Data Science Interview Prep

¡Supera tus tareas y exámenes ahora con Quizwiz!

K-Nearest Neighbor (K-NN)

KNN is an algorithm that can be used for both classification and regression. In simple terms, a value is considered the same class as other values around it. By setting a value of K, we look at the K nearest neighbors and assign that value to the same class. A low value of K may overfit the data as it only looks at the nearest neighbor. A higher value tends to underfit. The model can be run at different K values and validated until validation error is lowest.

Assumptions of Linear Regression

1. Linear relationship between dependent and independent variables 2. Independent variables should be multivariate normal (data should be normally distributed) 3. There should be no multicollinearity. Independent variables shouldn't be highly correlated 4. There should be no autocorrelation. This occurs when the residuals are not independent from each other of a single variable. In other words when the value of y(x+1) is not independent from the value of y(x). 5. Homoscedasticity, meaning the residuals are equal across the regression line. In other words, the residuals should not fan out at different parts of the regression line

Data Analysis Process

1. Understanding the problem / Setting Goals 2. Data Gathering. The emphasis is on ensuring accurate and honest collection of data. 3. Data Processing: organizing, structuring, and standardizing data 4. Data Cleaning: Find, change or remove any incorrect or redundant data 5. Data Analysis: Techniques used to understand, interpret, and derive conclusions based on the requirements 6. Result Interpretation: machine learning algorithms as well as descriptive and inferential statistics 7. Communication of Results / Data Viz

Logistic Regression

A classification algorithm where the response variable is categorical. The goal is to find a relationship between the features and the response variable. Provides a probability between 0 and 1. Creates an S shaped decision boundary which better helps to deal with outliers. binomial = 2 responses (true / false) multinomial = 3 or more responses (red / blue / green) Evaluate model with a confusion matrix

Decision Tree

A good classification algorithm, especially for instances with qualitative predictors. It systematically breaks down the data into smaller and smaller subsets with decision nodes or splits for various attributes that end in a classification decision. The deeper the tree, the more complex the decision rules and the fitter the model. Pruning is performed by shortening some of the branches to make a classification decision earlier. This reduces complexity and prevents over fitting as trees could easily perfectly fit a training data set. Trees are based on calculating entropy for each node and see how much uncertainty the tree can reduce by splitting on that node. As entropy (uncertainty) is decreased, more information is gained.

Bias-Variance Tradeoff

A model with high bias pays very little attention to the training data and oversimplifies the model leading to underfitting. Bias measures how much the predictions deviate from the true value we are trying to predict. A model with high variance pays a lot of attention to training data and does not generalize well on the unseen data which leads to overfitting. Variance is an indicator to a model's sensitivity to small variations in outcomes due to new training data The goal of a model is low bias and low variance but they vary in opposite directions. We aim to find a happy medium

Bayesian Network

A probabilistic graphical model that represents a set of random variables and their conditional dependencies. They are a very important tool in understanding the dependency among events and assigning probabilities to them thus ascertaining how probable or what is the change of occurrence of one event given the other. If an event is not related, it will not be factored into the final probability (the probability of the final event given all other dependent events)

Multicollinearity

A situation in which several independent variables are highly correlated with each other. This characteristic can result in difficulty in estimating separate or independent regression coefficients for the correlated variables. A regression relies on eliminating multicollinearity (should be independent). To solve this issue, remove highly correlated predictors from the model. If you have two or more correlated variables, remove one from the model since they supply redundant information.

Z-Score

A type of standard score that tells us how many standard deviation units a given score is above or below the mean for that group. It is a way of standardizing a broad range of continuous attributes.

Activation Function

Activation functions also known as transfer function is used to map input nodes to output nodes in certain fashion. They are used to impart non linearity. If there is no activation function then the input signal will be mapped to an output using a linear function which is just a polynomial of one degree. Linear functions are not able to capture complex functional mappings of the data. Sigmoid Tanh ReLU Leaky ReLU Depending on the situation, the functions can be better than each other at classification.

Neural Network

Also known as an Artificial Neural Network or ANN. learning algorithms which takes one or more inputs, processing them into an output. The neural network itself consists of many small units called neurons. These neurons are grouped into several layers connected by weighted paths. Input Layer: Raw data is input, say the pixels of photo Hidden Layer: Where intermediate processing or computation is done, like defining edges in pixels. There can be multiple hidden layers. Output Layer: An activation function is applied to the hidden layer output and its weights, identify what's in the image Can be used for classification or regression but most often is used for classification.

Linear Regression

An algorithm that helps predict the relationship between a continuous dependent variable and its independent variables. To understand exactly what that relationship is, and whether one variable causes another, you will need additional research and statistical analysis. Simple = a single independent variable, straight line relationship (positive, negative, or none) Multiple = n independent variables, linear Polynomial = higher order regression plane that is non-linear

Gradient Descent

An algorithm to optimize a regression type function in respect to a cost function (i.e. find the minimum sum of squares error or other check). An initial guess is made (usually stochastic) and the function is tested. The guess is then changed slightly and the model is run again. The difference between these two results provides a slope. We follow that slope until in any direction the slope cannot be more negative. This is the lowest point and therefore the minimum error. Learning rate is how quickly we move during gradient descent. Setting it too high would make your path unstable, too low would make convergence slow. This is the most prevalent optimization algorithm in machine learning.

Curse of Dimensionality

As features (dimensions) increase, the amount of data needed to generalize accurately grows exponentially. This can create complexity without benefit or a model overfit. It can also increase compute time and create meaningless results. Generally, there are three solutions: Feature Extraction: Taking a component of the original feature, especially when the raw data is unusable (i.e. the redness of a photo) Feature Selection: choose best subset of original variables usually by comparing relationship to target variable Feature Creation: careful pre-processing of data into more meaningful variables (like combining or taking square root) This is why before building any model, we need the feature selection/engineering step

Backpropagation

Backpropagation is an algorithm commonly used to train neural networks (the essence of neural network training). When the neural network is initialized, weights are set for its individual elements, called neurons. Inputs are loaded, they are passed through the network of neurons, and the network provides an output for each one, given the initial weights. Backpropagation helps to adjust the weights of the neurons so that the result comes closer and closer to the known true result.

Bagging vs Boosting

Bagging and Boosting are similar in that they are both ensemble techniques, where a set of weak learners are combined to create a strong learner that obtains better performance than a single one. In Bagging, each model is trained in parallel and is running independently. The outputs are then aggregated at the end without preference to any model (see Random Forest). Meanwhile boosting is all about "teamwork". Each previous model decides the subset of features used by the next model depending on the performance (see Gradient Boosting). Ensemble methods try reducing bias and/or variance of weak learners by combining several of them together in order to create a strong learner

Cost Function

Cost function is used to learn the parameters in the machine learning model such that the total error is as minimal as possible. A cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between the dependent and independent variable. This function is typically expressed as a difference between the predicted value and the actual value. A convex cost function has a single minimum value that it can be minimized to. A non-convex function has multiple local minimums, or locally optimal points, which may cause the function to get stuck.

How to avoid overfitting?

Cross-validation sampling, reducing number of features (dimensionality reduction), pruning a decision tree, regularization, etc.

Descriptive, Predictive, Prescriptive

Descriptive = describe what has happened in the past and potentially why it has happened (cluster of customers failing to pay back a loan, see association, clustering, pattern discovery) Predictive = predict what is likely to happen in the future (what customer will not pay back loan due to specific qualities, see classification, regression, etc.) Prescriptive = recommends actions to take to affect those outcomes (course of action so customer will pay back loan)

Gradient Boosting

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Trees are added one at a time, and existing trees in the model are not changed. A gradient descent procedure is used to minimize the loss when adding trees. After calculating error or loss, the weights are updated to minimize that error. The output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model.

Hyperparameter Tuning

Hyperparameters are those values that are set before the training process begins. They are properties used to describe how a model is supposed to function, for example: the number of trees in a decision tree or the learning rate of a regression model. Hyperparameters directly control the behavior of the training algorithm and have a significant impact on the performance of the model being trained. Some common techniques to tune are: Grid search: An exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. Random Search: Replaces the exhaustive enumeration of all combinations by selecting them randomly. Bayesian Optimization: Takes into account past evaluations when choosing the hyperparameter set to evaluate next. By choosing its parameter combinations in an informed way, it enables itself to focus on those areas of the parameter space that it believes will bring the most promising validation scores.

Dealing with Missing Data

Imputation: replace missing data with appropriate value (average, median, etc.) Leave Blank: sometimes missingness can be predictive so it is reasonable to leave blank or classify as empty Discretization: Binning the values with a bin for missing Prediction: Use regression technique to predict the missing values Elimination: Eliminate entries that contain a blank for a specific feature

Support Vector Machine (SVM)

In a classification problem the main goal is to derive an optimal boundary separating groups from one another. SVMs find all the lines that correctly classify the training data and then pick the one that has the greatest distance to the points closest to it. the closest points become their own parallel lines called support vectors and form a confusion band around the main classifier line. Real world data cannot be perfectly separated due to outliers. SVMs allow for us to add a value (C) that controls the width of the confusion band. A lower value means a wider band and allows for greater number of errors identified in the training data. This might capture the trend better though and avoid overfitting.

Variance

Informally, it measures how far a set of numbers are spread out from their average value

Neural Network Operation

Initialization—initial weights are applied to all the neurons. Forward propagation—the inputs from a training set are passed through the neural network and an output is computed. Error function—because we are working with a training set, the correct output is known. An error function is defined, which captures the delta between the correct output and the actual output of the model, given the current model weights (in other words, "how far off" is the model from the correct result). Backpropagation—the objective of backpropagation is to change the weights for the neurons, in order to bring the error function to a minimum. Weight update—weights are changed to the optimal values according to the results of the backpropagation algorithm. Iterate until convergence—because the weights are updated a small delta step at a time, several iterations are required in order for the network to learn. After each iteration, the gradient descent force updates the weights towards less and less global loss function. Use - feed new data to get results.

Parsimonious Model

Less complex model. Accomplishes a desired level of explanation or prediction with as few predictor variables as possible

Machine Learning vs Deep Learning vs AI

Machine Learning is equivalent to data mining. It is a technique of examining a dataset and extracting new information through a variety of algorithms. Deep Learning is a subset of machine learning. It uses multiple layers to progressively extract higher level features from the raw input generally using a neural network. AI is the ability of computer program to function like a human brain so it may use aspects of machine learning and deep learning to perform certain functions. Either AGI (general) or ANI (narrow) which only is a single task.

Confusion Matrix

Metric for evaluating a binary classification. Evaluates the results of applying the model True Positive: Predicted Yes, Actually Yes True Negative: Predicted No, Actually No False Positive (Type 1): Predicted Yes, Actually No False Negative (Type 2): Predicted No, Actually Yes Can then determine the accuracy (TP+TN/n), misclassification rate (FP+FN/n), the precision (TP/TP+FP), and the recall / sensitivity or relevant cases within a dataset (TP/TP+FN). Depending on the business problem, a different metric may be chosen (i.e. if the cost associated with False Positives or False Negatives is higher). F1 Score allows us to balance between precision and recall.

Epoch

One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE. In gradient descent, there are multiple epochs run to slowly improve the model. We want to avoid too many epochs though so we don't overfit the data.

Principal Component Analysis (PCA)

PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Variables are standardized and then compared via covariance (joint variability of 2 random variables) to see how they are correlated. Principal components are then new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. Principal components are less interpretable and don't have any real meaning but they reduce dimensionality. The first principal component accounts for the largest possible variance in the data set. The next must be uncorrelated (perpendicular) to the first and so on.

Perceptron vs CNN vs RNN

Perceptron: The most basic version of a neural network which contains none to many hidden layers, but only feeds data forward. Convolutional Neural Network: Used to analyze images for image classification, segmentation or object detection. CNNs work by reducing an image to its key features and using the combined probabilities of the identified features appearing together to determine a classification. In the convolutional layers, an input is analyzed by a set of filters that output a feature map. This output is then sent to a pooling layer, which reduces the size of the feature map, condensing the map to it's most essential information. The two layers are repeated several times. Then data is sent to Fully Connected layers which flatten the maps together and compare the probabilities of each feature occurring in conjunction with the others, until the best classification is determined. Recurrent Neural Network: A multi-layer neural network, used to analyze sequential input, such as text, speech or videos, for classification and prediction purposes. RNNs work by evaluating sections of an input in comparison with the sections both before and after the section being classified through the use of weighted memory and feedback loops. As the RNN analyzes the sequential features of the input, an output is returned to the analysis step in a feedback loop, allowing the current feature to be analyzed in the context of the previous features.

Regularization (L1 vs L2)

Regularization basically adds a penalty as model complexity increases (higher order = larger penalty). L1 Regularization: Utilizes Lasso regression. Adds "absolute value of magnitude" coefficient as the penalty term to the loss function. Often used for better feature selection or when dealing with multicollinearity. L2 Regularization: Utilizes Ridge regression. Adds "squared magnitude" of coefficient as the penalty term to the loss function. Can be used if you don't want to eliminate a feature. Lasso shrinks the less important features coefficient to zero thus, removing some features altogether, whereas Ridge just sets these coefficients to small values.

ROC - AUC

Sensitivity = TP / TP + FN (True Positive Rate) Specificity = TN / TN + FP (True Negative Rate) 1-Specificity = FP / TN + FP (False Positive Rate) ROC plots the Sensitivity vs the 1-Specificity. As we increase Sensitivity, Specificity decreases because we are moving the decision boundary in the classification. 1-Specificity instead gives us the false positive rate which allows us to compare just the positives. AUC indicates how well the probabilities from the positive classes are separated from the negative classes. A score closer to one (curving to the upper left corner) shows the model is doing a good job of distinguishing between positive and negative.

Histogram

Shows the frequency distribution of continuous data using rectangles. Data is split into intervals and the frequency of instances in each interval is plotted.

Supervised vs Unsupervised Learning

Supervised learning predicts the value of an outcome based on input measures (e.g. Regress, K-nn, SVM, Decision Trees, Random Forest, Neural Networks). Simply put they make predictions. Unsupervised learning describes the associations or patterns of inputs (e.g. k-means clustering, PCA, association rules). Simply put they find patterns.

Random Forest

The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. These individual random trees protect each other from their individual errors. When new data is run through the trees, the more often selected prediction is then chosen instead of relying on the single tree. Maintaining randomness between the trees is important and relies on two things: 1. Bagging (Bootstrap Aggregation): Individual trees randomly sample from the dataset with replacement, resulting in different trees 2. Feature Randomness: In a normal decision tree, when it is time to split a node, we consider every possible feature and pick the one that produces the most separation between the observations. Random forests simply choose a random feature to split on.

Binary Classification

The task of classifying the items in a particular set between one of two groups. This can be done with a variety of methods from logistic regression to decision trees to neural networks

K-Means Clustering

This is an unsupervised learning algorithm. A cluster refers to a collection of data points aggregated together because of certain similarities. Our goal is to group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

A/B Testing

This is the process of comparing two variations of a single variable to determine which performs best. A/B testing is a marketing experiment wherein you "split" your audience to test a number of variations of a campaign/new feature and determine which performs better.

Training, Testing, and Validation Data

Training = data that the model will be trained on (usually 80-90% of the data) Validation = data that will be used to help tune the hyperparameters and select an appropriate model. It is a held out, representative sample of the data. Testing = data that will be used to test the final accuracy of the model

Cross-Validation

Validation help us evaluate the quality of the model, select the model which will perform best on unseen data and help us to avoid overfitting and underfitting. It involves splitting the data between training and validation data. Cross-validation is a more robust version which segments the data stochastically (k times) and iteratively changes which segment will be validation data and which will be training data. The results are then averaged to provide a more robust prediction.

Boxplot

Way of displaying the distribution of data based on the following: Minimum, First Quartile, Median, Third Quartile, and Maximum. It gives information about variability and dispersion of data. It also displays outliers and tells us about the symmetry and skewness of the data.

Unbalanced Binary Classification

When one class vastly outnumbers the other class it can result in a biased model. Some solutions include: Proportional Oversampling - increasing rare events in the dataset Suitable metrics - Accuracy can be a misleading metric for unbalanced data. Suitable metrics would include Precision, Recall, F1-score, AUC, etc. Suitable algorithms - Algorithms such as Decision Tree and Random Forest work well with unbalanced data


Conjuntos de estudio relacionados

Executive Branch Ch. 5 & 7 Swanson Google Form Quiz Review (and Budget Edmodo)

View Set

PassPoint: PostPartum Period ML8

View Set

PrepU Ch 41: Management of Patients with Intestinal and Rectal Disorders

View Set