Machine Learning Midterm
Structured data
- Fields/variables in columns - Observations in rows - Each field can be text, number, float (decimal), etc. But a field cannot have more than one type. Also every row has a fixed number of fields - Primary key
Random variable
A variable whose value is determined by the outcome of a chance experiment - Can be discrete or continuous
Recurrent neural networks
Allow the data to learn interdependencies in a given data set - Problem of "Vanishing gradients," effectively leading such models to be unable to predict long term dependencies
The Modelers' Hippocratic Oath
Guidelines of an ethical ML model - Being explicit about a model's assumptions and oversights - Understanding that effects could have an enormous impact on society
Underfitting
Happens when a model is unable to capture the underlying pattern of the data - These models usually have high bias and low variance - It happens when we have not enough data to build an accurate model or when we try to build a linear model with a nonlinear data
Correlation
Measure of linear association between two variables and lies between −1 and +1, −1 indicating perfect negative association and +1 indicating perfect positive association; denoted by rho
Variance
Measures the spread of the random variable around the mean. Assume the mean is mu: ∑(X - mu)^2 f(x) var(w1.X + w2.Y) = w1^2*σx^2 + w2^2*σy^2 + 2w1*w2*σx*σy*ρ σ (Sigma) refers to square root of variance and is called standard deviation - Can also mean the variability of model prediction for a given data point. - High value pays a lot of attention to training data and does not generalize on the data which it hasn't seen before (overfits) As a result, such models perform very well on training data but has high error rates on test data
Python
Object oriented language with many user libraries - Most popular general purpose programming language - Free + open source - Multi-purpose: Allows creation of ML models and tools, but also allows web app development - Often referred to as "executable pseudo-code" because its syntax mostly follows the conventions used by programmers to outline their ideas without the formal verbosity of code in most programming languages
Population
Set of all possible outcomes (also called sample space)
Endogeneity
Situations in which an explanatory variable is correlated with the error term; two types: - Reverse causality - Unobservable and correlated factors
Support vectors
The data points, which are closest to the hyperplane. These points will define the separating line better by calculating margins; most difficult to classify
Parameters
Values estimated by model. E.g., regression beta
Feature
Variable used to predict an outcome. Economists like to call these "independent variables" in a regression context. Also, can be called the x variable
Integrated developer environment
increase programmer productivity by combining common activities of writing software into a single application: editing source code, compiling and editing the code, and debugging (e.g. Ipython) - Can display graphics and pictures, allowing data visualization
Heteroskedasticity
- OLS assumes each observation is drawn from same distribution, having same variance. - ui ~ N(0,σ2) for all i. - If this assumption is violated, standard errors, i.e. p-values are biased. So cannot estimate statistical significance with any degree of confidence. - Solution - robust standard errors
XGBoost
- Short for "Extreme Gradient boost" model - Implements Gradient Boosted decision trees - A variant of the gradient boosting machine that aims to improve performance and speed. - Optimizes on hardware use and on algorithm - Best performer of gradient boosting, random forrest, and logistic regression
Autocorrelation
- Time series data susceptible to this - Error terms are correlated over time - Biases standard errors and inference - Does not bias coefficient estimates - Fixed with heteroskedasticity and ___ robust standard errors
Normal distribution
1. It is symmetrical around its mean value 2. Approximately 68 percent of the area under the normal curve lies between the values of μ ± σ , about 95 percent of the area lies between μ ± 2σ , and about 99.7 percent of the area lies between μ ± 3σ - Denote normally distributed variable as X ∼ N(μ, σ2) where ∼ means "distributed as," N stands for the normal distribution, and the quantities in the parentheses are the two parameters of the normal distribution, namely, the mean and the variance - Standard normal distribution refers to random variable with μ =0 and σ2 =1; used to obtain test statistics - Sum of two normal distributions is also normal and assumed to be independent
Kernel trick
3 types: linear, polynomial, radial basis function - SVM uses this to transform the input space to a higher dimensional space (e.g. z axis) if problem can't be solved using linear hyperplane
Hyperplane
A decision plane which separates a set of objects having different class memberships
Margin
A gap between the two lines on the closest class points. The idea is to maximize this margin
Ensemble method
A machine learning technique that combines several base models in order to produce one optimal predictive model - Known for providing an advantage over single models in terms of performance. They can be applied to both regression and classification problems (i.e., continuous vs. discrete outcome variables) - Two types: - Homogeneous ensemble model: Single model across all sub-predictions - Heterogeneous ensemble model: Multiple models used across sub-predictions
Artificial neural network
A multi-layer fully-connected neural net 1. Initialize: Randomly initialize the weights for all the nodes. The model can do that 'smartly' through inbuilt algorithms 2. Forward pass: For every training example, perform a 'forward pass' using the current weights - i.e., calculate the output of each node going from left to right - The final output is the value of the last node 3. Evaluate error: Compare the final output with the actual target in the training data, and measure the error using a loss function 4. Backpropogate: Calculate each weight's contribution to the error, and adjust the weights accordingly - Backpropogation 5. Run multiple epochs: Epochs are the number of times you do forward pass and backpropogation to adjust weights
Example
A particular instance of the data, e.g., all x, y variables in one row
Scaling
Adjusting variables to modify their distribution. This can help with comparability and may be required in certain models - Regressions and tree based models do not require scaling. - Models calculating distance, e.g., Gradient descent, KNN need standardization - Standardize after splitting the data into test and training samples
Random search
An approach to hyperparameter tuning motivated by the fact that for most cases, hyperparameters are not equally important - You'll define a sampling distribution for each hyperparameter - You can also define how many iterations you'd like to build when searching for the optimal model - For each iteration, the hyperparameter values of the model will be set by sampling the defined distributions
Grid search
An approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid
Gradient descent
An optimization algorithm used to minimize some function (e.g. loss function- sum of squared errors) by iteratively moving in the direction of steepest descent as defined by the negative of the gradient - The size of these steps is called the learning rate - High learning rate: can cover more ground each step, but risk overshooting the lowest point since the slope of the hill is constantly changing - Low learning rate: can confidently move in the direction of the negative gradient since we are recalculating it so frequently - Low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom - Fixes the errors in the Adaboost model
Adaboost
Boosts weak learners to strong learners - Initially, the model selects a training subset randomly - It iteratively trains the model by selecting the training set based on the accurate prediction of the last training - Assigns the weight to the trained classifier in each iteration according to the accuracy of the classifier. The more accurate classifier will get higher weight - This process iterates until the complete training data fits without any error or until it reaches the specified maximum number of estimators - To classify, perform a "vote" across all of the learning algorithms you built
Classifiers
Checking quality of predictions in ___: - Confusion matrix (true positive, false positive, etc.) - Accuracy: Correct predictions divided by sample size - Recall: Tells you what fraction of actual cases in a category are correctly predicted - Precision: Tells you what fraction of predictions are correct for each category - F1 score: Combination of recall and precision
Multicollinearity
Correlation between x variables - Can bias parameter estimates and make estimates sensitive to small changes in data (unstable) - Can bias standard errors, and therefore p-values - May not bias predictions, but you may not understand the intuition and can be problematic if the same relationships are not expected to continue in the future - High R-square but few significant p-values of coefficients - High pairwise correlations - Measured by VIF
Neuron
Creates a fancy weighted average of input signals to create an output signal
Learning rate
Defines how quickly a network updates its parameters - Low rate slows down the learning process but converges smoothly - Larger rate speeds up the learning but may not converge - Usually a decaying rate is preferred
Decision tree
Determines the predictive value based on series of questions and conditions - Can use single tree on entire sample; too many trees can overfit the model - Feature selection and threshold selection can become ad-hoc choices that can affect model predictability
Statistical inference
Essentially assessing the possibility that your analysis yields an estimate that is precise enough - You're almost always estimating using a sample, and not the population; cannot be sure whether your estimate is signal or noise
R-squared
Explained sum of squares/Total sum of squares = ∑_i〖y ̂_i 2〗/∑_i〖y_i 2〗 - The fraction of the overall variance of y explained by x - Higher value reflects a higher model fit, or predictive ability - Adjusted ___: You can increase mechanically by adding more x terms in the regression; removes such mechanical effects
Marginal PDF
Given y (or x) is determined to be some value, sum the probabilities of x (or y) f(x) = ∑ f(x,y) or f(y) = ∑ f(x,y)
Discrete joint PDF
Gives the (joint) probability that X takes the value of x and Y takes the value of y
Conditional PDF
Gives the probability that X takes on the value of x given that Y has assumed the value y f (x | y) = P(X = x | Y = y) (similarly, f (y | x) = P(Y = y | X = x) ) - Can be obtained as the ratio of joint distribution to marginal distribution f (x | y) = f (x, y)/f (y) f (y | x) = f (x, y)/f (x) Statistical independence means f (x, y) = f (x) f (y)
Multi-layer perceptions
Good at predicting using a given data set, but may not be able to effectively model inter-relationships within the data. - I.e. they can predict y based on x, but not x(-1) or x(+1), i.e., the previous or next x in line
Overfitting
Happens when our model captures the noise along with the underlying pattern in data - It happens when we train our model a lot over noisy dataset - These models have low bias and high variance Complex models like Decision trees are prone to overfitting
Dropout
Has brought significant advances to modern neural networks and it considered one of the most powerful techniques to avoid overfitting - Somewhat like ensemble methods. If each component model learns a relationship from the data that contains the true signal with some addition of noise, a combination of models should maintain the relationship of the signal within the data while averaging out the noise. - During training, dropout samples from different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks creating a final model with smaller weights
Variance inflation factor
Identifies correlation between independent variables and the strength of that correlation - Start at 1 and have no upper limit. - A value of 1 indicates that there is no correlation between this independent variable and any others. - Values between 1 and 5 suggest that there is a moderate correlation, but it is not severe enough to warrant corrective measures. - Values greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated, and the p-values are questionable
Class balance
If one class occurs more likely than the other, you want to adjust for this imbalance as the classifier will more likely over classify into the more frequent class
OLS regression
Largely concerned with estimating and/or predicting the (population) mean value of the dependent variable on the basis of the known or fixed values of the explanatory variable(s); linear relationship - Issues include multicolinearity, heteroskedasticity, autocorrelation, and endogenous regressor
Deep stacking
Leads to complex relationships in data - more prone to overfitting. - There are some ways to detect and mitigate this: Using a validation data in the fitting phase. - Detection: Check if the validation data loss function starts going up - Early stopping: stop before curve moves up
Regression
Model used to predict continuous values
Classification
Model used to predict discrete values
Semi-structured data
More flexible than structured data like CSV, Excel, and SQL - XML, JSON
Unsupervised learning
Pattern recognition, without quite knowing what you are looking for; clustering customers into groups by the most relevant variable - E.g. Who is more likely to invest in 401(k)'s through robo-advising?
Supervised learning
Predicting specific well-defined variables (e.g. predicting whether the picture is a dog or cat)
Logit
Predicts the probability of an event using the logistic distribution function (non-linear) - 0/1 model prediction (classification model) - Maximizes the likelihood function - Estimation addresses the question: What are the parameters (betas) such that the probability of observing the given Y variables is as high as possible? - Multiply the prob of each y given the logistic distribution - Uses odds ratio to assess economic impact
Probability
Proportion of times the event A will occur in repeated trials of an experiment 0 ≤ P(A) ≤ 1 for every A
Bootstrapping
Refers to randomly selecting one or more random subsample from the sample, with replacement (different subsamples can have the common observations) - Bagging models run similar or different learning models on multiple bootstrapped sample and combine the outcomes; reduces overfitting - Bagging is combo of bootstrapping and aggregation
Central limit theorem
Regardless of the PDF of a random variable in a small sample, as the sample size becomes large the mean of the distribution approaches a normal distribution
Hypothesis testing
Ruling out one hypothesis in favor of another - Type I error: Chance of wrongly rejecting the null hypothesis; type of error we look to minimize - We prefer to use 5% or lower (called level of significance denoted by a) - Not rejecting the null hypothesis does NOT mean you are accepting it - All you can say is that the null is not rejected. Because the type II error probability is 95% (probability of accepting a false null), you are unable to say anything at this point about the estimate
Reinforcement learning
Teaching a machine to take actions or achieve goals in an environment - No prediction variables or "labels." Training is done through a reward mechanism - E.g. A self-driving car is 'rewarded' for making the right decisions (not killing the cat!) and penalized for making the wrong ones
ROC curve
Tells us about how good the model can distinguish between two things (e. g., If a patient has a disease or no). Better models can accurately distinguish between the two - Indicates how well the probabilities from the positive classes are separated from the negative classes - Sensitivity on y axis, (1 - specificity) on x axis - AUC is the area under this; the score gives us a good idea of how well the model performs
F-statistic
Tests the hypothesis that all the explanatory coefficient estimates (β) are jointly significant - Significant value reflects a model with more explanatory power
Bias
The difference between the average prediction of our model and the correct value which we are trying to predict - High value means model pays very little attention to the training data and oversimplifies (underfits) - It always leads to high error on training and test data
Irreducible error
The error that can't be reduced by creating good models. It is a measure of the amount of noise in our data - Our data will have certain amount of noise or irreducible error that can not be removed
Probability density function
The probability of the random variable being between two discrete points
Cumulative density function
The sum of all probabilities up to point a of any distribution f(x) at point a
Random forest
Type of model that can be thought of as Bagging, with a slight tweak - A decision tree is formed on each subsample - However, each decision tree is split on different features - This level of differentiation provides a greater ensemble to aggregate over, producing a more accurate predictor
Long short term memory
Type of model that can remember longer term dependencies and forget when necessary; the more popular type of RNN model - Requires inputs as a 3D array: in shape (Sample, timesteps, feature) or (timesteps, features) - Output could be a 2D array or 3D array depending upon the return_sequences argument
Support vector machine
Type of model that is considered to be a classification approach, it but can be employed in both types of classification and regression problems - It solves the following problem: How do we create a boundary that separates the two classes? - Doesn't automatically calculate probability estimates, requires hyperparameter: probability = True- much slower with this option Pros: - Works really well with clear margin of separation - Effective in high dimensional spaces. - Effective in cases where number of dimensions is greater than the number of samples. - Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient Cons: - Doesn't perform well when we have a large data set because the required training time is higher - Also doesn't perform very well when the data set has more noise i.e. target classes are overlapping - SVM doesn't directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is related SVC method of Python scikit-learn library
Convolutional neural network
Type of network that is more likely to be used in image recognition
Gamma
Type of parameter that defines how far the influence of a single training example reaches, with low values meaning 'far' and high values meaning 'close'
Regularization
Type of parameter that tells the SVM optimization how much you want to avoid misclassifying each training example. Will trade of margin for accuracy
Stationarity
Type of series in which the properties - mean, variance and covariance - do not vary with time - Allows you to assess trends in time series data without worrying about changes in the underlying sampling distribution of the data - Stock market price predictions using prior data may not be capturing trends due to non-stationarity or regime shifts - Adjusting this data, for instance, by first difference and then creating a prediction may reduce overfitting - Use a baseline - persistence model. The prior time's data as a predictor of the next time's data - If you are not better than a persistence model, then your model is not that great
Unstructured data
Used in NLP and Deep Learning Neural Networks to detect patterns and actionable intelligence with such data - Pictures, audio, etc. - Not useful in regressions
Machine learning
Uses datasets and models to draw insights from data and predict outcomes - Branch of AI - Requires data and parametric/non-parametric model; makes assumptions - Information advantage: underwriting, stock picking - Process automation: robo-advising, chatbots, back office automation - Compliance/security: suspicious account behavior - Main types are unsupervised and supervised learning
Hyperparameters
Values set up by data analyst to optimize output and model prediction (best accuracy, least overfitting) - Like the settings of an algorithm that can be adjusted to optimize performance (called hyperparameter optimization/hyperparameter tuning)
Label
Variable to be predicted. Economists call this the "dependent variable" in a regression context. Sometimes called the y variable
Javascript Object Notation
What most applications transfer data to one another through - Utilized in APIs and many database formats - CSVs are less scalable and can lose data
Cross validation k-fold
While tuning the parameters, it is best to use this approach (similar to splitting the sample into test and training sets) - Split the dataset into k equal partitions - Use first fold as testing data and union of other folds as training data and calculate testing accuracy - Repeat step 1 and step 2. Use different set as test data different times. Take the average of these test accuracy as the accuracy of the sample - The choice of k is usually 5 or 10, but there is no formal rule - Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation