ISYE 6501 - Quiz 1
SVM Soft Classifier: What are the parts and what do they represent?
"Minimize the sum of total error and square coefficients" We seek to minimize both in order to minimize error while maximizing margin. Max(0 , 1 - the sum of coefficients x variables x y) is the portion of the equation that measures error. Any misclassifications will produce a sum of coefficients x variables x y value that is negative which will be added to the total error. This part of the equation ensure accuracy. Like in hard classification, we seek to minimize the sum of coefficients squared in order to maximize margin. This part of the equation maximizes margin. Unlike hard classification, we have the variable lambda that is multiplied with margin which can act as a "lever" for what we want to prioritize.
SVM Hard Classifier: What are the parts and what do they represent?
"Subject to all points being accurate, we seek to maximize margin of error" The sum of coefficents x variables x y must be greater than or equal to 1 in order order to maintain accuracy. y is the response variable (-1 or 1), so the classifier must produce a prediction that is the same sign as the response variable in order for the equation to be true. This part of the equation ensure accuracy. We seek to minimize the sum of coefficients squared since that is the denominator of the margin calculation (2 / sqrt(sum of coefficents)^2). Therefore, minimizing the sum of coefficients squared maximizes the margin of our equation. This part of the equation maximizes margin.
Automated Methods for Detecting Outliers
1. Box and Whisker Plot 2. Exponential Smoothing Model
How to tell when there is causation?
1. Cause is before effect. 2. Idea of causation makes sense (however this can be open to disagreement of what 'makes sense') 3. No outside factors causing the relationship (this is hard to guarantee)
What types of questions can regression answer?
1. Descriptive. It can tell you how important a variable is to an outcome. 2. Predictive. It can forecast into the future.
How do you determine what the value of k should be in k-means?
1. Fit the situation you are analyzing (Qualitative Approach). If there is a natural constraint to the number of clusters in your problem, then that is your k value. 2. Compare total distances (Qualitative Approach). You plot an elbow graph and find the k value where you get diminishing returns on reducing total distance.
Characteristics of k-means clustering
1. Heuristic. It is not guaranteed to find the absolute best solution but it can get very close and do so quickly 2. Expectation-maximizing. The alternating steps of finding cluster centers (expectation) and assigning data points (maximizing negative distance from point to cluster center) make it an EM algorithm.
Dealing with Real Data Outliers
1. Investigate the cause of this outlier and circumstances around it. Would removing this point make your model too optimistic? Are these circumstances likely to occur again? 2. Create 2 models. One logistic regression model estimating the probability of outliers occurring under different conditions. Another model using data without outliers. Always investigate your outliers.
k-means clustering
1. Pick k cluster centers within a range of data 2. Assign each data point to the nearest cluster center 3. Recalculate cluster centers (centroids) We then loop through and repeat steps 2 and 3 until cluster centers stop changing
3 Types of Outliers
1. Point Outlier 2. Contextual Outlier 3. Collective Outlier
Methods for Splitting Data
1. Random. Randomly assign points to different sets 2. Rotation. Take turns selecting points to go to each set.
Feature Selection in SVM
1. SVM graphs with lines of separation that are vertical or horizontal lines indicate an insignificant variable is present. If both variables play a part in creating the line of separation, then the line would be diagonal. 2. Factor coefficients near 0 indicate an insignificant variable.
3 Parts of Data Preperation
1. Scale the data 2. Examine data distribution 3. Handle outliers
Regular Validation Workflow
1. Split the dataset into 3 distinct sets. data1, data2, and data3. 2. Train all models on data1 3. Validate all models on data2 and choose the best one 4. Report the picked model's accuracy using data3
Cross Validation Workflow
1. Split the dataset into two distinct sets; data1 and data2 2. Perform Cross-Validation on all models using data1 and choose the best one. For the cross-validation, you are splitting data1 into k different pieces and making k different models using the same hyperparameters (e.g,. C or k), but different subsets of training data giving you different model parameters (e.g., coefficients and intercepts) 3. Train the chosen model on ALL of data1 to find model parameters 4. Report the picked model's accuracy using data2
3 Types of Variation In Time Series Data
1. Trends 2. Cyclical Variation (Seasonality) 3. Random Variation
k-Fold Cross Validation
1. We split data into two groups; a test set and a combined training + validation set 2. We then split the combined training + validation set into k groups 3. Taking the k groups, we take turns assigning one of the groups to be the validation set and the remaining groups to be the training set. We do this k times so that each group takes turns being the validation set and every data point is used to train at least k-1 of the models 4. The average of all k group configurations becomes the total validation score for the model
When does relying on P-Values fail?
1. With large amounts of data. When data gets large, p-values naturally get small even attributes are not related to the response. 2. P-values are only probabilities even when meaningful. Since its only a probability there is always the chance two variables could not be correlated even if p-value is less than .05. Using 100 attributes with p-values of .02 each, expect two of them to be irrelevant.
Linear Regression
A method of finding the best model for a linear relationship between the predictors and response variable. Linear regression allows us to make predictions based on a set of predictors and their coefficients.
Exponential Smoothing
A method of time-series forecasting that takes a weighted average of current observations and previous observations as a way of reducing variation.
AIC (Akaike Information Criterion)
A metric used to evaluate model fit by balancing likelihood and simplicity (reducing overfitting). AIC does this by introducing a penalty term that is proportional to the number of parameters being used in the model. AIC encourages models that produce the maximum likelihood while using the least amount of parameters. The best fitting model is the one that minimizes AIC
BIC (Bayesian Information Criterion)
A similar metric to AIC that also evaluates how a model balances fit and overfitting. BIC takes into account the number of data points into its penalty term. As a result, BIC's penalty term > AIC's penalty term and BIC encourages models with fewer parameters than AIC. Only use BIC when there are more data points than parameters.
Margin vs Accuracy Trade Off
A small margin for error reduces your chances of misclassifying known data points but increases your chances of misclassifying unknown data points. A large margin for error increases your chances of misclassifying known data points but decreases your chances of misclassifying unknown data points. Making trade off decisions is based on what the cost is for a false-positive and false-negative in a given business problem.
ARIMA vs Exponential Smoothing
ARIMA and Exponential Smoothing can both be used for short-term forecasting. ARIMA works better than exponential smoothing when data is stable with fewer peaks, valleys, and outliers. However, ARIMA needs at least 40 data points to work well. Exponential Smoothing is considered a subset of ARIMA since Exponential Smoothing is itself autoregressive. Both can be used to predict future values of a variable based on previous values of that variable.
What are the main differences between GARCH's calculation and ARIMA's?
ARIMA and GARCH equations are structured similarly except that 1. GARCH measures variances and squared errors instead of observations/linear errors. GARCH doesn't care about the value of an observation since its trying to forecast variance. 2. GARCH uses raw variances not differences in variances GARCH uses the variables p and q but not d.
What is ARIMA? What are the 3 key parts to ARIMA?
ARIMA is a time-series forecasting model that uses previous values of the response variable to make future predictions. The 3 key parts of ARIMA are 1. Differences (d) 2. Autoregression (p) 3. Moving Average (q)
ARIMA: How is autoregression integrated? What is the variable p?
ARIMA uses autoregression to look back p time periods of previous observations to predict d-th order differences. The variable p lets you control how reliant your ARIMA model is on past observations.
ARIMA: How is moving-average integrated? What is the variable q?
ARIMA uses moving-average to model predicted observations based on previous error or noise. The variable q controls how many time periods should be considered when considering previously seen error. Error is calculated as the difference between the observed data point and the mean up to that observation.
ARIMA: How are differences integrated? What is the variable d?
ARIMA uses the differences between observations to take into account trend in data. The variable d represents the order of differences you are taking. The differences between two consecutive observations is first order, the difference of the differences between two observations is second order, etc. (i.e. d represents the number of derivatives you need to take to get your data to flat line)
What does the variable alpha do in single exponential smoothing? When would you choose to a large or small alpha?
Alpha is the parameter that you can set (between 0 and 1) to weight how much you want to take into account current observations vs past observations. *A large alpha increases the weight of the current observation* and is something you would do if you *do not expect much randomness* in your system. *A small alpha increases the weight of past observations *and is something you would do if you *expect lots of randomness* in your system.
What is CUSUM (Cumulative Sum)?
An approach to change detection that detects when the mean has changed and passed a critical threshold. It can detect both increases and decreases.
Benefits of k-Fold Cross Validation
Better use of data Better estimate of model quality Choose model more effectively Prevents one model from benefitting more than another model from randomness in validation set. Models being trained don't miss out on any important data points that may only be present in the validation set.
What is Box-Cox transformation? When should you use it?
Box-Cox is a logarithmic transformation that stretches out the smaller range to enlarge its variability and shrinks the larger range to reduce its variability. Box-Cox should be used when your data has heteroscedasticity.
How does CUSUM work? What are the main variables of the equation and what do they do?
CUSUM works by taking a running sum of the difference each point is above or below the mean over time. The main variables are T: The critical threshold we set to be alerted at C: The dampening factor to our CUSUM model St: The running sum of how much above/below the mean each point is over time
Causation vs. Correlation
Causation: one thing causes another thing Correlation: two things tend to happen or not happen together. neither of them might cause the other Two things can be correlated without a causal relationship existing. In fact, both might be caused by the same external factor.
What is Change Detection? When would you want to examine Change Detection?
Change detection is determining whether something has changed using time series/time-dependent data. You would want to change detection to 1. Determine whether action might be needed 2. Determine impact of past action 3. Determine changes to help plan
Classification: What is it? What are the types?
Classification groups things into discrete categories. The types of classification are hard classification and soft classification.
Clustering vs Classification
Clustering is a *unsupervised learning problem* where we *do not know what the correct answer is* for each prediction. Furthermore, we may not know how many clusters exists. Classification is a *supervised learning problem* where *we know the correct outcome* for each prediction. As a result, you can calculate accuracy for classification but not for clustering.
Dealing with Bad Data Outliers
Consider either: 1. Omitting the data points 2. Using imputation
Why is Corrected AIC needed? What is relative likelihood?
Corrected AIC is used for *smaller data sets* since AIC has nice properties if there are *indefinitely* many data points. Relative likelihood is a comparison metric that takes the difference in AIC between two models and raises e to that difference in AIC divided by 2. It should be interpreted as "There is an 8.2% chance that Model 2 is better than Model 1. Therefore, Model 1 is better." Higher relative likelihood means a better chance your model is the better choice.
What is cross validation and what problem does it solve?
Cross validation shuffles samples between training sets and validation sets. This solves a problem when important data points are not present in training sets but are only present in validation/test sets. Cross validation shuffles training and validation data in a way that the model incorporates validation data into its build at least once.
How do you calculate cyclical patterns? What does increasing or decreasing Gamma do?
Cyclical pattern is calculated as the current observation divided by the most recent baseline variation. Gamma acts just like alpha and beta and is the parameter used to weight the current observations cyclical pattern vs previous observations. Increasing gamma increases the weight of the current observation's cyclical pattern. Decreasing gamma increases the weight of past observations' cyclical pattern.
1st Step When Dealing with Outliers
Determine if the outlier is a result of Bad Data or Real Data
What is double exponential smoothing?
Double exponential smoothing takes into account *baseline variation* and *trend*. Trend is an additive component to the exponential smoothing equation.
Detrending Data: How do you do it?
Factor-by-Factor For this method we go through the predictors, factor-by-factor, and fit a one dimensional regression to each of them to a get a predicted response. From there we calculate the detrended response as: *Detrended Response = Actual Response - Predicted Response*
Rule of Thumb for Evaluating BIC
First you take the difference in BIC between two models if the difference is > 10: smaller BIC is "very likely" better Between 6 and 10: smaller BIC is "likely" better Between 2 and 6: smaller BIC is "somewhat likely" better Between 0 and 2: smaller BIC is "slightly likely" better
What is GARCH? Why would we want to use it?
GARCH is a time-series model use to forecast *variance*. We would want to use GARCH to forecast variance so we can estimate the amount of error in our primary model's predictions and bake that into our forecasts.
When to use scaling and standardization
Generally, you want to use scaling if your data has some type of bounded range (e.g. SAT scores) since standardization won't guarantee that a data point will stay within a range
Clustering
Grouping data points together based on similarity
Hard Classification vs Soft Classification
Hard: 100% separation between groups Soft: minimize misclassifications where possible
What is heteroscedasticity and how can it bias your results?
Heteroscedasticity is when there is unequal variance present in your data. As a result, your model will produce larger estimation errors in the areas of your data where there is more variance. This will then push the model to fit those points better than others which leads to model bias.
What's the difference between PCA and Feature Selection?
In feature selection, we are looking to reduce the number of factors that are used by our model. We determine which features are/are not useful using some metric (ex. p-value) and remove based on a threshold. PCA is a dimension reduction technique which reduces the number of dimensions that is passed to a model but still uses all of the factors in the data set to derive each principal component. No factors are removed from the model since every factor plays a part in creating each PC.
How does changing the value of C and T affect our CUSM model?
Increasing either T or C makes the model less sensitive to detections while decreasing either T or C makes the model more sensitive. Increasing T raises the bar for which the model will alert that a something in the data has changed while increase C raises the bar for how much a point is considered above or below the mean.
Why can't you use the validation score to measure a model's performance when choosing the best model from a group?
It is likely that the model that performed the best during validation did so because it happened to be better at picking up the random effects in your validation set than other models. As a result, the validation score it produced is probably too optimistic. Model scores are always a sum of fit to real patterns and fit to random patterns. If several models are pretty close to each other in how well they pick up real patterns, the deciding factor often becomes how well they fit random patterns in the validation data.
k-Nearest Neighbor (KNN)
KNN is a classifier that classifies data points based on the k number of data points nearest to a data point. To find the class of a new point, you pick the k closest points (neighbors) to the new one. The new point's class is the most common among the k neighbors. The calculation for KNN is much more straightforward with the main parameters being how distance is calculated (typically straight-line) and what the optimal value for k should be.
KNN vs SVM
KNN is better when there are more than 2 classes present. However, SVM is faster at classifying.
Lambda in SVM Soft Classifier
Lambda is the parameter we can set to control the trade off between margin and accuracy. Increasing lambda increases the emphasis the model has on margin. Decreasing lambda increases the emphasis the model has on accuracy.
Is it a good idea for you to use predictions from your training set for model validation?
No. Predictions made from a training data set are often too optimistic since it is likely that your model is picking up random effects present in training data. Training data is just used for training the model and should not be used for deriving how good the model performs.
Can you average coefficients across k splits to get your final model from k-Fold Cross Validation?
No. You get your final model coefficients by building the model on the combined training + validation set after you have selected which model type performs best in validation.
Clustering for Prediction
Once you have your cluster centers found, you assign new data points to the cluster center that they are closest to.
P-Values
P-values estimate the probability that the actual coefficient for a predictor is 0. Small p-values mean that there is strong statistical significance that two variables are correlated. The general cut off is .05 but other thresholds can be used. Higher thresholds mean more factors can be included but increases possibility of including irrelevant factors. Lower thresholds mean less factors can be included and increases possibility of leaving out a relevant factor
What is PCA (Principal Component Analysis)? When is it useful?
PCA is dimension-reduction technique that is useful when dealing with a model that has 1. lots of factors/predictors present 2. high correlation between factors/predictors PCA does this by 1. taking a linear transformation of the data by multiplying each data point by the matrix's eigenvectors and producing dimensions (principal components) that can be used as variables 2. concentrating on the most important components by ranking each principal component by the amount of variance (information) seen on that dimension
Box and Whisker Plot Outlier Detection
Points outside of the whiskers indicate possible outliers since they are outside of the "reasonable range" of data. Whiskers typically represent the 90th and 10th percentile or the 95th and 5th.
Exponential Smoothing Model Outlier Detection
Points with very large error indicate possible outliers since a smoothed model should reduce normal levels of variance.
PCA: What does the order of Principal Components tell us?
Principal Components are ranked according to how much variance/spread (which corresponds to the eigenvalue for each eigenvector) that exists along that dimension. As a result the higher in the order a PC is, the more significant we would expect the PC to be to our model. Ex. PCA1 should be much more significant to the model we pass it to than PCA2.
R-squared and Adjusted R-squared
R-squared estimates how much variability our model accounts for in the data. Adjusted R-squared adjusts for the number of attributes used (relevant factors will increase adjusted r-squared while irrelevant factors will decrease adjusted r-squared)
Random Splitting vs Rotation Splitting
Randomness can give one set more early or late data, while rotation equally separates data. However, rotation may introduce bias (ex. 5 data point rotation means all Mondays are in one dataset). You can make a combined approach that uses parts of both.
Why does fitting a model on different data sets remove the impacts of random effects?
Real effects are the same in all data sets. If there is truly a relationship between two variables, then that effect will always be present even if you change data sets. Random effects are different in all data sets. When you change your data set, any random effects your model picked up in training won't help it when it sees new data with different random effects.
What are the two types of patterns that exist in data?
Real effects: real relationships between attributes and the response variable Random effects: random, but looks like a real effect
What is SSE? What is Likelihood? What is the relationship between the two?
SSE = Sum of Squared Error. The best fitting model is one that *minimizes* SSE. Likelihood is how well your model's distribution overlaps with the actual data's distribution. The higher the likelihood the better the model fits. Likelihood and SSE are intertwined as maximizing Likelihood is the same as minimizing SSE. MLE (Maximum Likelihood Estimate) is the set of parameters that minimizes the SSE.
Support Vector Machine (SVM): How do they work and what are their objectives?
SVM a classifier that draws a line that best separates groups of data points. It does this by creating two outer lines that represent the margin for error and a middle line that is the classifier. Objectives: 1. All data points are categorized correctly (accuracy) 2. Maximize the gap between the two outer lines (margin)
Is scaling important in KNN? Why or why not?
Scaling is very important in KNN since KNN is a distance based algorithm. Without scaling, one feature would play a much larger impact than the other at determining the closest distance between data points.
Is scaling important in SVM? Why or why not?
Scaling is very important in SVM since its margin value is derived from its factor coefficents which are proportional to the scale of the variables. Without scaling, a small change in one variable can drown out a big change in the other.
Parameter tuning with k-means
Since k-means is heuristic and fast to implement, it is good practice to try lots of different parameters to find the best model. Common parameters are 1. Different initial cluster centers. The initial guess at cluster centers can have a large impact on the solution. 2. Different values for k. If you don't know how many clusters already exist in the data, its a good idea to try different values of k.
Exponential Smoothing Forecasting: Baseline Value (St)
Since the value of future observations are unknown, our best guess for future baseline values are just the *most recent baseline value* we have. We forecasted value for baseline value will just be a constant value equal to the most recent baseline value we have. As a result, we would expect error to grow as we forecast further into the future.
Point Outlier
Single data point(s) that are very far from the rest
Collective Outlier
Something is missing in a range of points but can't tell exactly where. You can't point to a single data point with this type of outlier since the outlier is the fact that data is missing.
Where does the name 'Exponential Smoothing' come from?
The 'Smoothing' component comes from dampening effect that the variable alpha has that smoothes out variation in the data. The 'Exponent' portion comes every past observation contributing to the current baseline value estimate.(1 - alpha) is raised to the power of the number of observations the model has seen. Newer observations are weighted more.
Infinity Norm Distance
The P-Norm Distance with P set to infinity. When we take the infinity norm distance, we are taking the distance of the largest absolute distance between two points along a single dimension. By setting P to infinity the equation gets simplified to max(abs(x - y)) which is where the distance is equal to the dimension with the largest difference between points x and y. Raising the equation to the power of infinity makes the dimension with the largest difference drown out any contribution made by other dimensions.
How do confidence intervals help determine model fit?
The closer our confidence interval is to 0, the better our estimate is of the population's true coefficients.
P-Norm Distance (Minkowski Distance)
The generalized distance formula that encapsulates both Euclidean (p=2) and Rectilinear distance (p=1)
Are model scores derived from validation data sets typically higher or lower than ones derived from training data sets? Why or why not?
They are almost always going to be lower than scores derived from training sets. The predictions made on a training set contain both real effects and random effects from that data. When that same model is run on a new validation set, only the model's ability to pick up real effects should remain.
Training and Validation Sets: What are the objectives of each and which should be larger or smaller?
Training sets should be larger and are meant to train the model and have it fitted on. Validation sets should be smaller and meant to be used for estimating the model's effectiveness.
Training, validation, and test sets
Training: trains our model and is used to fit the models Validation: used to compare and choose the best model Test: used to estimate the performance of the chosen model Note: Validation sets are only used when we are comparing multiple models. If only one model was built, then we do not need a validation step and just need a Training and Test set.
How do you calculate trend in exponential smoothing? What does increasing or decreasing Beta do?
Trend is calculated as the current baseline minus the previous baseline B(St - St-1) + (1-B)Tt-1 Beta acts just like alpha and is a parameter used to weight how much importance you give the trend of the current observation vs previous observations. Increasing beta increases the weight of the current observation's trend. Decreasing beta increases the weight of previous observations' trends.
What is triple exponential smoothing?
Triple exponential smoothing (Holt-Winters') takes into account *baseline variation, trend, and cyclical patterns.* Seasonality can be introduced either additively or multiplicatively.
Contextual Outlier
Value isn't very far from the rest overall, but is far from the points nearby in time. The value itself isn't far but the *context* in which it occurs is strange. Ex. It isn't strange for temperatures to reach 90 degrees but it would be strange if it reached 90 degrees in December.
Standardizing: standardizing to a normal distribution
We first find the mean and standard deviation of the variable. Then we subtract the mean from each variable and divide it by the standard deviation. We would do this if we want to understand how far from the mean each data point is.
Scaling: scaling data between 0 and 1
We scale all numbers between 0 and 1 by taking the min and max for that value and subtracting the min from each data point and dividing it by the range (max - min) for that value. This is how you scale linearly.
Exponential Smoothing Forecasting: Forecasting with Multiplicative Seasonality
We use the *most recent seasonality calculation for the the forecasted observation's stage in the cycle*. We then multiply Forecasted Baseline + Forecasted Trend by the Forecasted Seasonality factor. F = (S + kT)C Ex. If our observation falls on a Monday, we use the most recent seasonality factor we have for Monday.
Exponential Smoothing Forecasting: Forecasting with Trend
We use the *most recent trend calculation* and multiply it by the number of periods we are forecasting into the future. We then add that forecasted trend to our forecasted baseline to get a complete forecast. F = S + kT Ex. Our most recent trend calculation is 5 and we forecast 4 days out, we would get trend estimate of 20.
Training and Validation Sets: Choosing the best model from a group?
When choosing the best model among a group, you would to use the model score from the validation set to compare results. However; you would not use the score from the validation set to measure the model's overall accuracy. You would need to fit the model against a third testing data set to evaluate its performance.
Rules of Thumb for Splitting Data
Working with one model: 70%-90% to training and 30%-10% for testing Comparing models: 50%-70% to training and split the rest equally between validation and testing
Is k-means clustering affected by outliers? What about scaling?
Yes. Since we group data points based on distance from the nearest cluster center, outliers and scaling are both important considerations to k-means. Outliers cause cluster centers to be skewed towards them. Scaling is important since k-means is a distance based algorithm, so determining the nearest cluster center can be impacted by scaling.
How do you add business context to build the optimal SVM?
You can add the cost of a misclassification to your model by weighting the a0 variable (intercept). Ex. If a bad loan (-1) is twice as costly as a good loan (1) you can weight the a0 variable as 2/3*(a0-1) + 1/3(a0+1)
Transforming the Data: How can you make a general model for linear regression that captures non-linear data?
You can adjust the data (the formula you use) so the fit is linear. Ex. Quadratic Regression. You can also transform the response variable. Ex. Taking the log of the response variable
Transforming the Data: How can you take into account how variables may interact with each other?
You can introduce a new variable as an additional *interaction term* that captures how attributes may interact with each other. Ex. creating a new X3 variable that is the product of both parents' heights.
After doing k-Fold Cross Validation, how do you decide which model performed best?
You take the average model score for each model you did k-Fold Cross Validation on and choose the model that had the highest average. Once you decide which model is the best, you build the final model using the combined training + validation set and get a final model score using the test set.
How do you determine a good value for C and T in CUSUM?
You want to consider the trade off of how fast your want your model to react. Examine what the cost of reacting too slowly is and compare it to the cost of reacting too quickly. Lower C and T values also introduce the risk of false positives so you need to consider the trade off of early detection vs false positives.
Detrending Data: When should do you do it?
You would want to detrend anytime you are using a *factor-based model* (regression, SVM, etc. anything that creates a linear model with coefficients) to analyze time-series data. If the analysis you are doing has no benefit from the known trend in the data, detrending helps to remove noise caused by that trend in the data.