predicting social behavior exam 2
how would you calculate precision from data for both yes and no outcomes?
Precision = TP / (TP + FP)
How do you calculate the Euclidean distance between a set of predictors?
Square-root of the sum of the squared differences between each characteristic - Example euclidean distance between two objects (1 & 2) on 3 different characteristics (X, Y, an Z) - Distance = square-root [(X1- X2)2 + (Y1- Y2)2 + (Z1- Z2)2 ]
How do you calculate the City-Block distance between a set of predictors?
Sum of the absolute differences between each characteristic in two vectors Example city block distance between two objects (1 & 2) on 3 different characteristics (X, Y, an Z) Distance = |X2- X1| + |Y2- Y1|+ |Z2- Z1|
what is feature engineering?
feature engineering allows for additional improvements to model - more predictor variables = more chances to understand outcome - "predictor" = "feature" - you can create new predictor variables from the ones you have
error
how wrong will my predictions be? - we need to know how close the predictions are to the real outcome
what is an f1 score?
is used to summarize how well you can predict true/positive instances of an outcome - the harmonic mean between precision and recall - F1 score = 2 * (Precision*Recall)/(Precision + Recall)
Do we generally prefer higher or lower values for K, if time is not a concern?
larger k values are typically preferred: - better representation of the models actual performance on new data (low bias) - more variability in performance estimates -- for classification, if K = n-1, then your error will either be perfect or the worst each iteration - takes more time to compute
what is an ROC curve?
metric to quantify how good your guesses are - performance will be in terms of true positive rate and false positive rate
How can you recategorize a categorical variable?
- 1,000s of Zipcodes becomes (Northern Southern, Midwestern, Western US) - Hundreds of school names become Private vs. Public - Hundreds of school names become school quality rating: 1-100
How can you re-categorize ratio variables?
- 5-yrs to 12 is child, > 65 is retired
doesn't automatically know some neighbors are more important...
- By default, KNN gives the K closest neighbors the same vote in terms of what the final predicted value is - You might believe that certain neighbors are more important than others (closer neighbors to the unknown case should get more importance than further neighbors) - Although it is not the default option, you can weight the predictions by the relative distance of the different neighbors: called "distance weighting" - Common to specify that closer neighbors are more predictive than further neighbors
disadvantages of MAE
- all errors treated equally: you may be okay with being off a little, but NOT okay with being off a lot - not appropriate for asymmetric distributions: might prefer the median absolute error to better capture central tendency
What are two properties predictions should have to be useful?
- be close to the actual result (low error) - generalize to new situations (high generalizability)
What are some goals of classification problems?
- be right more often than we are wrong - be right about "yes" when we predict "yes" - be right about "no" when we predict "no" - identify all the "yes" outcomes that we encounter - identify all the "no" outcomes that we encounter
benefits of MAE
- clear interpretation: "when I make a prediction, on average, I'll be off by this much" - independent of sample size: can be compared across datasets with different sizes
What are recommended metrics to use for classification problems?
- do not use accuracy - use a measure that is consistent with your classification goals - - Matthew's correlation coefficient: balanced - - f1 score: emphasize correct positive predictions - - AUC: take into account different prediction thresholds
easy to calculate/explain
- does not require knowledge of calculus to explain - intuitive appeal: --- birds of a feather flock together --- like goes with like
What is the Matthews Correlation Coefficient?
- how closely do the predictions fall along the diagonal vs. random distribution (similar to chi-square) or off-diagonal - If you take the top left and multiply it by the bottom right, if that value is larger than the bottom left and top-right multiplied by each other then the Matthews Correlation Coefficient (MCC) will be a positive number. - If the product of the top-left and bottom-right is smaller than the product of the Top-right and bottom-left, then the MCC is negative.
disadvantages of RMSE
- interpretation isn't as clear as MAE - sample size dependent (get worse if n is larger, even if the underlying model producing the outcome is the same) - because RMSE is sample size dependent, cannot compare RMSE of models that have sample sizes
What is the largest value that K could be?
- k can be at most (n) - k can be at least 2
when evaluating the error of regression-type models use a measure of error...
- mean absolute error (MAE) - root mean-squared error (RMSE)
benefits of RMSE
- minimizes egregious errors (squaring the residual makes bad error exponentially worse) - also ideal for examining performance when expecting small errors (magnifies them and therefore become more noticeable and the difference between models is more noticeable - better metric to use for "loss functions". squaring the error makes it easier for calculating the gradient in optimization algorithms
requires large datasets...
- needs to find similar instances - need to get training that more/less matches any new cases you'll see - - when many predictors, it becomes harder to find similar training cases - - otherwise, the nearest instance will still be very far
training is very fast...
- no actual training phase - just provide a sample of training cases with predictor variable values and outcome value measured - when making a prediction, just compare the predictor values of the target to be predicted to each of the training samples predictor values
works well on multi-class data and regression...
- not limited to only two outcome categories or even classification tasks - works on outcomes with many categories - works on numeric outcomes
What are some reasons why a predictive model will not work on data from new situations?
- overfitting - changes in time - changes in sample characteristics
what are some disadvantages of k-nearest neighbor?
- slow to make predictions - does not handle many predictors well - must standardize variables first - must have large datasets
what is k-fold cross validation?
- split our own dataset into many (k) sections (folds) and train on one of the sections, and test the other - by repeatedly splitting data, to train and test, you estimate the typical performance of the model on new data samples
root mean-squared (RMSE)
- square-root of the average of squared residuals - square the difference between each prediction and the outcome, add them all up, divide by the number of predictions, and square-root that number
handles complex relationships between variables and outcome...
- there may be complex relationships between predictor variable and outcome - KNN can approximate any (complex) relationship
What are some goals of regression problems?
- to be as close on average to the actual value - to be treat a single big miss as being worse than a few near misses - to balance overestimating with underestimating the actual value - to be close to the actual value consistently
advantages for k-nearest neighbors
- training is very fast - handles complex relationships between variables and outcome - easy to calculate/explain
What are three possible solutions to overfitting?
- try collecting new data and verify its generalizability - try penalizing our model -- keep only predictors/parameters that have a BIG effect at prediction - recommended solution: k-fold cross-validation
four limitations of feature engineering
- using the outcome variable - not using descriptive statistics appropriately - too much overlap with other predictor variables - not enough creativity - spending too little time
KNN does not work well with many predictor variables...
- with more dimension's, all points slowly start averaging out to be equally distant, KNN no longer predictive - keep the number of predictors limited and KNN will do well - exclude extraneous features when using KNN
What are good values to have for the Matthews Correlation Coefficient?
--> + 1: Predictions are perfect (Predictions fall down TP and TN diagonal) --> -1: Not Good (Predictions are always opposite but wrong) --> 0 = Not Good (Random guessing) e --> Undefined = Not good (Our model was never right about something)
EVALUATING PREDICTION ERRORS
..
CLASSIFICATION VS. REGRESSION PROBLEMS
...
CROSS VALIDATION
...
FEATURE ENGINEERING
...
What are some descriptive statistics that measure trends?
Describe the direction of that a set of predictor variables measured over time is trending/moving (slope, skewness, or average of items weighted by recency)
distance metrics
Distances metrics offer a way to compare the similarity of one object to another object that were all measured on the same things. Distance metrics compare one specific trait to the corresponding specific trait on the other object.
What are some descriptive statistics that measure central tendency?
Give score based on mean/median/mode of a collection of other predictor variables
What are some descriptive statistics that measure extremes?
Give score based on minimum or maximum of a set of predictor variables
What are some descriptive statistics that measure variability?
Give score based on standard deviation or range of collection of predictor variable responses or how many unique responses are provided.
doesn't automatically know that some variables are more predictive than others...
- By default, if all variables are standardized/normalized, then they are treated as equal important in the distance calculation - However, if we have some idea of the relative importance of each variable, we can weight them -- multiply each standardization variable by the proportional amount ---example: If you have 4 variables (A, B, C, and D)and you want A to have twice as much weight as B, and B to have twice as much weight as C, and C to have twice as much weight as D Multiply the variables as follows: A x 8 , B x 4, C x 2, D x 1
Why is accuracy not an ideal measure to use?
- Does not take into account the % of True occurrences that you correctly predicted - - I can be 99.999% accurate at predicting when earthquakes occur (simply guess "No earthquake every day") - - In the above example, I was always right about no-earthquakes, but always wrong about "earthquakes" - Does not take into account the % of False occurrences that you missed - - I can be 85% accurate at predicting the sunny weather (simply guess "true" every day) - - In the above example, for false outcomes ("rainy days"), I was always wrong
Describe the steps of the K-Nearest Neighbors Algorithm?
- For a given point, calculate the distance to all other points. - Given those distances, pick the k closest points. - Calculate the probability of each class label given those points. - The original point is classified as the class label with the largest probability ("votes").
you must standardize the variables first in KNN...
- KNN uses distance metric to compare similarity between cases - when variables are not commensurate, we can standardize them or normalize them - standardization makes the model treat all variables as equally important
characteristics of city block distance
- Looks at absolute differences between each characteristic - Sensitive to scaling: all characteristics need to have similar means and standard deviations. Should transform variables into Z-scores first - Not scale invariant = converting cm to mm will produce different results - Gives equal emphasis to characteristics: Absolute difference is calculated between each characteristic - Matches and Non-matches both matter for similarity - Better for higher dimensional data than Euclidean - Describes number of mismatches if vector values are coded as 1s and 0s - Dampens large differences between characteristics: Because the differences are not squared, less susceptible to noise - Fast to compute
characteristics of euclidean distance
- Looks at squared differences between each characteristic - Sensitive to scaling (small scale = small differences; big scale = big differences) = Need all characteristics to have similar scale - Should transform variables into Z-scores first - Not scale invariant = converting cm to mm will produce different results - Gives equal emphasis to characteristics: Absolute difference is squared between each characteristic - Better for lower dimensional data: Because all characteristics are considered - Emphasizes large differences between characteristics: Because the differences are squared. Susceptible to noise and random variation.
When would you favor RMSE over MAE?
- MAE favors general accuracy at the expense of occasional wide misses - RMSE favors less overall accuracy as the price to reduce the likelihood/size of wide misses
How can you re-categorize a date?
- Monday, Tuesday, Wednesday, ... becomes Weekend vs. Weekday - 11:00 3:00 7:00 ... becomes morning, afternoon, evening - Separate time into before work hours and after work hours
What are four common ways that people engineer features?
- Recategorize - Dimension reduction - Apply descriptive statistic - Infer feature from other variables:
how do you construct a confusion matrix?
- The top-left corner represents all the time sometimes was actually true and we predicted it to be true (We were right about something happening) - - Example: We predicted some as True, and it turned out to be True 5 times - The bottom-right corner represents all the time sometimes was actually false and we predicted it to be false (We were right about something not happening) - - Example: We predicted some as False, and it turned out to be False 20 times
slow to make predictions...
- When making a prediction about an unknown case, you provide it a set of predictor values from the target case - For each example in the training data, it has to compare the target case's predictor values to each example's values - The more training cases, the better the predictions, but the more comparisons that need to be made
what happens in high dimensionality?
KNN works better from problems where peoples decisions are made on a small set of criteria
dichotomize
Make an interval/ratio variable into two categories: - Income becomes "rich" or "poor" - Age becomes "old" or "young"
When does K-Nearest Neighbors work better? Narrow and distinct predictor variables or Broad and Redundant predictor variables?
Narrow and distinct because with more dimensions, all points slowly start averaging out to be equally distant. KNN no longer predictive.
Should you dichotomize a naturally continuous variable? Why or why not?
No, Dichotomizing a variable that is normally continuous throws away information! - Treats people who are at different parts of the low-end as the same
How do you determine if you should keep an engineered feature in the model?
cross-validation determines whether the feature stays or is removed - if the models cross-validated error improves with the feature, keep it in the model
examples of infer attributes
Use demographic tendencies - People with these last names tend to be Hispanic - People with these heights and weights tend to be male Use behavioral tendencies - People with this activity level during the day and this level of education tend to be unemployed - People who have seen these movies tend to be female - People with this writing style tend to be educated Match an entry with archival records - This day was sunny and 90 degrees - This customer's LinkedIn says their educational background is a Bachelor's degree - This customer's zipcode has a median income of $50,000
what is a confusion matrix?
displays the types of predictions we made and how they actually turned out (all two-category classification models can describe their results using a "confusion matrix"
what is the difference between error & generalizability?
error = how wrong we are generalizability = how well the model works in unknown situations
What is K-Nearest Neighbors?
a classification and regression algorithm that makes a prediction based upon the closest data points.
What is R squared?
a statistical measure that's used to assess the goodness of fit of a regression model's predictions - the model's predicted line the worst model (the mean of the observed responses of the outcome) - Describes the proportion of variation in the outcome that can be explained by the regression model - Ranges from 0% to 100%
What is a descriptive statistic?
a way of summarizing a characteristic of a series of data with a single number - You can describe a collection of predictor variables with a single number such as: --> the mean --> the standard deviation --> the maximum value --> the autocorrelation --> the skewness --> the kurtosis --> etc.
how do you calculate accuracy?
accuracy: number of correct predictions made divided by the total number of predictions made - number of predictions we made that were right / number of predictions made
what is overfitting?
adding predictors/rules/expectations that try to explain more the data, but are actually random error or noise instead of the underlying relationship
what is the area under the ROC curve?
aggregate measure of performance across all possible classification thresholds.
mean absolute error (MAE)
average of absolute value of residuals - Take the absolute difference between the real outcome value and each predicted value, add them all up, and divide by the number of predictions made - Interpretation: "On average, this is how off my predictions were from the true values"
when would you want to use city block distance
better for high dimensional data
when would you want to use euclidean distance
better for low dimensional data: because all characteristics are considered
How do you add more importance to a given feature/predictor variable?
multiplying the normalized variables by numbers that indicate their relative importance. - If one variable should be twice as important as every other variable, then multiply the normalized variable by 2
classification model
predicting a nominal/categorical outcome (A, B or C) - Predict purchase (Buy or Not) - Predict species (Dog, Cat, Frog, or Pig) - Predict outcome (Win or Lose
regression model
predicting an interval or ratio outcome - Predict sales (in dollars) - Predict number of users (in counts) - Predict length of life (in years)
how would you calculate recall from data for both yes and no outcomes?
recall = TP / (TP + FN)
What is a commonly used value for K?
recommended k = 10 - Shown empirically to yield test error rate estimates that approximate the true generalizability (low bias) and is fairly consistent (low variability) - May need to increase K when total sample size is small
what are the types of predictive modeling problems ?
regression & classification
how do we give all predictors equal importance?
standardize all variables
Why does the type of problem matter? (What does it determine?)
the type of problem (classification or regression) determines: - Which predictive modeling algorithm you choose - What metric you use to evaluate your model - How many observations/people you need for your study - How fast will your predictive model find an optimal solution
Recall
what proportion of the actual true occurrences were predicted as true - recall = TP / (TP + FN) - high recall = when you encounter an unknown case, that is actually a true occurrence, you will identify it
precision
what proportion of the true predictions were actually true - precision = TP / (TP + FP) - high precision = when you make a true prediction, people better believe you
generalizability
what will happen in a new situation? - we need to know how the model would perform on data we have yet to see ( if we can perfectly predict what will happen for events that have yet to occur, that is useful)
What is recategorization?
when you group certain values into a cohesive category - typically accomplished by: setting cutoffs b/t ranges of values, assigning anyone in certain categories to another category
What are some ways to infer attributes? Provide examples of those?
you can infer people's attributes based on the combination of responses provided. - Works only if you combine the information from multiple variables or from a nominal meta variable (zipcode, license#)
What are reasons that lead people to feature engineer?
you had to limit your initial data collection due to constraints - can only ask so many questions before people stop providing data - money/time limits that number of variables that can be collected you many be working with an already collected dataset - can't go back and ask for more variables from people - very common occurrence in industry it doesn't hurt and might help decrease prediction error - More predictor variables = more chances at explaining the outcome - Cross-validation will reveal if the added predictors are useful or can be removed