Data Mining Midterm

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Correct Rate

(# of records in decile with character of interest)/(# of records in the decile)

Base Rate

(# of records in entire sample with character of interest)/(# of records in the sample)

Positively-skewed

(Left skewed) - large outliers dragging the mean up

Negatively-skewed distribution

(Right skewed) - small outliers dragging the mean down

Data Prep: Detecting Outliers

An outlier is an observation that is extreme and can have a disproportionate influence on models. Outliers are data that are 3 standard deviations or more beyond the mean. They should only be .2% of the population (2 out of 1,000).

Cutoff Tables

A table showing the actual values and the probabilities of "1." The cutoff is the probability level that will be the cutoff. The cutoff level is traditionally 0.5.

False negative rate

1 - sensitivty

False positive rate

1 - specificity

Anatomy of Tukey Box Plot

1. Arranged the values from smallest to largest 2. Calculate the median - this will be the center most value of the data 3. Calculate Q1 and Q3 (Q1: 25% of data is less than this mark; Q3: 25% of data is greater than this mark) 4. Calculate the outlier thresholds 5. The box plot extends to the last non-outlier data on the low and on the high side

Unitary distribution

A distribution where all data bars are at similar levels

Bi-modal distrubtion

A distribution with two peaks

R^2

A measure of the proportion of the change in y that is explained by the model (change in x). R^2 is not as reliable a measure because it has blind spots. A better solution is to use validation data and calculate MAE, RMSE, and MAPE

Naive Rule

Classify all records as belonging to the most prevalent class. This is often used as the benchmark and we hope we can do better than the naive rule. Ex. 75% of a class are women and 25% are men. The naive rule would said that if we assumed the entire class was women, we would be wrong 75% of the time. The goal, then, would be to create a model that would outperform the naive rule. The naive rule is NOT the same as the base rate.

Decile Lift

Decile Lift = Correct Rate / Base Rate. This can be calculated for a singe decile, or can be calculate for several deciles. Remeber a decile is 10% of the data. Thus, if we have a sample of 40, then the first decile is the first 4 records

How do you calculate variance and standard deviation given a data set?

Find the average value of the entire data set. Then, subtract the difference between the data set and the average. This sum of the differences should be zero. However, the find the variance, first square the differences and then sum up all of the differences. Then, divide the sum of the squares of the differences divided by the number of observations; this is your variance. To find the standard deviation, take the square root of the variance.

How do you calculate the Eucledian distance for the KNN model?

First calculate the delta (difference) between the x and y coordinates of each record. (x of 1st record - x of second record, etc.) Then, square those differences. The last step is to sum the squared differences and then take the square root to get the final answer. (See slides for further information)

How do you calculate the outlier thresholds?

IQR * 1.5. This number is the added to Q3 and subtracted from Q1 to give you the range. Then, the last data point not beyond the threshold is where the lines are drawn for the box plot. NO LINES appear at the Q1/Q3 +/- IQR points

What is the effect on the cutoff point when one class is more important?

In many cases it is more important to identify members of one class, such as bad food, etc. This means that in some cases we are willing to tolerate GREATER OVERALL ERROR in return for BETTER IDENTIFYING THE IMPORTANT CLASS

IQR

Interquartile range. This is calculate as Q3-Q1

What is the calculation and connection of logit, odds, and probability?

Odds = e^z; Probability = Odds/(1+Odds); z = ln(odds). Probability can also be calculated as 1+(1+e^-z). Odds is also calculated as odd = p/(1-p)

Low K vs. High K

Low values of K (1,2,3...) capture local structure but may not represent an overall pattern because it can be distracted by noice. Large values of K provide more smoothing and less noise, but may miss local structure

MLR

Multiple Linear Regression. This gives numeric outcome variable only. This handles categorical input variables with dummy variables. It is very fast to get a model and to understand it. Works well when relationships are approximately linear; does not work well when relationships are nonlinear

Logistic Regression

ONLY for CATEGORICAL VARIABLES.

Logistic Response Formula

P=f(z)=1+(1+e^-z)

Evaluating Predictive Performance: Different Measures

R^2, Average Error, MAE, RMSE, MAPE

Classification Matrix

See graphic on classification matrix. The biggest thing to understand here is that the ROW totals are the total actuals 1s or 0s. The COLUMN totals are the total predicted values.

Lift Chart

Shows how well the classifier does in identifying the class of interest

MAE (Mean Average Error)

Take the absolute value of the differences between the actual and predicted value. Then add up those differences and divide by the total number of observations. That is the MAE

Specificity

The % of the 0's classified correctly

Sensitivity

The % of the 1's (class of interest) classified correctly

Avg Error

The difference between the actual value and the predicted value. Then, sum up those differences and divide by the total number of observations; this is the average error. (Some of the differences could be negative; this is okay)

MAPE (Means Absolute Percentage Error)

The first step is to do a new calculation. This calculation is the absolute value of the difference divided by the actual value - abs(difference/actual). The next step is to multiply this value by 100 to get a percentage. Then, sum those percentages and divide by the total number of observations to get the average percentage error. (See graphic for more explanation)

How do we adjust for oversampling?

The predicted class never changes. However, the true negatives (0's) need to be increased to reflect a population with the correct proportions. To calculate this, the first step is to find out how many total observations are needed in order to get the current number of positives. If the current number from the sample is 500 positives and only 2% of the actual population are positives, the you would do 500 =.02population; Population = 25,000. Thus, 500 would be true positives and 24,500 would be true negatives. Then, using the ratios of the actual 0 row, multiply that percentage by 24,500 to get the new row numbers for 0's. Note: The top row will never change; we are trying to get a bottom row to match the proportion of the top row.

Base Rate

The proportion of the class of interest

CART

This is a decision tree. It can handle either categorical or numerical outcome variables. Numeric is regression tree and categorical is classification tree. Must prune back to avoid overfitting.

RMSE (Root Mean Squared Error)

This is the square root of the average of the squared differences. To calculate you this you take the absolute value of the difference and square it. Then, so those squared differences and divide by the number of observations. Then, take the square root of that number to get the RMSE. (See graphic for more explanation)

Oversampling

This is when we collect more data (oversample) rare cases in order to give the model more information to work with when the class of interest is rare. Ex. Assume the real population is 2% the class of interest and 98% majority negative. In the model, however, we use 50% positives and 50% negatives, resulting in oversampling.

What method is used with non-normal distributions?

Tukey's box-plot method.

Odds Ratio

When you take the one odds divided by another odds. Ex. If odds of heart attack with training are .5 and odds of heart attack without training are 1.75, then 1.75/.5 is the odds ratio and people without training are 3.5 times more likely to have a heart attack

KNN

k-Nearest Neighbor. Either categorical or numeric outcome variables. Categorical variables are inputted directly as such; no dummy variables. This is a very slow model that takes a long time to run. This is DATA driven, not model driven. Uses a distance measure to calculate. It measures how close a new record is to an existing set of instances


Kaugnay na mga set ng pag-aaral

Batson et al. (1981): Empathy-altruism theory

View Set

Introduction to Minitab for SixSigma

View Set