BA 305 EXAM 1
multicollinearity
a condition where some of the predictor variables are correlated with each other, leads to instability in the solution space
Principal Component Analysis
a descriptive approach involving projections
PCA algorithm
components created are just eigenvectors with eigenvalues being the strength
difference between data summarization and reduction
data reduction squeezes the matrix from the sides, reducing the number of columns, where data summarization reduces the number of rows
how to deal with outliers
delete or replace the instance
Ordinal Scale
differentiate and then order objects (ex. Education levels like primary, secondary, undergraduate, graduate, etc)
Interval Scale
differentiated, ordered levels with consistent differences between levels
fall out
false positive rate: false positives divided by the total number of negatives (false positives divided by true negative and false positives) ***type 1 error***
overfitting
generality of findings is hindered since new data does not behave the way as training data - machine learning models learn so much which means it also learns the random noise in the data
parsimony
high dimensional data violates this principle- want to keep the number of variables to a size that could be easily interpreted - want to be efficient about the variables you use in the data so its easier to understand while maintaining data integrity
specificity
how many negatives can the model identify (true negatives divided by total negatives (false positives plus true negatives))
sensitivity
how many positives can the model identify (true positives predicted divided by all positives (False negative plus true positives))
how should one deal with missing data
identify the patterns and relationships underlying the missing data in order to maintain as close a result as possible to the original distribution of values
common approaches to classification
linear regression, nearest neighbor, naive bayes, decision trees, artificial neural networks
precisions
out all of all the positive calls the model makes, what proportion of that are correct (true positive / true positive +false positive)
objective of principal component analysis
reduce the set of numerical variables (columns) by removing any overlap of information between variables where the result is less variables with more efficient values (aka contain most of the necessary information)
what is the effect of missing values
reduces the sample size for analysis which can distort results
density plot
smoothed out version of a histogram, single variable can make two dimensions and have a heat map
What condition requires more components
when there is heterogeneity among sample subgroups
Methods of data transformation
- Normalization - Aggregation - Data reduction - dummy variables
Scree Test Criterion
Components before inflection point when graphing the percent of variance explained
purpose of data transformations
Do so to correct violations of the statistical assumptions underlying the multivariate technique Also do so to improve the relationship/correlation between the variables
Analytics Lifecycle
Raw Data Data Processing Clean, Structured Data Exploratory Data Analysis Insights, Reports, Visual Graphs Models and Algorithms → Data driven products
model performance for prediction/numerical value
**NOT THE SAME AS GOODNESS OF FIT TEST** - we want to know how well the model predicts on NEW DATA - measure error via mean absolute error or average error -generally, want to measure the absolute difference between the actual y and the predicted y
Best practices for data visualization
- Color is bad for quantitative because color is a nominal scale - however, color is great for visualizing nominal data - Don't go 3D it just looks gaudy - Don't use trendlines for categorical data (like gender) - it assumes connections between levels - Don't use trend lines except for continuous variables - Watch out for aspect ratios -> Baking to 45 - average absolute angle is 45 degrees - Don't break the y-axis scale (misconstrues the data, especially because people don't commonly read the numbers actually) - Get rid of "chartjunk" - use ink as a precious resource
Combination of tools that make up business analytics
- Descriptive: what happened? (how many customers, what was my profit, etc) (more information based) - Diagnostic: why did it happen? - Predictive: what will happen in the future? Prescriptive Analytics: how can we make it happen ("optimally?") (help you make decisions going forward)
general: how do you evaluate a model?
- Objectively/randomly partition the data into training and testing parts - use training data for in-sample evaluation: how well does my model "fit" the data - use testing data for out of sample evaluation: model predictive power on new data - basically, build a model using the training set and evaluate it on the testing set (test data can not be used at all in training process)
When do you need to use "repeated hold out?"
- When data is smaller: repeat the process with different subsamples and take the average error rates on different iterations to yield an overall error rate - randomization works better on larger data sets, so with a smaller data set, there is a higher chance that the data will be unbalanced (so you need more random subgroups)
Naive bayes
- a classification algorithm (similar to k-NN) but conceptually different
model performance for classification
- an error would be classifying a record as belonging to one class when it belongs to another - error rate is the percent of misclassified records out of the total records -start with the naive rule: classify all records as belonging to the most prevalent class, can use as a benchmark where the goal is to better than this - can evaluate the model based on separation of records: a high separation of records means that the model attains low error but a low separation of records means that the model does not improve much on the naive rule
Aggregation
- combining two or more attributes (or objects) into a single attribute (or object) - typically more stable data because has less variability - can have levels of which objects can be aggregated into (like how cities can be aggregated into regions, states, countries, etc)
process of nearest neighbor algorithm
- compute distance to other training records, identify k nearest neighbors using these distances, use class labels of the nearest neighbors to determine the class label of the unknown record
How dimension reduction helps
- creates independent/uncorrelated/orthogonal input useful for regression, cluster analysis, classification, and other methods - reduces the amount of time and memory required by data mining algorithms
why does k-NN not work well for large datasets?
- due to curse of dimensionality - expected distance to nearest neighbor increases with dimensions (as when there are many dimensions, all records end up "far away" from each other) - can try reducing dimension using PCA - or just for large datasets, use a different method
Phases of preparation
- graph the data to look at the shape (univariate) and/or relationships of your data (multivariate) - identify and evaluate missing values - identify and deal with outliers
Why do we use predictive analytics
- help classify/predict out-of-sample data points (with tools like linear regressions) - use what we learn to take action
K-fold cross validation
- like an extenuation of the hold out method - instead of doing 1 split, split the data into k subgroups, then train the data on (k-1) subsets, where you use the last subset to test the data - repeat this process k times so each subset is the test data at least once -take an average of error estimates - use python package "cross_val_score"
low values of k versus high values of k (k-NN)
- low: capture local structure of data but also the noise - high: provide more smoothing and less noise by may miss the local structure
Sampling
- main technique employed for data selection, often used for both preliminary investigation of the data and the final data analysis - Used in data mining because processing the entire set of data of interest is too expensive or time consuming
types of rescaling
- min-max normalization: rescale the variables to a range of [0,1] - standardization (via z-score: how many standard deviations away from the mean): all variables have a mean of 0 and variances of 1
Why do we evaluate models?
- model deployment is incredibly expensive (in terms of time and money) so we want to evaluate before money is spent to scale it up to deploy - multiple models can be used to classify or predict (and may have multiple settings as well) so to choose the best model, we need ways to compare each model's performance
Classification
- most common datamining task - a SUPERVISED method that includes two or more classes for the categorical target variable - the algorithm examines relationships between the values of the predictor/input fields and target/output values - ex: banking - determine whether a mortgage application is "good"
Box Plot
- not visually pleasing but is a helpful tool because contains a lot of different types of information - the box shows the 25-75% percentiles (so contains 50% of the data) - the belt is the median (tells us how symmetrical the data is) - median line shifted up or down tells us if the data is skewed - whiskers are the soft minimum and maximum (1.067 * IQR) - outliers will be above or below the whiskers - side by side boxplots are useful for comparing subgrousp
Receiver Operating Characteristic Curve
- plots the total positive rate against the false positive rate at all positive cutoff values (where the diagonal line reflects when TPR = FPR which is random classification) - compare models by comparing the area under the ROC curve
How does principal component analysis work
- rely on correlations to create new variables that are linear combinations of the original variables - these linear combinations are uncorrelated (no information overlap) and only a few of them contain most of the original information - new variables are called principal components
advantages of k-NN
- simple because has no assumptions - handles non-linear decision boundaries -effective at capturing complex interaction among variables without having to define a statistical model
types of voting during classification (k-NN)
- simple unweighted voting: decide on the value for k to detemine the number of similar records that can "vote", compare each unclassified record to its k nearest (most similar) neighbors according to the euclidean distance function, each of the k similar records vote - weighted voting: where closer neighbors have a larger voice in the classification decision as compared to more distant neighbors
reasoning behind curse of dimensionality
- sparsity increases: definitions of density and distance between points, which are critical for models like clustering and outlier detection, become less meaningful - getting more signals but also getting more noise --> signal gets lost in the noise
Interpretation of component matrix
- squared component weight represents the amount of the total variability explained by the component - eigenvalues represent the magnitude of the components (the explicatory value of the variables)
Component weights/loadings
- tells us how original variables and the components/factors relate - gives the correlation of each component with each of the original variables
component scores
- the resulting data from dimension reduction - can be used for subsequent analysis (like regression, cluster analysis, classification, etc) - helps avoid complications caused by multicollinearity - however, interpretation can be more difficult because all variables contribute through loadings
shortcomings of k-NN
- there is no model so its not interpretable - requires some random tinkering with parameters (like value of k and cutoffs) -can fail for large datasets - also not memory efficient (slow for large datasets)
Why do you examine your data before analysis?
- to examine the suitability of the data to make the proper adjustments
Hold-out method
- use two independent data sets (ex: training set is about 80%, testing set is about 20% of original) - use sklearn package train_test_split
cost benefit matrix
- use when the outcome of rating something as positive or negative and true/false has a cost benefit associated with it (like where a false diagnosis is a negative cost) - the best cutoff maximizes net benefit
"leave-one-out"
- version of k-fold cross validation - for smaller sized data, k-fold cv where k=n (so the number of subsets is the number of data points) - a more efficient use of smaller data sets
why do you visualize your data first?
- want to understand your data as much as possible before you analyze it - a lot of statistical models need certain assumptions verified (like linearity) which visualizations can confirm
Output of PCA
-a component matrix with the correlation values between the variables and the components - there are as many components as there are dimensions in the data
what happens when you change the cutoff values for classifications
-changes the confusion matrix - isn't rerunning the model, just reclassifying the same output of the same model
characteristics of k-Nearest neighbors
-data driven, not model driven - makes no assumptions about the data -simple goal is to classify a record like similar records
how do we measure nearby? (k-NN)
-for metric values, use Eucledian distance (may need to normalize the values using min-max or z-score) - for categorical values, can't use Euclidean; use "different function" where 0 is the same and 1 is otherwise
Objectives of data reduction/dimensionality
-helps to understand the underlying structure (are some variables similar to each other, and if so, can we somehow group them together?) - shrinks size, while preserving the important information, for more efficient subsequent analysis
measuring of error in model evaluation (generally)
-prediction (how well does it predict the correct numerical value): when dealing with numerical values - membership in a class/classification (how well does it predict which class its in): when dealing with categorical data - propensity (probability of belonging to a class):(how well does it predict the probability that something belongs to a class)
how do we choose k? (k-NN)
-typically choose the value of k which has the lowest error rate in the validation data (so repeat the process multiple times and calculate error rate and compare)
Ways to measure the quality of data
Accuracy: is your data really measuring what it claims to be measuring (errors would be during data collection and entry) Completeness: missing data/values Uniqueness: entities are only recorded once Timeliness: only include in-date data Consistency: data agrees with itself (so common units, need to be able to compare)
what is the best model shown on an ROC curve?
a line that traces the y axis and then the top cutoff line (so would have an area under the curve of 1)
A Priori Criterion
a predetermined number of components based on research objectives and/or prior research
components in PCA
abstract concepts that provide an efficient and convenient method for labeling a number of similar variables - combine correlated measurements into one component
Structure of data
an nxd data matrix, where n rows correspond to instances and are referred to as the size and d columns represent features and are referred to as the dimensionality
curse of dimensionality
by throwing more data at the problem, you increase the likelihood of getting better solutions but as you keep adding more, your results will start to deteriorate
categorical outcome classification (k-NN)
classify the given record as whatever the predominant class is among the nearby records (majority vote determines class)
Latent root criterion
components with eigenvalues greater than 1
data summarization
compressing the data that exists by the rows (summarizing the rows, not the variables))
Pie/donut Chart
don't do it
what is the most important problem with classification
drawing the "right" decision boundary
binary classification
each record is classified as either positive or negative, then compare the actual value to the predictive value can create a confusion matrix
Predictive Analytics
encompasses modeling relationships, testing hypotheses, extrapolating / predicting
percentage variance criterion
enough components to meet a specified percentage of variance explained (usually 60% or higher)
overfitting with model evaluation
fitting the training data too precisely leads to learning too many peculiarities in training data and generates poor results on new data
what makes a sample representative?
if it has approximately the same property (of interest) as the original data set
how are classifications different than a regression?
in classifications, we have dependent variables that are categorical, where in regressions, they are continuous values
What is the most important quality of a sample for it to be effective?
it must be representative
Nominal Scale
measurement technique where levels are used to differentiate between different objects (ex. colors)
what is the goal of model evaluation?
minimize error on the TEST dataset
Ratio Scale
monetary values, age, can complete all sorts of mathematical analysis
what needs to happen before you can run PCA?
need to "process" the data to turn all relevant data into metric data in order to rescale - may need to use dummy variables (to make categorical data into metric data in order to analyze) - may have scaling issues (some variables may have a larger scale than others which means certain variances may dominate the calculation of covariances)
dummy variables
nonmetric independent varaible that has two (or more) distinct levels that are coded 0 or 1 - these variables act as replacement variables to enable nonmetric variables to be used as metric variables
Why are there missing values in data
not all information may be reported for all people (can be systematic or random)
outliers
observations which stand out from the rest (commonly 3 or 4 standard deviations from the norm)
Four types of scales for data
ratio, interval, nominal, ordinal
what other forms can datasets come in besides matrices
record (transaction data), sequences (DNA/protein), graph (the internet)
data reduction
reduce the number of attributes or objects, reduces the volume of data but still produces the same or similar analytical result
extreme case of k=n (k-NN)
the entire data set are neighbors, same as the "naive rule", classify all records according to the majority class
why do we not want to minimize error on training set?
the level of error on the training set is not a good of indicator on performance on FUTURE data (since new data will not be the same as training data) we want to test the model's performance on data its never seen before
is PCA unsupervised or supervised?
unsupervised - not predicting anything, just want to see if the variables are correlated among themselves
numerical outcome prediction (k-NN)
use average response values (or weighted average where weights decrease with distance)
process of classification
usually a two step process: for each value, calculate the probability of belonging to class 1 and compare to the calculated cutoff value, then classify accordingly **default cutoff usually .5 **
when would you want to do worse than the naive rule?
when the goal is to identify high value but rare outcomes (like a rare diagnosis): the model may do better by classifying less records as the most prevalent class as compared to the naive because the goal is to classify as the least prevalent class