BA 305 EXAM 1

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

multicollinearity

a condition where some of the predictor variables are correlated with each other, leads to instability in the solution space

Principal Component Analysis

a descriptive approach involving projections

PCA algorithm

components created are just eigenvectors with eigenvalues being the strength

difference between data summarization and reduction

data reduction squeezes the matrix from the sides, reducing the number of columns, where data summarization reduces the number of rows

how to deal with outliers

delete or replace the instance

Ordinal Scale

differentiate and then order objects (ex. Education levels like primary, secondary, undergraduate, graduate, etc)

Interval Scale

differentiated, ordered levels with consistent differences between levels

fall out

false positive rate: false positives divided by the total number of negatives (false positives divided by true negative and false positives) ***type 1 error***

overfitting

generality of findings is hindered since new data does not behave the way as training data - machine learning models learn so much which means it also learns the random noise in the data

parsimony

high dimensional data violates this principle- want to keep the number of variables to a size that could be easily interpreted - want to be efficient about the variables you use in the data so its easier to understand while maintaining data integrity

specificity

how many negatives can the model identify (true negatives divided by total negatives (false positives plus true negatives))

sensitivity

how many positives can the model identify (true positives predicted divided by all positives (False negative plus true positives))

how should one deal with missing data

identify the patterns and relationships underlying the missing data in order to maintain as close a result as possible to the original distribution of values

common approaches to classification

linear regression, nearest neighbor, naive bayes, decision trees, artificial neural networks

precisions

out all of all the positive calls the model makes, what proportion of that are correct (true positive / true positive +false positive)

objective of principal component analysis

reduce the set of numerical variables (columns) by removing any overlap of information between variables where the result is less variables with more efficient values (aka contain most of the necessary information)

what is the effect of missing values

reduces the sample size for analysis which can distort results

density plot

smoothed out version of a histogram, single variable can make two dimensions and have a heat map

What condition requires more components

when there is heterogeneity among sample subgroups

Methods of data transformation

- Normalization - Aggregation - Data reduction - dummy variables

Scree Test Criterion

Components before inflection point when graphing the percent of variance explained

purpose of data transformations

Do so to correct violations of the statistical assumptions underlying the multivariate technique Also do so to improve the relationship/correlation between the variables

Analytics Lifecycle

Raw Data Data Processing Clean, Structured Data Exploratory Data Analysis Insights, Reports, Visual Graphs Models and Algorithms → Data driven products

model performance for prediction/numerical value

**NOT THE SAME AS GOODNESS OF FIT TEST** - we want to know how well the model predicts on NEW DATA - measure error via mean absolute error or average error -generally, want to measure the absolute difference between the actual y and the predicted y

Best practices for data visualization

- Color is bad for quantitative because color is a nominal scale - however, color is great for visualizing nominal data - Don't go 3D it just looks gaudy - Don't use trendlines for categorical data (like gender) - it assumes connections between levels - Don't use trend lines except for continuous variables - Watch out for aspect ratios -> Baking to 45 - average absolute angle is 45 degrees - Don't break the y-axis scale (misconstrues the data, especially because people don't commonly read the numbers actually) - Get rid of "chartjunk" - use ink as a precious resource

Combination of tools that make up business analytics

- Descriptive: what happened? (how many customers, what was my profit, etc) (more information based) - Diagnostic: why did it happen? - Predictive: what will happen in the future? Prescriptive Analytics: how can we make it happen ("optimally?") (help you make decisions going forward)

general: how do you evaluate a model?

- Objectively/randomly partition the data into training and testing parts - use training data for in-sample evaluation: how well does my model "fit" the data - use testing data for out of sample evaluation: model predictive power on new data - basically, build a model using the training set and evaluate it on the testing set (test data can not be used at all in training process)

When do you need to use "repeated hold out?"

- When data is smaller: repeat the process with different subsamples and take the average error rates on different iterations to yield an overall error rate - randomization works better on larger data sets, so with a smaller data set, there is a higher chance that the data will be unbalanced (so you need more random subgroups)

Naive bayes

- a classification algorithm (similar to k-NN) but conceptually different

model performance for classification

- an error would be classifying a record as belonging to one class when it belongs to another - error rate is the percent of misclassified records out of the total records -start with the naive rule: classify all records as belonging to the most prevalent class, can use as a benchmark where the goal is to better than this - can evaluate the model based on separation of records: a high separation of records means that the model attains low error but a low separation of records means that the model does not improve much on the naive rule

Aggregation

- combining two or more attributes (or objects) into a single attribute (or object) - typically more stable data because has less variability - can have levels of which objects can be aggregated into (like how cities can be aggregated into regions, states, countries, etc)

process of nearest neighbor algorithm

- compute distance to other training records, identify k nearest neighbors using these distances, use class labels of the nearest neighbors to determine the class label of the unknown record

How dimension reduction helps

- creates independent/uncorrelated/orthogonal input useful for regression, cluster analysis, classification, and other methods - reduces the amount of time and memory required by data mining algorithms

why does k-NN not work well for large datasets?

- due to curse of dimensionality - expected distance to nearest neighbor increases with dimensions (as when there are many dimensions, all records end up "far away" from each other) - can try reducing dimension using PCA - or just for large datasets, use a different method

Phases of preparation

- graph the data to look at the shape (univariate) and/or relationships of your data (multivariate) - identify and evaluate missing values - identify and deal with outliers

Why do we use predictive analytics

- help classify/predict out-of-sample data points (with tools like linear regressions) - use what we learn to take action

K-fold cross validation

- like an extenuation of the hold out method - instead of doing 1 split, split the data into k subgroups, then train the data on (k-1) subsets, where you use the last subset to test the data - repeat this process k times so each subset is the test data at least once -take an average of error estimates - use python package "cross_val_score"

low values of k versus high values of k (k-NN)

- low: capture local structure of data but also the noise - high: provide more smoothing and less noise by may miss the local structure

Sampling

- main technique employed for data selection, often used for both preliminary investigation of the data and the final data analysis - Used in data mining because processing the entire set of data of interest is too expensive or time consuming

types of rescaling

- min-max normalization: rescale the variables to a range of [0,1] - standardization (via z-score: how many standard deviations away from the mean): all variables have a mean of 0 and variances of 1

Why do we evaluate models?

- model deployment is incredibly expensive (in terms of time and money) so we want to evaluate before money is spent to scale it up to deploy - multiple models can be used to classify or predict (and may have multiple settings as well) so to choose the best model, we need ways to compare each model's performance

Classification

- most common datamining task - a SUPERVISED method that includes two or more classes for the categorical target variable - the algorithm examines relationships between the values of the predictor/input fields and target/output values - ex: banking - determine whether a mortgage application is "good"

Box Plot

- not visually pleasing but is a helpful tool because contains a lot of different types of information - the box shows the 25-75% percentiles (so contains 50% of the data) - the belt is the median (tells us how symmetrical the data is) - median line shifted up or down tells us if the data is skewed - whiskers are the soft minimum and maximum (1.067 * IQR) - outliers will be above or below the whiskers - side by side boxplots are useful for comparing subgrousp

Receiver Operating Characteristic Curve

- plots the total positive rate against the false positive rate at all positive cutoff values (where the diagonal line reflects when TPR = FPR which is random classification) - compare models by comparing the area under the ROC curve

How does principal component analysis work

- rely on correlations to create new variables that are linear combinations of the original variables - these linear combinations are uncorrelated (no information overlap) and only a few of them contain most of the original information - new variables are called principal components

advantages of k-NN

- simple because has no assumptions - handles non-linear decision boundaries -effective at capturing complex interaction among variables without having to define a statistical model

types of voting during classification (k-NN)

- simple unweighted voting: decide on the value for k to detemine the number of similar records that can "vote", compare each unclassified record to its k nearest (most similar) neighbors according to the euclidean distance function, each of the k similar records vote - weighted voting: where closer neighbors have a larger voice in the classification decision as compared to more distant neighbors

reasoning behind curse of dimensionality

- sparsity increases: definitions of density and distance between points, which are critical for models like clustering and outlier detection, become less meaningful - getting more signals but also getting more noise --> signal gets lost in the noise

Interpretation of component matrix

- squared component weight represents the amount of the total variability explained by the component - eigenvalues represent the magnitude of the components (the explicatory value of the variables)

Component weights/loadings

- tells us how original variables and the components/factors relate - gives the correlation of each component with each of the original variables

component scores

- the resulting data from dimension reduction - can be used for subsequent analysis (like regression, cluster analysis, classification, etc) - helps avoid complications caused by multicollinearity - however, interpretation can be more difficult because all variables contribute through loadings

shortcomings of k-NN

- there is no model so its not interpretable - requires some random tinkering with parameters (like value of k and cutoffs) -can fail for large datasets - also not memory efficient (slow for large datasets)

Why do you examine your data before analysis?

- to examine the suitability of the data to make the proper adjustments

Hold-out method

- use two independent data sets (ex: training set is about 80%, testing set is about 20% of original) - use sklearn package train_test_split

cost benefit matrix

- use when the outcome of rating something as positive or negative and true/false has a cost benefit associated with it (like where a false diagnosis is a negative cost) - the best cutoff maximizes net benefit

"leave-one-out"

- version of k-fold cross validation - for smaller sized data, k-fold cv where k=n (so the number of subsets is the number of data points) - a more efficient use of smaller data sets

why do you visualize your data first?

- want to understand your data as much as possible before you analyze it - a lot of statistical models need certain assumptions verified (like linearity) which visualizations can confirm

Output of PCA

-a component matrix with the correlation values between the variables and the components - there are as many components as there are dimensions in the data

what happens when you change the cutoff values for classifications

-changes the confusion matrix - isn't rerunning the model, just reclassifying the same output of the same model

characteristics of k-Nearest neighbors

-data driven, not model driven - makes no assumptions about the data -simple goal is to classify a record like similar records

how do we measure nearby? (k-NN)

-for metric values, use Eucledian distance (may need to normalize the values using min-max or z-score) - for categorical values, can't use Euclidean; use "different function" where 0 is the same and 1 is otherwise

Objectives of data reduction/dimensionality

-helps to understand the underlying structure (are some variables similar to each other, and if so, can we somehow group them together?) - shrinks size, while preserving the important information, for more efficient subsequent analysis

measuring of error in model evaluation (generally)

-prediction (how well does it predict the correct numerical value): when dealing with numerical values - membership in a class/classification (how well does it predict which class its in): when dealing with categorical data - propensity (probability of belonging to a class):(how well does it predict the probability that something belongs to a class)

how do we choose k? (k-NN)

-typically choose the value of k which has the lowest error rate in the validation data (so repeat the process multiple times and calculate error rate and compare)

Ways to measure the quality of data

Accuracy: is your data really measuring what it claims to be measuring (errors would be during data collection and entry) Completeness: missing data/values Uniqueness: entities are only recorded once Timeliness: only include in-date data Consistency: data agrees with itself (so common units, need to be able to compare)

what is the best model shown on an ROC curve?

a line that traces the y axis and then the top cutoff line (so would have an area under the curve of 1)

A Priori Criterion

a predetermined number of components based on research objectives and/or prior research

components in PCA

abstract concepts that provide an efficient and convenient method for labeling a number of similar variables - combine correlated measurements into one component

Structure of data

an nxd data matrix, where n rows correspond to instances and are referred to as the size and d columns represent features and are referred to as the dimensionality

curse of dimensionality

by throwing more data at the problem, you increase the likelihood of getting better solutions but as you keep adding more, your results will start to deteriorate

categorical outcome classification (k-NN)

classify the given record as whatever the predominant class is among the nearby records (majority vote determines class)

Latent root criterion

components with eigenvalues greater than 1

data summarization

compressing the data that exists by the rows (summarizing the rows, not the variables))

Pie/donut Chart

don't do it

what is the most important problem with classification

drawing the "right" decision boundary

binary classification

each record is classified as either positive or negative, then compare the actual value to the predictive value can create a confusion matrix

Predictive Analytics

encompasses modeling relationships, testing hypotheses, extrapolating / predicting

percentage variance criterion

enough components to meet a specified percentage of variance explained (usually 60% or higher)

overfitting with model evaluation

fitting the training data too precisely leads to learning too many peculiarities in training data and generates poor results on new data

what makes a sample representative?

if it has approximately the same property (of interest) as the original data set

how are classifications different than a regression?

in classifications, we have dependent variables that are categorical, where in regressions, they are continuous values

What is the most important quality of a sample for it to be effective?

it must be representative

Nominal Scale

measurement technique where levels are used to differentiate between different objects (ex. colors)

what is the goal of model evaluation?

minimize error on the TEST dataset

Ratio Scale

monetary values, age, can complete all sorts of mathematical analysis

what needs to happen before you can run PCA?

need to "process" the data to turn all relevant data into metric data in order to rescale - may need to use dummy variables (to make categorical data into metric data in order to analyze) - may have scaling issues (some variables may have a larger scale than others which means certain variances may dominate the calculation of covariances)

dummy variables

nonmetric independent varaible that has two (or more) distinct levels that are coded 0 or 1 - these variables act as replacement variables to enable nonmetric variables to be used as metric variables

Why are there missing values in data

not all information may be reported for all people (can be systematic or random)

outliers

observations which stand out from the rest (commonly 3 or 4 standard deviations from the norm)

Four types of scales for data

ratio, interval, nominal, ordinal

what other forms can datasets come in besides matrices

record (transaction data), sequences (DNA/protein), graph (the internet)

data reduction

reduce the number of attributes or objects, reduces the volume of data but still produces the same or similar analytical result

extreme case of k=n (k-NN)

the entire data set are neighbors, same as the "naive rule", classify all records according to the majority class

why do we not want to minimize error on training set?

the level of error on the training set is not a good of indicator on performance on FUTURE data (since new data will not be the same as training data) we want to test the model's performance on data its never seen before

is PCA unsupervised or supervised?

unsupervised - not predicting anything, just want to see if the variables are correlated among themselves

numerical outcome prediction (k-NN)

use average response values (or weighted average where weights decrease with distance)

process of classification

usually a two step process: for each value, calculate the probability of belonging to class 1 and compare to the calculated cutoff value, then classify accordingly **default cutoff usually .5 **

when would you want to do worse than the naive rule?

when the goal is to identify high value but rare outcomes (like a rare diagnosis): the model may do better by classifying less records as the most prevalent class as compared to the naive because the goal is to classify as the least prevalent class


Kaugnay na mga set ng pag-aaral

Chapter1 introduction to technology

View Set

MGT 101 Chapter 3 Organizational environment and cultures

View Set

AP Euro: Chapter 15-The Old Regime in the 18th Century

View Set

Professional Cooking - Chapter 11 "Cooking Vegetables"

View Set

Koncentration, afgrænsing af marked, barrierer,

View Set

Chapter 9: NPV and other Investment Criteria

View Set

determine the output of the following code in C

View Set