CSE 519: Data Science

Ace your homework & exams now with Quizwiz!

Artifacts

Systematic problems. Detectable and correctable.

Normality Testing

Tests normality of a given distribution.

monkey

Will create an evaluation statistic randomly.

sharp

Will try to make evaluation statistic look bad by achieving a high score with a useless classifier.

Angle between vectors

cosx = u⋅v/||u||∗||v||. Measures similarity in exactly the same way as pearson.

Data storage

csv, xml, sql, json, protocol buffers

Tufte's Principles for Analytical Design

-Comparison -Causality -Multivariate Analysis -Integration of Evidence -Documentation -Content

recall_i

= C[i,i] / sumj(C[i,j])

precision_i

= C[i,i] / sumj(C[j,i])

Good scoring function

Bell shaped distribution. Easily computable. Easily understandable. Monotonic interpretations of variables. Produces generally satisfying results on outliers. Uses systematically normalized variables. Breaks ties in meaningful ways.

Means Squared Error (MSE)

Boils error distributions to a single number. MSE(y,y') = (1/n)*Sum(y'_i - y_i)^2 Median Squared Error better for noisier instances.

Proxie

Easier to find data that should correlate well with the desired nonexistent / hard to find gold standard / base truth.

Recommended Data Partition

Training Data 60% Test Data 20% Evaluation Data 20%

Data Compatibility

Units Numerical representation Name unification Time date unification Financial unification

Receiver Operator Characteristic Curve (ROC)

Visual representation of complete space options in putting together a classifier. Each point represents a particular classifier threshold defined by its FP and FN rates.

Linear Model

Weigh each feature by a coefficient reflecting its importance and sum them up to produce a score. e.g. Linear Regression.

Cartogram

a map on which statistical information is shown in diagrammatic form.

Relative Error

epsilon = (y - y')/y Unitless. Corrects for size.

Elo Rank Update

r'(A) = r(A) + k(SA - muA) SA is result of comparison and is 1 or -1. muA is expected result in range [-1, 1]: 0 if same skill level. muA = P(A > B) + (-1)(1 - P(A > B)

bias vs variance

underfit vs. overfit

Elo Skill Difference

x = r(A) - r(B) This should be converted to probability using Logit function.

Absolute Error

|y - y'| Weighs larger values of y as more important. Histogram should be bell shaped.

Gold Standard

Set of labels and answers trusted to be correct.

Arrow's Impossibility Theorem

Shows no election system for aggregating permutations of preferences satisfies: Complete System. Transitive Results. All prefer A to B => System prefers A to B. No Dictators. Preference of A over B should be independent of other comparisons.

Covariance

Sign of correlation Sum((X_i - E[X])(Y_i - E[Y])

T-test

Significance test Evaluates whether the populations means of two samples are different. t = (E[x1] - E[x2]) / sqrt(var1/n1 + var2/n2) Requires table lookup, alpha value, and degree of freedom to find out if t above threshold of significance.

KS-test

Significance test. Assesses cdfs by quantifying maximum y-gap between cdfs. Used to do normality testing. Fewer technical assumptions and variants than t-test. Can be applied to many problems.

Baseline Model

Simplest reasonable model that can be compared against.

F-Score

Single score for classifier. Harmonic mean of precision and recall. F = 2 * precision * recall / (precision + recall). F <= arithmetic mean High G must show decent o positive and negative, balancing precision and recall. Best of all classifier evaluation statistics.

Accuracy

TP + TN / (TP + TN + FP + FN). Ratio of correct predictions over total predictions. When |P| << |N|, this has limitations. Essentially a misleading statistic when the class sizes are substantially different.

Statistical Significance (2)

Tells you how unlikely it is that something is due to chance but not whether it is important. Measures confidence that there is a genuine difference between two given distributions.

Effect size

The magnitude of difference between the two distributions.

Monte Carlo Circle Sampling

Generate value in [-r, r] Remove value if not sqrt(x^2 + y^2) = r.

Page Rank

Google's way of ordering nodes by importance. Rewards in-links and strength of source.

Zipf's Law

Governs the distribution of word usage in natural languages. kth most popular word is used only 1/kth as frequently as the most popular word.

Data Imputation Methods

Heuristic based / Reasonable guess. Mean value: does not bias mean and generally applicable. Random values from column: good for lots of missing values. Nearest Neighbor: requires distance function; should be more accurate than mean when there are systematic reasons to explain the variance.

Precision

How often classifier is right when it dares to say positive. precision = TP / (TP + FP) Impossible to get high for sharp or monkey. High precision is hard to achieve in unbalanced class sizes.

Recall

How often you prove right on all positive instances. recall = TP / (TP + FN) high => few FN. Tradeoff between precision and recall: braver predictions are less likely to be right. Recall = Accuracy iff the classifiers are balanced.

Bonferroni Correction

How you found correlation can be as important as the strength of the correlation itself. When testing n different hypotheses simultaneously, the resulting p-value must rise to a level alpha/n in order to be significant at alpha. Safeguards from accepting significance of lone success among many trials.

number of samples

Statistical significance depends upon the _____________ ____ ___________ while the effect size does not.

Spearman Rank Correlation Coefficient

Summing all pair rank differences square (in sorted order) and normalizing. Counts number of out of order points. Gives high scores to monotonic non-linear functions. Less sensitive to outliers

Borda's Weights

Options: Linear weights. Symmetric? Bell shaped sample. Non-symmetric? Half bell shaped sampled. Domain dependent!

Bayes Theorem

P(A|B) = P(B|A)*P(A) / P(B) posterior = likelihood*prior / marginal Allows us to swap conditions. P(results | data) <=> P(data | results)

Cross-Validation

Partition data into k equal sized sets then train k distinct models. Average the performances. Results in standard deviation of performances. Worth it on large data sets.

Score

Reduces n dimensional records to a single value, highlighting a property of the data.

Errors

Data lost in acquisition.

Rank

A sorting of record scores.

Squared Error

Absolute Error squared. Dominated by outliers.

scoring function

Area under ROC can be used to measure the quality of the _________ _________ defining the classifier. The closer AUC to is to one the better.

Normal Distribution

Bell Shaped Continuous. e.g Gaussian noise. Defined by mean and standard deviation sigma. 68-95-99 % rule. 1-2-3 standard deviations respectively.

Confusion Matrix

C[x,y] reports the number or fraction of instances of class x which get labeled as class y. C[i, i] is the number of correctly labeled instances. Sparse rows => poorly represented classes in training data. Sparse columns => labels classifier is reluctant to assign. In either case, consider removing the label.

Permutation Tests

Conduct many trials against random data to establish significance of real observations. Score should be on the extreme tail among the random permutations to be significant. Try min k = 1000 permutations.

Logit function

Convert real variable x to probability p in [0, 1]. p = 1/(1+e^-cx) c governs how steep interpolation between A having complete advantage and B having complete advantage. Do small differences in skill translate to large differences in the probability of winning?

Causation

Correlation does not mean causation...

Amplifying a Dataset

Create negative examples from a prior distribution. Perturb real examples to create a similar but synthetic one. Give partial credit when you can: Squeeze everything out of what you got.

r^2 (effect size)

Reflects proportion of the variance in one variable explained by the other. Measures effect size. Small, medium, and large are just squares of r and represent percentage of variance explained.

Bar plots and Pie charts

Relative proportions of categorical variables.

Evaluation Data

Data used to confirm the performance of the final model right before it goes into production.

Testing Data

Data used to evaluate how good the model is.

Training Data

Data used to study domain and set parameters of model.

r

Degree of linear relationship between two variables. Measures effect size. small: > +- 0.2 med: > +- 0.5 large > +- 0.8

Pearson Correlation Coefficient

Degree to which a linear predictor of the form y = mx + b can fit the observed data. r = cov(X,Y)/std(X)std(Y). Good for linear predictors. Fails on functions like y = |x|. More sensitive to outliers.

Statistical Significance

Depends on n and r Significant if alpha <= 0.05: Chance that we would observe a correlation as strong as r in any random set of n points.

False Discovery Rate (FDR)

Discovering too many correlations so that many are statistically significant. Are they really that important?

Blackbox Model

Do their job in an unknown manner. e.g. Deep learning, Neural Networks.

Occam's Razor

Does a simpler explanation fit the data just as well?

Bias

Error from incorrect assumptions built into a model.

Variance

Error from sensitivity to fluctuations in the training set. Models do better on testing than on training data.

Root Mean Squared Error (RMSD)

Error value whose magnitude is interpretable on the same scale as the original values. = sqrt(MSE)

r^2

Estimates fraction of variance in Y explained by X in a simple linear regression.

Generating Random Permutations

For i = 1 to n do a[i] = 1 for i = 1 to n - 1 do swap[a[i], a[Random[i, n]]

Borda's Method

For merging multiple rankings. Assign weight to each of the n positions in the permutation. For each n we sum up weights of its positions over all k in input rankings. Sort these n scores for final rankings.

Histogram

Frequency distribution graph. bins = ceil(n/25) pdf or cdf

Poisson Distribution

Frequency of intervals between rare events. Doing something n times till event R happens. e.g living and dying. P(x) = (e^-mu * mu^x) / x!

Inverse Transform Sampling

Generate random probability in uniform distribution [0,1] and interpret as probability. Report cdf(X <= x) = p.

Descriptive Model

Insight into why they are making decisions. e.g. Simplicity of Linear Regression.

Transpose

Interchange the rows and columns of a matrix. M^T_i,j = M_j,i A matrix times its transpose measures how similar items i and j are. A matrix times its transpose is a covariance matrix.

Power Law Distribution

Long tails in distribution. P(X = x) = cx^-a. Show as straight lines on logarithmic frequency plots. Mean and standard deviation are useless. Distribution is scale invariant. Reflect inequalities in the world.

Outlier Detection

Look at largest and smallest. View graphics. Can be indicative of systematic problem or data entry mistakes.

Color

Mark class distinctions Encode numerical values. Please use linear scales that are clear.

Scatter Plot / Multivariate Plot

Massive data sets bivariate: scatter plot

Visualization Rules

Maximize Data/Ink ratio. Minimize Lie Factor = size effect in graphic / size of effect in data. Minimize chartjunk. Proper Scaling and Labeling

Cohen's d

Measures effect size by including difference of means and natural variation (standard deviations). d = |mu - mu'|/std small: d > 0.2 med: d > 0.5 large: d > 0.8

Benjamin-Hochberg

Minimize False Discovery Rate If many values are significant to a standard, a certain fraction of them should be significant to a much higher standard.

First-Principle Model

Model based on belief about the system under investigation works. e.g. a simulation using mathematical laws.

Data-Driven Model

Model based on observed correlation between input and outcome. Effective model in domain in which nothing is known. e.g. Weather Predictions.

Deterministic Model

Model that always returns the same answer given the same input. Good for bug fixing in the model.

Hierarchical Model

Model that has submodels. Deep learning models are flat and hierarchical: Flat data and neutral layers.

Live Model

Model updated by live data.

Dot Product

Multiply v_i * u_i for all i.

Digraph-based rankings

Optimal ranking is a permutation of the vertices that violates the fewest number of edges. AKA Topological Sort of a DAG when consistent. Not consistent: Max acyclic subgraph problem Watch difference between nodes in-degree and out-degree: highly negative: near front.

Correlation Coefficient

Predictive power of variable for another. -1 anitcorrelated 1 correlated 0 no relation

Binomial Distribution

Probability of getting x outcome A events in the course of n independent trials in no order. Almost bell shaped. Discrete function. P(X = x) = nC2 * p^x * (1 - p)^(n - x)

Data sources

Proprietary Government Academic Sweat Scraping Spidering Logging

Data munging tools

Python, Perl, R, Matlab, Java, C, C++, Mathematica, Excel, pynb

Box and Whisker plots

Quartiles and median

Stochastic Model

Randomly Determined Model Probability Distributions returned. Use fixed seed in RNG to test.

Elo Rank

Rates all equally then incrementally adjusts each score in response to the results of each comparison.

Line chart

Reasons to display data in a _______: Interpolating and fitting. Dot plots. Function plots.

Table

Reasons to display data in a _________: Show precision Show scale Multivariate visualiztion Heterogeneous data compactness


Related study sets

The graph above shows reelection rates for incumbents in the House and Senate. From this information and your knowledge of United States politics, perform the following tasks.

View Set

Honors Biology - Cell Organelles

View Set