CSE 519: Data Science
Artifacts
Systematic problems. Detectable and correctable.
Normality Testing
Tests normality of a given distribution.
monkey
Will create an evaluation statistic randomly.
sharp
Will try to make evaluation statistic look bad by achieving a high score with a useless classifier.
Angle between vectors
cosx = u⋅v/||u||∗||v||. Measures similarity in exactly the same way as pearson.
Data storage
csv, xml, sql, json, protocol buffers
Tufte's Principles for Analytical Design
-Comparison -Causality -Multivariate Analysis -Integration of Evidence -Documentation -Content
recall_i
= C[i,i] / sumj(C[i,j])
precision_i
= C[i,i] / sumj(C[j,i])
Good scoring function
Bell shaped distribution. Easily computable. Easily understandable. Monotonic interpretations of variables. Produces generally satisfying results on outliers. Uses systematically normalized variables. Breaks ties in meaningful ways.
Means Squared Error (MSE)
Boils error distributions to a single number. MSE(y,y') = (1/n)*Sum(y'_i - y_i)^2 Median Squared Error better for noisier instances.
Proxie
Easier to find data that should correlate well with the desired nonexistent / hard to find gold standard / base truth.
Recommended Data Partition
Training Data 60% Test Data 20% Evaluation Data 20%
Data Compatibility
Units Numerical representation Name unification Time date unification Financial unification
Receiver Operator Characteristic Curve (ROC)
Visual representation of complete space options in putting together a classifier. Each point represents a particular classifier threshold defined by its FP and FN rates.
Linear Model
Weigh each feature by a coefficient reflecting its importance and sum them up to produce a score. e.g. Linear Regression.
Cartogram
a map on which statistical information is shown in diagrammatic form.
Relative Error
epsilon = (y - y')/y Unitless. Corrects for size.
Elo Rank Update
r'(A) = r(A) + k(SA - muA) SA is result of comparison and is 1 or -1. muA is expected result in range [-1, 1]: 0 if same skill level. muA = P(A > B) + (-1)(1 - P(A > B)
bias vs variance
underfit vs. overfit
Elo Skill Difference
x = r(A) - r(B) This should be converted to probability using Logit function.
Absolute Error
|y - y'| Weighs larger values of y as more important. Histogram should be bell shaped.
Gold Standard
Set of labels and answers trusted to be correct.
Arrow's Impossibility Theorem
Shows no election system for aggregating permutations of preferences satisfies: Complete System. Transitive Results. All prefer A to B => System prefers A to B. No Dictators. Preference of A over B should be independent of other comparisons.
Covariance
Sign of correlation Sum((X_i - E[X])(Y_i - E[Y])
T-test
Significance test Evaluates whether the populations means of two samples are different. t = (E[x1] - E[x2]) / sqrt(var1/n1 + var2/n2) Requires table lookup, alpha value, and degree of freedom to find out if t above threshold of significance.
KS-test
Significance test. Assesses cdfs by quantifying maximum y-gap between cdfs. Used to do normality testing. Fewer technical assumptions and variants than t-test. Can be applied to many problems.
Baseline Model
Simplest reasonable model that can be compared against.
F-Score
Single score for classifier. Harmonic mean of precision and recall. F = 2 * precision * recall / (precision + recall). F <= arithmetic mean High G must show decent o positive and negative, balancing precision and recall. Best of all classifier evaluation statistics.
Accuracy
TP + TN / (TP + TN + FP + FN). Ratio of correct predictions over total predictions. When |P| << |N|, this has limitations. Essentially a misleading statistic when the class sizes are substantially different.
Statistical Significance (2)
Tells you how unlikely it is that something is due to chance but not whether it is important. Measures confidence that there is a genuine difference between two given distributions.
Effect size
The magnitude of difference between the two distributions.
Monte Carlo Circle Sampling
Generate value in [-r, r] Remove value if not sqrt(x^2 + y^2) = r.
Page Rank
Google's way of ordering nodes by importance. Rewards in-links and strength of source.
Zipf's Law
Governs the distribution of word usage in natural languages. kth most popular word is used only 1/kth as frequently as the most popular word.
Data Imputation Methods
Heuristic based / Reasonable guess. Mean value: does not bias mean and generally applicable. Random values from column: good for lots of missing values. Nearest Neighbor: requires distance function; should be more accurate than mean when there are systematic reasons to explain the variance.
Precision
How often classifier is right when it dares to say positive. precision = TP / (TP + FP) Impossible to get high for sharp or monkey. High precision is hard to achieve in unbalanced class sizes.
Recall
How often you prove right on all positive instances. recall = TP / (TP + FN) high => few FN. Tradeoff between precision and recall: braver predictions are less likely to be right. Recall = Accuracy iff the classifiers are balanced.
Bonferroni Correction
How you found correlation can be as important as the strength of the correlation itself. When testing n different hypotheses simultaneously, the resulting p-value must rise to a level alpha/n in order to be significant at alpha. Safeguards from accepting significance of lone success among many trials.
number of samples
Statistical significance depends upon the _____________ ____ ___________ while the effect size does not.
Spearman Rank Correlation Coefficient
Summing all pair rank differences square (in sorted order) and normalizing. Counts number of out of order points. Gives high scores to monotonic non-linear functions. Less sensitive to outliers
Borda's Weights
Options: Linear weights. Symmetric? Bell shaped sample. Non-symmetric? Half bell shaped sampled. Domain dependent!
Bayes Theorem
P(A|B) = P(B|A)*P(A) / P(B) posterior = likelihood*prior / marginal Allows us to swap conditions. P(results | data) <=> P(data | results)
Cross-Validation
Partition data into k equal sized sets then train k distinct models. Average the performances. Results in standard deviation of performances. Worth it on large data sets.
Score
Reduces n dimensional records to a single value, highlighting a property of the data.
Errors
Data lost in acquisition.
Rank
A sorting of record scores.
Squared Error
Absolute Error squared. Dominated by outliers.
scoring function
Area under ROC can be used to measure the quality of the _________ _________ defining the classifier. The closer AUC to is to one the better.
Normal Distribution
Bell Shaped Continuous. e.g Gaussian noise. Defined by mean and standard deviation sigma. 68-95-99 % rule. 1-2-3 standard deviations respectively.
Confusion Matrix
C[x,y] reports the number or fraction of instances of class x which get labeled as class y. C[i, i] is the number of correctly labeled instances. Sparse rows => poorly represented classes in training data. Sparse columns => labels classifier is reluctant to assign. In either case, consider removing the label.
Permutation Tests
Conduct many trials against random data to establish significance of real observations. Score should be on the extreme tail among the random permutations to be significant. Try min k = 1000 permutations.
Logit function
Convert real variable x to probability p in [0, 1]. p = 1/(1+e^-cx) c governs how steep interpolation between A having complete advantage and B having complete advantage. Do small differences in skill translate to large differences in the probability of winning?
Causation
Correlation does not mean causation...
Amplifying a Dataset
Create negative examples from a prior distribution. Perturb real examples to create a similar but synthetic one. Give partial credit when you can: Squeeze everything out of what you got.
r^2 (effect size)
Reflects proportion of the variance in one variable explained by the other. Measures effect size. Small, medium, and large are just squares of r and represent percentage of variance explained.
Bar plots and Pie charts
Relative proportions of categorical variables.
Evaluation Data
Data used to confirm the performance of the final model right before it goes into production.
Testing Data
Data used to evaluate how good the model is.
Training Data
Data used to study domain and set parameters of model.
r
Degree of linear relationship between two variables. Measures effect size. small: > +- 0.2 med: > +- 0.5 large > +- 0.8
Pearson Correlation Coefficient
Degree to which a linear predictor of the form y = mx + b can fit the observed data. r = cov(X,Y)/std(X)std(Y). Good for linear predictors. Fails on functions like y = |x|. More sensitive to outliers.
Statistical Significance
Depends on n and r Significant if alpha <= 0.05: Chance that we would observe a correlation as strong as r in any random set of n points.
False Discovery Rate (FDR)
Discovering too many correlations so that many are statistically significant. Are they really that important?
Blackbox Model
Do their job in an unknown manner. e.g. Deep learning, Neural Networks.
Occam's Razor
Does a simpler explanation fit the data just as well?
Bias
Error from incorrect assumptions built into a model.
Variance
Error from sensitivity to fluctuations in the training set. Models do better on testing than on training data.
Root Mean Squared Error (RMSD)
Error value whose magnitude is interpretable on the same scale as the original values. = sqrt(MSE)
r^2
Estimates fraction of variance in Y explained by X in a simple linear regression.
Generating Random Permutations
For i = 1 to n do a[i] = 1 for i = 1 to n - 1 do swap[a[i], a[Random[i, n]]
Borda's Method
For merging multiple rankings. Assign weight to each of the n positions in the permutation. For each n we sum up weights of its positions over all k in input rankings. Sort these n scores for final rankings.
Histogram
Frequency distribution graph. bins = ceil(n/25) pdf or cdf
Poisson Distribution
Frequency of intervals between rare events. Doing something n times till event R happens. e.g living and dying. P(x) = (e^-mu * mu^x) / x!
Inverse Transform Sampling
Generate random probability in uniform distribution [0,1] and interpret as probability. Report cdf(X <= x) = p.
Descriptive Model
Insight into why they are making decisions. e.g. Simplicity of Linear Regression.
Transpose
Interchange the rows and columns of a matrix. M^T_i,j = M_j,i A matrix times its transpose measures how similar items i and j are. A matrix times its transpose is a covariance matrix.
Power Law Distribution
Long tails in distribution. P(X = x) = cx^-a. Show as straight lines on logarithmic frequency plots. Mean and standard deviation are useless. Distribution is scale invariant. Reflect inequalities in the world.
Outlier Detection
Look at largest and smallest. View graphics. Can be indicative of systematic problem or data entry mistakes.
Color
Mark class distinctions Encode numerical values. Please use linear scales that are clear.
Scatter Plot / Multivariate Plot
Massive data sets bivariate: scatter plot
Visualization Rules
Maximize Data/Ink ratio. Minimize Lie Factor = size effect in graphic / size of effect in data. Minimize chartjunk. Proper Scaling and Labeling
Cohen's d
Measures effect size by including difference of means and natural variation (standard deviations). d = |mu - mu'|/std small: d > 0.2 med: d > 0.5 large: d > 0.8
Benjamin-Hochberg
Minimize False Discovery Rate If many values are significant to a standard, a certain fraction of them should be significant to a much higher standard.
First-Principle Model
Model based on belief about the system under investigation works. e.g. a simulation using mathematical laws.
Data-Driven Model
Model based on observed correlation between input and outcome. Effective model in domain in which nothing is known. e.g. Weather Predictions.
Deterministic Model
Model that always returns the same answer given the same input. Good for bug fixing in the model.
Hierarchical Model
Model that has submodels. Deep learning models are flat and hierarchical: Flat data and neutral layers.
Live Model
Model updated by live data.
Dot Product
Multiply v_i * u_i for all i.
Digraph-based rankings
Optimal ranking is a permutation of the vertices that violates the fewest number of edges. AKA Topological Sort of a DAG when consistent. Not consistent: Max acyclic subgraph problem Watch difference between nodes in-degree and out-degree: highly negative: near front.
Correlation Coefficient
Predictive power of variable for another. -1 anitcorrelated 1 correlated 0 no relation
Binomial Distribution
Probability of getting x outcome A events in the course of n independent trials in no order. Almost bell shaped. Discrete function. P(X = x) = nC2 * p^x * (1 - p)^(n - x)
Data sources
Proprietary Government Academic Sweat Scraping Spidering Logging
Data munging tools
Python, Perl, R, Matlab, Java, C, C++, Mathematica, Excel, pynb
Box and Whisker plots
Quartiles and median
Stochastic Model
Randomly Determined Model Probability Distributions returned. Use fixed seed in RNG to test.
Elo Rank
Rates all equally then incrementally adjusts each score in response to the results of each comparison.
Line chart
Reasons to display data in a _______: Interpolating and fitting. Dot plots. Function plots.
Table
Reasons to display data in a _________: Show precision Show scale Multivariate visualiztion Heterogeneous data compactness