Data Science
Multiple Regression
(An extension of simple linear regression) is used to predict the value of a dependent variable (also known as an outcome variable) based on the value of two or more independent variables (also known as predictor variables).
Supervised Learning
- Data comes with additional attributes that we want to predict. - This problem can be either: o Classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. Think of it as a form of discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class. CONTINUE http://scikit-learn.org/stable/tutorial/basic/tutorial.html
Pearson product-moment correlation coefficient
- Measure of the strength of the linear relationship between two variables. - R = 1 perfect correlation, R = -1 perfect negative correlation (want to be closer to one to know two variables are highly correlated) - np.corrcoef
Regression Key Ideas
- Outliers in a regression are records with a large residual - Multicollinearity can cause numerical instability in fitting the regression equation - A confounding variable is an important predictor that is omitted from a model and can lead to a regression equation with spurious (not meaning what it purports to be) relationships - An interaction term between two variables is needed if the effect of one variable depends on the level of the other - Polynomial regression can fit nonlinear relationships between predictors and the outcome variable - Splines are series of polynomial segments strung together, joining at knots - Generalized additive models (GAM) automate the process of specifying the knots in splines
Omnibus Test
A single hypothesis test of the overall variance among multiple group means.
F-Statistic
A standardized statistic that measures the extent to which differences among group means exceeds what might be expected in a chance model.
Cross Validation
A technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.
Partial Residual Plot
A way to visualize how well the estimated fit explains the relationship between a predictor and the outcome. Along with detection of outliers, (regression) this is probably the most important diagnostic for data scientists. The basic idea is to isolate the relationship between a predictor variable and the response, taking into account all of the other predictor variables. This might be thought of as a synthetic outcome value, combining the prediction based on a single predictor with the actual residual from the full regression equation. A partial residual for predictor X_i is the ordinary residual plus the regression term associated with X_i:
Prediction Interval
An uncertainty interval around an individual predicted value.
Confounding Variables
An important predictor, when omitted, leads to spurious relationships in a regression equation.
Interactions
An interdependent relationship between two or more predictors and the response.
Lift Curve
Consider the records in order of their predicted probability of beings 1s. Say, of the top 10% classified as 1s, how much better did the algorithm do, compared to the benchmark of simply picking blindly? If you can get 0.3% response in this top decile instead of the 0.1% you get overall picking randomly, the algorithm is said to have a lift (also called gains) of 3 in the top decile. A life chart (gains chart) quantifies this over the range of the data.
Stratified Sampling
Dividing the population into strata and then randomly sampling from each strata.
Spline Regression
Fitting a smooth curve with a series of polynomial segments
Null Hypothesis
In inferential statistics, the term "null hypothesis" is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups.
TF-IDF
In information retrieval, tf-idf or TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf-idf is one of the most popular term-weighting schemes today.
Heteroskedasticity
Indicates that prediction errors differ for different ranges of predicted value, and may suggest an incomplete model. In context of the book's example, heteroskedasticity may indicate that the regression has left something unaccounted for in high and low range houses.
Statistical moments
In statistical theory, location (mean, median, etc.) and variability (STD, variance, MAD, etc.) are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values and kurtosis indicates the propensity of the data to have extreme values. Generally, metrics are not used to measure skewness and kurtosis; instead, these are discovered through visual displays.
Bootstrapping
In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates.
Imputation
In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation".
Type 2 error
Mistakenly concluding an effect is due to chance (when it is real).
Type 1 Error
Mistakenly concluding an effect is real (when it is due to chance).
Down-sampling
Say you have two sets of data representing two categories (think upgrade vs no upgrade), you take the set with more samples, and you choose a sample from that set equal to the total of the smaller number of the two initial sets. In python -- resample()
Sample mean
The average in a given sample
Population mean
The average in the entire population.
Entropy
The average rate at which information is produced by a stochastic source of data. (stochastic: an adjective in English that describes something that was randomly determined.)
Interquartile range (IQR)
The difference between the 75th percentile and the 25th percentile
Standard deviation (I2-norm, Euclidean norm)
The squared root of the variance (can also compute a trimmed standard deviation)
ANOVA (analysis of variance)
The statistical procedure that tests for a statistically significant difference among the groups (when we have a comparison of multiple groups, each with numeric data).
Standard error
The variability (standard deviation) of a sample statistic over many samples (not to be confused with standard deviation, which, by itself, refers to variability of individual data values).
Knots
Values the separate spline segments.
Ensemble
Forming a prediction by using a collection of models.
Normalization (Standardization)
Subtract the mean and divide by the standard deviation (the result of this calculation for a data point is the Z-score).
Central tendency
An estimate of where most of the data is located.
Generalized Additive Models
Spline models with automated selection of knots
Polynomial Regression
Adds polynomial terms (squares, cubes, etc.) to a regression.
False Discovery Rate
Across multiple tests, the rate of making a Type 1 error. Type 1 error: mistakenly concluding an effect is real (when it is due to chance).
Pairwise Comparison
A hypothesis test (e.g. of means) between two groups among multiple groups.
Weighted mean
Multiply each data value x_i by weight w_i and divide their sum by the sum of the weights Two motivations: Some values are intrinsically more variable than others, and highly variable observations are given a lower weight. E.g. if we are taking the average from multiple sensors but some sensors are less accurate, then you weight the less accurate sensors less (downweight). Data collected does not equally represent the different groups that we are interested in measuring. E.g. bc of the way an online experiment was conducted, we may not have a set of data that accurately reflects all groups in the user base. To correct that, we can give a higher weight to the values from the groups that were underrepresented.
Occam's Razor
The more assumptions you have to make, the more unlikely the occurrence. Applied to models, this means that if you have two models with similar performance indicators, choose the one with less features, because you are "making less assumptions". Essentially, the simpler the better. Including additional variables always reduces RMSE and increases R^2.
Reference Coding (Treatment Coding)
The most common type of coding used by statisticians, in which one level of a factor is used as a reference and other factors are compared to that level
N Choose I
The number of different groups of i objects that can be chosen from a set of n objects.
Alpha
The probability threshold of "unusualness" that chance results must surpass, for actual outcomes to be deemed statistically significant.
Chi-Squared Test
- (Book has a great example on this - involving clicks per page) - A statistical hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. - The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. - Most standard uses of the chi-square test or Fisher's exact test are not terribly relevant for data science. In most experiments, whether A-B or A-B-C..., the goal is not simply to establish statistical significance, but rather to arrive at the best treatment. For this purpose, multi-armed bandits offer a more complete solution. One use for Chi-Square is in determining appropriate sample sizes for web experiments.
F1 Score
- A function of precision and recall - F1 score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of actual negatives)
Weighted median
- A value such that the sum of the weights is equal for the lower and upper halves of the sorted list - Robust to outliers - numpy.average (weights variable)
Regression to the Mean
- Phenomenon involving successive measurements on a given variable: extreme observations tend to be followed by more central ones. Attaching special focus and meaning to the extreme value can lead to a form of selection bias.
P-Value
- The frequency with which the chance model produces a result more extreme than the observed result. I.e. a p-value of .308 indicates that we would expect to achieve a result as extreme as this, or more extreme, by random chance over 30% of the time. - Useful for a data scientist in the following way: a p-value is a useful metric in situations where you want to know whether a model result that appears interesting and useful is within the range of normal chance variability.
Root Mean Square Error RMSE
- The standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.
Variance (mean squared error)
- The sum of squared deviations from the mean divided by n-1 where n is the number of data values - remember that variance is particularly sensitive to outliers since it involves squaring values
One Hot Encoder
A common type of coding used in the machine learning community in which all factor levels are retained. While useful for certain machine learning algorithms, this approach is not appropriate for multiple linear regression.
Receiving Operating Characteristic (ROC) curve
A curve that captures the tradeoffs between recall and specificity. The recall (sensitivity) is plotted on the y-axis and the specificity is plotted on the x-axis.
Boosting
A general technique to fit a sequence of models by giving more weight to the records with large residuals for each successive round.
Bagging
A general technique to form a collection of models by bootstrapping the data.
Median absolute deviation
A robust estimate of variability that is not influence by extreme values (unlike variance, standard deviation, and mean absolute deviation)
Random Forest
A type of bagged estimate based on decision tree models.
Deviation Coding
A type of coding that compares each level against the overall mean as opposed to the reference level.
Dummy variables
Binary 0-1 variables derived by recoding factor data for use in regression and other models.
Perturbation
Definition: a deviation of a system, moving object, or process from its regular or normal state of path, caused by an outside influence. In data science: a random amount up or down from eat imputed estimate.
Extrapolation
Extension of a model beyond the range of the data used to fit it.
Statistical Significance
In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis.
IID (Independent and Identically Distributed)
In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and are all mutually independent.
Time series data
Records successive measurements of the same variable. Raw material for statistical forecasting methods.
Apache Spark, Chapter 1 Learnings
Spark is a distributed programming model in which the user specifies transformations. Multiple transformations build up a directed acyclic graph of instructions. An action begins the process of executing that graph of instructions, as a single job, by breaking it down into stages and tasks to execute across the cluster. The logical structures that we manipulate with transformations and actions are DataFrames and Datasets. To create a new DataFrame or Dataset, you call a transformation. To start computation or convert to native language types, you call an action.
Maximum Likelihood Estimation (MLE)
The algorithm used to fit a logistic regression model, because unlike linear regression, there is no closed form solution in logistic regression. The MLE finds the solution such that the estimated log odds best describes the observed outcome.
Posterior Probability
The probability of an outcome after the predictor information has been incorporated (in contrast to the prior probability of outcomes, not taking predictor information into account).
Main Effects
The relationship between a predictor and the outcome variable, independent from other variables.
Nonlinear Regression
When statisticians talk about nonlinear regression, they are referring to models that can't be fit using least squares. What kind of models are nonlinear? Essentially all models where the response cannot be expressed as a linear combination of the predictors or some transform of the predictors. Nonlinear regression models are harder and computationally more intensive to fit, since they require numerical optimization. For this reason, it is generally preferred to use a linear model if possible.
Multicollinearity (Collinearity)
When the predictor variables have perfect, or near-perfect, correlation, the regression can be unstable or impossible to compute.