Data 100
Span of X
the set of all linear combinations of columns of X.
Variance Var[x]
the spread of x
Target population
the total group to be studied or described and from whom samples may be drawn
Constant Model
theta is 1D R(theta) = 1/n sum (y-theta)² Constant model error is higher than SLM error
Box Plot Statistics
upper/median/lower quartile=np.percentile(data, 75/50/25) outliers = lower-1.5IQR and upper+1.5 IQR
KDE (Kernel density estimation)
use to estimate density curve, increase alpha = more smooth (but potentially removes multimodality) sns.kdeplot()
Single Observation
y=xT(theta)
Change in Entropy (ΔWS)
ΔWS = WS of node - WS of all its children combined
Accuracy
(TP+TN)/(TP+FP+FN+TN)
pandas: DataFrame to Series
.squeeze()
pandas: Series to DataFrame
.to_frame()
Ways to pick k for Clustering
1. Elbow method from Inertia vs. K graph 2. Silhouette score (S): give data points a score base on how centered they are in their cluster A=avg. distance to points from own cluster B=avg. distance to points from the closest cluster S=(B - A)/ max(A,B) High S = near other points in its own cluster Low S = near other points in a different cluster S=1 when all points in a cluster are on top of each other S=negative if the point is in a wrong cluster
Ways to avoid overfit in Decision Tree
1. prevent growth-don't split too small, too many levels, too small ΔWS 2. tree pruning-create validation set before creating tree, if node has no impact on validation error then don't split
Average Loss (Empirical Loss)
1/n sum(L(y,^y))
Density of a plot
100*(indv in a bin/total population) = total area% total area%/size of bin = density
PCA (Principal component analysis)
A dimensionality-reduction method often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. u,s,vt = np.linalg.svd(rectangle, full_matrices = False) usig = uΣ = u*s --> same dimensionality as original matrix to drop 1 column (dimensional decomposition): pd.DataFrame(usig[:, 0:1] @ vt[0:1,:]) PC 1 is the horizontal linear line, PC 2 is the vertical line. high pc1 to the left of the plot, high pc2 to the top of the plot
Regression line
A line that minimizes the mean squared error of estimation among all straight lines y = a+bx b = r(SDy/SDx) a = mean(y)-slope(mean(x) r = correlation coefficient = 1/n sum(x-mean(x)/SDx)(y-mean(y)/SDy)
Random Forest
A machine learning model that creates many different trees and returns their mean output. Can have errors but reduce overfitting from a decision tree.
Covariance
A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship
Scree plot
A plot of the eigenvalues against the number of factors in order of extraction. plt.plot(s**2/np.sum(s**2))
Regular Expression (RegEx)
A rule that says the values in a table must match a prescribed pattern * 0 or more + 1 or more ? 0 or 1 {x} matches x times {x,y} matches x-y times .* many times as possible .*? few times as possible \w word \s single white space \d decimal[0-9] \anything for special character [] a set pf equiv. single characters | or () capture group ex. A(CG)+ --> ACGCGCG ^$ start ^ end$ [^...] not ...
Multiple Linear Regression
A statistical method used to model the relationship between one dependent (or response) variable and two or more independent (or explanatory) variables by fitting a linear equation to observed data
Linear Combinations
A sum of scalar multiples of vectors. The scalars are called the weights. If can not make into a vector of only thetas then it is not linear.
Logistic Regression
An oblique regression model (no curve graph) that relates a set of explanatory variables to a dichotomous dependent variable. (Image shows Empirical Risk of logistic regression.) lr = LogisticRegression(fit_intercept=True, solver = 'lbfgs') lr.fit(X_train, Y_train) lr.intercept_, lr.coef_
Bias-variance tradeoff
As complexity increases, bias decreases but variance increases
K-fold cross-validation
Data is split into k mutual subsets and a fold will be the validation set, other folds will be trained. Repeat for k-possible choices of validation fold. Quality (λ) = average of k validation error score. Purpose: pick α with the lowest λ. Increase k will decrease bias (increase variance).
SVD (Singular Value Decomposition)
Decompose into mysterious uΣ with most important column on the left. Σ(s) is a diagonal matrix (for scaling). u and v are orthonormal: all vectors are unit vectors and orthogonal.
Batch Gradient Descent (BGD)
Each step of gradient descent uses the average of ALL the training examples.
N-grams
Given an array and n, divide the array into groups of n with consecutive words. Rules: no wrap-around, will have overlaps def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)])
Gradient Descent
Gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss. Formula: x_n+1=x_n−α∇L(Y, f(x_n)) Use the derivative curve of the cost function. Repeat the process to find a local minimum of the function being optimized.
High Order Polynomial
Increase degree features, decrease MSE (bias)
Loss Function
L(y,^y)
L2 Squared Loss
L(y,^y) = (y-(a+bx))² Minimized L2: ^a=mean(y)-b(mean(x) ^b= r(SDy/SDx)
L1 Absolute Loss
L(y,^y) = |y-(a+bx)|
Ordinary Least Squares (OLS)
Linear Regression Model + MSE. The ordinary least squares estimates are obtained by minimizing the sum of squared residuals.
Inertia
Loss Function for Clustering Sum of squared distances from each data point to its center. ex. a is 0.21 away from center, b is 0.19 away from center, c is 0.17 from center. Inertia = 0.19² + 0.21² + 0.17²
Node Entropy (S)
Measures how unpredictable a node is. Increase S = decrease predicatability p_i = proportion of data point in node with label i A node where data are evenly split into C classes has log_2(C) entropy
Mini Gradient Descent
Neither we use all the dataset all at once nor we use the single example at a time. We use a batch of a fixed number of training examples which is less than the actual dataset and call it a mini-batch.
Classification
Predict a categorical y (instead of predict a real number y in Regression) Types: 1. binary (win or lose) 2. multiclass (labeling) 3.structured prediction (translation)
L2 MSE
R(theta) = 1/n sum (y-(a+bx))² for constant model: ^theta = mean(y)
L1 MAE
R(theta) = 1/n sum |y-(a+bx)| for constant model: ^theta = median(y)
Regularization
Reduce the model complexity and multi-collinearity of the model by finding the optimal parameters to prevent overfitting or underfitting.
SQL (Structured Query Language)
SELECT DISTINCT FROM JOIN ON WHERE LIKE <--regex GROUP BY HAVING ORDER BY LIMIT OFFSET
Stochastic Gradient Descent
SGD can be used for larger datasets. It converges faster when the dataset is large as it causes updates to the parameters more frequently. 1. Take an example 2. Feed it to Neural Network 3. Calculate its gradient 4. Use the gradient we calculated in step 3 to update the weights 5. Repeat steps 1-4 for all the examples in the training dataset
Decision Tree
Start in root node, repeat until every node is pure or unsplittable 100% accuracy in classification, will overfit, nonlinear boundary Will have errors if there is overlapped data points.
Clustering
Supervised Learning Types of Clustering: 1. k-means clustering: pick k and randomly place k centers, repeat until convergence 2. agglomerative clustering: pick k, combine two nearest data points until only k clusters left 3. spectral clustering: pick k, group together if it has a shape
Recall (NPV)
TP/(TP+FN)
Precision (PPV)
TP/(TP+FP)
Hyperparameter-Learning Rate (α)
The length of each step in gradient descent. Decrease α will increase bias and decrease variance (prevent overfit).
Odds Ratio (OR) of Logistic Regression
The odds that an individual with a prognostic (risk) factor had an outcome of interest as compared to the odds for an individual without the prognostic (risk) factor.
L1 Regularization (Lasso Regression)
The only difference is instead of taking the square of the coefficients, magnitudes are taken into account. Optimal parameters tend to include zeros: Lasso regression not only helps in reducing over-fitting but it can help us in feature selection.
One-hot encoding
The process by which categorical variables are converted into binary form (0 or 1) for machine reading. It is one of the most common methods for handling categorical features in text data. dummies = pd.get_dummies(data['day'])
Holdout Validation
The sample data are randomly divided into training and validation sets. train, valid = np.split(shuffle(data, 35), [25]) shuffle-->random selection 25-->train size 35-25=10-->validation size
Orthogonal Projection
The shortest distance of ^Y=X(theta) is directly below Y (the orthogonal projection of y onto the Span of X). So the optimal theta is the residual vector orthogonal to the Span of X.
Feature engineering
Transform raw features into features that can model or EDA Example: df[hp²] = df[hp]**2 model = LinearRegression() model.fit([[hp, hp²]], df['mpg'])
Cross-validation (2 methods)
Verifying the results obtained from a validation set by administering a validation set to a different sample (drawn from the same population). Purpose: to avoid overfitting, pick α, train all datapoints 2 methods: 1. Holdout method 2. K-fold
Weighted Entropy (WS)
WS = (# of samples in node/# of total samples) * entropy of the node
Central Limit Theorem (CLT)
When n is large, the sampling distribution of the sample mean is approximately Normal.
Multiple Observation
Y=X(theta) Y and X are matrix of n x (p+1)
Random Variable
a numerical description of the outcome of an experiment input: all random samples of size n output: number line
Simple Linear Regression Model
a parametric model (goal is to choose best parameters for slope and intercept) SkLearn: model = LinearRegression() model.fit(df, 'observation to predict') model.predict('input data') model.intercept_ model.coef_
Sample
a subset of the access frame
Access frame
collection of elements accessible for measurement (could be outside of target population)
Types of bias
coverage, selection, non-response, measurement
Test Set
data set used to estimate accuracy of final model on unseen data
Decision Tree Modeling
decision_tree_model = tree.DecisionTreeClassifier(criterion = 'entropy') X_train = df['characteristic_1', 'characteristic_2'] decision_tree_model.fit(X_train, Y_train) decision_tree_model.predict(X_test)
pandas: Create a data frame
df = pd.DataFrame( {"a" : [4, 5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = [1, 2, 3])
pandas: Select columns whose name matches regular expression regex
df.filter(regex='regex')
pandas: Grouped by values in column named "col"
df.groupby("col")
pandas: Aggregate group using function
df.groupby(by="col").agg(function)
Create histogram
df.hist('x', bins=, unit=)
pandas: Select only rows, only columns or both
df.loc[] and df.iloc[]
pandas: Spread rows into columns
df.pivot_table(columns='var', values='val')
Create line/scatter plots
df.plot('x','y) df.scatter('x','y') plt.xlabel/ylabel
pandas: Order rows by values of a column
df.sort_values('mpg')
pandas: Count number of rows with each unique value of variable
df['w'].value_counts()
pandas: gives data frame with column in a list
df[df['col'].isin(list)]
pandas: Return parts of a data frame base on conditions
df[df['col']=='a']
Types of Quantitative Variables
discrete(age) and continuous(price)
Decision Tree Accuracy
from sklearn.metrics import accuracy_score accuracy_score(predictions, Y_test)
Confusion Matrix
from sklearn.metrics import confusion_matrix Y_test_pred = lr.predict(X_test) cnf_matrix = confusion_matrix(Y_test, Y_test_pred) cnf_matrix
Singular Value Interpretation
ith value in Σ tells us the variance of the ith pc. (it is its eigenvalue)
Model Risk (bias-variance decomposition)
model risk = σ² + (model bias)² + model variance Hold true for linear and non-linear models with squared loss no absolute loss, classification or zero-one loss
Multinomial Distribution
n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability
Types of Qualitative Variables (categories)
ordinal(preference, level of education) and nominal(id#)
pandas: Gather columns to rows
pd.melt(df)
pandas: Join matching rows from bdf to adf
pd.merge(adf, bdf, how='left', on='x1')
Protocal
procedure to choose a sample, follow up, training, etc.
Types of Variables
qualitative and quantitative
instrument
questionnaire to an indv of the sample
urn model
random selection of samples from a population, container of identical or labeled marbles helps to estimate the size of variation with probability
Granularity
refers to the level of detail in the model or the decision-making process
Types of variation
sampling variation, assignment variation, measurement error
Relationships in scatter plot
simple linear, simple nonlinear, linear spreading, v-shaped
Types of uniform random sampling
simple random sampling (SRS), systematic, stratified
Expectation E[x]
the average value of x
L2 Regularization (Ridge Regression)
the cost function is altered by adding a penalty equivalent to square of the magnitude of the coefficients. Does not leverage redundancy.