data 100 final
mean absolute error formula (MAE)
(1 / n) * sum (absolute val of yi - theta) for all y
calc for amount of variance captured by i'th principal component
(singular value i) ^2 / N (number of data pts)
gradient of cross entropy loss
- 1/n * sum{ yi - sigma(theta, xi)}
random forest method
- bootstrap resamples of training data - one model per resample - final tree is majority vote on each data point
methods for avoiding overfitting on decision trees
- maximum depth - don't split nodes containing very few samples - random forest
entropy of a decision tree node
- sum (pc * log_2 pc) where pc is the proportion of data points at a node with label c (log_2 1 = 0, so we have no entropy if every label at that node is the same) (analogous to 'predictability')
cross entropy loss
-1/n * sum{- yi ln (y_hat_i) - (1 - y_i)ln(1 - y_hat_i)}
auc of roc curve if we randomly guessed
0.5
sigmoid function (logistic regression)
1/(1 + exp(-t))
one hot encoding
A sparse vector in which: One element is set to 1. All other elements are set to 0. (make a quantitative variable out of a categorical one with finite values)
variance of an estimator
E(theta_hat squared) - [E(theta_hat)] squared
bias of an estimator
E(theta_hat) - theta_star
2 formats for risk function
E[ (x - theta) ^2 ], bias + variance (e(x) - theta)^2 + var(x)
L1 Regularization is known as ____ L2 is known as _____
LASSO, ridge regression
L1/lasso regularization/regression
MSE + absolute values of weights
L2 regularization
MSE + lambda * sum of weights squared (penalizes large weights)
2 ways to calculate principal components
U @ Sigma or X @ V
l3 norm (elastic net)
a compromise, but requires tuning of 2 feature params
constant model
a model where our predicted value is constant - not dependent on which values of a dataset we're looking at
confidence interval
a p% interval will theoretically contain the population parameter that you want to estimate p% of the time
huber loss
a piecewise loss function that goes from mse to mae farther from the observation (we can pick the point where this happens)
residual formula
actual y - predicted y
there (always, sometimes, never) exists a unique ridge regression lstsq solution
always
model
an idealized representation of data
what do kernels do
average neighboring points for a 'smoothed' effect
given the model y = a + bx, what are a and b
b = r (sigma y / sigma x) a = mean(y) - b * mean(x)
advantages of huber loss
both differentiable and robust to outliers (while mse and mae each are only one)
gini impurity (decision trees)
chance that a point would be misclassified if randomly chosen at that point on the tree
k-means is for clustering. k nearest neighbors is for ---
classification (prediction is the most common class among the k nearest neighbors)
l1 norm ball
diamond / absval - encourages sparse sols and is convex
rows of vt relative to pca
directions of principal components
agglomerative clustering algorithm
each point is its own cluster, and clusters are joined in closeness order (by some metric) until only k remain
sigmoid function vector form
f_theta(x) = sigma(theta_hat dot x)
why does k means clustering fail to find a global minimum
first step optimizes center with constant data colors, second optimizes data colors with constant centers
false positive rate
fp / (fp + tn) or 1 - specificity
weighted entropy function
helps us decide a decision tree split, like a loss fxn L = N1S(X) + N2S(Y) / (N1 + N2) - 2 nodes x and y with n1 and n2 samples each - S is our entropy fxn??
boxcar kernel general shape
histogram
i'th singular value will tell us
how much of the variance is captured by the i'th principal component
odds & log-odds
in logistic regression: odds = p/1 - p. The ln of this value is assumed to be linear
increasing the size of the training set in cross validation (increases/decreases) the variance of the validation error
increases - a smaller validation set size means less representation of your true population / test set
stochastic gradient descent
instead of doing gradient descent on every point, do it on a single randomly chosen one
.iloc accesses by
integer position (regardless of index data type)
why is the bias/intercept term excluded from regularization
it does not increase variance, which regularization seeks to minimize
ridge regression
l2 regularization using linear model, mse
.loc accesses by
label (row index, col names) and boolean array (label lists, slices (inclusive),
(lasso/ridge) regression completely tosses out some features when making predictions
lasso - this can reduce dimensionality, and unneeded features increase variance without reducing bias. sometimes you can even use lasso regression to select a subset of features and then use another model on just those features
issues with linear regression for probabilities
linear regression is not bounded on [0,1]
elbow method to pick k (clustering)
look for diminishing returns on inertia
risk = expected value of ___ for all training data
loss (risk is the expectation of a random variable and thus not random itself)
mean squared error (MSE) formula
loss function: (1/n) * (sum of squared yi - theta) for every y in our set of true points (think of y as a vector)
class imbalanced dataset
majority of labels belong to one class over the other - predicting only one value always can actually lead to decent accuracy
mse loss is always minimized by the ___ of the data
mean
linear separability in classification
means we can separate the classes with a line
silhouette method (clustering)
measures distance to other points in cluster A = average distance to points in current cluster B = average distance to points in closest clster S = (B - A)/max(A,B)
the mae loss is always minimized by the ___ of the data
median
fitting a model is the same thing as
minimizing loss
(mse, mae) has a unique minimum every time
mse
(mse, mae) is more outlier sensitive
mse (because of the square term)
l2 norm ball
mse / circle - robust (spreads weight over features) but does not encourage sparsity
variance of a binomial distribution
np(1-p)
parameters and output of a loss function
parameters: theta (estimate), and the points in the set output: a number that measures the quality of our estimate (goal is to minimize loss)
scree plot axes
pc num (x), variance captured (y)
principal component svd relationship
pcs are the columns of u@sigma
build a decision tree
picking the best split value. repeat until all nodes are pure or unsplittable
prediction vs inference
prediction: using data to predict future data outcomes inference: using data to make conclusions about underlying detail
variance of fitted values in the y = a + bx model
r ^ 2 = variance of fitted / variance of actual
minimize risk function with the
sample mean
bootstrapping
sample population multiple times with replacement - used to estimate distributions (the sample is done a lot of times and graphing all the different values you get will approximately yield your distribution)
how to create decision boundary (logistic regression)
set sigma(model) equal to your threshold
inverse of logistic regression
sigma ^ -1 = ln (t / (1-t)), where t is the threshold
gaussian kernel
similarity function with exp(x) - landmark distances
too simple a model = high ____ too complex a model = high _____
simple: bias complex: variance
how to normalize data for regularization
subtract mean from each column and scale weights to be within -1 and 1
how to center principal components
subtract mean of each col (before svd) (can add back after to get proper scale)
l0 norm ball
the axes - good for feature selection but difficult to optimize
inertia (k means clustering)
the loss function - squared distance from each point to center
gradient descent is performed on ____ to minimize the ______
theta, loss function
specificity (true negative rate)
tn / (tn + fp)
recall (sensitivity, true positive rate)
tp / (tp + fn)
precision
tp / (tp + fp) true positive / predicted positives
decision trees always have perfect accuracy on training data except when
two points are exactly the same in features but belong to different classes
avoid gradient descent finding only a local minimum by..
using a convex function (such as mae, mse, or huber)
the constant model has low ____ and high _____
variance, bias
distortion (k means clustering)
weighted sum of squared distance from each point to center (divide by num points)
multicollinearity
when a feature can be predicted fairly accurately by a linear combination of other features perfect multicollinearity = no least squares solution (because linear dependence)
roc curve axis (receiving operataor characteristic)
x axis: fpr (false positive rate) y axis: tpr (true positive rate)
precision-recall curve axes
x axis: recall y axis: precision