data 100 final

¡Supera tus tareas y exámenes ahora con Quizwiz!

mean absolute error formula (MAE)

(1 / n) * sum (absolute val of yi - theta) for all y

calc for amount of variance captured by i'th principal component

(singular value i) ^2 / N (number of data pts)

gradient of cross entropy loss

- 1/n * sum{ yi - sigma(theta, xi)}

random forest method

- bootstrap resamples of training data - one model per resample - final tree is majority vote on each data point

methods for avoiding overfitting on decision trees

- maximum depth - don't split nodes containing very few samples - random forest

entropy of a decision tree node

- sum (pc * log_2 pc) where pc is the proportion of data points at a node with label c (log_2 1 = 0, so we have no entropy if every label at that node is the same) (analogous to 'predictability')

cross entropy loss

-1/n * sum{- yi ln (y_hat_i) - (1 - y_i)ln(1 - y_hat_i)}

auc of roc curve if we randomly guessed

0.5

sigmoid function (logistic regression)

1/(1 + exp(-t))

one hot encoding

A sparse vector in which: One element is set to 1. All other elements are set to 0. (make a quantitative variable out of a categorical one with finite values)

variance of an estimator

E(theta_hat squared) - [E(theta_hat)] squared

bias of an estimator

E(theta_hat) - theta_star

2 formats for risk function

E[ (x - theta) ^2 ], bias + variance (e(x) - theta)^2 + var(x)

L1 Regularization is known as ____ L2 is known as _____

LASSO, ridge regression

L1/lasso regularization/regression

MSE + absolute values of weights

L2 regularization

MSE + lambda * sum of weights squared (penalizes large weights)

2 ways to calculate principal components

U @ Sigma or X @ V

l3 norm (elastic net)

a compromise, but requires tuning of 2 feature params

constant model

a model where our predicted value is constant - not dependent on which values of a dataset we're looking at

confidence interval

a p% interval will theoretically contain the population parameter that you want to estimate p% of the time

huber loss

a piecewise loss function that goes from mse to mae farther from the observation (we can pick the point where this happens)

residual formula

actual y - predicted y

there (always, sometimes, never) exists a unique ridge regression lstsq solution

always

model

an idealized representation of data

what do kernels do

average neighboring points for a 'smoothed' effect

given the model y = a + bx, what are a and b

b = r (sigma y / sigma x) a = mean(y) - b * mean(x)

advantages of huber loss

both differentiable and robust to outliers (while mse and mae each are only one)

gini impurity (decision trees)

chance that a point would be misclassified if randomly chosen at that point on the tree

k-means is for clustering. k nearest neighbors is for ---

classification (prediction is the most common class among the k nearest neighbors)

l1 norm ball

diamond / absval - encourages sparse sols and is convex

rows of vt relative to pca

directions of principal components

agglomerative clustering algorithm

each point is its own cluster, and clusters are joined in closeness order (by some metric) until only k remain

sigmoid function vector form

f_theta(x) = sigma(theta_hat dot x)

why does k means clustering fail to find a global minimum

first step optimizes center with constant data colors, second optimizes data colors with constant centers

false positive rate

fp / (fp + tn) or 1 - specificity

weighted entropy function

helps us decide a decision tree split, like a loss fxn L = N1S(X) + N2S(Y) / (N1 + N2) - 2 nodes x and y with n1 and n2 samples each - S is our entropy fxn??

boxcar kernel general shape

histogram

i'th singular value will tell us

how much of the variance is captured by the i'th principal component

odds & log-odds

in logistic regression: odds = p/1 - p. The ln of this value is assumed to be linear

increasing the size of the training set in cross validation (increases/decreases) the variance of the validation error

increases - a smaller validation set size means less representation of your true population / test set

stochastic gradient descent

instead of doing gradient descent on every point, do it on a single randomly chosen one

.iloc accesses by

integer position (regardless of index data type)

why is the bias/intercept term excluded from regularization

it does not increase variance, which regularization seeks to minimize

ridge regression

l2 regularization using linear model, mse

.loc accesses by

label (row index, col names) and boolean array (label lists, slices (inclusive),

(lasso/ridge) regression completely tosses out some features when making predictions

lasso - this can reduce dimensionality, and unneeded features increase variance without reducing bias. sometimes you can even use lasso regression to select a subset of features and then use another model on just those features

issues with linear regression for probabilities

linear regression is not bounded on [0,1]

elbow method to pick k (clustering)

look for diminishing returns on inertia

risk = expected value of ___ for all training data

loss (risk is the expectation of a random variable and thus not random itself)

mean squared error (MSE) formula

loss function: (1/n) * (sum of squared yi - theta) for every y in our set of true points (think of y as a vector)

class imbalanced dataset

majority of labels belong to one class over the other - predicting only one value always can actually lead to decent accuracy

mse loss is always minimized by the ___ of the data

mean

linear separability in classification

means we can separate the classes with a line

silhouette method (clustering)

measures distance to other points in cluster A = average distance to points in current cluster B = average distance to points in closest clster S = (B - A)/max(A,B)

the mae loss is always minimized by the ___ of the data

median

fitting a model is the same thing as

minimizing loss

(mse, mae) has a unique minimum every time

mse

(mse, mae) is more outlier sensitive

mse (because of the square term)

l2 norm ball

mse / circle - robust (spreads weight over features) but does not encourage sparsity

variance of a binomial distribution

np(1-p)

parameters and output of a loss function

parameters: theta (estimate), and the points in the set output: a number that measures the quality of our estimate (goal is to minimize loss)

scree plot axes

pc num (x), variance captured (y)

principal component svd relationship

pcs are the columns of u@sigma

build a decision tree

picking the best split value. repeat until all nodes are pure or unsplittable

prediction vs inference

prediction: using data to predict future data outcomes inference: using data to make conclusions about underlying detail

variance of fitted values in the y = a + bx model

r ^ 2 = variance of fitted / variance of actual

minimize risk function with the

sample mean

bootstrapping

sample population multiple times with replacement - used to estimate distributions (the sample is done a lot of times and graphing all the different values you get will approximately yield your distribution)

how to create decision boundary (logistic regression)

set sigma(model) equal to your threshold

inverse of logistic regression

sigma ^ -1 = ln (t / (1-t)), where t is the threshold

gaussian kernel

similarity function with exp(x) - landmark distances

too simple a model = high ____ too complex a model = high _____

simple: bias complex: variance

how to normalize data for regularization

subtract mean from each column and scale weights to be within -1 and 1

how to center principal components

subtract mean of each col (before svd) (can add back after to get proper scale)

l0 norm ball

the axes - good for feature selection but difficult to optimize

inertia (k means clustering)

the loss function - squared distance from each point to center

gradient descent is performed on ____ to minimize the ______

theta, loss function

specificity (true negative rate)

tn / (tn + fp)

recall (sensitivity, true positive rate)

tp / (tp + fn)

precision

tp / (tp + fp) true positive / predicted positives

decision trees always have perfect accuracy on training data except when

two points are exactly the same in features but belong to different classes

avoid gradient descent finding only a local minimum by..

using a convex function (such as mae, mse, or huber)

the constant model has low ____ and high _____

variance, bias

distortion (k means clustering)

weighted sum of squared distance from each point to center (divide by num points)

multicollinearity

when a feature can be predicted fairly accurately by a linear combination of other features perfect multicollinearity = no least squares solution (because linear dependence)

roc curve axis (receiving operataor characteristic)

x axis: fpr (false positive rate) y axis: tpr (true positive rate)

precision-recall curve axes

x axis: recall y axis: precision


Conjuntos de estudio relacionados

Managerial Accounting, 4e (Whitecotton) Chapter 5 Cost Behavior true or false

View Set

vrader/ French Culture trivia 2021

View Set

ATI Fundamentals Comprehensive Review

View Set

DISNEY COLLEGE PROGRAM PHONE INTERVIEW 🏰✨

View Set

AP COMPARATIVE GOVERNMENT- QUESTIONS 46-55 KEY, :AP Comparative Unit 3

View Set

Chapter 8 - Advanced Financial Accounting

View Set

Physical and Cognitive development in middle adulthood ch.15

View Set

Maternity and Women's Health Nursing - Women's Health

View Set