Data 100

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Span of X

the set of all linear combinations of columns of X.

Variance Var[x]

the spread of x

Target population

the total group to be studied or described and from whom samples may be drawn

Constant Model

theta is 1D R(theta) = 1/n sum (y-theta)² Constant model error is higher than SLM error

Box Plot Statistics

upper/median/lower quartile=np.percentile(data, 75/50/25) outliers = lower-1.5IQR and upper+1.5 IQR

KDE (Kernel density estimation)

use to estimate density curve, increase alpha = more smooth (but potentially removes multimodality) sns.kdeplot()

Single Observation

y=xT(theta)

Change in Entropy (ΔWS)

ΔWS = WS of node - WS of all its children combined

Accuracy

(TP+TN)/(TP+FP+FN+TN)

pandas: DataFrame to Series

.squeeze()

pandas: Series to DataFrame

.to_frame()

Ways to pick k for Clustering

1. Elbow method from Inertia vs. K graph 2. Silhouette score (S): give data points a score base on how centered they are in their cluster A=avg. distance to points from own cluster B=avg. distance to points from the closest cluster S=(B - A)/ max(A,B) High S = near other points in its own cluster Low S = near other points in a different cluster S=1 when all points in a cluster are on top of each other S=negative if the point is in a wrong cluster

Ways to avoid overfit in Decision Tree

1. prevent growth-don't split too small, too many levels, too small ΔWS 2. tree pruning-create validation set before creating tree, if node has no impact on validation error then don't split

Average Loss (Empirical Loss)

1/n sum(L(y,^y))

Density of a plot

100*(indv in a bin/total population) = total area% total area%/size of bin = density

PCA (Principal component analysis)

A dimensionality-reduction method often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. u,s,vt = np.linalg.svd(rectangle, full_matrices = False) usig = uΣ = u*s --> same dimensionality as original matrix to drop 1 column (dimensional decomposition): pd.DataFrame(usig[:, 0:1] @ vt[0:1,:]) PC 1 is the horizontal linear line, PC 2 is the vertical line. high pc1 to the left of the plot, high pc2 to the top of the plot

Regression line

A line that minimizes the mean squared error of estimation among all straight lines y = a+bx b = r(SDy/SDx) a = mean(y)-slope(mean(x) r = correlation coefficient = 1/n sum(x-mean(x)/SDx)(y-mean(y)/SDy)

Random Forest

A machine learning model that creates many different trees and returns their mean output. Can have errors but reduce overfitting from a decision tree.

Covariance

A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship

Scree plot

A plot of the eigenvalues against the number of factors in order of extraction. plt.plot(s**2/np.sum(s**2))

Regular Expression (RegEx)

A rule that says the values in a table must match a prescribed pattern * 0 or more + 1 or more ? 0 or 1 {x} matches x times {x,y} matches x-y times .* many times as possible .*? few times as possible \w word \s single white space \d decimal[0-9] \anything for special character [] a set pf equiv. single characters | or () capture group ex. A(CG)+ --> ACGCGCG ^$ start ^ end$ [^...] not ...

Multiple Linear Regression

A statistical method used to model the relationship between one dependent (or response) variable and two or more independent (or explanatory) variables by fitting a linear equation to observed data

Linear Combinations

A sum of scalar multiples of vectors. The scalars are called the weights. If can not make into a vector of only thetas then it is not linear.

Logistic Regression

An oblique regression model (no curve graph) that relates a set of explanatory variables to a dichotomous dependent variable. (Image shows Empirical Risk of logistic regression.) lr = LogisticRegression(fit_intercept=True, solver = 'lbfgs') lr.fit(X_train, Y_train) lr.intercept_, lr.coef_

Bias-variance tradeoff

As complexity increases, bias decreases but variance increases

K-fold cross-validation

Data is split into k mutual subsets and a fold will be the validation set, other folds will be trained. Repeat for k-possible choices of validation fold. Quality (λ) = average of k validation error score. Purpose: pick α with the lowest λ. Increase k will decrease bias (increase variance).

SVD (Singular Value Decomposition)

Decompose into mysterious uΣ with most important column on the left. Σ(s) is a diagonal matrix (for scaling). u and v are orthonormal: all vectors are unit vectors and orthogonal.

Batch Gradient Descent (BGD)

Each step of gradient descent uses the average of ALL the training examples.

N-grams

Given an array and n, divide the array into groups of n with consecutive words. Rules: no wrap-around, will have overlaps def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)])

Gradient Descent

Gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss. Formula: x_n+1​=x_n​−α∇L(Y, f(x_n​)) Use the derivative curve of the cost function. Repeat the process to find a local minimum of the function being optimized.

High Order Polynomial

Increase degree features, decrease MSE (bias)

Loss Function

L(y,^y)

L2 Squared Loss

L(y,^y) = (y-(a+bx))² Minimized L2: ^a=mean(y)-b(mean(x) ^b= r(SDy/SDx)

L1 Absolute Loss

L(y,^y) = |y-(a+bx)|

Ordinary Least Squares (OLS)

Linear Regression Model + MSE. The ordinary least squares estimates are obtained by minimizing the sum of squared residuals.

Inertia

Loss Function for Clustering Sum of squared distances from each data point to its center. ex. a is 0.21 away from center, b is 0.19 away from center, c is 0.17 from center. Inertia = 0.19² + 0.21² + 0.17²

Node Entropy (S)

Measures how unpredictable a node is. Increase S = decrease predicatability p_i = proportion of data point in node with label i A node where data are evenly split into C classes has log_2(C) entropy

Mini Gradient Descent

Neither we use all the dataset all at once nor we use the single example at a time. We use a batch of a fixed number of training examples which is less than the actual dataset and call it a mini-batch.

Classification

Predict a categorical y (instead of predict a real number y in Regression) Types: 1. binary (win or lose) 2. multiclass (labeling) 3.structured prediction (translation)

L2 MSE

R(theta) = 1/n sum (y-(a+bx))² for constant model: ^theta = mean(y)

L1 MAE

R(theta) = 1/n sum |y-(a+bx)| for constant model: ^theta = median(y)

Regularization

Reduce the model complexity and multi-collinearity of the model by finding the optimal parameters to prevent overfitting or underfitting.

SQL (Structured Query Language)

SELECT DISTINCT FROM JOIN ON WHERE LIKE <--regex GROUP BY HAVING ORDER BY LIMIT OFFSET

Stochastic Gradient Descent

SGD can be used for larger datasets. It converges faster when the dataset is large as it causes updates to the parameters more frequently. 1. Take an example 2. Feed it to Neural Network 3. Calculate its gradient 4. Use the gradient we calculated in step 3 to update the weights 5. Repeat steps 1-4 for all the examples in the training dataset

Decision Tree

Start in root node, repeat until every node is pure or unsplittable 100% accuracy in classification, will overfit, nonlinear boundary Will have errors if there is overlapped data points.

Clustering

Supervised Learning Types of Clustering: 1. k-means clustering: pick k and randomly place k centers, repeat until convergence 2. agglomerative clustering: pick k, combine two nearest data points until only k clusters left 3. spectral clustering: pick k, group together if it has a shape

Recall (NPV)

TP/(TP+FN)

Precision (PPV)

TP/(TP+FP)

Hyperparameter-Learning Rate (α)

The length of each step in gradient descent. Decrease α will increase bias and decrease variance (prevent overfit).

Odds Ratio (OR) of Logistic Regression

The odds that an individual with a prognostic (risk) factor had an outcome of interest as compared to the odds for an individual without the prognostic (risk) factor.

L1 Regularization (Lasso Regression)

The only difference is instead of taking the square of the coefficients, magnitudes are taken into account. Optimal parameters tend to include zeros: Lasso regression not only helps in reducing over-fitting but it can help us in feature selection.

One-hot encoding

The process by which categorical variables are converted into binary form (0 or 1) for machine reading. It is one of the most common methods for handling categorical features in text data. dummies = pd.get_dummies(data['day'])

Holdout Validation

The sample data are randomly divided into training and validation sets. train, valid = np.split(shuffle(data, 35), [25]) shuffle-->random selection 25-->train size 35-25=10-->validation size

Orthogonal Projection

The shortest distance of ^Y=X(theta) is directly below Y (the orthogonal projection of y onto the Span of X). So the optimal theta is the residual vector orthogonal to the Span of X.

Feature engineering

Transform raw features into features that can model or EDA Example: df[hp²] = df[hp]**2 model = LinearRegression() model.fit([[hp, hp²]], df['mpg'])

Cross-validation (2 methods)

Verifying the results obtained from a validation set by administering a validation set to a different sample (drawn from the same population). Purpose: to avoid overfitting, pick α, train all datapoints 2 methods: 1. Holdout method 2. K-fold

Weighted Entropy (WS)

WS = (# of samples in node/# of total samples) * entropy of the node

Central Limit Theorem (CLT)

When n is large, the sampling distribution of the sample mean is approximately Normal.

Multiple Observation

Y=X(theta) Y and X are matrix of n x (p+1)

Random Variable

a numerical description of the outcome of an experiment input: all random samples of size n output: number line

Simple Linear Regression Model

a parametric model (goal is to choose best parameters for slope and intercept) SkLearn: model = LinearRegression() model.fit(df, 'observation to predict') model.predict('input data') model.intercept_ model.coef_

Sample

a subset of the access frame

Access frame

collection of elements accessible for measurement (could be outside of target population)

Types of bias

coverage, selection, non-response, measurement

Test Set

data set used to estimate accuracy of final model on unseen data

Decision Tree Modeling

decision_tree_model = tree.DecisionTreeClassifier(criterion = 'entropy') X_train = df['characteristic_1', 'characteristic_2'] decision_tree_model.fit(X_train, Y_train) decision_tree_model.predict(X_test)

pandas: Create a data frame

df = pd.DataFrame( {"a" : [4, 5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = [1, 2, 3])

pandas: Select columns whose name matches regular expression regex

df.filter(regex='regex')

pandas: Grouped by values in column named "col"

df.groupby("col")

pandas: Aggregate group using function

df.groupby(by="col").agg(function)

Create histogram

df.hist('x', bins=, unit=)

pandas: Select only rows, only columns or both

df.loc[] and df.iloc[]

pandas: Spread rows into columns

df.pivot_table(columns='var', values='val')

Create line/scatter plots

df.plot('x','y) df.scatter('x','y') plt.xlabel/ylabel

pandas: Order rows by values of a column

df.sort_values('mpg')

pandas: Count number of rows with each unique value of variable

df['w'].value_counts()

pandas: gives data frame with column in a list

df[df['col'].isin(list)]

pandas: Return parts of a data frame base on conditions

df[df['col']=='a']

Types of Quantitative Variables

discrete(age) and continuous(price)

Decision Tree Accuracy

from sklearn.metrics import accuracy_score accuracy_score(predictions, Y_test)

Confusion Matrix

from sklearn.metrics import confusion_matrix Y_test_pred = lr.predict(X_test) cnf_matrix = confusion_matrix(Y_test, Y_test_pred) cnf_matrix

Singular Value Interpretation

ith value in Σ tells us the variance of the ith pc. (it is its eigenvalue)

Model Risk (bias-variance decomposition)

model risk = σ² + (model bias)² + model variance Hold true for linear and non-linear models with squared loss no absolute loss, classification or zero-one loss

Multinomial Distribution

n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability

Types of Qualitative Variables (categories)

ordinal(preference, level of education) and nominal(id#)

pandas: Gather columns to rows

pd.melt(df)

pandas: Join matching rows from bdf to adf

pd.merge(adf, bdf, how='left', on='x1')

Protocal

procedure to choose a sample, follow up, training, etc.

Types of Variables

qualitative and quantitative

instrument

questionnaire to an indv of the sample

urn model

random selection of samples from a population, container of identical or labeled marbles helps to estimate the size of variation with probability

Granularity

refers to the level of detail in the model or the decision-making process

Types of variation

sampling variation, assignment variation, measurement error

Relationships in scatter plot

simple linear, simple nonlinear, linear spreading, v-shaped

Types of uniform random sampling

simple random sampling (SRS), systematic, stratified

Expectation E[x]

the average value of x

L2 Regularization (Ridge Regression)

the cost function is altered by adding a penalty equivalent to square of the magnitude of the coefficients. Does not leverage redundancy.


Ensembles d'études connexes

Iggy Chapter 02: Common Health Problems of Older Adults med/surg

View Set

Security+ Chapter 12 - Disaster Recovery and Incident Response

View Set

Earth: Portrait of a Planet Chap 19 CR

View Set

Learn: Trade in a Global Economy

View Set

Chapter 14: Reconstruction: An Unfinished Revolution 1865-1877

View Set