CS345 Final Exam

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

How to improve accuracy for nearest neighbor classifier?

- modify features (normalize them and remove noise), and base the decision on multiple nearest neighbors not just one (kNN).

How to combat the fact that data needs to be linearly seperable in the perceptron?

1. add a limit of the # of iterations 2. add a bias term to alter the data so that the hyperplane is altered and does not go through origin. -- do this by adding another dimension to the weight vector and a constant feature to data. add a column of 1s to the feature matrix or formulate the algorithm with a bias term. y = y*2 -1 since y = 0 or 1.

Gradient Descent

A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss. take small steps in negative direction of the gradient halting condition is little change of the loss function across epochs

Random Forests

A variant of the bagging ensemble method that generates a committee of classification or regression trees based on different random samples but restricts each individual tree to a limited number of randomly selected features (variables). base is decision trees.

Bagging

Bagging and other resampling techniques can be used to reduce the variance in model predictions. In bagging (Bootstrap Aggregating), numerous replicates of the original data set are created using random selection with replacement. Each derivative data set is then used to construct a new model and the models are gathered together into an ensemble. To make a prediction, all of the models in the ensemble are polled and their results are averaged.

What is stratified K-fold?

Better to use stratified k-fold with classification problems and k-fold for regression. K-fold chooses a random subset of examples for each fold, so this leads to underrepresented classes. Stratified kfold fixes that. It makes sure that each class is represented in proportion to its overall fraction in the data. this is important for classes with few exmples. use this with shuffling too! best way.

What is expected for accuracy of training set for nearest neighbor classifier and why?

Expect to see a very high accuracy (close to 100%) when testing the classifier on the training set because this classifier memorizes the training data, then when testing it will find a perfect match from the memorized training set to the test set. Not great with new data (overfitting).

Why use linear classifiers?

Good baseline since simple, stable, less likely to overfit because of less paramters, can underfit, good for high dimensional data, scalable algorithms (ability of a system or algorithm to handle an increasing amount of work or data efficiently. A scalable algorithm can maintain or improve its performance as the size of the input grows, and it can handle larger datasets or workloads without a significant degradation in performance).

how do partial derivatives fit into gradient descent?

Gradient descent is an optimization algorithm actually adjusts the params of a model to minimize a cost or loss func. Partial derivatives is math operation to compute rate of change with respect to individual variables (understands how function changes). Gradient descent uses the gradient (vector of partial derivatives) to update parameters and minimize cost of f(x) function.

Decision Trees

Iteratively splits the data based on the value of a single feature. At each step, algorithm picks the feature that leads to the greatest "purity" of the split. Stop tree when: max depth is reached, leaf node contain only examples in same class, leaf node contains very few examples

L2 norm vs. L1

L2 = euclidean norm p=2 (circle) L1 = manhattan distance p=1 (square, grid system; less flexible) Minkowski is generalization of both

differences between MSE, MAE, and RMSE

MAE = easiest to understand (average error) MSE = "punishes" larger errors RMSE = 2nd easiest to understand. It is in same scale or units as y (MSE is not)

What does it mean for dot products to be negative or positive?

Negative is not similar directions, so not similar features. Positive is similar directions and features.

Does standardization matter for decision trees?

No, at every point it is a binary decision, so the weight does not matter. Decision trees are also already at the same scale so wont have an effect.

Pros and cons to perceptron

Pros: simple, easy, fast to train Cons: unclear how to set hyperparameters, solution bounces with arbitrary hyperplane, limited to linear decision boundary and binary problems, can overfit with too many iterations.

Pros and cons for nearest neighbor classifier

Pros: simple, trivial to train, works for classification and regression, decision boundary not limited to a form. Cons: expensive time-wise during testing, especially large datasets, low accuracy for high-dimensional data, usually not best performing classifier.

Why does RF have better accuracy than DT or bagging?

RF has more diverse classifiers bc it is generating from features too and not just the bootstrap examples. we want randomness.

differences in decision boundary between RF and DT

RF is smoother, more flexible, more accurate. more classifiers = more accuracy. RF is less noisy bc if suffers less from overfitting - more stable with lower stdev than DT.

explain SVM (support vector machine) (specifically margins)

SVM is big linear classifier upgrade from perceptron. Based on large-margin classification. LinearSVC is better and faster. hard-margin SVM is default. hard margin is not always good if data isnt a clean separation (noise). - can cause poor accuracy. no points inside small margin soft-margin however allows some points to be within the margin and be misclassified. More flexible (C parameter). Low C is soft margin, allows slack variables into margin. Low C can underfit and large C is overfitting.

SVM vs RF?

SVMs more susceptable to overfitting and SVMs dont handle missing data well. SVM good for high-dim data. RF good for large datasets

explain PCA - principle components analysis

Selecting a smaller dimensional subspace to represent the data (reduces dimensionality in high-dim data). PCA is the direction along which the sample variance of the dataset is largest. highest variance is first principle component. Data is transformed to represent new data in new form of axes by new vectors. Faster, helps with visualization, removes noise.

Boosting

Similar to bagging but classifiers are constructed iteratively and main idea is to focus on training on examples that previous classifiers made errors on.

Accuracy

TP+TN/P+N

partial derivatives

Taking a derivative of a function with respect to a specific variable while treating the other variables as if they were constants. For multi-variate functions. still set the partial derivatives to 0 to find min or max, but you will be unsure if it is local or global.

What are the names of the entries in the conf. matrix for a classification problem

True positive, false negative, false positive, true negative

Ensemble methods

Use multiple algorithms to obtain better predictive performance than could be obtained from any of the algorithms by itself. Accuracy increases as # of models increases

Explain a labeled dataset

a collection of vectors (features) each with its own associated label (class). X= features y = class labels

Describe what a hyperplane is

a key mathematical object that gives rise to linear classifiers as the simplest examples. It is the decision boundary. if the bias (b) = 0 then the hyperplane goes through the origin. the dot product of the input vector, the weight vector, and the bias term gives a score. the sign of that score determines what side of the hyperplane that feature is on. Different class on each side.

k-nearest neighbor explain

a majority vote chooses predicted class label. distances can be used as weights for each vote, smaller distance is larger vote.

global minimum

a point in which its function value is lower than all the other function values. typically many local points, not just a global point in ML.

Describe the perceptron and inputs and outputs

a training algorithm that finds a value for the hyperplane that separates the two classes of examples. precurser to neural networks. it iterates over training examples to then update the weight vector in a way that will make training example more likely to be correctly classified. Linearly separable data only, simple binary classification. input = a labeled dataset output = weight vector w

What is a unit vector?

a vector with a norm = 1. To make a vector a unit vector, divide by its norm. This vector points in the direction of the original vector.

What does it mean to add more dimension to a problem/dataset

add more features/variables

what is parameter b

additional parameters for flexibility and determines the position of the hyperplane. it is the bias term.

Condorcet's Jury Theorem

all we need is an ensemble of diverse models from the same training data

How to find maximum not minimum for gradient descent?

alter gradient algorithm for parameters to be opposite direction of gradient so to maximize cost not minimize. (positive direction - increasing values)

simplest ensemble method?

bagging

standardization

brings everything to the same scale - balances features.

steps in cross-validation

create a fixed test set- make sure data is equally represented in both train and test sets (stratify does this). then generate different sized training sets from available training data and evaluate classifier trained on each training set.

what is basis functions?

creating a transformation of data using a set of functions called basis functions = polynomial basis functions

What is dimensionality of resulting features when using monomials to 2nd degree?

d(d-1)/2 : The # of pairs increases quadractially with # of features, so it grows quickly and will be slow - cant handle data from memory. Therefore not good with high-dimensional data.

how do derivatives work towards finding minimum?

derivitive points in opposite direction to find local min (so use neg. of derivative). - take small steps in the direction of the negative of the deriviative, which should converge to local minimum.

What is normalization

dividing a vector by its norm to get the unit vector.

additional features of RF

error estimation during training using out-of-bag data. variable importance, RF generate a score for each feature according to its contribution to accuracy on out of bag examples.

What is the gamma hyper parameter for kernels

gamma is for how far the influence of a single training example reaches (width). low gamma = large influence.

confusion matrix

gives insight on errors -rows of the matrix correspond to true labels, columns correspond to predicted labels. -the elements of the matrix in a given row quantify how predictions are distinguished across the different classes. Tells you where it is confused.

Does gradient decsent guarantee global minimum?

gradient descent is not guarenteed to find global min and max as gradient descent often has local mins and maxes. global min and max depend on initial conditions like the starting point, learning rate, and cost function characteristics.

Gradient

gradient is the magnitude of the partial deritative vector. As you take steps closer to the origin (minimum) - the magnitude of the vector gets smaller, so the steps get smaller and eventually converge and stop.

F1 score

harmonic mean of precision and recall. focuses on true positive not true negatives. Focuses on how well a classifier is at retrieving relevant items. Ex. language processing applications, searches

Precision

how accurate the positive predictions are :TP/TP+FP

Which direction does the gradient point?

in the direction of steepest increase of the function. gradient is perpendicular to the contours of constant function value. Gradient tends to zero as we approach origin.

when is train test splits better than cross-validate?

large, time consuming data

What makes up any vector

length and direction (norm and unit vector). u^=u/||u||

What does the dot product of two vectors measure?

measures how closely two vectors align in terms of directions they point. Closely related to the angle between the two vectors. Equal to the cosine of the angle between them. multiple two vectors together essentially feature by feature.

what is generalization

most important issue in ML. the ability to perform well on unseen test data.

Decision boundary for perceptron vs. nearest neighbor

nearest neighbor decision boundary is is more flexible, reducing underfitting. does not have to have a linear path like perceptron either

what is Non-linear modeling using basis function regression used for

non-linear data. this addresses the limitation of linear regression by mapping data to higher dimensional feature space. This approach transforms the features using a non-linear function. transforms x using a collection of non-linear functions, then use a linear model on x. Still linear in parameters, but not in function. can be applied to classification problems too

pearson correlation coefficient

numerator is the dot product between the mean subtracted vectors and denominator is the dot product between the two unit vectors. linear regression strength can be quanitifed by through correlation matrix. y=wx+b residuals used for sum of squared error (distance from point to least squares line). w and b minimize the sum-squared error.

Compare kNN to NN

one-nearest neighbor overfits, not as smooth. Smooth is not as affected by noise and has higher accuracy. too many neighbors is underfitting. Too smooth is overgeneralized. increased sensitivity to noisy data with more ks.

How to convert binary classifier into a multi-class classifier

one-vs -the-rest : train x different binary classifiers, one for each class in data (lg. # of classes - scalable) one-vs-one : (smaller class #s) train each pair of classes with a binary classifier - distinguishes between a pair

What is it called when the dot product is equal to 0.

orthogonal vectors

What is the angle between orthogonal vectors?

orthogonal vectors are perpendicular to each other, so 90 degree angle. cos(90) = 0.

If data is not linear and SVM accuracy is low, what would you do?

perform polynomial regression to add dimensional space and the SVM could find that data is linearly separable and performs very well.

What is recursive classifier accuracy

performs feature selection by building on the intuition that the magnitude of the weight vector of a linear classifier is a good indication of features usefulness.

What can you say about the angle between vectors that have positive coefficients?

positive coefficients imply that all the components of the vectors are positive. If vectors are in the same direction, angle is 0. If opposite directions, angle is 180.

derivative rules

power rule, quotient rule, product rule, chain rule.

Pros and cons to decision trees

pros: simple to understand , white box, handles categorical and numerical data, handles missing data, fast training cons: unstable, inaccurate for high-dim data, SVMs tend to be better

Pros and Cons to Linear regression

pros: simple to understand, highly interpretible, fast training, no tuning required, can do well w small dataset. Cons: presumes a linear relationship, not competitive with better regression models, sensitive to outliers and irrelevent features. It is "parametric method" - success depends on the data satisfying our assumption that the data falls on a line (hyperplane)

cross-val-score

returns only accuracy scores. Scoring determines which accuracy measure is evaluated for each fold. use balanced accuracy when data is unbalanced (one class contains larger number of examples)

Non-linear SVM: Kernels - explain

same as polynomial basis function regression but more efficient. C is important in controlling extent of regularization. Kernels map data without explicitly doing so because mapping is bad for high-dim data. Expresses using dot products! squaring dot product in original space has same effect as do product in transformed space.

what is min-max scaling

scaling data so values fall b/w 0 and 1.

What is shuffling and why is it good

shuffling is good because most datasets are organized in a certain way - so removed that bias.

How does nearest neighbor classifier work?

simple! finds the example in the training data that is closest to the example that needs to be classified and returns its label. The notion of closeness used here is distance b/w examples. ex. euclidean distance (the length of the vector that points from x' to x).

Derivatives

solve linear regression by finding minima and maxima of a function of a single variable (gradient descent is multi-variable) derivative of a single variable is the steepness at a given point. tells you how the slope of the function has changed. line tangent to f(x) at a point has a slope that equals the derivative at that point. set derivative = to 0.

What is k-fold cross validation?

split the training/validation data into k-parts; we train on k-1 parts and validate on the remaining part. Repeat until each fold has been used for evaluation. compute accuracy by averaging over the accuracy estimates generated for each fold. - use small datasets. more accurate than one test set and more efficient than averaging over multiple train-test splits

how do you compute the accuracy of the classifier from the confusion matrix

sum of the diagonal divided by entire thing

why is not using a validation set bad?

the choice of which value to report to the user uses info about the test set, and end result is accuracy thats over-optimistic. estimates of classifier performance should never be based on performance of test set.

What is the square root of the dot product?

the euclidean norm

What is the norm of a vector

the length. Norm of vector in 2 dimensions means find the hypotenuse (Pythagorean theorem)

Balanced accuracy

the mean of recall (true P rate) and specifically (True N rate). 1/2 (TP/P + TN/N)

What is the norm of a vector as a dot product?

the norm (or magnitude) of a vector as a dot product is expressed as a dot product of the vector with itself.

Why does using w' make xi more likely to be correctly classified?

the prime (w') is the update to the algorithm weight vector, and it includes variations of the learning rate. With this addition, the dot product is less likely to be negative (or positive depending). these updates lead to the perceptron algorithm. yi = 1 for positives and -1 for negative examples

What is a leaning curve and why would you use it?

the resulting curve of accuracy as a function of # of examples is called a learning curve. This is used to assess performance and identify over/under fitting.

Purity in decision tree

the splits are as homogenous as possible in the composition of members of each class. = Gini impurity coefficient. Look for splits that decrease impurity in a node on tree

What do you do when the number of folds equals training examples?

this is for very small datasets, usually medical, and you can use leave-one-out cross-validation where there are N models for N examples, one example used for testing.

When transforming data in non-linear modeling using basis function regression, how do you transform the data?

use polynomials. pipeline applies polynomial features to dataset the applies linear regression. high polynomial amount causes overfitting, high #s have more structure. dont want to fit too close to noise.

how to use a validation set?

use train-test-split twice (a train and test and a train and validate). to get the right validation size use size_validation/(size_validation + size_train). Dont do this with small datasets however.

PCA trend as principle components increases

variance decreases, meaning it is harder to distinguish between classes. components = features parameter = k (# of components)

What is the decision boundary of a NN classifier related to?

voronoi diagram. We get decision boundary by merging adjacent cells that have the same label associated with them.

What is parameter w

w is the weight vector and it assigns a weight to each feature.

Recall

what fraction of the positive examples our classifier classifies correctly (true positive rate) TP/P (TP/(TP+FN))

How can basis function regression overfit?

with increased feature-space dimensionality, the resulting regression function has increased flexibility and erratic overfitting can occur.

Do you standardize PCA?

yes always so that the data has zero mean and zero unit variance so no single feature dominates.


Set pelajaran terkait

Ch 4 Respiratory disorders/pediatric success nclex q/a/exam2

View Set

NURS 3234 Exam 3 Review Questions

View Set

Match each term with its definition. Question 6

View Set

Chapter 7: Interests in Real Estate

View Set

Chapter 3: Authentication, Authorization, and Accounting

View Set

Chapter 1 LearningCurve. Bio 1010-05

View Set

Digital Literacy Exam Review 40-Questions

View Set