Machine Learning: Models

Ace your homework & exams now with Quizwiz!

What is the difference between logistic regression and SVM without a kernel?

(Only in implementation - one is much more efficient and has good optimization packages)

What do we need PCA and what does it do?

(PCA tries to find a lower dimensional surface such the sum of the squared projection error is minimized)

How does the SVM parameter C affect the bias/variance trade off?

(Remember C = 1/lambda; lambda increases means variance decreases)

Why do we need dimensionality reduction techniques?

(data compression, speeds up learning algorithm and visualizing data)

What are the two pre-processing steps that should be applied before doing PCA?

(mean normalization and feature scaling)

DECISION TREES: What are some common uses of decision tree algorithms?

1. Classification 2. Regression 3. Measuring feature importance 4. Feature selection

SVM: What common kernels can you use for SVM?

1. Linear 2. Polynomial 3. Gaussian RBF 4. Sigmoid

LINEAR REGRESSION: What training algorithms are appropriate for a linear regression on large data sets? Which should be avoided?

Appropriate: stochastic gradient descent, mini-batch gradient descent Avoided: normal equation (too computationally complex)

KNN: Is kNN a parametric or non-parametric algorithm? Is it used as a classifier or regressor?

kNN is non-parametric and can be used as either a classifier or regressor.

LOGISTIC REGRESSION: Can logistic regression produce a probability score along with its classification prediction?

Yes.

REGULARIZED LINEAR MODELS: What hyperparameters can be tuned in regularized linear models? Explain how they affect model learning.

You can tune the weight of the regularization term for regularized models (typically denoted as alpha), which affect how much the models will compress features. alpha = 0 --> regularized model is identical to original model alpha = 1 --> regularized model reduced the original model to a constant value

When will you use Bayesian methods instead of Frequentist methods?

(Small dataset, large feature set)

DECISION TREES: How is feature importance evaluated in decision-tree-based models?

The features that are split on most frequently and are closest to the top of the tree, thus affecting the largest number of samples, are considered to be the most important.

Logistic regression vs. SVMs: When to use which one?

( Let's say n and m are the number of features and training samples respectively. If n is large relative to m use log. Reg. or SVM with linear kernel, If n is small and m is intermediate, SVM with Gaussian kernel, If n is small and m is massive, Create or add more fetaures then use log. Reg. or SVM without a kernel)

LOGISTIC REGRESSION: What parameters be tuned in logistic regression models? Explain how they affect model learning.

Logistic regression models can be tuned using regularizations techniques (commonly L2 norm, but other norms may be used as well)

DECISION TREES: What do high and low Entropy scores mean?

Low Entropy (near 0) = most records from the sample are in the same class High Entropy (maximum of 1) = records from sample are spread evenly across classes

What's the difference between a generative and discriminative model?

More reading: What is the difference between a Generative and Discriminative Algorithm? (Stack Overflow) A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks. https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm

What would you do if data in a data set were missing or corrupted?

Whenever data is missing or corrupted, you either replace it with another value or drop those rows and columns altogether. In Pandas, both isNull() and dropNA() are handy tools to find missing or corrupted data and drop those values. You can also use the fillna() method to fill the invalid values in a placeholder—for example, "0." https://www.analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/

When is it necessary to update an algorithm?

You should update an algorithm when the underlying data source has been changed or whenever there's a case of non-stationarity. The algorithm should also be updated when you want the model to evolve as data streams through the infrastructure. https://stackoverflow.com/questions/16217844/the-data-has-been-changed-error-when-stepping-from-main-form-into-sub-form

How would you implement a recommendation system for our company's users?

https://stackoverflow.com/questions/6302184/how-to-implement-a-recommendation-system#6302223 A lot of machine learning interview questions of this type will involve implementation of machine learning models to a company's problems. You'll have to research the company and its industry in-depth, especially the revenue drivers the company has, and the types of users the company takes on in the context of the industry it's in.

What cross-validation technique would you use on a time series dataset?

https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later time periods for example, your model may still pick up on it even if that effect doesn't hold in earlier years! You'll want to do something like forward chaining where you'll be able to model on past data then look at forward-facing data. fold 1 : training [1], test [2] fold 2 : training [1 2], test [3] fold 3 : training [1 2 3], test [4] fold 4 : training [1 2 3 4], test [5] fold 5 : training [1 2 3 4 5], test [6]

Explain the difference between L1 and L2 regularization methods.

"A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these two is the penalty term." https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

What is root cause analysis?

"All of us dread that meeting where the boss asks 'why is revenue down?' The only thing worse than that question is not having any answers! There are many changes happening in your business every day, and often you will want to understand exactly what is driving a given change — especially if it is unexpected. Understanding the underlying causes of change is known as root cause analysis." https://towardsdatascience.com/how-to-conduct-a-proper-root-cause-analysis-789b9847f84b

What are hash table collisions?

"If the range of key values is larger than the size of our hash table, which is usually always the case, then we must account for the possibility that two different records with two different keys can hash to the same table index. There are a few different ways to resolve this issue. In hash table vernacular, this solution implemented is referred to as collision resolution." https://medium.com/@bartobri/hash-crash-the-basics-of-hash-tables-bef82a8ea550

DECISION TREES: What are some ways to reduce overfitting with decision trees?

- Reduce maximum depth - Increase min samples split - Balance your data to prevent bias toward dominant classes - Increase the number of samples - Decrease the number of features

DIMENSIONALITY REDUCTION: Name four popular dimensionality reduction algorithms and briefly describe them.

1. Principal component analysis (PCA) - uses an eigen decomposition to transform the original feature data into linearly independent eigenvectors. The most important vectors (with highest eigenvalues) are then selected to represent the features in the transformed space 2. Non-negative matrix factorization (NMF) - can be used to reduce dimensionality for certain problem types while preserving more information than PCA 3. Embedding techniques - various embedding techniques, e.g. finding local neighbors as done in Local Linear Embedding, can be used to reduce dimensionality 4. Clustering or centroid techniques - each value can be described as a member of a cluster, a linear combination of clusters, or a linear combination of cluster centroids By far the most popular is PCA and similar eigen-decomposition-based variations.

What is collaborative filtering?

Collaborative filtering can be described as a process of finding patterns from available information to build personalized recommendations. You can find collaborative filtering in action when you visit websites like Amazon and IMDB. Also known as social filtering, this approach essentially makes suggestions based on the recommendations and preferences of other people who share similar interests. http://recommender-systems.org/collaborative-filtering/

DECISION TREES: What is entropy?

Entropy is the measure of the purity of members among non-empty classes. It is very similar to Gini in concept, but a slightly different calculation.

DECISION TREES: What is Gini impurity?

Gini impurity (also called the Gini index) is a measurement of how often a randomly chosen record would be incorrectly classified if it was randomly classified using the distribution of the set of samples.

GRADIENT BOOSTING: How does gradient boosting, aka gradient boosting machines (GMB), differ from traditional decision tree algorithms?

Gradient boosting involves using multiple weak predictors (decision trees) to create a strong predictor. Specifically, it includes a loss function that calculates the gradient of the error with regard to each feature and then iteratively creates new decision trees that minimize the current error. More and more trees are added to the current model to continue correcting error until improvements fall below some minimum threshold or a pre-decided number of trees have been created.

What's the difference between inductive, deductive, and abductive learning?

Inductive learning describes smart algorithms that learn from a set of instances to draw conclusions. In statistical ML, k-nearest neighbor and support vector machine are good examples of inductive learning. There are three literals in (top-down) inductive learning: Arithmetic literals Equality and inequality Predicates In deductive learning, the smart algorithms draw conclusions by following a truth-generating structure (major premise, minor premise, and conclusion) and then improve them based on previous decisions. In this scenario, the ML algorithm engages in deductive reasoning using a decision tree. Abductive learning is a DL technique where conclusions are made based on various instances. With this approach, inductive reasoning is applied to causal relationships in deep neural networks. https://www.quora.com/Whats-the-difference-between-inductive-deductive-and-abductive-reasoning

Is it better to have too many false positives or too many false negatives?

It depends on several factors. https://www.quora.com/In-Data-Science-is-it-preferable-to-have-too-many-false-negatives-or-too-many-false-positives

Explain the difference between L1 and L2 regularization.

L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.

LOGISTIC REGRESSION: Is logistic regression a regressor or a classifier?

Logistic regression is usually used as a classifer because it predicts discrete classes. Having said that, it technically outputs a continuous value associated with each prediction. So we see that is is actually a regression algorithm, hence the name, that can solve classification problems. It is fair to say that it is a classifier because it is used for classification, although it is also technically also a regressor.

LOGISTIC REGRESSION: Can gradient descent get stuck at a local minima when training a logistic regression model? Why?

No, because the cost function is convex.

Can any similarity function be used for SVM?

No, have to satisfy Mercer's theorem

K-MEANS: Does k-means clustering always converge to the same clusters? How does this affect the use of k-means clustering in production models?

No, there is no guarantee that k-means converges to the same set of clusters, even given the same samples from the same population. The clusters that are produced may be radically different based on the initial cluster means selected. For this reason, it is important that the cluster definitions remain static when using k-mean clustering in production to ensure that different clusters aren't created each time during training.

K-MEANS: What is one heuristic to select "k" for k-means clustering?

One such method is the elbow method. In short, it attempts to identify the point at which adding additional clusters only marginally increases the variance explained by the clusters. The elbow is the point at which we begin to see diminishing returns in explained variance when increasing k.

LINEAR REGRESSION: What are some reasons that gradient descent may converge slowing and how can you address them?

Problem: Low learning rate Solution: Increase the learning rate gradually (avoid making it so high that you jump over minima) Problem: Features have very dissimilar scales Solution: Rescale features using a rescaling technique

How is a decision tree pruned?

Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning. Reduced error pruning is perhaps the simplest version: replace each node. If it doesn't decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy. https://en.wikipedia.org/wiki/Pruning_%28decision_trees%29

RANDOM FOREST: How does Random Forest differ from traditional Decision Tree algorithms?

Random forest is an ensemble method that uses bagged decision trees with random feature subsets chosen at each split point. It then either averages the prediction results of each tree (regression) or using votes from each tree (classification) to make the final prediction.

Explain what precision and recall are. How do they relate to the ROC curve?

Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity-specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is. http://www.kdnuggets.com/faq/precision-recall.html

GRADIENT BOOSTING: How can you reduce overfitting when doing gradient boosting?

Reducing the learning rate or reducing the maximum number of estimators are the two easiest ways to deal with gradient boosting models that overfit the data. With stochastic gradient boosting, reducing subsample size is an additional way to combat overfitting. Boosting algorithms tend to be vulnerable to overfitting, so knowing how to reduce overfitting is important.

REGULARIZED LINEAR MODELS: When should you use no regularization vs ridge vs lasso vs elastic net?

Regularized models tend to outperform non-regularized linear models, so it is suggested that you at least try using ridge regression. Lasso can be effective when you want to use to automatically do feature selection in order to create a simpler model but can be dangerous since it may be erratic and remove features that contain useful signal. Elastic net is a balance of ridge and lasso, and it can be used to the same effect as lasso with less erratic behavior.

SVM: What are some possible uses of SVM models? E.g. classification, regression, etc

SVM can be used for: • linear classification • nonlinear classification • linear regression • nonlinear regression

SVM: Why is it important to scale feature before using SVM?

SVM tries to fit the widest gap between all classes, so unscaled features can cause some features to have a significantly larger or smaller impact on how the SVM split is created.

What are the different algorithm techniques you can use in AI and ML?

Some algorithm techniques that can be leveraged are: Learning to learn Reinforcement learning (deep adversarial networks, q-learning, and temporal difference) Semi-supervised learning Supervised learning (decision trees, linear regression, naive bayes, nearest neighbor, neural networks, and support vector machines) Transduction Unsupervised learning (association rules and k-means clustering) https://towardsdatascience.com/types-of-machine-learning-algorithms-you-should-know-953a08248861

SVM: What parameters can you tune for SVM?

The hyperparameters that you can commonly tune for SVM are: • Regularization/cost parameter • Kernel • Degree of polynomial (if using a polynomial kernel) • Gamma (modifies the influence of nearby points on the support vector for Gaussian RBF kernels) • Coef0 (influences impact of high vs low degree polynomials for polynomial or sigmoid kernels) • Epsilon (a margin term used for SVM regressions)

Can you list some disadvantages related to linear models?

There are many disadvantages to using linear models, but the main ones are: Errors in linearity assumptions Lacks autocorrelation It can't solve overfitting problems You can't use it to calculate outcomes or binary outcomes https://www.quora.com/What-are-the-limitations-of-linear-regression-modeling-in-data-analysis

K-MEANS: Why is it difficult to identify the "ideal" number of clusters in a dataset using k-mean clustering?

There is no "ideal" number of clusters since increasing the number of clusters always captures more information about the features (the limiting case is k=number of observations, i.e. each observation is a "cluster"). Having said that, there are various heuristics that attempt to identify the "optimal" number of clusters by recognizing when increasing the number of clusters only marginally increases the information captured. The true answer is usually driven by the application, though. If a business has the ability to create 4 different offers, then they may want to create 4 customer clusters, regardless of the data.

KNN: How do you select the ideal number of neighbors for kNN?

There is no closed-form solution for calculating k, so various heuristics are often used. It may be easiest to simply do cross validation and test several different values for k and choose the one that produces the smallest error during cross validation. As k increases, bias tends to increase and variance decreases.

What's selection bias? What other types of biases could you encounter during sampling?

When you're dealing with a non-random sample, selection bias will occur due to flaws in the selection process. This happens when a subset of the data is consistently excluded because of a particular attribute. This exclusion will distort results and influence the statistical significance of the test. Other types of biases include survivorship bias and undercoverage bias. It's important to always consider and reduce such biases because you'll want your smart algorithms to make accurate predictions based on the data. https://www.ibm.com/blogs/research/2018/02/mitigating-bias-ai-models/

In your opinion, which is more important when designing a machine learning model: model performance or model accuracy?

https://en.wikipedia.org/wiki/Accuracy_paradox

Name an example where ensemble techniques might be useful.

https://en.wikipedia.org/wiki/Ensemble_learning Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data). You could list some examples of ensemble methods, from bagging to boosting to a "bucket of models" method and demonstrate how they could increase predictive power.

Do you think 50 small decision trees are better than a large one? Why?

https://www.quora.com/Do-you-think-50-small-decision-trees-are-better-than-a-large-one-Why

What are some situations where a general linear model fails?

https://www.quora.com/What-are-the-limitations-of-linear-regression-modeling-in-data-analysis

What is an exact test?

"In statistics, an exact (significance) test is a test where all assumptions, upon which the derivation of the distribution of the test statistic is based, are met as opposed to an approximate test (in which the approximation may be made as close as desired by making the sample size big enough). This will result in a significance test that will have a false rejection rate always equal to the significance level of the test. For example an exact test at significance level 5% will in the long run reject true null hypotheses exactly 5% of the time." https://en.wikipedia.org/wiki/Exact_test

Explain the 80/20 rule, and tell me about its importance in model validation.

"People usually tend to start with a 80-20% split (80% training set - 20% test set) and split the training set once more into a 80-20% ratio to create the validation set." https://www.beyondthelines.net/machine-learning/how-to-split-a-dataset/

K-MEANS: What is one common use case for k-mean clustering?

Customer segmentation is probably the most common use case for k-means clustering (although it has many uses in various industries). Often, unsupervised clustering is used to identify groups of similar customers (or data points) and then another predictive model is trained on each cluster. Then, new customers are first assigned a cluster and then scored using the appropriate model.

DIMENSIONALITY REDUCTION: Why would you want to use dimensionality reduction techniques to transform your data before training?

Dimensionality reduction can allow you to: • Remove collinearity from the feature space • Speed up training by reducing the number of features • Reduce memory usage by reducing the number of features • Identify underlying, latent features that impact multiple features in the original space

DIMENSIONALITY REDUCTION: Why would you want to avoid dimensionality reduction techniques to transform your data before training?

Dimensionality reduction can: • Add extra unnecessary computation • Make the model difficult to interpret if the latent features are not easy to understand • Add complexity to the model pipeline • Reduce the predictive power of the model if too much signal is lost

How would you go about choosing an algorithm to solve a business problem?

First, you have to develop a "problem statement" that's based on the problem provided by the business. This step is essential because it'll help ensure that you fully understand the type of problem and the input and the output of the problem you want to solve. The problem statement should be simple and no more than a single sentence. For example, let's consider enterprise spam that requires an algorithm to identify it. The problem statement would be: "Is the email fake/spam or not?" In this scenario, the identification of whether it's fake/spam will be the output. Once you have defined the problem statement, you have to identify the appropriate algorithm from the following: Any classification algorithm Any clustering algorithm Any regression algorithm Any recommendation algorithm Which algorithm you use will depend on the specific problem you're trying to solve. In this scenario, you can move forward with a clustering algorithm and choose a k-means algorithm to achieve your goal of filtering spam from the email system. While examples aren't always necessary when answering questions about artificial intelligence, sometimes it will help make it easier for you to get your point across. https://neelbhatt.com/2017/11/25/how-to-choose-ml-algorithm-machine-learning-questions-answers-part-iii/

DECISION TREES: What are the main hyperparameters that you can tune for decision trees?

Generally speaking, we have the following parameters: max depth - maximum tree depth min samples split - minimum number of samples for a node to be split min samples leaf - minimum number of samples for each leaf node max leaf nodes - the maximum number of leaf nodes in the tree max features - maximum number of features that are evaluated for splitting at each node (only valid for algorithms that randomize features considered at each split) Other similar hyperparameters may be derived from the above hyperparameters. The "traditional" decision tree is greedy and looks at all features at each split point, but many modern implementations allow splitting on randomized features (as seen in sklearn), so max features is may or may not be a tuneable hyperparameter.

DECISION TREES: xplain how each hyperparameter affects the model's ability to learn.

Generally speaking... max depth - increasing max depth will decreases bias and increases variance min samples split - increasing min samples split increases bias and decreases variance min samples leaf - increasing min samples leaf increases bias and decreases variance max leaf nodes - decreasing max leaf node increases bias and decreases variance max features - decreasing maximum features increases bias and decreases variance There may be instances when changing hyperparameters has no effect on the model.

DECISION TREES: What metrics are usually used to compute splits?

Gini impurity or entropy. Both generally produce similar results.

DECISION TREES: What do high and low Gini scores mean?

Low Gini (near 0) = most records from the sample are in the same class High Gini (maximum of 1 or less, depending on number of classes) = records from sample are spread evenly across classes

RANDOM FOREST: Are random forest models prone to overfitting? Why?

No, random forest models are generally not prone to overfitting because the bagging and randomized feature selection tends to average out any noise in the model. Adding more trees does not cause overfitting since the randomization process continues to average out noise (more trees generally reduces overfitting in random forest). In general, bagging algorithms are robust to overfitting. Having said that, it is possible to overfit with random forest models if the underlying decision trees have extremely high variance, e.g. extremely high depth and low min sample split, and a large percentage of features are considered at each split point, e.g. if every tree is identical, then random forest may overfit the data.

SVM: Can SVM produce a probability score along with its classification prediction?

No.

LINEAR REGRESSION: Is a polynomial regression non-linear?

No. It is a linear model that can be used to fit non-linear data.

DECISION TREES: Are decision trees parametric or non-parametric models?

Non-parametric. The number of model parameters is not determined before creating the model.

RANDOM FOREST: What hyperparameters can be tuned for a random forest that are in addition to each individual tree's hyperparameters?

Random forest is essentially bagged decision trees with random feature subsets chosen at each split point, so we have 2 new hyperparameters that we can tune: num estimators - the number of decision trees in the forest max features - maximum number of features that are evaluated for splitting at each node

DIMENSIONALITY REDUCTION: How do you select the number of principal components needed for PCA?

Selecting the number of latent features to retain is typically done by inspecting the eigenvalue of each eigenvector. As eigenvalues decrease, the impact of the latent feature on the target variable also decreases. This means that principal components with small eigenvalues have a small impact on the model and can be removed. There are various rules of thumb, but one general rule is to include the most significant principal components that account for at least 95% of the variation in the features.

GRADIENT BOOSTING: What hyperparameters can be tuned in gradient boosting that are in addition to each individual tree's hyperparameters?

The main hyperparameters that can be tuned with GBM models are: Loss function - loss function to calculate gradient of error Learning rate - the rate at which new trees correct/modify the existing predictor Num estimators - the total number of tress to produce for the final predictor Additional hyperparameters specific to the loss function Some specific implementations, e.g. stochastic gradient boosting, may have additional hyperparameters such as subsample size (subsample size affects the randomization in stochastic variations).

LINEAR REGRESSION: How do you select the right order of polynomial for polynomial regressions? What if the data is high-dimensional?

This is a difficult question and there is no easy way to automate this selection. It is suggested that you inspect the data and try to choose the order of polynomial that will best fit the data without overfitting. If the data is high-dimensional and can't be visualized, then you can train multiple models and observe when the validation error begins to increase instead of decrease. At this point you're probably overfitting your training data and should reduce the polynomial order to the point where validation error is minimized.

DIMENSIONALITY REDUCTION: After doing dimensionality reduction, can you transform the data back into the original feature space? How?

Yes and no. Most dimensonality reduction techniques have inverse transformations, but signal is often lost when reducing dimensions, so the inverse transformation is usually only an approximation of the original data.

What steps would you take to evaluate the effectiveness of your ML model?

You have to first split the data set into training and test sets. You also have the option of using a cross-validation technique to further segment the data set into a composite of training and test sets within the data. Then you have to implement a choice selection of the performance metrics like the following: Confusion matrix Accuracy Precision Recall or sensitivity Specificity F1 score For the most part, you can use measures such as accuracy, confusion matrix, or F1 score. However, it'll be critical for you to demonstrate that you understand the nuances of how each model can be measured by choosing the right performance measure to match the problem. https://medium.com/thalus-ai/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b

How is KNN different from k-means clustering?

https://www.quora.com/How-is-the-k-nearest-neighbor-algorithm-different-from-k-means-clustering K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points. The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn't — and is thus unsupervised learning.

How is k-NN different from k-means clustering?

k-NN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data. https://www.springboard.com/blog/data-science-interview-questions/

K-MEANS: Briefly explain how the k-means clustering algorithm works.

k-means clustering in a unsupervised clustering algorithm that partitions observations into k clusters. The cluster means are usually randomized at the start (often by choosing random observations from the data) and then updated/shifted as more records are observed. At each iterations, a new observation is assigned to a cluster based on which cluster mean it is nearest and then the means are recalculated, or updated, with the new observation information included.

KNN: Briefly explain how the k-nearest neighbor (kNN) algorithm works.

kNN makes a prediction by averaging the k neighbors nearest to a given data point. For example, if we wanted to predict how much money a potential customer would spend at our store, we could find the 5 customers most similar to her and average their spending to make the prediction. The average could be weighted based on similarity between data points and the similarity, aka "distance," metric could be modified as well.

REGULARIZED LINEAR MODELS: Name and briefly explain three regularized linear models.

• Ridge regression - linear regression that adds L2-norm penalty/regularization term to the cost function • Lasso - linear regression that adds L1-norm penalty/regularization term to the cost function • Elastic Net - linear regression that adds mix of both L1- and L2-norm penalties terms to the cost function


Related study sets

Contracts and Relationships with Buyers and Sellers

View Set

Research Methods in Psychology Exam 1

View Set

Cna Chapter 21: The Musculoskeletal System

View Set

iGCSE Decimal, Binary and Hexadecimal number systems

View Set