Data science final study guide

¡Supera tus tareas y exámenes ahora con Quizwiz!

Underfitting

When the model is too simple, both training and test errors are large

What is the Apriori property and how is it used in the Apriori algorithm?

Apriori is If an itemset is frequent, then all of its subsets must also be frequent Support of an itemset never exceeds the support of its subsets aka anti-monotone. It can be used for pruning, Candidate Generation, Candidate Pruning, Candidate Elimination, Support Counting and building items sets with frequent pairs.

Why do we end up with an over fitted model with deep trees and in data sets when we have outcomes that are dependent on many variables?

Because each split reduces the training data for subsequent splits

What gives the Naïve Bayesian Classifier the advantage of being computationally inexpensive?

Because you are not multiplying probabilities. Instead you are getting the count of Y=n and X=n and then dividing by another count.

Why do we consider K-means clustering an unsupervised machine learning algorithm?

K-means clustering is unsupervised machine learning because, there's not a predictive model, find similarities, finds relationships, and we give it unlabeled data that it then categorizes.

What is meant by generalization, over_fitting, and under_fitting?

Overfitting: Creating over complex trees that do not generalize data well. Underfitting: When the model is too simple, both training and test errors are large

Detail the four steps in the K-means clustering algorithm.

Choose K; then select K random "centroids" Assign records to the cluster with the closest centroid Recalculate the resulting centroids Centroid: the mean value of all the records in the cluster Repeat steps 2 & 3 until record assignments no longer change

Detail the challenges with categorical values in linear regression model.

Categorical variables are expanded as a set of indicator variables, one for each possible value. Expanding categorical variables adds more variables to consider in Ordinary Least Squares and increases the storage and time complexity of the solution

Overfitting

Creating over complex trees that do not generalize data well

How do you describe a binary class problem?

Deciding if a certain data type belongs to a certain class or not for example. determining if an image is a dog or not.

List and discuss two standard sanity checks that you will perform on the coefficients derived from a linear regression model.

Do the signs make sense? Are the coefficients excessively large? -Wrong sign is an indication of correlated inputs, but doesn't necessarily affect predictive power. -Excessively large coefficient magnitudes may indicate strongly correlated inputs; you may want to consider eliminating some variables, or using regularized regression techniques.

What is the most common measure of distance used with K-means clustering algorithms?

Euclidean Distance is the most common measure of distance used with K-means clustering algorithms.

What are the two major challenges in the problem of text analysis?

Finding the right structure for your unstructured data Very high dimensionality

Support

Fraction of transactions that contain both X and Y

How do you use a "hold-out" dataset to evaluate the effectiveness of the rules generated?

Hold-out method is to exclude data from the training set and then add it to the testing set allowing you to see how well your model predicts on data it has never seen.

What is the difference between Lift and Leverage? How is Lift used in evaluating the quality of rules discovered?

Lift measures how many times more often X and Y occur together than expected if they were statistically independent. It is a measure of how X and Y are really related rather than coincidentally happening together. Leverage is a similar notion but instead of a ratio it is the difference. Leverage measures the difference in the probability of X and Y appearing together in the data set compared to what would be expected if X and Y were statistically independent.

Compare and contrast linear and logistic regression methods.

Linear models are used to predict continuous values like stock prices Logistic regression models are used to classify like determining if a picture has a dog in it or not.

What is Pseudo R2 and what does it measure in a logistic regression model?

Logistic regression which we use the same way we use R2 in linear regression. It is basically "the fraction" of the variance .

How is the measure of significance used in determining the explanatory value of a driver with linear regression models?

Measure of significance used to determine if the value of a driver really makes a difference in the model

For what conditions is the value of entropy at a maximum and when is it at a minimum?

Maximum (log nc) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information

Confidence

Measures how often items in Y appear in transactions that contain X

Why should we use log-likelihoods rather than pure probability values in the Naïve Bayesian Classifier?

Multiplying several probability values ( < 1) invariably leads to the problem of numerical underflow. So an important implementation guideline is that the log of probability added with a smoothing value should be computed. This recommendation is not just for high-dimensional problems but for all implementation

The attributes of a data set are "purchase decision (Yes/No), Gender (M/F), income group (<10K, 10-50K, >50K). Can you use K-means to cluster this data set?

No, because gender is a non Gender is a categorical variable.

Describe N-Fold cross validation method used for diagnosing a fitted model.

Partitions the data into N groups Fit N models, holding out each group, and calculate the residuals on the group Estimated prediction error is the average over all the residuals

List two use cases of linear regression models.

Predicting stock and housing prices

What is the difference between Unsupervised Learning and Supervised Learning?

Supervised learning you are feeding the model labeled data and the model responds with predictions, classifications, and categorize. Unsupervised models you feed it unlabeled data and it can classify and categorize. However, unsupervised models cannot predict.

What is the difference in MAP inference and Maximum Likelihood inference?

The maximum likelihood of a parameter is the value of the parameter that maximizes the likelihood. Maximum a posteriori (MAP) estimation is the value of the parameter that maximizes the entire posterior distribution

Training Dataset

The sample of data used to fit the model.

Test Dataset

The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

Validation Dataset

The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

What is the difference of the training, validation and test sets?

Training Dataset: The sample of data used to fit the model. Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration. Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

How do we use WSS to pick the value of K?

We use WSS to pick our K by choosing the K values that is the elbow. WSS is the measure of how tight on average each cluster is. Thus WSS can be used to decide if K is overfitted. You can also use WSS to find the overall dispersion of data and then choose the K from that information.

How do you use "pair-wise" plots to evaluate the effectiveness of the clustering?:

You use pair-wise plots when there are not many variables.

N-fold cross validation

it tells you if your set of variables is reasonable. This method is used when you don't have enough data to create a (test set) data.

Unsupervised Learning

models you feed it unlabeled data and it can classify and categorize. However, unsupervised models cannot predict.

Supervised learning

you are feeding the model labeled data and the model responds with predictions, classifications, and categorize.


Conjuntos de estudio relacionados

FMFQO / FMFWO 101: Navy and Marine Corps History, Customs and Courtesies Fundamentals

View Set

Comp. NCLEX Review Ch. 59 - Renal or Genitourinary Disorders

View Set

MCB4403 Test 2 Smartworks (Ch 6, 10, 12, 13, 14, 15, & 17)

View Set

Chapter 20: Forming and Operating Partnerships

View Set

Foundations of Nursing Exam 1 ATI

View Set

الاردن والقضية الفلسطينية

View Set

SSN chọn đáp áp, SSN301 - TF

View Set