Feature Engineering & Cross Validation

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Which of the following are the steps associated with feature engineering? Outlier treatment Data Cleaning Hyperparameter tuning Scaling of the data

1, 2 and 4 - Feature engineering includes all the steps required to prepare the data i.e outlier treatment, data cleaning, and scaling. Hyperparameter tuning is not a part of feature engineering.

Which of the following statements are true about SMOTE? 1. Creates synthetic data points 2. Prevents the model from getting biased towards the majority class 3. Reduce the number of majority class observations 4. Might cause overfitting

1, 2 and 4 - SMOTE created synthetic data points and it create synthetic data points for minority classes that prevent the model from getting biased from the majority class. SMOTE does not reduce the number of majority classes, it just creates data points for the minority class. SMOTE might increase overfitting.

How many times will the model be trained if k=5 in K-fold cross-validation?

5 - In K-fold CV, data is divided into k folds and trains the model k times where k-1 times each fold is used in training and 1 time is used in testing at every iteration. Hence, if K = 5 then 5 times will the model be trained in K-fold cross-validation.

If 90,85,78,88,85 are the cross-validated scores then what would be the average cross-validation score?

85.2 - The average score would be (90 + 85 + 78 + 88 + 85)/5 = 85.2

State whether the following statements are true or false. A. Curse of dimensionality refers to having too many features and very few data points B. Regularization can help in dealing with the curse of dimensionality C. Curse of dimensionality refers to having too many data points and very few dimensions

A and B are true, C is false - Curse of dimensionality means there are too many features and very few data points. Curse of dimensionality results in large magnitude coefficients that can be treated by the regularization method.

The constraint region for the penalty in ridge regression has which of the following shape?

Circular - The constraint region for the penalty in ridge regression has a circular shape.

State whether the following statement is true or false. 'Good model performance on train and test dataset means that the model will always show good performance when deployed on real-world data'

False - It is not always necessary that the model will perform well on real-world data. There can be many variations in the real data for which our model is not trained.

State whether the following statement is true or false. "Ridge regression always have better model performance than lasso regression"

False - It is not necessary that ridge regression always performs better. The performance depends on different factors.

State whether the following statements are true or false. Oversampling leads to loss of information. Undersampling creates synthetic data points. Oversampling adds more data points over the existing patterns without tampering with the existing patterns. Undersampling can be done using SMOTE

False, False, True, False - Oversampling created synthetic data and hence there is no loss of information. - Undersampling removes certain data points instead of creating synthetic points - Oversampling adds more data points and does not affect the existing pattern. - SMOTE is used for oversampling not for undersampling.

Which of the following libraries is used to import RandomUnderSampler?

Imblearn.under_sampling - Imblearn.under_sampling is used to import RandomUnderSampler.

Which of the following algorithms is used by SMOTE to create synthetic data points?

K-nearest neighbour - KNN (K-nearest neighbor) is used to generate data points in SMOTE.

What could be the maximum value of K in K-fold cross-validation?

No. of observations in the data - The maximum value of K could be equal to the number of observations in the data.

Which of the following issues can be addressed by the regularization technique?

Overfitting - Regularization is used to deal with overfitting

Which of the following resampling methods is used for SMOTE and T-links respectively?

Oversampling, Undersampling - SMOTE is used for oversampling and T-links is used for undersampling.

The constraint region for the penalty in lasso regression has which of the following shape?

Rectangle - The constraint region for the penalty in lasso regression has a rectangular shape.

What is the penalty term in lasso regression?

Sum of Absolute values of coefficients

What is the penalty term in ridge regression?

Sum of Squared values of coefficients - Sum of Squared values of coefficients is the penalty term in ridge regression.

What will be the impact of the large value of K in K-fold cross-validation?

The variation across the training set will decrease - If k is large then there will be very less variation in the training set because folds will be closer to the total data points in the dataset.

Oversampling and undersampling techniques are applied on which of the following data sets?

Train set - Oversampling and undersampling techniques are applied on train set.

State whether the following statement is true or false. 'Using the mean and standard deviation of the cross-validated scores, we can expect the model performance to lie in the range of (mean - 2*standard deviation) to (mean+2*standard deviation) with 95% confidence.'

True - As per the normal distribution, 95% of the data lies in the range of (mean - 2*standard deviation) to (mean+2*standard deviation). Hence, we can expect our model performance to lie in the specified confidence interval.

State whether the following statement is true or false. 'Feature engineering is applied in the very initial stages of the data science projects while model tuning comes at the end of the project where we want to inch-up the model performance.'

True - Feature engineering is performed at the early stages when we need to prepare the data to make it compatible with the machine learning algorithms. Model tuning is performed towards the end-stage when we need to improve the model performance by tweaking certain hyperparameters.

State whether the following statement is true or false. 'Very large values of the penalty term might lead to underfitting.'

True - Regularization is used to deal with overfitting by adding a penalty term to the cost function. If the penalty term is very high then it might underfit the model.

State whether the following statement is true or false. 'The cross-validation approach gives a better view of model performance as it runs the training/testing cycle on many folds. '

True - The cross-validation approach splits the data into k-folds, each fold is used in both training and testing for different iterations. The iterative process gives a better view of the model performance.

State whether the following statement is True or False "Lasso regression can be used for feature selection"

True - The penalty term in Lasso regression is the sum of the absolute value of the coefficient and the absolute values can be zero. The coefficient becomes zero, which helps in feature selection.

State whether the following statement is True or False "The possible value of coefficients that can be obtained via ridge regression are the ones that lie on the intersection of contour lines and the constraint region"

True - The possible coefficient values lie at the intersection of the contour line and constraint region.

Which of the following functions can be used to get the scores of the model on every fold?

cross_val_score - cross_val_score can be used to get the scores of the model.

The K-Fold() function is available in which of the following python package?

sklearn.model_selection - K-Fold() function is available in sklearn.model_selection.


Set pelajaran terkait

General Genetics Chapter 11 and 12

View Set

ACCCOB1 Textbook Exercises_Chapter 2 - Nature and Formation of a Partnership Business

View Set

Chapter 59: Care of Patients with Problems of the Biliary System and Pancreas

View Set

Chap 19 Practice Questions, marketing test 3 ch 20, AGR 130- CHAP. 9, Mktg TB: Chap 16, Chapter 16 - Practice Problems, MKT 230 Chapter 14, MKTG 351 CHAPTER 15, chpt 13, chapter 11marketing, MKTG CH 12 TRUE OR FALSE, Marketing Study Questions, ch 13,...

View Set

Intro to Business Programming Chapter 8

View Set

Module 15: Closed Chest Drainage Systems

View Set

Psyc1101 Introductory Psychology

View Set