Feature Engineering

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

max value of k

# of data points [leave 1 out cross val technique]

Cross Validation technique info

- creates and validates given model multiple times - # of times it does so, is dependent on the value selected by the user of the technique, usually expressed as "K" which is always an integer - the sequence of steps used is iterated through as many times as K - the process begins by dividing the original data into K parts / folds using random function

Which of the following statements are true for SMOTE? 1Creates synthetic data points 2Prevents the model from getting biased towards the majority class 3Reduce the number of majority class observations 4Might increase overfitting

1,2 and 4

Cross validation procedure

1. shuffle dataset randomly 2. split dataset into k folds 3. for each fold a. keep fold data separate / hold out dataset b. use the remaining folds as a single trng dataset c. fit the model on the trng set and evaluate on the test set d. retain the evaluation score and discard the model e. loop back 4. the steps 3s and 3e will be executed K times 5. summariaxe the scores and avg it by diving the sum by K 6. analyze the avg score, the dispersion to assess the likely performance of the model in the unseen data production data / universe

Which of the following is true for Oversampling using SMOTE? 1Exact copies of existing data points are created 2Random noise is added 3Copies of existing data points with small variations are added

1st and 2nd are false and 3rd is true

If k = 5, each data point is used for ____ times in the training process.

4

In K fold CV, if k = 5, the number of times the model will be trained is

5

If 90,85,78,88,85 are the cross validated scores then what will be the final model performance?

85.2

Feature Engineering is employed at the _______ of the model building while model tuning is employed towards the ________.

Beginning, End

Fill in the blank with the most appropriate option The constraint region for the penalty in ridge regression has ____ shape

Circular

State if the following statements are true or false 1. Upsampling leads to loss of information. 2. Downsampling creates synthetic data points. 3. Upsampling adds more data points over the existing patterns without tampering with the existing patterns. 4. Downsampling can be done using SMOTE

FFTF

Which of the following is true while going through each iteration of K-fold?

K-1 folds are used for training

Which of the following algorithms is used by SMOTE for data points generation?

K-nearest neighbour

Fill in the blank with the most appropriate option Large values of k in K-fold cross-validation leads to ____ variance across the training set

Less

Shrinkage models help us deal with the problem of _______.

Overfitting

SMOTE can be used for:

Oversampling of minority class

Fill in the blank with the most appropriate option The constraint region for the penalty in lasso regression has the ____ shape

Rectangle

Which of the following are included in the feature engineering?

Removing outliers Cleaning data Scaling data

What is the penalty term in ridge regression?

Sum of Squared values of coefficients

Which of the following is NOT the purpose of using the penalty term?

To reduce underfitting

We should apply oversampling and undersampling techniques on:

Train dataset

"The possible value of coefficients that can be obtained via ridge regression are the ones that lie on the intersection of contour lines and the constraint region"

True

SMOTE is used for __________ while T-links _________ the data.

Up-sampling, Down-samples

The best way to choose the value of 'k' in cross-validation is :

Value of k varies for every problem

Select True or False for the following statements a. Curse of dimensionality refers to having too many dimensions and very few data points b. Regularization can help in dealing with the curse of dimensionality c. Curse of dimensionality refers to having too many data points and very few dimensions

a and b are true, c is false

Cross Validation

a technique to evaluate / validate a ML model and estimate its performance on unseen data

k fold validation technique code

a) from numpy import array b) from sklearn.model_selection import K Fold c) data = array([10, 20, 30, 40 , 50]) d) kfold = KFold(5, True) e) for train, test in kfold.split(data) f) print('train: %s, test: %s'%(data[train], data[test]))

After instantiating kfold and model, we can get the scores of the model on every fold using which of the below function?

cross_val_score

salient features

each record / data point in the sample data before creating the kfolds is assigned to a single fold and stays in that fold for the duration of the procedures each data point used once in hold-out, and K-1 times in the trng transformation done on iterations of trng folds/not entire data set for hyperparameter/split original data into two. keep one side apart/use other to do kfold validation, once optimal hyperparameters are found assess the model on the test data any data transform done on whole set outside the loop, lad to data leakage and overfitting

"LooCV will have a high fluctuation in training models' accuracy scores"

false

"Ridge regression always have better model performance than lasso regression"

false

Good model performance on train dataset and test dataset mean that the model will show good performance when deployed on real-world data?

false

Ridge regression can be used for feature selection because the feature coefficients can become zero thus those features can be dropped and considered non-essential for modeling.

false

We divide dataset into 3 parts - train, test, validation and use the test dataset for tweaking hyperparameters

false

While undersampling dataset using Cluster centroids from imblearn, data points are removed randomly.

false

Which one of the following will divide data into train and test sets while using LeaveOneOut()? Where leave = LeaveOneOut()

leave.split(data)

too large a k

less variance across the trng set(limit the model differences across iterations)

k fold how to determine # of folds

min value of 2 (2 iterations this case)

For a data set with n data points, K-fold CV behaves similarly to LooCV when k=?

n

"n is the maximum value for k in K-fold CV" Which of the following holds true for n?

n = no. of observations

whatever value of k chosen

resulting trng and test data should be representative od the unseen data as much as possible

The KFold() function is found in which package of python?

sklearn.model_selection

over, underfit

trng set well poor test = overfit zone poor trng and poor test = underfit zone equally well in test and trn = right fit zone

"Data Leakage happens when information from the testing data leaks into the training process"

true

"If the random state is not set while splitting the data, we might get varying test accuracies every time we run the model"

true

"Lasso regression can be used for feature selection"

true

"The cross-validation approach gives a better view of model performance as it runs the training/testing cycle many times by splitting the data into many parts and reports the model performance for each iteration"

true

"The mean and standard deviation of cross-validated scores can be used to imply that the expected performance with 95% confidence of the model will lie in between (mean - 2*standard deviation) and (mean+2*standard deviation)"

true

State whether the following statement is True or False "Very large values of penalty term might lead to underfitting"

true


संबंधित स्टडी सेट्स

RHIT Practice Test Questions 2022

View Set

Obj. 5.01-5.02 Foods I Table Setting, Table Appointments, & Manners

View Set

Evolve HESI Leadership/Management

View Set

ISDS Exam 3 (Chapters 11-14 & 16) Multiple Choice

View Set