ML Pipeline and Hyperparameter Tuning
Which of the following is true about grid search?
Grid search looks at each combination of parameters in a sequence and then chooses the optimal one
how certain parameters are best
no industry standard
Two generic approaches to searching hyper parameter space include
GridSearchCV (considers all parameter combinations) RandomizedSearchCV (can sample a given # of candidates from a parameter space with a specified distribution
State which of the following statements are true I. We should focus only on improving the performance on the training set, and the performance on testing set will get improved automatically. II. The testing error keeps on falling along with the training error with an increase in model complexity.
none of the two
Pipeline
object that make things more efficient, clean class of sklearn helps simplify the chaining of the transformation steps and the model along with the GridsearchCV helps search over the hyperparameter space applicable at each stage
testing data should be
separetely transformed using the same functions that were used to transform the rest of the data for model building and hyperparameter tuning
Pipelines are objects that help us _______ the processes/steps used in the ML project cycle.
standardize
split data into three, while tuning hyperparameter, to prevent data leak
trng, validation, testing
State whether the following statement is True or False "Grid Search is a computationally extensive cross-validation process"
true
Grid search parameter example import
sklearn.model_selection import GridSearchCV gs = GridSearchCV(knn_clf,param_grid, cv=10 cv = cross validation
GridSearchCV example
HP1, HP2, HP3..eg.
Fill in the blanks with the most appropriate option Data imbalance creates an issue because most of the classifiers/estimators work to improve the ______________, and thus the estimators are biased towards the _________.
Overall accuracy, majority class
State whether the following statement is True or False "GridSearchCV is an exhaustive sampling technique and can be inefficient"
true
State whether the following statement is True or False "Hyperparameters are like handles available to data scientists to tune the performance of their models"
true
State whether the following statement is True or False "The regression coefficients related to the features are the model parameters of a linear regression model"
true
The parameter grid defined for GridSearchCV takes input as
Dictionary
gs.best_params_ does what
extracts the best combination
make_pipeline() differs from Pipeline() as
make_pipeline() does not need the user to input the names for each step while Pipeline does
How to tune a model using train, test and validation split?
1. Pick a combination of hyperparameter 2. Train a model using those hyperparameters 3. Find the model's performance on the validation test 4. Repeat this process for all combinations available 5. Choose the model with the best validation score, and find out the final(generalized) score on the test set
GridsearchCV is
1. a basic hyperparameter tuning technique 2. it builds a model for each permutation of all of the given hyperparameter values 3. each model is evaluated and ranked 4. the combo of hyperparameter values that gives the best performing model is chosen 5. for every combo, cross validation is used and average score is calculated 6. its an exhaustive sampling of the hyperparameter space and can be quite inefficient
Hyperparameters and tuning steps
1. select model type (regressor or classifier) 2. ID the corresponding parameters space (csv file) 3. decide the method for searching and sampling parameter space 4. decide the cross-val scheme to ensure model will generalize 5. decise a score function to use to evaluate the model
Pipeline steps
1. sequentially apply a list of transforms and a final estimator 2. Intermediate steps of the pipeline must be "tranforms", that is , they must implement fit and transform methods 3, The final estimator only needs to implement fit 4. Helps std the model project by enforcing consistency in building testing and production
In a pipeline with 4 steps including the estimation/prediction step, how many times is the fit_transform() function called while calling pipeline.fit()?
3
Which of the following is NOT true about hyperparameter tuning?
Creates artificial data points
The pipeline object takes input as :
List of tuples
Which of the following can be the hyperparameter space of decision tree?
Max_depth, min_samples_split, max_features
Grid search will ensure that the best combination of hyperparameters
The best set of hyperparameters are provided out of the provided discrete sample space. They might not be the best combinations for the model.
How to upgrade the Numpy library?
To upgrade the numpy library, you can run: !pip install numpy==1.20.3 --user in your Jupyter notebook OR pip install numpy==1.20.3 --user in Anaconda prompt
Fill in the blank with the most appropriate option Dataset is divided into 3 parts and below steps are followed: The model is trained on ----- dataset Hyperparameters are tuned on ----- dataset Final performance is checked on ----- dataset
Training, validation, test
Which of the following is true for pipelines?
Transformations are always applied on train dataset and test dataset separately
Which of the following statements is NOT true about data leakage?
Using different datasets for training and testing leads to data leakage
Make_Pipeline
a function that will create the pipeline and automatically name each step. We dont need to specify a name names will be set to the lowercase of their types automatically
What is the difference between fit, fit_transform, and transform?
a. fit - is used to fit parameters of the function b. transform - transforming the data using parameters fitted with the fit function c. fit_transform - to first fit the parameters of the function and then transform the data also
KNeighborsClassifier list of parameters example
algorithm = 'auto' ; leaf_size=30 ; metric = 'minkowski' n_neighbors=5
Hyperparameters are supplied as
arguments to the model algorithms while initializing them..for e.g. setting the criterion for decision tree building "dt_model = DecisionTreeClassifier(criterion = "entrophy")"
Grid search follows a particular pattern and goes through
every set of hyperparameters available, the outputs are always same
State whether the following statement is True or False "Based on the performance of the model on test data, we should tweak the hyper-parameters of the model to get a better result"
false
State whether the following statement is True or False "Each step/stage in a pipeline should have a transform function so that the data fed to the next step is transformed"
false
State whether the following statement is True or False "Greater model complexity always implies better model performance on test dataset"
false
State whether the following statement is True or False "Grid search gives different outputs every time you run it"
false
State whether the following statement is True or False "Grid search will ensure that the best combination of hyperparameters(from all possible values of hyperparameters) are highlighted"
false
State whether the following statement is True or False "Hyper-parameters are learned from the data"
false
State whether the following statement is True or False "In the transformation step of the pipeline, transformation is fit on the training dataset and then train dataset is transformed, similarly transformation is fit on the test dataset and then test dataset is transformed"
false
State whether the following statement is True or False "Pipelines can only be used for classification purpose and not for regression"
false
State whether the following statement is True or False "The training dataset and test dataset should be scaled together to maintain uniformity"
false
State whether the following statement is True or False "make_pipeline is similar to pipelines but it does not require and does not permit, naming the estimators. Instead, their names will be set to the uppercase of their types automatically"
false
Make a Pipeline example code
from sklearn.pipeline.import make_pipeline pipe = make_pipeline( MinMaxScaler(), (SVC())) print("Pipeline steps:\ n{}".format(pipe.steps))
To get a list of hyperparameters for a given algorithm, call the function
get_params() from sklearn.svm.import SVC svc = SVC() svc.get_params()
all the permutations of hyper parameter values are tried out looks like this
gs.fit(X_train, y_train)
Calling KNeighborsClassifier is imported by using these default values
knn_clf = KNeighborsClassifier()
ervery stage a transform function except
last step
Hyperparameters are not
learnt from the data as other model parameters are...for e.g. attribute coefficients in a linear model are learnt from data while cost of error is input as hyper parameter
Hyperparameters are
like handles available to the modeler to control the behavior of the algorithm used for modeling
