machine learning
To be able to find the global minimum of a cost function, the cost function should be a __________ function when we use it with gradient descent technique
convex
We have a dataset called course_enrollements that has the following columns: course, instructor, student, room. We would like to know how many enrolments each course has. Use a method of DataFrame structure to get the number of students in each course.
course_enrollment['course'].value_counts()
A learning algorithm tries to find optimal values for its model parameters such that a. The model generalizes well to training instances b. The model generalizes well to outlier instances c. The model generalizes well to large instances d. The model generalizes well to new instances
d
Which of the following is not correct for softmax regression? a. It is the Generalization of Logistic Regression b. Softmax regression does not train and combine multiple binary classifiers c. Softmax regression should only be used for mutually exclusive classes d. Softmax is a multioputput classifier
d
Which of the following is true for constraining weights for regularization? a. The term added to the cost function during training. Once the model is trained, use the unregularized performance measure to evaluate b. The hyperparameter α controls how much you want to regularize the model. If α is very high, then all weights end up very close to zero and the result is a flat line going through the data's mean. c. Lasso Regression tends to eliminate the weights (set to zero) of the least important features. d. All of the options listed
d
_______ of dataframe structure of Pandas library is useful to get a quick description of the data, in particular the total number of rows, each attribute's type, and the number of nonnull values.
info() method
You have a multi-class classification problem with k classes, using one vs rest method, how many different logistic regression classifiers will you end up training?
k
Name 5 different sub tasks you need to perform while you are exploring the data.
1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary). 2. Create a Jupyter notebook to keep a record of your data exploration. 3. Study each attribute and its characteristics: • Name • Type (categorical, int/float, bounded/unbounded, text, structured, etc.) • % of missing values • Noisiness and type of noise (stochastic, outliers, rounding errors, etc.) • Usefulness for the task 8 • Type of distribution (Gaussian, uniform, logarithmic, etc.) 4. For supervised learning tasks, identify the target attribute(s). 5. Visualize the data. 6. Study the correlations between attributes. 7. Study how you would solve the problem manually. 8. Identify the promising transformations you may want to apply. 9. Identify extra data that would be useful (go back to "Get the Data"). 10. Document what you have learned
Name 3 sub tasks of data preparation step of machine learning.
1. Data cleaning: • Fix or remove outliers (optional). • Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns). 2. Feature selection (optional): • Drop the attributes that provide no useful information for the task. 3. Feature engineering, where appropriate: • Discretize continuous features. • Decompose features (e.g., categorical, date/time, etc.). • Add promising transformations of features (e.g., log(x), sqrt(x), x2, etc.). • Aggregate features into promising new features. 4. Feature scaling: • Standardize or normalize features.
When we are working with the above problem, we notice that some of the math grades are missing. How would you propose to clean the missing math grades?
.dropna()
Suppose you want to perform regression on - independent variable X1,⋯,XmX1,⋯,Xm for - dependent variable Y1,⋯,YnY1,⋯,Yn When n > 1, it is usually called
"multi-variate regression"
Suppose you want to perform regression on - independent variable X1,⋯,XmX1,⋯,Xm for - dependent variable Y1,⋯,YnY1,⋯,Yn When m > 1, it is usually called
"multiple regression" or "multi-variable regression"
Name 4 main challenges of machine learning and briefly explain:
. Insufficient Quantity of Training Data 2. Nonrepresentative Training Data 3. Poor-Quality Data 4. Irrelevant Features 5. Overfitting the Training Data 6. Underfitting the Training Data
Name 5 sub tasks of shortlisting the promising models step of a machine learning project.
1. Train many quick-and-dirty models from different categories (e.g., linear, naïve Bayes, SVM, Random Forest, neural net, etc.) using standard parameters. 2. Measure and compare their performance. • For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds. 3. Analyze the most significant variables for each algorithm. 4. Analyze the types of errors the models make. • What data would a human have used to avoid these errors? 5. Perform a quick round of feature selection and engineering. 6. Perform one or two more quick iterations of the five previous steps. 7. Shortlist the top three to five most promising models, preferring models that make different types of errors.
Select the multiclass classification: 15 - Assigning a tag to an email from one of the following: Promotion, Social, Primary - Assigning a patent one of these: not ill, cold, flu - Assigning the weather as one of these: sunny, rain, snow, cloudy - Analyzing a picture and assigning both young/old and male/female options
15 - Assigning a tag to an email from one of the following: Promotion, Social, Primary - Assigning a patent one of these: not ill, cold, flu - Assigning the weather as one of these: sunny, rain, snow, cloudy
univariate multivariable regression
A model with one outcome and several explanatory variables. (most common)
What are the types of the Gradient Descent technique? Briefly explain the differences
Batch, Stochastic, Mini-batch. Data used: - Batch used the whole data set - Mini-batch used only a subset of the whole data set - Stochastic uses one random data/example Speed: - Stochastic > Mini-batch > Batch Global Minimum - Reaches the global minimum and then stops - Stochastic and Mini-batch walks around the minimum
Gradient Descent will converge when training a Logistic Regression model because the cost function is a. Convex b. Complex c. Collocated d. Core optimized
a
Name 5 of the main steps in a machine learning project:
Frame the problem and look at the big picture. 2. Get the data. 3. Discover and visualize the data to gain insights. 4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms. 5. Explore many different models and shortlist the best ones. (Select a model and train it.) 6. Fine-tune your models and combine them into a great solution. 7. Present your solution. 8. Launch, monitor, and maintain your system.
One-versus-the-Rest (OvR) (one-versus-all)
Get the decision score from each classifier for that image and select the class whose classifier outputs the highest score.
____________ is a function that determines the learning rate at each iteration. If the learning rate is reduced too quickly, you may get stuck in a local minimum.
Learning schedule
______________________ is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values).
Multioutput-multiclass classification (or simply multioutput classification)
multivariate multivariable regression
Multiple outcomes, multiple explanatory variable.
multivariate univariable regression
Multiple outcomes, single explanatory variable
To find the value of θ that minimizes the cost function, there is a closed-form solution (a mathematical equation) that gives the result directly. This is called the __________________. Mathematical formula for the normal equation is given below.
Normal Equation
________ Regression is the Generalization of Logistic Regression to support multiple classes directly without having to train and combine multiple binary classifiers
Softmax
Test dataset is used to estimate the generalization error that a model will make on new instances, before the model is launched in production a. True b. False
a
univariate univariable regression
One outcome, one explanatory variable (often used as the introductory).
We would like to use binary classifiers to detect a letter from the alphabet. We use OvR strategy, how many binary classes do we need to train? If we use OvO strategy how many binary classifiers do we need to train?
OvR strategy -> 26 OvO strategy -> (26 * 25) / 2
A validation dataset is used to compare models a. True b. False
a
__________________ is constraining a model to make it simpler and reduce the risk of overfitting.
Regularization
_______________ uses many small validation sets, evaluate each model once per validation set after it is trained on the rest of the training data. Calculate the average of all evaluations.
Repeated cross-validation
An online learning system can learn incrementally a. True b. False
a
Clustering is a type of a. Unsupervised learning task b. Supervised learning task c. Regression learning task d. Batch learning task
a
What is happening in the following code? from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()),]) housing_num_tr = num_pipeline.fit_transform(housing_num)
The constructor takes a list of name/estimator pairs defining a sequence of Pipeline steps. All but the last estimator must be transformers (i.e., must have fit_transform method). In the above code, we have transformers SimpleImputer, CombinedAttributesAdder and StandardScaler. SimpleImputer uses median value to fill in missing values. When you call the pipeline's fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call until it reaches the final estimator, for which it calls the fit() method.
When we draw a scatter plot between house value and income, we notice the correlation between the two attributes as seen in the figure. Which of the followings can be learned about the data from the figure below?
The correlation is indeed strong b. The price cap that we noticed earlier is clearly visible as a horizontal line at $500,000. 11 c. The plot reveals other less obvious straight lines: a horizontal line around $450,000, another around $350,000 d. All of the options listed
Linear regression predicts ______________while logistics regression predicts ______________ a. Continuous values, classes b. Classes, continuous values c. Classes, Values close to the mean d. Classes, Outlier values away from the mean
a
One-versus-One (OvO)
Train a binary classifier for every pair of binary classifier: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. If there are N classes, you need to train N × (N - 1) / 2 classifiers.
There are two types of performance functions 1) ____________ Measures how good your model is. 2) ___________ Measures how bad it is.
Utility (or fitness) function, Cost function:
How would you Early Stopping for regularization to avoid overfitting, explain briefly?
We identify and stop at the point where error for the validation stops decreasing and starts increasing while the error for the training set continues to decrease.
Excessively simple models ________________ while excessively complex models _____________ a. underfit the training data, overfit the training data b. overfit the training data, underfit the training data c. underfit the training data, optimally fit the training data d. optimally fit the training data, underfit the training data
a
21. To classify pictures as outdoor or indoor and daytime or nighttime we may train a. Two Logistic Regression classifiers b. Two Linear Regression classifiers 28 c. Four Logistic Regression classifiers d. Four Linear Regression classifiers
a
A hyperparameter is a parameter of the learning algorithm itself, not of the model a. True b. False
a
A pipeline is a. A sequence of data processing components b. A sequence that combines all datasets c. A sequence that samples a portion of a dataset d. A sequence that randomly processes a dataset
a
The best learning type to teach a robot to learn to walk in various unknown terrains is a. Reinforcement learning b. Supervised learning c. Semisupervised learning d. Other types of learning
a
This type of learning method is capable of adapting rapidly to both changing data and autonomous systems, and of training on very large quantities of data. a. Online learning b. Offline learning c. Reinforcement learning d. Semisupervised learning
a
To fine-tune your model and perform iterative evaluations, you may use a. Grid search b. Full text search c. Sample search d. Spot search
a
Which of these can be a hyperparameter? • The learning rate for training a neural network. • The k in k-nearest neighbors.
a
X_train, X_test, y_train, y_test = X[:60000], X[6000:], y[:60000], y[6000:] a. Partitions the data into training and test datasets b. Partitions the data into two training and two test datasets c. Partitions the data into four datasets 35 d. Partitions the data into two datasets randomly
a
Gradient decent is used for the following purpose a. Find the parameters that minimize the cost function b. Evaluate how good the predictions are. c. Split the dataset in training and test sets. d. Compute the recall
a. Find the parameters that minimize the cost function
In a medical trial, we train a model with weight, age, and race features, and we get predictions for variables blood pressure and cholesterol with the same model. Which of the following is true? a. It is a multivariate multivariable (multiple) regression b. It is a univariate multivariable (multiple) regression c. It is a multivariate univariable regression d. None of the options listed.
a. It is a multivariate multivariable (multiple) regression
Which of the followings are an indication of overfitting? a. The model performs well on the training data but generalizes poorly according to the cross-validation metrics. b. The error on the training data is low but considerable high on the validation data c. There is a gap between the learning curves for training and validation data
all
Which of the followings are an indication of underfitting? a. The model performs poor on the training data and also generalizes poorly. b. The training and validation learning curves reach a plateau and they are close and fairly high. c. Adding more training data does not help improving the performance on the training data d. We need more complex model or come up with better features
all
A labeled training set is a. A dataset that contains specific names b. A dataset that contains the desired solution c. A dataset that contains boolean instances d. A dataset that contains sufficient instances
b
A validation dataset is used to compare a. Datasets b. Models c. Features d. Labels
b
Batch learning systems learn dynamically a. True b. False
b
Customer segmentation into group is a type of a. Classification task b. Clustering task c. Regression task d. Reinforcement task
b
Given a training set with millions of features, the fastest algorithm to use to perform a search for a global minimum a. The Normal Equation b. Stochastic Gradient Descent c. Mini-batch Gradient Descent d. Batch Gradient Descent
b
Gradient Descent cannot get stuck in a _______________ when training a Logistic Regression model a. local minimum b. global minimum c. plateau d. summit
b
One hot encoder is technique for a. encode continuos features b. categorical features c. Similar to min-max normalization to improve the convergency of gradient decent d. imputation system to fill out missing values
b
The code below: From sklearn.linear_model import SGDClassifier Sgd_clf = SGDClassifier (random_state = 42) Sgd_clf.fit(X_train, y_train) a. Instantiates and validates a classifier b. Instantiates and regularizes a classifier c. Instantiates and trains a classifier d. Instantiates and optimizes a classifier
c
The logistic regression approach is used for a. Regression b. Clustering c. Classification d. Data segmentation
c
A univariate uses a. Two features to predict the outcome b. One feature to predict the outcome c. One feature in its dataset d. At least two features in its dataset
b. One feature to predict the outcome
(One-variable regression) Consider the plot below corresponding to ℎ∅(𝑥𝑥) = ∅0 + ∅1𝑥𝑥. What are ∅0 𝑎nd ∅1?
b. ∅𝟎 = 𝟎.𝟓, ∅𝟏 = 1
A performance measure for regression is: a. recall b. precision c. Root Square Mean Error (RSME) d. F1-score
c. Root Square Mean Error (RSME)
Mean Absolute Error is a preferred performance measure for data with many a. Instances b. Features c. Outliers d. Classes
c. outliers
If you are using a learning algorithm to estimate the price of houses in a city, you may want one of your features xi to capture age of the houses. In your training set, all the houses have an age between 10 to 35 with an average of 17. Which of the following would you use as features if you use normalization for feature scaling
c. xi = (age of house - 10)/25
if lambda is very small, will make the parameter θ1 to θn close to zero if lambda is very large, will make the parameter θ1 to θn close to zero if lambda is very large, will make the parameter θ1 to θn very large too if lambda is close to zero, will make the parameter θ1 to θn close to zero too
close to 0 if lambda large
Which of the following is true for Normal Equation, Batch Gradient Descent (GD), Stochastic GD and Mini-Batch GD? a) After training, all these algorithms end up with very similar models and make predictions in exactly the same way. b) Batch GD's path actually stops at the minimum, while both Stochastic GD and Mini-batch GD continue to walk around global minimum. c) Mini-batch GD will end up walking around a bit closer to the minimum than Stochastic GD - but it may be harder for it to escape from local minimal d) All of the options listed
d, all
Which of the followings is correct for regularization of linear regression? a. We should avoid plain linear regression b. Ridge regression is a good default c. We should use Lasso or Elastic Net if you expect that only a few features are actually useful d. All of the options listed.
d. All of the options listed.
Which of the following is not a way of constraining weight for regularization? a. Ridge Regression b. Lasso Regression c. Elastic Net d. Softmax
d. softmax
The validation set and the test set must be as representative as possible of the data you expect to use in production. We would have unexpected generalization errors due to
data mismatch.
To avoid ________________, we should not look at the test set. If we look, we may see an interesting pattern in the test data that leads you to select a particular kind of Machine Learning model. Since your model will perform well on the test set because of this selection, you might get an unexpected generalization error.
data snooping bias
If we would like to get the count, mean, std, etc. values of numeric fields of a dataframe, we can utilize the
describe() method
Root Mean Square Error (RMSE) is a measure of how much _______________ the system typically makes in its predictions
error
The error rate on new cases is called the ________________ (or_______________), and by evaluating your model on the test set, you get an estimate of this error.
generalization error, out-of-sample error
At each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini-batch GD computes the gradients on small random sets of instances called __________________
mini-batches.
A classification system that outputs _________________ is called a multilabel classification system.
multiple binary tags
During the ________ feature scaling technique; values are shifted and rescaled so that they end up ranging from 0 to 1.
normalization/min-max
If a model performs great on the training data but generalizes poorly to new instances, the model is likely ___________________________ the training data
overfitting
To save your model, you can utilize __________ or ___________ libraries.
pickle, joblib
A simple way to regularize a __________________ is to reduce the number of _____________________
polynomial model, polynomial degrees (complexity).
Stochastic gradient descent issued to determine theta vector for a regression and the figure for the cost is given above. Briefly explain why the line is not a straight line?
randomness helps to jump out of local minima , as stochastic gradient never reaches the global minimum, it keeps jumping around
For linear models, ______________ is typically achieved by ________________ of the model.
regularization, constraining the weights
If you set the _________________ hyperparameter to a very large value, you will get an almost flat model (a slope close to zero); the learning algorithm will almost certainly not ________________ the training data, but it will be less likely to find a good solution.
regularization, overfit
What is a toy data set provided in scikit-learn library? How do you load them?
small datasets to quickly illustrate the behavior of the various algorithms implemented in scikit-learn. You use the method load_data_set_name
The________ Regression classifier predicts only one class at a time (i.e., _________ not ________ ), so it should be used only with mutually exclusive classes, such as different types of plants.
softmax, it is multiclass, not multioutput
During ________ feature scaling, we subtract the mean value and then it divides by the standard deviation so that the resulting distribution has unit variance.
standardization
We perform ________ to guarantee that the test set is representative of the overall population. During stratified sampling, the population is divided into homogeneous subgroups called _______, and the right number of instances are sampled from each ________
stratified sampling, strata, stratum
Machine learning systems improve their performance in a given task with more and more experience or data. True/False?
t
Unless there are very few hyperparameter values to explore, prefer random search over grid search. True/False?
t
To figure out if the performance issue is due to data mismatch or overfitting, we can develop our model with a ______________ set and then validate your model with the ____________ set (training set - training-dev set). • If it performs well, then the model is not overfitting the training set, the problem must be coming from the data mismatch. • If the model performs poorly on the train-dev set, then it must have over fit the training set, so you should try to simplify or regularize the model, get more training data, and clean up the training data.
training-dev, validation-dev
A high-degree polynomial model likely have high ______________, and thus to overfit the training data while a high _____________ model is most likely to ______________ the training data.
variance, bias, underfit
For the following group of data: 200, 400, 800, 1000, 2000, 2200, scale them with min-max.
x - x min / (xmax - xmin)
Which one of these are not one of the feature engineering steps: • Discretize continuous features. • Decompose features (e.g., categorical, date/time, etc.). • Add promising transformations of features (e.g., log(x), sqrt(x), x2, etc.). • Aggregate features into promising new features.
• Discretize continuous features. • Decompose features (e.g., categorical, date/time, etc.). • Add promising transformations of features (e.g., log(x), sqrt(x), x2, etc.). • Aggregate features into promising new features.