Week 7 Machine Learning
What is the first step in constructing a decision tree? Start with all samples at a node. Partition the samples into subsets based on the input variables. Repeatedly partition data into successively purer subsets until stopping criteria are satisfied.
Start with all samples at a node.
In general, are classification and regression often supervised or unsupervised approaches? Supervised Unsupervised
Supervised
What does the following method call return? accuracy_score(data_true = data_test, data_pred = predictions) The fraction of correctly classified samples. The number of correctly classified samples.
The fraction of correctly classified samples.
What type of object does the function Kmeans output? kmeans dataframe integer series
kmeans
What is the function call to output the name of columns of a dataframe named x? x.columns(0) x.columns columns(x)
x.columns
You are given a dataframe labeled x where the column 'number' indicates the index of a record. Which function call would create a new dataframe y that takes more than 10 samples x if x has 100 records? y = x[(x['number']%5)==0] y = x[(x['number']%10)==0] y = x[(x['number']%15)==0]
y = x[(x['number']%5)==0]
Which Root Mean Square Error (RMSE) would represent a perfect prediction with no errors in regression? 0 NaN 1 -1
0
For a classification problem, if you want to predict the letter grade that a student would receive, what are 2 examples of reasonable input data to consider? Amount of time spent studying Percentage grade these students received in the previous semester Letter grade different students received in another class The students' ID numbers
Amount of time spent studying Percentage grade these students received in the previous semester
What is the difference between regression and classification for machine learning in Python? Regression transforms categorical values to numeric and then follows the same as classification. Regression is used to predict a numeric value while classification is used to predict a categorical value. Classification is used when the input data is categorical and regression is used when the input data is numeric.
Regression is used to predict a numeric value while classification is used to predict a categorical value.
What is the next step in building a classification model after the model is constructed and parameters are adjusted? Apply model to new data Train the data Minimize errors
Apply model to new data
How do you assign each sample in a dataset to a centroid using the k-means algorithm? Assign the sample to the cluster with the closest centroid. Assign the sample to the cluster with the furthest centroid. Assign the sample to a random cluster.
Assign the sample to the cluster with the closest centroid.
As an example, you have a dataset containing numerical values of subjects' heart rates during exercise and categorical values describing how much they smoke. You want to determine whether smoking and heart rate are related. What machine learning category would this fall under? Classification Regression Cluster analysis Association analysis
Association analysis
How do you determine the new centroid of a cluster? Calculate the mean of the cluster Calculate the max of the cluster Calculate the mode of the cluster Calculate the min of the cluster
Calculate the mean of the cluster
Is age group a numeric or a categorical variable? Numeric Categorical
Categorical
For example, you want to predict the number of kids someone will have: either 0, 1, 2, or 3+. Is this an example of regression or classification? Regression Classification
Classification
Why are decision boundaries of a decision tree parallel to the axes formed by the variables? Each split considers only a single variable Each subset should be as homogenous as possible The induction algorithm eventually stops expanding
Each split considers only a single variable
Cluster analysis is a supervised task. True False
False
In machine learning, algorithms and programs directly aim to learn a given task. True False
False
Regression is an unsupervised task. True False
False
Test data is the same dataset as training data in classification models. True False
False
True or False: The function call train_test_split(a, b) where a and b are dataframes will always output the same result. True False
False
What is the command to get the number of rows in a data set titled "data"? data.shape[0] data.shape[1] data.size() data.length()
data.shape[0]
Final clusters are sensitive to initial centroids. True False
True
It works out better mathematically to measure the impurity of a split in a decision tree, rather than the purity. True False
True
The target variable is always categorical in classification. True False
True
When you search an incorrectly spelled term online, suggested words is an example of machine learning. True False
True
What is the definition of 'data mining'? Activities related to finding patterns in databases and data warehouses. Process of inspecting, cleansing, transforming, and engineering a particular dataset. Query processing and statistical analysis to summarize a dataset.
Activities related to finding patterns in databases and data warehouses.
What does the "within-cluster sum of squared error" provide? A mathematical measure of the variation within a cluster. An error measurement for a specific sample in relation to the centroid of a particular cluster. An answer to which cluster is the most 'correct.'
A mathematical measure of the variation within a cluster.
Training Data--> Learning Algorithm--> ???? ----> Model (Training Phase) What goes in the ??? Build model Test data Apply model Results
Build model
What is true between supervised and unsupervised approaches? In supervised approaches, the target is unavailable. In unsupervised approaches, the target is unavailable. In supervised approaches, the target is provided. In unsupervised approaches, the target is provided. In supervised approaches, the target is unavailable. In unsupervised approaches, the target is provided. In supervised approaches, the target is provided. In unsupervised approaches, the target is unavailable.
In supervised approaches, the target is provided. In unsupervised approaches, the target is unavailable.
In a decision tree, which nodes do NOT have test conditions? Root nodes Internal nodes Leaf nodes
Leaf nodes
What 2 statements describe classification in the context of machine learning? Predict the category of the target given input data Supervised task Unsupervised task Numerical target variable
Predict the category of the target given input data Supervised task
How would you initially handle an anomaly (apparent outlier) in cluster analysis? Throw it out of the dataset Disregard in further analysis Provide further analysis on the anomaly
Provide further analysis on the anomaly
What is the correct word to describe an instance of an entity in your data? Sample Feature Attribute Field
Sample
Which is NOT mentioned in the course as a common similarity measure in cluster analysis? Euclidean distance Manhattan distance Cosine similarity Sine similarity
Sine similarity
How do you determine the size of a decision tree? The number of edges from the root node to that node. The number of edges in the longest path from the root node to the leaf node The number of nodes in the tree correct
The number of nodes in the tree correct
In building a machine learning model, why do we want to adjust the parameters? To reduce the model's error To compare different model variations To provide the best graph of the model outputs
To reduce the model's error
A Root Mean Square Error (RMSE) higher than our mean value would be too high. True False
True
In the parallel_plot function, what was represented on the y-axis of the resulting plot? Each of the features Values of each cluster center Location of each cluster center Min and max values in each cluster
Values of each cluster center
When is a prediction task referred to as simple linear regression? When there is only one input variable. When there is more than one input variable. When there are two input variables.
When there is only one input variable.
When would you use the machine learning technique 'regression'? When your model has to predict a categorical value. When your model has to predict a numerical value. When you want to organize similar items in your dataset into groups. When you want to capture associations between items
When your model has to predict a numerical value.
Which algorithm to build classification models relies on the notion that samples with similar characteristics likely belong to the same class? kNN Decision Tree Naive Bayes
kNN
Which parameter in the KMeans clustering algorithm do you have to specify for the number of clusters you want? n_clusters clusters tot cluster_centers
n_clusters
To use scikit-learn: DecisionTreeRegressor, train_test_split, and mean_squared_error, which of the following libraries are necessary? pandas sklearn.metrics sklearn.model_selection sklearn.tree scikitlearn
pandas sklearn.metrics sklearn.model_selection sklearn.tree
Which of the following is true about a model? built using test data evaluated on training data trained by the training data set
trained by the training data set
What is the appropriate input for the following line of code to make a linear regression prediction? y_prediction = regressor.predict(___) x_test x_train y_train y_test
y_test