QMB3302 Final Exam
The features of the model... a. Keep the model validation process stable. b. Are always functions of each other. c. None of these answers are correct. d. Are used as proxies for y-hat/y (that is yhat divided by y)
c
What is ONE reason the textbook lists for why a Linear regression is a good starting point in a modeling task. a. They are easily coded. b. The slope is easily drawn. c. They are interpretable. d. Fixed point assessment. e. They make nice graphs.
c
What is a potential downside of using linear regression models in machine learning? a. They are not suitable for predicting continuous target variables. b. They are too complex and difficult to interpret. c. They are prone to overfitting the data. d. They can only handle numerical data.
c
What is the purpose of the following code: from sklearn.preprocessing import StandardScaler scale = StandardScaler() rfm_std = scale.fit_transform(df) a. to remove missing values b. Adjust the number of columns being used. c. to standardize the data d. the above code doesn't make any sense.
c
What type of algorithm would you use to segment customers into groups? Assume the groups ARE already labeled. a. Decision trees. b. Cluster regression c. Random forest d. Regression e. All of the above.
e
In class we walked through steps to building a machine learning model. The textbook also goes over in some depth the steps. What are they? Second Step _________________
Choose Hyperparameters
In class we walked through steps to building a machine learning model. The textbook also goes over in some depth the steps. What are they? First Step __________
Choosing a class model
In class we walked through steps to building a machine learning model. The textbook also goes over in some depth the steps. What are they? Third Step __________
Arrange data
When looking at the code in the videos, we sometimes used a variable to hold our model. What is the significance of the word "model" in the below code? model = LinearRegression(fit_intercept=True) a. The word 'model' instantiates the method and calls the interpreter. Without this specific word, no model functions are available. b. Model is a named variable and is just holding our linear regression model. It could be renamed anything. The word itself is not important. It is just a container. c. The word 'model' calls the fit method. If another word is used in this example, Python will not understand that it is a model that can be run.
B
In class we walked through steps to building a machine learning model. The textbook also goes over in some depth the steps. What are they? Fourth Step __________
Fit the model
In class we walked through steps to building a machine learning model. The textbook also goes over in some depth the steps. What are they? Fifth Step ____________
Predict
Pipelines are useful (in analytics with Python sense) for the following reasons? (Choose all that apply) a. Pipelines make it very easy to change small things in your model, like which variables to include. b. Pipelines help organize the code you used to clean and treat your data. c. Pipelines make it easy to repeat/replicate steps and run multiple models. d. Pipelines automatically update to new versions of Python. e. Pipelines are good for moving data into your programing environment.
a, b, c
Imagine you have a dataset with 2 columns, both filled with continuous numbers. You believe the first column is a predictor of the second column. Which of the model approaches below *could work* when building a model? Choose all that apply. a. Random forest. b. Graphing. c. Running .describe and .info on the data d. Decision Trees e. Regression
a, d, e
All the nodes prior to the output nodes essentially 'guess' at the correct weights. Then the algorithm checks to see if the initial guess is correct (usually not). When it is wrong... a. It tries again (runs another epoch). b. Outputs the wrong value and waits for the user to correct it. c. Readjusts the input data (removes or adds inputs)
a
Decision tree's are nice because they are fairly simple and straightforward to interpret. a. Ture b. False
a
In the attempt to fit values from the input layer to the output layer, the hidden layer applies some weights to the input values. a. True b. False
a
NLP stands for _________ a. Natural Language Processing b. Nothing: NLP is not a thing. c. Nothing Like Pizza d. No Language Processing
a
One big difference between the unsupervised approaches in this module, and the supervised approaches in prior modules: Unsupervised models do not have a target variable (Y). This make is difficult to know when they are "right" or correct. a. True b. False
a
One weakness of cross-validation discussed is that information can sometimes _______ across different periods. A common situation in which this happens is when we are looking at stock data. a. Leak b. Overfit c. Not leak d. Underfit
a
The random forest algorithm prevents, or at least avoids to some extent, the problems with overfitting found in decision trees. a. True b. False
a
What is Scikit-learn? a. A machine learning package in Python that has built in machine learning algorithms we can use on our dataset. b. A machine learning model in Python that allows us to use various machine learning models on our dataset. c. An algorithm that sorts through a string, iterating one value at a time until it has reached the end of the string, at which point it returns the string ascending alphabetical order. d. The ideal model to use when we are in a situation where we want to separate our data into clusters.
a
What is the first variable in a decision tree called (before any of the branches)? a. Root b. Origin c. End Node d. Basis Function e. Terminal
a
When viewing a diagram of a neural network there are several layers. What is the input layer? a. These are the X's, or inputs from your data. b. These are the Y (the Target variable you are interested in) c. Something you don't see, here there is some computation to transform the X's into the Y.
a
Which is true about linear regression models? a. They are easy to interpret. b. They are always the best model to choose. c. They are the optimal choice of model in a situation where we have unlabeled data. d. We want them to completely explain our dataset.
a
Which of the following is a common use case for the random forest algorithm in machine learning? a. Classifying data into categories based on input features b. Predicting a continuous target variable based on input features. c. Clustering similar data points into groups. d. Finding the optimal hyperparameters for a model.
a
Which of the following is true about data validation and cross-validation in machine learning? a. Data validation and cross-validation are used to evaluate a model's performance and prevent overfitting. b. Data validation and cross-validation are the same thing. c. Data validation involves splitting the data into three sets: training, validation and testing. d. Cross-validation involves using the same data for both training and testing.
a
Which of the following statements best describes an ensemble method in machine learning? a. A technique that combines the results of multiple models to improve overall predictive accuracy, b. An algorithm that learns to find patterns and relationships in data without being explicitly programmed, c. A method that automatically groups similar data points into clusters based on their characteristics. d. A model that predicts the value of a dependent variable based on the values of one or more independent variables.
a
Imagine you have a dataset with the following columns (inputs) for a set of customers. Column 1 = Customer ID Column 2 = Distance to Store Column 3 = Yearly spend Column 4 = Likelihood to return (a survey response that indicates a customer is likely to shop again). What kind of approaches could you use to understand more about these customers? a. Regression- to understand the effect of one or more variables on the others. b. Clustering- to develop groups of customers that have similar patterns.
a, b
Deep Neural Networks have only 1 hidden layer and multiple input layers. a. True b. False
b
Each of the connections between nodes as a connection, each of those connections has a(n) _______________. a. Silu b. Activation function c. Double headed arrows d. Fit function
b
Hierarchical clustering is more powerful than K-means, as it allows the researcher to determine the exact number of clusters to use in the analysis. a. True b. False
b
Imagine X in the below is a missing value. If I were to run a median imputer on this set of data what would the returned value be? 50, 60, 70, 80, 100, 60, 5000, X a. 50 b. 70 c. 80 d. An error e. 100
b
In K-Means clustering, the analyst does not need to determine the number of clusters (K), these are always derived analytically using the k-means algorithm. a. True b. False
b
In order to interpret Decision Tree's it is necessary to first run a linear regression. a. True b. False
b
Models. such as the random forest model we ran, often have a number of parameters that the analyst can choose or set. What is the best source of up to date information about the different parameters that can be set? a. A textbook b. The scikit learn documentation. c. Google or online forums d. A telex.
b
Neural Networks in computing are exactly the same as the neural networks from biology. a. True b. False
b
Neural networks are an unsupervised technique, because there is no target variable. a. True b. False
b
Random Forests can only be used on classification problems. a. True b. False
b
Random forests are [inputx] interpretable than decision trees. a. More b. Less c. Just as
b
The LinearRegression estimator is only capable of simple straight line fits. a. True b. False
b
The correct number of clusters in Hierarchical clustering can be determined precisely using approaches such as silhouette scores. a. True b. False
b
The elbow method provides an exact number of clusters for a k-means algorithm. a. True b. False
b
We want the R-squared value for our regression model to be 100%. a. True b. Fales
b
What is the purpose of the code below? import matplotlib.pyplot as plt import seaborn as sns import numpy as np a. Create a chart of a linear regression b. Import python packages c. Nothing d. This is R code
b
When viewing a diagram of a neural network there are several layers. What is the output layer? a. These are the X's, or inputs from your data. b. These are the Y (the Target variable you are interested in) c. Something you don't see, here there is some computation to transform the X's into the Y.
b
Y and y-hat are a little different. Y is our target vector, and y-hat is an output in our model that is a(n)...... a. a combination of XY intercept coordinates. b. estimate or predictions of y. c. the actual value of y. d. an axis on our 2 way graph.
b
Recommenders come in many flavors. 2 of the most common, often used together and discussed in the lecture are: Choose two from below. a. Stock availability. b. User based. c. Item based. d. Algorithm oriented e. Syntax dependent
b, c
According to the documentation, a silhouette scores of 1 is the ______ score and -1 is the _______ score.
best, worst
The example we walked through was from a fairly famous dataset for learning about machine learning. The dataset is called: a. TOASTER b. Handwriting c. MNIST d. LIME
c
Tokenization, as defined in the lecture is _________ a. A mathematical operation on a set of python strings. b. Providing a place holder, like the word "Accept" as a stem for a word like "Accepting." c. A computer turning letters and/or words into something it can read and understand, like numbers. d. Nothing- it's not a thing.
c
When running our first decision tree, we took out "maxdepth=". This has the unfortunate result of... a. Formatting the decision tree in max gray scale. b. Maximum likelihood scaling of the features. c. Building a very large hard to understand tree. d. Feature engineering the maxdepth.
c
When viewing a diagram of a neural network there are several layers. What is the hidden layer? a. These are the X's, or inputs from your data. b. These are the Y (the Target variable you are interested in) c. Something you don't see, here there is some computation to transform the X's into the Y.
c
Which of the following best describes supervised learning? a. A machine learning approach where the algorithm learns to optimize a performance metric by adjusting its internal parameters. b. A machine learning approach where the algorithm automatically groups similar data points into clusters. c. A machines learning approach where the algorithm receives labeled data and learns to map inputs to outputs based on those labels. d. A machine learning approach where the algorithm learns to find patterns and relationships in data without being explicitly programmed.
c
Which of the following best describes the difference between a supervised and an unsupervised learning task in machine learning? a. A supervised learning task is faster and more efficient than an unsupervised learning task. b. A supervised learning task can handle both numerical and categorical data, while an unsupervised learning task can only handle numerical data. c. A supervised learning task requires labeled data, while an unsupervised task does not. d. A supervised learning task involves clustering data into groups, while an unsupervised learning task involves predicting a target variable.
c
Which of the following is a potential benefit of using decision trees in machine learning? a. Easy to overfit the data b. Great at predicting future data c. Can handle both numerical and categorical data. d. Can only handle numerical data
c
Which of the following machine learning models utilizes supervised learning? a. K-means b. Hierarchical clustering c. Regression d. Precipitous algorithm
c
An example this week was done in a Jupiter like environment called Google Collab. What was the language that was demonstrated in the videos? (One cool thing about this is that it looks just like any other package! Installing this on your own is tricky) a. R b. Java c. C d. TensorFlow
d
Decision trees have a few problems, you should probably review those for the final exam! The problem we talked about most is: a. Poor time management b. Under modeling c. Lack of sophistication d. Overfitting
d
In which of these situations would you want to use a clustering algorithm? a. You were given the financial data for the Federal Reserve of New York in 2023 and want to determine where the discrepancy in accumulated depreciation came from before you submit the financial statements. b. You have a dataset containing past crimes of current defendants and you want to determine the likelihood that they will commit another crime. c. You have a dataset containing 2023 Charlotte, NC housing data and you want to predict 2024 housing prices. d. You have a dataset containing set for Cheesecake Factory and you want to look at customer spending at the restaurant in order to find patterns among customers who share similar characteristics.
d
What is the role of cluster centers clustering, and how are they determined during the algorithm? a. Cluster centers are the mean or median value of all data points in a cluster and they are determined through a distance metric such as Euclidean distance. b. Cluster centers are the outliers or anomalies in the dataset, and they are determined through a clustering technique known as DBSCAN. c. Cluster centers are the number of clusters chosen prior to running the algorithm and they are determined based on the desired output of the clustering. d. Cluster centers are the initial data points chosen randomly to begin clustering and they are updated to iteratively to minimize the within-cluster sum of squares.
d
What is the terminal node as discussed in the lecture? a. The first, or initiation, node in a complex decision tree. b. The income less the age of our sample. c. The node that results in a terminal error. d. The last node (sometimes called a leaf if you google the term). The tree doesn't split after this.
d
Which of the following statements best describes classification in machine learning? a. A type of supervised learning where the goal is to predict a continuous target based on input features. b. A type of reinforcement learning where the goal is to learn an optimal policy for making decisions in an environment. c. A type of unsupervised learning where the goal is to group similar data points into clusters. d. A type of supervised learning where the goal is to assign input data points to predefined categories or classes.
d
Your dataset consists of details about customer traits, such as "number of items in the basket at checkout" and "time of day od checkout." Your task is to group customers that are like each other together. You don't already have labeled customer types. What kind of model are you building? a. Cereal Identifier Model b. Stochastic Fit Model c. Supervised model (such as classification) d. Unsupervised model (such as K-means)
d
The basic idea of a regression is very simple. We have some X values (we called these ________) and some Y values (this is the variable we are trying to _____. We could have multiple Y values, but that is not something we have covered.
features, predict
What is a good model fit value? a. R-squared of .8 b. 99% accurate. c. 95% accurate. d. R-squared of p-value minus .05 e. R-squared of .4 f. R-squared of .95 g. Unknowable without knowing/understanding the context and the domain.
g
This should look familiar: "We will start with the most familiar linear regression, a straight-line fit to data. A straight-line fit is a model of the form y=ax+b" Where 'b' is commonly known as the _______.
intercept
One problem with decision trees is that they are prone to ___________ if you are not careful or do not set the _________ appropriately.
overfitting, Max Depth
In K-means- the algorithm has multiple iterations. If we have a simple 2d problem, and a k=2, it begins by assigning the first centroids to a(n) __________________, and then ___________________ of each point or record to the centroid.
random initial starting point, measuring the distance
This should look familiar: "We will start with the most familiar linear regression, a straight-line fit to data. A straight-line fit is a model of the form y=ax+b" Where 'a' is commonly known as the _______.
slope