CSE 144 Week 1 Quiz Questions
What is the equation and range for z-score normalization?
(x - fancyu)/ fancyo, range: [-2,2]
What is the equation and range for min-max normalization?
(x-min)/(max-min), range:[0:1]
What should be minimized in the regression hypothesis?
. Mean of squared errors
In which kind of Machine Learning algorithm, the data is presented with labels (Each data tagged with the correct label)? a. Supervised Learning b. Unsupervised Learning c. Reinforcement Learning d. Learning and Techniques
. Supervised Learning
. Give an example of 3 different ways to feature cross the features: A and B. The vocabulary of A is {a1, a2, a3} and the vocabulary of B is {b1 ,b2, b3}
1) {a1 + b1, a2 + b2, a3 + b3} 2) {a1 * b1, a2 * b2, a3 * b3} 3) {a1 - b1, a2 - b2, a3 - b3}
What is regression used for?
1. Prediction 2. Estimation 3. Hypothesis testing 4. Modeling casual relationships
What three components does a machine learning algorithm have?
1. Representation 2. Evaluation 3. Optimization
15. What are the 4 types of data used with ML?
1. Time series 2. Spatial 3. Textual 4. image/video
. Given a data X = [[0, 2, 4, 6], [1, 3, 5, 7]], what is X subscript 2 superscript 1
4
How do we know the model has been over-trained?
After training the model, if the validation data accuracy isn't increasing while the training data accuracy is, then the model has been over-trained.
What is the difference between an independent variable and a dependent variable?
An independent variable are the inputs of the system, and can freely take on any value. Whereas the dependent variables change based on the changes of the independent variable, or other changes to the system.
. Compute the Mean Squared Error for the following data points assuming the regression line is Y = X. Point 1 (2, 3) Point 2 (10, 8) Point 3 (12, 13) Point 4 (14, 14)
Answer is 1.5 ((2− 3)^2 + (10 − 8)^2 + (13 − 12)^2 + (14 − 14)^2)/4 = (1 + 4 + 1 + 0)/4 = 6/4 = 1.5
Which of the following are ways to handle extreme outliers? • Leave the outlier as is since it is part of the overall data • Change the scale (log / exponential) • Cap the data • Change the value of the outlier to match the rest of the data • Remove or change the outlier during post-testing
Change the scale (log / exponential) • Cap the data
What are the three types of optimization? Give an example of each.
Combinatorial: Greedy search. ● Convex: Gradient descent. ● Constrained: Linear programming
Which of the following best describes data integration? a. identifying and removing outliers b. combining data sources c. min-max normalization
Combining data sources
. What are the Four Data Preparation methods
Data Cleaning • Data Integration • Data transformation • Data reduction (dimensionality reduction)
Classify the following data preparation tasks into their correct category. Adjusting the magnitude of yearly incomes of all Californians to fit onto a 0 to 1 scale Removing survey feedback where more than half of the questions are left unanswered Choosing to focus only on stock prices on the last Friday of every month instead of the end of every trading session. Combining COVID vaccination data from all counties in California to determine the statewide vaccination rate.
Data Transformation Data Cleaning Data Reduction Data Integration
What is the purpose of data cleaning? a. Data cleaning is the process of fixing or removing incorrected, duplicated data. It is implemented to increase the accuracy of machine learning. b. Data cleaning is the process of adding data to increase the accuracy of machine learning. c. Data cleaning is the process of editing correct data to change the accuracy of machine learning. d. Data cleaning is the process of removing data to make it more simple and fast to read it.
Data cleaning is the process of fixing or removing incorrected, duplicated data. It is implemented to increase the accuracy of machine learning.
What are the major tasks in data preparation? Briefly describe each of them.
Data cleaning: Fill in missing values, smooth noisy data, identify and remove outliers. ● Data integration: Combine multiple sources of data. ● Data transformation: Min-max normalization, z-score normalization. ● Data reduction: Obtain reduced representation in volume but produce same or similar results.
What's the difference between dependent and independent variables
Dependent variables are those values that change as a consequence of changes in other values in the system and it is denoted by Y. • Independent variables are regarded as inputs to a system and may take on different values freely and it is denoted by X.
. What are the other terms used to denote dependent and independent variables?
Dependent: response variable Independent: predictor or explanatory variable
What can be done about extreme outliers in data?
Either the regression scale can be changed to better incorporate the outlier or it can be removed from the data.
What is NOT an example of data that should be cleaned? a. Example URL b. Duplicate examples c. Bad labels d. Bad feature values
Example URL
Each data point in a training data can only have one feature
False
Integration is one type of Evaluation in machine learning. True or False?
False
There can be many dependent variables and many independent variables up to we need.
False, only 1 dependent and many independent
What are feature crosses?
Features crosses are new features that are created from combining multiple features in some way. For example, a price per square foot feature could be generated out of price and square feet data for houses.
Give 3 examples of problems that can be solved by regression analysis.
Financial forecasting (e.g., stock prices) • Sales forecasting • Time series forecasting • Weather analysis and prediction
If you have a dataset and every data under feature A has a value between 1.0 to 8.5 except one has a value of 78. What type of outliers is this one?
Global outlier. Global outlier means point that is out of the distribution of the entirety of the dataset.
When should you use normalization vs. standardization?
If your data distribution is normal (forms a bell curve), use standardization, otherwise use normalization because normalization has a bound and is highly affected by outliers while standardization is not.
What is the assumption in linear regression?
It assumes that the relationship it works with is a straight-line relationship between the dependent variable and independent variables. This means that when deal with a relationship which is not linear, it would give a wrong result like the relationship between income and age is curved but not a straight line
Why is the error term squared when calculating the MSE of a linear regression?
It gets rid of negative terms and increases the punishment for outliers.
9. What is validation data used for, and why is it important to use it rather than just the testing data?
It is important to use validation data to check whether we have trained enough and can then move on to the test data, or if we have trained too much, and overfitted to our training data.
10. What happens when you over-train your model? a. It becomes muscular b. It overfits to the training datasets noise and can lose patterns, making it less functional c. There is no such thing as overtraining your model, the more training it gets the better it will become
It overfits to the training datasets noise and can lose patterns, making it less functional
Describe the concept of Linear Regression, what does it do?
Linear Regression is a statistical measure that attempts to determine the strength of the relationship between one dependent variable and a series of other changing variables. It does this by calculating a boundary by which to classify the points in the dataset. For linear regression specifically, this boundary is a line whose slope and y-intercept are determined by taking the average positions of all the data points.
What is the purpose of MSE and what is its formula?
Mean Squared Error will make all error values positive, increase the severity of farther out outliers, and taking the mean value gets the average error. MSE = 1/n * SUM(n on top i = 1 on bottom) * e subscript i superscript 2
. Give an example of 5 features that you would use in a model whose job was to classify if an animal's profile was of a dog or a cat.
Nose size • Time spent around owner • Eye type • Food fed per day • Size
State whether the following values are categorical or numerical Life expectancy of different nations Number of babies born each year with certain names Zip codes in California Names of most sold car models in the US
Numerical Numerical Categorical Categorical
What are two reasons for scrubbing data?
Omitted values - Duplicate examples - Bad labels - Bad feature values
3. What is one-hot encoding? a. One-hot encoding is a process of converting the categorical data into numerical data that could be used for machine learning. b. One-hot encoding is a process of converting the numerical data into categorical data that could be used for machine learning. c. One-hot encoding is a process of deleting outliers. d. One-hot encoding is a process of finding outliers.
One-hot encoding is a process of converting the categorical data into numerical data that could be used for machine learning
What is One-hot encoding? Why it is good and why it is bad?
One-hot encoding is a process to convert non-numeric categorical values in a column into numeric values. For each unique categorical variable, a new binary variable is added. It is good to use One-hot encoding when categorical variables does not have ordinal relationship, avoiding the results of poor performance or unexpected results. It is bad when the size of vocabular is large, since the vector for one-hot encoding will be large, too
What are outliers for the right skewed distribution function?
Q3 + (3/2) * IQR
What are the three components of Machine Learning? a. Planning, Evaluation, Integration b. Taxation, Evaluation, Representation c. Representation, Evaluation, Optimization d. Supervised learning, Unsupervised learning, Semi-supervised learning
Representation, Evaluation, Optimization
. What is the difference between supervised and unsupervised learning models?
Supervised learning models use labels for the datasets, whereas unsupervised learning models make predictions and look at patterns from that dataset
How are data split in the stage of data pre-processing? a. Test data will be divided into a training set and validation set b. The validation set will be divided into a testing set and training set c. Test data and training data, training data will be divided into a training set and validation set d. The full dataset will be divided into a validation set and testing set
Test data and training data, training data will be divided into a training set and validation set
Which of the following is used to evaluate how accurately the machine can predict an outcome? a. Testing Dataset b. Training data c. Validation data
Testing Dataset
. In linear regression, why does data X have both subscript and superscript but the corresponding labels Y has superscript only?
The corresponding labels Y is in one dimension, but the data X could be in multidimensional
Explain the difference between traditional programming and machine learning in regards to the data, program, output, and computer.
The inputs for traditional programming are the data and program fed into the computer, while for machine learning, the inputs are data and output fed into the computer, with the output being output for traditional and program for machine learning
What will happen if we choose the model best fit to the training model, not the validation model?
The model might be overfit to the training model, and not generalize enough to get a good performance in the testing model.
Why would overtraining a model be detrimental, and what could be used to avoid doing so?
The model would start to capture qualities specific to the training data, making the model less accurate on the general data it was intended for. The validation dataset helps determine if the model has been over-trained.
When training a model, what is the difference between a validation set and training set?
The validation set is used in the process of training the model. The different iterations of training the model will be tested against and optimized for the validation set. In that sense, the model indirectly sees the validation set. In contrast, the testing set is only used at the end of the training process to test the model. The model never sees the test set before the very last test. Allegory: The training set is a textbook, the validation set is a practice final exam that you can retake as much as you want, and the test set is the actual final exam.
Which of below is not a reason for cleaning data: • It is a omitted value • The data is duplicate • There is too much data already and I do not think I need it. • The data is an outlier in the whole data set.
There is too much data already and I do not think I need it.
What is the bias term in the regression hypothesis equation?
Theta subscript 0
Say we have a categorical feature in a data set for Favorite Color. The only available colors are ["'Red", "Blue", "Green"]. What is the problem with encoding these colors by giving each a number (i.e. "Red"→1, "Blue"→2, "Green"→3)
This kind of encoding implies that each color has a different weight/importance. Red being 1 and Green being 3 could be interpreted as Green being 3 times as "big" or important as Red. However, color is not a quantifiable value (at least in this instance), there is no "greater" color. This encoding gives the colors a weight that shouldn't exist.
A video is an example of what type of data? a. Time Series b. Spatial c. Textual d. Natural
Time Series
Assign the following data into the correct ML data type category (time series, spatial, image and video, textual) Apple stock price since 2010 Average rent prices in different neighborhoods in Santa Cruz, CA Collection of New York Times articles about Russia Security camera footage at a bank from September 23, 2022
Time Series Spatial Textual Image and video and time series
. What is one way you can encode the following vocabulary using One-hot encoding?: voc => ['c', 's', 'e']
To do one hot encoding we can replace each value with a numerical value of 0, and when we get that value, we simply replace its slot with a 1 keeping the rest 0. voc => c = [1, 0, 0], s = [0, 1, 0], e = [0, 0, 1]
Roughly what percentage of the total data set should training, validation, and testing sets each make up?
Training set ~70% Validation set ~20% Testing set ~10%
. Constrained Optimization is one type of Optimization in machine learning. True or False?
True
. Decision Trees are one type of Representation in machine learning. True or False?
True
Corresponding labels to data points are a single value while inputs can be multidimensional.
True
. Assign the following learning examples into the correct category (supervised, unsupervised, semi supervised, reinforcement, self-supervised) Unlabeled images of different types of flowers Algorithm given success queue whenever correctly identifying a dog 25% of input set is labeled Labeled images of road bikes and mountain bikes
Unsupervised Reinforcement Semi-Supervised Supervised
Training a model using unlabeled data to find hidden structure is called? Unsupervised Learning b. Supervised Learning c. Semi-supervised Learning d. Regression
Unsupervised Learning
. Describe how datasets are divided and the roles between each. Sizing as well.
Using the full dataset, it is divided into a training set and a testing set. The training set is then partitioned into two more sets, a training set and a validation set. The training set is used to train the model, and the validation set is used to make sure the model is good enough, tweaking it to improve on the validation set. Then it is finally tested on the testing set. Validation set is used to find models that are good enough for the testing set.
What's the difference between the Validation set and the Test set?
Validation set is the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. Test set is the sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.
What is the difference between a validation set and a testing set?
Validation set is used as a "pseudo-testing" set to tell the model when to stop training to avoid overfitting. A testing set is used to evaluate a model's performance.
In class exercise, why do we use one-hot encodes of city_num, instead of variable city_num?
We don't want the model to think the magnitude of city_num matters.
When do we use One-hot encoding?
When the categorical features present in the data is not ordinal, or When the number of categorical features present in the dataset is less so that the one-hot encoding technique can be effectively applied while building the model.
5. Fill in what each variable means: Y = �0 + �1X
Y = dependent/outcome/response variable X = independent/predictor/explanatory variable �0 = Y-intercept / baseline / bias �1 = slope of the line / weight
Create a template first order linear model equation for a model that has 2 properties (A and B)
Y = theta subscript 0 + theta subscript 1 * A + theta subscript 2 * B
What is the equation for the First Order Linear Model? List the variable's meaning
Y = θ + θ1X Y = dependent variable X = independent variable Θ = Y-intercept Θ1 = slope
7. When you want to train a model but only have one data set what can you do? What is important to ensure when using this technique?
You want to divide the data up into two sets: test data and training data. The training data will be split up into the training set and the validation set. It is important to ensure that you do not train on the test data.
23. Given the raw data: 0: { Name: ID: 123 Fruit: Banana } What is the one hot encoding for the fruit data if there are 4 fruits in the order {Apple, Banana, Orange, Grape}?
[0,1,0,0]
. Select the invalid vocabulary. • [a, b, c] • [e, f, c] • [d, b, d]
[d, b, d]
Which of the following properties contribute to the characterization of a good feature? a. Has rarely-used discrete values b. Clear and obvious meaning c. Can change over time d. Doesn't have extreme values
b. Clear and Obvious Meaning d. Doesn't have extreme values
What type of variable is regarded as the input and what type is regarded as the output? a. The input is the dependent variable and the output is the independent variable b. The input is the independent variable and the output is the dependent variable
b. The input is the independent variable and the output is the dependent variable
Should you get rid of outliers in your dataset? a. No, outliers exist in real life so they should also be used in your model b. Yes, they're not representative of your data and so they shouldn't be used c. No, outliers help make the model more accurate
b. Yes, they're not representative of your data and so they shouldn't be used
What does superscript i and subscript j represent in the training data x subscript j superscript i
b. i represents different data points and j represents the different features
In regression, what is another name for the predictor variable? a. intercept b. independent c. outcome d. Dependent
b. independent
Which of the following is a form of evaluation (cost function) in a ML algorithm? a. Hyper Planes b. Greedy Search c. Mean Squared Error d. Gradient Descent
c. Mean Squared Error
What do we want to minimize to find a linear regression model? a. mean of errors b. mean of absolute of errors c. mean of square of errors
c. mean of square of errors
How is error in a linear regression typically calculated a. relative error b. absolute error c. mean squared error d. percentage error
c. mean squared error
What is Regression used for a. Prediction b. Estimation c. Hypothesis Testing d. All of the Above
d. All of the Above
What are different types of outliers? a. global outliers/point anomalies b. contextual/conditional outliers c. collective outliers d. All of the above
d. All of the above
Which of the following are example(s) of data cleaning? a. Remove duplicated data b. Remove incorrected data c. Remove incomplete data d. All of the above
d. All of the above
In what type of learning does the training data include only a few desired outputs?
semi-supervised learning
What do the following functions do in the Pandas library in Python? (Matching problem: match the function with its definition): head describe hist info drop
shows the first few rows gives descriptive statistics such as the mean and standard deviation displays data in a histogram Lists general stats such as # of entries and columns, data types and memory usage removes the specified item (eg. column)
Explain the "roles" of theta subscript 0 and theta subscript 1 in the equation Y = theta subscript 0 + theta subscript 1 A , a first order linear model.
theta subscript 0 is the bias of the entire linear model. theta subscript 1 is the weight of the independent variable (predictor), or the way it affects the model
Select the vector for a in vocabulary [d, a, c] • [1, 0, 0] • [0, 0, 1] • [0, 1, 0]
• [0, 1, 0]
. Name each type of data and give an example for it as well.
❖ Time series ➢ Finances, clickstream, demand/sale forecast, natural phenomena, biological data ❖ Spatial data ➢ advertising, cell phone and network data, VR gaming, space exploration ❖ Textual data ➢ web, social media, legal documents, articles, debugging ❖ Image and video data ➢ Photographs, artworks