Intro Data Science Midterm 2 Prepration
Write R code to give the result of performing Zscore normalization on a numeric vector x.
(x - mean(x))/sd(x)
What does the performance of kNN classifier depend on ?
- choice of k - choice of distance function - amount and quality of training data - the set of selected features
What some reasons you want less features when creating a linear model ?
->better explanatory value of the model -> less data needs to be stored -> extra features can add noise ->some algorithms scale poorly with # features
what are 3 treatments for missing data ?
1) remove records with missing data 2) Convert missing data to values 3) Convert missing data to NA
Describe how kNN anomaly detection works. Be sure to include how k is used.
1. For each point x, calculate the distance to its kth nearest neighbor. 2. If the distance is greater than a specified threshold value, mark x as anomalous.
How is forward selection implied ? aka Greedy algorithm
1. Test each single feature to find the best one for example, look for lowest RMSE 2. Then, test each of the remaining features to find the next best one 3. Iterate, stopping when adding a feature gives little or no improvement ***This is definitely not guaranteed to give the optimal feature set
Explain the process of cross validation ? assume 10 folds
1. Train using the blue folds 2. Then validate using the orange fold. Compute error. 3. Repeat 9x, making each fold orange once 4. Take the mean of the 10 errors that were calculated
Assume we have some labeled training data, and want to predict the value of a new example.
1. find the k nearest neighbors in the training data to the example 2. return the average value of the neighbors
Assume we have some labeled training data, and want to classify an unlabeled example.
1. find the k nearest neighbors in the training data to the example 2. return the most popular label of the neighbors
What are 3 things you want to do with a lm model ?
1. see the model parameters (the coefficients) 2. make predictions 3. see how well the model fits the data
What is mean(c(1, 2, 3, NA), na.rm=TRUE) ?
2
Some machine learning methods require that attributes be numeric Suppose the value of predictor 'role' can be 'admin', 'instructor', or 'student'. How many indicator variables would you use to encode 'role' numerically?
2 for example: isAdmin (0 or 1), isInstructor (0 or 1) remember n-1
Density plots are suitable for plotting how many variables? A. 1 B. 2 C. 3 or more
A
how do you access the quality of a classifier ?
error rate = #incorrect predictions / # of predictions
What typically happens when you reduce the complexity of a model? a) an increase in bias, b) an increase in variance, or c) both.
Answer: a Bias concerns Example of a less flexible model: a linear model Example of a more flexible model: nearest neighbors
For the purpose of creating a model that is easy to interpret, should I prefer a) a more flexible model or b) a less flexible model?
Answer: b A less flexible model will generally be simpler and have fewer parameters. Example of a less flexible model: a linear model Example of a more flexible model: nearest neighbors
I run a bunch of experiments. In each experiment I flip 4 coins and count the number of heads. What is the size of the sample space? a) 4 b) 5 c) 2^4
Answer: b The outcome of an experiment is the number of heads in 4 flips. The possible outcomes are 0, 1, 2, 3, 4.
What is considered "Low-value data" ?
Attributes not meaningful to analysis Example: a column that has the same value for every row Treatment would be to remove or filter !
In classification what leads to good performance ?
few misclassified test examples
what does PDF imply to ?
Continuous random variables only
Repeatability and Documentation
Document all sources of raw data ! Avoid any manual steps in your process. Your results should be replicable
explain how training and prediction works in kNN classification
During training you just remember all your data In predictions the class looks at the data and makes predictions based of the most common label amongst all the labels of the K-closets points based of the training data. Often Euclidean distance is used to capture distance between points.
In this linear model what does "*" do ? fit = lm(prp ~ mmax * cpm * cach, data=machine)
for all combinations
Assume you want to use linear regression with 3 numeric predictors. How many coefficients will linear regression estimate?
Four: one for each predictor, plus one more (the "intercept") In the case of only one predictor, we have "simple linear regression". Then the coefficients are the slope and intercept of a line.
In logistic regression how do you find the best fit ?
In logistic regression, the best fit is found using the maximum likelihood principle.
In linear regression, does it matter whether whether we choose to minimize the RSS or the MSE (Mean Squared Error)?
No. The linear coefficients that minimize the RSS will also minimize the MSE. Similarly you could choose to minimize the RMSE.
what is sum(c(1,2,NA))
NA
what R function is used for logisitic regression ?
glm()
What are three concerns you'd have about data quality when performing data acquisition?
Recent,Reliable, Relevant,Privacy issues, & having enough data
What does RSS stand for ?
Residual sum of squares
1. the true relationship between the _______________ variable 𝑦 and ____________ variables 𝑥1, 𝑥2, ... , 𝑥_𝑛 is: 𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑛𝑥𝑛 + 𝜖 𝜖 is the "error term". Is the error term dependent on the x variables ???
Response predictor False, its independent
What are 2 data exploration functions in R to see an overview of your data in R ?
STR, and Summary ?
Is anomaly detection an example of supervised or unsupervised learning?
Unsupervised If it were supervised, we get training example labelled with "anomaly" or "okay".
Does CDF imply to continuous and discrete random variables ? T/F
True
What is this scale called ? scale(x) = 𝑥 −min(𝑥)/max 𝑥 −min(𝑥)
Unit-interval scaling or 0-1 scaling
Can categorical variables be used directly with linear regression in R?
Yes - if they are factors. They will be converted to indicator variables automatically.
Describe the difference between how kNN classification and regression work.
classification: find k-nearest neighbors to given point; return most popular label of the neighbors regression: find k-nearest neighbors to given point; return average label value of the neighbors
In data frame 'dat', each row represents a dog, and column "breed" gives the breed of each dog. (Only breeds "dachsund", "terrier", and "bulldog" are in the data set). Write R code to plot the number of dogs of each species. Use "Number of each breed" as the title.
barplot(table(dat$breed), main="Number of each breed")
What range does the logistic function output ?
between 0 and 1
What is it called when you convert numeric data to categorical data ?
binning
Is a spam-detection algorithm an example of classification or regression?
classification
what does PMF imply to ?
discrete Random variables only
What are some data exploration examples that use a single variable ?
histograms, density plots, bar plots
Write an R expression that computes 0 if vector x contains no duplicates, and computes 1 otherwise.
ifelse(length(x) == length(unique(x)), 0, 1)
What are the hyperparameters for kNN classification
k, and distance function ** remember k and distance function is determined by you its not set by R
What are the hyperparameters for kNN anomaly detection ?
k,distance function, threshold
R # compute the length of the string obtained by appending strings # s1 and s2, with a space between them
nchar(paste(s1,s2))
Let p be a number and x be a numeric vector. Write R code to give the indexes of the 3 elements of x having values closest (in absolute terms) to p.
order(abs(x - p))[1:3]
What are some data exploration examples that use a two variable ?
scatterplots and their interpretation, boxplots
In Regression what leads to good performance ?
small total difference between actual and predicted values.
In this linear model what does ":" do ? fit = lm(prp ~ cach + cach:mmax, data=machine)
specific product features
Is a nearest-neighbor classifier an example of supervised or unsupervised learning?
supervised
What do you use to build a classifier ?
training data
When you roll two dice the sum of the two values ranges from 2 to 12. Write code to assign to vector y the sum of two dice rolled 1000 times. In other words, the vector y should contain 1000 values. (assignment to y)
x1 = sample(1:6, 1000, replace=TRUE) OR x2 = sample(1:6, 1000, replace=TRUE) OR y = x1 + x2
compute a vector consisting of the elements of a vector x that are negative
x[x < 0]