Data Mining Exam 1
Numerical nominal data conveys no mathematical meaning
True Nominal data, even if it is numerical, has no mathematical meaning beyond the symbols used as its values. For example, a social security number is just a taxpayers identifier but conveys no numerical/mathematical information, unlike, for example, taxpayer's income
The principal goal of classification is to develop a model h that is generalizable, which meant that the model must
accurate predict a class of a new, previously unseen example
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Student grade point average
Continuous Quantitative Interval
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Temperature measured in Fahrenheit
Continuous Quantitative Interval
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Surface area
Continuous Quantitative Ratio
Clustering is a task of assigning objects to one of several predefined categories called classes or labels
False
Clustering is an example of supervised learning
False
We must always select the model with the smallest training error
False
Classification involved supervised learning
True
Generalization error is the sum of bias and variance
True
Ideally, a model should have the lowest bias and the lowest variance, but it is not possible to achieve.
True
Is monitoring the heart rate of a patient for abnormalities a data mining task?
True In order to know what is an abnormal heart rate for a specific patient, we would have to have a model or normal/ubnormal hart rates that also inlcludes other health indicators. What is an ubnormal rate for one patient may be dangerous for another.
During data preprocessing, it is possible that the number of attributes increases.
True It is possible that the number of attributes increases during data preprocessing. For example, if we want to perform a quadratic regression using a linear regression algorithm, then we may double the number of features per vector by including the original features as well as their squares
Is predicting the future stock of a company using historical records a data mining task?
True. We would attempt to create a model that can predict the continuous value of the stock price. This is an example of the area of data mining known as predictive modeling. We could use regression for this modeling, although researchers in many fields have developed a wide variety of techniques for predicting time series.
Construction of a decision tree is an example of supervised learning
true
Model tuning is done using
validation set
Given the following data and their corresponding scatter plot, classify the point (1,1) using 3-NN with a Euclidean distance.
Class +
The "no free lunch" theorem informs us that:
For every problem we must custom-build a new model to fit this specific problem
Training error of a model is the error committed on
training set data
Which of the following a properties of metric d? - A metric d must be symmetric , i.e., d(x,y) = d(y,x) for all x,y. - A metric d must be nonnegative, i.e., d(x,y) > 0 for all x,y. - A metric d must obey the triangle inequality, i.e., d(x,z) < d(x,y) + d(y,z) for all x,y,z. - A metric d must be symmetric , i.e., d(x,y) = d(y,x) for all x,y. - A metric d must obey the triangle equality, i.e., d(x,z) = d(x,y) + d(y,z) for all x,y,z. - A metric d must obey the equality property, i.e., if d(x,y) = d(x,z), then y = z for all x,y,z. - A metric d must be nonnegative, i.e., d(x,y) > 0 for all x,y. - A metric d must obey the triangle inequality, i.e., d(x,z) < d(x,y) + d(y,z) for all x,y,z. - A metric d must be positive, i.e., d(x,y) > 0 for all x,y.
- A metric d must be symmetric , i.e., d(x,y) = d(y,x) for all x,y. - A metric d must be nonnegative, i.e., d(x,y) > 0 for all x,y. - A metric d must obey the triangle inequality, i.e., d(x,z) < d(x,y) + d(y,z) for all x,y,z.
Given two attributes: size = {small, medium, large, x-large} and price_level = {low, med, high}, and two data vectors: x = (small, high) and y = (x-large, med), find d(x,y) using the Euclidean distance. Round your answer to two decimal digits.
1.12 Since the attributes are ordinal, we first convert their values to integers: "size"= {0,1,2,3} and "price level"= {0,1,2}, and the data vectors become: x = (0, 2) and y = (3, 1). Then, we compute the normalized distances between the individual components: d(x1,y1) = d(small, x-large) = (0-3)/3 = 1d(x2,y2) = d(high, medium) = (2-1)/2 = 1/2. Finally, the Euclidean distance between the data vectors is calculated as: d( x , y ) = sqrt(1^2 + (1/2)^2 ) which is about 1.12 In practice, we would begin by normalizing the values of the attributes to obtain: "size"= {0, 1/3, 2/3, 1} and "price level"= {0, 1/2, 12}, x = (0, 1) and y = (1, 1/2), and then we would calculate the "ordinary" Euclidean distance: d( x , y ) = sqrt(1^2 + (1/2)^2) which is about 1.12
What is exploratory data analysis
A way by which we can get an initial feel for data Included both quantitative and qualitative analysis and its role is to give us an initial "feel" and understanding of the data we have. It may help with further preprocessing and selection of machine learning methods
Which of the following is true about the k-NN classifier?
As k increases, the bias increases as well
Which of the following can be used to summarize nominal data? range stem-and-leaf plot mean median bar graph histogram pie chart box plot standard deviation mode percentile frequency distribution
Bar graph Pie chart Mode Frequency Distribution
Classifying an image of an animal as a cat or a dog is an example of
Binary classification
Which of the following is the best option to decrease the impact of the "curse of dimensionality" when using k-NN classifier?
Both, dimensionality reduction and feature selection
Given the following data and their corresponding scatter plot, classify the point (-1,-1) using 4-NN with a Euclidean distance.
Cannot decide definitively
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Brightness as measured by people's judgements
Discrete Qualitative Ordinal
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Bronze, silver, and gold medals as awarded at the Olympics
Discrete Qualitative Ordinal
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Number of patients in a hospital
Discrete Quantitative Ratio
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: ISBN numbers for books
Discrete Qualitative Nominal
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Military rank
Discrete Qualitative Ordinal
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Coat Check Number
Discrete Quantitative Nominal
Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Time in Terms of AM or PM
Discrete, qualitative, binary
Every data proximity measure is also a metric
False Not all proximity measures used in data mining are necessarily metrics. The triangle property is the one that is most likely to be missing but sometimes a proximity measure may also be nonsymmetric. To envision a distance measure that is not symmetric just think of finding the distance between 2 locations if the city uses many one-way streets. However, if we use a proximity measure that satisfies all three properties of a metric, then many algorithms can be designed to be much more efficient. Therefore, whenever possible, we prefer to use measures that are metrics.
We can calculate the generalization error by using the test data
False It is impossible to calculate the generalization or "true" error since it is the error for the future, unseen examples
The model-building phase for the k-NN classifier is time-consuming, but once the model is ready, we can classify new records almost instantly.
False The k-NN classifier stores all the training records and in order to classify a new record, it must find the k closest records among all the training records. In other words, all the work is done when we have a new record that needs to be classified.
Test set is critical for model selection
False The test set can be used exclusively for reporting the estimated generalization error. For model selection, we find the error on the validation set.
The higher the value of k, the longer the training.
False The training time in k-NN algorithm is the same regardless of the value of k. We still must find the distances to all training examples
Is dividing the customers of a company according to their profitability a data mining task?
False This is an accounting calculation, followed by the application of a threshold. However, predicting the profitability of a new customer would be a data mining task.
Is soring a student database based on student identification numbers a data mining task?
False This is not a data mining task, this a simple database query
All elements of a Pandas data frame must have the same data type
False the elements can have different data types. In fact, it is an advantage of the a Data Frame over NumPy array when the input data that we need to load and preprocess consists of attributes with mixed data types. A data frame can hold mixed-attribute data in a single array - unlinke NumPy array where all elements must be of the same type
Which of the following can be used to summarize ordinal data? frequency distribution of the individual values mode standard deviation stem-and-leaf plot mean box plot bar graph range histogram median percentiles pie chart of the individual values
Frequency distribution of individual values mode box plot bar graph range median percentiles pie chart of the individual values
Which of the following distance measures should be used when applying k-NN to categorical data?
Hamming distance
Which of the following is true about k-NN classifiers: It can be used for classification. It can be used for data cleaning for any other classifier. It can be used for regression. It can be used for categorical data. It can be used for descriptive modeling. It can be used for continuous data.
It can be used for classification. It can be used for data cleaning for any other classifier. It can be used for regression. It can be used for categorical data. It can be used for continuous data.
Which of the following measures is resistant, i.e. is not distorted by outliers? mode variance median absolute deviation (MAD) Interquartile range (IQR) mean median absolute average deviation (ADD) standard deviation range percentiles
Mode Median absolute deviation Interquartile range Median Absolute average deviation Percentiles
Which of the following can be used to summarize continuous numerical data? range stem-and-leaf plot mean histogram median pie chart of the individual values mode box plot standard deviation frequency distribution of the individual values bar graph percentiles
Range Stem-and-leaf plot mean histogram median mode box plot standard deviation percentiles
Let x = (0,1,0,0,1,0,0,0,1,0) and y = (1,0,0,0,1,0,1,0,1,0), represent voting for "All Democrat" ticket by two voters in the past ten elections, where 0 means "no" and 1 means "yes". Which proximity measure should we use to compare the voters?
Simple Matching Coefficient Since voting all democrat (yews) and mixing the selections between the candidate's affiliations (no) are important characteristics of a voter so we should include both yes and no counts.
Which of the following is not a goal of modeling using machine learning? To find patterns among data. To make predictions. To find correlation between variables. To find explanations
To find correlation between variables
A target function in classification corresponds to
The classification model we want to learn
Overfitting means that
The model separates classes very well but it does not generalize well
When we use the z-score normalization (standardization) as a data transformation during the data preprocessing stage, then we
change the values so that their mean becomes 0 and the standard deviation is 1
Generalization error of a model is the error committed on
future, unseen data
Which of the following characterizes the k-NN algorithm: k-NN is an eager learner (builds model and discards training data). k-NN is a parametric algorithm (makes assumptions about data distribution). k-NN can be used for descriptive modeling. k-NN is an instance-based learner. k-NN is a lazy learner (build a model only when needed). k-NN is a nonlinear classifier (arbitrary decision boundary). k-NN is a linear classifier (linear decision boundary). k-NN is a non-parametric algorithm (makes no assumptions about data distribution). k-NN can be used exclusively for prediction.
k-NN is an instance-based learner k-NN is a lazy learning (build a model only when needed) k-NN is a nonlinear classifier (arbitrary decision boundary) k-NN is a non-parametric algorithm (makes no assumptions about data distribution). k-NN can be used exclusively for prediction.
