Data Mining Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Numerical nominal data conveys no mathematical meaning

True Nominal data, even if it is numerical, has no mathematical meaning beyond the symbols used as its values. For example, a social security number is just a taxpayers identifier but conveys no numerical/mathematical information, unlike, for example, taxpayer's income

The principal goal of classification is to develop a model h that is generalizable, which meant that the model must

accurate predict a class of a new, previously unseen example

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Student grade point average

Continuous Quantitative Interval

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Temperature measured in Fahrenheit

Continuous Quantitative Interval

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Surface area

Continuous Quantitative Ratio

Clustering is a task of assigning objects to one of several predefined categories called classes or labels

False

Clustering is an example of supervised learning

False

We must always select the model with the smallest training error

False

Classification involved supervised learning

True

Generalization error is the sum of bias and variance

True

Ideally, a model should have the lowest bias and the lowest variance, but it is not possible to achieve.

True

Is monitoring the heart rate of a patient for abnormalities a data mining task?

True In order to know what is an abnormal heart rate for a specific patient, we would have to have a model or normal/ubnormal hart rates that also inlcludes other health indicators. What is an ubnormal rate for one patient may be dangerous for another.

During data preprocessing, it is possible that the number of attributes increases.

True It is possible that the number of attributes increases during data preprocessing. For example, if we want to perform a quadratic regression using a linear regression algorithm, then we may double the number of features per vector by including the original features as well as their squares

Is predicting the future stock of a company using historical records a data mining task?

True. We would attempt to create a model that can predict the continuous value of the stock price. This is an example of the area of data mining known as predictive modeling. We could use regression for this modeling, although researchers in many fields have developed a wide variety of techniques for predicting time series.

Construction of a decision tree is an example of supervised learning

true

Model tuning is done using

validation set

Given the following data and their corresponding scatter plot, classify the point (1,1) using 3-NN with a Euclidean distance.

Class +

The "no free lunch" theorem informs us that:

For every problem we must custom-build a new model to fit this specific problem

Training error of a model is the error committed on

training set data

Which of the following a properties of metric d? - A metric d must be symmetric , i.e., d(x,y) = d(y,x) for all x,y. - A metric d must be nonnegative, i.e., d(x,y) > 0 for all x,y. - A metric d must obey the triangle inequality, i.e., d(x,z) < d(x,y) + d(y,z) for all x,y,z. - A metric d must be symmetric , i.e., d(x,y) = d(y,x) for all x,y. - A metric d must obey the triangle equality, i.e., d(x,z) = d(x,y) + d(y,z) for all x,y,z. - A metric d must obey the equality property, i.e., if d(x,y) = d(x,z), then y = z for all x,y,z. - A metric d must be nonnegative, i.e., d(x,y) > 0 for all x,y. - A metric d must obey the triangle inequality, i.e., d(x,z) < d(x,y) + d(y,z) for all x,y,z. - A metric d must be positive, i.e., d(x,y) > 0 for all x,y.

- A metric d must be symmetric , i.e., d(x,y) = d(y,x) for all x,y. - A metric d must be nonnegative, i.e., d(x,y) > 0 for all x,y. - A metric d must obey the triangle inequality, i.e., d(x,z) < d(x,y) + d(y,z) for all x,y,z.

Given two attributes: size = {small, medium, large, x-large} and price_level = {low, med, high}, and two data vectors: x = (small, high) and y = (x-large, med), find d(x,y) using the Euclidean distance. Round your answer to two decimal digits.

1.12 Since the attributes are ordinal, we first convert their values to integers: "size"= {0,1,2,3} and "price level"= {0,1,2}, and the data vectors become: x = (0, 2) and y = (3, 1). Then, we compute the normalized distances between the individual components: d(x1,y1) = d(small, x-large) = (0-3)/3 = 1d(x2,y2) = d(high, medium) = (2-1)/2 = 1/2. Finally, the Euclidean distance between the data vectors is calculated as: d( x , y ) = sqrt(1^2 + (1/2)^2 ) which is about 1.12 In practice, we would begin by normalizing the values of the attributes to obtain: "size"= {0, 1/3, 2/3, 1} and "price level"= {0, 1/2, 12}, x = (0, 1) and y = (1, 1/2), and then we would calculate the "ordinary" Euclidean distance: d( x , y ) = sqrt(1^2 + (1/2)^2) which is about 1.12

What is exploratory data analysis

A way by which we can get an initial feel for data Included both quantitative and qualitative analysis and its role is to give us an initial "feel" and understanding of the data we have. It may help with further preprocessing and selection of machine learning methods

Which of the following is true about the k-NN classifier?

As k increases, the bias increases as well

Which of the following can be used to summarize nominal data? range stem-and-leaf plot mean median bar graph histogram pie chart box plot standard deviation mode percentile frequency distribution

Bar graph Pie chart Mode Frequency Distribution

Classifying an image of an animal as a cat or a dog is an example of

Binary classification

Which of the following is the best option to decrease the impact of the "curse of dimensionality" when using k-NN classifier?

Both, dimensionality reduction and feature selection

Given the following data and their corresponding scatter plot, classify the point (-1,-1) using 4-NN with a Euclidean distance.

Cannot decide definitively

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Brightness as measured by people's judgements

Discrete Qualitative Ordinal

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Bronze, silver, and gold medals as awarded at the Olympics

Discrete Qualitative Ordinal

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Number of patients in a hospital

Discrete Quantitative Ratio

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: ISBN numbers for books

Discrete Qualitative Nominal

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Military rank

Discrete Qualitative Ordinal

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Coat Check Number

Discrete Quantitative Nominal

Classify the attributes as discrete or continuous, qualitative or quantitative, and as binary or nominal or ordinal or interval or ratio: Time in Terms of AM or PM

Discrete, qualitative, binary

Every data proximity measure is also a metric

False Not all proximity measures used in data mining are necessarily metrics. The triangle property is the one that is most likely to be missing but sometimes a proximity measure may also be nonsymmetric. To envision a distance measure that is not symmetric just think of finding the distance between 2 locations if the city uses many one-way streets. However, if we use a proximity measure that satisfies all three properties of a metric, then many algorithms can be designed to be much more efficient. Therefore, whenever possible, we prefer to use measures that are metrics.

We can calculate the generalization error by using the test data

False It is impossible to calculate the generalization or "true" error since it is the error for the future, unseen examples

The model-building phase for the k-NN classifier is time-consuming, but once the model is ready, we can classify new records almost instantly.

False The k-NN classifier stores all the training records and in order to classify a new record, it must find the k closest records among all the training records. In other words, all the work is done when we have a new record that needs to be classified.

Test set is critical for model selection

False The test set can be used exclusively for reporting the estimated generalization error. For model selection, we find the error on the validation set.

The higher the value of k, the longer the training.

False The training time in k-NN algorithm is the same regardless of the value of k. We still must find the distances to all training examples

Is dividing the customers of a company according to their profitability a data mining task?

False This is an accounting calculation, followed by the application of a threshold. However, predicting the profitability of a new customer would be a data mining task.

Is soring a student database based on student identification numbers a data mining task?

False This is not a data mining task, this a simple database query

All elements of a Pandas data frame must have the same data type

False the elements can have different data types. In fact, it is an advantage of the a Data Frame over NumPy array when the input data that we need to load and preprocess consists of attributes with mixed data types. A data frame can hold mixed-attribute data in a single array - unlinke NumPy array where all elements must be of the same type

Which of the following can be used to summarize ordinal data? frequency distribution of the individual values mode standard deviation stem-and-leaf plot mean box plot bar graph range histogram median percentiles pie chart of the individual values

Frequency distribution of individual values mode box plot bar graph range median percentiles pie chart of the individual values

Which of the following distance measures should be used when applying k-NN to categorical data?

Hamming distance

Which of the following is true about k-NN classifiers: It can be used for classification. It can be used for data cleaning for any other classifier. It can be used for regression. It can be used for categorical data. It can be used for descriptive modeling. It can be used for continuous data.

It can be used for classification. It can be used for data cleaning for any other classifier. It can be used for regression. It can be used for categorical data. It can be used for continuous data.

Which of the following measures is resistant, i.e. is not distorted by outliers? mode variance median absolute deviation (MAD) Interquartile range (IQR) mean median absolute average deviation (ADD) standard deviation range percentiles

Mode Median absolute deviation Interquartile range Median Absolute average deviation Percentiles

Which of the following can be used to summarize continuous numerical data? range stem-and-leaf plot mean histogram median pie chart of the individual values mode box plot standard deviation frequency distribution of the individual values bar graph percentiles

Range Stem-and-leaf plot mean histogram median mode box plot standard deviation percentiles

Let x = (0,1,0,0,1,0,0,0,1,0) and y = (1,0,0,0,1,0,1,0,1,0), represent voting for "All Democrat" ticket by two voters in the past ten elections, where 0 means "no" and 1 means "yes". Which proximity measure should we use to compare the voters?

Simple Matching Coefficient Since voting all democrat (yews) and mixing the selections between the candidate's affiliations (no) are important characteristics of a voter so we should include both yes and no counts.

Which of the following is not a goal of modeling using machine learning? To find patterns among data. To make predictions. To find correlation between variables. To find explanations

To find correlation between variables

A target function in classification corresponds to

The classification model we want to learn

Overfitting means that

The model separates classes very well but it does not generalize well

When we use the z-score normalization (standardization) as a data transformation during the data preprocessing stage, then we

change the values so that their mean becomes 0 and the standard deviation is 1

Generalization error of a model is the error committed on

future, unseen data

Which of the following characterizes the k-NN algorithm: k-NN is an eager learner (builds model and discards training data). k-NN is a parametric algorithm (makes assumptions about data distribution). k-NN can be used for descriptive modeling. k-NN is an instance-based learner. k-NN is a lazy learner (build a model only when needed). k-NN is a nonlinear classifier (arbitrary decision boundary). k-NN is a linear classifier (linear decision boundary). k-NN is a non-parametric algorithm (makes no assumptions about data distribution). k-NN can be used exclusively for prediction.

k-NN is an instance-based learner k-NN is a lazy learning (build a model only when needed) k-NN is a nonlinear classifier (arbitrary decision boundary) k-NN is a non-parametric algorithm (makes no assumptions about data distribution). k-NN can be used exclusively for prediction.


Ensembles d'études connexes

SOCIOLOGY TEST 4 QUESTIONS (CH. 14-17)

View Set

Intro to Learning & Behavior Chapter 1

View Set

FINAL (recycled questions from exams 1-3)

View Set

Archaeologists of Pompeii and Herculaneum

View Set