Data Mining Midterm

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

median

holistic

mode()

holistic

rank()

holistic

One of the data reduction techniques finds true information and its source reliability while reducing the data amount of data collected

truth discovery

What is a consequence of KDD

useful information and meaningful data

Data mining can also applied to other forms such as ................ i) Data streams ii) Sequence data iii) Networked data iv) Text data v) Spatial data A) i, ii, iii and v only B) ii, iii, iv and v only C) i, iii, iv and v only D) All i, ii, iii, iv and v

All i, ii, iii, iv and v

Imagine, you have 1000 input features and 1 target feature in a machine learning problem. You must select the 100 most importantfeatures based on the relationship between input and target features. Do you think this is an example of dimensionality reduction

yes

a concept hierarchy climbing defines a sequence of concept mappings from a set of low-level concepts to higher - level, more general concepts

yes

How many metrics or things are available in the kNN algorithm?

Which of the following is not an attribute selection measure? This can be used to determine the best split of data.

A heuristic

A technique scales given values of features by dividing the values by a power of ten. (select all that apply)

A technique scales given values of features by dividing the values by a power of ten. We divide each value of the data by the maximum absolute value of the data. Where X is the original feature value, and d is the smallest integer such that the largest absolute value inthe feature becomes less than one. It normalizes by moving the decimal point of values of the data

Which of the following is true for decimal scaling normalization? (Select all that apply)

A technique scales given values of features by dividing the values by a power of ten. It normalizes by moving the decimal point of values of the data We divide each value of the data by the maximum absolute value of the data. Where X is the original feature value, and d is the smallest integer such that the largest absolute value in the feature becomes less than one

(select all that apply) In terms of measuring data quality, a well-accepted multidimensional view includes

Accuracy, Consistency, and Accessibility

Which is not a data cleaning method?

Aggregation

Identify the distance matrix used in KNN Algorithm___________

All

Choose examples of supervised learning

Credit/loan approval, Fraud detection

Imagine you are working at the weather forecast station, and you need to predict whether or not it will be raining at 6pm tomorrow. You want to use a classification algorithm for this. What would be the best?

Which of the following is a major task in data preprocessing?

Data cleaning, data integration, Data transformation, data reduction

Which of the following features usually applies to data in a data warehouse?

Data is rarely deleted

correct statement when we make a difference between data mining and data warehouse

Data warehousing is a process of pooling all the relevant data together, data mining is a process of extracting data from large data sets

The ID3 algorithm is used to build through the _______.

Decision tree

Which of the following items is used in identifying a significant change in data?

Deviation

Data mining is a process used by organizations to turn useful information into raw data.

False

The whole process of data transformation is called ETL (Extract, Transform, and Load). This process is defined as a six-step process, as listed below. Which of the following is mainly used in data preprocessing? (select all that apply

data mapping, data extraction

which of the normalization techniques just requires one number (data) from the given dataset to normalize all data?

decimal scaling

Which of the following is a non-parametric method of numerosity data reduction techniques?

Histograms, clustering

When you find noise in data which of the following option would you consider in k-NN?

I will increase the value of k

Which of the following methods does not involve in data mining?

Information

Which of the following options is true about the k-NN algorithm?

It can be used in both classification and regression

What is KDD?

Knowledge Discovery in Database

A computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E. What is the program called?

Machine learning

There are two main processes in classification algorithms. Choose them from the following.

Model construction, Model usage

Select one or all that is untrue.

Most patterns in data are interesting

Choose examples of supervised learning algorithms. (Select all that apply)

Neural Network, Support Vector Machines

For the discriminative classifiers, what makes the difference between discriminative and non-discriminative data?

P(Y)

What is the objective of unsupervised learning? (Select all that apply)

determine data patterns, determine data groupings

in data mining, ______________ technique applies the procedure to reduce a significant amount of data from a large dataset by cutting down a number features in a dataset while retaining data quality.

dimensionality reduction

count()

distributive

max()

distributive

sum()

distributive

Human inspection is an important andappropriate method to handle noisy data________

false

Regarding the Bayesian Network as shown in the diagram, what do the nodes and links in the Bayesian Network Classifier imply?

Random variables and their dependency

Which of the following has a rewarding strategy?

Reinforcement learning

Which of the following is not a data mining functionality?

Selection and interpretation

Choosing _______values for k can be noisy and will have a higher influence on the result.

Smaller k value.

In market basket analysis; confidence is:

The conditional probability that a transaction contains item set B given that it contains item set A

Regarding a DT tree construction, which of the following is true?

The terminal node holds a class label in a DT.

________ is used to build a data mining model.

Training data

OLTP captures, stores, and processes data fromtransactions in real time, while OLAP usescomplex queries to analyze aggregatedhistorical data

True

Regarding data vs. information, data is the result of analyzing and interpreting information.

True

Machine learning always produces accurate results which are used by data mining and thereby makes data mining produce better results.

false

a data cube is generally used for easily smoothing data

false

Which of the following is an example of data mining:

What are sold more on a particular day than other days

An organization may not wish to get a response from its data warehouse for the following question.

Who are the customers of the organization?

k-NN algorithm does more computation on test time rather than train time.

Yes

Which of the following looks more like the final or mined data?

a few facts, numbers, and texts

which of the following is not a normalization method? (select all that apply)

aggregation and data cleaning

The difference between training data and test data is clear: one trains a model; the other confirms it works (or doesn't work) correctly with previously unseen data

agree

avg()

algebraic

min_n()

algebraic

equal width and frequency of data are used in

binning

According to Bayes' rule, which of the following is the correct posterior probability? (Select one or more that apply)

cancer

A systematic approach to building classification models from an input data set.

classification

A test-set is used in ______ to determine the accuracy of the model

classification

Dividing a database into 3 parts; a training data set, validation data set and testing data set is known as:

classification

Suppose that the diagram provides a snapshot of the transactions that you have made recently. As a customer, you can check when you buy, what you buy, how often you pay on time, etc. Someday, you are notified that there was fraud with one of your transactions. The credit card provider usually uses a data mining task to detect such fraud by observing credit card transactions on your account. Here, the name of the data mining task is:

classification

Which of the following data mining task would be the best for spam detection?

classification

Documents can be better grouped by:

clustering

data reduction techniques are used in data mining to reduce the size of a dataset while still keeping the most critical data quality. However, one of the techniques comprises either a lossy or lossless method to reduce the size of a dataset. What is that?

compression techniques

Which part of the KDD process may require the use of a large amount of effort, ___________?

data cleaning and preprocessing

min-max normalization is used to normalize data by dividing each value by a fixed number

false

Let's take a simple case to understand a classifier. Following is a spread of red circles (RC) and green squares (GS):

kNN classifier

Suppose that you are working on a data mining project. You are considering applying a classifier where you don't need to use a training phase. What classifier are you going to choose?

kNN classifier

which of the following transformations is helpful with high variance

log scaling

Suppose in a classification problem, you are using a decision tree, and you use the Gini index as the criterion for the algorithm to select the feature for the root node. The feature with the _____ Gini index will be selected.

lowest

Which of the following is an example of relational OLAP?

metacube

a normalization technique in which values are shifted and rescaled so that they end up ranging from 0 to 1

min-max normalization

Concept Hierarchy is useful in

multiple levels of abstraction

ID3 algorithm can also be used in clustering

Consider a production line producing computers. The shop manager would like a good estimate of the required number of worker hours given that a certain number of units must be produced. The manager collects a small sample of the number of worker hours for each lot size. The __________ task fit suggests a minimum of 10 hours of work and two extra hours for each additional unit that is produced.

regression

Which one would be the best for predicting sales amounts?

regression

________ is the data mining task that is used to predict sales amounts of a new product.

regression

Machine learning techniques are used to automatically find patterns in data. The patterns that are found in the given data may be represented by the following (select all that apply):

structural descriptions and black-box models

A ______________ is utilized to measure the accuracy of a classification model.

test dataset

The idea presented in the portion of the picture takes time to cluster data. If an obstacle is encountered, it cannot be dealt with in real time. It needs to wait for an analysis to be initiated by a person. The idea presented in the below portion of the picture detects what is relevant to the task. If an obstacle is encountered, an analysis can be made in line. It does not need to wait for an analysis to be initiated by a person. Select all that are true.

the first picture is about data mining the second picture is about machine learning

A data warehouse differs from an operationaldatabase because most data warehouses havea product orientation and are tuned to handletransactions that update the database _______.

true

Association rule is suitable for marketing and sales promotion.

true

Classification is a predictive task

true

Discretization and Concept HierarchyGeneration divides the range of continuousattributes into intervals

true

Feature selection is a dimensionality reductiontechnique

true

Finding a sequential pattern in data is a descriptive data mining task.

true

If you want to handle noisy data, you can useregression

true

Multiple warehouses are needed in a database-centric solution. However, integrating thewarehouses is a problem

true

Normalization is used to do datatransformation

true

The approach to identifying frequently occurring terms in a document is data clustering.

true

The operational data is used as a source for the data warehouse

true

When feature values in the dataset vary drastically, z-score normalization should not be a good choice.

true

it is not necessary to have a target variable forapplying dimensionality reduction in datareduction

Data Mining Midterm

संबंधित स्टडी सेट्स

Purple Book

Environmental Science Final

Summary, Paraphrase, and Quotation

MicroEcon Chapter 19 w/ glossary

Chapter 16 pt 1

Cell bio: Chapter 15

Business Finance Ch 12 Reading - Connect

CHEM FINAL

Assessment 521: Assessment in Counseling

1-36

Test 2 Review (Chapters 5-8)

Diabetes- NAPLEX

PDBIO 210 -- Lesson 11 (part 6) Dermatomes, Shingles, Plexus, Reflex Arc

NEURO ATI BOOK

Exam 4 Practice Test Questions

final chem

Mk303 C16 SmrtBk Chapter 16

PrepUs for Pharm Chapter 40

Logic Midterm 2

mktg 445 - chapt. 10