Data Mining Midterm
median
holistic
mode()
holistic
rank()
holistic
One of the data reduction techniques finds true information and its source reliability while reducing the data amount of data collected
truth discovery
What is a consequence of KDD
useful information and meaningful data
Data mining can also applied to other forms such as ................ i) Data streams ii) Sequence data iii) Networked data iv) Text data v) Spatial data A) i, ii, iii and v only B) ii, iii, iv and v only C) i, iii, iv and v only D) All i, ii, iii, iv and v
All i, ii, iii, iv and v
Imagine, you have 1000 input features and 1 target feature in a machine learning problem. You must select the 100 most importantfeatures based on the relationship between input and target features. Do you think this is an example of dimensionality reduction
yes
a concept hierarchy climbing defines a sequence of concept mappings from a set of low-level concepts to higher - level, more general concepts
yes
How many metrics or things are available in the kNN algorithm?
4
Which of the following is not an attribute selection measure? This can be used to determine the best split of data.
A heuristic
A technique scales given values of features by dividing the values by a power of ten. (select all that apply)
A technique scales given values of features by dividing the values by a power of ten. We divide each value of the data by the maximum absolute value of the data. Where X is the original feature value, and d is the smallest integer such that the largest absolute value inthe feature becomes less than one. It normalizes by moving the decimal point of values of the data
Which of the following is true for decimal scaling normalization? (Select all that apply)
A technique scales given values of features by dividing the values by a power of ten. It normalizes by moving the decimal point of values of the data We divide each value of the data by the maximum absolute value of the data. Where X is the original feature value, and d is the smallest integer such that the largest absolute value in the feature becomes less than one
(select all that apply) In terms of measuring data quality, a well-accepted multidimensional view includes
Accuracy, Consistency, and Accessibility
Which is not a data cleaning method?
Aggregation
Identify the distance matrix used in KNN Algorithm___________
All
Choose examples of supervised learning
Credit/loan approval, Fraud detection
Imagine you are working at the weather forecast station, and you need to predict whether or not it will be raining at 6pm tomorrow. You want to use a classification algorithm for this. What would be the best?
DT
Imagine you are working at the weather forecast station, and you need to predict whether or not it will be raining at 6pm tomorrow. You want to use a classification algorithm for this. What would you choose?
DT
Which of the following is a major task in data preprocessing?
Data cleaning, data integration, Data transformation, data reduction
Which of the following features usually applies to data in a data warehouse?
Data is rarely deleted
correct statement when we make a difference between data mining and data warehouse
Data warehousing is a process of pooling all the relevant data together, data mining is a process of extracting data from large data sets
The ID3 algorithm is used to build through the _______.
Decision tree
Which of the following items is used in identifying a significant change in data?
Deviation
Data mining is a process used by organizations to turn useful information into raw data.
False
The whole process of data transformation is called ETL (Extract, Transform, and Load). This process is defined as a six-step process, as listed below. Which of the following is mainly used in data preprocessing? (select all that apply
data mapping, data extraction
which of the normalization techniques just requires one number (data) from the given dataset to normalize all data?
decimal scaling
Which of the following is a non-parametric method of numerosity data reduction techniques?
Histograms, clustering
When you find noise in data which of the following option would you consider in k-NN?
I will increase the value of k
Which of the following methods does not involve in data mining?
Information
Which of the following options is true about the k-NN algorithm?
It can be used in both classification and regression
What is KDD?
Knowledge Discovery in Database
A computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E. What is the program called?
Machine learning
There are two main processes in classification algorithms. Choose them from the following.
Model construction, Model usage
Select one or all that is untrue.
Most patterns in data are interesting
Choose examples of supervised learning algorithms. (Select all that apply)
Neural Network, Support Vector Machines
For the discriminative classifiers, what makes the difference between discriminative and non-discriminative data?
P(Y)
What is the objective of unsupervised learning? (Select all that apply)
determine data patterns, determine data groupings
in data mining, ______________ technique applies the procedure to reduce a significant amount of data from a large dataset by cutting down a number features in a dataset while retaining data quality.
dimensionality reduction
count()
distributive
max()
distributive
sum()
distributive
Human inspection is an important andappropriate method to handle noisy data________
false
Regarding the Bayesian Network as shown in the diagram, what do the nodes and links in the Bayesian Network Classifier imply?
Random variables and their dependency
Which of the following has a rewarding strategy?
Reinforcement learning
Which of the following is not a data mining functionality?
Selection and interpretation
Choosing _______values for k can be noisy and will have a higher influence on the result.
Smaller k value.
In market basket analysis; confidence is:
The conditional probability that a transaction contains item set B given that it contains item set A
Regarding a DT tree construction, which of the following is true?
The terminal node holds a class label in a DT.
________ is used to build a data mining model.
Training data
OLTP captures, stores, and processes data fromtransactions in real time, while OLAP usescomplex queries to analyze aggregatedhistorical data
True
Regarding data vs. information, data is the result of analyzing and interpreting information.
True
Machine learning always produces accurate results which are used by data mining and thereby makes data mining produce better results.
false
a data cube is generally used for easily smoothing data
false
Which of the following is an example of data mining:
What are sold more on a particular day than other days
An organization may not wish to get a response from its data warehouse for the following question.
Who are the customers of the organization?
k-NN algorithm does more computation on test time rather than train time.
Yes
Which of the following looks more like the final or mined data?
a few facts, numbers, and texts
which of the following is not a normalization method? (select all that apply)
aggregation and data cleaning
The difference between training data and test data is clear: one trains a model; the other confirms it works (or doesn't work) correctly with previously unseen data
agree
avg()
algebraic
min_n()
algebraic
equal width and frequency of data are used in
binning
According to Bayes' rule, which of the following is the correct posterior probability? (Select one or more that apply)
cancer
A systematic approach to building classification models from an input data set.
classification
A test-set is used in ______ to determine the accuracy of the model
classification
Dividing a database into 3 parts; a training data set, validation data set and testing data set is known as:
classification
Suppose that the diagram provides a snapshot of the transactions that you have made recently. As a customer, you can check when you buy, what you buy, how often you pay on time, etc. Someday, you are notified that there was fraud with one of your transactions. The credit card provider usually uses a data mining task to detect such fraud by observing credit card transactions on your account. Here, the name of the data mining task is:
classification
Which of the following data mining task would be the best for spam detection?
classification
Documents can be better grouped by:
clustering
data reduction techniques are used in data mining to reduce the size of a dataset while still keeping the most critical data quality. However, one of the techniques comprises either a lossy or lossless method to reduce the size of a dataset. What is that?
compression techniques
Which part of the KDD process may require the use of a large amount of effort, ___________?
data cleaning and preprocessing
min-max normalization is used to normalize data by dividing each value by a fixed number
false
Let's take a simple case to understand a classifier. Following is a spread of red circles (RC) and green squares (GS):
kNN classifier
Suppose that you are working on a data mining project. You are considering applying a classifier where you don't need to use a training phase. What classifier are you going to choose?
kNN classifier
which of the following transformations is helpful with high variance
log scaling
Suppose in a classification problem, you are using a decision tree, and you use the Gini index as the criterion for the algorithm to select the feature for the root node. The feature with the _____ Gini index will be selected.
lowest
Which of the following is an example of relational OLAP?
metacube
a normalization technique in which values are shifted and rescaled so that they end up ranging from 0 to 1
min-max normalization
Concept Hierarchy is useful in
multiple levels of abstraction
ID3 algorithm can also be used in clustering
no
Consider a production line producing computers. The shop manager would like a good estimate of the required number of worker hours given that a certain number of units must be produced. The manager collects a small sample of the number of worker hours for each lot size. The __________ task fit suggests a minimum of 10 hours of work and two extra hours for each additional unit that is produced.
regression
Which one would be the best for predicting sales amounts?
regression
________ is the data mining task that is used to predict sales amounts of a new product.
regression
Machine learning techniques are used to automatically find patterns in data. The patterns that are found in the given data may be represented by the following (select all that apply):
structural descriptions and black-box models
A ______________ is utilized to measure the accuracy of a classification model.
test dataset
The idea presented in the portion of the picture takes time to cluster data. If an obstacle is encountered, it cannot be dealt with in real time. It needs to wait for an analysis to be initiated by a person. The idea presented in the below portion of the picture detects what is relevant to the task. If an obstacle is encountered, an analysis can be made in line. It does not need to wait for an analysis to be initiated by a person. Select all that are true.
the first picture is about data mining the second picture is about machine learning
A data warehouse differs from an operationaldatabase because most data warehouses havea product orientation and are tuned to handletransactions that update the database _______.
true
Association rule is suitable for marketing and sales promotion.
true
Classification is a predictive task
true
Discretization and Concept HierarchyGeneration divides the range of continuousattributes into intervals
true
Feature selection is a dimensionality reductiontechnique
true
Finding a sequential pattern in data is a descriptive data mining task.
true
If you want to handle noisy data, you can useregression
true
Multiple warehouses are needed in a database-centric solution. However, integrating thewarehouses is a problem
true
Normalization is used to do datatransformation
true
The approach to identifying frequently occurring terms in a document is data clustering.
true
The operational data is used as a source for the data warehouse
true
When feature values in the dataset vary drastically, z-score normalization should not be a good choice.
true
it is not necessary to have a target variable forapplying dimensionality reduction in datareduction
true
A method that can be used to detect and resolve data equalities
truth discovery
