Data Mining Final
Data mining is a simple transformation of technology developed from databases, statistics, and machine learning?
False (IS NOT)
A process of finding a model that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. It predicts categorical (discrete, unordered) label
classification
database
clean data
has difficulty when number of classes is large:
gini index
datawarehouse
any data used, dirty data
how do you handle noisy data
binning, regression, clustering, detect suspicious values
A process to analyze data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the interclass similarity
clustering
1950-1990
computational science
a process that removes or transforms noise and inconsistent data
data cleaning
where multiple data sources may be combined
data integration
1990-now
data science
Clustering partitions data set into clusters based on ________.
similarities
The goal of data reduction is to obtain a reduced representation of the data set that is much ____ in volume yet produces the same ____ results.
smaller,analytical
regression
smooth by fitting data into a function
Noise is error or variance in a measured variable.
true
When data is integrated, redundant attributes may be generated. Such redundancy could be detected by ____ analysis and _____ analysis
covariance, correlation
an essential process where intelligent and efficient methods are applied in order to extract patterns
data mining
objects are made up of:
entities
The task of data cleaning is just to get rid of noisy data.
false
Decision tree is constructed in a bottom-up recursive divide-and-conquer manner.
false (top up)
Dissimilarity of data is higher when objects are more alike
false(lower)
prefers unbalanced splits in which one partition os much smaller than the
gain ratio
data preprocessing
improves data
Another challenge in data mining is the parallel, distributed, and [a1] processing of data mining algorithms. Due to the high cost of some data mining processes, [a1] data mining algorithms incorporate database updates without the need to mine the entire data again from scratch. The two blanks should be filled with the same word. What is it?
incremental
has bias towards multivalued attributes:
information gained
where visualization and knowledge representation techniques are used to present the mined knowledge to the user
knowledge presentation
Central tendency of a data can be measured by mean, median, and____
mode
A two-step process of classification are explained by the following two:
model construction, model usage
what are the 5 data attributes
nominal,binary,ordinal, interval scaled, ratio scaled
data sets are made up of:
objects
A process to analyze the objects that do not comply with the general behavior or model of the data. Examples include fraud detection based on a large dataset of credit card transactions
outlier analysis
An induced tree may _____ the training data when it has too many branches. Some may reflect anomalies due to noise or outliers.
overfit
where data relevant to the analysis task are retrieved from the database
data selection
where data are transformed or consolidated into forms appropriate for mining
data transformation
clustering
detect and remove outliers
The need for data mining has arisen due to the wide availability of huge amounts of data and the imminent need for turning such data into useful _____ and ____
knowledge, information
In supervised learning, the training data are accompanied by ____ which are indicating the class of the observations.
labels
Correlation analysis measures the _________ relationship between object.
liner
a process that identifies the truly interesting patterns representing knowledge based on some interesting measures
pattern evaluation
A process to model continuous-valued functions. It is used to predict missing or unavailable numerical data values rather than (discrete) class label
regression
One challenge to data mining regarding performance issues is the ___and ___ of data mining algorithms, because it is extremely important to effectively extract information from large amounts of data in databases within predictable and acceptable running times
scalability, efficency
data normalization
scales data
data visualization
search patterns, trends among data
The purpose of data pre-processing is to improve data quality.
true