Data Mining Final Questions
To use a trained decision tree model to make predictions for a scoring data set in RapidMiner, you would use the _________ operator. Apply Model Predict Model Apply Predict()
Apply Model
A _________ is a type of data store that is intentionally created for a specific business unit, usually for management and reporting purposes. data extraction data mart data warehouse relational database
data mart
The mathematical formula for calculating the ________ percent is the number of times an association did occur divided by the number of times the premise (or antecedent) occurred in the data set. support confidence Laplace gain
confidence
__________ is a statistical measure of how strong the relationships are between attributes in a data set.
correlation
In RapidMiner, which operator is used to create correlation coefficients? Coefficient Matrix Statistical Correlation Correlate Correlation Matrix
correlation matrix
which of the following is NOT one of the six steps of CRISP-DM process? Evaluation Data Execution Data Understanding Deployment
data execution
Given three political parties, Republican Party, Democrat Party, and Green Party, the logistic regression can be used to predict which party will win the elections.
false
If you attempt to make a prediction for an out-of-range scoring observation in a linear regression model in RapidMiner, the software will throw an error.
false
The Logistic Regression operator in RapidMiner offers only one algorithm for model creation.
false
The k-Means Clustering is a data mining model that predicts values.
false
True or false: A support percent of 18% in an association set would be considered too low to be of any use.
false
True or false: Business (or Organizational) Understanding and Data Understanding are not necessary when you are confident that data is prepared and ready for analysis.
false
True or false: Conducting data mining and analytics on high-volume transactional database systems is recommended because such systems have the most up-to-date data.
false
True or false: Decision tree models do not provide confidence percentages alongside their predictions.
false
True or false: FP-Growth creates association rules in RapidMiner.
false
True or false: In decision tree models, all independent variables are given equal weight when making predictions.
false
True or false: Scatterplots can only show the correlation of two attributes at a time.
false
True or false: The cluster number assigned to each cluster in a k-Mean model indicates the relative importance of each cluster when compared to the others.
false
True or false: When correlation coefficients between two attributes reach the "strong" or "very strong" ranges, you have discovered statistical evidence that one of the attributes causes the other to change in some way. TrueFalse
false
Unlike in linear regression, it is possible to have more than one dependent variable in a logistic regression model.
false
Which of the following is NOT another name for a row of data in a database? case example field observation record tuple
field
To examine all records in one specific cluster in RapidMiner, use a ______________ operator. Sample Filter Examples Cluster Select Attributes
filter examples
To remove out-of-range values from a scoring data set in RapidMiner, use a ___________ operator.
filter examples
To remove unwanted observations from a data set in RapidMiner, use the _______ operator. Select Attributes Remove Observations Filter Observations Filter Examples
filter examples
To remove unwanted or unusable observations from data sets in RapidMiner, use the _______ operator. Filter Examples Select Attributes Declare Missing Values Replace
filter examples
To view which observations are assigned to each cluster in a k-Means model in RapidMiner, use the ________ feature.
folder view
Sloppy organization of data causing dubious analysis results is an example of ________. Data Preparation Lazy Modeling Garbage In, Garbage Out Data Understanding
garbage in, garbage out
To calculate the sum total of all predictions in a linear regression operator in RapidMiner, use a(n) __________ operator. Sum Total Aggregate Summarize
aggregate
In the Neural Net operator in RapidMiner, which of the following parameters will cause the model to stop the training process if its value is reached? Training cycles Momentum Learning rate All of these would stop training if their value is reached.
all of these would stop training if their value is reached
In RapidMiner, the attribute you wish to predict must be set to the role of _________.
label
In a decision tree, the dependent variable value found at the end of each path through the tree is known as a _________. fork branch leaf node
leaf
The output of a correlation is called a ________. contingency table match table similarity index matrix
matrix
To prevent a k-Means model for a large data set from taking a long time to run, you can adjust the _________ parameter in RapidMiner. divergence max runs measure types start values
max runs
In linear regression, the p-values for each independent variable must be smaller than ________. the intercept the value of the y variable alpha the coefficients for each independent variable
alpha
The k-Means clustering technique for data analysis is ideal for _________. machine learning prediction time series forecasting segmentation
segmentation
To remove unwanted attributes from a data set in RapidMiner, use the _______ operator. Filter Examples Select Attributes Remove Attributes Select Examples
select attributes
Correlation coefficients are generally considered strong if they are at least _________. 0.2 or -0.2 0.95 or -0.95 0.6 or -0.6 none of the above
0.6 or -0.6
In RapidMiner, if the number of hidden layers is not specified by the analyst, how many hidden layers will be used to train a neural network? 2 10 5 1
1
The default Confidence Percent used for logistic regression models is _______.
95%
The operator required in RapidMiner to find frequency patterns in a data set is called ________. FP-Growth FP-Find FP-Detect FP-Associate
FP-Growth
Increasing which parameter of the Decision Tree operator in RapidMiner would reduce the size of the tree? Minimal Leaf Size Criterion Confidence Split Size
Minimal Leaf Size
Databases designed to support a dimensional examination and aggregation are referred to as ________. SQL systems OLAP systems ENTP systems OLTP systems
OLAP systems
Databases designed to support a high number of reads and writes are referred to as ________. OLTP systems OLAP systems SQL systems ENTP system
OLTP systems
If a training data set in RapidMiner contains a non-predictive, numeric identification column, how must this be handled when creating logistic regression models? Nothing. RapidMiner can detect and ignore identification columns in training data. The role for the identification column must be set to "ID". Nothing. RapidMiner will allow you to remove the identification column in Results view after the model is created. The role for the identification column must be set to "label".
The role for the identification column must be set to "ID".
Data types for independent variables in a decision tree model must be_______ numeric binary text any of the above
any of the above
Association rules use the _______ algorithm to find frequently associated attributes in a data set. apriori incarini logetti associati
apriori
In a decision tree model represented visually in RapidMiner, the first predictive independent variable is represented __________. at the bottom on the right on the left at the top
at the top
Removing columns from a data set because they are not useful for a certain type of data analysis is an example of _________. attribute reduction observation reduction content reduction document reduction
attribute reduction
The data type of the dependent variable in logistic regression must be ________.
binominal
The required data type for all attributes in an association rule model in RapidMiner is _______. binary polynominal binominal integer
binominal
In RapidMiner, which of the following will automatically be generated when the Apply Model operator applies a neural network model to a scoring data set? both class predictions and confidence percentages class predictions only both class predictions and standard error values confidence predictions only
both class predictions and confidence percentages
Establishing permission to use company data in analytic activities should take place during which CRISP-DM phase? Business (or Organizational) Understanding Data Preparation Deployment Evaluation
business understanding
________ is the first step in the CRISP-DM process. Business (or Organizational) Understanding Evaluation Modeling Data Preparation
business understanding
Which of the following is NOT another name for a column of data in a database? attribute case variable field
case
The averages for each attribute in each cluster created by a k-Means model are called _______. midpoints complex means centroids simple means
centroids
In linear regression, the m variable is the independent variable's ________.
coefficient
The values in correlational analysis results are called _______. convergences covariates coefficients contingencies
coefficients
Evaluation within the CRISP-DM process is intended to ensure that ________. data inputs have numeric data types data mining results are reliable and useful employees within a company know about data mining the company has complied with governmental regulations
data mining results are reliable and useful
Reformatting phone numbers so that they all conform to a 12-character standard, such as 123-456-7890, would take place in which phase of the CRISP-DM process? Deployment Data Preparation Evaluation Data Understanding
data preparation
A __________ is a subset of a database or data warehouse usually created for a specific analytic purpose. relational database reduction denormalization data set
data set
Auditing an organization's databases, spreadsheets, file servers, and records repositories for information to use in data analysis is an example of ________. Modeling Deployment Business (or Organizational) Understanding Data Understanding
data understanding
The attribute you want to predict in a predictive model is called a(n) _________. dependent variable independent variable identifying variable category variable
dependent variable
To see the size of each cluster in RapidMiner, click the ______________ icon in Results view. Folder View Description Centroid Table Graph
description
Changing one categorical attribute (e.g., "Blue," "Red," "Green") into a series of binary attributes (e.g., Blue = 0/1; Red = 0/1; Green = 0/1) is known as _________. dummification digitization binarification dummy coding
dummy coding
Which of the following is NOT an example of data scrubbing as described in the text? handling inconsistent data reducing data handling marginal data handling missing data
handling marginal data
The space between the independent variables and the dependent variable where a neural network model gets trained is called the ____________. training layer hidden layer neural layer synapse layer
hidden layer
A value of "middle-aged" in an attribute that otherwise contains peoples' ages in number of years would be an example of _______. modified data inconsistent data aged data alphabetic data
inconsistent data
An attribute used to predict outcome values in a predictive model is called a(n) _________. dependent variable independent variable identifying variable category variable
independent variable
What is calculated in the nodes of the hidden layer of a neural network? dependent variable weights independent variable ranges dependent variable ranges independent variable weights
independent variable weights
In linear regression, the b variable is the model's ________.
intercept coefficient
What type of correlation occurs when two attributes are correlated to one another, and as the values in one attribute increase, the values in the other attribute decrease? mutual neutral negative positive
negative
In a decision tree, the independent variable found at each branch of the tree is known as a _________ fork branch leaf node
node
he data type for the dependent variable in a classification decision tree model must be __________. Nominal Numeric Binary None of these
nominal
Missing values in a data set mean that ________. the data contains errors the data set is unusable the data set contains outliers none of the above
none of the above
In linear regression, p-values larger than alpha indicate that their corresponding independent variables are __________.
not statistically significant
The data type required for independent variables in a neural network model must be _________. categorical label binary numeric
numeric
The data types of all independent variables in logistic regression must be _________.
numeric
What data type must be assigned to the dependent variable in RapidMiner when building linear regression models? Label Target Numeric Polynominal
numeric
Line-by-line records of each item sold at a grocery store would be an example of ________. organizational data strategic data operational data aggregate data
operational data
Data analysis processes in RapidMiner are built using rectangular building blocks called _________.
operators
What type of correlation occurs when two attributes are correlated to one another, and as the values in one attribute decrease, the values in the other attribute also decrease? positive negative mutual neutral
positive
Discriminant analysis, k-Nearest Neighbors, and Naïve Bayes are all datamining models used to __________ data values. categorize predict & categorize guess predict
predict & categorize
Considerations for Data Understanding include all of the following EXCEPT _________. accuracy of the data presentation of the data age of the data completeness of the data
presentation of the data
The Naïve Bayes technique for predicting categorical outcomes employs both ________ and _______. outliers; skewness probability; outliers variance; skewness probability; variance
probability, variance
If a data analyst finds that a decision tree model has too many nodes or leaves to be meaningful, the analyst should apply _________ to the tree. chopping lifting stumping pruning
pruning
Association rules data mining is often used to produce ________. ideations forecasts predictions recommendations
recommendations
Removing records that contain missing or inconsistent data from a data set before analysis is an example of _________. purging reduction missing values data mining
reduction
Selecting some subset of records from a data set is called ________ the data. modifying sampling morphing correcting
sampling
Data that does not have known outcome values for an attribute you wish to predict is known as ________ data. unknown scoring evaluating training
scoring
The mathematical formula for calculating the ________ percent is the number of times an association did occur divided by the number of times it could have occurred in the data set. support confidence Laplace gain
support
In neural networks, the pathways between independent variables and dependent variables are called __________. synapses convergences line paths propagations
synapses
Data arranged into columns and rows in a database are stored in ________. warehouses tables dimensional arrays matrices
tables
In RapidMiner, if one or more independent variable has a non-numeric data type, what would be required for the Neural Net operator to work correctly? The non-numeric independent variables could be recoded to numeric values or excluded from the model. Nothing. The Neural Net operator in RapidMiner will automatically convert independent variable data types if it needs to. Nothing. The Neural Net operator in RapidMiner will use Linear Regression model instead. Nothing. Like Decision Trees in RapidMiner, the Neural Net operator can also handle independent variables of all different data types.
the non-numeric independent variables could be recoded to numeric values or excluded from the model
The k in k-Means indicates ________. the intercept for the model the number of clusters desired the coefficient of the dependent variable the coefficient of the independent variable
the number of clusters desired
In a neural network model, how many nodes will the output layer always have? the number of variables used in the scoring dataset the number of distinct values in the dependent variable a random number based on how many training iterations (cycles) occur the number of variables used to train the model
the number of distinct values in the dependent variable
Data containing known outcome values for an attribute you wish to predict is known as ________ data. testing training evaluating scoring
training
Which parameter of the Neural Net operator in RapidMiner will limit the number of forward/backward propagations? training cycles learning rate momentum maximum iterations
training cycles
Using two or three different modeling techniques on the same data and then comparing predicted outcomes across the different models is called _________. the circle method trial-and-error analytics hypothesis testing triangulation
triangulation
All of the attribute values must be numeric if one wants to use the k-Means Clustering model.
true
In RapidMiner, the label (dependent variable) can be coded either alphabetically (e.g., true/false) or numerically (e.g., 0/1).
true
In logistic regression, the smaller the p-value for an independent variable, the more predictive power that variable has relative to the dependent variable
true
The values true/false or 0/1 would both be valid combinations for the dependent variable in a logistic regression model.
true
True or false: A confidence percent of 18% in an association set would be considered too low to be of any use. TrueFalse
true
True or false: Data mining modeling techniques can classify, predict, or both.
true
True or false: In RapidMiner, data can be either imported or read into the software from CSV, text, and spreadsheet files. TrueFalse
true
True or false: In RapidMiner, if the min support parameter on the FP-Growth operator is set at 0.8 and the min confidence on the Create Association Rules operator is set at .75, the software may still return rules if an association has confidence of 88% but support of just 52%. TrueFalse
true
True or false: It is permissible at any point in the CRISP-DM process to return to an earlier step.
true
True or false: Scatterplots are a method of visualizing statistical correlations.
true
Value ranges for all attributes for every observation in a scoring data set must be within the value ranges for the corresponding attributes in the training data set in a linear regression model.
true
In linear regression, the x variable is the independent variable's ________
value