MSIS-4263 Midterm
In CRISP-DM methodology, how many sequential steps exist?
6
Which of the following is true about clustering can
Assigning customers in different segments
Random sampling of a fixed number of instances from the original data with replacement to construct the training data set is achieved by
Bootstrapping
Identifying the goals, purpose, and requirements of the customers is achieved in which step of the CRISP-DM process.
Business Understanding
The most relevant methodology that is used to implement data science and business analytics projects is
CRISP-DM methodology
______________ classification approach uses historical samples and cases to identify commonalities in order to assign a new case to the most similar category.
Case-based reasoning
Which of the following is not a supervised machine learning algorithm?
Clustering
When an SVM prediction model is developed, it can be integrated into decision support system by which of the following methods?
Computational object
In classification problems, the main source for all accuracy estimation metrics is a
Confusion matrix
Which of the following provides an estimate of the degree of linear association between numerically represented variables.
Correlation
The _____________ method's common idea is to split the data sample into a number of randomly drawn, disjointed subsamples.
Cross validation
Identifying the relevant data from different sources is achieved in which step of the CRISP-DM Process.
Data Understanding
Usually, which step in the CRISP process consumes the most amount of time to complete?
Data preparation
Data mining is primarily concerned with mining (i.e., digging out data) from a variety of disparate data sources.
False
If I am distributing funds to different financial products to maximize return, I am essentially doing descriptive analytics. True
False
If a classification problem is not binary, we cannot use confusion matrix to tabulate prediction outcomes. True
False
In SEMMA process, visualization and description of the data is carried out in the modify step.
False
In banking and finance, data mining is often used to manage microeconomics movements and overall cash flow outcomes.
False
In linear regression independence of errors assumption is also known as homoscedasticity
False
In linear regression, the hypothesis testing reveal the existence of relationships between explanatory (i.e., input) variables.
False
In normality of error assumption of linear regression, the response variables values expected to be randomly distributed.
False
In the project finalization task, both CRISP-DM and SEMMA methodologies prescribe deploying the results.
False
In the testing and evaluation step of CRISP-DM methodology, monitoring and maintenance of the models are important.
False
Linear Regression aims to capture the functional relationships between one or more numeric input variables and a categorical output variable.
False
Logistic regression is like linear regression where both of them are used to predict a numeric target variable.
False
Major commercial business intelligence products and services were well established in the early 1970s.
False
One of the most pronounced reasons for the increasing popularity of data mining is due to the fact that there are less suppliers than corresponding demand in the business marketplace.
False
Prediction modeling is often classified under the unsupervised machine learning methods.
False
The Naïve Bayes method requires output variables to have numeric values.
False
The area under the ROC curve is a graphical assessment technique for binary classification problems, in which sensitivity is plotted on the y-axis and the specificity is plotted on the x-axis.
False
The modify step in Six-Sigma involves the process of assessing the mapping between organizational data repositories and the business problem.
False
The multi split methodology partitions data into exactly two mutually exclusive subsets called training set and test set.
False
The ratio of correctly classified positives divided by the total actual positive count is defined as a precision metric.
False
____________ clustering methods are based on the basic idea that nearby objects are more related to each other than are those that are farther away from each other.
Hierarchical
During which step in DMAIC, the identified data sources are consolidated and transformed into a format that is amenable to machine processing.
Measure
________________ is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model.
Multicollinearity
Which of the following relates to a pattern-recognition methodology for machine learning.
Neural computing
The categorical data contains
Nominal
The types of patterns discovered with data mining includes all of these, except:
Optimization
The Customer credit ratings like bad, fair, and excellent are considered as what type of data.
Ordinal
_____________ is a type of linear least squares method for estimating the unknown parameters in a linear regression model.
Ordinary least squares
In retailing, data mining is most commonly used to
Predict future sales
Data mining is an essential part of what types of analytics in analytics taxonomy.
Predictive
In brokerages and securities trading, data mining is used to
Prevent fraudulent activities
____________ is defined as the coefficient of determination in a statistical measure of regression model.
R-squared
The well-known standardized process for data analytics which was developed by SAS is called
SEMMA methodology
In data mining, clustering is classified further into
Segmentation, Outlier Analysis
Which of the following is not among the main assumptions in linear regression?
Simplicity
The ratio of accurately classified negatives divided by the total negative count is defined as
Specificity
The ratio of correctly classified negatives divided by the total negative count is called:
Specificity
The primary difference between statistics and data mining is
Statistics starts with a well-defined proposition and hypothesis whereas data mining starts with a loosely defined discovery statement.
A typical example of interval scale measurement is the temperature on the Celsius scale.
True
Analytics is the art and science of discovering insight to support accurate and timely decision making.
True
Apriori and FP-Growth algorithms are part of the association type data mining tasks.
True
Association patterns can also include capturing the sequence of events and things.
True
Business intelligence is nothing more than the descriptive analytics part of the simple business analytics taxonomy.
True
CRM aims to create one-on-one relationships with customers by developing an intimate understanding of their needs and wants.
True
Cubes in OLAP are defined as multidimensional representation of the data stored in and retrieved from data warehouses.
True
Data mining leverages capabilities of statistics, artificial intelligence, machine learning, management science, information systems, and databases, in a systematic and synergistic way.
True
During the model building step in CRISP-DM process, the data mining methods and algorithms are applied to the current data set.
True
ERP stands for enterprise resource planning and is used for the integration of company-wide data.
True
Homoscedasticity states that the response variables must have the same variance in their error, regardless of the explanatory variables' values.
True
In SEMMA process, the accuracy and usefulness of the models are evaluated in the assess step.
True
In prediction, linear regression uses a mathematical equation to identify additive mathematical relationships between explanatory variables and the response variable.
True
In the model-building task, both CRISP-DM and SEMMA methodologies build and test various models.
True
In the retail industry association rule mining is frequently called market-based analysis.
True
Manufacturers use data mining to classify anomalies and commonalities in the production system to improve the manufacturing system.
True
Multicollinearity can be triggered by having two or more perfectly correlated explanatory variables present in the model.
True
Organizations apply analytics to business problems to identify problems, foresee future trends, and make best possible decisions.
True
Six Sigma process promotes an error-free/perfect business execution.
True
The important part of KDD process is the feedback loop that allows the process flow to redirect backward, from any step to any other previous steps, for rework and readjustments.
True
The purpose of data preparation is to eliminate the possibility of GIGO errors, which is also commonly known as data preprocessing
True
The ratio of accurately classified instances (positives and negatives) divided by the total number of instances is defined as the overall accuracy metric.
True
Today, analytics can be defined as simply as "the discovery of information/knowledge/insight in data.
True
k-NN is a prediction method used not only for classification but also for regression-type prediction problems.
True
Association and clustering type patterns are often classified as the result of
Unsupervised learning procedures
Business Analytics is the process of developing code and frameworks.
False
In linear regression the relationship between the variables can be represented as:
All the answers are true Mathematical equation Additive function Linear representation Linear coefficient
Which one of the following represents unstructured data?
All answers are true Multimedia XML/HTML Pictures Textual
Which of the following application areas make use of association rule mining:
All answers are true Sales transactions Medical records Credit card transactions Banking services
The tasks that are followed in the SVM model when performing the data preprocessing includes:
All the answers are true Handling noisy values Handling missing and incomplete data Normalizing the data Numerisizing the data
Business intelligence is a broad concept that also includes business analytics within its simple taxonomy.
False
CRISP-DM methodology is proposed by Fayyad et al, in the year 1996.
False
Analytics and analysis are essentially the same thing; they both focus on the granular level representation of complex problems through decomposition of the whole into its lower-level parts.
False
Balancing skewed data means oversampling of the more represented class records and under sampling of the less represented class records.
False
Bootstrapping methodology is similar to the leave-one-out methodology where it can be used to calculate accuracy by leaving out one sample out at each iteration of the estimation process.
False
The most commonly used clustering technique is
K-means
________________ clustering is a density based method of vector quantization to partition observations into predetermined fixed number of clusters.
K-means
The first and the earliest data mining process is known with the name of
Knowledge discovery in databases (KDD) methodology
Which of the following is not a classification method?
Linear regression
