ISM4402 Exam 1
Data mining requires specialized data analysts to ask ad hoc questions and obtain answers quickly from the system. T/F
False
________ regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning.
Logistic
The programing algorithm developed by Google to handle Big Data computational challenges is known as ________.
MapReduce
________ are typically used together with other charts and graphs, as opposed to by themselves, and show postal codes, country names, etc.
Maps
In the Target case study, why did Target send a teen maternity ads? Target's analytic model confused her with an older woman with a similar name. Target was sending ads to all women in a particular neighborhood. Target's analytic model suggested she was pregnant based on her buying habits. Target was using a special promotion that targeted all teens in her geographical area.
Target's analytic model suggested she was pregnant based on her buying habits.
Computer applications have moved from transaction processing and monitoring activities to problem analysis and solution applications. T/F
True
Converting continuous valued numerical variables to ranges and categories is referred to as discretization. T/F
True
Data accessibility means that the data are easily and readily obtainable. T/F
True
Data is the main ingredient for any BI, data science, and business analytics initiative. T/F
True
Descriptive statistics is all about describing the sample data on hand. T/F
True
During classification in data mining, a false positive is an occurrence classified as true by the algorithm while being false in reality. T/F
True
During the early days of analytics, data was often obtained from the domain experts using manual processes to build mathematical or knowledge-based models. T/F
True
Google Maps has set new standards for data visualization with its intuitive Web mapping software. T/F
True
If using a mining analogy, "knowledge mining" would be a more appropriate term than "data mining." T/F
True
In data mining, classification models help in prediction. T/F
True
Why is a performance management system superior to a performance measurement system? because performance measurement systems are only in their infancy because measurement automatically leads to problem solution because performance management systems cost more because measurement alone has little use without action
because measurement alone has little use without action
With a dashboard, information on sources of the data being presented, the quality and currency of underlying data provide contextual ________ for users.
metadata
Oper marts are created when operational data needs to be analyzed linearly. in a dashboard. unidimensionally. multidimensionally.
multidimensionally.
There has been an increase in data mining to deal with global competition and customers' more sophisticated ________ and wants.
needs
Dashboards can be presented at all the following levels EXCEPT the visual dashboard level. the static report level. the visual cube level. the self-service cube level.
the visual cube level.
What are the four major components of a Business Intelligence (BI) system?
1. A data warehouse, with its source data 2. Business analytics, a collection of tools for manipulating, mining, and analyzing the data in the data warehouse 3. Business performance management (BPM) for monitoring and analyzing performance 4. A user interface (e.g., a dashboard)
In the Influence Health case, the company was able to evaluate over ________ million records in only two days.
195
In what decade did disjointed information systems begin to be integrated? 1970s 1980s 1990s 2000s
1980s
Relational databases began to be used in the 1960s. 1970s. 1980s. 1990s.
1980s.
Typical charts, graphs, and other visual elements used in visualization-based applications usually involve ________ dimensions.
2
What is the definition of a data mart?
A data mart is a subset of a data warehouse, typically consisting of a single subject area (e.g., marketing, operations). Whereas a data warehouse combines databases across an entire enterprise, a data mart is usually smaller and focuses on a particular subject or department.
Today, many vendors offer diversified tools, some of which are completely preprogrammed (called shells). How are these shells utilized? They are used for customization of BI solutions. All a user needs to do is insert the numbers. The shell provides a secure environment for the organization's BI data. They host an enterprise data warehouse that can assist in decision making.
All a user needs to do is insert the numbers.
Why is data alone worthless?
Alone, data is worthless because it does not provide business value. To provide business value, it has to be analyzed.
The ________ is the most commonly used algorithm to discover association rules. Given a set of itemsets, the algorithm attempts to find subsets that are common to at least a minimum number of the itemsets.
Apriori algorithm
Which of the following is an umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies? MIS DSS ERP BI
BI
________ charts are effective when you have nominal data or numerical data that splits nicely into different categories so you can quickly see comparative results and trends within your data.
Bar
________ charts are useful in displaying nominal data or numerical data that splits nicely into different categories so you can quickly see comparative results and trends.
Bar
________ cycle times are now extremely compressed, faster, and more informed across industries.
Business
________ is an umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies.
Business intelligence (BI)
Which data mining process/methodology is thought to be the most comprehensive, according to kdnuggets.com rankings? SEMMA proprietary organizational methodologies KDD Process CRISP-DM
CRISP-DM
________ was proposed in the mid-1990s by a European consortium of companies to serve as a nonproprietary standard methodology for data mining.
CRISP-DM
Describe categorical and nominal data.
Categorical data represent the labels of multiple classes used to divide a variable into specific groups. Examples of categorical variables include race, sex, age group, and educational level. Nominal data contain measurements of simple codes assigned to objects as labels, which are not measurements. For example, the variable marital status can be generally categorized as (1) single, (2) married, and (3) divorced.
________ providers focus on providing technology and services aimed toward integrating data from multiple sources.
Data Warehouse
Because the recession has raised interest in low-cost open source software, it is now set to replace traditional enterprise software. T/F
False
Bill Inmon advocates the data mart bus architecture whereas Ralph Kimball promotes the hub-and-spoke architecture, a data mart bus architecture with conformed dimensions. T/F
False
Business intelligence (BI) is a specific term that describes architectures and tools only. T/F
False
Which of the following is a data mining myth? Data mining is a multistep process that requires deliberate, proactive design and use. Data mining requires a separate, dedicated database. The current state-of-the-art is ready to go for almost any business. Newer Web-based tools enable managers of all educational levels to do data mining.
Data mining requires a separate, dedicated database.
Computerized support is only used for organizational decisions that are responses to external pressures, not for taking advantage of opportunities. T/F
False
Six Sigma rests on a simple performance improvement model known as DMAIC. What are the steps involved?
Define Measure Analyze Improve Control
________ analytics help managers understand current events in the organization including causes, trends, and patterns.
Descriptive
________ modeling is a retrieval-based system that supports high-volume query access.
Dimensional
Dashboards provide visual displays of important information that is consolidated and arranged across several screens to maintain data order. T/F
False
With ________, all the data from every corner of the enterprise is collected and integrated into a consistent schema so that every part of the organization has access to the single version of the truth when and where needed.
Enterprise Resource Planning (ERP)
________ is a mechanism that integrates application functionality and shares functionality (rather than data) across systems, thereby enabling flexibility and reuse.
Enterprise application integration (EAI)
________ is a mechanism for pulling data from source systems to satisfy a request for information. It is an evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases.
Enterprise information integration (EII)
BI represents a bold new paradigm in which the company's business strategy must be aligned to its business intelligence analysis initiatives. T/F
False
How does the use of cloud computing affect the scalability of a data warehouse? Cloud computing vendors bring as much hardware as needed to users' offices. Hardware resources are dynamically allocated as use increases. Cloud vendors are mostly based overseas where the cost of labor is low. Cloud computing has little effect on a data warehouse's scalability.
Hardware resources are dynamically allocated as use increases
Describe the difference between simple and multiple regression.
If the regression equation is built between one response variable and one explanatory variable, then it is called simple regression. Multiple regression is the extension of simple regression where the explanatory variables are more than one.
In lessons learned from the Target case, what legal warnings would you give another retailer using data mining for marketing?
If you look at this practice from a legal perspective, you would conclude that Target did not use any information that violates customer privacy; rather, they used transactional data that most every other retail chain is collecting and storing (and perhaps analyzing) about their customers. What was disturbing in this scenario was perhaps the targeted concept: pregnancy. There are certain events or concepts that should be off limits or treated extremely cautiously, such as terminal disease, divorce, and bankruptcy.
What is the definition of a data warehouse (DW) in simple terms?
In simple terms, a data warehouse (DW) is a pool of data produced to support decision making; it is also a repository of current and historical data of potential interest to managers throughout the organization.
________ (also called in-database analytics) refers to the integration of the algorithmic extent of data analytics into data warehouse.
In-database processing
________ statistics is about drawing conclusions about the characteristics of the population.
Inferential
Mehra (2005) indicated that few organizations really understand metadata, and fewer understand how to design and implement a metadata strategy. How would you describe metadata?
Metadata are data about data. Metadata describe the structure of and some meaning about data, thereby contributing to their effective or ineffective use.
________ management reports are used to manage business performance through outcome-oriented metrics in many organizations.
Metric
________ providers focus on bringing all the data stores into an enterprise-wide platform.
Middleware
________ analytics help managers understand probable future outcomes.
Predictive
________ analytics help managers make decisions to achieve the best performance in the future.
Prescriptive
Which of the following statements about Big Data is true? Data chunks are stored in different locations on one computer. Hadoop is a type of processor used to process Big Data applications. MapReduce is a storage filing system. Pure Big Data systems do not involve fault tolerance.
Pure Big Data systems do not involve fault tolerance
________ plots are often used to explore the relationship between two or three variables (in 2-D or 2-D visuals).
Scatter
All of the following statements about data mining are true EXCEPT: The term is relatively new. Its techniques have their roots in traditional statistical analysis and artificial intelligence. The ideas behind it are relatively new. Intense, global competition make its application more important.
The ideas behind it are relatively new.
What is the intent of the analysis of data that is stored in a data warehouse?
The intent of the analysis is to give management the ability to analyze data for insights into the business, and thus provide tactical or operational decision support whereby, for example, line personnel can make quicker and/or more informed decisions.
In the data mining in Hollywood case study, how successful were the models in predicting the success or failure of a Hollywood movie?
The researchers claim that these prediction results are better than any reported in the published literature for this problem domain. Fusion classification methods attained up to 56.07% accuracy in correctly classifying movies and 90.75% accuracy in classifying movies within one category of their actual category. The SVM classification method attained up to 55.49% accuracy in correctly classifying movies and 85.55% accuracy in classifying movies within one category of their actual category.
Describe the role of the simple split in estimating the accuracy of classification models.
The simple split (or holdout or test sample estimation) partitions the data into two mutually exclusive subsets called a training set and a test set (or holdout set). It is common to designate two-thirds of the data as the training set and the remaining one-third as the test set. The training set is used by the inducer (model builder), and the built classifier is then tested on the test set. An exception to this rule occurs when the classifier is an artificial neural network. In this case, the data is partitioned into three mutually exclusive subsets: training, validation, and testing.
________ series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values.
Time
Which type of question does visual analytics seeks to answer? Why is it happening? What happened yesterday? What is happening today? When did it happen?
Why is it happening?
What is Six Sigma? a letter in the Greek alphabet that statisticians use to measure process variability a methodology aimed at reducing the number of defects in a business process a methodology aimed at reducing the amount of variability in a business process a methodology aimed at measuring the amount of variability in a business process
a methodology aimed at reducing the number of defects in a business process
When you tell a story in a presentation, all of the following are true EXCEPT a story should make sense and order out of a lot of background noise a well-told story should have no need for subsequent discussion. stories and their lessons should be easy to remember. the outcome and reasons for it should be clear at the end of your story.
a well-told story should have no need for subsequent discussion.
Fundamental reasons for investing in BI must be ________ with the company's business strategy.
aligned
BI applications must be integrated with databases. legacy systems. enterprise systems. all of these
all of these
Data warehouses are intended to work with informational data used for online ________ processing systems.
analytical
Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from collecting data about customers and transactions. developing a philosophy that is data analytics-centric. analyzing the vast data amounts routinely collected. asking the customers what they want.
analyzing the vast data amounts routinely collected.
This plot is a graphical illustration of several descriptive statistics about a given data set. pie chart bar graph box-and-whiskers plot kurtosis
box-and-whiskers plot
Which kind of chart is described as an enhanced version of a scatter plot? heat map bullet pie chart bubble chart
bubble chart
Which broad area of data mining applications analyzes data, forming rules to distinguish between defined classes? associations visualization classification clustering
classification
The user interface of a BI system is often referred to as a(n) ________.
dashboard
Data are often buried deep within very large ________, which sometimes contain data from several years.
databases
One way to accomplish privacy and protection of individuals' rights when data mining is by ________ of the customer records prior to applying data mining applications, so that the records cannot be traced to an individual.
de-identification
In the Opening Vignette on Sports Analytics, what type of modeling was used to predict offensive tactics? heuristics heat maps cascaded decision trees sentiment analysis
heat maps
A(n) ________ architecture is used to build a scalable and maintainable infrastructure that includes a centralized data warehouse and several dependent data marts.
hub-and-spoke
Which data warehouse architecture uses a normalized relational warehouse that feeds multiple data marts? independent data marts architecture centralized data warehouse architecture hub-and-spoke data warehouse architecture federated architecture
hub-and-spoke data warehouse architecture
Data warehouses provide direct and indirect benefits to organizations. Which of the following is an indirect benefit of data warehouses? better and more timely information extensive new analyses performed by users simplified access to data improved customer service
improved customer service
All of the following are true about in-database processing technology EXCEPT it pushes the algorithms to where the data is. it makes the response to queries much faster than conventional databases. it is often used for apps like credit card fraud detection and investment risk management. it is the same as in-memory storage technology.
it is the same as in-memory storage technology.
The Internet emerged as a new medium for visualization and brought all the following EXCEPT worldwide digital distribution of visualization. immersive environments for consuming data. new forms of computation of business logic. new graphics displays through PC displays.
new forms of computation of business logic.
The data field "ethnic group" can be best described as nominal data. interval data. ordinal data. ratio data.
nominal data.
When validating the assumptions of a regression, ________ assumes that the errors of the response variable are normally distributed.
normality
The data mining in cancer research case study explains that data mining methods are capable of extracting patterns and ________ hidden deep in large and complex medical databases.
relationships
A(n) ________ is a communication artifact, concerning business matters, prepared with the specific intention of relaying information in a presentable form.
report
Given that the size of data warehouses is expanding at an exponential rate, ________ is an important issue.
scalability
List the five most common functions of business reports.
• To ensure that all departments are functioning properly • To provide information • To provide the results of an analysis • To persuade others to act • To create an organizational memory (as part of a knowledge management system)
One of SiriusXM's challenges was tracking potential customers when cars were sold. T/F
True
List 3 common data mining myths and realities.
1) Myth: Data mining provides instant, crystal-ball-like predictions. Reality: Data mining is a multistep process that requires deliberate, proactive design and use. 2) Myth: Data mining is not yet viable for mainstream business applications. Reality: The current state of the art is ready to go for almost any business type and/or size. 3) Myth: Data mining requires a separate, dedicated database. Reality: Because of the advances in database technology, a dedicated database is not required. 4) Myth: Only those with advanced degrees can do data mining. Reality: Newer Web-based tools enable managers of all educational levels to do data mining. 5) Myth: Data mining is only for large firms that have lots of customer data. Reality: If the data accurately reflect the business or its customers, any company can use data mining.
List and briefly describe the six steps of the CRISP-DM data mining process.
1. Business Understanding - learning the needs of the management and the specific goals of the business for the project 2. Data Understanding - clearly identify and select all data sources 3. Data Preparation - preprocess the data to be acceptable for the algorithm(s) to be used 4. Model Building - select and apply modeling techniques to satisfy business needs 5. Testing and Evaluating - asses the results of the model for accuracy and generality 6. Deployment - implement the model into the business so that the results can be useful
What are the most important assumptions in linear regression?
1. Linearity. This assumption states that the relationship between the response variable and the explanatory variables is linear. 2. Independence (of errors). This assumption states that the errors of the response variable are uncorrelated with each other. 3. Normality (of errors). This assumption states that the errors of the response variable are normally distributed. 4. Constant variance (of errors). This assumption, also called homoscedasticity, states that the response variables have the same variance in their error, regardless of the values of the explanatory variables. 5. Multicollinearity. This assumption states that the explanatory variables are not correlated
According to Eckerson (2006), a well-known expert on BI dashboards, what are the three layers of information of a dashboard?
1. Monitoring. Graphical, abstracted data to monitor key performance metrics. 2. Analysis. Summarized dimensional data to analyze the root cause of problems. 3. Management. Detailed operational data that identify what actions to take to resolve a problem.
While prediction is largely experience and opinion based, ________ is data and model based.
forecasting
As the number of potential BI applications increases, the need to justify and prioritize them arises. This is not an easy task due to the large number of ________ benefits.
intangible
Data ________ comprises data access, data federation, and change capture.
integration
Software monitors referred to as ________ can be placed on a separate server in the network and use event- and process-based approaches to measure and monitor operational processes.
intelligent agents
Key performance indicators (KPIs) are metrics typically used to measure database responsiveness. qualitative feedback. external results. internal results.
internal results.
With dashboards, the layer of information that uses graphical, abstracted data to keep tabs on key performance metrics is the ________ layer.
monitoring
Visual analytics is widely regarded as the combination of visualization and ________ analytics.
predictive
Third party providers of publicly available data sets protect the anonymity of the individuals in the data set primarily by asking data users to use the data ethically. leaving in identifiers (e.g., name), but changing other variables. removing identifiers such as names and social security numbers. letting individuals in the data know their data is being accessed.
removing identifiers such as names and social security numbers.
Which of the following is NOT an example of transaction processing? ATM withdrawal bank deposit sales report cash register scans
sales report
Dashboards present visual displays of important information that are consolidated and arranged on a single ________.
screen
Data warehouse administrators (DWAs) do not need strong business insight since they only handle the technical aspect of the infrastructure. T/F
False
Data warehouses are subsets of data marts. T/F
False
Decision support system (DSS) and management information system (MIS) have precise definitions agreed to by practitioners. T/F
False
Demands for instant, on-demand access to dispersed information decrease as firms successfully integrate BI into their operations. T/F
False
Due to industry consolidation, the analytics ecosystem consists of only a handful of players across several functional areas. T/F
False
In the Dallas Cowboys case study, the focus was on using data analytics to decide which players would play every week. T/F
False
The entire focus of the predictive analytics system in the Infinity P&C case was on detecting and handling fraudulent claims for the company's benefit. T/F
False
With the balanced scorecard approach, the entire focus is on measuring and managing specific financial goals based on the organization's strategy. T/F
False
A(n) ________ is a major component of a Business Intelligence (BI) system that holds source data.
data warehouse
Which of the following is LEAST related to data/information visualization? information graphics scientific visualization statistical graphics graphic artwork
graphic artwork
Most data warehouses are built using ________ database management systems to control and manage the data.
relational
Because of performance and data quality issues, most experts agree that the federated architecture should supplement data warehouses, not replace them. T/F
True
In the 2000s, the DW-driven DSSs began to be called BI systems. T/F
True
In the FEMA case study, the BureauNet software was the primary reason behind the increased speed and relevance of the reports FEMA employees received. T/F
True
Interval data are variables that can be measured on interval scales. T/F
True
Managing data warehouses requires special methods, including parallel computing and/or Hadoop/Spark. T/F
True
Many business users in the 1980s referred to their mainframes as "the black hole," because all the information went into it, but little ever came back and ad hoc real-time querying was virtually impossible. T/F
True
One way an operational data store differs from a data warehouse is the recency of their data. T/F
True
Predictive algorithms generally require a flat file with a target variable, so making data analytics ready for prediction means that data sets must be transformed into a flat-file format and made ready for ingestion into those predictive algorithms. T/F
True
Structured data is what data mining algorithms use and can be classified as categorical or numeric. T/F
True
The "islands of data" problem in the 1980s describes the phenomenon of unconnected data being stored in numerous locations within an organization. T/F
True
The cost of data storage has plummeted recently, making data mining feasible for more firms. T/F
True
The data warehousing maturity model consists of six stages: prenatal, infant, child, teenager, adult, and sage. T/F
True
The hub-and-spoke data warehouse model uses a centralized warehouse feeding dependent data marts. T/F
True
The use of statistics in baseball by the Oakland Athletics, as described in the Moneyball case study, is an example of the effectiveness of prescriptive analytics. T/F
True
There are basic chart types and specialized chart types. A Gantt chart is a specialized chart type. T/F
True
Traditional BI systems use a large volume of static data that has been extracted, cleansed, and loaded into a data warehouse to produce reports and analyses. T/F
True
Using data mining on data about imports and exports can help to detect tax avoidance and money laundering. T/F
True
Visualization differs from traditional charts and graphs in complexity of data sets and use of multiple dimensions and measures. T/F
True
When a problem has many attributes that impact the classification of different patterns, decision trees may be a useful approach. T/F
True
With key performance indicators, driver KPIs have a significant effect on outcome KPIs, but the reverse is not necessarily true. T/F
True
Without middleware, different BI programs cannot easily connect to the data warehouse. T/F
True
The Google search engine is an example of Big Data in that it has to search and index billions of ________ in fractions of a second for each search.
Web pages
Online transaction processing (OLTP) systems handle a company's routine ongoing business. In contrast, a data warehouse is typically the end result of BI processes and operations. a repository of actionable intelligence obtained from a data mart. a distinct system that provides storage for data that will be made use of in analysis. an integral subsystem of an online analytical processing (OLAP) system.
a distinct system that provides storage for data that will be made use of in analysis.
This measure of central tendency is the sum of all the values/observations divided by the number of observations in the data set. dispersion mode median arithmetic mean
arithmetic mean
In data mining, finding an affinity of two products to be commonly together in a shopping cart is known as association rule mining. cluster analysis. decision trees. artificial neural networks.
association rule mining.
This technique makes no a priori assumption of whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables. regression correlation means test multiple regression
correlation
Different types of players are identified and described in the analytics ________.
ecosystem
The very design that makes an OLTP system efficient for transaction processing makes it inefficient for end-user ad hoc reports, queries, and analysis. transaction processing systems that constantly update operational databases. the collection of reputable sources of intelligence. transactions such as ATM withdrawals, where we need to reduce a bank balance accordingly.
end-user ad hoc reports, queries, and analysis.
What is the fundamental challenge of dashboard design? ensuring that users across the organization have access to it ensuring that the organization has the appropriate hardware onsite to support it ensuring that the organization has access to the latest Web browsers ensuring that the required information is shown clearly on a single screen
ensuring that the required information is shown clearly on a single screen
Which approach to data warehouse integration focuses more on sharing process functionality than data across systems? extraction, transformation, and load enterprise application integration enterprise information integration enterprise function integration
enterprise application integration
The need for more versatile reporting than what was available in 1980s era ERP systems led to the development of what type of system? management information systems relational databases executive information systems data warehouses
executive information systems
Patterns have been manually ________ from data by humans for centuries, but the increasing volume of data in modern times has created a need for more automatic approaches.
extracted
Performing extensive ________ to move data to the data warehouse may be a sign of poorly managed data and a fundamental lack of a coherent data management strategy.
extraction, transformation, and load (ETL)
The ________ data warehouse architecture involves integrating disparate systems and analytical resources from multiple sources to meet changing needs or business conditions.
federated
Which data warehouse architecture uses metadata from existing data warehouses to create a hybrid logical data warehouse comprised of data from the other warehouses? independent data marts architecture centralized data warehouse architecture hub-and-spoke data warehouse architecture federated architecture
federated architecture
Which type of visualization tool can be very helpful when a data set contains location data? bar chart geographic map highlight table tree map
geographic map
All of the following are benefits of hosted data warehouses EXCEPT smaller upfront investment. better quality hardware. greater control of data. frees up in-house systems.
greater control of data.
The filing system developed by Google to handle Big Data storage challenges is known as the ________ Distributed File System.
hadoop
In the Influence Health case study, what was the goal of the system? locating clinic patients understanding follow-up care decreasing operational costs increasing service use
increasing service use
Which kind of data warehouse is created separately from the enterprise data warehouse by a department and not reliant on it for updates? sectional data mart public data mart independent data mart volatile data mart
independent data mart
Identifying and preventing incorrect claim payments and fraudulent activities falls under which type of data mining applications? insurance retailing and logistics customer relationship management computer hardware and software
insurance
What does the scalability of a data mining method refer to? its ability to predict the outcome of a previously unknown data set accurately its speed of computation and computational costs in using the mode its ability to construct a prediction model efficiently given a large amount of data its ability to overcome noisy data to make somewhat accurate predictions
its ability to construct a prediction model efficiently given a large amount of data
What does the robustness of a data mining method refer to? its ability to predict the outcome of a previously unknown data set accurately its speed of computation and computational costs in using the mode its ability to construct a prediction model efficiently given a large amount of data its ability to overcome noisy data to make somewhat accurate predictions
its ability to overcome noisy data to make somewhat accurate predictions
In ________, a classification method, the complete data set is randomly split into mutually exclusive subsets of approximately equal size and tested multiple times on each left-out subset, using the others as a training set.
k-fold cross-validation
Fayyad et al. (1996) defined ________ in databases as a process of using data mining methods to find useful information and patterns in the data.
knowledge discovery
When validating the assumptions of a regression, ________ assumes that the relationship between the response variable and the explanatory variables are linear.
linearity
Which of the following developments is NOT contributing to facilitating growth of decision support and analytics? collaboration technologies Big Data knowledge management systems locally concentrated workforces
locally concentrated workforces
Because of its successful application to retail business problems, association rule mining is commonly called ________.
market-basket analysis
A(n) ________ data store (ODS) provides a fairly recent form of customer information file.
operational
What is the management feature of a dashboard? operational data that identify what actions to take to resolve a problem summarized dimensional data to analyze the root cause of problems summarized dimensional data to monitor key performance metrics graphical, abstracted data to monitor key performance metrics
operational data that identify what actions to take to resolve a problem
Which of the following BEST enables a data warehouse to handle complex queries and scale up to handle many more requests? use of the Web by users as a front-end parallel processing Microsoft Windows a larger IT staff
parallel processing
Which type of visualization tool can be very helpful when the intention is to show relative proportions of dollars per department allocated by a university administration? heat map bullet pie chart bubble chart
pie chart
What type of analytics seeks to determine what is likely to happen in the future? descriptive prescriptive predictive domain
predictive
What type of analytics seeks to recognize what is going on as well as the likely forecast and make decisions to achieve the best performance possible? descriptive prescriptive predictive domain
prescriptive
In ________ oriented data warehousing, operational databases are tuned to handle transactions that update the database.
product
Due to the ________ expansion of information technology coupled with the need for improved competitiveness in business, there has been an increase in the use of computing power to produce unified reports that join different views of the enterprise in one place.
rapid
Prediction problems where the variables have numeric values are most accurately defined as classifications. regressions. associations. computations.
regressions
Customer ________ management extends traditional marketing by creating one-on-one relationships with customers.
relationship
The competitive imperatives for BI include all of the following EXCEPT right information right user right time right place
right user
Clustering partitions a collection of things into segments whose members share similar characteristics. dissimilar characteristics. similar collection methods. dissimilar collection methods.
similar characteristics.
Real-time data warehousing can be used to support the highest level of decision making sophistication and power. The major feature that enables this in relation to handling the data is country of (data) origin. nature of the data. speed of data transfer. source of the data.
speed of data transfer.
This measure of dispersion is calculated by simply taking the square root of the variations. standard deviation range variance arithmetic mean
standard deviation
When representing data in a data warehouse, using several dimension tables that are each connected only to a fact table means you are using which warehouse structure? star schema snowflake schema relational schema dimensional schema
star schema
Whereas ________ starts with a well-defined proposition and hypothesis, data mining starts with a loosely defined discovery statement.
statistics
Operational or transaction databases are product oriented, handling transactions that update the database. In contrast, data warehouses are subject-oriented and nonvolatile. product-oriented and nonvolatile. product-oriented and volatile. subject-oriented and volatile.
subject-oriented and nonvolatile.
Data generation is a precursor, and is not included in the analytics ecosystem. T/F
False
Data is the contextualization of information, that is, information set in context. T/F
False
Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales. T/F
False
Data source reliability means that data are correct and are a good match for the analytics problem. T/F
False
Data that is collected, stored, and analyzed in data mining is often private and personal. There is no way to maintain individuals' privacy other than being very careful about physical data security. T/F
False
In the Dell cases study, the largest issue was how to properly spend the online marketing budget. T/F
False
In the Miami-Dade Police Department case study, predictive analytics helped to identify the best schedule for officers in order to pay the least overtime. T/F
False
In the cancer research case study, data mining algorithms that predict cancer survivability with high predictive power are good replacements for medical professionals. T/F
False
In the opening case, police detectives used data mining to identify possible new areas of inquiry. T/F
False
Information systems that support such transactions as ATM withdrawals, bank deposits, and cash register scans at the grocery store represent transaction processing, a critical branch of BI. True False
False
K-fold cross-validation is also called sliding estimation. T/F
False
Major commercial business intelligence (BI) products and services were well established in the early 1970s. T/F
False
Managing information on operations, customers, internal procedures and employee interactions is the domain of cognitive science. True False
False
Market basket analysis is a useful and entertaining way to explain data mining to a technologically less savvy audience, but it has little business significance. T/F
False
Moving the data into a data warehouse is usually the easiest part of its creation. T/F
False
Nominal data represent the labels of multiple classes used to divide a variable into specific groups. T/F
False
OLTP systems are designed to handle ad hoc analysis and complex queries that deal with many data items. T/F
False
Open-source data mining tools include applications such as IBM SPSS Modeler and Dell Statistica. T/F
False
Organizations seldom devote a lot of effort to creating metadata because it is not important for the effective use of data warehouses. T/F
False
Properly integrating data from various databases and other disparate sources is a trivial process. T/F
False
Ratio data is a type of categorical data. T/F
False
Statistics and data mining both look for data sets that are as large as possible. T/F
False
Subject oriented databases for data warehousing are organized by detailed subjects such as disk drives, computers, and networks. T/F
False
Successful BI is a tool for the information systems department, but is not exposed to the larger organization. T/F
False
The BPM development cycle is essentially a one-shot process where the requirement is to get it right the first time. T/F
False
The data storage component of a business reporting system builds the various reports and hosts them for, or disseminates them to users. It also provides notification, annotation, collaboration, and other services. T/F
False
The growth in hardware, software, and network capacities has had little impact on modern BI innovations. T/F
False
The use of dashboards and data visualizations is seldom effective in identifying issues in organizations, as demonstrated by the Silvaris Corporation Case Study. T/F
False
To respond to its market challenges, SiriusXM decided to focus on manufacturing efficiency. T/F
False
Two-tier data warehouse/BI infrastructures offer organizations more flexibility but cost more than three-tier ones. T/F
False
User-initiated navigation of data through disaggregation is referred to as "drill up." T/F
False
Visual analytics is aimed at answering, "What is it happening?" and is usually associated with business analytics. T/F
False
When telling a story during a presentation, it is best to avoid describing hurdles that your character must overcome, to avoid souring the mood. T/F
False
How does Amazon.com use predictive analytics to respond to product searches by the customer?
Amazon uses clustering algorithms to segment customers into different clusters to be able to target specific promotions to them. The company also uses association mining techniques to estimate relationships between different purchasing behaviors. That is, if a customer buys one product, what else is the customer likely to purchase? That helps Amazon recommend or promote related products. For example, any product search on Amazon.com results in the retailer also suggesting other similar products that may interest a customer.
________ is an evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases. Enterprise information integration (EII) Enterprise application integration (EAI) Extraction, transformation, and load (ETL) None of these
Enterprise information integration (EII)
Describe cluster analysis and some of its applications.
Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters. Cluster analysis is an essential data mining method for classifying items, events, or concepts into common groupings called clusters. The method is commonly used in biology, medicine, genetics, social network analysis, anthropology, archaeology, astronomy, character recognition, and even in MIS development. As data mining has increased in popularity, the underlying techniques have been applied to business, especially to marketing. Cluster analysis has been used extensively for fraud detection (both credit card and e-commerce fraud) and market segmentation of customers in contemporary CRM systems.
________ charts are a special case of horizontal bar charts that are used to portray project timelines, project tasks/activity durations, and overlap among the tasks/activities.
Gantt
There are several basic information system architectures that can be used for data warehousing. What are they?
Generally speaking, these architectures are commonly called client/server or n-tier architectures, of which two-tier and three-tier architectures are the most common, but sometimes there is simply one tier.
The ________ Model, also known as the EDW approach, emphasizes top-down development, employing established database development methodologies and tools, such as entity-relationship diagrams (ERD), and an adjustment of the spiral development approach.
Inmon
The ________ Model, also known as the data mart approach, is a "plan big, build small" approach. A data mart is a subject-oriented or department-oriented data warehouse. It is a scaled-down version of a data warehouse that focuses on the requests of a specific department, such as marketing or sales.
Kimball
________ describe the structure and meaning of the data, contributing to their effective use.
Metadata
________ charts or network diagrams show precedence relationships among the project activities/tasks.
PERT
________, or "The Extended ASP Model," is a creative way of deploying information system applications where the provider licenses its applications to customers for use as a service on demand (usually over the Internet).
SaaS (software as a service)
More data, coming in faster and requiring immediate conversion into decisions, means that organizations are confronting the need for real-time data warehousing (RDW). How would you define real-time data warehousing?
Real-time data warehousing, also known as active data warehousing (ADW), is the process of loading and providing data via the data warehouse as they become available.
What are the four processes that define a closed-loop BPM cycle?
Strategize Plan Monitor/Analyze Adjust/Act
Describe the difference between descriptive and inferential statistics.
The main difference between descriptive and inferential statistics is the data used in these methods—whereas descriptive statistics is all about describing the sample data on hand, and inferential statistics is about drawing inferences or conclusions about the characteristics of the population.
Kaplan and Norton developed a report that presents an integrated view of success in the organization called metric management reports. balanced scorecard-type reports. dashboard-type reports. visual reports.
balanced scorecard-type reports.
What is the main reason parallel processing is sometimes used for data mining? because the hardware exists in most organizations, and it is available to use because most of the algorithms used for data mining require it because of the massive data amounts and search efforts involved because any strategic application requires parallel processing
because of the massive data amounts and search efforts involved
In which stage of extraction, transformation, and load (ETL) into a data warehouse are anomalies detected and corrected? transformation extraction load cleanse
cleanse
Organizations using BI systems are typically seeking to ________ the gap between the operational data and strategic objectives has become more pressing.
close
Which broad area of data mining applications partitions a collection of objects into natural groupings with similar features? associations visualization classification clustering
clustering
As described in the Influence Health case study, customers are more often ________ services from a variety of healthcare service providers before selecting one.
comparing
How are enterprise resources planning (ERP) systems related to supply chain management (SCM) systems? different terms for the same system complementary systems mutually exclusive systems None of these; these systems never interface
complementary systems
Which characteristic of data requires that the variables and data values be defined at the lowest (or as low as required) level of detail for the intended use of the data? data source reliability data accessibility data richness data granularity
data granularity
A large storage location that can hold vast quantities of data (mostly unstructured) in its native/raw format for future/potential analytics consumption is referred to as a(n) extended ASP. data cloud. data lake. relational database.
data lake.
In the Dell case study, engineers working closely with marketing, used lean software development strategies and numerous technologies to create a highly scalable, singular ________.
data mart
Knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern searching, and data dredging are all alternative names for ________.
data mining
Business applications have moved from transaction processing and monitoring to other activities. Which of the following is NOT one of those activities? problem analysis solution applications data monitoring mobile access
data monitoring
Data preparation, the third step in the CRISP-DM data mining process, is more commonly known as ________.
data preprocessing
Which characteristic of data means that all the required data elements are included in the data set? data source reliability data accessibility data richness data granularity
data richness
The three main types of data warehouses are data marts, operational ________, and enterprise data warehouses.
data stores
The role responsible for successful administration and management of a data warehouse is the ________, who should be familiar with high-performance software, hardware, and networking technologies, and also possesses solid business insight.
data warehouse administrator (DWA)
The basic idea behind a(n) ________ is that it recursively divides a training set until each division consists entirely or primarily of examples from one class.
decision tree
A(n) ________ data mart is a subset that is created directly from the data warehouse..
dependent
In the terrorist funding case study, an observed price ________ may be related to income tax avoidance/evasion, money laundering, or terrorist financing.
deviation
A data mining study is specific to addressing a well-defined business task, and different business tasks require general organizational data. general industry data. general economic data. different sets of data.
different sets of data.
When querying a dimensional database, a user went from summarized data to its underlying details. The function that served this purpose is dice. slice. roll-up. drill down.
drill down.
Information dashboards enable ________ operations that allow the users to view underlying data sources and obtain more detail.
drill-down/drill-through
What has caused the growth of the demand for instant, on-demand access to dispersed information? the increasing divide between users who focus on the strategic level and those who are more oriented to the tactical level the need to create a database infrastructure that is always online and contains all the information from the OLTP systems the more pressing need to close the gap between the operational data and strategic objectives the fact that BI cannot simply be a technical exercise for the information systems department
the more pressing need to close the gap between the operational data and strategic objectives
All of the following statements about data mining are true EXCEPT the process aspect means that data mining should be a one-step process to results. the novel aspect means that previously unknown patterns are discovered. the potentially useful aspect means that results should lead to some business benefit. the valid aspect means that the discovered patterns should hold true on new data.
the process aspect means that data mining should be a one-step process to results.
Big Data often involves a form of distributed storage and processing using Hadoop and MapReduce. One reason for this is centralized storage creates too many vulnerabilities. the "Big" in Big Data necessitates over 10,000 processing nodes. the processing power needed for the centralized model would overload a single computer. Big Data systems have to match the geographical spread of social media.
the processing power needed for the centralized model would overload a single computer.
In estimating the accuracy of data mining (or other) classification models, the true positive rate is the ratio of correctly classified positives divided by the total positive count. the ratio of correctly classified negatives divided by the total negative count. the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified positives. the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives.
the ratio of correctly classified positives divided by the total positive count.
Benefits of the latest visual analytics tools, such as SAS Visual Analytics, include all of the following EXCEPT mobile platforms such as the iPhone are supported by these products. it is easier to spot useful patterns and trends in the data. they explore massive amounts of data in hours, not days. there is less demand on IT departments for reports.
they explore massive amounts of data in hours, not days.
A Web client that connects to a Web server, which is in turn connected to a BI application server, is reflective of a one-tier architecture. two-tier architecture. three-tier architecture. four-tier architecture.
three-tier architecture.
In the Opening Vignette on Sports Analytics, what was adjusted to drive one-time ticket sales? player selections stadium location fan tweets ticket prices
ticket prices
Online ________ is a term used for a transaction system that is primarily responsible for capturing and storing data related to day-to-day business functions such as ERP, CRM, SCM, and point of sale.
transaction processing
A(n) ________ is a major component of a Business Intelligence (BI) system that is often browser based and often presents a portal or dashboard.
user interface
Contextual metadata for a dashboard includes all the following EXCEPT whether any high-value transactions that would skew the overall trends were rejected as a part of the loading process. which operating system is running the dashboard server software. whether the dashboard is presenting "fresh" or "stale" information. when the data warehouse was last refreshed.
which operating system is running the dashboard server software.
Describe and define Big Data. Why is a search engine a Big Data application?
• Big Data is data that cannot be stored in a single storage unit. Big Data typically refers to data that is arriving in many different forms, be they structured, unstructured, or in a stream. Major sources of such data are clickstreams from Web sites, postings on social media sites such as Facebook, or data from traffic, sensors, or weather. • A Web search engine such as Google needs to search and index billions of Web pages in order to give you relevant search results in a fraction of a second. Although this is not done in real time, generating an index of all the Web pages on the Internet is not an easy task.
List four myths associated with data mining.
• Data mining provides instant, crystal-ball-like predictions. • Data mining is not yet viable for business applications. • Data mining requires a separate, dedicated database. • Only those with advanced degrees can do data mining. • Data mining is only for large firms that have lots of customer data.
Briefly describe four major components of the data warehousing process.
• Data sources. Data are sourced from multiple independent operational "legacy" systems and possibly from external data providers (such as the U.S. Census). Data may also come from an OLTP or ERP system. • Data extraction and transformation. Data are extracted and properly transformed using custom-written or commercial ETL software. • Data loading. Data are loaded into a staging area, where they are transformed and cleansed. The data are then ready to load into the data warehouse and/or data marts. • Comprehensive database. Essentially, this is the EDW to support all decision analysis by providing relevant summarized and detailed information originating from many different sources. • Metadata. Metadata include software programs about data and rules for organizing data summaries that are easy to index and search, especially with Web tools. • Middleware tools. Middleware tools enable access to the data warehouse. There are many front-end applications that business users can use to interact with data stored in the data repositories, including data mining, OLAP, reporting tools, and data visualization tools.
Briefly describe five techniques (or algorithms) that are used for classification modeling.
• Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena. • Statistical analysis. Statistical techniques were the primary classification algorithm for many years until the emergence of machine-learning techniques. Statistical classification techniques include logistic regression and discriminant analysis. • Neural networks. These are among the most popular machine-learning techniques that can be used for classification-type problems. • Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category. • Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category). • Genetic algorithms. This approach uses the analogy of natural evolution to build directed-search-based mechanisms to classify data samples. • Rough sets. This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems.
List and describe three levels or categories of analytics that are most often viewed as sequential and independent, but also occasionally seen as overlapping.
• Descriptive or reporting analytics refers to knowing what is happening in the organization and understanding some underlying trends and causes of such occurrences. • Predictive analytics aims to determine what is likely to happen in the future. This analysis is based on statistical techniques as well as other more recently developed techniques that fall under the general category of data mining. • Prescriptive analytics recognizes what is going on as well as the likely forecast and makes decisions to achieve the best performance possible.
What storage system and processing algorithm were developed by Google for Big Data?
• Google developed and released as an Apache project the Hadoop Distributed File System (HDFS) for storing large amounts of data in a distributed way. • Google developed and released as an Apache project the MapReduce algorithm for pushing computation to the data, instead of pushing data to a computing node.
List five types of specialized charts and graphs.
• Histograms • Gantt charts • PERT charts • Geographic maps • Bullets • Heat maps • Highlight tables • Tree maps
List four possible analytics applications in the retail value chain.
• Inventory Optimization • Price Elasticity • Market Basket Analysis • Shopper Insight • Customer Churn Analysis • Channel Analysis • New Store Analysis • Store Layout • Video Analytics
List and describe the three major categories of business reports.
• Metric management reports. Many organizations manage business performance through outcome-oriented metrics. For external groups, these are service-level agreements (SLAs). For internal management, they are key performance indicators (KPIs). • Dashboard-type reports. This report presents a range of different performance indicators on one page, like a dashboard in a car. Typically, there is a set of predefined reports with static elements and fixed structure, but customization of the dashboard is allowed through widgets, views, and set targets for various metrics. • Balanced scorecard-type reports. This is a method developed by Kaplan and Norton that attempts to present an integrated view of success in an organization. In addition to financial performance, balanced scorecard-type reports also include customer, business process, and learning and growth perspectives.
List five reasons for the growing popularity of data mining in the business world.
• More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace • General recognition of the untapped value hidden in large data sources • Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. • Consolidation of databases and other data repositories into a single location in the form of a data warehouse • The exponential increase in data processing and storage technologies • Significant reduction in the cost of hardware and software for data storage and processing • Movement toward the demassification (conversion of information resources into nonphysical form) of business practices
Business applications can be programmed to act on what real-time BI systems discover. Describe two approaches to the implementation of real-time BI.
• One approach to real-time BI uses the DW model of traditional BI systems. In this case, products from innovative BI platform providers provide a service-oriented, near-real-time solution that populates the DW much faster than the typical nightly extract/transfer/load (ETL) batch update does. • A second approach, commonly called business activity management (BAM), is adopted by pure play BAM and or hybrid BAM-middleware providers (such as Savvion, Iteration Software, Vitria, webMethods, Quantive, Tibco, or Vineyard Software). It bypasses the DW entirely and uses Web services or other monitoring means to discover key business events. These software monitors (or intelligent agents) can be placed on a separate server in the network or on the transactional application databases themselves, and they can use event- and process-based approaches to proactively and intelligently measure and monitor operational processes.
Describe the three major subsets of the Analytics Focused Software Developers portion of the Analytics Ecosystem.
• Reporting/Descriptive Analytics — Includes tools is enabled by and available from the Middleware industry players and unique capabilities offered by focused providers. • Predictive Analytics — a rapidly growing area that includes a variety of statistical packages. • Prescriptive Analytics — Software providers in this category offer modeling tools and algorithms for optimization of operations usually called management science/operations research software.
List six common data mining mistakes.
• Selecting the wrong problem for data mining • Ignoring what your sponsor thinks data mining is and what it really can and cannot do • Leaving insufficient time for data preparation • Looking only at aggregated results and not at individual records • Being sloppy about keeping track of the data mining procedure and results • Ignoring suspicious findings and quickly moving on • Running mining algorithms repeatedly and blindly • Believing everything you are told about the data • Believing everything you are told about your own data mining analysis • Measuring your results differently from the way your sponsor measures them
Mention briefly some of the recently popularized concepts and technologies that will play a significant role in defining the future of data warehousing.
• Sourcing (mechanisms for acquisition of data from diverse and dispersed sources): o Web, social media, and Big Data o Open source software o SaaS (software as a service) o Cloud computing • Infrastructure (architectural—hardware and software—enhancements): o Columnar (a new way to store and access data in the database) o Real-time data warehousing o Data warehouse appliances (all-in-one solutions to DW) o Data management technologies and practices o In-database processing technology (putting the algorithms where the data is) o In-memory storage technology (moving the data in the memory for faster processing) o New database management systems o Advanced analytics
A common way of introducing data warehousing is to refer to its fundamental characteristics. Describe three characteristics of data warehousing.
• Subject oriented. Data are organized by detailed subject, such as sales, products, or customers, containing only information relevant for decision support. • Integrated. Integration is closely related to subject orientation. • Time variant (time series). A warehouse maintains historical data. The data do not necessarily provide current status (except in real-time systems). • Nonvolatile. After data are entered into a data warehouse, users cannot change or update the data. • Web based. Data warehouses are typically designed to provide an efficient computing environment for Web-based applications. • Relational/multidimensional. A data warehouse uses either a relational structure or a multidimensional structure. • Client/server. A data warehouse uses the client/server architecture to provide easy access for end users. • Real time. Newer data warehouses provide real-time, or active, data-access and analysis capabilities (see Basu, 2003; and Bonde and Kuckuk, 2004). • Include metadata. A data warehouse contains metadata (data about data) about how the data are organized and how to effectively use them.