ACCT 6600 Midterm
independent data mart
A small data warehouse designed for a strategic business unit or a department.
Which model from the list is used for predicting categorical outcomes of the response variable? A. Linear regression B. Logistic Regression C. Classification D. Association
B. Logistic Regression
Where does the data for business analytics come from?
Business processes, internet sources and social media, machines, and Internet of Things.
ratio data
Continuous data where both differences and ratios are interpretable. The distinguishing feature of a scale is the possession of a nonarbitrary zero value.
Centrality is best defined as? a. A business situation where complete knowledge is available and stored used to know the exact course of action. b. A group of metrics that aims to quantify the importance or influence of a particular group or node. c. A category of data maintaining an algorithm that establishes relationships about items that occur together.
b. A group of metrics that aims to quantify the importance or influence of a particular group or node.
Which of these is not one of the main areas of Web mining? a. Content b. Authorization c. Structure d. Usage
b. Authorization
The umbrella term that combines architecture tools, databases, analytic tools, applications, and methodologies is _______. a. Analytics b. Business Intelligence c. Enterprise resource planning d. Business performance management
b. Business Intelligence
What are three common myths about data mining?
It provides instant predictions, it requires a separate, defined database, and it is only for large firms with a lot of customer data.
association
A category of data mining algorithm that establishes relationships about items that occur together in a given record.
business intelligence
A conceptual framework for managerial decision support. It combines architecture, databases (or data warehouses), analytical tools, and applications.
CRISP-DM
A cross-industry standardized process of conducting data mining projects, which is a sequence of six steps that starts with a good understanding of the business and the need for the data mining project (i.e., the application domain) and ends with the deployment of the solution that satisfied the specific business need.
What is a cube? What do drill down, roll-up, slice, and dice mean?
A cube is a multidimensional structure that allows fast analysis Drilling down navigate among the most detailed data Roll up computes relationships for one or more dimensions Slice is a subset of multidimensional array corresponding to a single value set Dice is a slice on more than two dimensions of the data cube
dependent data mart
A data mart that depends on the existence of a data warehouse.
regression
A data mining method for real-world prediction problems where the predicted values (i.e., the output variable or dependent variable) are numeric (e.g., predicting the temperature for tomorrow as 68°F).
How does a data warehouse differ from a transactional database?
A data warehouse accesses aggregates of data while a transactional database accesses single data records
ETL
A data warehousing process that consists of extraction (i.e., reading data from a database), transformation (i.e., converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database), and load (i.e., putting the data into the data warehouse).
lift
A goodness-of-fit measure for classification as well as association rule mining models.
centrality
A group of metrics that aims to quantify the importance or influence (in a variety of senses) of a particular node (or group) within a network.
What are the characteristics of Big Data?
A large collection of data that cannot be stored in a single database (volume). It is available in many forms (variety) and can be processed quickly (velocity).
snowflake schema
A logical arrangement of tables in a multidimensional database in such a way that the entity relationship diagram resembles a snowflake in shape.
Six Sigma
A performance management methodology aimed at reducing the number of defects in a business process to as close to zero defects per million opportunities (DPMO) as possible.
balanced scorecard
A performance measurement and management methodology that helps translate an organization's financial, customer, internal process, and learning and growth objectives and targets into a set of actionable initiatives.
data warehouse
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.
data mining
A process that uses statistical, mathematical, artificial intelligence, and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases.
What is a data warehouse?
A repository of historical data that are organized by subject to support decision makers in the organization
What is a search engine? Why are they important for today's businesses?
A search engine is a software program that searched for information based on keywords. As the size of the Web increases, finding what you want is becoming more complex.
What processing technique is applied to process Big Data?
A single computer cannot process Big Data so MapReduce programming paradigm and algorithm trading are used.
What is a social network? What is the need for SNA?
A social network is a structure composed of individuals and organizations linked to one another. SNA identifies patterns and examines network dynamics. It is useful in customer intelligence and sociology.
correlation
A statistical measure that indicates the extent to which two or more variables change/fluctuate together.
kurtosis
A statistical measure to characterize the shape of a unimodal distribution—characterizing the peak/tall/skinny nature of the distribution.
decision tree
A tabular representation of possible condition combinations and outcomes.
data preprocessing
A tedious process of converting raw data into an analytic ready state.
What are the key similarities and differences between a two-tiered and a three tiered architecture?
A three tiered architecture separates the client work station, application server, and database server so it is better performing but more expensive. A two tiered architecture has a client work station but combines the application and database servers so it sometimes has more processing issues but is more economical in cost.
nominal data
A type of data that contains measurements of simple codes assigned to objects as labels, which are not measurements. For example, the variable marital status can be generally categorized as (1) single, (2) married, and (3) divorced.
logistic regression
A very popular, statistically sound, probability-based classification algorithm that employs supervised learning.
dashboard
A visual presentation of critical data for executives to view. It allows executives to see hot spots in seconds and explore the situation.
Which of the following is not a step for Data Mining Process? A- Business Understanding B- Model Building C- Data Preparation D- Separation of Data
D- Separation of Data
Which of the following is not included in the ETL process? A: Extraction B: Transformation C: Load D: Transfer
D: Transfer
What are the main data preprocessing steps?
Data consolidation, cleaning, transformation, and reduction result in well-formed data.
What is data?
Data is a collection of facts usually obtained as a result of experiments, observations, transactions, or experiences.
Define data mining.
Data mining is discovering knowledge from large amounts of data. It is a process through which previously unknown patterns are discovered.
Describe the data warehousing process.
Data source Data transformation and extraction Data load Comprehensive database Metadata Middleware tools
ordinal data
Data that contain codes assigned to objects or events as labels that also represent the rank order among them. For example, the variable credit score can be generally categorized as (1) low, (2) medium, and (3) high.
unstructured data
Data that do not have a predetermined format and are stored in the form of textual documents.
What are the privacy issues in data mining?
Data that is collected and stored often contains information about real people. Personal, demographic, and financial information needs to be protected.
categorical data
Data that represent the labels of multiple classes used to divide a variable into specific groups.
What is data visualization? Why is it needed?
Data visualization is the use of visual representations to explore, make sense of, and communicate data. It is used to better understand current and future trends.
What things can help a Web pages rank higher in the search engine results?
Hire a company to improve site appeal Pay the search engine Liberation from dependence on search engine traffic
What are the most popular commercial data mining tools?
IBM SPSS modeler SAP Infinite Insight Dell Statistica SAS Enterprise Miner
dimension table
In a data warehouse, surrounding the central fact tables (and linked via foreign keys) are called dimension tables.
confidence
In association rules, the conditional probability of finding the RHS of the rule present in a list of transactions where the LHS of the rule exists.
corpus
In linguistics, a large and structured set of texts (usually stored and processed electronically) prepared for the purpose of conducting knowledge discovery.
What is an information dashboard? Why are they so popular?
Information dashboard is a visual representation of important information that is consolidated and arranged on a screen so it can be easily digested and further explored. It is popular because it provides an overview of current and future trends over multiple aspects relevant to the entity.
What is NOT one of three main areas of web mining? a. Web Content Mining b. Web Structure Mining c. Web Usage Mining d. Web Context Mining
d. Web Context Mining
NLP
natural language processing (NLP) Using a natural language processor to interface with a computer-based system.
search engine
program that finds and lists Web sites or pages (designated by URLs) that match some user-selected criteria.
TDM
term-document matrix (TDM) A frequency matrix created from digitized and organized documents (the corpus) where the columns represent the terms and rows represent the individual documents.
analytics
the science of fact-based decision making
data visualization
A graphical, animation, or video presentation of data and the results of data analysis.
Which of the following two are common derivatives of association rule mining? A- Link analysis & Sequence Mining B- Statistical Analysis & Interval Data C- Clustering Analysis & Data Mining D- Forecast Analysis & Interval Data
A- Link analysis & Sequence Mining
Which of the following is not a data mining task? A. Segmentation B. Association C. Prediction D. Exporting
D. Exporting
What is the main difference between descriptive and inferential statistics? A. descriptive - Describes data on hand, inferential- Draw conclusions about the characteristics of population B. inferential - Predicts future outcomes, descriptive- describes past outcomes C. Inferential - Describes data on hand, Descriptive- Draw conclusions about the characteristics of population D. Descriptive - Predicts future outcomes, Inferential- describes past outcomes
A. descriptive - Describes data on hand, inferential- Draw conclusions about the characteristics of population
web mining
The discovery and analysis of interesting and useful information from the Web, about the Web, and usually through Web-based tools.
prescriptive analytics
A branch of business analytics that deals with finding the best possible solution alternative for a given problem.
descriptive statistics
A branch of statistical modeling that aims to describe a given sample of data.
inferential statistics
A branch of statistical modeling that aims to draw inferences or conclusions about the characteristics of the population based on a given sample of data.
predictive analytics
A business analytical approach toward forecasting (e.g., demand, problems, opportunities) that is used instead of simply reporting data as they occur.
storytelling
A case with rich information and episodes. Lessons may be derived from this kind of case in a case base.
SEMMA
An alternative process for data mining projects proposed by the SAS Institute. The acronym "SEMMA" stands for "sample, explore, modify, model, and assess."
spider
An application used to swift/crawl/read through the content of a Web sites automatically.
OLAP
An information system that enables the user, while at a PC, to query the system, conduct an analysis, and so on. The result is generated in seconds.
Define analytics.
Analytics is the process of developing actions or recommendations based on information derived from historical data.
A decision tree classifies entities into what? A. All classes based on the features of the entities B. Particular classes based on features of the entities C. Particular classes based on the size of the entities D. All classes based on the size of the entities
B. Particular classes based on features of the entities
How many steps are in the data warehousing process? A: 2 B: 4 C: 6 D: 8
C: 6
How does CRISP-DM differ from SEMMA?
CRSIP-DM takes a more comprehensive approach, considering business needs and relevant data. SEMMA assumes needs have been identified and understood.
List and briefly define at least two classification techniques.
Case based reasoning: uses historical cases to recognize commonalities to assign a new case into most probable Genetic algorithms: use of analogy of natural evolution to build search-based mechanisms to classify data samples
What is the major difference between cluster analysis and classification?
Clustering does not have a supervising mechanism that enforces the learning process and uses one or more heuristic to discover natural groupings of objects.
What is prescriptive analytics?
Combining what is currently going on with what is projected for the future to determine the best possible course of action.
List and describe the major components of BI.
Data warehouse - data storage Business analytics - manipulating the data through data mining and other applications Business program management - performance analysis and monitoring User interface - allows users to interact through a dashboard
List three of the terms that have been predecessors of analytics.
Decision support systems Enterprise/Executive IS Business Intelligence
What are the main differences between descriptive and inferential statistics?
Descriptive statistics describe sample data while inferential statistics draw conclusions about the characteristics of a population.
What are the main steps in the text mining process?
Establish the corpus Create the term document matrix Extract knowledge
Describe the three steps of the ETL process.
Extract - Read data Transform - Convert data into a new form that can be loaded to the data warehouse Load - Put data into the data warehouse
Define Gini index. What does it measure?
Gini Index is used to measure the diversity of a population in economics. The same concept is used to determine the purity of a specific class as a result of a decision to branch along a particular attribute.
authoritative pages
Web pages that are identified as particularly popular based on links by other Web pages and directories.
What are the main differences among line, bar and pie charts? When should you use one over the others?
Line charts - time series data, depicts relationship between two variables to find trend Bar charts - categorical data, used to compare data across multiple categories Pie charts - depicts relative proportions of a specific measure
What is logistic regression? How does it differ from linear regression?
Logistic regression is a probability based classification algorithm that uses supervised learning. Its response is a class where linear regression's is a numerical value.
List and briefly define the central tendency measures of descriptive statistics.
Mean - average value of observations Median - middle value of a sorted dataset Mode - the observation that occurs most frequently
KPI
Measure of performance against a strategic objective and goal.
What is metadata? Explain the importance of metadata.
Metadata is data about data and give insight to structure and meaning, making it more or less effective.
star schema
Most commonly used and simplest style of dimensional modeling.
How does NLP relate to text mining?
NLP considers semantic, contextual and syntactic structure of the text in a document to extract information from the data source.
What is NLP?
NLP is an important component of text mining and a subfield of AI and computational linguistics. It studies the problem of understanding the natural human language with a view of converting the depictions of human language into more formal documentation that is easier for computer programs to manipulate.
hubs
One or more Web pages that provide a collection of links to authoritative pages.
What is OLAP, how does it differ from OLTP?
Online Analytical Processing is accessible from a computer and allows queries and analysis with quickly generated results. Online Transaction Processing is used in data warehouses to gather and analyze data and make decisions.
Define OLAP.
Online analytics processing is an online tool for data storage, retrieval, and analysis processing.
Define OLTP.
Online transaction processing supports routine business in an organization and responds to specific requests (transactions) made by users.
clustering
Partitioning a given data set into segments (natural groupings) in which the members of a segment share similar qualities.
What are the key differences among the major data mining tasks?
Prediction is telling the future. Classification analyzes historical data to predict future trends. Segmentation identifies natural groupings based on common characteristics. Association discovers relationships among variables in a dataset.
List and briefly define the dispersion measures of descriptive statistics.
Range - difference between largest and smallest value Variance - deviations from the mean Standard deviation - measure of how far spread the observations are from the mean
What is regression, and what statistical purpose does it serve?
Regression is a statistical technique to model the dependency of a variable on one or more explanatory variables. It provides hypothesis testing identifying potential relationships and can be used for predicting and forecasting.
What are the most common metrics that make for analytics-ready data?
Reliability, accuracy, accessibility, security, consistency, currency, validity, and relevancy
text mining
The application of data mining to nonstructured or less structured text files. It entails the generation of meaningful numeric indices from the unstructured text and then processing those indices using various data mining algorithms.
What is "search engine optimization"? Who benefits from it?
Search engine optimization is intentional activity affecting the visibility of a website in search engine's natural results. Individual users and businesses both benefit from it. Users can efficiently perform searches and businesses can increase traffic to their sites.
What are three common data mining mistakes/blunders?
Selecting the wrong problem for data mining, defining a project around a foundation your data cannot support, and leaving insufficient time for data preparation
What is sentiment analysis? How does it relate to text mining?
Sentiment analysis is automated extraction of opinions, feelings, and subjectivity in text. It is used to detect opinions toward products or services using textual data sources. It extracts explicit and implicit texts.
What are the main steps in carrying out sentiment analysis projects?
Sentiment detection N-P polarity classification Target identification Collection and aggregation
Identify and comment on the information dimensions captured in the Napoleon march diagram.
Size of the army - thickness of the band Direction of movement - yellow band is movement into Russia and black band is retreat Geographic location - all cities visited Outside temperature - bottom half of diagram displays temperatures as marching through certain areas
What are the two most commonly used shape characteristics to describe a data distribution?
Skewness - measure of asymmetry in a distribution of data Kurtosis - degree to which a peak in the distribution is more or less peaked than the normal distribution
What is meant by social analytics? Why is it an important business concept?
Social analytics monitor, analyze, measure, and interpret digital interactions and relations of people and content. It helps gain insights about existing and potential customers' current and future behaviors and preferences towards products and services.
What is social media analytics?
Social media analytics is the systematic way to consume vast amounts of content created by web-based social media outlets and tools to better an organization's competitiveness.
What are some major data mining methods and algorithms?
Some major data mining methods include prediction, association, and segmentation. Some algorithms include decision trees, regression, and expectation maximization.
When developing a data warehouse, what are the most important risk and issues to consider?
Some risks to consider include choosing personnel who are technology rather than user oriented, focusing on internal data and ignoring the value of external data, believing problems are over when the data warehouse is up and running, and delivering data with confusing definitions.
classification
Supervised induction used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior.
What is text mining?
Text mining is a semi-automatic process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources.
What are the major data mining processes?
The 6 step CRISP-DM: Cross industry standard process for data mining SEMMA: Sample, explore, modify, model, assess KDD: Knowledge discovery in databases
clickstream analysis
The analysis of data that occur in the Web environment.
drill down
The investigation of information in detail (e.g., finding not only total sales but also sales by region, by product, or by salesperson). Finding the detailed sources.
support
The measure of how often products and/or services appear together in the same transaction; that is, the proportion of transactions in the data set that contain all the products and/or services mentioned in a specific rule.
sentiment analysis
The technique used to detect favorable and unfavorable opinions toward specific products and services using a large numbers of textual data sources (customer feedback in the form of Web postings).
What are the main categories of data?
The two main categories of data are structured and unstructured. Structured can further be broken into categorical and numerical. Unstructured includes text and images.
What is time series? What are the main forecasting techniques for time series data?
Time series is a prediction model where there is a sequence of data points of the variable which are measured and represented at uniform time intervals. The main forecasting techniques include simple, moving, and weighted average, and exponential smoothening.
Define BI.
Tools, databases, applications, and methods that allow users to manipulate information to make analyses and draw conclusions.
OLTP
Transaction system that is primarily responsible for capturing and storing data related to day-to-day business functions.
List and briefly define the phases in the CRISP-DM process.
Understanding business: know reason for study Data understanding: identifying relevant data for this specific study Data preparation: prepare consolidated data to make it ready for analysis Model building: identify best method for study Testing and evaluation: evaluates accuracy and scope of model Deployment: generate analysis report and implement repeatable processes
knowledge
Understanding, awareness, or familiarity acquired through education or experience; anything that has been learned, perceived, discovered, inferred, or understood; the ability to use information. In a knowledge management system, knowledge is information in action.
What are the three types of data generated through Web page visits?
Usage patterns User profiles Customer value
What is predictive analytics?
Using statistical evaluation of historical data to develop a forecast of what is most likely to occur in the future.
What is descriptive analytics?
Using trends and available information to develop an understanding of what is currently going on within an organization through reading and analysis.
What are the most popular free data mining tools? Why are they gaining overwhelming popularity?
WEKA, KNIME, and RapidMiner are open-source and free of cost.
What recent technologies may shape the future of data warehousing, why?
Web and social media allow access to vast data sources with heavy amounts of personal data. Because of its volume and velocity, internet data will likely require new data warehousing technology. Cloud computing is now starting to be used as a data warehouse platform and will continue to be developed.
What are the three main areas of Web mining?
Web content Web structure Web usage
What is Web mining? How does it differ from regular data mining or text mining?
Web mining is a process of discovering intrinsic relationships from web data and can identify authoritative data through search engines. The Web is so vast so traditional data mining is less effective than web mining.
What is web structure mining? How does it differ from Web content mining?
Web structure mining is the process of extracting useful information from links embedded in Web documents. Web content mining extracts information from web pages while web structure mining extracts useful information from links.
What are commonly used Web analytics metrics? What is the importance of metrics?
Website usability - how the website is being used Traffic sources - where are users coming from Visitor profiles - what do the visitors look like Conversion statistics - what does this mean for the business
What is the first step of the 6 step CRISP-DM data mining process? a. Business Understanding b. Data Understanding c. Data Prep d. Deployment
a. Business Understanding
What type of data contains measurements of simple codes assigned to objects as labels, which are not measurements? a. Nominal Data b. Ordinal Data c. Ratio Data d. Interval Data
a. Nominal Data
What is a data warehouse? a. a pool of data produced to support decision making b. a nontrivial process of identifying useful patterns in data c. a process of extracting data from unstructured data d. a conceptual framework for managerial decision support
a. a pool of data produced to support decision making
What is a software program that searches for documents based on keywords via the web? a. search engine b. Train engine c. data analytics d. Web mining
a. search engine
big data analytics
application of analytics methods and tools to Big Data - data that is characterized by volume, variety, and velocity that exceeds the reach of commonly used hardware environments and/or capabilities of software tools to process
A very popular, statistically sound, probability-based classification algorithm that employs supervised learning, best describes? a. Normality b. Logistic Regression c. Linearity d. Linear Regression
b. Logistic Regression
What is a semi automated process of extracting patterns from large amounts of unstructured data? a. Data mining b. Text mining c. Analytics d. Delivery robot
b. Text mining
What is the definition of extract in the three-step ETL process? a. converting data from its previous form into new form b. read data from one or more data bases c. putting data into the data warehouse d. navigate among high-level data
b. read data from one or more data bases
Why would we use these steps? Steps: data consolidation, data cleaning, data transformation, and data reduction? a) Learning b) Data analysis c) Data processing d) Big Data
c) Data processing
Which one of the following is not a OLAP operation? a) Drill down b) Drill up c) Drill across d) Roll up
c) Drill across
What is a main category of data? a) Consolidated b) Multi-Structured c) Unstructured d) Organized
c) Unstructured
Which of the following is not a form of business analytics? a. Predictive b. Descriptive c. Big Data d. Prescriptive
c. Big Data
A technique used to detect favorable and unfavorable opinions towards specific products and services using a large number of textual data sources describes what? a. Speech Recognition b. Deception Detection c. Sentiment Analysis d. Trend Analysis
c. Sentiment Analysis
What is NOT a characteristic of Big Data? a. Velocity b. Volume c. Versatility d. Variety
c. Versatility
The 4 perspectives of balanced score cards include all of the below except a) Financial b) Internal Business Processes c) Customers d) External Business Processes
d) External Business Processes
What is a characteristic of Big Data? a. Variety b. Velocity c. Variety d. All of the above
d. All of the above
What things can help a web page rank higher in the search engine results? a. Edit content to specific keywords b. Remove barriers to the indexing activities of search engines c. HTML d. All of the above
d. All of the above
_______________ partitions a collection of things into segments whose member share similar characteristics? a. linked analysis b. Classification c. regression d. Clustering
d. Clustering
Which of the following is an example of a descriptive statistic? a. Median b. Mean c. Mode d. all of the above e. none of the above
e. none of the above
EAI
enterprise application integration (EAI) A technology that provides a vehicle for pushing data from source systems into a data warehouse.
EDW
enterprise data warehouse (EDW) An organizationallevel data warehouse developed for analytical purposes.
EII
enterprise information integration (EII) An evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases.