MIS 6873 Exam 2

Data quality

GIGO (garbage in garbage out), organization process, culture, pre processing of the data


General data protection regulation (2016). EU. Controls privacy rights. The rights to know where your data is going

Key building blocks of Hadoop

Google file system and Google map reduce

Which statement is CORRECT?

HBase can be considered as a NoSQL database.

Which components does the base Hadoop stack include?

HDFS, MapReduce and YARN.

Which statement is NOT CORRECT?

Hive applies a 'schema on write' approach.

Which statement is NOT CORRECT?

Hive's main advantage lies within the query speed and performance.


How much work is involved in translating the numerical score from an instrument into a clinically meaningful result.


Hybrid online analytical processing. Combines MOLAP and ROLAP. Uses the best of both OLAP types


the range of data types and sources that are used, data in its "many forms"


A presentation of an OLAP measure with associated dimensions. The reason for this term is that some products show these displays using three axes, like a cube in geometry. Same as OLAP report.

MapReduce Framework

A programming model to process large sets of data in parallel

Pig is...

A project offering a programming language to provide more user-friendliness compared to MapReduce programs.

Roll up, Roll Down

OLAP operator which allows you to aggregate or de-aggregate the current set of fact values within or across one or more dimension


OLAP operator which allows you to select a range on one or more dimension


OLAP operator which allows you to set one of the dimension at a particular value

Drill Across

OLAP operators which allows you to retrieve information from 2 or more connected fact tables


Data Warehouse: structured Data Lake: raw, "as is", often unstructured

Organizational Aspects

Data governance plan, approach

Overall Considerations of privacy and security

Data integrity, Data availability, authentication and access control, confidentiality, auditing, mitigating vulnerabilities

Which statement is CORRECT?

The dimension tables of a star schema contain the criteria for aggregating the measurement data and will typically be used as constraints to answer queries.

Which statement is NOT CORRECT?

The lower the area under the ROC curve (AUC), the better the model performs.

Deep learning

The most advanced machine learning methods. Analysis of multiple examples. An example is google translate and speech recognition

Star Schema

The most commonly used and the simplest style of dimensional modeling Contain a fact table surrounded by and connected to several dimension tables


Transactional system: normalized data Data warehouse: (sometimes also) denormalized

Data Latency

Transactional system: real time data Data warehouse: periodic snapshots, including historical data

Sensitivity Analysis

a special case of what-if analysis, is the study of the impact on other variables when one variable is changed repeatedly. How near or far the data is.

linear regression analysis

a straight-line mathematical model to describe the functional relationships between independent and dependent variables

logistic regression analysis

a type of multiple regression in which a sample's scores on two or more measures are used to predict their scores on a categorical measure

Issue with data prep: Data preparation requires deep knowledge of organizational data

Create company standards for data definitions

Issue with data prep: The hidden reality of data prep silos

Create consistency and collaboration within the data prep process

Decision trees can be used in the following applications:

Credit risk scoring and churn prediction


Choosing a subset if historical data. This should be representative. Should consider optimal time window (trade off between lots of data and recent data)

Which statement is CORRECT?

A MapReduce pipeline in Hadoop can include an optional Sorter to sort the final output.

Which statement is CORRECT?

A charecteristic of data warehouses is that they are time-variant.

Virtual Data Warehouses

A set of separate databases, which can be queried together, so a user can effectively access all the data as if it was stored in one data warehouse

Hadoop Common

A set of shared programming libraries used by the other modules

Which statement is CORRECT?

A side benefit of cross-validation is that you can calculate a standard deviation and confidence interval for the performance measure.

Independent Data Mart

A small data warehouse designed for a strategic business unit or a department

Data Lake

A storage repository that holds a vast amount of raw data in its original format until the business needs it

OLAP (online analytical processing)

BI technique where business user can query to interactively analyze the data, summarize it, and visualize it and various ways

Operational Efficiency

Can we implement it once we create the model


Capture and retrieval of the data


Data Warehouse: Before entering the Data warehouse Data Lake: before analysis


Data Warehouse: Decision Makers Data Lake: Data scientists


Data Warehouse: Low Data Lake: High


Data Warehouse: Mature Data Lake: Maturing


Data Warehouse: expensive Data Lake: Low cost


Data Warehouse: schema on write Data Lake: schema on read

Which of the following measures cannot be used to make the splitting decision in a regression tree?


Which statement is NOT CORRECT?

MLlib is based on the MapReduce pipeline.

Which of the following statements on Big Data is correct?

MapReduce programs can be automatically parallelized and executed across a cluster of different computers.

SQL on Hadoop

MapReduce very complex when compared to SQL. Need a more database like setup on top of Hadoop

Consider a data set with a multiclass target variable as follows: 25% bad payers, 25% poor payers, 25% medium payers and 25% good payers. In this case, the entropy will be:


Which statement is CORRECT?

One of the disadvantages of Spark is that its streaming and machine learning APIs are still mostly RDD based.

Apache Spark

Open-source alternative for MapReduce. Used a new programming paradigm. Resilient distributed dataset (RDD), distributed across cluster, implicit data parallelism, and fault tolerance.


Relational online analytical processing. Uses schemas and a lot more standard. Uses SQL and can be very versatile

Which of the following elements is the entry point to which client submit their YARN applications?


Given the following two statements:i. Missing values are meaningless and should always be discarded.ii. In outlier detection and handling, it is crucial to differentiate between valid and invalid values.Which of these statements are correct?

Statement ii is correct but statement i is not correct.


The Electronic Communications Privacy Act- establishes the guidelines for e-mail monitoring by employers and employees.US. 1986.

Which statement is NOT CORRECT?

The Hive executer takes the MapReduce stages and sends these to Hadoop Common.


The amount of data, also referred to the data "at rest"

Which statement is CORRECT?

The betweenness counts the number of the times that a node or edge occurs in the geodesics of the network.

How does a client read a file from HDFS?

The client sends a request to the NameNode. The NameNode will return the blocklocations of which DataNode(s) contain the desired information. The client then reads the data directly of the DataNode(s).

Business Intelligence (BI)

The set of activities, techniques, and tools aimed at understanding patterns in past data and predicting the future.

Virtual Data Mart

Use a metadata table structure that would then reference data elements within the data warehouse

Virtual Data Warehouse and Virtual Data Mart

Use wrappers to update data consistently


completely structured data; data warehousing; Facebook uses this

Google File System

easily distributed across commodity hardware, while providing fault tolerance

Exploratory Analysis

examining the data descriptively to become as familiar as possible with it at a macro level (mean, median, mode, STDEV, percentiles)

open source software

free, less quality assurance, full access to source code

clustering approach

groups similar characteristics or items

Social Network Metrics

how network roles affect one another


semi structured data; resembles SQL but relatively procedural vs. declarative

New source of data

social network, public (government website, variety to add to the analysis


use of historical data to test a strategy that was developed subsequent to the observation of the data

commercial software

well engineered business focused solutions (end to end), extensive help facilities, business continuity, pre packaged (black box routines), expensive

Which of the following costs should be included in a Total Cost of Ownership (TCO) analysis?

All of these costs

Harvard Business Review: Defense

- Ensure Data security, Privacy, integrity, quality, regulatory compliance, and governance - optimize data extraction, standardization, storage, and access - control - SSOT (single source of truth)

Harvard Business Review: Offense

- improve competitive position and profitability - optimize data analytics, modeling, visualization, transformation, and enrichment - flexibility - MVOTs (multiple versions of the truth)

Which of the following strategies can be used to deal with missing values?

All of these strategies can be applied

Random forests

1. Very good performance (speed, accuracy) when abundant data is available. 2. Use bootstrapping/bagging to initialize each tree with different data. 3. Use only a subset of variables at each node. 4. Use a random optimization criterion at each node. 5. Project features on a random different manifold at each node.

Hadoop Distributed File System (HDFS)

A Java based file system to store data across multiple machines

Which statement is NOT CORRECT?

A MapReduce program can be implemented in an easy, straightforward manner.

Which statement is NOT CORRECT?

A data lake is targeted towards decision makers at middle and top management level, whereas a data warehouse requires a data scientist, which is a more specialized profile in terms of data handling and analysis.

Which statement is CORRECT?

A dendrogram can be used to decide upon the optimal number of clusters. It is a tree-like diagram that records the sequences or merges.

Decision Tree Analysis

A diagramming and calculation technique for evaluating the implications of a chain of multiple options in the presence of uncertainty.

Privacy Act

A law passed in 1974 requiring government files about individuals to be kept confidential.US.

neural networks

A lot of nodes that are interconnected. Mainly used for AI

Data Warehouse

A subject-oriented, integrated, time-variant, nonvolatile collection of data used in support management decision-making process

Dependent Data Mart

A subset (or mart) that is created directly from a data warehouse

Which of the following activities are part of the post-processing step?

All of these activities

Operational Data Store (ODS)

A type of database often used as an interim area for a data warehouse query, especially for customer information files. (used for more real time look at data)

Snowflake Schema

A type of star schema in which dimension tables can have their own dimension tables. The snowflake schema is usually the result of normalizing dimension tables.

Cross Fertilization

Across models and experiences from different areas

Which statement is NOT CORRECT?

All given success factors of an analytical model, i.e. relevance, performance, interpretability, efficiency, economical cost and regulatory compliance, are always equally important.

Total Cost of Ownership (TCO)

All of the costs associated with the design, development, testing, implementation, documentation, training and maintenance of a software system.


An open source software framework used for distributing storage and fast processing of big datasets using cluster if computers built from normal, commodity hardware

Which statement is NOT CORRECT?

An outrigger table can be defined to store a set of attribute types of a dimension table which are uncorrelated, high in cardinality and updated simultaneously.

Query and Reporting

BI technique where business users can graphically and interactively design a query and corresponding report (report can be refreshed at any time) also know as query by example (QBE)

Pivot Tables (or cross table)

BI technique where cross tabulates multidimensional data and represents in 2 dimensional tabular format

Which of the following is not one of the reasons why Spark programs are generally faster than MapReduce operations?

Because Mesos can be used as a resource manager instead of YARN.

Cloud solution

Better security management, scalability and economies of sale, easy maintenance/upgrades, improved collaboration across business departments, risk of vendor lock in

In house vs Outsourcing

Both can accomplish the same thing but you have to look at the risks. Analytics concerns a company's frontend strategy, exchange of confidential information, continuity of the partnership, cultural mismatch

Sequence Rules

Descriptive; analyzes transactions and look for patterns

Association Rules

Descriptive; discovers links or associations amongst data.

Outlier Detection and Handling

Determining if the data is valid or invalid. Detected by min/max test or looking at a histogram, box plot, and/or scatter plot. We handle this by treating the data as missing value (invalid observation) or capping (valid observation)

Issue with data prep: Rigid and time-consuming processes don't keep up with demand

Develop agile processes with the right tools to support them

Statistical Performance and Validity

Does the model fit the data. Are the results correctly representing the data

Management Support

Existing c level vs new position (chief analytic officer)

Is the following statement true or false? "All given success factors of an analytical model, i.e. relevance, performance, interpretability, efficiency, economical cost and regulatory compliance, are always equally important."


YARN (Yet Another Resource Negotiator)

Handles the management and scheduling of resources requests in a distributed environment

Fact Constellation (Galaxy) Schema

Has >1 fact table and dimension table where dimension tables can be connected to multiple fact tables


Health Insurance Portability and Accountability Act of 1996. US.

Social Network Definitions

Helps describe the data collected from social media and networks

Executive Systems

Helps executives understand data and decide how to move forward *what do we want to achieve*

Which statement is CORRECT?

In a data warehouse context, the definition of junk dimensions greatly contributes to the maintainability and query performance.

Which statement is CORRECT?

In outlier detection and handling, it is crucial to differentiate between valid and invalid values.

Which statement is CORRECT?

Missing values and outliers can potentially provide useful information and should be analyzed before they are removed/replaced.

Management Systems

Information-based routines and procedures designed to maintain or alter patterns in organizational activities *How are we going to achieve goals*

Economical Cost

Is the investment worth the benefit.

Denormalizing Data

Merging data into one table. Information goes from multiple data tables into one big table that holds all the information.


Multidimensional Online Analytical Processing. Uses arrays to manage data. Can be fast but need more storage. don't use normal SQL. they have their own language

Which statement is CORRECT?

Negative ROI of analytics often boils down to the lack of good quality data, management support and a company-wide data driven decision culture


NoSQL- like data storage platform; large volume; limited query facilities

Which of the following elements launches application containers and monitors the application resource usage?


Which of the following commands are not a part of HBase?


Social Network Learning

Predictive modeling technique for social networks

The Analytics Process Model

Predictive, Descriptive, Social Network

Google MapReduce (GMR)

Programming paradigm to write programs that can be automatically parallelized and excited across a cluster

EU-US Privacy Shield

Protects data over different continents. Provide companies on both sides of the Atlantic with a mechanism to comply with data protection requirements when transferring personal data from the European Union and Switzerland to the United States in support of transatlantic commerce

Issue with data prep: "Clean data" is a matter of perspective

Put the power in the hands of the data experts

RACI Matrix

Responsible, Accountable, Consulted, Informed

Model Deployment

Sending the model and analysis to be used for business development

Tufte Fundamental Principles

Show comparisons, show causality, use multivariate data, completely integrate modes (like text, images, numbers), establish credibility, and focus on content

Which statement is NOT CORRECT?

Spark SQL DataFrames need to be created by loading a file.

Given the following statements:i. When using featurization, the network is summarized in a set of features, such as betweenness and closeness.ii. The betweenness of a node is its average distance to all other nodes in the network.Which of these statements are correct?

Statement i is correct, but statement ii is not correct.

Which of the following is the definition for the sensitivity?


Which of the following is the definition for the precision?



The speed at which data comes in and goes out, data "in motion"


The uncertainty of the data, data "in doubt"

Operational Systems

These systems support the day‐to‐day activities of the business (purchasing of goods and services, manufacturing activities, sales to customers, cash collections, payroll, etc.) Also known as transaction processing systems (TPS). *How well are we achieving goals*

Which statement is CORRECT?

To guarantee maximum independence and organizational impact of analytics, it is important that a Chief Data Officer or Chief Analytics officer is added to the executive committee who directly reports to the CEO.

Regulatory Compliance

To insure that the model is not breaking any laws. Mainly ethic or privacy rights.


Transactional system: application-oriented Data warehouse: subject oriented


Transactional system: day to day business operations Data warehouse: decision support at the tactical/strategic level

Transaction Management

Transactional system: important Data warehouse: less of a concern

Data manipulation

Transactional system: insert/update/delete/select Data warehouse: insert/select

Type of queries

Transactional system: many simple queries Data warehouse: fewer, but complex and ad-hoc (when necessary or needed) queries

Advantages of Apache Spark

Uses RDDs which can handle - iterative programs that have to visit a data set multiple times; more interactive or exploratory programs - great for analytics applications. Much faster than MapReduce implementations. Rapidly adopted by many Big Data vendors.

Missing Values

Values that were not collected. Brings up the question of keeping it or deleting it or replace

Which statement is NOT CORRECT?

Veracity in Big Data refers to data "in change".

Which of the following is not a characteristic of a data warehouse?


What do the 5 V's of Big Data stand for?

Volume, Variety, Velocity, Veracity, Value.

5 Vs of Big Data

Volume, Velocity, Variety, Veracity, Value


Was it worth it

Which statement is CORRECT?

When building decision trees, the impurity can be measured with entropy, gini, MSE or ANOVA with the F-test

Which statement is CORRECT?

When building decision trees, the impurity can be measured with entropy, gini, MSE or ANOVA with the F-test.

Which statement is CORRECT?

When choosing the granularity of the fact table, a trade-off between the level of detailed analysis and storage requirements (and hence query performance) must be made.

Which statement is CORRECT?

When predicting a categorical value, logistic regression can be used.

To guarantee maximum independence and organizational impact of analytics, it is important that...

a Chief Data Officer or Chief Analytics officer is added to the executive committee who directly reports to the CEO.

Featurization in the context of neural networks refers to...

making features (=inputs) out of the network characteristics.

Interpretation and Validation

interpreting the results and making sure its correcting

Business relevance

is it relevant or the answer to the problem or need

On premise solution

keep data in house (full control), security risk, expensive up or downsizing

Return on Investment (ROI)

the direct financial impact of a firm's expenditure of a resource, such as time or money

