MIS 6873 Exam 2
Data quality
GIGO (garbage in garbage out), organization process, culture, pre processing of the data
GDPR
General data protection regulation (2016). EU. Controls privacy rights. The rights to know where your data is going
Key building blocks of Hadoop
Google file system and Google map reduce
Which statement is CORRECT?
HBase can be considered as a NoSQL database.
Which components does the base Hadoop stack include?
HDFS, MapReduce and YARN.
Which statement is NOT CORRECT?
Hive applies a 'schema on write' approach.
Which statement is NOT CORRECT?
Hive's main advantage lies within the query speed and performance.
Interpretability
How much work is involved in translating the numerical score from an instrument into a clinically meaningful result.
HOLAP
Hybrid online analytical processing. Combines MOLAP and ROLAP. Uses the best of both OLAP types
Variety
the range of data types and sources that are used, data in its "many forms"
OLAP Cube
A presentation of an OLAP measure with associated dimensions. The reason for this term is that some products show these displays using three axes, like a cube in geometry. Same as OLAP report.
MapReduce Framework
A programming model to process large sets of data in parallel
Pig is...
A project offering a programming language to provide more user-friendliness compared to MapReduce programs.
Roll up, Roll Down
OLAP operator which allows you to aggregate or de-aggregate the current set of fact values within or across one or more dimension
Dicing
OLAP operator which allows you to select a range on one or more dimension
Slicing
OLAP operator which allows you to set one of the dimension at a particular value
Drill Across
OLAP operators which allows you to retrieve information from 2 or more connected fact tables
Data
Data Warehouse: structured Data Lake: raw, "as is", often unstructured
Organizational Aspects
Data governance plan, approach
Overall Considerations of privacy and security
Data integrity, Data availability, authentication and access control, confidentiality, auditing, mitigating vulnerabilities
Which statement is CORRECT?
The dimension tables of a star schema contain the criteria for aggregating the measurement data and will typically be used as constraints to answer queries.
Which statement is NOT CORRECT?
The lower the area under the ROC curve (AUC), the better the model performs.
Deep learning
The most advanced machine learning methods. Analysis of multiple examples. An example is google translate and speech recognition
Star Schema
The most commonly used and the simplest style of dimensional modeling Contain a fact table surrounded by and connected to several dimension tables
Normalization
Transactional system: normalized data Data warehouse: (sometimes also) denormalized
Data Latency
Transactional system: real time data Data warehouse: periodic snapshots, including historical data
Sensitivity Analysis
a special case of what-if analysis, is the study of the impact on other variables when one variable is changed repeatedly. How near or far the data is.
linear regression analysis
a straight-line mathematical model to describe the functional relationships between independent and dependent variables
logistic regression analysis
a type of multiple regression in which a sample's scores on two or more measures are used to predict their scores on a categorical measure
Issue with data prep: Data preparation requires deep knowledge of organizational data
Create company standards for data definitions
Issue with data prep: The hidden reality of data prep silos
Create consistency and collaboration within the data prep process
Decision trees can be used in the following applications:
Credit risk scoring and churn prediction
Sampling
Choosing a subset if historical data. This should be representative. Should consider optimal time window (trade off between lots of data and recent data)
Which statement is CORRECT?
A MapReduce pipeline in Hadoop can include an optional Sorter to sort the final output.
Which statement is CORRECT?
A charecteristic of data warehouses is that they are time-variant.
Virtual Data Warehouses
A set of separate databases, which can be queried together, so a user can effectively access all the data as if it was stored in one data warehouse
Hadoop Common
A set of shared programming libraries used by the other modules
Which statement is CORRECT?
A side benefit of cross-validation is that you can calculate a standard deviation and confidence interval for the performance measure.
Independent Data Mart
A small data warehouse designed for a strategic business unit or a department
Data Lake
A storage repository that holds a vast amount of raw data in its original format until the business needs it
OLAP (online analytical processing)
BI technique where business user can query to interactively analyze the data, summarize it, and visualize it and various ways
Operational Efficiency
Can we implement it once we create the model
Value
Capture and retrieval of the data
Transformation
Data Warehouse: Before entering the Data warehouse Data Lake: before analysis
Users
Data Warehouse: Decision Makers Data Lake: Data scientists
Agility
Data Warehouse: Low Data Lake: High
Security
Data Warehouse: Mature Data Lake: Maturing
Storage
Data Warehouse: expensive Data Lake: Low cost
Processing
Data Warehouse: schema on write Data Lake: schema on read
Which of the following measures cannot be used to make the splitting decision in a regression tree?
Entropy
Which statement is NOT CORRECT?
MLlib is based on the MapReduce pipeline.
Which of the following statements on Big Data is correct?
MapReduce programs can be automatically parallelized and executed across a cluster of different computers.
SQL on Hadoop
MapReduce very complex when compared to SQL. Need a more database like setup on top of Hadoop
Consider a data set with a multiclass target variable as follows: 25% bad payers, 25% poor payers, 25% medium payers and 25% good payers. In this case, the entropy will be:
Maximal
Which statement is CORRECT?
One of the disadvantages of Spark is that its streaming and machine learning APIs are still mostly RDD based.
Apache Spark
Open-source alternative for MapReduce. Used a new programming paradigm. Resilient distributed dataset (RDD), distributed across cluster, implicit data parallelism, and fault tolerance.
ROLAP
Relational online analytical processing. Uses schemas and a lot more standard. Uses SQL and can be very versatile
Which of the following elements is the entry point to which client submit their YARN applications?
ResourceManager
Given the following two statements:i. Missing values are meaningless and should always be discarded.ii. In outlier detection and handling, it is crucial to differentiate between valid and invalid values.Which of these statements are correct?
Statement ii is correct but statement i is not correct.
ECPA
The Electronic Communications Privacy Act- establishes the guidelines for e-mail monitoring by employers and employees.US. 1986.
Which statement is NOT CORRECT?
The Hive executer takes the MapReduce stages and sends these to Hadoop Common.
Volume
The amount of data, also referred to the data "at rest"
Which statement is CORRECT?
The betweenness counts the number of the times that a node or edge occurs in the geodesics of the network.
How does a client read a file from HDFS?
The client sends a request to the NameNode. The NameNode will return the blocklocations of which DataNode(s) contain the desired information. The client then reads the data directly of the DataNode(s).
Business Intelligence (BI)
The set of activities, techniques, and tools aimed at understanding patterns in past data and predicting the future.
Virtual Data Mart
Use a metadata table structure that would then reference data elements within the data warehouse
Virtual Data Warehouse and Virtual Data Mart
Use wrappers to update data consistently
Hive
completely structured data; data warehousing; Facebook uses this
Google File System
easily distributed across commodity hardware, while providing fault tolerance
Exploratory Analysis
examining the data descriptively to become as familiar as possible with it at a macro level (mean, median, mode, STDEV, percentiles)
open source software
free, less quality assurance, full access to source code
clustering approach
groups similar characteristics or items
Social Network Metrics
how network roles affect one another
Pig
semi structured data; resembles SQL but relatively procedural vs. declarative
New source of data
social network, public (government website, variety to add to the analysis
Backtesting
use of historical data to test a strategy that was developed subsequent to the observation of the data
commercial software
well engineered business focused solutions (end to end), extensive help facilities, business continuity, pre packaged (black box routines), expensive
Which of the following costs should be included in a Total Cost of Ownership (TCO) analysis?
All of these costs
Harvard Business Review: Defense
- Ensure Data security, Privacy, integrity, quality, regulatory compliance, and governance - optimize data extraction, standardization, storage, and access - control - SSOT (single source of truth)
Harvard Business Review: Offense
- improve competitive position and profitability - optimize data analytics, modeling, visualization, transformation, and enrichment - flexibility - MVOTs (multiple versions of the truth)
Which of the following strategies can be used to deal with missing values?
All of these strategies can be applied
Random forests
1. Very good performance (speed, accuracy) when abundant data is available. 2. Use bootstrapping/bagging to initialize each tree with different data. 3. Use only a subset of variables at each node. 4. Use a random optimization criterion at each node. 5. Project features on a random different manifold at each node.
Hadoop Distributed File System (HDFS)
A Java based file system to store data across multiple machines
Which statement is NOT CORRECT?
A MapReduce program can be implemented in an easy, straightforward manner.
Which statement is NOT CORRECT?
A data lake is targeted towards decision makers at middle and top management level, whereas a data warehouse requires a data scientist, which is a more specialized profile in terms of data handling and analysis.
Which statement is CORRECT?
A dendrogram can be used to decide upon the optimal number of clusters. It is a tree-like diagram that records the sequences or merges.
Decision Tree Analysis
A diagramming and calculation technique for evaluating the implications of a chain of multiple options in the presence of uncertainty.
Privacy Act
A law passed in 1974 requiring government files about individuals to be kept confidential.US.
neural networks
A lot of nodes that are interconnected. Mainly used for AI
Data Warehouse
A subject-oriented, integrated, time-variant, nonvolatile collection of data used in support management decision-making process
Dependent Data Mart
A subset (or mart) that is created directly from a data warehouse
Which of the following activities are part of the post-processing step?
All of these activities
Operational Data Store (ODS)
A type of database often used as an interim area for a data warehouse query, especially for customer information files. (used for more real time look at data)
Snowflake Schema
A type of star schema in which dimension tables can have their own dimension tables. The snowflake schema is usually the result of normalizing dimension tables.
Cross Fertilization
Across models and experiences from different areas
Which statement is NOT CORRECT?
All given success factors of an analytical model, i.e. relevance, performance, interpretability, efficiency, economical cost and regulatory compliance, are always equally important.
Total Cost of Ownership (TCO)
All of the costs associated with the design, development, testing, implementation, documentation, training and maintenance of a software system.
Hadoop
An open source software framework used for distributing storage and fast processing of big datasets using cluster if computers built from normal, commodity hardware
Which statement is NOT CORRECT?
An outrigger table can be defined to store a set of attribute types of a dimension table which are uncorrelated, high in cardinality and updated simultaneously.
Query and Reporting
BI technique where business users can graphically and interactively design a query and corresponding report (report can be refreshed at any time) also know as query by example (QBE)
Pivot Tables (or cross table)
BI technique where cross tabulates multidimensional data and represents in 2 dimensional tabular format
Which of the following is not one of the reasons why Spark programs are generally faster than MapReduce operations?
Because Mesos can be used as a resource manager instead of YARN.
Cloud solution
Better security management, scalability and economies of sale, easy maintenance/upgrades, improved collaboration across business departments, risk of vendor lock in
In house vs Outsourcing
Both can accomplish the same thing but you have to look at the risks. Analytics concerns a company's frontend strategy, exchange of confidential information, continuity of the partnership, cultural mismatch
Sequence Rules
Descriptive; analyzes transactions and look for patterns
Association Rules
Descriptive; discovers links or associations amongst data.
Outlier Detection and Handling
Determining if the data is valid or invalid. Detected by min/max test or looking at a histogram, box plot, and/or scatter plot. We handle this by treating the data as missing value (invalid observation) or capping (valid observation)
Issue with data prep: Rigid and time-consuming processes don't keep up with demand
Develop agile processes with the right tools to support them
Statistical Performance and Validity
Does the model fit the data. Are the results correctly representing the data
Management Support
Existing c level vs new position (chief analytic officer)
Is the following statement true or false? "All given success factors of an analytical model, i.e. relevance, performance, interpretability, efficiency, economical cost and regulatory compliance, are always equally important."
False
YARN (Yet Another Resource Negotiator)
Handles the management and scheduling of resources requests in a distributed environment
Fact Constellation (Galaxy) Schema
Has >1 fact table and dimension table where dimension tables can be connected to multiple fact tables
HIPPA
Health Insurance Portability and Accountability Act of 1996. US.
Social Network Definitions
Helps describe the data collected from social media and networks
Executive Systems
Helps executives understand data and decide how to move forward *what do we want to achieve*
Which statement is CORRECT?
In a data warehouse context, the definition of junk dimensions greatly contributes to the maintainability and query performance.
Which statement is CORRECT?
In outlier detection and handling, it is crucial to differentiate between valid and invalid values.
Which statement is CORRECT?
Missing values and outliers can potentially provide useful information and should be analyzed before they are removed/replaced.
Management Systems
Information-based routines and procedures designed to maintain or alter patterns in organizational activities *How are we going to achieve goals*
Economical Cost
Is the investment worth the benefit.
Denormalizing Data
Merging data into one table. Information goes from multiple data tables into one big table that holds all the information.
MOLAP
Multidimensional Online Analytical Processing. Uses arrays to manage data. Can be fast but need more storage. don't use normal SQL. they have their own language
Which statement is CORRECT?
Negative ROI of analytics often boils down to the lack of good quality data, management support and a company-wide data driven decision culture
Hbase
NoSQL- like data storage platform; large volume; limited query facilities
Which of the following elements launches application containers and monitors the application resource usage?
NodeManager
Which of the following commands are not a part of HBase?
Place
Social Network Learning
Predictive modeling technique for social networks
The Analytics Process Model
Predictive, Descriptive, Social Network
Google MapReduce (GMR)
Programming paradigm to write programs that can be automatically parallelized and excited across a cluster
EU-US Privacy Shield
Protects data over different continents. Provide companies on both sides of the Atlantic with a mechanism to comply with data protection requirements when transferring personal data from the European Union and Switzerland to the United States in support of transatlantic commerce
Issue with data prep: "Clean data" is a matter of perspective
Put the power in the hands of the data experts
RACI Matrix
Responsible, Accountable, Consulted, Informed
Model Deployment
Sending the model and analysis to be used for business development
Tufte Fundamental Principles
Show comparisons, show causality, use multivariate data, completely integrate modes (like text, images, numbers), establish credibility, and focus on content
Which statement is NOT CORRECT?
Spark SQL DataFrames need to be created by loading a file.
Given the following statements:i. When using featurization, the network is summarized in a set of features, such as betweenness and closeness.ii. The betweenness of a node is its average distance to all other nodes in the network.Which of these statements are correct?
Statement i is correct, but statement ii is not correct.
Which of the following is the definition for the sensitivity?
TP/(TP+FN)
Which of the following is the definition for the precision?
TP/(TP+FP)
Velocity
The speed at which data comes in and goes out, data "in motion"
Veracity
The uncertainty of the data, data "in doubt"
Operational Systems
These systems support the day‐to‐day activities of the business (purchasing of goods and services, manufacturing activities, sales to customers, cash collections, payroll, etc.) Also known as transaction processing systems (TPS). *How well are we achieving goals*
Which statement is CORRECT?
To guarantee maximum independence and organizational impact of analytics, it is important that a Chief Data Officer or Chief Analytics officer is added to the executive committee who directly reports to the CEO.
Regulatory Compliance
To insure that the model is not breaking any laws. Mainly ethic or privacy rights.
Design
Transactional system: application-oriented Data warehouse: subject oriented
Usage
Transactional system: day to day business operations Data warehouse: decision support at the tactical/strategic level
Transaction Management
Transactional system: important Data warehouse: less of a concern
Data manipulation
Transactional system: insert/update/delete/select Data warehouse: insert/select
Type of queries
Transactional system: many simple queries Data warehouse: fewer, but complex and ad-hoc (when necessary or needed) queries
Advantages of Apache Spark
Uses RDDs which can handle - iterative programs that have to visit a data set multiple times; more interactive or exploratory programs - great for analytics applications. Much faster than MapReduce implementations. Rapidly adopted by many Big Data vendors.
Missing Values
Values that were not collected. Brings up the question of keeping it or deleting it or replace
Which statement is NOT CORRECT?
Veracity in Big Data refers to data "in change".
Which of the following is not a characteristic of a data warehouse?
Volatile
What do the 5 V's of Big Data stand for?
Volume, Variety, Velocity, Veracity, Value.
5 Vs of Big Data
Volume, Velocity, Variety, Veracity, Value
Justifiability
Was it worth it
Which statement is CORRECT?
When building decision trees, the impurity can be measured with entropy, gini, MSE or ANOVA with the F-test
Which statement is CORRECT?
When building decision trees, the impurity can be measured with entropy, gini, MSE or ANOVA with the F-test.
Which statement is CORRECT?
When choosing the granularity of the fact table, a trade-off between the level of detailed analysis and storage requirements (and hence query performance) must be made.
Which statement is CORRECT?
When predicting a categorical value, logistic regression can be used.
To guarantee maximum independence and organizational impact of analytics, it is important that...
a Chief Data Officer or Chief Analytics officer is added to the executive committee who directly reports to the CEO.
Featurization in the context of neural networks refers to...
making features (=inputs) out of the network characteristics.
Interpretation and Validation
interpreting the results and making sure its correcting
Business relevance
is it relevant or the answer to the problem or need
On premise solution
keep data in house (full control), security risk, expensive up or downsizing
Return on Investment (ROI)
the direct financial impact of a firm's expenditure of a resource, such as time or money