Midterm 1 - Data Wrangling, Hadoop and Spark, Big Data Strategy, Data Lakes
One drawback of MapReduce is that it passes data using key/value pairs and the value must be coded as a numeric value.
False
One step in developing a Big Data strategy is to identify the relevant business entities. While customers and competitors can definitely impact a business, the author emphasizes that only business entities internal to the organization should be considered in developing a Big Data strategy.
False
What are the components of CoNVO?
context, need, vision, outcome.
When profiling a dataset, you should avoid using visualizations to present the data profile.
False
What happens when you load your Yelp data into Trifacta?
After entering the title and description, click the Import & Wrangle button. Trifacta will load the file and parse the JSON. Each of the first‐level JSON name/value pairs will be shown as a column and each JSON object for a business will be shown as a row. For some of these pairs ‐ such as hours (when they are open) and categories (for restaurants this contains the cuisines in many cases) the value is the nested JSON object or array respectively
Does the tool keep the provenance of what transformations are done?
yes
McKinsey's definition is based on which of the 3V's?
"Big data" refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. - McKinsey (2011) Volume?/??
What is a spreadmart?
(spreadsheet data mart) is a business data analysis system running on spreadsheets or other desktop databases that is created and maintained by individuals or groups to perform the tasks normally done by a data mart or data warehouse.
What are the main steps in MapReduce?
-Map (done by a mapper written by a programmer) -Shuffle (done by Hadoop) -Reduce (done by a reducer written by a programmer)
Criteria to evaluate data sources:
1. the data value that the data source could give to support the individual use case. 2. the feasibility (or ease) of acquiring, cleaning, aligning, normalizing, enriching, and analyzing those data sources.
What is the relationship between business initiatives, entities, and key decisions?
B. initiatives define the goal of the project, entities are the users that will make use of the final project and key decisions are the decisions needed in order to successes the project.
Why did Data.gov almost get cancelled? (watch Porway video)
Because nobody was using it because the Government didn't know who would use it or how. They were planning in cutting its funds.
How is data communicated in MapReduce?
By HDFS
If you are not very familiar with SQL, how could you avoid needing to do CASE or IF statement?
CASE: When you are getting the data from the wrangler process, we could try to create a column categorize what is needed from there. IF: In the data wrangling we create a column for true or false so at the time we SELECT we can ask for what we need.
Of the actions taken in transforming data, which of the following does NOT generally change the schema (structure) of the data?
Cleaning
Which data is designed to hold more data?
Data Lake
Which type of data is loaded into a data warehouse or data lake?
Data Warehouse: Structured Data Lake: all of them. structured, semi‐structured, unstructured
What is embarrassingly parallel mean?
Embarrassingly parallel workload or problem (also called perfectly parallel or pleasingly parallel) is one where little or no effort is needed to separate the problem into a number of parallel tasks.
The chapter discusses creating a Big Data strategy document. Which of the following best describes the starting point for this approach:
Identify strategic business initiatives the organization aims to accomplish over the next 9 to 12 months
Compared to MapReduce, why is Spark faster?
In Spark, the intermediate results as well as data files read in for queries can be kept in memory as resilient distributed datasets (RDD) and only written to disk if there Is not enough RAM to keep them in memory. HDFS → Files Spark → Records
What were the benefits at Yahoo?
It is cheap and can be used in commodity machines
What does the name "Hadoop" mean?
It was the elephant stuffed toy that the son of the developer that created Hadoop had
Which of the following is NOT generally considered an advantage of data lakes built on Hadoop in comparison to traditional data warehouses?
More predicable processing load due to better data governance.
Major components of a query
Select Queries -Functions, operators -Grouping, Aggregate Functions - Join and Unions -CASE and IF statements
What did they discuss that was similar to the Porway video?
The importance showing data to the right people in the right way. Porway says that there is a lot of data out there that most citizens don't know how to use or don't know they exists, while on the O'Reilly video they gave some example of how regular data could be used by regular people to find answers to kind of complex questions, like the huracan one.
* Why do you need to start with a question?
To find an asnwer
6 techniques for refining the vision.
+ Interviews: talk to expert in the subject matter. Having their perspective is invaluable at building your intuition. + Rapid Investigation: get order of magnitude estimates, related quantities, easy graphs, and so on. to build intuition for the topic. + Kitchen Sink Interrogation (Brainstorming): ask every questions that comes to your mind relating to the need or data collection. Those questions will bring more questions. + Working backward: Start from the finished idea and figure out the need prior to achieve the outcome. Then see what is previous to that and etc. + More mockups: Drawing the final result won't just help to find out the actual needs but also how the final result might look like. + Roleplaying: Pretend you are the final user. Think about the process of interacting with the final work.
Executing jobs on MapReduce consists of three phases in the following order:
Map, Shuffle, Reduce
What are the characteristics of key business initiatives?
Behaviors Tendencies Patterns Trends Preferences Anything that helps us to understand a customer personality. For example, a credit card company would want to capture Bill Schmarzo's specific travel and buying patterns and tendencies in order to better detect fraud and improve merchant marketing offers.
If you want to see histograms or other summaries of your data in OpenRefine (similar to Trifacta), what feature do you use?
Facets
Although the data warehouse can be a source of data for a data lake, the data lake is not appropriate for doing the ETL processes required to load a data warehouse.
False
Should data lakes be specialized for different purposes?
No
From the Patil / O'Reilly video, what did Kaggle winners do differently?
They didn't do anything different. They used the same algorithms. The ones that win are more creative and think what data to bring to the problem.
When wrangling a dataset, profiling and transformation are an iterative process that keep getting repeated in cycles of profiling and transforming the data.
True
MapReduce
it is a technique popularized by Google that distributes the processing of very large multi-structured data files across a large cluster of machines. Goal -achieving high performance with "simple" computers. Good at processing and analyzing large volumes of multi- structured data in a timely manner.
Which of the following best describes why Gartner (an IT research firm), said that data lakes run the risk of turning into data swamps?
A lack of data governanance, security, and descriptive metadata will result in analysts not being able to re-use data.
What are the 4 components of data wrangling?
+ Access: Extract portion of the ETL pipeline + Transformation: functional aspect of wrangling. From how it is to how we want it. + Profiling: motivates and validates profiling. Check if there is any missing data, how fast the data is working; gives information about how the information is working. + Output: Completion. Two outputs -the wrangled dataset and -the script of transformation logic.
What are the component sections of the big data strategy document?
+ Business strategy: how the organization will achieve its goal in the next 2 to 3 years + Strategic Business Initiatives: what are the plans to achieve the business strategy over the next 9 to 12 months. Usually includes business objectives, financial targets, metrics, and time frames. + Business Entities: Customers, students, doctors, etc. the final consumer or users around the business initiative study behaviors and performance. + Business Decisions: decisions that the stakeholders need to make in support of the business initiative. + Financial drivers or Big Data Use Cases: The analytic use cases (decisions and corresponding actions) that support the strategic business initiative.
How do Data Warehouse and Data lake differ?
+ DATA WAREHOUSE: • Generate reports and dashboards on a scheduled basis • Predictable processing load • SLA‐constrained • Heavily governed • Standardized • Schema on load • Complex integrations (ETL) + DATA LAKE: • A cheaper way to store data (20 - 50 times cheaper) • Data is stored as‐is in its raw form • Rapid ingest • Provides the ability to store data before determining its value (Starbucks example) • Linear scalability • Ad hoc analysis • Separates exploration from production • Exploratory environment • Unpredictable system load • Experimentation oriented • Looser governance • Best tool for the job - Large ecosystem of tools • Schema on query • Scalability • Complex processing • Holds more data than Data Warehouse
There are 3 types of data projects?
+ Exploration: What is in the data- from the perspective of the meta-data and the values contained in the dataset. + Curation: to build, structure and store data sets in the most useful way. + Production: happens when a data source is valuable for a customer and it is used to be harvest it in the user. For example, what show to watch or what product to buy.
What are the 3 major components of JSON?
+ Object: a set of name/key/value enclosed in curly braces {} where the text is a string that identify the value. In JSON the name doesn't have to be unique. They are separated by commas. + Array: a list of value associated with a single name label. Values are separated by commas and enclosed in brackets [ ]. + Value: could be a string, or number, Boolean, another object enclosed in {}, an array, or null.
What are the types of transformations in wrangling?
+ Structuring: manipulate the schema of the dataset by just using the data from the dataset. + Enriching: add columns/fields that add new information to the dataset + Cleaning: focused in the values within the dataset. Tries to ensure that every column'/field has the right or valid values.
What are the 3V's?
+ Volume: Number of yottabytes + Velocity: how fast the data is generated or used. + Variety: different formats, lack of structure
What are the types of profiling?
+Typed-based: calculate the percent of values in a column/field that satisfy the constraints of the type. + Distributional: there are 2: -Distributional profiling: helps to detect deviation-based anomalies. -Cross-distributional: detect quality issued when pattern across multiple columns/fields are considered.
How would a data lake help to the spreadmart?
- Eliminate data silos. Rather than having dozens of independently managed collections of data (e.g., data warehouses, data marts, spreadmarts), you can combine these sources into a single data lake for indexing, cataloging, and analytics. Consolidation of the data into a single data repository results in increased data use and sharing, while cutting costs through server and license reduction. - Store, manage, protect, and analyze data by consolidating inefficient storage silos across the organization. - Provide a simple, scalable, flexible, and efficient solution that works across block, file, or object workloads (i.e., a shared storage platform that natively supports both traditional and next generation workloads). - Reduce the costs of IT infrastructure. - Speed up time to insights. - Improve operational flexibility. - Enable robust data protection and security capabilities. - Reduce data warehouse workloads by reducing the burden of analytics-intensive queries that would be best done in a special-purpose analytics sandbox environment. - Free up data warehousing resources by off-loading Extract, Transform, and Load (ETL) processes from the data warehouse to the more cost-efficient, more powerful Hadoopbased data lake.
ETL (Extract, Transform and Load)
A process in the Data Warehousing where the data is pulled out of the source systems and placed in the data warehouse. Hadoop can substitute the ETL and is cheaper. For example using Hadoop makes it much easier to create advanced customer purchase and product performance metrics around frequency (how often), recency (how recently), and sequencing (in what order) activities that could yield new insights that might be better predictors of customer behaviors and product performance
Which component is most involved with the capture of metadata and provenance?
Access
If given examples, could you match them to the CoNVO components?
Context → This news organization produces stories and editorials for a wide audience. It makes money through advertising and through premium subscriptions to its content. The main decision maker for this project is the head of online business. Need → We want to decide between two competing vendors. Which is better for us? Vision → Here are several ideas for the marketing department looking to target new cities, depending on the details of the context: Idea 1 • Vision: The marketing department that wants to improve its targeting will get a report that ranks cities by their predicted value to the company. • Mockup: Austin, Texas, would provide a 20% return on investment per month. New York City would provide an 11% return on investment per month. • Argument sketch: The department should focus on city X, because it is most likely to bring in high value. The definition of high value that we're planning to use is substantiated for the following reasons... Outcome → The metrics email for the nonprofit needs to be set up, verified, and tweaked. Sysadmins at the nonprofit need to be briefed on how to keep the email system running. The CTO and CEO need to be trained on how to read the metrics emails, which will consist of a document written to explain it.
Describe or identify each part of ConVo
Context: Understanding who we are working with, why we are doing it Need: What could be improved by better knowledge, what's the gap? Ask LOTS of questions. It's a design process ‐ to design knowledge Vision: What does success look like? Not the actual answer Outcome: What happens when we are "done"
What does flattening based on a JSON array field do?
Create a row for each array (category)
If data flows between the data warehouse and data lake, which directions does it flow?
From Data Lake to Data Warehouse
What is HDFS? What are its main characteristics that we discussed?
Hadoop Distributed File System (HDFS) • Hide the ugliness of distributed systems (things break) • Be fault tolerant (things break) • Replicate the data (things break) • Write once, no updates, not limited by disk size
When Yahoo! first applied Hadoop to building their index of the Web, which of the following was NOT an advantage of using Hadoop?
Hadoop used fewer servers than their prior system
What are sources of data for identifying key business initiatives?
If it is a public company we should check their financial statements. If it is a private company or a non profit then: Annual reports 10-K (filed annually) 10-Q (filed quarterly) Quarterly analyst calls Executive presentations and conferences Executive blogs News releases Social media sites SeekingAlpha.com Web searches using Google, Yahoo, and Bing
In the CoNVO approach, the "Need" is defined as a question we want to answer to fill a gap in our knowledge. Which of the following sections of the Big Data strategy document is most similar to identifying the Need in CoNVO?
Key decisions
John Schroeder of MapR compared the difference between traditional databases and Hadoop to the evolution of Gmail. The characteristic he saw in common was:
On Gmail you can save everything and on Hadoop it's cheap enough to save all of the raw data and analyze it as needed
Of the actions you may take in transforming data, which of the following does NOT generally require an iterative processing of the data?
Preprocessing
In which component are you learning about your data?
Profiling
The summary statistics displayed in Trifacta are an example of what aspect of data wrangling?
Profiling
The big data strategy document:
Provides a framework for linking an organization's business strategy and supporting business initiatives to the organization's big data efforts. The big data strategy document guides the organization through the process of breaking down its business strategy and business initiatives into potential big data business use cases and the supporting data and analytic requirements.
What is the difference between a union and a join?
The SQL UNION operator combines the result of two or more SELECT statements. (not very popular or useful. Ex: combines a table with men with women) Joins match records in two tables
What kind of hardware was being used?
Specialize. Super expensive
The author emphasizes that the Big Data strategy document should focus on what the business is trying to accomplish over the next 9 to 12 months. Which of the following is NOT one of the reasons he emphasizes this time frame?
Technology is changing so fast that anything longer would be using obsolete tools
Which of the following statements best describes the relationship between a firm's data lake, their data warehouse, and the analytics sandbox used by their data science teams?
The data lake is a source of data for both the data warehouse and analytics sandbox, but only the sandbox feeds data back to the data lake.
Which is more closely related to data wrangling?
The feasibility (or ease) of acquiring, cleaning, aligning, normalizing, enriching, and analyzing those data sources
Which data source is more related to data Wrangling?
The feasibility one
The data wrangling process at a high level consists of four activities. Which two activities are considered the core of data wrangling?
Transformation and Profiling
Which of the data wrangling process are iterative?
Transformation and profiling. If at the time of profiling the information is not found, then the transformation process has to be done again. action, feedback, action, feedback, etc.
Both MapReduce and Spark can use large distributed clusters of computers, but Spark is faster because it can keep data in memory (RAM) instead of writing it to disk as often as MapReduce does.
True
Data analytics teams in different business units often have different data needs, but the author argues that a firm should have one data lake - not separate specialized data lakes.
True
If shown a formula (similar to those we did in the lab), could you identify the output? Or, given a description of the result, could you select which action to take?
Yes parseJson (convert Json string to objects) value.parseJson().compliment_hot If → if(value.parseJson().elite[0] == "None", 0,value.parseJson().elite.length()) Sort → order the array from smaller to bigger toNumber → convert the string to number Revert → From bigger to smaller Cell.cross → returns an array of matching rows
If a desired result is described (similar to the lab) could you identify which function was used?
Yes. Flatten (add a row for each array) Drop (delete or remove columns) Keep (keep the rows where the value of the the column have the value that we selected) Extract (extract a part of a string) Rename (to change the column name) Derive (allow to create a new column based on another column) Delete (to delete row)
In either tool, could you re‐run someone else's analysis? How?
Yes. With Trifacta we can use Recipe With OpenRefine we use Script
Mockups (visualizations) are done in which part?
Visual
Is it more like agile or waterfall in software Development?
Waterfall
Which part of a query are you most likely to use the CASE and IF statements?
When you are looking for something in particular.
Would customers often be a key business entity?
YES!
If given a query with a major problem, could you identify the issue?
Yes
Hadoop
an open source platform designed to crunch epic amounts of data using an army of dirt-cheap servers. Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. Originally created by Doug Cutting at Yahoo! Hadoop was inspired by MapReduce. Hadoop clusters run on inexpensive commodity hardware so projects can scale-out inexpensively
In evaluating potential data sources, the author says one criteria is the feasibility or ease of doing some of the profiling and transformation tasks such as cleaning, aligning, and enriching the data. The other major criteria for evaluating a data source is the _______ (fill in the blank here)
business value
What is the origin of Hadoop?
• 2004: Dean & Ghemawat at Google publish a paper on MapReduce (and the file system it runs on the year before) • Cutting and Cafarella are working on a search engine: - "We had it up and running on four or five machines, and we had a lot of manual steps to keep everything running on those machines. The papers provided a way to automate all those steps and give us reliability and really give a nice framework for what we were doing." • Cutting is hired by Yahoo to work on Hadoop - "Every search company at that time had built something very MapReduce‐like ... [but] these were overly complex systems. The genius [of] MapReduce is that it was such a simple abstraction." ‐ Raymie Stata, former Yahoo! CTO
What do you need to do to prevent a data lake from becoming a data swamp?
• Automate metadata capture / cataloging of data sets • Graph‐based description of data set lineage • GE Data Lake Example • Google Dataset Search (Goods)
What are the inputs and outputs of each step?
• Mapper - Takes a key (line offset) - Takes a line of input (e.g., a Yelp JSON object) - Generates key/value pairs(s) from that input • Reducer - Takes a key and an array of values for that key - Generates a key/value pair
What led to the development of Hadoop?
• Scaling up is difficult In today's internet-driven world, more and more data is hitting big businesses, and it's hitting them faster. Hadoop is a way of dealing with that data.