Week 4: Data Wrangling for Analytics

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Steps in Data Wrangling Process

1. Prepare Analytical Sandbox 2. Perform ETLT 3. Condition the Data 4. Survey and Visualize the Data

Steps in Data Wrangling What is the Process of cleaning, normalizing, and performing transformations specific to individual analysis (e.g., creating the outcome variable for prediction) ?

3. Condition the data

Steps in Data Wrangling What is the process of Leverage visualization tools to get an overview of the data's characteristics (e.g., unexpected values or ranges)?

4. Survey and visualize the data

(Basic Data Terminology) dataset

A dataset is a collection of attribute values of many many instances of the entity (e.g., all customers)

Continuous variables can have "interval" or "ratio" level

A ratio level has a zero point (e.g. volume, since there is a zero volume, which is an absence of volume) An interval has no zero point (e.g. celsius temp, where zero Celsius does not imply absence of heat)

(Basic Data Terminology) Variable

A variable stores data about an attribute (e.g. age)

(Data Wrangling Challenges) Example of Ambiguity in recorded data

Ambiguity in recorded data Data in most free-format storages, such as text descriptions or survey responses are prone to ambiguities and misspellings For example, the FDA, NIH and pharma companies use different terms to describe the same side effects leading to problems when documents from these entities are merged to create a dataset

(Basic Data Terminology) Observation

An observed stores data about different attributes (eg.g name, past purchase amount) for a particular instance of the entity (e.g. a customer named "Chandra")

What can grow to be very large, since copies of the data are created to serve multiple analysis models?

Analytical Sandbox (workspace)

What can hold all formats of data, including raw data, aggregated data, transformed data?

Analytical Sandbox (workspace)

What is the workspace created for exploring data without interfering with live production databases?

Analytical Sandbox (workspace)

(Data Wrangling Challenges) Some Causes of Messiness in Datasets

Column headers are values, not variable names Multiple variables stored in one column Variables stored in both rows and columns Multiple types of experimental units stored in same table One type of experimental unit stored in multiple tables

Steps in Data Wrangling 3. Condition the data

Condition the data Process of cleaning, normalizing, and performing transformations specific to individual analysis (e.g., creating the outcome variable for prediction) Though book says that this step is done only by IT, data owners, DBA, or data engineer (the data scientist is involved), I think most data scientists have (or expected to have) data skills to do many of the conditioning steps

What are the considerations in data conditioning?

Considerations in data conditioning What are data sources and target fields? How clean is the data? How consistent are the files and content? Missing values or deviations from normal? Consistency of data types? Do values in the columns make sense? E.g., negative age or income Evidence of systematic error? Data captured with shifted columns, repurposed column, data capture stopped midway

(Basic Data Terminology) Continuous Variables

Continuous Variables store numerical values which have logical order and the relative distances between the values are meaningful. ex: volume of beverage in oz (continuous variable) The value 20 oz comes after 18 oz The difference between 20 oz and 18 oz is the same as the difference between 18 oz and 16 oz.

What is the an important step before any useful analytics can be done on the data?

Data Preparation (data wrangling)

What is the most labor intensive step in the analytical life cycle?

Data Wrangling

Data Wrangling Challenges

Data Wrangling Challenges: Multiple formats of data Ambiguity in recorded data Data in most free-format storages, such as text descriptions or survey responses are prone to ambiguities and misspellings For example, the FDA, NIH and pharma companies use different terms to describe the same side effects leading to problems when documents from these entities are merged to create a dataset

DataWrangler (a Stanford project that became the commercial venture Trifacta), is an example of?

Data Wrangling tools

Many companies build their own ad hoc data wrangling programs/scripts using languages such as Python, Java & R

Data Wrangling tools

Mr. Data Converter (from csv/tab separated data to other formats), is an example of?

Data Wrangling tools

OpenRefine (formerly Google Refine), is an example of?

Data Wrangling tools

Python with Pandas library, is an example of?

Data Wrangling tools

R packages - e.g., from tidyverse.org (dplyr,tidyr etc.), is an example of?

Data Wrangling tools

Tabula (to extract data from pdf into csv or excel format), is an example of?

Data Wrangling tools

Data Wrangling tools Free tools (source: http://blog.varonis.com/free-data-wrangling-tools/) Tabula (to extract data from pdf into csv or excel format) OpenRefine (formerly Google Refine) DataWrangler (a Stanford project that became the commercial venture Trifacta) R packages - e.g., from tidyverse.org (dplyr,tidyr etc.) Python with Pandas library Mr. Data Converter (from csv/tab separated data to other formats) Many companies build their own ad hoc data wrangling programs/scripts using languages such as Python, Java & R

What is Data Wrangling?

Data Wrangling: Its another name for "data preparation" for analytics modeling Also called "data munging" and "data janitor work" Data Wrangling it is the process of of converting or mapping data from one format into another, cleaning the data, merging, and filtering so it is ready for modeling and analysis Goal of data wrangling is to provide a dataset to facilitate analysis -- reporting, visualization, prediction, etc. Data Wrangling involves understanding the data sources, the formats, pitfalls of extracting from one file format to another Data Wrangling Most labor intensive step in the analytics life-cycle Many steps in data wrangling are iterative

Why is data wrangling important

Data Wrangling: Data preparation is important step before any useful analytics can be done on the data. Data from multiple sources can provide important clues about a company's business and customers. Data Scientist spent 50% to 80% of their time collecting and preparing the data before it can be used in analytics models In many firm that involves quite a bit of manual effort, creating bottlenecks for the more useful descriptive or predictive models.

Steps in Data Wrangling 2. Perform ETLT

Data is extracted in its raw form and loaded into the data store in the sandbox, where it can then be transformed, if necessary Big data platforms like Hadoop & MapReduce are used to move large datasets to the sandbox and do some analysis Advisable to make an inventory of the data available and compare it to the data needed for analysis If external data is needed, APIs are a popular way to access data sources (e.g., Twitter API) Learn about the data Meta data or data dictionaries have info about the data fields Provides context to the data and what to expect from analysis

What does data wrangling involve?

Data wrangling involves understanding the data sources, the formats, pitfalls of extracting from one file format to another.

Data wrangling is the process of?

Data wrangling is the process of converting or mapping data from one format into another, cleaning the data, merging, and filtering so it is ready for modeling and analysis

Basic Data Terminology

For analytics and visualization, we collect data about various attributes and multiple instances of an entity. A variable stores data about an attribute (e.g. age) An observed stores data about different attributes (eg.g name, past purchase amount) for a particular instance of the entity (e.g. a customer named "Chandra") A dataset is a collection of attribute values of many many instances of the entity (e.g., all customers) For most analytics modeling, each variable is expected to store a fixed type of data (e.g., numeric or text). •Hence, when a variable is specified, its type is also specified.

(data wrangling important) Collecting and Preparing the data creates?

In many firms, this involves quite a bit of manual effort, creating bottlenecks for the more useful descriptive or predictive models.

(data wrangling important) What is an example of data from multiple sources can provide important clues about companies business outcome?

In the food industry, data about production volume, location data on shipment, weather reports, daily sales and social network comments can be combines to have better insights into customers sentiment and demand

Many step in what process are iterative?

Many steps in Data Wrangling are iterative.

Column headers are values, not variable names, is an example of?

Messiness in Dataset - Data Wrangling Challenges

Multiple types of experimental units stored in same table, is an example of?

Messiness in Dataset - Data Wrangling Challenges

Multiple variables stored in one column, is an example of?

Messiness in Dataset - Data Wrangling Challenges

One type of experimental unit stored in multiple tables, is an example of?

Messiness in Dataset - Data Wrangling Challenges

Variables stored in both rows and columns, is an example of

Messiness in Dataset - Data Wrangling Challenges

(Data Wrangling Challenges) Examples of multiple formats of data ?

Multiple formats of data: Data is being sourced from multiple and disparate source • Web, sensors, smartphones, data warehouses, databases Data needs to be converted into some unified form appropriate for the analytics algorithms or platforms

(Basic Data Terminology) Nominal variables store ?

Nominal variables store values which have no logical ordering or sequence For example, the variable "marital status" may store values "married", "single", "divorced" There is no logical ordering of the above values

What is a tidy dataset?

Observations are in rows Variables are in columns Data contained in a single dataset

(Basic Data Terminology) Ordinal variables

Ordinal variables store values which have a logical ordering For example, the variable "drink size" may store values "small", "medium" and "large" There is an order where we know "medium" is more than "small", and "large" is more than "medium" However, it may not be the case that the difference between "medium" and "large" is the same as the difference between "small" and "medium"

What step in the data wrangling process is date extracted in its raw form and loaded into the data store in the sandbox, where it can then be transformed, if necessary?

Perform ETLT

Steps in Data Wrangling Process Prepare the analytics sandbox

Prepare the analytics sandbox: • Sandbox is the workspace created for exploring data without interfering with live production databases • Sandbox can hold all formats of data, including raw data, aggregated data, transformed data • Sandbox can grow to be very large, since copies of the data are created to serve multiple analysis models

Steps in Data Wrangling 4. Survey and visualize the data

Survey and visualize the data Leverage visualization tools to get an overview of the data's characteristics (e.g., unexpected values or ranges)

The Analytical Sandbox

The Analytical Sandbox Many analytics projects are performed in an analytics sandbox Data for modeling is loaded into the sandbox (or workspace) so analysis can be done without interfering with production environment Example: A fraud analytics workspace will get a copy of the customer and financial data than directly connecting to the production databases It is important for the analytics team to collaborate with IT to balance its need for more data with IT's need for proper data control

What is goal of data wrangling?

The goal of data wrangling is to provide a data set to facilitate analysis-reporting, visualization, prediction, etc.

Data Wrangling Software Companies

Trifacta - A startup based on San Francisco, CA Platform and tools for data wrangling and viewing ClearStory Data- A startup in Palo Alto, CA Tools for combining data from variety of sources Paxata - HQ in Redwood City, CA Part of DataRobot Inc. Runs on Apache Spark Uses sematic algorithms to infer meaning of columns and pattern recognition to identify potential duplicates Tableau Prep- Data cleaning and preparation tool for Tableau visualization

(Basic Data Terminology) Variables are also classified

Variables are also classified according to the measurement scale used

Appropriate time-related measurements as needed for analysis, is an example of