Data Wrangling

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Handling blank values:

: Are blank values represented as "NA," "Null," "-1," or "TBD"? If so, deciding on a single value for consistency's sake will help eliminate stakeholder confusion. A more advanced approach is imputing values. This means using populated cells in a column to make a reasonable guess at the missing values, such as finding the average of the populated cells and assigning that to the blank cells.

Understand the columns and data types

: Having a data dictionary (a document that describes a data set's column names, business definition, and data type) can really help with this step. It's necessary to ensure that the data values actually stored in a column match the business definition of that column. For example, a column called "date_of_birth" should be formatted in a format like MM/DD/YYYY. Combining this practice with data profiling, described above, should help the analyst really get to know the data.

Deduplicating

: No source data set is perfect, and sometimes source systems send duplicate rows. The key here is to know the "natural key" of each record, meaning the field or fields that uniquely identify each row. If an inbound data set includes records having the same natural key, all but one of the rows could be removed.

Start with a small test set

: One of the challenges of Big Data is working with large data sets, especially early in the data transformation where analysts need to quickly iterate through many different exploratory techniques. To help tame the unruly beast of 500 million rows, apply random sampling to the data set to explore the data and lay out the preparation steps. This method will greatly accelerate data exploration and quickly set the stage for further transformation.

Validating accuracy:

: One type of accuracy is taking steps to ensure data is correctly entered at the point of collection - for example, if a website has changed and the value is no longer there, or if the pricing of a product is only available when you put an item into a shopping cart because of a promotion.

What is the first steps of data cleansing and transformation?

Define the objective, data source investigation, data profiling

What are some of the best data cleansing practices?

Defining a data quality plan Validating accuracy Deduplicating Handling blank values: Reformatting values: Threshold checking:

Defining a data quality plan:

Derived from the business case (see above), the quality plan may also entail some conversation with business stakeholders to tease out answers to questions like "What are our data extraction standards," "What opportunities do we have to automate the data pipeline," "What data elements are key to downstream products and processes," "Who is responsible for ensuring data quality," and "How do we determine accuracy."

You can clean the data before the data source is evaluated and profiled

False

Test early and often

Ideally, reliable expected values are available to test the results of a data wrangling effort. A good business case could include expected values for validation purposes. But even if not, knowing the business question and iteratively testing the results of data wrangling should help testers surface data transformation issues for resolution early in the process.

What does a data source investigation include?

Identifying the data required by the business case Knowing whether the data will be integrated directly into an application or business process or if it will be used to drive an analytical investigation Identifying what trends project team members anticipate seeing as web data is collected over time Cataloging possible data sources and their data stewards in a mature IT environment Understanding the delivery mechanism and frequency of refreshed data from the source

What are some of the best data wrangling practices?

Start with a small test set Understand the columns and data types Visualize source data: Zero in on only the needed data elements Turn it into actionable data Test early and often

Data Wrangling

Take messy, incomplete data or data that is too complex and simplify and/or clean it so that it's useable for analysis

Turn it into actionable data

The steps above shed light on the manipulations, transformations, calculations, reformatting, etc. needed to convert the web source data into the target format. A skilled analyst can create repeatable workflows that translate the required business rules into data wrangling action

Threshold checking:

This is a more nuanced data cleansing approach. It includes comparing a current data set to historical values and record counts. For example, in the health care world, let's say a monthly claims data source averages total allowed amounts of $2M and unique claim counts of 100K. If a subsequent data load arrives with a total allowed amount of $10M and 500K unique claims, those amounts exceed the normal expected threshold of variance, and should trigger additional scrutiny

Zero in on only the needed data elements

This is where having a well-defined business case can really help. Since most source data sets have far more columns than are actually needed, it's imperative to wrangle only the columns required by the business case. Proper application of this practice will save untold amounts of time, money, and credibility.

you'll have to drop null values to do any reliable data analysis

True

Visualize source data:

Using common graphing tools and techniques can help bring the "current state" of the data to life. Histograms show distributions, scatter plots help find outliers, pie graphs show percentage to whole, and line graphs can show trends in key fields over time. Showing how data looks in visual form is also a great way to explain exploratory findings and needed transformations to non-technical users.

What does this function do? df.groupby('Sex').Survived.value_counts()

counts survivors and none survivors and groups them by sex

How would you filter a variable with greater than 30 age in Python

df[df['Age']>30]

Reformatting values:

f the source data's date fields are in the MM-DD-YYYY format, and your target date fields are in the YYYY/MM/DD format, update the source date fields to match the target format.

Web data integration (WDI)

has built-in Excel-like transform functions that allow you to normalize data right within the web application; focuses on data quality and controls

What does data profiling include?

reveals data structure, null records, outliers, junk data, and potential data quality issues, etc.


Ensembles d'études connexes

IB Cell Cycle, DNA, and DNA Replication

View Set

2nde BAC PRO HELLO Introduce yourself

View Set

(1) Completing the Application, Underwriting, and Delivering Policy

View Set

ECN CH.15 Aggregate Demand and Aggregate Supply Quiz

View Set