Data Analytics Course 4 Week 4

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Changelog take you the last mile in documentation

A changelog can build on your automated version history by giving you an even more detailed record of your work. This is where data analysts record all the changes they make to the data. Version histories record what was done in a data change for a project, but don't tell us why. Changelogs are super useful for helping us understand the reasons changes have been made. Changelogs have no set format and you can even make your entries in a blank document. But if you are using a shared changelog, it is best to agree with other data analysts on the format of all your log entries.

Example of how a version control system affects a change to a query

A company has official versions of important queries in their version control system. An analyst makes sure the most up-to-date version of the query is the one they will change. This is called syncing The analyst makes a change to the query. The analyst might ask someone to review this change. This is called a code review and can be informally or formally done. An informal review could be as simple as asking a senior analyst to take a look at the change. After a reviewer approves the change, the analyst submits the updated version of the query to a repository in the company's version control system. This is called a code commit. A best practice is to document exactly what the change was and why it was made in a comments area. Going back to our example of a query that pulls daily revenue, a comment might be: Updated revenue to include revenue coming from the new product, Calypso. After the change is submitted, everyone else in the company will be able to access and use this new query when they sync to the most up-to-date queries stored in the version control system. If the query has a problem or business needs change, the analyst can undo the change to the query using the version control system. The analyst can look at a chronological list of all changes made to the query and who made each change. Then, after finding their own change, the analyst can revert to the previous version. The query is back to what it was before the analyst made the change. And everyone at the company sees this reverted, original query, too.

Changelog

A file containing a chronologically ordered list of modifications made to a project. It's usually organized by version and includes the date followed by a list of added, improved, and removed features Changelogs are very useful for keeping track of how a dataset evolved over the course of a project. They're also another great way to communicate and report on data to others.

Sometimes you have an error that shows up repeatedly, and it can't be resolved with a quick manual edit or a tool that fixes the problem automatically. In these cases, it's helpful to create a pivot table.

A pivot table is a data summarization tool that is used in data processing. Pivot table sort, reorganize, group, count, total or average data stored in a database.

Verification

A process to confirm that a data cleaning effort was well- executed and the resulting data is accurate and reliable. It involves rechecking your clean dataset, doing some manual clean ups if needed, and taking a moment to sit back and really think about the original purpose of the project. By doing this you can be confident that the data you collected is credible and appropriate for your purposes. Making sure your data is properly verified is so important because it allows you to double-check that the work you did to clean up your data was thorough and accurate. Verification lets you catch mistakes before you begin analysis. Without it, any insights you gain from analysis can't be trusted for decision-making. You might even risk misrepresenting populations or damaging the outcome of a product that you're actually trying to improve. Verification is a process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable. It also involves manually cleaning data to compare your expectations with what's actually present.

The final step in data cleaning

As a quick reminder, the goal is to ensure that our data-cleaning work was done properly and the results can be counted on. You want your data to be verified so you know it's 100 percent ready to go. The first step in verification is returning to your original, unclean dataset and comparing it to what you have now. This is an opportunity to search for common problems. After that, you clean up the problems manually.

Data analysts usually use a changelog to access this information.

As a reminder, a changelog is a file containing a chronologically ordered list of modifications made to a project.

Best Practices for changelogs (A changelog for a personal project may take any form desired. However, in a professional setting and while collaborating with others, readability is important. These guiding principles help to make a changelog accessible to others:)

Changelogs are for humans, not machines, so write legibly. Every version should have its own entry. Each change should have its own line. Group the same types of changes. For example, Fixed should be grouped separately from Added. Versions should be ordered chronologically starting with the latest. The release date of each version should be noted. All the changes for each category should be grouped together. Types of changes usually fall into one of the following categories: Added: new features introduced Changed: changes in existing functionality Deprecated: features about to be removed Removed: features that have been removed Fixed: bug fixes Security: lowering vulnerabilities

Data Checklist

Correct the most common problems. Make sure you identified the most common problems and corrected them, including: Review the goal of your project

Typical Changelogs contain:

Data, file, formula, query, or any other component that changed Description of what changed Date of the change Person who made the change Person who approved the change Version number Reason for the change

Tools to help in verification process

Find and replace is a tool that looks for a specific search term in a spreadsheet and allows you to replace it with something else. COUNTA: A function that counts the total number of values within a specified range. (Within Excel - Pivot Table) COUNT: ​Only counts the numerical values within a specified range. (Within Excel - Pivot Table) But in other special applications, COUNT would give us information we want for our current example. If you're working in SQL, you can address misspellings using a CASE statement.

Taking a big picture view of your project involves doing three things.

First, consider the business problem you're trying to solve with the data. If you've lost sight of the problem, you have no way of knowing what data belongs in your analysis. Taking a problem-first approach to analytics is essential at all stages of any project. You need to be certain that your data will actually make it possible to solve your business problem. Second, you need to consider the goal of the project It's not enough just to know that your company wants to analyze customer feedback about a product. What you really need to know is that the goal of getting this feedback is to make improvements to that product. On top of that, you also need to know whether the data you've collected and cleaned will actually help your company achieve that goal. And third, you need to consider whether your data is capable of solving the problem and meeting the project objectives. That means thinking about where the data came from and testing your data collection and cleaning processes.

Having a record of how a data set evolved does three very important things.

First, it lets us recover data-cleaning errors. Instead of scratching our heads, trying to remember what we might have done three months ago, we have a cheat sheet to rely on if we come across the same errors again later. It's also a good idea to create a clean table rather than overriding your existing table. This way, you still have the original data in case you need to redo the cleaning. Second, documentation gives you a way to inform other users of changes you've made. Third, documentation helps you to determine the quality of the data to be used in analysis. The first two benefits assume the errors aren't fixable. But if they are, a record gives the data engineer more information to refer to. It's also a great warning for ourselves that the data set is full of errors and should be avoided in the future.If the errors were time-consuming to fix, it might be better to check out alternative data sets that we can use instead. Clarification: The instructor stated that the first two benefits of documentation --1) recalling the errors that were cleaned and 2) informing others of the changes -- assume that the data errors aren't fixable. She then added that when the data errors are fixable, the documentation needs to record how the data was fixed. Data-cleaning documentation is important in both cases.

You can use and view a changelog in spreadsheets and SQL to achieve similar results.

For spreadsheets, we can use Sheet's version history, which provides a real-time tracker of all the changes and who made them from individual cells to the entire worksheet. File - version history If you want to check out changes in a specific cell, we can right-click and select Show Edit History. The way you create and view a changelog with SQL depends on the software program you're using. Some companies even have their own separate software that keeps track of changelogs and important SQL queries Essentially, all you have to do is specify exactly what you did and why when you commit a query to the repository as a new and improved query. This allows the company to revert back to a previous version if something you've done crashes the system, which has happened to me before. Another option is to just add comments as you go while you're cleaning data in SQL. This will help you construct your changelog after the fact. BigQuery's query history tracks all of the changes you've made You can click on any of them to revert back to a previous version of your query or to bring up an older version to find what you've changed. Changelogs like these are a great way to keep yourself on track. It also lets your team get real-time updates when they want them.

For example, one of the biggest challenges of working with data is dealing with errors. Some of the most common errors involve:

Human error in data entry Flawed processes (Like poor design of a survey form) System Issues (older systems integrate data incorrectly)

When you clean your data, all the incorrect or outdated information is gone, leaving you with the highest-quality content. But all those changes you made to the data are valuable too.

Keeping track of all the changes you've made involves documentation

Automated version control takes you most of the way there

Most software applications have a kind of history tracking built in. (EX: google sheets, google docs), in excel track changes, in BigQuery, SQL log

Review the goal of your project

Once you have finished these data cleaning tasks, it is a good idea to review the goal of your project and confirm that your data is still aligned with that goal. This is a continuous process that you will do throughout your project-- but here are three steps you can keep in mind while thinking about this: Confirm the business problem Confirm the goal of the project Verify that data can solve the problem and is aligned to the goal

The other important part of the verification process is reporting on your efforts.

Open communication is a lifeline for any data analytics project. Reports are a super effective way to show your team that you're being 100 percent transparent about your data cleaning. Reporting is also a great opportunity to show stakeholders that you're accountable, build trust with your team, and make sure you're all on the same page of important project details.

Correct the most common problems. Make sure you identified the most common problems and corrected them, including:

Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset? Null data: Did you search for NULLs using conditional formatting and filters? Misspelled words: Did you locate all misspellings? Mistyped numbers: Did you double-check that your numeric data has been entered correctly? Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function? Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL? Mismatched data types: Did you check that numeric, date, and string data are typecast correctly? Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful? Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset? Misleading variable labels (columns): Did you name your columns meaningfully?Truncated data: Did you check for truncated or missing data that needs correction? Business Logic: Did you check that the data makes sense given your knowledge of the business?

CASE Statement

The CASE statement goes through one or more conditions and returns a value as soon as a condition is met. Example below says that if the name is Chad Lucero, change it to Chuck. Same thing for Abel. If they are not, then just use the name already created. The END AS states that a new column named cleaned_name is created SELECT customer_id, CASE WHEN name="Chad Lucero" THEN "Chuck Lucero" WHEN name="Abel Black" THEN "Abe Black" ELSE name END AS cleaned_name FROM project3-326803.customer_data.customer_address

FILTER function

The FILTER function is fully internal to a spreadsheet and doesn't require the use of a query language. The FILTER function lets you view only the rows (or columns) in the source data that meet your specified conditions. It makes it possible to pre-filter data before you analyze it. The FILTER function might run faster than the QUERY function. But keep in mind, the QUERY function can be combined with other functions for more complex calculations. For example, the QUERY function can be used with other functions like SUM and COUNT to summarize data, but the FILTER function can't.

IMPORTRANGE function

The IMPORTRANGE function in Google Sheets and the Paste Link feature (a Paste Special option in Microsoft Excel) both allow you to insert data from one sheet to another. Using these on a large amount of data is more efficient than manual copying and pasting. They also reduce the chance of errors being introduced by copying and pasting the wrong data. They are also helpful for data cleaning because you can "cherry pick" the data you want to analyze and leave behind the data that isn't relevant to your project. This functionality is also useful for day-to-day data monitoring; with it, you can build a tracking spreadsheet to share the relevant data with others. The data is synced with the data source so when the data is updated in the source file, the tracked data is also refreshed. If you are using IMPORTRANGE in Google sheets, data can be pulled from another spreadsheet, but you must allow access to the spreadsheet the first time it pulls the data

QUERY function

The QUERY function is also useful when you want to pull data from another spreadsheet. The QUERY function's SQL-like ability can extract specific data within a spreadsheet. For a large amount of data, using the QUERY function is faster than filtering data manually. This is especially true when repeated filtering is required. For example, you could generate a list of all customers who bought your company's products in a particular month using manual filtering. But if you also want to figure out customer growth month over month, you have to copy the filtered data to a new spreadsheet, filter the data for sales during the following month, and then copy those results for the analysis. With the QUERY function, you can get all the data for both months without a need to change your original dataset or copy results. The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the range of data that you want to query from, and then use the SQL SELECT command to select the specific columns. You can also add specific criteria after the SELECT statement by including a WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes! Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets use a query wizard to guide you through the steps to connect to a data source and select the tables. In either case, you are able to be sure that the data imported is verified and clean based on the criteria in the query. Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the QUERY function to create multiple tabs (views) of that dataset. For example, one tab could contain all the sales data for a particular month and another tab could contain all the sales data from a specific region. This solution illustrates how SQL and spreadsheets are used well together.

Steps in the verification process

The first step in the verification process is going back to your original unclean data set and comparing it to what you have now. Review the dirty data and try to identify any common problems. For example, maybe you had a lot of nulls. In that case, you check your clean data to ensure no nulls are present. Another key part of verification involves taking a big-picture view of your project. This is an opportunity to confirm you're actually focusing on the business problem that you need to solve and the overall project goals and to make sure that your data is actually capable of solving that problem and achieving those goals. It's important to take the time to reset and focus on the big picture because projects can sometimes evolve or transform over time without us even realizing it Sometimes data analysts can be too familiar with their own data, which makes it easier to miss something or make assumptions. Asking a teammate to review your data from a fresh perspective and getting feedback from others is very valuable in this stage. This is also the time to notice if anything sticks out to you as suspicious or potentially problematic in your data. Again, step back, take a big picture view, and ask yourself, do the numbers make sense?

Documentation

The process of tracking changes, additions, deletions and errors involved in your data cleaning effort. Think about it like a crime tv show: Data errors are the crime, data cleaning is gathering evidence, and documentation is detailing exactly what happened for peer review or court.

What to record in a changelog

To start, you record the various changes, additions, and fixes that were discussed above. Arrange them using bullets or numbering with one change per line. Group similar changes together with a label describing the change immediately above them. Use different version numbers for each milestone reached in your project. Within each version, place the logged changes that were made since the previous version (milestone). Dates are not generally necessary for each change, but they are recommended for each version. You can do this using a simple text file or spreadsheet and include your changelog with the project write-up. It will help you stay organized and collaborate with others. Keep this in mind when you reach the capstone project in an upcoming course, and don't be afraid to revisit this lesson if you have questions.

Since it's staged chronologically, it provides a real-time account of every modification. Documenting will be a huge time saver for you as a future data analyst. It's basically a cheatsheet you can refer to if you're working with the similar data set or need to address similar errors.

While your team can view changelogs directly, stakeholders can't and have to rely on your report to know what you did. Regardless of how we capture and share our changelogs, we're setting ourselves up for success by being 100 percent transparent about our data cleaning. This keeps everyone on the same page and shows project stakeholders that we are accountable for effective processes.


Kaugnay na mga set ng pag-aaral

RNA and Protein Synthesis review

View Set

PN Learning system fundamentals practice quiz 1 & quiz 2

View Set

[PSYCH] Unit 13 - Treatment of Psychological Disorders

View Set

reading exam--two countries story

View Set

MedSurg Adult Health Cardio Questions

View Set

Chapter 6: Tax and Retirement Planning

View Set

Macro CH 31: Inflation, Disinflation, and Deflation

View Set

Chapter 10: Assessment of High Risk Pregnancy NCLEX

View Set

Quiz reviews for operations management

View Set