Data Analytics Course 4 Week 2
Not checking for misfielded values
A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren't careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis.
Data Engineers
Transform data into a useful format for analysis and give it a reliable infrastructure. They develop, maintain, and test databases, data processors and related systems.
Removing extra spaces and blanks
Extra spaces can cause unexpected results when you sort, filter, or search through your data. And because these characters are easy to miss, they can lead to unexpected and confusing results. To remove these unwanted spaces or blank cells, you can delete them yourself. Or again, you can rely on your spreadsheets, which offer lots of great functions for removing spaces or blanks automatically.
Inconsistent Data
Any data that uses different formats to represent the same thing Possible Causes: Data stored incorrectly or errors inserted during data transfer Potential Harm to Businesses: Contradictory data points leading to confusion or inability to classify or segment customers
The Split function
Divides text around a specified character or string and puts each fragment of text into a separate cell in the row
Forgetting to document errors
Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven't been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn't work.
Types of Dirty Data
Duplicate Data, Outdated Data, Incomplete Data, Incorrect/Inaccurate Data, and Inconsistent Data
MID
MID is a function that gives you a segment from the middle of a text string.
Functions can optimize your efforts to ensure data integrity
A function is a set of instructions that performs a specific calculation using the data in a spreadsheet
Concatenate
A function that joins multiple text strings into a single string
Count IF Function
A function that returns the number of cells that match a specified value. Basically, it counts the number of times a value appears in a range of cells.
How to get rid of extra spaces in a string
1. Highlight the data in the spreadsheet. 2. Click on the Data tab, then hover over Data cleanup and select Trim whitespace. In Excel, you can use the TRIM command to get rid of white spaces. In any space beneath your data (such as cell A10), type =TRIM(A1). Then, drag the bottom right corner of the cell to the bottom right to call the data without the white spaces.
Pivot tables
A data summarization tool that is used in data processing. Pivot tables sort, reorganize, group, count, total or average data stored in the database. In data cleaning, pivot tables are used to give you a quick, clutter- free view of your data. You can choose to look at the specific parts of the data set that you need to get a visual in the form of a pivot table. Pivot tables are another great tool to change the way you view data
Merger
Agreement that unites two organizations into a single new one When big things happen in an industry, it's common for two organizations to team up and become stronger through a merger. If a merger occurs, all the data from each organization would need to be combined using data merging.
Not accounting for data cleaning in your deadlines/process
All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA.
In data analytics, a text string is a group of characters within a cell, most often composed of letters.
An important characteristic of a text string is its length, which is the number of characters in it. A substring is a smaller subset of a text string. Split is a tool that divides a text string around the specified character and puts each fragment into a new and separate cell. Split is helpful when you have more than one piece of data in a cell and you want to separate them out. EX: split a person's first and last name into two cells Delimiter: Specified text which separates each item Split text to columns is also helpful for fixing instances of numbers stored as text. Sometimes values in your spreadsheet will seem like numbers, but they're formatted as text. This can happen when copying and pasting from one place to another or if the formatting's wrong. It can update this formatting issue
Null
An indication that a value does not exist in the dataset Can be a cell left blank or is N/A Note that it is not the same as a 0 To do your analysis, you would first need to clean this data. Step one would be to decide what to do with those nulls. You could either filter them out and communicate that you now have a smaller sample size, or you can keep them in and learn from the fact that the customers did not provide responses. There's lots of reasons why this could have happened. Maybe your survey questions weren't written as well as they could be. Maybe they were confusing or biased, something we learned about earlier.
Duplicate Data
Any data record that shows up more than once Possible Causes: Manual data entry, batch data imports, or data migration Potential Harm to Business: Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval
Incorrect/Inaccurate Data
Any data that is complete but inaccurate Possible Causes: Human error inserted during data input, fake information, or mock data Potential Harm to Business: Inaccurate insights or decision-making based on bad information resulting in revenue loss
Incomplete Data
Any data that is missing important fields Possible Causes: Improper data collection or incorrect data entry Potential Harm to Businesses: Decreased productivity, inaccurate insights, or inability to complete essential services
Outdated Data
Any data that is old which should be replaced with newer and more accurate information Possible Causes: People changing roles or companies, or software and systems becoming obsolete Potential Harm to Businesses: Inaccurate insights, decision-making, and analytics
Workflow Automation
Basically, workflow automation is the process of automating parts of your work. That could mean creating an event trigger that sends a notification when a system is updated. Or it could mean automating parts of the data cleaning process. As you can probably imagine, automating different parts of your work can save you tons of time, increase productivity, and give you more bandwidth to focus on other important aspects of the job. Automation sounds amazing, doesn't it? But as convenient as it is, there are still some parts of the job that can't be automated. One of the most important ways you can streamline your data cleaning is to clean data where it lives. This will benefit your whole team, and it also means you don't have to repeat the process over and over. For example, you could create a programming script that counted the number of words in each spreadsheet file stored in a specific folder. Using tools that can be used where your data is stored means that you don't have to repeat your cleaning steps, saving you and your team time and energy. *Look within the Google Doc for various examples of task you can automate and tasks you can't
Another useful spreadsheet tool enables you to make formats consistent
Can use the format tool to format a column to make the formatting consistent
Data mapping:
Data mapping is the process of matching fields from one database to another. This is very important to the success of data migration, data integration, and lots of other data management activities. different systems store data in different ways. For example, the state field in one spreadsheet might show Maryland spelled out. But another spreadsheet might store it as MD Data mapping helps us note these kinds of differences so we know when data is moved and combined it will be compatible. For more challenging projects there's all kinds of data mapping software programs you can use. These data mapping tools will analyze field by field how to move data from one place to another then they automatically clean, match, inspect, and validate the data. They also create consistent naming conventions, ensuring compatibility when the data is transferred from one source to another. When selecting a software program to map your data, you want to be sure that it supports the file types you're working with, such as Excel, SQL, Tableau, and others. Concatenate is a function that joins together two or more text strings,
Irrelevant Data
Data that doesn't fit the specific problem that you're trying to solve, also needs to be removed. EX: Let's say a data analyst was working on a project that focused only on current members. They wouldn't want to include information on people who are no longer members, Removing irrelevant data takes a little more time and effort because you have to figure out the difference between the data you need and the data you don't. But believe me, making those decisions will save you a ton of effort down the road.
Clean Data
Data that is complete, correct, and relevant to the problem you're trying to solve When you work with clean data, you'll find that your projects go much more smoothly. SQL only works when the data is clean Clean data is incredibly important for effective analysis. On the other hand, clean data is complete, correct, and relevant to the problem you're trying to solve. This allows you to understand and analyze information and identify important patterns, connect related information, and draw useful conclusions. Then you can apply what you learn to make effective decisions. In some cases, you won't have to do a lot of work to clean data. For example, when you use internal data that's been verified and cared for by your company's data engineers and data warehouse team, it's more likely to be clean.
DIrty Data
Data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve When you work with dirty data, you can't be sure that your results are correct. In fact, you can pretty much bet they won't be. The number one cause of poor quality data is human error Dirty data can be the result of someone typing in a piece of data incorrectly, inconsistent formatting, blank fields, or the same piece of data which creates duplicates dirty data is incomplete, incorrect, or irrelevant to the problem you're trying to solve. It can't be used in a meaningful way, which makes analysis very difficult, if not impossible.
Data Validation
Data validation is a tool for checking the accuracy and quality of data before adding or importing it. Data validation is a form of data cleaning, which you'll learn more about soon.
Data warehousing specialists
Develop processes and procedures to effectively store and organize data. They make sure that data is available, secure, and backed up to prevent loss.
Not fixing the source of the error
Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone's progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run.
How to eliminate rows with blank cells
Highlight all cells in the spreadsheet. Click on the Data tab and pick the Create a filter option. In Microsoft Excel, this is called Filter. Every column now shows a green triangle in the first row next to the column title. Click the green triangle in Column B to access a new menu. On that new menu, click Filter by condition and open the dropdown menu to select Is empty. Click OK. Select all these cells and delete the rows except the row of column headers.
Removing unwanted data
However, before removing unwanted data, it's always a good practice to make a copy of the data set. That way, if you remove something that you end up needing in the future, you can easily access it and put it back in the data set. Once you've made a copy, you can move on to getting rid of the duplicates or data that isn't relevant to the problem you're trying to solve. Typically, duplicates appear when you're combining data sets from more than one source or using data from multiple departments within the same business. Duplicates can be a big problem for data analysts. So it's really important that you can find and remove them before any analysis starts. Duplicates can be fixed manually, but most spreadsheet applications also offer lots of tools to help you find and remove duplicates.
What to ask yourself when merging data
I always begin by asking myself some key questions to help me avoid redundancy and to confirm that the datasets are compatible. Compatibility: How well two or more datasets are able to work together First question you should ask is, "Do I have all the data that I need?" Next: "Does the data I need exist within these data sets?" This involves considering the entire dataset analytically. Looking through the data before I start using it lets me get a feel for what it's all about, what the schema looks like, if it's relevant to my customer purchase insights, and if it's clean data. Next: "Does the data need to be cleaned, or are they ready for me to use?" If using more than one source, "Are both data sets cleaned to the same standard?" For example, what fields are regularly repeated? How are missing values handled? How recently was the data updated? Finding the answers to these questions and understanding if I need to fix any problems at the start of a project is a very important step in data merging.
Not analyzing the system prior to data cleaning
If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean.
How to delete all formatting
If you want to clear the formatting for any or all cells, you can find the command in the Format tab. To clear formatting: 1. Select the data for which you want to delete the formatting. In this case, highlight all the data in the spreadsheet by clicking and dragging over Rows 1-8. 2. Click the Format tab and select the Clear Formatting option. In Excel, go to the Home tab, then hover over Clear and select Clear Formats. You will notice that all the cells have had their formatting removed.
"Remove duplicates"
Is a tool that automatically searches for and eliminates duplicate entries from a spreadsheet. Can use the data - remove duplicates function in google sheets to remove all duplicates
Not backing up your data prior to data cleaning
It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache.
Only looking at a subset of the data
It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention.
LEFT and RIGHT function:
LEFT is a function that gives you a set number of characters from the left side of a text string. RIGHT is a function that gives you a set number of characters from the right side of a text string. You can use LEFT or RIGHT to give you the specific set of characters or numbers you need. By using left and right, you can get a substring
LEN
LEN is a function that tells you the length of the text string by counting the number of characters it contains. This is useful when cleaning data if you have a certain piece of information in your spreadsheet that you know must contain a certain length. EX: Member ID must be a certain length
Overlooking missing values
Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency.
Not checking for spelling errors
Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named "John" whose name has been input incorrectly as "Jon" in some places. The spreadsheet's spellcheck probably won't flag this, so if you don't double-check for spelling errors and catch this, your analysis will have mistakes in it.
How to change text to lower case/upper case/ proper case
Next, you'll process string data. The easiest way to clean up string data will depend on the spreadsheet program you are using. If you are using Excel, you'll use a simple formula. If you are using Google Sheets, you can use an Add-On to do this with a few clicks. Follow the steps in the relevant section below. Microsoft Excel If you are using Microsoft Excel, this documentation explains how to use a formula to change the case of a text string. Follow these instructions to clean the string text and then move on to the confirmation and reflection section of this activity. Google Sheets Click on the Add-Ons option at the top of Google Sheets. Click on Get add-ons. Search for ChangeCase. It should appear like this: Click on Install to install the add-on. It may ask you to login or verify the installation permissions. 1. Click on Column C. Be sure to deselect the column header, unless you want to change the case of that as well (which you don't). 2. Click on the Add-Ons tab and select ChangeCase. Select the option All uppercase. Notice the other options that you could have chosen if needed.
Common data cleaning pitfalls
Not checking for spelling errors Forgetting to document errors Not checking for misfielded values Overlooking missing values Only looking at a subset of the data Losing track of business objectives Not fixing the source of the error Not analyzing the system prior to data cleaning Not backing up your data prior to data cleaning Not accounting for data cleaning in your deadline/processes
When you plot data, you put it in a graph chart, table, or other visual to help you quickly find what it looks like.
Plotting is very useful when trying to identify any skewed data or outliers. Looking at data in new and creative ways helps data analysts identify all kinds of dirty data.
Common issues with dirty data
Some common issues associated with dirty data include spelling and other text errors, inconsistent labels, formats and field length, missing data, and duplicates. Misspelling, spelling variations, mixed up letters, inconsistent punctuation and typos in general, happen when someone types in a piece of data incorrectly. As a data analyst, you'll also deal with different currencies. For example one data set may be in dollars and the other in euros. Don't want to mix them up Another type of dirty data is inconsistent formatting; something that should be formatted as currency is shown as a percentage. We discussed nulls previously, but as a reminder, nulls are empty fields. This kind of dirty data requires little more work than just fixing a spelling error or changing a format. Another common type of dirty data is a duplicate. Maybe two different people added this appointment on August 13th, not realizing that someone else had already done it. Or maybe the person entering the data hit copy and paste by accident. Whatever the reason, it's the data analyst job to identify this error and correct it by deleting one of the duplicates. Another type of dirty data is labeling EX: labeling a group of pictures but labeling on picture incorrectly Another type of dirty data is inconsistent field length A field is a single piece of information from a row or column of a spreadsheet. Field length is a tool for determining how many characters can be keyed into a field, assigning a certain length to these fields in your spreadsheet is a great way to avoid errors. For instance, if you have a column for someone's birth year, you know the field length is four, because all years are four digits long.
Data analysts use different methods to look at data differently
Some of these methods include sorting and filtering pivot tables, a function called VLOOKUP, and plotting to find outliers.
Sorting
Sorting involves arranging data into a meaningful order to make it easier to understand, analyze, and visualize. Sorting and filtering data helps data analysts customize and organize the information the way they need for a particular project. But these tools are also very useful for data cleaning.
Conditional Formatting
Spreadsheet tool that changes how cells appear when values meet specific conditions Likewise, it can let you know when a cell does not meet the conditions you've set. Visual cues like this are very useful for data analysts, especially when we're working in a large spreadsheet with lots of data. Making certain data points standout makes the information easier to understand and analyze. For cleaning data, knowing when the data doesn't follow the condition is very helpful.
Syntax
Syntax is a predetermined structure that includes all required information and its proper placement.
TRIM
TRIM is a function that removes leading, trailing, and repeated spaces in data. Sometimes when you import data, your cells have extra spaces, which can get in the way of your analysis. For example, if this cosmetics maker wanted to look up a specific client name, it won't show up in the search if it has extra spaces. You can use TRIM to fix that problem.
Data Mapping Steps
The first step to data mapping is identifying what data needs to be moved. This includes the tables and the fields within them. We also need to define the desired format for the data once it reaches its destination. Next comes mapping the data. Depending on the schema and number of primary and foreign keys in a data source, data mapping can be simple or very complex. The next step is transforming the data into a consistent format. This is a great time to use concatenate. Now that everything's compatible, it's time to transfer the data to it's destination. There's a lot of different ways to move data from one place to another, including querying, import wizards, and even simple drag and drop. After merging, we'll go into the testing phase of data mapping. For this, you inspect a sample piece of data to confirm that it's clean and properly formatted. It's also a smart practice to do spot checks on things such as the number of nulls. For the test, you can use a lot of the data cleaning tools we discussed previously, such as data validation, conditional formatting, COUNTIF, sorting, and filtering. Finally, once you've determined that the data is clean and compatible, you can start using it for analysis. Data mapping is so important because even one mistake when merging data can ripple throughout an organization, causing the same error to appear again and again. This leads to poor results. On the other hand, data mapping can save the day by giving you a clear road map you can follow to make sure your data arrives safely at it's destination.
How to transpose data
The second technique you will practice will help you convert the data from the current long format (more rows than columns) to the wide format (more columns than rows). This action is called transposing. The following steps will help you transpose Highlight and copy the data that you want to transpose including the column labels. You can do this by highlighting Columns A-H. In Excel, highlight only the relevant cells (A1-H45) instead of the headers. Right-click on cell I1. This is where you want the transposed data to start. Hover over Paste Special from the right-click menu. Select the Transposed option. In Excel, select the Transpose icon under the paste options.You should now find the data transformed into the new wide format. At this point, you should remove the original long data from the spreadsheet. Delete the previous long data. The easiest way to do this is to click on Column A, so the entire column is highlighted. Then, hold down the Shift key and click on Column H. You should find these columns highlighted. Right-click on the highlighted area and select Delete Columns A - H.
Data cleaning features in spreadsheets
There's a lot of great efficiency tools that data analysts use all the time, such as conditional formatting, removing duplicates, formatting dates, fixing text strings and substrings, and splitting text to columns.
The next data cleaning step involves fixing misspellings, inconsistent capitalization, incorrect punctuation, and other typos.
These types of errors can lead to some big problems. EX: incorrect email address. This can lead to sending emails to wrong people and not your intended recipient Like the other problems you've come across, you can also fix these problems manually. Or you can use spreadsheet tools, such as spellcheck, autocorrect, and conditional formatting to make your life easier. There's also easy ways to convert text to lowercase, uppercase, or proper case
The next step is removing formatting.
This is particularly important when you get data from lots of different sources. Every database has its own formatting, which can cause the data to seem inconsistent. Creating a clean and consistent visual appearance for your spreadsheets will help make it a valuable tool for you and your team when making key decisions. Most spreadsheet applications also have a "clear formats" tool, which is a great time saver. Cleaning data is an essential step in increasing the quality of your data
V-Lookup
Vertical Lookup It's a function that searches for a certain value in a column to return a corresponding piece of information. When data analysts look up information for a project, it's rare for all of the data they need to be in the same place. Usually, you'll have to search across multiple sheets or even different databases. (very common) To put it simply, VLOOKUP searches for the value in the first argument in the leftmost column of the specified location. Then the value of the third argument tells VLOOKUP to return the value in the same row from the specified column. The "false" tells VLOOKUP that we want an exact match. V lookup takes the value in one cell and searches for a match in another place.
Losing track of business objectives
When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don't want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn't related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand.
Data merging
is the process of combining two or more datasets into a single dataset. This presents a unique challenge because when two totally different datasets are combined, the information is almost guaranteed to be inconsistent and misaligned. This could cause issues when customers may use different services which could be covered by both companies merging. This could cause duplicates between both data sets One data set may use different terminology than the other One data set may have a lot more data than the other There's lots of other reasons why data analysts merge datasets. For example, in one of my past jobs, I merged a lot of data from multiple sources to get insights about our customers' purchases.
Filtering
showing only the data that meets a specific criteria while hiding the rest. Filters are very useful in data cleaning when you want to find a particular piece of information. When cleaning data, you might use a filter to only find values above a certain number, or just even or odd values. Again, this helps you find what you need quickly and separates out the information you want from the rest.