Course 4: Process Data from Dirty to Clean

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What are the most common processes and procedures handled by data engineers?

- Developing, maintaining, and testing databases and related systems - Transforming data into a useful format for analysis - Giving data a reliable infrastructure

T/F: A data analyst determines an appropriate sample size for a survey. They can check their work by making sure the confidence level percentage plus the margin of error percentage add up to 100%.

False

Fill in the blank: While cleaning data, documentation is used to track _____.

errors, deletions, changes

Fill in the blank: Data _____ is the process of matching fields from one data source to another.

mapping

A data analyst is given a dataset for analysis. It includes data about the total population of every country in the previous 20 years. Which of the following questions can the analyst use this dataset to address?

- What was the average population of a certain country from 2015 through 2020? - What was the difference in population between two specific countries in 2018?

A data analyst at a nonprofit organization is working with a dataset about a summer fundraiser. Although they have a lot of useful data by the end of June, they recognize that the data is insufficient. So, they decide to wait until the end of the season to begin working with the dataset. Which type of insufficient data does this example describe?

Data that keeps updating

T/F: Sometimes during analysis, an analyst discovers that it's necessary to adjust the business objective. When this happens, the analyst should take the initiative to do so without involving others in order to be respectful of their time.

False

T/F: VLOOKUP searches for a value in a row in order to return a corresponding piece of information.

False

A data analyst is managing a database of customer information for a retail store. What SQL command can the analyst use to add a new customer to the database?

INSERT INTO

What can jeopardize data integrity throughout its lifecycle?

Malware, system failures, human error

A research team runs an experiment to determine if a new security system is more effective than the previous version. What type of results are required for the experiment to be statistically significant?

Results that are real and not caused by random chance

Which of the following queries considers one or more conditions and returns a value as soon as that condition is met?

SELECT * CASE WHEN COLUMN = VARIABLE

A data analyst is tasked with identifying what orders are still in transit. The current list of orders contains trillions of rows. What is the best tool for the analyst to use?

SQL

A data analyst is in the verification step. They consider the business problem, the goal, and the data involved in their analytics project. What scenario does this describe?

Seeing the big picture

For a function to work properly, data analysts must follow each function's predetermined structure. What is this structure called?

Syntax

During the verification process, you find that you missed a few leading spaces during data cleaning. What function can you use to eliminate these spaces?

TRIM

Fill in the blank: To remove leading, trailing, and repeated spaces in data, analysts use the ____ function.

TRIM

Which of the following SQL functions can data analysts use to clean string variables?

TRIM, SUBSTR

A data analyst is cleaning a dataset with inconsistent formats and repeated cases. They use the TRIM function to remove extra spaces from string variables. What other tools can they use for data cleaning?

TRIM, remove duplicates, and find and replace

In order to have a high confidence level in a customer survey, what should the sample size accurately reflect?

The entire population

A data analyst uses the COUNTA function to count which of the following?

The total number of values within a specified range

T/F: A data analyst making changes to SQL queries and using these comments to create a changelog involves specifying the changes they made and why they made them.

True

A data analyst is cleaning a dataset. They want to confirm that users entered five-digit zip codes correctly by checking the data in a certain spreadsheet column. What would be most helpful as the next step?

Using the field length tool to specify the number of characters in each cell in the column

Documenting data-cleaning makes it possible to achieve what goals?

be transparent about your process, keep team members on the same page, and demonstrate to project stakeholders that you are accountable

Which of the following are benefits of using SQL?

can handle huge amounts of data, can be adapted and used with multiple database programs, and offers powerful tools for cleaning data.

Fill in the blank: A changelog contains a _____ list of modifications made to a project.

chronological

Fill in the blank: Every database has its own formatting, which can cause the data to seem inconsistent. Data analysts use the _____ tool to create a clean and consistent visual appearance for their spreadsheets.

clear formats

Fill in the blank: In data analytics, _____ describes how well two or more datasets are able to work together.

compatibility

A healthcare company keeps copies of their data at several locations across the country. The data becomes compromised because each location creates a copy of the original at different times of day. Which of the following processes caused the compromise?

data replication

Which of the following are limitations that might lead to insufficient data?

data that updates continually, outdated data, and data from a single source.

Fill in the blank: A _____ is a character that the SPLIT function uses to determine where a text string is to be divided.

delimiter

Fill in the blank: Margin of error is the _____ amount that the sample results are expected to differ from those of the actual population.

maximum

Which of the following tasks can data analysts do using both spreadsheets and SQL?

perform arithmetic, use formulas, and join data.

Making sure data is properly verified is an important part of the data-cleaning process. Which of the following tasks are involved in this verification?

recheck the data-cleaning effort, manually fix errors, and consider whether the data is credible and appropriate for the project.

Why is it important for a data analyst to document the evolution of a dataset?

recover data-cleaning errors, inform other users of changes, and determine the quality of the data

A data analyst wants to find out how many people in Utah have swimming pools. It's unlikely that they can survey every Utah resident. Instead, they survey enough people to be representative of the population. This describes what data analytics concept?

sample

Which of the following principles are key elements of data integrity?

the accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle.

Fill in the blank: Sampling bias in data collection happens when a sample isn't representative of _____.

the population as a whole

Fill in the blank: A data analyst finishes cleaning their data. The next step in the process is reporting and ____.

verification

SQL is a language used to communicate with databases. Like most languages, SQL has dialects. What are the advantages of learning and using standard SQL?

works with a majority of databases and requires a small number of syntax changes to adapt to other dialects

Which process do data analysts use to make data more organized and easier to read?

Data manipulation

In SQL databases, what data type refers to a number that does not contain a decimal?

Integer

Fill in the blank: Documentation is the process of tracking _____ during data cleaning.

additions, changes, and deletions

What should an analyst do if they do not have the data needed to meet a business objective?

they should gather related data on a small scale and request additional time. Then, they can find more complete data or perform the analysis by finding and using proxy data from other datasets.

What are the most common processes and procedures handled by data warehousing specialists?

- Ensuring data is secure - Ensuring data is backed up to prevent loss - Ensuring data is available

A team of data analysts is working on a large project that will take months to complete and contains a huge amount of data. They need to document their process and communicate with multiple databases. The team decides to use a SQL server as the main analysis tool for this project and SQL for the queries. What makes this the most efficient tool?

- SQL efficiently handles large amounts of data. - SQL allows you to connect to multiple databases. - SQL records queries and changes throughout a project.

Conditional formatting is a spreadsheet tool that changes how cells appear when values meet a specific condition. Data analysts can use conditional formatting to do which of the following tasks?

- To identify blank cells or missing information - To make cells stand out for more efficient analysis

A data analyst uses the COUNTIF function to count the number of times a value less than 5 occurs between spreadsheet cells A2 through A100. What is the correct syntax?

=COUNTIF(A2:A100,"<5")

An analyst is cleaning a new dataset. They want to make sure the data contained from cell B2 through cell B100 does not contain a number smaller than 10. Which COUNTIF function syntax can be used to answer this question?

=COUNTIF(B2:B100,">=10")

In a spreadsheet, what function would you use to extract the last three characters of the string located in row 4, column C?

=RIGHT(C4,3)

Describe the difference between a null and a zero in a dataset.

A null indicates that a value does not exist. A zero is a numerical response.

What is the process of combining two or more datasets into a single dataset?

Data merging

A car manufacturer wants to learn more about the brand preferences of electric car owners. There are millions of electric car owners in the world. Who should the company survey?

A sample of all electric car owners

Describe the relationship between a text string and a substring.

A text string is a group of characters within a cell. A substring is a smaller subset of that text string.

To correct a typo in a database column, where should you insert a CASE statement in a query?

As a SELECT clause

In a survey about a new cleaning product, 75% of respondents report they would buy the product again. The margin of error for the survey is 5%. Based on the margin of error, what percentage range reflects the population's true response?

Between 70% and 80%

You're working with a dataset that contains a float column with a significant amount of decimal places. This level of granularity is not needed for your current analysis. How can you convert the data in the float column to be integer data?

CAST

Fill in the blank: The _____ function can be used to join strings to create a new column.

CONCAT

Fill in the blank: To count the total number of spreadsheet values within a specified range, a data analyst uses the _____ function.

COUNTA

Before analysis, a company collects data from countries that use different date formats. Which of the following updates would improve the data integrity?

Change all of the dates to the same format


Ensembles d'études connexes