Collecting, Cleaning, and Validating Data
A data set of results from a customer feedback survey that includes multiple, identical entries from a single customer will require a data scientist to ___________________.
remove duplicates
Data scientists validate their data after cleaning it to _______________________.
Ensure that their finished product contains quality, usable data
Removing duplicates, identifying outliers, and adding missing values are a few examples of ways to ____________.
clean data
Removing unneeded data, fixing misspellings and identifying contradictions are each examples of ways to ______________.
clean data
Companies collect data from ________________.
close-ended surveys, open-ended surveys, interviews, focus groups and online analytics
Web scraping refers to ________________.
collecting data that is publicly available on the internet usually by using an automated tool
Data scientists clean data to __________________ and ___________________.
ensure accuracy and usability of data; fix and respond to errors
Which of the following methods are used by companies to collect data?
All of the above
A customer feedback survey shows multiple survey entries from a single customer that are identical. Why should just the duplicates be removed and not the unneeded data?
Both A and B
A data scientist would need to ____________________to check for inconsistencies or unneeded data to find the most common occupations in the United States from a data set of the most common occupations in North America.
Both A and B
What is the difference between qualitative and quantitative data?
Both A and B
Why is it important for data scientists to clean data?
Both A and B
Why is it important for data scientists to validate their data?
Both A and B
Why might a business use web scraping to collect data?
Both A and B
Which of the following is not a method data scientists use to clean data?
Creating charts and graphs
_________________ refers to the process of verifying that data has been cleaned, corrected and properly formatted.
Data validation
Which of the following statements does not explain the importance of data validation?
Data validation is important because it allows data scientists to categorize data by the way in which it was collected.
Which of the following is not a data validation method?
Fixing errors, such as misspellings in a data set
Which of the following is not an example of methods used by companies to collect data?
Focus groups
What is the first step a data scientist must take to validate collected and cleaned data?
Format the data
Which of the following is not a method data scientists use to clean data?
Identifying trends and patterns
What is the difference between primary and secondary data?
Primary data is collected by a person or business for their own use while secondary data is collected and analyzed by another source.
Which of the following statements does not explain the difference between primary and secondary data collection?
Primary data refers to data that is most relevant to a person or business while secondary data refers to data that is the least relevant to a person or business.
. __________________ refers to data a person or business collects themselves while __________________ refers to data that is collected and analyzed by another source.
Primary data; secondary data
_______________ refers to data that is non-numeric while __________________ refers to data that is numeric.
Qualitative data; quantitative data
Which of the following statements does not explain the difference between qualitative and quantitative data?
Quantitative data is not easily handled or measured while qualitative data is easily handled or measured
A data scientistis cleaning a data set of results from a customer feedback survey when they notice multiple survey entries from a single customer. What should they do?
Remove duplicates
A data scientist would like to know the most common occupations in the United States. They find a data set of the most common occupations in North America that includes data from the United States, Canada, and Mexico. What is wrong with the data set?
The category - country studied - does not match, which can negatively impact data consistency.
Why is it important for a data scientist to both clean and validate a data set of the most common occupations in North America to find the most common occupations in the United States?
To check for data consistency and to see if some data needs to be removed from the data set
Collecting data that is publicly available on the internet, usually by using an automated tool is called _________.
web scraping
