Course 4: Process Data from Dirty to Clean
CONCAT
A SQL function that adds strings together to create new text strings that can be used as unique keys
CAST
A SQL function that converts data from one datatype to another
COALESCE
A SQL function that returns non-null values in a list
CASE
A SQL statement that returns records that meet conditions by including an if/then statement in a query
Fill handle
A box in the lower-right-hand corner of a selected spreadsheet cell that can be dragged through neighboring cells in order to continue an instruction
Equation
A calculation that involves addition, subtraction, multiplication, or division (also called a math expression)
Cell reference
A cell or a range of cells in a worksheet typically used in formulas and functions
Delimiter
A character that indicates the beginning or end of a data item
Attribute
A characteristic or quality of data used to label a column in a table
Database
A collection of data stored in a computer system
Dataset
A collection of data that can be manipulated or analyzed as one unit
Data
A collection of facts
Bias
A conscious or subconscious preference in favor of or against a person, group of people, or thing
Bad data source
A data source that is not reliable, original, comprehensive, current, and cited (ROCCC)
Boolean data
A data type with only two possible values, usually true or false
CSV (comma-separated values) file
A delimited text file that uses a comma to separate values
Data science
A field of study that uses raw data to create new ways of modeling and understanding the unknown
Changelog
A file containing a chronologically ordered list of modifications made to a project
DISTINCT
A keyword that is added to a SQL SELECT statement to retrieve only non-duplicate entries
Agenda
A list of scheduled appointments
Data element
A piece of information in a dataset
Cloud
A place to keep data online, rather than a computer hard drive
Data governance
A process for ensuring the formal management of a company's data assets
Algorithm
A process or set of rules followed for a specific task
Cross-field validation
A process that ensures certain conditions for multiple data fields are satisfied
Data warehousing specialist
A professional who develops processes and procedures to effectively store and organize data
Data engineer
A professional who transforms data into a useful format for analysis and gives it a reliable infrastructure
Fairness
A quality of data analysis that does not create or reinforce bias
Action-oriented question
A question whose answers lead to change
Confidence interval
A range of values that conveys how likely a statistical estimate reflects the population
Field
A single piece of information from a row or column of a spreadsheet; in a data table, typically a column in the table
Cookie
A small file stored on a computer that contains information about its users
DATEDIF
A spreadsheet function that calculates the number of days, months, or years between two dates
COUNT
A spreadsheet function that counts the number of cells in a range that meet a specified criteria
COUNTA
A spreadsheet function that counts the total number of values within a specified range
CONCATENATE
A spreadsheet function that joins together two or more text strings
AVERAGE
A spreadsheet function that returns an average of the values from a selected range
COUNTIF
A spreadsheet function that returns the number of cells in a range that match a specified value
Conditional formatting
A spreadsheet tool that changes how cells appear when values meet specific conditions
Data validation
A tool for checking the accuracy and quality of data
Field length
A tool for determining how many characters can be keyed into a spreadsheet field
Data model
A tool for organizing data elements and how they relate to one another
Find and replace
A tool that finds a specified search term and replaces it with something else
Dashboard
A tool that monitors live, incoming data
Data type
An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform
Digital photo
An electronic or computer-based image usually in BMP or JPG format
Duplicate data
Any record that inadvertently shares data with another record
Clean data
Data that is complete, correct, and relevant to the problem being solved
Discrete data
Data that is counted and has a limited number of values
Dirty data
Data that is incomplete, incorrect, or irrelevant to the problem to be solved
Continuous data
Data that is measured and can have almost any numeric value
External data
Data that lives, and is generated, outside of an organization
Audio file
Digitized audio storage usually in an MP3, AAC, or other compressed format
Data-inspired decision-making
Exploring different data sources to find out what they have in common
Access control
Features such as password protection, user permissions, and encryption that are used to protect a spreadsheet
Data design
How information is organized
Compatibility
How well two or more datasets are able to work together
Big data
Large, complex datasets typically involving long periods of time, which enable data analysts to address far-reaching business problems
Borders
Lines that can be added around two or more cells on a spreadsheet
Descriptive metadata
Metadata that describes a piece of data and can be used to identify it at a later point in time
Administrative metadata
Metadata that indicates the technical source of a digital asset
Data range
Numerical values that fall between predefined maximum and minimum values
Data privacy
Preserving a data subject's information any time a data transaction occurs
Data security
Protecting data from unauthorized access or corruption by adopting safety measures
Analytical skills
Qualities and characteristics associated with using facts to solve problems
Data analyst
Someone who collects, transforms, and organizes data in order to draw conclusions, make predictions, and drive informed decision-making
Data interoperability
The ability to integrate data from multiple sources and a key factor leading to the successful use of open data among companies and governments
Data integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle
Consent
The aspect of data ethics that presumes an individual's right to know how and why their personal data will be used before agreeing to provide it
Currency
The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions
Estimated response rate
The average number of people who typically complete a survey
Data analysis
The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making
Context
The condition in which something exists or happens
Data constraints
The criteria that determine whether a piece of a data is clean and valid
Accuracy
The degree to which data conforms to the actual entity being measured or described
Completeness
The degree to which data contains all desired components or measures
Consistency
The degree to which data is repeatable from different points of entry or collection
Data visualization
The graphical representation of data
Data strategy
The management of the people, processes, and tools used in data analysis
Confidence level
The probability that a sample size accurately reflects the greater population
Data manipulation
The process of changing data to make it more organized and easier to read
Data merging
The process of combining two or more datasets into a single dataset
Data transfer
The process of copying data from a storage device to computer memory or from one computer to another
Analytical thinking
The process of identifying and defining a problem, then solving it by using data in an organized, step-by-step manner
Data mapping
The process of matching fields from one data source to another
Data anonymization
The process of protecting people's private or sensitive data by eliminating identifying information
Filtering
The process of showing only the data that meets a specified criteria while hiding the rest
Data replication
The process of storing data in multiple locations
A/B testing
The process of testing two variations of the same web page to determine which page is more successful at attracting user traffic and generating revenue
Business task
The question or problem data analysis answers for a business
Data analytics
The science of data
Data life cycle
The sequence of stages that data experiences, which include plan, capture, manage, analyze, archive, and destroy
Data analysis process
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
Experimenter bias
The tendency for different people to observe things differently (Refer to Observer bias)
Confirmation bias
The tendency to search for or interpret information in a way that confirms pre-existing beliefs
Data ecosystem
The various elements that interact with one another in order to produce, manage, store, organize, analyze, and share data
Data-driven decision-making
Using facts to guide business strategy
Data ethics
Well-founded standards of right and wrong that dictate how data is collected, shared, and used
Ethics
Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues
Data bias
When a preference in favor of or against a person, group of people, or thing systematically skews data analysis results in a certain direction