Data Vocab - Exam 1
Data analytics
A means of searching through large databases to make predictions or identify patterns or trends.
Relational databases
A means of storing data that gets rid of redundancies, and enforces business rules (such as referential integrity)
Value correlation
Checking collections of values against a rule that must hold true over the data
Data timeliness
Problems related to this include data that may eventually be accurate, but are not available when needed
Data accuracy for records
Problems related to this include missing records , incorrect records, old records.
Classification
Process to assign data to categories; helps with prediction
A well-chosen average
When an the word "average" is used to describe something with no indication of whether it is actually a mean, median, or mode
Sampling bias
When not every item in the population has an equal chance to be chosen for the sample
Data quality
What data has when it satisfies the requirements of its intended use
Missing value
A data element that has no value in it that can either be accurate or inaccurate is a ___
Co-occurrence grouping
A data approach meant to identify associates among individuals based on transactions that involve them.
Value inspection
A method to identify the presence of inaccuracies through visual inspection when it is not possible to create a clear rule that defines the boundary between right and wrong
Composite/concatenated primary key
A primary key composed of one or more data elements; this key is the primary key of a bridge table
Clustering
A process to group individuals based on characteristics or attributes
Data request form
A request for data that you do not have direct access to
Regression
A statistical method that predicts a variable based on the influence of one or more other variables.
Nominal data
A type of qualitative data that you can count, group, and take a proportion
Dependent variable
A variable in regression analysis that is predicted based on the predictor variable(s)
Independent variable
A variable that predicts or explains another variable, used in regression analysis
Degree of significance
A way to determine if an inadequate sample is being used to make a claim
Data reduction
Allows the user to focus on the most critical items in a large set of data
Profiling
An attempt to identify "typical" behavior by generating summary statistics about the data such as mean and standard deviation
Similarity matching
An attempt to identify similar individuals based on known data
Link prediction
An attempt to predict a relationship between two data items
Similarity matching
Approach that tries to identify similar individuals based on known data
Qualitative data
Categorical data that can only be counted and grouped, although sometimes the data can be ranked
The gee whiz graph
Changing the scale, proportion, or eliminating lower values to make something look better or more significant than it really is
Structural analysis
Checking fields for uniqueness or consecutiveness, checking for orphans on collections of records, checking for circular references
Costs described in a data quality program business case
Costs of rework, cost of lost customers, cost of late reporting, cost of incorrect decisions. Other costs include implementing large projects without understanding the underlying databases or taking too long to complete the project.
Identified costs
Costs that are already known as costs for a data quality project
Non-key data elements/descriptive attributes
Data elements that are included in a table that are neither primary keys nor foreign keys
Discrete data
Data represented by whole numbers; e.g., an Astros game score
Ordinal data
Data that can be counted and categorized, and the categories can be ranked; e.g., gold, silver, and bronze medals
Interval data
Data that can be counted and grouped like qualitative data, and the differences between each data point are meaningful
Structured v unstructured data
Data that can be used in a relational database v data such as video that cannot be readily adapted for use in a relational database
Continuous data
Data that can take on a any value within a range; e.g., height
Normal distribution
Distribution of data where the mean, median, and mode are all equal; half of the observations fall below the mean and half are above the mean
Mastering the data
Identifying and determining what data is needed for answering the questions identified in a data analytic project
Steps in the IMPACT cycle.
Identifying the questions to be answered, mastering the data, performing the test plan, addressing and refining results, communicating insights, and tracking outcomes
Potential for future costs included in a business case
Identifying what could happen if certain data elements contain inaccurate information
An example of "post hoc fallacy"
If A implies B, then B must imply A
Ratio data
In addition to being able to be counted, grouped and the intervals between data points are meaningful, the data point of zero has meaning; for example, zero has a meaning of "the absence of"
Quantitative data
In these types of data, the intervals between data points are meaningful so that means, medians, and modes can be calculated
Data dictionary
Includes descriptions of all the data attributes (e.g., range, domain, numeric v. alphanumeric, etc.)
Element analysis
Looking at individual values in isolation to determine if they are valid
Reverification
Manually going back to the original source of the information to check every value
Statisticulation
Misinforming people by the use of statistics
The argument for a data quality program
Placing your company in the best position to rapidly respond to changes in the business requires having and maintaining highly accurate and metadata in the company's information systems
Data accuracy for data values.
Refers to whether data values stored for an object are the correct values
Spurious precision
Saying people sleep an average of 6.71 hours a night without taking into account that most people will miss their guess by a quarter hour or more
Flat file
Storing data in one place (such as an Excel spreadsheet), rather than in multiple tables
Change-induced inconsistencies
System changes that change the way or granularity of information describe ____
Assessment project
The part of the project that is all cost and no value unless it is implemented
Extracting, transforming, and loading of data
The process of mastering the data in a data analytic project
Primary key
The unique identifier for each record in a table; it is typically alphanumeric code
Valid value
The value is in the collection of possible accurate values and is represented in a consistent and unambiguous way
Trusted data
This issue is related to problems with an application that leads to misinformation, leading to a lack of believability
Value representation consistency
Two values can be both correct and unambiguous and still cause problems can be caused by ____
Foreign key
Typically the primary key in one table that is used to link information from that table to another table.
Statistical error/standard error
Used to determine how accurately your sample can be taken to represent the population
Business case
Used to present the difference between the gains and costs of a project, for example a case for a data quality program
Data relevance
Without this information specifically related to the use of the data, the data has a low level of quality