Data Analytics
load (ETL process)
transformed data is inserted/loaded into the new database
recoding
transforming a record containing one value to a record containing another intended value
relational data
two or more data sets that have a common feature or variable
z test
used to compare the averages between a population and its sample
outliers
values that lie outside the normal range of values in a data sets.
overlapping variables
variables that are present in both sources
stepped color
when colors change along a visualization based on whole numbers
Skewed distributions
when values tend toward one side of the range of the x-axis or the other
Steps of a hypothesis test
1. Create null and alternative hypothesis 2. Identify the test statistic 3. Calculate the p value 4. Compare that p value to the significance level
Data Analysis Pipeline
1. collecting 2. cleaning 3. analysis 4. visualization 5. communication
Predictive analysis
1. use historical sales trends to predict how sales will perform in the future 2. look at inbound call volumes to predict how many call center employees to staff
Trend Line/ Line of Best Fit
A straight line that comes closest to the points on a scatter plot.
left-skewed distribution
a distribution that has a concentration of data on the upper end (tail on the left). The mean is lower than the median
Volume (Big Data)
amount of data
qualitative data
data that describes features or attributes. Often relies on senses
if p value is greater than alpha
did not reject the null hypothesis
Variety (Big Data)
different forms of structured and unstructured data
data uniqueness
ensures that no duplicates are entered
GROUP BY
groups selected fields to give a summary or see how the data is organized
usage data
how a brand or product is used important for improving services and driving product development
frequency
how many times values occur within a data set
heat maps
illustrates density of data points to show trends
Entity Relationship Diagram (ERD)
illustrates the links between the tables in a relational database
If the data is relatively symmetrical, which measure of central tendency could I use to impute missing data
mean
forecasting
method for predicting how variables will change the future
If the data has no mean or median, which measure of central tendency could I use to impute missing data?
mode
Using Quotes In SQL code
non-numeric values such as text and dates
What type of data exists in an implied order? *Example*: -small, medium, large -movie ratings
ordinal qualitative data
data transformation
process of adjusting data so it's consistent across data sets
data mapping
process of finding matching variables across data sets
descriptive analysis
process of using current and historical data to identify trends and relationships
algorithm
process or set of rules to be followed in calculations or other problem-solving operations, often by a computer
Aggregating data/data joining
putting data together and calculating the sum or average
NumPy
python library that supports different operations
What type of data is non numeric and relies on the senses: what I see, taste, hear, or feel?
qualitative
________ data is measured with a specific numerical quantity
quantitative
elevating questions
questions that observe the project in terms of its purpose and impact
adjoining questions
questions that seek to determine how a problem/issue fits into a bigger picture
bin
range of numbers in a histogram. Each set of numbers must have the same numbers
READ operation
reading data from a database
Measures (Tableau)
refer to numeric data, or data on which you'd perform some kind of mathematical operation
P value is less than or equal to alpha
rejected the null hypothesis
Determining If A Data Set is Relevant
review where it came from, how it was collected (administrative or usage?), and what it contains
hypothesis testing
seeing if the data at hand supports a proposed hypothesis
Clustering Algorithm
segmenting customers with similar attributes into different clusters to be able to target specific promotions to them
INNER JOIN
selects all records from both tables that have matching values
RIGHT JOIN
selects all records from the right table and its matching records from the left
FULL JOIN/OUTER JOIN
selects every record from table A and table B, regardless of whether there's a match
color scheme
specific ways of combining colors using a color wheel
constraints
specifies what type of data a table or column can accept
T-score
standard score with a mean of 50 and a standard deviation of 10 -the larger the t score, the larger the difference between the two groups compared
Graph Databases (NoSQL)
stores data as nodes and links
dimension tables
stores details from the fact table
Central Limit Theorem
the distribution of the sample average approaches an approximate normal distribution as the sample size gets larger (over 30 records), no matter what the shape of the population distribution The more records there are, the closer the data comes to being normally distributed
types of external data
-benchmarking -open
How can missing data be coded in a data set?
-record is blank, null, or "NA" -coded as numbers (0, 99999) -a clearly incorrect date (Jan 1 1900) -its own text category (Unknown, Missing, Other)
Relational Database Management Systems
1. PostgreSQL 2. Oracle 3. Microsoft SQL server 4. MySQL
How to calculate interquartile range
1. Sort the data into ascending order, then split into 4 equal groups 2. Subtract Q3-Q1
How Tableau classifies data
1. dimensions 2. measures
relational database
A group of database tables that are connected by a defined relationship that ties the information together.
When comparing a p-value to a significance level, what result will allow you to reject the null hypothesis?
A p-value less than or equal to the significance level
Foreign Key
A primary key used to establish relationships between two tables
subquery/inner statement
A query that's nested within another query
Composite Key
A set of columns that can identify a record
snowflake schema
An expanded version of a star schema in which dimension tables are broken down into related tables
collection bias
Bias in how the data is collected
Surrogate Key
Computer-generated primary key
CRUD operations in SQL
Create Read/Select Update Delete
Data Quality
Data Integrity (Accuracy, Consistency) + Completeness + Uniqueness + Timeliness
Data completeness
Data includes all required elements
Measures of Spread
Data sets with a low range have values that don't deviate that much from each other. Data sets with a high range have values that vary widely.
To filter aggregated data what keyword do I use?
HAVING
Transparency
Individuals should have a clear overview of the data that organizations collect and share
Why is a frequency table important?
It helps organize information with an unusually high or low number of values
% wildcard
Matches any character or set of characters
JOIN (SQL)
SQL command used to merge fields from two separate tables into a new table
UNION
SQL keyword that takes two tables and returns all records from both tables
VIEW
SQL keyword used to create a virtual table
What section of a business requirement document involves everyone that's interested in, involved in, or impacted by the project?
Stakeholder identification
monochromatic color scheme
Use of different tints, shades, & intensities of ONE color
one sample t test
Used to determine if a single sample mean is different from a known population mean
logistical regression model
a classification model used to predict the output of two qualitative variables
relational database
a collection of data stored in one or more tables made up of rows and columns
What does a company's style guide usually include?
a color scheme
bimodal distribution
a distribution with two modes
null hypothesis
a statement or idea that can be falsified, or proved wrong
test statistic
a statistic whose value helps determine whether a null hypothesis should be rejected. The value tells how much of a difference there is between the average as it relates to the data's spread
time series
algorithm that aims to predict quantitative values based on previous values
decision tree
algorithm that consists of a hierarchical arrangement of criteria that predict a classification or a value, creates groups that are as "pure" as possible using "if/then" statements to create rules
classification model
algorithm that draws a conclusion from qualitative data
regression model
algorithm used to measure the relationship between numerical values
standard deviation
average distance of the data from the mean
extract (ETL process)
collecting data from multiple data sources
data wrangling
converting raw data into a usable format
decentralized/distributed databases
data located across various machines
structured data (relational database)
data organized in tables made up of rows and columns
data timeliness
data should be recorded in an appropriate period of time after the event
DISTINCT
keyword used to return unique values in a field
multivariate analysis
measures the relationship between three or more variables
If the data is skewed which measure of central tendency could I use to impute missing data
median
pandas
open source library for importing and exporting data, merging data sets, dropping and renaming columns, conducting descriptive analyses, and other data-manipulation tasks
symmetrical distribution
right and left halves of the distribution mirror each other
Velocity (Big Data)
the pace that data is generated
UNIQUE Constraint
Restriction placed on a column to ensure that no duplicate values exist for that column
Which excel function takes out a set number of characters starting from the end of the text in a cell
Right
Big Data Lifecycle
collection > storage > analysis > implementation
analogous color scheme
colors next to each other on the color wheel
What type of data can be measured along a scale and can take any point along the scale as its value? *Example*: temperature measured in Farenheight, Celsius, or Kelvin
continuous quantitative data
subsetting
create a smaller data set from a whole data set based on a certain filter
benchmarking data
data that benchmarks statistics about a specific population or industry across multiple organizations
time variant variables
data that can change over time (example: age, education, work history)
Time invariant variables
data that can't change over time (example: birthdate, eye color, height)
mixed data types
data that has numeric and string values
data accuracy
data that is free of errors
unclear data
data that's in a vague format
data corruption
data that's unreadable or unusable can be fixed by software manager or database administrator
Boolean data type
data type used to test whether or not a condition/statement is true or false.
Data collection
details about data collection provide context towards data reliability and timeliness
anomaly
deviation from what is normal Example: flagging potentially fraudulent bank transactions by picking out transactions that are unusual compared to the account holder's typical spending behavior.
Python function that returns the descriptive statistics for a dataframe
df.describe()
Python function that returns the data types in a dataframe
df.info()
Python function that returns the the number of rows and columns in a dataframe
df.shape
Numeric ID columns should always be a ________ (dimension/measure) in Tableau
dimension
What type of data can only be represented through whole numbers? *Example*: the number of students in a class
discrete quantiative data
word cloud
display of common words and phrases found within a data set according to their frequency (the larger a word or phrase, the more times it exists within the data)
bell curve (normal distribution)
distribution of scores in which the bulk of the scores fall toward the middle, with progressively fewer scores toward the "tails" or extremes. Has the same value for mean, median, and mode
Data Validity
does the data fit the business and technical requirements of a project?
data profiling
examining the data that's available and providing summary information about it
transform (ETL process)
extracted data is converted into another format -calculating ages from date of birth -combining multiple data points like area codes and telephone numbers
Predictive analysis
extracts information from data and uses it to predict future trends and identify behavioral patterns
Columns in a table are called
fields
data sourcing
looking for existing data
data grain
lowest level of detail in the data
Geospatial charts
maps that display geographical data
point maps
maps that show the exact location of events
Interquartile Range (IQR)
measure the spread of data by finding the difference between the 25th and 75th percentiles of a data set (Q3 - Q1)
Variance
measures how far a data set is spread out
Normally distributed data has
more than 30 records
What type of data has no ordered categories? *Example:* categorizing cars by manufacturer (Nissan, Toyota, Honda) *none of these categories are more or less than
nominal qualitative data
4 Types of Data
nominal, ordinal, discrete, continuous
Effect of outliers
outliers can impact the range of a data set in a potentially misleading manner
Sentiment charts (ex. histograms) are the only type of chart that focuses exclusively on what type of data?
quantitative data
funneling questions
questions that start broad and go deeper and deeper into into issue
LEFT JOIN
returns all records from the left table and its matching record from the right table
cross joins
returns all the combinations of rows from two tables that do not have a column in common
fact table
stores information about an event (ex. movie rental)
data integration
the integration of data from multiple sources, which provides a unified view of all data
stationary
the mean and variance stays the same over time
median
the middle value in a group whose values are ordered from least to greatest
mode
the most frequent value in a data set -A data set can have multiple modes if multiple values appear the same number of times in a data set. -A data set may have no mode if every value occurs only once -A data set with no mode may indicate that the data is widely spread out, with no real central tendency
two-tailed test
the null hypothesis test sample mean could be higher and lower than the population mean.
aggregate data
the presentation of data in a summarized form
normalization
the process of making units uniform between multiple data sets, such as changing a product weight from pounds to kilograms
OLTP (online transaction processing)
used to rapidly insert, delete, and update large quantities of transactions
outlier
value that is more than 2 or 3 standard deviations away from the mean
seasonality
values fluctuate during a certain time periods
How does the Empirical Rule classify outliers?
values more than two standard deviations away from the mean
frequency table
A table used to show the number of times something occurs.
bubble chart
A type of scatter plot with circular symbols used to compare three variables; the area of the circle indicates the value of a third variable
Candidate Key
Any field that can serve as a primary key
Latitude and Longitude
Longitude lines are lines that go horizontally on maps. Latitude lines are lines that go vertically on maps.
Which two aggregate functions in SQL can ONLY be used on numeric fields?
SUM and AVG
Business Requirements Document (BRD)
a list of requirements, goals, and objectives to all the stakeholders in a project (a.k.a. Project bible)
Data mapping
adding columns from one data set to another data set
geocoding
adding geographic codes to customer records to make it possible to plot customer addresses on a map
What kind of SQL functions can't be used where the WHERE clause?
aggregate (COUNT,AVG,SUM,MIN,MAX)
Paired t-test
compares the means from the same group at two different times.
application programming interface (API)
connects users to databases usually through an app
administrative data
data about the running of a business
internal data
data collected by the organization itself (or pays another company to provide)
aggregated data
data from more than one source and grouped for comparison
Structured Data vs. Unstructured Data
data in tabular form free response data
structured data
data in the form of rows and columns
centralized/single node database
data is located on a single machine that's on site
data consistency
data is uniform and correctly formatted
types of variables
date, text, numeric
deviation
departure or wandering away from the accepted standard
NOT NULL constraint
enforces a column can't have any empty or missing values
population (statistics)
entire set of items in a data set
composition/comparison charts
focuses on how different parts of a data set compare to the whole
histogram
frequency chart that displays range of values
What is the most common type of temporal chart?
line chart
Concatenation
linking items in a chain or series
What type of python function locates a certain column in the dataframe?
loc()
diagnostic analysis
looks for patterns in historical data to explain *why* something happened. Often supplemented with a machine
__ wildcard
matches a single character
scripts
multiple commands that are executed by a certain program
numerically coded qualitative data
qualitative data that has numbers and is usually an identifier (ex. license plate numbers, zip codes, employee ID, movie ratings)
linear regression model
regression model used to predict the value of a variable based on the value of another variable
CRISP-DM (Cross-Industry Standard Process for Data Mining)
standard process for data mining
descriptive statistics
summarize and describe data
Emperical Rule
tells you where most of the values lie in a normal distribution 68% of all data is within one standard deviation of the mean 95% of all data is within two standard deviations of the mean 99.7% of all data is within three standard deviations of the mean being more than one standard deviation away from the average is usually considered a large deviation
Dimensions (Tabelau)
text or categorical data and are shown in the top-left corner of the dashboard
Z Scores are ...
the # of standard deviations above or below the mean
data integrity
the accuracy and consistency of data
probability distribution
the likelihood of a certain outcome
one tailed test
the null hypothesis test sample mean is higher or lower than the population mean. Depends on what you're looking for
data record/observation
the number of rows in an excel worksheet
graduated symbol map
uses symbols of different sizes to indicate different amounts of something
inferential statistics
using a sample of data to infer information about an entire population
autocorrelation
value of data points is always the same at a certain time interval
Value (big data)
what purpose can this data serve?
measurement bias
when there's a problem with the machines or humans doing the measuring or observing
examples of benchmarking data
-Body mass index chart -application program interface (Twitter, Google Maps)
Examples of usage data
-Netflix: a sequence of clicks a customer follows before starting to watch a movie, how often a customer clicks on recommended movies after a movie has ended -interviews and surveys
4 Components of A Project Management Plan
1. Communicating to Stakeholders (How and when will I communicate to the people involved?) 2. Schedule and Milestones 3. Project deliverables - how will the project be presented 4. Defining My Audience - who are the people involved? high level executives, civilians = concise information
Resolving Data Integrity/Accuracy Issues
1. Is the original person or machine that collected the data available? If a person, can they correct the data value? If it was a machine, is there any record of a service, calibration, or coding change that could explain how the temperature records changed from Celsius to Fahrenheit? 2. Does other similar data (such as temperature from another weather center) exist that can be used for comparison? 3. Was the same variable measured over time so that historical data exists? 4. Can you identify the error by using the other values of the variable?
6 Components of A Business Requirement Document
1. Project Overview - quick summary of the project -includes motivation (problem that needs to be solved), objectives (what will be achieved), and scope 2. Stakeholder Identification: everyone that's interested in, involved in, or impacted by the project 3. Success Factors: the success criteria. Should be explicitly correlated with objectives 4. Assumptions and Constraints - defines factors that might limit the plan constraint = limitations the project must operate under (scheduling, budgeting, software issues) 5. Requirements details how the objectives will be achieved 6. Glossary of Terms
How to calculate variance
1. Subtract each score from the mean 2. Square those values 3. Add them all up 4. Divide by the number of samples
I Should Use Pie Charts in Tableau When...
1. There's no more than 3 categories 2. The categories aren't similar in size
Missing value strategies
1. average the non missing values 2. take the most common non missing values 3. take a random value
Data Story Components
1. beginning: project motivation and objects. 2. middle: how i proceeded with my analysis 3. end: conclusions and reccomendations
Handling Duplicate Records in SQL
1. create a unique table using the VIEW statement 2.delete the duplicate record from the table or view
Methods of Forecasting
1. linear extrapolation: draw a line through a series of existing data points to get an understanding of how the data will behave in the future 2. averaging 3. exponential smoothing: squaring the numbers and finding the difference 4. seasonality: school graduations
Examples of descriptive analysis
1. looking at how sales or marketing performed over a time period 2. Examining the amount of inbound calls to a call center over a certain week or month
Strategies to Measure The Data's Central Tendency
1. mean 2. median 3. mode
ORDER BY
A SQL clause that sorts data in ascending or descending order
horizontal bar chart
A bar chart that displays the bars in a horizontal direction. Should be used when variable names are long
significance level (alpha)
A benchmark value that's compared to the P-value to determine if the null hypothesis will be rejected. Doesn't have to be calculated alpha = .05 = 95 percent confidence level alpha = .01 = 90 percent confidence level
delimeter
A character that marks the beginning or end of a unit of data (commas, semicolon, slashes)
Outlier
A common definition of an outlier is a value that's either lower than Q1 - (IQR * 1.5) or larger than Q3 + (IQR * 1.5)
star schema
A database design that contains one fact table and multiple dimension tables linked to it.
chi square test
A statistical method of testing used to compare observations against expectations
If My Expectation Has Been Challenged
Ask myself: why is the data not behaving as expected? What's going on here? Does the data actually disconfirm (i.e., not confirm) our expectations about how our organization works? Either the organization's understanding is incorrect, or there is an error in how you're processing or interpreting the data: it has to be one or the other.
NoSQL databases
Databases that can manipulate structured as well as unstructured data and inconsistent or missing data; are useful when working with Big Data Example: Facebook Messenger, Amazon
Removing Records (General Rule of Thumb)
I can usually remove up to 5 percent of the data in a data set without causing any major issues Only use central tendency and random values methods when the data you're missing makes up 5 percent or less of the total values within your data set.
Which excel function returns the total number of characters within a cell
Len
P value In null hypothesis significance testing
Measures the likelihood of something happening by random chance low probability value = low probability of happening by chance
unstructured data
Not defined and does not follow a specified format
ETL Process (Extract, Transform, and Load)
Process of migrating data into a database 1. Extract necessary data from databases 2. Transform data so its in a common format and free of errors 3. Load extracted and transformed data into the warehouse to be used
Interpreting Correlations
The correlation coefficient takes a value between -1 and +1 and provides information about the strength and direction of the linear relationship. The closer the absolute value of the coefficient is to 1, the stronger the relationship—if one variable increases, the other also increases. On the opposite side of the spectrum, the same holds true, only in reverse. The closer the coefficient is to -1, the stronger the relationship, but in the opposite direction, meaning that if one variable increases, the other decreases. Generally speaking, using the absolute value of the correlation coefficient: 0: no relationship 0.1-0.3: weak relationship 0.3-0.5: moderate relationship 0.5-1.0: strong relationship
absolute error
The magnitude of difference between the actual number and the nearest representable value.
color harmony
The practice of using color to create visualizations that are engaging, balanced, and pleasing to the eye
Student's t-distribution
The sampling distribution of the t statistic
spatial analysis
an analysis that incorporates a geographic component
git
Version control system that helps keep track of changes made to files and code
Five V's of Big Data
Volume, Velocity, Variety, Veracity, Value
CASE statement
WHEN, THEN, and END WHEN specifies a condition THEN indicates the value to return if the condition is met END signifies the end of the case statement
Data Analyst Question Process
What do we want to measure --> How will we measure it? ----> What complications may arise?
note
When the data contains an outlier, the analyst may want to consider using a different measure of central tendency in order to more-accurately find the center.
right skewed distribution
a distribution with a tail on the right. The mean is higher than the median
ENUM (enumerated)
a drop down of permitted values chosen for a field
Primary Key
a field that uniquely identifies a record in a table
scatterplot
a graph in which one variable is represented along the x-axis and the other along the y-axis
box plot
a graph that shows a summary of data using the median, quartiles, and outlier of the data
Choropleth Map
a map that uses differences in shading, coloring, or the placing of symbols within predefined areas to illustrate a numeric count or statistical value
central tendency
a measure that represents the typical response or the behavior of a variable as a whole Done by finding the mean, median, or mode
repository
a place designated for storage
unstructured question
a question with no definite answer
database index
a quick lookup table for specific records
python libraries/package
a set of functions that have been brought together for a specific purpose
two sample t test
a statistical method used to compare the means of 2 groups of subjects. For example, If p <0.05, the null is rejected and the means are different.
Common Table Expression (CTE)
a temporary table that can be referenced in the main query that comes after it
open data
a type of external data that's usually free and available to the public
OLAP system (online analytical processing)
also known as a data warehouse configured for READ operations rather than updating, inserting, or deleting records
What is a relational database management system?
an application that allows users to create, read, update, and delete data in a relational database. Can also grant users access to specific tables, make backups of the data, and monitor performance
implementation (big data lifecycle)
an integrated collection of assumptions, data, and inferences that are mathematically measured to produce or predict an outcome
terminal
an interface used for interacting directly with your computer
data silo
an isolated data set that is difficult to obtain, combine, or use with other company data
prescriptive analysis
analyzing data and to recommend the best course of action or strategy moving forward
data mining
analyzing data to make insights not offered by the raw data alone
univariate analysis
analyzing one variable to identify patterns within it
complementary color scheme
any 2 colors directly opposite each other on the color wheel
downsampling
any method that decreases the amount of data
FOREIGN KEY constraint
assigns a foreign key to another table to establish a relationship between the two
PRIMARY KEY constraint
assigns a primary key for a specific field
expectation
baseline understanding about how some facet of an organization works, which can then either be confirmed, revised, or augmented by insights example:An apparel company might expect its sales to go up during holiday shopping periods
What type of data can be categorized into 2 groups? *Example*: yes/no, true/false
binary data
Machine learning
branch of analytics that uses algorithms to search for patterns in the data
Transposing
changing the rows to columns and vice versa
temporal chart
charts that include some kind of time component
statistical charts
charts used to display some statistical aspect of the data Correlation is usually visualized using a scatterplot Frequency distributions are usually visualized on a histogram
CHECK constraint
checks new values and if they don't meet certain conditions the user receives an error message
Veracity (Big Data)
the accuracy and credibility of data
bivariate analysis
the analysis of two variables simultaneously, for the purpose of determining the relationship between them
simple moving average
the average value over a set time period
Data accuracy
the data points (example: customer addresses) should be correct
Data source
the entity who owns/ collects the data Internal data - a particular department or team External data - the third party organization providing the data
alternative hypothesis
the statement i'm trying to prove
query cost
the time it takes to execute a query
Textual analysis
type of visualization that focuses on qualitative data. commonly used through a word cloud
K-means
type of clustering algorithm in which "k" indicates the number of clusters and "means" represents the clusters' average
sentiment analysis
type of textual analysis that analyzes the feelings behind the textual data by categorizing text into negative, neutral, and positive groupings
frequency analysis
type of textual analysis that involves counting specific words and phrases to spot broad trends
domain knowledge
understanding a specific subject area or field