Data Analytics

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

load (ETL process)

transformed data is inserted/loaded into the new database

recoding

transforming a record containing one value to a record containing another intended value

relational data

two or more data sets that have a common feature or variable

z test

used to compare the averages between a population and its sample

outliers

values that lie outside the normal range of values in a data sets.

overlapping variables

variables that are present in both sources

stepped color

when colors change along a visualization based on whole numbers

Skewed distributions

when values tend toward one side of the range of the x-axis or the other

Steps of a hypothesis test

1. Create null and alternative hypothesis 2. Identify the test statistic 3. Calculate the p value 4. Compare that p value to the significance level

Data Analysis Pipeline

1. collecting 2. cleaning 3. analysis 4. visualization 5. communication

Predictive analysis

1. use historical sales trends to predict how sales will perform in the future 2. look at inbound call volumes to predict how many call center employees to staff

Trend Line/ Line of Best Fit

A straight line that comes closest to the points on a scatter plot.

left-skewed distribution

a distribution that has a concentration of data on the upper end (tail on the left). The mean is lower than the median

Volume (Big Data)

amount of data

qualitative data

data that describes features or attributes. Often relies on senses

if p value is greater than alpha

did not reject the null hypothesis

Variety (Big Data)

different forms of structured and unstructured data

data uniqueness

ensures that no duplicates are entered

GROUP BY

groups selected fields to give a summary or see how the data is organized

usage data

how a brand or product is used important for improving services and driving product development

frequency

how many times values occur within a data set

heat maps

illustrates density of data points to show trends

Entity Relationship Diagram (ERD)

illustrates the links between the tables in a relational database

If the data is relatively symmetrical, which measure of central tendency could I use to impute missing data

mean

forecasting

method for predicting how variables will change the future

If the data has no mean or median, which measure of central tendency could I use to impute missing data?

mode

Using Quotes In SQL code

non-numeric values such as text and dates

What type of data exists in an implied order? *Example*: -small, medium, large -movie ratings

ordinal qualitative data

data transformation

process of adjusting data so it's consistent across data sets

data mapping

process of finding matching variables across data sets

descriptive analysis

process of using current and historical data to identify trends and relationships

algorithm

process or set of rules to be followed in calculations or other problem-solving operations, often by a computer

Aggregating data/data joining

putting data together and calculating the sum or average

NumPy

python library that supports different operations

What type of data is non numeric and relies on the senses: what I see, taste, hear, or feel?

qualitative

________ data is measured with a specific numerical quantity

quantitative

elevating questions

questions that observe the project in terms of its purpose and impact

adjoining questions

questions that seek to determine how a problem/issue fits into a bigger picture

bin

range of numbers in a histogram. Each set of numbers must have the same numbers

READ operation

reading data from a database

Measures (Tableau)

refer to numeric data, or data on which you'd perform some kind of mathematical operation

P value is less than or equal to alpha

rejected the null hypothesis

Determining If A Data Set is Relevant

review where it came from, how it was collected (administrative or usage?), and what it contains

hypothesis testing

seeing if the data at hand supports a proposed hypothesis

Clustering Algorithm

segmenting customers with similar attributes into different clusters to be able to target specific promotions to them

INNER JOIN

selects all records from both tables that have matching values

RIGHT JOIN

selects all records from the right table and its matching records from the left

FULL JOIN/OUTER JOIN

selects every record from table A and table B, regardless of whether there's a match

color scheme

specific ways of combining colors using a color wheel

constraints

specifies what type of data a table or column can accept

T-score

standard score with a mean of 50 and a standard deviation of 10 -the larger the t score, the larger the difference between the two groups compared

Graph Databases (NoSQL)

stores data as nodes and links

dimension tables

stores details from the fact table

Central Limit Theorem

the distribution of the sample average approaches an approximate normal distribution as the sample size gets larger (over 30 records), no matter what the shape of the population distribution The more records there are, the closer the data comes to being normally distributed

types of external data

-benchmarking -open

How can missing data be coded in a data set?

-record is blank, null, or "NA" -coded as numbers (0, 99999) -a clearly incorrect date (Jan 1 1900) -its own text category (Unknown, Missing, Other)

Relational Database Management Systems

1. PostgreSQL 2. Oracle 3. Microsoft SQL server 4. MySQL

How to calculate interquartile range

1. Sort the data into ascending order, then split into 4 equal groups 2. Subtract Q3-Q1

How Tableau classifies data

1. dimensions 2. measures

relational database

A group of database tables that are connected by a defined relationship that ties the information together.

When comparing a p-value to a significance level, what result will allow you to reject the null hypothesis?

A p-value less than or equal to the significance level

Foreign Key

A primary key used to establish relationships between two tables

subquery/inner statement

A query that's nested within another query

Composite Key

A set of columns that can identify a record

snowflake schema

An expanded version of a star schema in which dimension tables are broken down into related tables

collection bias

Bias in how the data is collected

Surrogate Key

Computer-generated primary key

CRUD operations in SQL

Create Read/Select Update Delete

Data Quality

Data Integrity (Accuracy, Consistency) + Completeness + Uniqueness + Timeliness

Data completeness

Data includes all required elements

Measures of Spread

Data sets with a low range have values that don't deviate that much from each other. Data sets with a high range have values that vary widely.

To filter aggregated data what keyword do I use?

HAVING

Transparency

Individuals should have a clear overview of the data that organizations collect and share

Why is a frequency table important?

It helps organize information with an unusually high or low number of values

% wildcard

Matches any character or set of characters

JOIN (SQL)

SQL command used to merge fields from two separate tables into a new table

UNION

SQL keyword that takes two tables and returns all records from both tables

VIEW

SQL keyword used to create a virtual table

What section of a business requirement document involves everyone that's interested in, involved in, or impacted by the project?

Stakeholder identification

monochromatic color scheme

Use of different tints, shades, & intensities of ONE color

one sample t test

Used to determine if a single sample mean is different from a known population mean

logistical regression model

a classification model used to predict the output of two qualitative variables

relational database

a collection of data stored in one or more tables made up of rows and columns

What does a company's style guide usually include?

a color scheme

bimodal distribution

a distribution with two modes

null hypothesis

a statement or idea that can be falsified, or proved wrong

test statistic

a statistic whose value helps determine whether a null hypothesis should be rejected. The value tells how much of a difference there is between the average as it relates to the data's spread

time series

algorithm that aims to predict quantitative values based on previous values

decision tree

algorithm that consists of a hierarchical arrangement of criteria that predict a classification or a value, creates groups that are as "pure" as possible using "if/then" statements to create rules

classification model

algorithm that draws a conclusion from qualitative data

regression model

algorithm used to measure the relationship between numerical values

standard deviation

average distance of the data from the mean

extract (ETL process)

collecting data from multiple data sources

data wrangling

converting raw data into a usable format

decentralized/distributed databases

data located across various machines

structured data (relational database)

data organized in tables made up of rows and columns

data timeliness

data should be recorded in an appropriate period of time after the event

DISTINCT

keyword used to return unique values in a field

multivariate analysis

measures the relationship between three or more variables

If the data is skewed which measure of central tendency could I use to impute missing data

median

pandas

open source library for importing and exporting data, merging data sets, dropping and renaming columns, conducting descriptive analyses, and other data-manipulation tasks

symmetrical distribution

right and left halves of the distribution mirror each other

Velocity (Big Data)

the pace that data is generated

UNIQUE Constraint

Restriction placed on a column to ensure that no duplicate values exist for that column

Which excel function takes out a set number of characters starting from the end of the text in a cell

Right

Big Data Lifecycle

collection > storage > analysis > implementation

analogous color scheme

colors next to each other on the color wheel

What type of data can be measured along a scale and can take any point along the scale as its value? *Example*: temperature measured in Farenheight, Celsius, or Kelvin

continuous quantitative data

subsetting

create a smaller data set from a whole data set based on a certain filter

benchmarking data

data that benchmarks statistics about a specific population or industry across multiple organizations

time variant variables

data that can change over time (example: age, education, work history)

Time invariant variables

data that can't change over time (example: birthdate, eye color, height)

mixed data types

data that has numeric and string values

data accuracy

data that is free of errors

unclear data

data that's in a vague format

data corruption

data that's unreadable or unusable can be fixed by software manager or database administrator

Boolean data type

data type used to test whether or not a condition/statement is true or false.

Data collection

details about data collection provide context towards data reliability and timeliness

anomaly

deviation from what is normal Example: flagging potentially fraudulent bank transactions by picking out transactions that are unusual compared to the account holder's typical spending behavior.

Python function that returns the descriptive statistics for a dataframe

df.describe()

Python function that returns the data types in a dataframe

df.info()

Python function that returns the the number of rows and columns in a dataframe

df.shape

Numeric ID columns should always be a ________ (dimension/measure) in Tableau

dimension

What type of data can only be represented through whole numbers? *Example*: the number of students in a class

discrete quantiative data

word cloud

display of common words and phrases found within a data set according to their frequency (the larger a word or phrase, the more times it exists within the data)

bell curve (normal distribution)

distribution of scores in which the bulk of the scores fall toward the middle, with progressively fewer scores toward the "tails" or extremes. Has the same value for mean, median, and mode

Data Validity

does the data fit the business and technical requirements of a project?

data profiling

examining the data that's available and providing summary information about it

transform (ETL process)

extracted data is converted into another format -calculating ages from date of birth -combining multiple data points like area codes and telephone numbers

Predictive analysis

extracts information from data and uses it to predict future trends and identify behavioral patterns

Columns in a table are called

fields

data sourcing

looking for existing data

data grain

lowest level of detail in the data

Geospatial charts

maps that display geographical data

point maps

maps that show the exact location of events

Interquartile Range (IQR)

measure the spread of data by finding the difference between the 25th and 75th percentiles of a data set (Q3 - Q1)

Variance

measures how far a data set is spread out

Normally distributed data has

more than 30 records

What type of data has no ordered categories? *Example:* categorizing cars by manufacturer (Nissan, Toyota, Honda) *none of these categories are more or less than

nominal qualitative data

4 Types of Data

nominal, ordinal, discrete, continuous

Effect of outliers

outliers can impact the range of a data set in a potentially misleading manner

Sentiment charts (ex. histograms) are the only type of chart that focuses exclusively on what type of data?

quantitative data

funneling questions

questions that start broad and go deeper and deeper into into issue

LEFT JOIN

returns all records from the left table and its matching record from the right table

cross joins

returns all the combinations of rows from two tables that do not have a column in common

fact table

stores information about an event (ex. movie rental)

data integration

the integration of data from multiple sources, which provides a unified view of all data

stationary

the mean and variance stays the same over time

median

the middle value in a group whose values are ordered from least to greatest

mode

the most frequent value in a data set -A data set can have multiple modes if multiple values appear the same number of times in a data set. -A data set may have no mode if every value occurs only once -A data set with no mode may indicate that the data is widely spread out, with no real central tendency

two-tailed test

the null hypothesis test sample mean could be higher and lower than the population mean.

aggregate data

the presentation of data in a summarized form

normalization

the process of making units uniform between multiple data sets, such as changing a product weight from pounds to kilograms

OLTP (online transaction processing)

used to rapidly insert, delete, and update large quantities of transactions

outlier

value that is more than 2 or 3 standard deviations away from the mean

seasonality

values fluctuate during a certain time periods

How does the Empirical Rule classify outliers?

values more than two standard deviations away from the mean

frequency table

A table used to show the number of times something occurs.

bubble chart

A type of scatter plot with circular symbols used to compare three variables; the area of the circle indicates the value of a third variable

Candidate Key

Any field that can serve as a primary key

Latitude and Longitude

Longitude lines are lines that go horizontally on maps. Latitude lines are lines that go vertically on maps.

Which two aggregate functions in SQL can ONLY be used on numeric fields?

SUM and AVG

Business Requirements Document (BRD)

a list of requirements, goals, and objectives to all the stakeholders in a project (a.k.a. Project bible)

Data mapping

adding columns from one data set to another data set

geocoding

adding geographic codes to customer records to make it possible to plot customer addresses on a map

What kind of SQL functions can't be used where the WHERE clause?

aggregate (COUNT,AVG,SUM,MIN,MAX)

Paired t-test

compares the means from the same group at two different times.

application programming interface (API)

connects users to databases usually through an app

administrative data

data about the running of a business

internal data

data collected by the organization itself (or pays another company to provide)

aggregated data

data from more than one source and grouped for comparison

Structured Data vs. Unstructured Data

data in tabular form free response data

structured data

data in the form of rows and columns

centralized/single node database

data is located on a single machine that's on site

data consistency

data is uniform and correctly formatted

types of variables

date, text, numeric

deviation

departure or wandering away from the accepted standard

NOT NULL constraint

enforces a column can't have any empty or missing values

population (statistics)

entire set of items in a data set

composition/comparison charts

focuses on how different parts of a data set compare to the whole

histogram

frequency chart that displays range of values

What is the most common type of temporal chart?

line chart

Concatenation

linking items in a chain or series

What type of python function locates a certain column in the dataframe?

loc()

diagnostic analysis

looks for patterns in historical data to explain *why* something happened. Often supplemented with a machine

__ wildcard

matches a single character

scripts

multiple commands that are executed by a certain program

numerically coded qualitative data

qualitative data that has numbers and is usually an identifier (ex. license plate numbers, zip codes, employee ID, movie ratings)

linear regression model

regression model used to predict the value of a variable based on the value of another variable

CRISP-DM (Cross-Industry Standard Process for Data Mining)

standard process for data mining

descriptive statistics

summarize and describe data

Emperical Rule

tells you where most of the values lie in a normal distribution 68% of all data is within one standard deviation of the mean 95% of all data is within two standard deviations of the mean 99.7% of all data is within three standard deviations of the mean being more than one standard deviation away from the average is usually considered a large deviation

Dimensions (Tabelau)

text or categorical data and are shown in the top-left corner of the dashboard

Z Scores are ...

the # of standard deviations above or below the mean

data integrity

the accuracy and consistency of data

probability distribution

the likelihood of a certain outcome

one tailed test

the null hypothesis test sample mean is higher or lower than the population mean. Depends on what you're looking for

data record/observation

the number of rows in an excel worksheet

graduated symbol map

uses symbols of different sizes to indicate different amounts of something

inferential statistics

using a sample of data to infer information about an entire population

autocorrelation

value of data points is always the same at a certain time interval

Value (big data)

what purpose can this data serve?

measurement bias

when there's a problem with the machines or humans doing the measuring or observing

examples of benchmarking data

-Body mass index chart -application program interface (Twitter, Google Maps)

Examples of usage data

-Netflix: a sequence of clicks a customer follows before starting to watch a movie, how often a customer clicks on recommended movies after a movie has ended -interviews and surveys

4 Components of A Project Management Plan

1. Communicating to Stakeholders (How and when will I communicate to the people involved?) 2. Schedule and Milestones 3. Project deliverables - how will the project be presented 4. Defining My Audience - who are the people involved? high level executives, civilians = concise information

Resolving Data Integrity/Accuracy Issues

1. Is the original person or machine that collected the data available? If a person, can they correct the data value? If it was a machine, is there any record of a service, calibration, or coding change that could explain how the temperature records changed from Celsius to Fahrenheit? 2. Does other similar data (such as temperature from another weather center) exist that can be used for comparison? 3. Was the same variable measured over time so that historical data exists? 4. Can you identify the error by using the other values of the variable?

6 Components of A Business Requirement Document

1. Project Overview - quick summary of the project -includes motivation (problem that needs to be solved), objectives (what will be achieved), and scope 2. Stakeholder Identification: everyone that's interested in, involved in, or impacted by the project 3. Success Factors: the success criteria. Should be explicitly correlated with objectives 4. Assumptions and Constraints - defines factors that might limit the plan constraint = limitations the project must operate under (scheduling, budgeting, software issues) 5. Requirements details how the objectives will be achieved 6. Glossary of Terms

How to calculate variance

1. Subtract each score from the mean 2. Square those values 3. Add them all up 4. Divide by the number of samples

I Should Use Pie Charts in Tableau When...

1. There's no more than 3 categories 2. The categories aren't similar in size

Missing value strategies

1. average the non missing values 2. take the most common non missing values 3. take a random value

Data Story Components

1. beginning: project motivation and objects. 2. middle: how i proceeded with my analysis 3. end: conclusions and reccomendations

Handling Duplicate Records in SQL

1. create a unique table using the VIEW statement 2.delete the duplicate record from the table or view

Methods of Forecasting

1. linear extrapolation: draw a line through a series of existing data points to get an understanding of how the data will behave in the future 2. averaging 3. exponential smoothing: squaring the numbers and finding the difference 4. seasonality: school graduations

Examples of descriptive analysis

1. looking at how sales or marketing performed over a time period 2. Examining the amount of inbound calls to a call center over a certain week or month

Strategies to Measure The Data's Central Tendency

1. mean 2. median 3. mode

ORDER BY

A SQL clause that sorts data in ascending or descending order

horizontal bar chart

A bar chart that displays the bars in a horizontal direction. Should be used when variable names are long

significance level (alpha)

A benchmark value that's compared to the P-value to determine if the null hypothesis will be rejected. Doesn't have to be calculated alpha = .05 = 95 percent confidence level alpha = .01 = 90 percent confidence level

delimeter

A character that marks the beginning or end of a unit of data (commas, semicolon, slashes)

Outlier

A common definition of an outlier is a value that's either lower than Q1 - (IQR * 1.5) or larger than Q3 + (IQR * 1.5)

star schema

A database design that contains one fact table and multiple dimension tables linked to it.

chi square test

A statistical method of testing used to compare observations against expectations

If My Expectation Has Been Challenged

Ask myself: why is the data not behaving as expected? What's going on here? Does the data actually disconfirm (i.e., not confirm) our expectations about how our organization works? Either the organization's understanding is incorrect, or there is an error in how you're processing or interpreting the data: it has to be one or the other.

NoSQL databases

Databases that can manipulate structured as well as unstructured data and inconsistent or missing data; are useful when working with Big Data Example: Facebook Messenger, Amazon

Removing Records (General Rule of Thumb)

I can usually remove up to 5 percent of the data in a data set without causing any major issues Only use central tendency and random values methods when the data you're missing makes up 5 percent or less of the total values within your data set.

Which excel function returns the total number of characters within a cell

Len

P value In null hypothesis significance testing

Measures the likelihood of something happening by random chance low probability value = low probability of happening by chance

unstructured data

Not defined and does not follow a specified format

ETL Process (Extract, Transform, and Load)

Process of migrating data into a database 1. Extract necessary data from databases 2. Transform data so its in a common format and free of errors 3. Load extracted and transformed data into the warehouse to be used

Interpreting Correlations

The correlation coefficient takes a value between -1 and +1 and provides information about the strength and direction of the linear relationship. The closer the absolute value of the coefficient is to 1, the stronger the relationship—if one variable increases, the other also increases. On the opposite side of the spectrum, the same holds true, only in reverse. The closer the coefficient is to -1, the stronger the relationship, but in the opposite direction, meaning that if one variable increases, the other decreases. Generally speaking, using the absolute value of the correlation coefficient: 0: no relationship 0.1-0.3: weak relationship 0.3-0.5: moderate relationship 0.5-1.0: strong relationship

absolute error

The magnitude of difference between the actual number and the nearest representable value.

color harmony

The practice of using color to create visualizations that are engaging, balanced, and pleasing to the eye

Student's t-distribution

The sampling distribution of the t statistic

spatial analysis

an analysis that incorporates a geographic component

git

Version control system that helps keep track of changes made to files and code

Five V's of Big Data

Volume, Velocity, Variety, Veracity, Value

CASE statement

WHEN, THEN, and END WHEN specifies a condition THEN indicates the value to return if the condition is met END signifies the end of the case statement

Data Analyst Question Process

What do we want to measure --> How will we measure it? ----> What complications may arise?

note

When the data contains an outlier, the analyst may want to consider using a different measure of central tendency in order to more-accurately find the center.

right skewed distribution

a distribution with a tail on the right. The mean is higher than the median

ENUM (enumerated)

a drop down of permitted values chosen for a field

Primary Key

a field that uniquely identifies a record in a table

scatterplot

a graph in which one variable is represented along the x-axis and the other along the y-axis

box plot

a graph that shows a summary of data using the median, quartiles, and outlier of the data

Choropleth Map

a map that uses differences in shading, coloring, or the placing of symbols within predefined areas to illustrate a numeric count or statistical value

central tendency

a measure that represents the typical response or the behavior of a variable as a whole Done by finding the mean, median, or mode

repository

a place designated for storage

unstructured question

a question with no definite answer

database index

a quick lookup table for specific records

python libraries/package

a set of functions that have been brought together for a specific purpose

two sample t test

a statistical method used to compare the means of 2 groups of subjects. For example, If p <0.05, the null is rejected and the means are different.

Common Table Expression (CTE)

a temporary table that can be referenced in the main query that comes after it

open data

a type of external data that's usually free and available to the public

OLAP system (online analytical processing)

also known as a data warehouse configured for READ operations rather than updating, inserting, or deleting records

What is a relational database management system?

an application that allows users to create, read, update, and delete data in a relational database. Can also grant users access to specific tables, make backups of the data, and monitor performance

implementation (big data lifecycle)

an integrated collection of assumptions, data, and inferences that are mathematically measured to produce or predict an outcome

terminal

an interface used for interacting directly with your computer

data silo

an isolated data set that is difficult to obtain, combine, or use with other company data

prescriptive analysis

analyzing data and to recommend the best course of action or strategy moving forward

data mining

analyzing data to make insights not offered by the raw data alone

univariate analysis

analyzing one variable to identify patterns within it

complementary color scheme

any 2 colors directly opposite each other on the color wheel

downsampling

any method that decreases the amount of data

FOREIGN KEY constraint

assigns a foreign key to another table to establish a relationship between the two

PRIMARY KEY constraint

assigns a primary key for a specific field

expectation

baseline understanding about how some facet of an organization works, which can then either be confirmed, revised, or augmented by insights example:An apparel company might expect its sales to go up during holiday shopping periods

What type of data can be categorized into 2 groups? *Example*: yes/no, true/false

binary data

Machine learning

branch of analytics that uses algorithms to search for patterns in the data

Transposing

changing the rows to columns and vice versa

temporal chart

charts that include some kind of time component

statistical charts

charts used to display some statistical aspect of the data Correlation is usually visualized using a scatterplot Frequency distributions are usually visualized on a histogram

CHECK constraint

checks new values and if they don't meet certain conditions the user receives an error message

Veracity (Big Data)

the accuracy and credibility of data

bivariate analysis

the analysis of two variables simultaneously, for the purpose of determining the relationship between them

simple moving average

the average value over a set time period

Data accuracy

the data points (example: customer addresses) should be correct

Data source

the entity who owns/ collects the data Internal data - a particular department or team External data - the third party organization providing the data

alternative hypothesis

the statement i'm trying to prove

query cost

the time it takes to execute a query

Textual analysis

type of visualization that focuses on qualitative data. commonly used through a word cloud

K-means

type of clustering algorithm in which "k" indicates the number of clusters and "means" represents the clusters' average

sentiment analysis

type of textual analysis that analyzes the feelings behind the textual data by categorizing text into negative, neutral, and positive groupings

frequency analysis

type of textual analysis that involves counting specific words and phrases to spot broad trends

domain knowledge

understanding a specific subject area or field


Kaugnay na mga set ng pag-aaral

ATI Fundamentals Missed Questions

View Set

ATI RN fundamentals Practice Assessment

View Set

Pharmacology: Chapter 51: Bowel Disorder Drugs

View Set

BUSN 101 Ch. 07 Management and Leadership

View Set

Chapter 5: Cells-The Working Units of Life

View Set

Ch 11 T/F Sales, Leases, and E—Contracts

View Set