statistics/probability
Box Plot
We are given a dataset containing the number of miles that different brands of cars ran before they broke down. Which plot is best suited to identify the outliers in miles per brand?
A Bar chart is used for categorical data, while histogram is used for quantitative data
What is a key difference between a histogram and a bar chart?
n is number of of possible outcomes and p is the expected value / n
What is n and p in binomial distribution
Continuous Numerical data
What type of data is best suited to work with box-plots?
Box Plot
What type of data visualization is best suited to understand what the median is from continuous dataset?
Histogram
What type of plot should we use to identify the probability distribution of a dataset?
Discrete Uniform Distribution
When all of the probabilities are the same
Confidense Interval
A Range in which the true population mean is likley to be found with a certain probability
positive correlation
A correlation where as one variable increases, the other also increases, or as one decreases so does the other. Both variables move in the same direction.
Heatmap
A data visualization tool that shows levels of activity on a web page in different colors.
Scatter plot
A factory collected some data points regarding the number of hours it took to make an item and the return rate of the item. An analyst wants to check if there is a relationship between these two variables. What plot is best suited for visualizing the relationship?
normal distribution
A function that represents the distribution of variables as a symmetrical bell-shaped graph.
Histogram
A graph of vertical bars representing the frequency distribution of a set of data.
two-tailed test
A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.
one-tailed test
A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.
Correlation
A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other.
McCandless Method
A method for presenting data visualizations that moves from general to specific information
line plot
A method of visually displaying a distribution of data values where each data value is shown as a dot or mark above a number line. Also known as a dot plot.
Percentile
A point on a ranking scale of 0 to 100. The 50th percentile is the midpoint; half the people in the population being studied rank higher and half rank lower.
alternate hypothesis
A statement that is accepted if the sample data provide sufficient evidence that the null hypothesis is false.
machine learning model
A statistical representation of a real-world process based on data
Variance Analysis
A technique for determining the cause and degree of difference between the baseline and actual performance.
blocked test case
A test case that cannot be executed because the preconditions for its execution are not fulfilled.
Paired t-test
A test designed to determine the statistical difference between two groups' means where the participants in each group are either the same or matched pairs.
controlled experiment
An experiment in which only one variable is manipulated at a time.
In repeated experiments, we expect our test results to match results from the population 95% of the time
As a Data Scientist at ABC corp., you are asked to explain to the Marketing Manager what a 95% confidence interval is. How should you describe a 95% confidence interval to the manager?
The color of the campaign
As a Data Scientist in a Marketing team, you are asked to analyze the effect of changing the color of your campaign from red to green on the number of clicks the campaign receives. Which of the following represents the independent variable in this scenario?
heatmap
As a biologist, you are responsible for tracking whale migration in the Pacific Ocean. You have special satellite equipment that monitors both the numbers of whales that are traveling along a known route and the latitude and longitude of their location. What would be the most appropriate visualization tool to see the density of whales across specific latitudes and longitudes in the Pacific Ocean?
Poisson
As a forecasting analyst in credit risk, you are responsible for tracking loan defaults. The bank is expecting a recession soon and is hoping to minimize losses. The bank experiences 5 loan defaults per month. Your boss has asked you to calculate the probability that the bank will have 10, 15, or 20 losses next month. You calculate the probabilities using the mean of 5, the predictions of 10, 15, or 20, and a constant of 2.71828. Which statistical distribution are you working with?
Uniform
As a magician, you do many card tricks. One of your tricks involves asking an audience member to draw a card from a 52-card deck. If you plot the distribution of n number of card draws where the outcome is spade, heart, diamond, or club, what kind of statistical distribution will you observe?
Poisson Distribution
Average # of events in a period is known but the time or space between events is random
Histogram
Before proceeding with further analyses, we want to know the distribution shape of one of our features with continuous data. Which of the following visualization types allows us to do it?
General applications
Binomial distribution can be used for independent events producing binary outcomes. ex. clinical trials
Quantitative
Data Scatter plots visualize
Quantitative and Qualitative
Data pivot tables can visualize
continuous data
Data that can take on any value. There is no space between data values for a given domain. Graphs are represented by solid lines.
discrete probability distribution
Defined as a probability distribution giving the probability that a discrete random variable will have a specified value.
probability distribution
Describes the probability of each possible outcome in a scenerio
normal is continuous while poisson is discrete
Difference in normal and poisson distribution
Interval Data
Differences between values can be found, but there is no absolute 0. (Temp. and Time)
uniform distribution
Distribution where populations are spaced evenly
type 2 error
False negative- An error that occurs when a researcher concludes that the independent variable does not have an effect on the dependent variable when it does.
A t-distribution has heavier tails and tends to have values farther from the mean
How is a t-distribution different than a normal distribution?
Chi-square
Involves categorical variables. Looks at 2 distributions of categorical data to see if they differ from each other.
Kurtosis
Measure of the fatness of the tails of a probability distribution relative to that of a normal distribution. Indicates likelihood of extreme outcomes.
sampling with replacement
Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.
Independent Probability
Principle that probability of one event occurring has no effect on the probability of the other.
binomial distribution
Probability Distribution of the number of successes in a sequence of independent events.
We are 95% confident that the true population value lies between 2 and 4
Suppose an analyst at your shoe company tells you that she ran a statistical test on expected sneaker sales. The test suggests customers will buy an average of 3 sneakers per person this year, with a margin of error of 1 sneaker. How do you explain to your stakeholder the results of this statistical finding?
interquartile range
The difference between the upper and lower quartiles.
Skewness
The extent to which cases are clustered more at one or the other end of the distribution of a quantitative variable rather than in a symmetric pattern around its center
Pivot Table
The name of the tool used by most spreadsheet programs to create a summary table.
Chi-distribution
The theoretical distribution that models the test statistic for doing Chi-Square tests
Central Limit Theorem
The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution.
BI tools
Tools that process large amounts of unstructured data in books, journals, documents, health records, images, files, email, video, and so forth, to help you discover meaningful trends and identify new business opportunities.
Quartiles
Values that divide a data set into four equal parts
data storytelling
the process of translating often complex data analyses into more easy to understand terms to enable better decision making
Proportion Hypothesis Test
used to determine if a sampled proportion is significantly different from a specified population proportion.
ordinal statistics
values can be ranked but not measured ex) house numbers
Qualitative
what data type is Data in the form of words
Quantitative
what data type is Data that is in numbers
Poisson process example
what examples are as follows: number of animals adopted from an animal shelter per week, number of people arriving at a restaurant per hour, number of website visits in a day
expected value
what is n x p
Normal
Which continuous probability distribution is characterized by being unimodal, symmetric, asymptotic to the x axis, with an equal mode, mean, and median? (exponential, uniform or normal)
to reduce bias, ensure groups are comparable, and increase the chances that results will be representative of the target population
Why is Randomization experiments important?
the population is usually too large to measure every member
Why is it necessary to make parameter estimates when collecting data for statistical analysis?
There is no significant between the red and green color click-through rates
You are a Data Scientist on a Marketing team. The team is analyzing whether changing the color of their campaign from red to green will increase click-through rates. What is an appropriate null hypothesis for this experiment?
One-Tailed Test
You are a Data Scientist on a Marketing team. The team is analyzing whether changing the color of their campaign will increase click-through rates. What kind of hypothesis do you propose to solve the team's question?
f-Distribution
You are a research analyst working for a consulting company who has been assigned a project to find whether the average weight for American adults has increased or decreased based on their work habits. You obtain a sample of weights from people who work from home and from people who commute to the office. You now need to calculate how varied your samples are. Which statistical distribution will be most useful in this case?
nominal statistics
You can count but not order or measure ex) sex and eye color
Wider
You conduct a survey to compare student study habits between two countries. Canada's data has a standard deviation of 10 and Mexico's has a standard deviation of 6. How can you describe the confidence interval for the Canadian students' study habits vs Mexican students'?
there is a 2.5% chance that she will capture less than 26.5% and a 2.5% chance more than 33.5%
You conducted a telephone survey on an election candidate and after analyzing your data concluded that there is a 95% chance that she would capture between 26.5% and 33.5% of the vote given the margin of error of 3.5%. What can you say about the remaining 5% chance?
Control Group
You design an experiment to test engagement with a new product. The users exposed to the existing version of the product are called what?
It cannot be rejected because hair regrowth was the same between the groups
You want to test the regrowth of hair in people who eat spinach on a daily basis vs. people who don't. You conduct an experiment and find that eating spinach daily has no effect on hair regrowth. What can you say about the null hypothesis?
Sugar Pill
You work at a pharmaceutical company and are testing a new drug to reduce hair loss. You separate the group into two sections: one you will give the new drug and the other you will give a sugar pill. Which one did the control group receive?
bar plot
a common way to display a single categorical variable
standard deviation
a computed measure of how much scores vary around the mean score
time series
a forecasting technique that uses a series of past data points to make a forecast
sample distribution
a frequency distribution of a sample
violin plot
a graph that shows an approximation of the frequency distribution of a numerical variable in each group and its mirror image
box plot
a graphic way of showing a summary of data and distribution of features
Scatterplot
a graphical depiction of the relationship between two variables
null hypothesis
a statement or idea that can be falsified, or proved wrong
artificial intelligence
a subdiscipline of computer science that attempts to simulate human thinking
negative correlation
as one variable increases, the other decreases
Lambda
average number of events per time interval. also known as the expected value of the distribution
time series forecasting
based on the assumption that the future is an extension of the past. Historical data is used to predict future demand
continuous uniform distribution
defined over a range that spans between some lower limit, a, and some upper limit b, which serves as the parameters of the distribution.
theoretical distribution
hypothesized scores based on mathematical formulas and logic
how does maching learning work?
interdisciplinary mix of statistics and computer science, ability to learn without being explicity programmed, learn patterns from existing data and applies it to new data, and relies on high-quality data are what type of learning?
Data Science
involves applications of statistics, computer science, and software engineering, along with some other relevant fields (such as sociology or finance).
F distribution
mathematically defined curve that is the comparison distribution used in an analysis of variance
neutral correlation
no relationship between variables
inferential statistics
numerical data that allow one to generalize- to infer from sample data the probability of something being true of a population
Type 1 error
rejection of a true null hypothesis
Pareto Principle
roughly 80% of the effects come from 20% of the causes
descriptive statistics
statistics that summarize the data collected in a study
Conjoint Analysis
technique used to develop an understanding of the attributes that guide consumer preferences by having consumers compare product preferences across varying levels of evaluative criteria and expected utility
Machine Learning
the extraction of knowledge from data based on algorithms created from training data
Law of Large Numbers
the larger the number of individuals that are randomly drawn from a population, the more representative the resulting group will be of the entire population