statistics/probability

Ace your homework & exams now with Quizwiz!

Box Plot

We are given a dataset containing the number of miles that different brands of cars ran before they broke down. Which plot is best suited to identify the outliers in miles per brand?

A Bar chart is used for categorical data, while histogram is used for quantitative data

What is a key difference between a histogram and a bar chart?

n is number of of possible outcomes and p is the expected value / n

What is n and p in binomial distribution

Continuous Numerical data

What type of data is best suited to work with box-plots?

Box Plot

What type of data visualization is best suited to understand what the median is from continuous dataset?

Histogram

What type of plot should we use to identify the probability distribution of a dataset?

Discrete Uniform Distribution

When all of the probabilities are the same

Confidense Interval

A Range in which the true population mean is likley to be found with a certain probability

positive correlation

A correlation where as one variable increases, the other also increases, or as one decreases so does the other. Both variables move in the same direction.

Heatmap

A data visualization tool that shows levels of activity on a web page in different colors.

Scatter plot

A factory collected some data points regarding the number of hours it took to make an item and the return rate of the item. An analyst wants to check if there is a relationship between these two variables. What plot is best suited for visualizing the relationship?

normal distribution

A function that represents the distribution of variables as a symmetrical bell-shaped graph.

Histogram

A graph of vertical bars representing the frequency distribution of a set of data.

two-tailed test

A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.

one-tailed test

A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.

Correlation

A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other.

McCandless Method

A method for presenting data visualizations that moves from general to specific information

line plot

A method of visually displaying a distribution of data values where each data value is shown as a dot or mark above a number line. Also known as a dot plot.

Percentile

A point on a ranking scale of 0 to 100. The 50th percentile is the midpoint; half the people in the population being studied rank higher and half rank lower.

alternate hypothesis

A statement that is accepted if the sample data provide sufficient evidence that the null hypothesis is false.

machine learning model

A statistical representation of a real-world process based on data

Variance Analysis

A technique for determining the cause and degree of difference between the baseline and actual performance.

blocked test case

A test case that cannot be executed because the preconditions for its execution are not fulfilled.

Paired t-test

A test designed to determine the statistical difference between two groups' means where the participants in each group are either the same or matched pairs.

controlled experiment

An experiment in which only one variable is manipulated at a time.

In repeated experiments, we expect our test results to match results from the population 95% of the time

As a Data Scientist at ABC corp., you are asked to explain to the Marketing Manager what a 95% confidence interval is. How should you describe a 95% confidence interval to the manager?

The color of the campaign

As a Data Scientist in a Marketing team, you are asked to analyze the effect of changing the color of your campaign from red to green on the number of clicks the campaign receives. Which of the following represents the independent variable in this scenario?

heatmap

As a biologist, you are responsible for tracking whale migration in the Pacific Ocean. You have special satellite equipment that monitors both the numbers of whales that are traveling along a known route and the latitude and longitude of their location. What would be the most appropriate visualization tool to see the density of whales across specific latitudes and longitudes in the Pacific Ocean?

Poisson

As a forecasting analyst in credit risk, you are responsible for tracking loan defaults. The bank is expecting a recession soon and is hoping to minimize losses. The bank experiences 5 loan defaults per month. Your boss has asked you to calculate the probability that the bank will have 10, 15, or 20 losses next month. You calculate the probabilities using the mean of 5, the predictions of 10, 15, or 20, and a constant of 2.71828. Which statistical distribution are you working with?

Uniform

As a magician, you do many card tricks. One of your tricks involves asking an audience member to draw a card from a 52-card deck. If you plot the distribution of n number of card draws where the outcome is spade, heart, diamond, or club, what kind of statistical distribution will you observe?

Poisson Distribution

Average # of events in a period is known but the time or space between events is random

Histogram

Before proceeding with further analyses, we want to know the distribution shape of one of our features with continuous data. Which of the following visualization types allows us to do it?

General applications

Binomial distribution can be used for independent events producing binary outcomes. ex. clinical trials

Quantitative

Data Scatter plots visualize

Quantitative and Qualitative

Data pivot tables can visualize

continuous data

Data that can take on any value. There is no space between data values for a given domain. Graphs are represented by solid lines.

discrete probability distribution

Defined as a probability distribution giving the probability that a discrete random variable will have a specified value.

probability distribution

Describes the probability of each possible outcome in a scenerio

normal is continuous while poisson is discrete

Difference in normal and poisson distribution

Interval Data

Differences between values can be found, but there is no absolute 0. (Temp. and Time)

uniform distribution

Distribution where populations are spaced evenly

type 2 error

False negative- An error that occurs when a researcher concludes that the independent variable does not have an effect on the dependent variable when it does.

A t-distribution has heavier tails and tends to have values farther from the mean

How is a t-distribution different than a normal distribution?

Chi-square

Involves categorical variables. Looks at 2 distributions of categorical data to see if they differ from each other.

Kurtosis

Measure of the fatness of the tails of a probability distribution relative to that of a normal distribution. Indicates likelihood of extreme outcomes.

sampling with replacement

Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.

Independent Probability

Principle that probability of one event occurring has no effect on the probability of the other.

binomial distribution

Probability Distribution of the number of successes in a sequence of independent events.

We are 95% confident that the true population value lies between 2 and 4

Suppose an analyst at your shoe company tells you that she ran a statistical test on expected sneaker sales. The test suggests customers will buy an average of 3 sneakers per person this year, with a margin of error of 1 sneaker. How do you explain to your stakeholder the results of this statistical finding?

interquartile range

The difference between the upper and lower quartiles.

Skewness

The extent to which cases are clustered more at one or the other end of the distribution of a quantitative variable rather than in a symmetric pattern around its center

Pivot Table

The name of the tool used by most spreadsheet programs to create a summary table.

Chi-distribution

The theoretical distribution that models the test statistic for doing Chi-Square tests

Central Limit Theorem

The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution.

BI tools

Tools that process large amounts of unstructured data in books, journals, documents, health records, images, files, email, video, and so forth, to help you discover meaningful trends and identify new business opportunities.

Quartiles

Values that divide a data set into four equal parts

data storytelling

the process of translating often complex data analyses into more easy to understand terms to enable better decision making

Proportion Hypothesis Test

used to determine if a sampled proportion is significantly different from a specified population proportion.

ordinal statistics

values can be ranked but not measured ex) house numbers

Qualitative

what data type is Data in the form of words

Quantitative

what data type is Data that is in numbers

Poisson process example

what examples are as follows: number of animals adopted from an animal shelter per week, number of people arriving at a restaurant per hour, number of website visits in a day

expected value

what is n x p

Normal

Which continuous probability distribution is characterized by being unimodal, symmetric, asymptotic to the x axis, with an equal mode, mean, and median? (exponential, uniform or normal)

to reduce bias, ensure groups are comparable, and increase the chances that results will be representative of the target population

Why is Randomization experiments important?

the population is usually too large to measure every member

Why is it necessary to make parameter estimates when collecting data for statistical analysis?

There is no significant between the red and green color click-through rates

You are a Data Scientist on a Marketing team. The team is analyzing whether changing the color of their campaign from red to green will increase click-through rates. What is an appropriate null hypothesis for this experiment?

One-Tailed Test

You are a Data Scientist on a Marketing team. The team is analyzing whether changing the color of their campaign will increase click-through rates. What kind of hypothesis do you propose to solve the team's question?

f-Distribution

You are a research analyst working for a consulting company who has been assigned a project to find whether the average weight for American adults has increased or decreased based on their work habits. You obtain a sample of weights from people who work from home and from people who commute to the office. You now need to calculate how varied your samples are. Which statistical distribution will be most useful in this case?

nominal statistics

You can count but not order or measure ex) sex and eye color

Wider

You conduct a survey to compare student study habits between two countries. Canada's data has a standard deviation of 10 and Mexico's has a standard deviation of 6. How can you describe the confidence interval for the Canadian students' study habits vs Mexican students'?

there is a 2.5% chance that she will capture less than 26.5% and a 2.5% chance more than 33.5%

You conducted a telephone survey on an election candidate and after analyzing your data concluded that there is a 95% chance that she would capture between 26.5% and 33.5% of the vote given the margin of error of 3.5%. What can you say about the remaining 5% chance?

Control Group

You design an experiment to test engagement with a new product. The users exposed to the existing version of the product are called what?

It cannot be rejected because hair regrowth was the same between the groups

You want to test the regrowth of hair in people who eat spinach on a daily basis vs. people who don't. You conduct an experiment and find that eating spinach daily has no effect on hair regrowth. What can you say about the null hypothesis?

Sugar Pill

You work at a pharmaceutical company and are testing a new drug to reduce hair loss. You separate the group into two sections: one you will give the new drug and the other you will give a sugar pill. Which one did the control group receive?

bar plot

a common way to display a single categorical variable

standard deviation

a computed measure of how much scores vary around the mean score

time series

a forecasting technique that uses a series of past data points to make a forecast

sample distribution

a frequency distribution of a sample

violin plot

a graph that shows an approximation of the frequency distribution of a numerical variable in each group and its mirror image

box plot

a graphic way of showing a summary of data and distribution of features

Scatterplot

a graphical depiction of the relationship between two variables

null hypothesis

a statement or idea that can be falsified, or proved wrong

artificial intelligence

a subdiscipline of computer science that attempts to simulate human thinking

negative correlation

as one variable increases, the other decreases

Lambda

average number of events per time interval. also known as the expected value of the distribution

time series forecasting

based on the assumption that the future is an extension of the past. Historical data is used to predict future demand

continuous uniform distribution

defined over a range that spans between some lower limit, a, and some upper limit b, which serves as the parameters of the distribution.

theoretical distribution

hypothesized scores based on mathematical formulas and logic

how does maching learning work?

interdisciplinary mix of statistics and computer science, ability to learn without being explicity programmed, learn patterns from existing data and applies it to new data, and relies on high-quality data are what type of learning?

Data Science

involves applications of statistics, computer science, and software engineering, along with some other relevant fields (such as sociology or finance).

F distribution

mathematically defined curve that is the comparison distribution used in an analysis of variance

neutral correlation

no relationship between variables

inferential statistics

numerical data that allow one to generalize- to infer from sample data the probability of something being true of a population

Type 1 error

rejection of a true null hypothesis

Pareto Principle

roughly 80% of the effects come from 20% of the causes

descriptive statistics

statistics that summarize the data collected in a study

Conjoint Analysis

technique used to develop an understanding of the attributes that guide consumer preferences by having consumers compare product preferences across varying levels of evaluative criteria and expected utility

Machine Learning

the extraction of knowledge from data based on algorithms created from training data

Law of Large Numbers

the larger the number of individuals that are randomly drawn from a population, the more representative the resulting group will be of the entire population


Related study sets

Supply Chain Management Section 3

View Set

Intro To Humanities Final Review

View Set

NFPA 13 Occupancy Hazard Classifications

View Set

Civil Rights and Liberties Study Guide

View Set