The ULTIMATE stats study guide

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Statistics

Statistics is therefore a process in which we are: collecting data, summarizing data, and interpreting data.

When describing the shape of a histogram, we should consider the following:

Symmetry/skewness of the distribution Peakedness (modality)—the number of peaks (modes) the distribution has

Multistage sampling

Taking a large population and making it progressively smaller makes by using method such as stratifying , clustering sample or a simple random sample.

Cluster Sampling

Target population is divided into subgroups & entire subgroups are randomly selected

Stratified Sampling

Target population is divided into subgroups & subjects from each subgroup are randomly chosen to yield a representative sample

Using the IQR to Detect Outliers

The 1.5(IQR) Criterion for Outliers An observation is considered a suspected outlier if it is: less than Q1 - 1.5(IQR), or more than Q3 + 1.5(IQR).

Midpoint

The center of the distribution is the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.

Correlation coefficient (r)

The correlation coefficient (r) is a numerical measure that measures the strength and direction of a linear relationship between two quantitative variables.

Uniform

The distribution has no modes, or no value around which the observations are concentrated.

Empirical Methods

The empirical way for finding probability uses a series of trials to determine (actually, estimate) the probability of an event. Each such trial produces outcomes that cannot be predicted in advance determined based on experience

Skewed Left

The left tail (smaller values) is much longer than the right tail (larger values) median (large) > mean graph data mostly on the right ex. age of death

Weak Relationship

The points also follow the linear pattern, but much less closely.The points also follow the linear pattern, but much less closely.

Quantitative explanatory and categorical response

Time is the explanatory variable and it is quantitative. Driving Test Outcome is the response variable and it is categorical. Therefore this is an example of case Q→C.

Inference

Use what we've discovered about our sample to draw conclusions about our population

Choosing Numerical Summaries

Use x¯ (the mean) and the standard deviation as measures of center and spread only for reasonably symmetric distributions with no outliers. Use the median and IQR as measures of center and spread for all other cases.

Classical Methods

Used for games of chance, such as flipping coins, rolling dice, spinning spinners, roulette wheels, or lotteries. They are "classical" because their values are determined by the game itself. determined by theory

Strong Relationship

We can see that in the top scatterplot the data points follow the linear pattern quite closely. This is an example of a strong relationship.

Split Stemplots

When some of the stems hold a large number of leaves, it is common for statistical software to split each stem into two: the first holding the leaves 0-4, and the second holding the leaves 5-9. Note that all stems have to be split

Simpson's paradox

Whenever including a lurking variable causes us to rethink the direction of an association

Equation of a straight line

Y=a+bX The intercept a is the value of Y when X = 0 The slope b is the change in Y for every increase of 1 unit in X.

Variable

a particular characteristic of the individual Ex:Gender, Age, Weight,

Stemplot

also called stem and leaf plot The leaf is the right-most digit. The stem is everything except the right-most digit. So, if the data point is 34, then 3 is the stem and 4 is the leaf. If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.

The Law of Large Numbers

as the number of trials increases, the empirical probability gets closer and closer to the theoretical probability.

Variance

average of the squared deviations is called the variance of the data.

Producing Data

choosing a sample and collecting data

two-way table (also called a contingency table

comparing two categorical variables

Spread/Variability

described by the approximate range covered by the data

First Quartile (Q1)

lower quartile

IQR

measures the variability of a distribution gives range covered by the middle 50% of the data

Exploratory Data Analysis consists of:

organizing and summarizing data, discovering important features and patterns in the data and any striking deviations from those patterns, and then interpreting our findings in the context of the problem.

Individual

particular person or object also called units

Data

pieces of information about individuals organized into variables

Relative Frequency

proportion of times the event happened; the number of times the even happened divided by the total number of trials

prospective observational study

records the values of variables (in this case, baby's growth) as they naturally happen forward in time.

scatterplot

relationship between two variables which are both quantitative

A store asked 250 of its customers to study the relationship between the amount spent on groceries and income. This is an example of:

scatterplot

systematic sampling

selecting samples based on a set schedule or plan ex :picking every 50th name on a list,

In order to study whether IQ level is related to gender, data were collected from a sample of 540.This is an example of:

side-by-side boxplots

Exploratory Data Analysis

summarizing the collected data

A survey was conducted to study the relationship between the zip code of the family home and whether they buy or rent the home. Data were collected from a random sample of 280 families from a certain metropolitan area.This is an example of:

two way table

Third Quartile (Q3)

upper quartile

To display data from one quantitative variable graphically

use either the histogram or the stemplot

observational study

values of the variable or variables of interest are recorded as they naturally occur. There is no interference by the researchers who conduct the study.

Probability of an Event

we can find the probability of any event A by dividing the number of outcomes in A by the number of outcomes in S

sample survey

which is a particular type of observational study in which individuals report variables' values themselves, frequently by giving their opinions.

Response Variable

(also commonly referred to as the dependent variable)—the outcome of the study. Denoted by Y

Bimodal

it has two modes (roughly at 10 and 20) around which the observations are concentrated.

Explanatory Variable

(also commonly referred to as the independent variable)—the variable that claims to explain, predict, or affect the response. Denoted by X

The probability that an event will occur

0 ≤ P(A) ≤ 1

How is the IQR found?

1) Arrange the data in increasing order, and find the median M. 2) Find the median of the lower 50% of the data. This is called the first quartile of the distribution, and the point is denoted by Q1. The bottom (top) 50% of the data is all the observations whose position in the ordered list is to the left (right) of the location of the overall median M. 3) Repeat this again for the top 50% of the data. Find the median of the top 50% of the data. This point is called the third quartile of the distribution, and is denoted by Q3 . 4) The middle 50% of the data falls between Q1 and Q3, and therefore: IQR = Q3 - Q1 5) Note that when n is odd (as in n = 7 above), the median is not included in either the bottom or top half of the data; When n is even (as in n = 8 above), the data are naturally divided into two halves.

Standard Deviation Formula

1) Find the mean of the data ( all the numbers, divide by the number of observations ) 2)subtract all of the number from the mean 3) Square each resulting deviation 4) Average the square deviation by adding them up and dividing by n-1 (number of observations minus 1) 5) The SD of the data is the square root of the variance

Big Picture of Statistics

1. Producing Data-Choosing a sample from the population of interest and collecting data 2. Exploratory Data Analysis (EDA)-Summarizing the data we've collected 3. Probability and 4. Inference Drawing conclusions about the entire population based on the data collected from the sample

lurking variable

A lurking variable is a variable that is not among the explanatory or response variables in a study, but could substantially affect your interpretation of the relationship among those variables. we say that the lurking variable is confounded with the explanatory variable, since their effects on the response variable cannot be distinguished from each other

correlation (r)

A measurement of the direction and strength of a linear relationship between two quantitative variables. Extremely important!

Negative Relationship

A negative (or decreasing) relationship means that an increase in one of the variables is associated with a decrease in the other.

Positive Relationship

A positive (or increasing) relationship means that an increase in one of the variables is associated with an increase in the other.

Standard Deviation Example: In general, the larger the animal the longer the length of pregnancy (also called gestation period). For the horse, for example, the gestation period varies roughly according to a normal distribution with a mean of 336 days and a standard deviation of 3 days (Source: These figures are from Moore and McCabe, Introduction to the Practice of Statistics ). Use the Standard Deviation Rule to answer the following questions: (a picture of the SD rule applied to this distribution will help).

Almost all (99.7%) horse pregnancies fall in what range of lengths? Above 336 days Below 336 days Between 333 and 339 days Between 330 and 342 days Between 327 and 345 days Good job! The Standard Deviation Rule tells us that virtually all the data fall within 3 standard deviations of the mean, which in this case is exactly between 336 - 3(3) = 327, and 336 + 3(3) = 345.

Least Square Criterion

Among all the lines that look good on your data, choose the one that has the smallest sum of squared vertical deviations.This line is called the least-squares regression line

Association does not imply causation

An observed association between two variables is not enough evidence that there is a causal relationship between them

The Standard Deviation Rule

Approximately 68% of the observations fall within 1 standard deviation of the mean. Approximately 95% of the observations fall within 2 standard deviations of the mean. Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

Outliers

Are data points/observations that fall outside the overall pattern of the distribution and need further research before continuing the analysis.

Boxplot Summary

Boxplot Summary :The five-number summary of a distribution consists of the median (M), the two quartiles (Q1, Q3) and the extremes (min, Max). The five-number summary provides a complete numerical description of a distribution. The median describes the center, and the extremes (which give the range) and the quartiles (which give the IQR) describe the spread. The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five-number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion. Boxplots are most useful when presented side-by-side to compare and contrast distributions from two or more groups.

Possibilities for Role Type Classification

Categorical explanatory and quantitative response Categorical explanatory and categorical response Quantitative explanatory and quantitative response Quantitative explanatory and categorical response

Categorical variables

Categorical variables represent labels or ranks and places/classifies an individual into one of several groups.

Ordinal variables

Categorical variables where there is natural order among the categories Ex: What is your mood today? (Very Good, good, ok, bad, very bad)

Nominal variables

Categorical variables where there is no natural order among the categories Ex: Race

Rules for Interpreting the Correlation Coefficient R

Exactly -1. A perfect downhill (negative) linear relationship -0.70. A strong downhill (negative) linear relationship -0.50. A moderate downhill (negative) relationship -0.30. A weak downhill (negative) linear relationship 0. No linear relationship +0.30. A weak uphill (positive) linear relationship +0.50. A moderate uphill (positive) relationship +0.70. A strong uphill (positive) linear relationship Exactly +1. A perfect uphill (positive) linear relationship

What are some Examples of categorical variables?

Examples of categorical variables are a person's eye color, a person's socioeconomic status (low, medium, or high), a person's political affiliation (Democrat, Republican, or Independent),

What are some Examples of quantitative variables

Examples of quantitative variables are the time you wait in line, the distance between a person's home and work, the number of text messages a person sends in a day Quantitative variables always take numerical values. For example, the outside temperature (in degrees Fo) can be 50, 66, -20, etc.; the time you wait in line (in minutes) can be 5, 10, or 60.

Probability Rule 1

For any event A, 0 ≤ P(A) ≤ 1 The probability of an event, which informs us of the likelihood of it occurring, can range anywhere from 0 (indicating that the event will never occur) to 1 (indicating that the event is certain). One practical use of this rule is that is can be used to identify any probability calculation that comes out to be more than 1 as wrong.

Categorical explanatory and quantitative response

Gender is the explanatory variable and it is categorical. Test score is the response variable and it is quantitative. Therefore this is an example of case C→Q.

The distribution of a categorical variable is summarized using:

Graphical display: pie chart or bar chart, supplemented by Numerical summaries: category counts and percentages.

Multimodal

If a distribution has more than two modes, we say that the distribution is multimodal.)

Determining which is larger in a historgram, the mean or median

If the distribution is skewed right, the mean will be larger than the median

Value of 0

In ratio variables the value of 0 means the absence of the quantity while in interval variables, the value of 0 does not mean the absence of the quantity.

Counting Intervals

It is very important that each observation be counted only in one interval.The square bracket means "including" and the parenthesis means "not including".

Categorical explanatory and categorical response

Light Type is the explanatory variable and it is categorical. Nearsightedness is the response variable and it is categorical. Therefore this is an example of case C→C.

the slope and intercept of the least squares regression line are found using the following formulas:

Like any other line, the equation of the least-squares regression line for summarizing the linear relationship between the response variable (Y) and the explanatory variable (X) has the form: Y=a+bX All we need to do is calculate the intercept a, and the slope b, which is easily done if we know: X¯¯¯—the mean of the explanatory variable's values SX—the standard deviation of the explanatory variable's values Y¯¯¯—the mean of the response variable's values SY—the standard deviation of the response variable's values r—the correlation coefficient

Linear Regression: Summarizing the Pattern of the Data with a Line

Linear regression is the technique of finding the line that best fits the pattern of the linear relationship (or, in other words, the line that best describes how the response variable linearly depends on the explanatory variable)We need to agree on what we mean by "best fits the data;" in other words, we need to agree on a criterion by which we would select this line.

Ratio variables

Meaningful to talk about the differences between the ratios but not values Examples of a ratio variable are income, weight, time

Interval variables

Meaningful to talk about the differences between the values but not ratios Ex: temperature ​

What are the Two types of categorical variables?

Nominal & Ordinal Variables

Probability

Probability is the "machinery" that allows us to draw conclusions about the population based on the data collected about the sample.

What are the Types of variables?

Quantitative and Categorical

Quantitative variables

Quantitative variables represent a measurement or count and generally answer the question: "how much", or "how many" or age, weight, height

Curvilinear Form

Relationships with a curvilinear form are most simply described as points dispersed around the same curved line

Linear Form

Relationships with a linear form are most simply described as points scattered about a line

Quantitative explanatory and quantitative response

SAT Score is the explanatory variable and it is quantitative. GPA of Freshman Year is the response variable and it is quantitative. Therefore this is an example of case Q→Q.

Skewed Right

The right tail (larger values) is much longer than the left tail (small values). most data is on the left side, tail is on the right. Skewed look at tail mean > median (small) ex. salary

Use of the regression line

The slope of the regression line can be interpreted as the average change in the response variable (Y) when the explanatory variable (X) increases by one unit. Or for Prediction

Standard Deviation

The standard deviation gives the average (or typical distance) between a data point and the mean, x¯the standard deviation measures on average how far the data points are from their mean. The further the data points are from the mean, the larger the standard deviation. The closer the data points are to the mean, the smaller the standard deviation

The three main numerical measures for the center of a distribution:

The three main numerical measures for the center of a distribution are mode, mean (x¯), and the median (M)

Summary of Center of Distribution

The three main numerical measures for the center of a distribution are the mode, mean (x¯), and the median (M). The mode is the most frequently occurring value. The mean is the average value, while the median is the middle value. The mean is very sensitive to outliers (as it factors in their magnitude), while the median is resistant to outliers. The mean is an appropriate measure of center only for symmetric distributions with no outliers. In all other cases, the median should be used to describe the center of the distribution.

What are the Two types of quantitative variables?

The two types of quantitative variables are interval variables and ratio variable

There are two fundamental ways in which we can determine probability:

Theoretical (also known as Classical) Empirical (also known as Observational)

retrospective observational study

involves recording variables' values that naturally happened in the past.

Population

is the entire group that is the target of our interest

Unimodal

it has one mode (roughly at 10) around which the observations are concentrated


Set pelajaran terkait

Pharmacology- CH 91- Fluoroquinolones, metronidazole, Rifampin, etc.

View Set

ch. 4 Processing Crime and Incident Scenes

View Set

Chapter 17 Law Final Study Guide

View Set

Chapter 15 Drugs Affecting Inflammation and Infection

View Set

catherine's lines - proof scene 4

View Set