The ULTIMATE stats study guide
Statistics
Statistics is therefore a process in which we are: collecting data, summarizing data, and interpreting data.
When describing the shape of a histogram, we should consider the following:
Symmetry/skewness of the distribution Peakedness (modality)—the number of peaks (modes) the distribution has
Multistage sampling
Taking a large population and making it progressively smaller makes by using method such as stratifying , clustering sample or a simple random sample.
Cluster Sampling
Target population is divided into subgroups & entire subgroups are randomly selected
Stratified Sampling
Target population is divided into subgroups & subjects from each subgroup are randomly chosen to yield a representative sample
Using the IQR to Detect Outliers
The 1.5(IQR) Criterion for Outliers An observation is considered a suspected outlier if it is: less than Q1 - 1.5(IQR), or more than Q3 + 1.5(IQR).
Midpoint
The center of the distribution is the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.
Correlation coefficient (r)
The correlation coefficient (r) is a numerical measure that measures the strength and direction of a linear relationship between two quantitative variables.
Uniform
The distribution has no modes, or no value around which the observations are concentrated.
Empirical Methods
The empirical way for finding probability uses a series of trials to determine (actually, estimate) the probability of an event. Each such trial produces outcomes that cannot be predicted in advance determined based on experience
Skewed Left
The left tail (smaller values) is much longer than the right tail (larger values) median (large) > mean graph data mostly on the right ex. age of death
Weak Relationship
The points also follow the linear pattern, but much less closely.The points also follow the linear pattern, but much less closely.
Quantitative explanatory and categorical response
Time is the explanatory variable and it is quantitative. Driving Test Outcome is the response variable and it is categorical. Therefore this is an example of case Q→C.
Inference
Use what we've discovered about our sample to draw conclusions about our population
Choosing Numerical Summaries
Use x¯ (the mean) and the standard deviation as measures of center and spread only for reasonably symmetric distributions with no outliers. Use the median and IQR as measures of center and spread for all other cases.
Classical Methods
Used for games of chance, such as flipping coins, rolling dice, spinning spinners, roulette wheels, or lotteries. They are "classical" because their values are determined by the game itself. determined by theory
Strong Relationship
We can see that in the top scatterplot the data points follow the linear pattern quite closely. This is an example of a strong relationship.
Split Stemplots
When some of the stems hold a large number of leaves, it is common for statistical software to split each stem into two: the first holding the leaves 0-4, and the second holding the leaves 5-9. Note that all stems have to be split
Simpson's paradox
Whenever including a lurking variable causes us to rethink the direction of an association
Equation of a straight line
Y=a+bX The intercept a is the value of Y when X = 0 The slope b is the change in Y for every increase of 1 unit in X.
Variable
a particular characteristic of the individual Ex:Gender, Age, Weight,
Stemplot
also called stem and leaf plot The leaf is the right-most digit. The stem is everything except the right-most digit. So, if the data point is 34, then 3 is the stem and 4 is the leaf. If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.
The Law of Large Numbers
as the number of trials increases, the empirical probability gets closer and closer to the theoretical probability.
Variance
average of the squared deviations is called the variance of the data.
Producing Data
choosing a sample and collecting data
two-way table (also called a contingency table
comparing two categorical variables
Spread/Variability
described by the approximate range covered by the data
First Quartile (Q1)
lower quartile
IQR
measures the variability of a distribution gives range covered by the middle 50% of the data
Exploratory Data Analysis consists of:
organizing and summarizing data, discovering important features and patterns in the data and any striking deviations from those patterns, and then interpreting our findings in the context of the problem.
Individual
particular person or object also called units
Data
pieces of information about individuals organized into variables
Relative Frequency
proportion of times the event happened; the number of times the even happened divided by the total number of trials
prospective observational study
records the values of variables (in this case, baby's growth) as they naturally happen forward in time.
scatterplot
relationship between two variables which are both quantitative
A store asked 250 of its customers to study the relationship between the amount spent on groceries and income. This is an example of:
scatterplot
systematic sampling
selecting samples based on a set schedule or plan ex :picking every 50th name on a list,
In order to study whether IQ level is related to gender, data were collected from a sample of 540.This is an example of:
side-by-side boxplots
Exploratory Data Analysis
summarizing the collected data
A survey was conducted to study the relationship between the zip code of the family home and whether they buy or rent the home. Data were collected from a random sample of 280 families from a certain metropolitan area.This is an example of:
two way table
Third Quartile (Q3)
upper quartile
To display data from one quantitative variable graphically
use either the histogram or the stemplot
observational study
values of the variable or variables of interest are recorded as they naturally occur. There is no interference by the researchers who conduct the study.
Probability of an Event
we can find the probability of any event A by dividing the number of outcomes in A by the number of outcomes in S
sample survey
which is a particular type of observational study in which individuals report variables' values themselves, frequently by giving their opinions.
Response Variable
(also commonly referred to as the dependent variable)—the outcome of the study. Denoted by Y
Bimodal
it has two modes (roughly at 10 and 20) around which the observations are concentrated.
Explanatory Variable
(also commonly referred to as the independent variable)—the variable that claims to explain, predict, or affect the response. Denoted by X
The probability that an event will occur
0 ≤ P(A) ≤ 1
How is the IQR found?
1) Arrange the data in increasing order, and find the median M. 2) Find the median of the lower 50% of the data. This is called the first quartile of the distribution, and the point is denoted by Q1. The bottom (top) 50% of the data is all the observations whose position in the ordered list is to the left (right) of the location of the overall median M. 3) Repeat this again for the top 50% of the data. Find the median of the top 50% of the data. This point is called the third quartile of the distribution, and is denoted by Q3 . 4) The middle 50% of the data falls between Q1 and Q3, and therefore: IQR = Q3 - Q1 5) Note that when n is odd (as in n = 7 above), the median is not included in either the bottom or top half of the data; When n is even (as in n = 8 above), the data are naturally divided into two halves.
Standard Deviation Formula
1) Find the mean of the data ( all the numbers, divide by the number of observations ) 2)subtract all of the number from the mean 3) Square each resulting deviation 4) Average the square deviation by adding them up and dividing by n-1 (number of observations minus 1) 5) The SD of the data is the square root of the variance
Big Picture of Statistics
1. Producing Data-Choosing a sample from the population of interest and collecting data 2. Exploratory Data Analysis (EDA)-Summarizing the data we've collected 3. Probability and 4. Inference Drawing conclusions about the entire population based on the data collected from the sample
lurking variable
A lurking variable is a variable that is not among the explanatory or response variables in a study, but could substantially affect your interpretation of the relationship among those variables. we say that the lurking variable is confounded with the explanatory variable, since their effects on the response variable cannot be distinguished from each other
correlation (r)
A measurement of the direction and strength of a linear relationship between two quantitative variables. Extremely important!
Negative Relationship
A negative (or decreasing) relationship means that an increase in one of the variables is associated with a decrease in the other.
Positive Relationship
A positive (or increasing) relationship means that an increase in one of the variables is associated with an increase in the other.
Standard Deviation Example: In general, the larger the animal the longer the length of pregnancy (also called gestation period). For the horse, for example, the gestation period varies roughly according to a normal distribution with a mean of 336 days and a standard deviation of 3 days (Source: These figures are from Moore and McCabe, Introduction to the Practice of Statistics ). Use the Standard Deviation Rule to answer the following questions: (a picture of the SD rule applied to this distribution will help).
Almost all (99.7%) horse pregnancies fall in what range of lengths? Above 336 days Below 336 days Between 333 and 339 days Between 330 and 342 days Between 327 and 345 days Good job! The Standard Deviation Rule tells us that virtually all the data fall within 3 standard deviations of the mean, which in this case is exactly between 336 - 3(3) = 327, and 336 + 3(3) = 345.
Least Square Criterion
Among all the lines that look good on your data, choose the one that has the smallest sum of squared vertical deviations.This line is called the least-squares regression line
Association does not imply causation
An observed association between two variables is not enough evidence that there is a causal relationship between them
The Standard Deviation Rule
Approximately 68% of the observations fall within 1 standard deviation of the mean. Approximately 95% of the observations fall within 2 standard deviations of the mean. Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.
Outliers
Are data points/observations that fall outside the overall pattern of the distribution and need further research before continuing the analysis.
Boxplot Summary
Boxplot Summary :The five-number summary of a distribution consists of the median (M), the two quartiles (Q1, Q3) and the extremes (min, Max). The five-number summary provides a complete numerical description of a distribution. The median describes the center, and the extremes (which give the range) and the quartiles (which give the IQR) describe the spread. The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five-number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion. Boxplots are most useful when presented side-by-side to compare and contrast distributions from two or more groups.
Possibilities for Role Type Classification
Categorical explanatory and quantitative response Categorical explanatory and categorical response Quantitative explanatory and quantitative response Quantitative explanatory and categorical response
Categorical variables
Categorical variables represent labels or ranks and places/classifies an individual into one of several groups.
Ordinal variables
Categorical variables where there is natural order among the categories Ex: What is your mood today? (Very Good, good, ok, bad, very bad)
Nominal variables
Categorical variables where there is no natural order among the categories Ex: Race
Rules for Interpreting the Correlation Coefficient R
Exactly -1. A perfect downhill (negative) linear relationship -0.70. A strong downhill (negative) linear relationship -0.50. A moderate downhill (negative) relationship -0.30. A weak downhill (negative) linear relationship 0. No linear relationship +0.30. A weak uphill (positive) linear relationship +0.50. A moderate uphill (positive) relationship +0.70. A strong uphill (positive) linear relationship Exactly +1. A perfect uphill (positive) linear relationship
What are some Examples of categorical variables?
Examples of categorical variables are a person's eye color, a person's socioeconomic status (low, medium, or high), a person's political affiliation (Democrat, Republican, or Independent),
What are some Examples of quantitative variables
Examples of quantitative variables are the time you wait in line, the distance between a person's home and work, the number of text messages a person sends in a day Quantitative variables always take numerical values. For example, the outside temperature (in degrees Fo) can be 50, 66, -20, etc.; the time you wait in line (in minutes) can be 5, 10, or 60.
Probability Rule 1
For any event A, 0 ≤ P(A) ≤ 1 The probability of an event, which informs us of the likelihood of it occurring, can range anywhere from 0 (indicating that the event will never occur) to 1 (indicating that the event is certain). One practical use of this rule is that is can be used to identify any probability calculation that comes out to be more than 1 as wrong.
Categorical explanatory and quantitative response
Gender is the explanatory variable and it is categorical. Test score is the response variable and it is quantitative. Therefore this is an example of case C→Q.
The distribution of a categorical variable is summarized using:
Graphical display: pie chart or bar chart, supplemented by Numerical summaries: category counts and percentages.
Multimodal
If a distribution has more than two modes, we say that the distribution is multimodal.)
Determining which is larger in a historgram, the mean or median
If the distribution is skewed right, the mean will be larger than the median
Value of 0
In ratio variables the value of 0 means the absence of the quantity while in interval variables, the value of 0 does not mean the absence of the quantity.
Counting Intervals
It is very important that each observation be counted only in one interval.The square bracket means "including" and the parenthesis means "not including".
Categorical explanatory and categorical response
Light Type is the explanatory variable and it is categorical. Nearsightedness is the response variable and it is categorical. Therefore this is an example of case C→C.
the slope and intercept of the least squares regression line are found using the following formulas:
Like any other line, the equation of the least-squares regression line for summarizing the linear relationship between the response variable (Y) and the explanatory variable (X) has the form: Y=a+bX All we need to do is calculate the intercept a, and the slope b, which is easily done if we know: X¯¯¯—the mean of the explanatory variable's values SX—the standard deviation of the explanatory variable's values Y¯¯¯—the mean of the response variable's values SY—the standard deviation of the response variable's values r—the correlation coefficient
Linear Regression: Summarizing the Pattern of the Data with a Line
Linear regression is the technique of finding the line that best fits the pattern of the linear relationship (or, in other words, the line that best describes how the response variable linearly depends on the explanatory variable)We need to agree on what we mean by "best fits the data;" in other words, we need to agree on a criterion by which we would select this line.
Ratio variables
Meaningful to talk about the differences between the ratios but not values Examples of a ratio variable are income, weight, time
Interval variables
Meaningful to talk about the differences between the values but not ratios Ex: temperature
What are the Two types of categorical variables?
Nominal & Ordinal Variables
Probability
Probability is the "machinery" that allows us to draw conclusions about the population based on the data collected about the sample.
What are the Types of variables?
Quantitative and Categorical
Quantitative variables
Quantitative variables represent a measurement or count and generally answer the question: "how much", or "how many" or age, weight, height
Curvilinear Form
Relationships with a curvilinear form are most simply described as points dispersed around the same curved line
Linear Form
Relationships with a linear form are most simply described as points scattered about a line
Quantitative explanatory and quantitative response
SAT Score is the explanatory variable and it is quantitative. GPA of Freshman Year is the response variable and it is quantitative. Therefore this is an example of case Q→Q.
Skewed Right
The right tail (larger values) is much longer than the left tail (small values). most data is on the left side, tail is on the right. Skewed look at tail mean > median (small) ex. salary
Use of the regression line
The slope of the regression line can be interpreted as the average change in the response variable (Y) when the explanatory variable (X) increases by one unit. Or for Prediction
Standard Deviation
The standard deviation gives the average (or typical distance) between a data point and the mean, x¯the standard deviation measures on average how far the data points are from their mean. The further the data points are from the mean, the larger the standard deviation. The closer the data points are to the mean, the smaller the standard deviation
The three main numerical measures for the center of a distribution:
The three main numerical measures for the center of a distribution are mode, mean (x¯), and the median (M)
Summary of Center of Distribution
The three main numerical measures for the center of a distribution are the mode, mean (x¯), and the median (M). The mode is the most frequently occurring value. The mean is the average value, while the median is the middle value. The mean is very sensitive to outliers (as it factors in their magnitude), while the median is resistant to outliers. The mean is an appropriate measure of center only for symmetric distributions with no outliers. In all other cases, the median should be used to describe the center of the distribution.
What are the Two types of quantitative variables?
The two types of quantitative variables are interval variables and ratio variable
There are two fundamental ways in which we can determine probability:
Theoretical (also known as Classical) Empirical (also known as Observational)
retrospective observational study
involves recording variables' values that naturally happened in the past.
Population
is the entire group that is the target of our interest
Unimodal
it has one mode (roughly at 10) around which the observations are concentrated