Stat 161 Midterm 1
continuous random variable
(X takes all values in a given interval of numbers) -The probability distribution of a continuous random variable is shown by a density curve. - The probability that X is between an interval of numbers is the area under the density curve between the interval endpoints - The probability that a continuous random variable X is exactly equal to a number is zero
Finding relative frequency table
1) Count the total number of items. In this chart the total is 40. 2) Divide the count (the frequency) by the total number. For example, 1/40 = .025 or 3/40 = .075.
Properties of r
1. Between -1 and 1 2. Is affected by outliers--> one outlier can severely alter the association. If the outlier is in the middle of the distribution, it will affect it less than if the outlier is at one of the ends 3. Is not affected by linear transformations 4. Doesn't change when there is a linear transformation
Describing a distribution
1. Describe shape 2. Describe variability 3. Describe center 4. Describe/ identify outliers
Correlational coefficient tells us
1. How the change in the explanatory variable corresponds to a change in the response variable 2. Can help you make vague predictions
R squared analysis
1. The overall prediction error drops by 73.3% when using the linear regression model as compared to the naive prediction model that uses the average 2. The overall prediction error decreases by 73.3% when using the linear regression model as compared to the naive prediction model 3. 73.3% of the variability in the longevity of animals can be explained by their varying gestational period
Example of ratio of proportions
1. The proportion of those contracting Covid was 0.3 times smaller for those receiving the vaccine as compared to those who received the placebo 2. Those in the vaccine group were 70% less likely to contract Covid as compared to those who received the placebo 3. Those in the placebo group were 3.3 times more likely to contract Covid as compared to those receiving the Placebo 4. (ratio of 1.6) Those in the placebo group were 60% more likely to contract Covid as compared to those receiving the vaccine
r cannot tell you...
1. You cannot extrapolate the magnitude of the slope (but you can get the sign) 2. Cannot make predictions about response variable values of the response variable based on the explanatory variable
Variability (s squared)
1/n-1 (sum of distances from the mean) squared
Box plot
A graph that displays the highest and lowest quarters of data as whiskers, the middle two quarters of the data as a box, and the median. The whiskers go to the value Q1- 1.5IQR and the top one falls Q3+1.5IQR
normal distribution
A probability bell curve. In a normal distribution the mean is zero and the standard deviation is 1. Normal distributions are symmetrical.
Shift
Addition or subtraction from the original points
Interpretation of the slope
As the explanatory increases by one unit (check the x-axis for the actual units and specify), the response variable is predicted to increase by the value of the slope
Right-skewed Distribution
Asymmetric distribution with the tail on the right side, mean is more toward the right and the median is more toward the left
Graphs describing categorical variables
Bar and pie charts
Define Given
Conditional probability
Depicting relationship between two categorical variables
Construct contingency table and then after can be put into side by side bar charts
R squared
Describes the difference in prediction abilities between the naive prediction model (using averages) and the linear regression model. Describes the variability!
What does Unbiased mean
Equal chance
R squared for single regression model
Example: For every 10 percentage increase in the education level, we predict the crime to increase by 14.9 crimes per thousand
Addition rule ( generalized for any two events)
For any two events A and B, the probability of A or B is the sum of the probability of A and the probability of B minus the shared probability of both A and B P(A or B) = P(A) + P(B) - P(A and B)
Probability value
Found in the middle of the z table
Graphs describing quantitative variables
Histograms, stem and leaf and box plot
What is sigma value?
How far a sample or data point is away from the mean
Mutually exclusive
If and ONLY if the two events have no shared outcomes
Standard Deviation
If computed at its base, it will get you a sum of zero which does not help, must define as the square root of the variability. This measure tells you how spread apart the observations are and how far on average the point falls from the average
How to identify if a variable is independent
If the conditional distribution/probabilities for one variable are the same for each category these are independent random variables. if the two variables are correlated, then they are not independent.
Randomness in choosing subjects
Important to randomly pick subjects otherwise it will not be an indicative sample of the entire population, also helps alleviate biases
Multiplicative rule for intersections
Intersection is the probability of both or all of the events you are calculating happening at the same time (less likely) P(A and B) = P(A) x P(B).
z-score graph to get sign of distribution
Look to see the number of negative and positive contributions to the overall sum of z-scores, if there are more negative contributors, the relationship is likely negative and vice versa.
Extrapolation of Linear Regression
Minimization of the sum of all residuals, squared
Correlation does not equal causation
No matter how correlated the values are, as in a high R squared value, no matter the model that you are using
Empirical Rule
Only applies to a bell shaped curves. One can describe the distribution as follows: 1. 68% of the data falls within 1 standard deviation from the mean--> mean +/- 1s 2. 95% of the data falls within 2 standard deviations from the mean --> mean +/- 2s 3. 98% of the data falls within 3 standard deviations from the mean --> mean +/- 3s
complement rule
P(A^c) = 1 - P(A) states that the sum of the probabilities of an event and its complement must equal 1.
conditional probability rule
P(A|B) = P(A and B) / P(B) the measure of the probability of an event occurring, given that another event has already occurred.
Quartiles
Q1= 25% of the data falls below this point Q2= median, center of the distribution Q3= 75% of the points fall below this point
Interquartile range
Q3-Q1, describes the middle 50% of the distribution
Example of Joint distribution
Q: What proportion of all students who responded are freshman and prefer milk chocolate?
Example of Conditional Distribution
Q: What proportion of freshman prefer milk chocolate?
Example of marginal distribution
Q: what proportion of students prefer milk chocolate?
random sampling
Randomly selecting people form the population if interest as my sample
Contingency table analysis
Row= explanatory variable Column= response variable Cell= the observation count at those categorical variables
Statistical life cycle
Statistical question--> collecting data--> processing data <--> explanatory variable--> learning from the data --> report--> statistical question
Computation of r
Sum of (ZxZy)/n-1
density curve
The area under the curve is equal to 100 percent of all probabilities. As we usually use decimals in probabilities you can also say that the area is equal to 1
Negative Residual
The linear regression line overestimates the value of the response variable given a specific value of the explanatory variable
Positive Residual
The linear regression line underestimates the value of the response variable given a specific value of the explanatory variable
Difference of Conditional Proportions
The percentage of patients who contracted Covid was 0.89 percentage points lower for those who received the vaccine as compared to those receiving the placebo
When do you use the complement rule
When a mutually exclusive events are complements of each other
linear correlation
a measure of dependence between two random variables that can take values between -1 and 1
population parameter
a numerical summary that describes a characteristic of a population.
random phenomenon
a situation in which we know what outcomes can occur, but we do not know which outcome will occur. We cannot predict each outcome
Census
a study of every unit, everyone or everything, in a population.
retrospective study
a study that looks backwards, so you will observe the subjects' past data. For example, you select a sample of kids, and look at their records and see how many extra words they speak from age 2 t
prospective studies
a study that looks forward, so you will observe the subjects from now to the future. For example, you select a sample of 24 months old babies now, observe how many extra words they can speak 1 year later.
continuous variable
a variable whose value is obtained by measuring (height, time, distance)
Equation for intercept (a)
a= y(mean)-b(x mean)
Residuals
actual/observed - predicted value (all of the response variable)
Left-skewed Distribution
asymmetric distribution with the tail in the left side, the mean is more toward the left and the median is more toward the right
(x mean, y mean)
average point will lie directly on the linear regression line
Equation for slope (b)
b= r(Sy/Sx)
Pareto Chart
categories are sorted from most to least frequent in the bar chart, typically the first two categories account for 80% or more of the participants
population inferences
conclusion about the pop. based on sample data
Continuous Quantitative variable
continuous scale, example is weight or height
experiment
controlled study in which the researcher attempts to understand cause-and-effect relationships
Linear Transformations
conversion of units that change the exact values of the points
Discrete Quantitative variable
counting, example is number of siblings
Single Regression Model
demonstrates the relationship between one explanatory variable and one response variable, ignoring the other variables
Frequency table
describes categorical variables, summarizes the frequencies of observations
variability between samples
different people make up different samples, but if samples are chosen randomly their differences should be within the margin of error
z-score
distance from the mean divided by the standard deviation
Marginal distribution
distribution for a single variable ignoring the others
Joint distribution
distribution of two variables jointly-together, where the proportion of one cell out of all of the observations
Causal Inferences
drawing conclusion about a cause and effect connection
Subjects
entities we measure (small chosen group)
disjoint events
events that cannot happen at the same time
Correlational coefficient (r)
if the relationship is approximately linear, then you can compute r to say how close all of the point lie from the linear regression. Estimates the strength of the relationship
Variability within a sample
measurements vary from person to person
finding mean/median and mode
mode: # with the highest count Median : MIDDLE VALUE when n value is ODD Average of two middle values when N is even Mean: sum of all observations/total # of observations
Scaling
multiplier to the original points
Inferential statistics
numerical data that allow one to generalize- to infer from sample data the probability of something being true of a population. Involves a prediction of the characteristics of an entire population based on the characteristics of the sample
Descriptive statistics
numerical data used to measure and describe characteristics of groups. Includes mean, median, graphs
Ordinal conditional variable
order does matter in the range of possible categories, put them in order on the x-axis do not make it in the order of most common to least common (don't make Pareto chart)
Nominal conditional variable
order does not matter in the range of possible categories
Population
overarching group that is being studied from the population
Positive z-score
point falls above the mean
Negative z-score
point falls below the mean
Random allocation
randomly assigning the subjects into different treatment groups
cluster sample
sampling method in which you divide a population into clusters. a sampling plan used when mutually homogeneous yet internally heterogeneous groupings
Stratified sample
sampling that involves the division of a population into smaller sub-groups known as strata. members of sub-divisions have a shared characteristic. samples are drawn from each strata.
Convince sample
sampling that involves the sample being drawn from that part of the population that is close to hand/ easy to contact
intersection
set of elements that are common to each of the sets. An element is in the intersection if it belongs to all of the sets.
Sample statistic
statistic that directly describes the sample
Observational studies
study in which the researcher simply observes the subjects without interfering.
Sample
subjects chosen to be studied, would be too costly to study everyone in the population
Difference of Proportions
the conditional proportion (in a particular category of y) for one category of x, minus the corresponding proportion for another category of x
Ratio of Proportions
the conditional proportion for one category x divided by the corresponding proportion for another category of x
Conditional Distribution
the distribution of one variable given that the other observations fall into a particular category of the other variable
Addition Rule of Probability ( mutually exclusive)
the sum probability A and B added together P(A or B) = P(A) + P(B) mutually exclusive events is really a special case of the generalized rule. This is because if A and B are mutually exclusive, then the probability of both A and B is zero.
Independence between two variables
two categorical variables are independent of each other when the conditional distribution of one variable is the same across all categories of the other variable, not necessarily exactly the same but very close to it --> will have a ratio of 1 and a difference of 0
Numerical Variable
values that describe a measurable quantity as a number, like 'how many' or 'how much' (quantitative variables)
Categorical Variable
values that describe a quality or characteristic of a data unit, like 'what type' or 'which category'
Lurking variable
variables that might affect the value of the response variable and may contribute to the poor correlation between the single regression model (or relationship shown only between two variables) Can be really difficult to know which variables are causing the variability in the relationship between the response and explanatory variables
Correlation
when looking at a set of data and are able to make future predictions about what could happen
Quantitative variable
when the response can fall anywhere within an interval
Categorical variable
when the response falls into one of several categories
cluster random sampling
when you divide the population into separate clusters, and each cluster is like a sample that's representative of the population, so you'll usually just randomly pick one (or a few) of the clusters as your sample.
systematic random sampling
when you select every nth person from the population of interest.
convenience sample
when you select your sample at your convenience, and this is usually a biased sample. For example, I want to measure the average height of the U of A population, but I only go to the basketball court at the U of A and sample everyone there.
Describing data
who= rows of spread sheet what/variable= columns
Histogram
x-axis is divided into even bins and the bars show how many observations fall within that range in the data
Linear Regression Equation
y(hat)= a+bx
.Observational study
you simply observe, so there is no manipulation at all, ie. there is no random assignment of subject to different treatment groups.