Stat 161 Midterm 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

continuous random variable

(X takes all values in a given interval of numbers) -The probability distribution of a continuous random variable is shown by a density curve. - The probability that X is between an interval of numbers is the area under the density curve between the interval endpoints - The probability that a continuous random variable X is exactly equal to a number is zero

Finding relative frequency table

1) Count the total number of items. In this chart the total is 40. 2) Divide the count (the frequency) by the total number. For example, 1/40 = .025 or 3/40 = .075.

Properties of r

1. Between -1 and 1 2. Is affected by outliers--> one outlier can severely alter the association. If the outlier is in the middle of the distribution, it will affect it less than if the outlier is at one of the ends 3. Is not affected by linear transformations 4. Doesn't change when there is a linear transformation

Describing a distribution

1. Describe shape 2. Describe variability 3. Describe center 4. Describe/ identify outliers

Correlational coefficient tells us

1. How the change in the explanatory variable corresponds to a change in the response variable 2. Can help you make vague predictions

R squared analysis

1. The overall prediction error drops by 73.3% when using the linear regression model as compared to the naive prediction model that uses the average 2. The overall prediction error decreases by 73.3% when using the linear regression model as compared to the naive prediction model 3. 73.3% of the variability in the longevity of animals can be explained by their varying gestational period

Example of ratio of proportions

1. The proportion of those contracting Covid was 0.3 times smaller for those receiving the vaccine as compared to those who received the placebo 2. Those in the vaccine group were 70% less likely to contract Covid as compared to those who received the placebo 3. Those in the placebo group were 3.3 times more likely to contract Covid as compared to those receiving the Placebo 4. (ratio of 1.6) Those in the placebo group were 60% more likely to contract Covid as compared to those receiving the vaccine

r cannot tell you...

1. You cannot extrapolate the magnitude of the slope (but you can get the sign) 2. Cannot make predictions about response variable values of the response variable based on the explanatory variable

Variability (s squared)

1/n-1 (sum of distances from the mean) squared

Box plot

A graph that displays the highest and lowest quarters of data as whiskers, the middle two quarters of the data as a box, and the median. The whiskers go to the value Q1- 1.5IQR and the top one falls Q3+1.5IQR

normal distribution

A probability bell curve. In a normal distribution the mean is zero and the standard deviation is 1. Normal distributions are symmetrical.

Shift

Addition or subtraction from the original points

Interpretation of the slope

As the explanatory increases by one unit (check the x-axis for the actual units and specify), the response variable is predicted to increase by the value of the slope

Right-skewed Distribution

Asymmetric distribution with the tail on the right side, mean is more toward the right and the median is more toward the left

Graphs describing categorical variables

Bar and pie charts

Define Given

Conditional probability

Depicting relationship between two categorical variables

Construct contingency table and then after can be put into side by side bar charts

R squared

Describes the difference in prediction abilities between the naive prediction model (using averages) and the linear regression model. Describes the variability!

What does Unbiased mean

Equal chance

R squared for single regression model

Example: For every 10 percentage increase in the education level, we predict the crime to increase by 14.9 crimes per thousand

Addition rule ( generalized for any two events)

For any two events A and B, the probability of A or B is the sum of the probability of A and the probability of B minus the shared probability of both A and B P(A or B) = P(A) + P(B) - P(A and B)

Probability value

Found in the middle of the z table

Graphs describing quantitative variables

Histograms, stem and leaf and box plot

What is sigma value?

How far a sample or data point is away from the mean

Mutually exclusive

If and ONLY if the two events have no shared outcomes

Standard Deviation

If computed at its base, it will get you a sum of zero which does not help, must define as the square root of the variability. This measure tells you how spread apart the observations are and how far on average the point falls from the average

How to identify if a variable is independent

If the conditional distribution/probabilities for one variable are the same for each category these are independent random variables. if the two variables are correlated, then they are not independent.

Randomness in choosing subjects

Important to randomly pick subjects otherwise it will not be an indicative sample of the entire population, also helps alleviate biases

Multiplicative rule for intersections

Intersection is the probability of both or all of the events you are calculating happening at the same time (less likely) P(A and B) = P(A) x P(B).

z-score graph to get sign of distribution

Look to see the number of negative and positive contributions to the overall sum of z-scores, if there are more negative contributors, the relationship is likely negative and vice versa.

Extrapolation of Linear Regression

Minimization of the sum of all residuals, squared

Correlation does not equal causation

No matter how correlated the values are, as in a high R squared value, no matter the model that you are using

Empirical Rule

Only applies to a bell shaped curves. One can describe the distribution as follows: 1. 68% of the data falls within 1 standard deviation from the mean--> mean +/- 1s 2. 95% of the data falls within 2 standard deviations from the mean --> mean +/- 2s 3. 98% of the data falls within 3 standard deviations from the mean --> mean +/- 3s

complement rule

P(A^c) = 1 - P(A) states that the sum of the probabilities of an event and its complement must equal 1.

conditional probability rule

P(A|B) = P(A and B) / P(B) the measure of the probability of an event occurring, given that another event has already occurred.

Quartiles

Q1= 25% of the data falls below this point Q2= median, center of the distribution Q3= 75% of the points fall below this point

Interquartile range

Q3-Q1, describes the middle 50% of the distribution

Example of Joint distribution

Q: What proportion of all students who responded are freshman and prefer milk chocolate?

Example of Conditional Distribution

Q: What proportion of freshman prefer milk chocolate?

Example of marginal distribution

Q: what proportion of students prefer milk chocolate?

random sampling

Randomly selecting people form the population if interest as my sample

Contingency table analysis

Row= explanatory variable Column= response variable Cell= the observation count at those categorical variables

Statistical life cycle

Statistical question--> collecting data--> processing data <--> explanatory variable--> learning from the data --> report--> statistical question

Computation of r

Sum of (ZxZy)/n-1

density curve

The area under the curve is equal to 100 percent of all probabilities. As we usually use decimals in probabilities you can also say that the area is equal to 1

Negative Residual

The linear regression line overestimates the value of the response variable given a specific value of the explanatory variable

Positive Residual

The linear regression line underestimates the value of the response variable given a specific value of the explanatory variable

Difference of Conditional Proportions

The percentage of patients who contracted Covid was 0.89 percentage points lower for those who received the vaccine as compared to those receiving the placebo

When do you use the complement rule

When a mutually exclusive events are complements of each other

linear correlation

a measure of dependence between two random variables that can take values between -1 and 1

population parameter

a numerical summary that describes a characteristic of a population.

random phenomenon

a situation in which we know what outcomes can occur, but we do not know which outcome will occur. We cannot predict each outcome

Census

a study of every unit, everyone or everything, in a population.

retrospective study

a study that looks backwards, so you will observe the subjects' past data. For example, you select a sample of kids, and look at their records and see how many extra words they speak from age 2 t

prospective studies

a study that looks forward, so you will observe the subjects from now to the future. For example, you select a sample of 24 months old babies now, observe how many extra words they can speak 1 year later.

continuous variable

a variable whose value is obtained by measuring (height, time, distance)

Equation for intercept (a)

a= y(mean)-b(x mean)

Residuals

actual/observed - predicted value (all of the response variable)

Left-skewed Distribution

asymmetric distribution with the tail in the left side, the mean is more toward the left and the median is more toward the right

(x mean, y mean)

average point will lie directly on the linear regression line

Equation for slope (b)

b= r(Sy/Sx)

Pareto Chart

categories are sorted from most to least frequent in the bar chart, typically the first two categories account for 80% or more of the participants

population inferences

conclusion about the pop. based on sample data

Continuous Quantitative variable

continuous scale, example is weight or height

experiment

controlled study in which the researcher attempts to understand cause-and-effect relationships

Linear Transformations

conversion of units that change the exact values of the points

Discrete Quantitative variable

counting, example is number of siblings

Single Regression Model

demonstrates the relationship between one explanatory variable and one response variable, ignoring the other variables

Frequency table

describes categorical variables, summarizes the frequencies of observations

variability between samples

different people make up different samples, but if samples are chosen randomly their differences should be within the margin of error

z-score

distance from the mean divided by the standard deviation

Marginal distribution

distribution for a single variable ignoring the others

Joint distribution

distribution of two variables jointly-together, where the proportion of one cell out of all of the observations

Causal Inferences

drawing conclusion about a cause and effect connection

Subjects

entities we measure (small chosen group)

disjoint events

events that cannot happen at the same time

Correlational coefficient (r)

if the relationship is approximately linear, then you can compute r to say how close all of the point lie from the linear regression. Estimates the strength of the relationship

Variability within a sample

measurements vary from person to person

finding mean/median and mode

mode: # with the highest count Median : MIDDLE VALUE when n value is ODD Average of two middle values when N is even Mean: sum of all observations/total # of observations

Scaling

multiplier to the original points

Inferential statistics

numerical data that allow one to generalize- to infer from sample data the probability of something being true of a population. Involves a prediction of the characteristics of an entire population based on the characteristics of the sample

Descriptive statistics

numerical data used to measure and describe characteristics of groups. Includes mean, median, graphs

Ordinal conditional variable

order does matter in the range of possible categories, put them in order on the x-axis do not make it in the order of most common to least common (don't make Pareto chart)

Nominal conditional variable

order does not matter in the range of possible categories

Population

overarching group that is being studied from the population

Positive z-score

point falls above the mean

Negative z-score

point falls below the mean

Random allocation

randomly assigning the subjects into different treatment groups

cluster sample

sampling method in which you divide a population into clusters. a sampling plan used when mutually homogeneous yet internally heterogeneous groupings

Stratified sample

sampling that involves the division of a population into smaller sub-groups known as strata. members of sub-divisions have a shared characteristic. samples are drawn from each strata.

Convince sample

sampling that involves the sample being drawn from that part of the population that is close to hand/ easy to contact

intersection

set of elements that are common to each of the sets. An element is in the intersection if it belongs to all of the sets.

Sample statistic

statistic that directly describes the sample

Observational studies

study in which the researcher simply observes the subjects without interfering.

Sample

subjects chosen to be studied, would be too costly to study everyone in the population

Difference of Proportions

the conditional proportion (in a particular category of y) for one category of x, minus the corresponding proportion for another category of x

Ratio of Proportions

the conditional proportion for one category x divided by the corresponding proportion for another category of x

Conditional Distribution

the distribution of one variable given that the other observations fall into a particular category of the other variable

Addition Rule of Probability ( mutually exclusive)

the sum probability A and B added together P(A or B) = P(A) + P(B) mutually exclusive events is really a special case of the generalized rule. This is because if A and B are mutually exclusive, then the probability of both A and B is zero.

Independence between two variables

two categorical variables are independent of each other when the conditional distribution of one variable is the same across all categories of the other variable, not necessarily exactly the same but very close to it --> will have a ratio of 1 and a difference of 0

Numerical Variable

values that describe a measurable quantity as a number, like 'how many' or 'how much' (quantitative variables)

Categorical Variable

values that describe a quality or characteristic of a data unit, like 'what type' or 'which category'

Lurking variable

variables that might affect the value of the response variable and may contribute to the poor correlation between the single regression model (or relationship shown only between two variables) Can be really difficult to know which variables are causing the variability in the relationship between the response and explanatory variables

Correlation

when looking at a set of data and are able to make future predictions about what could happen

Quantitative variable

when the response can fall anywhere within an interval

Categorical variable

when the response falls into one of several categories

cluster random sampling

when you divide the population into separate clusters, and each cluster is like a sample that's representative of the population, so you'll usually just randomly pick one (or a few) of the clusters as your sample.

systematic random sampling

when you select every nth person from the population of interest.

convenience sample

when you select your sample at your convenience, and this is usually a biased sample. For example, I want to measure the average height of the U of A population, but I only go to the basketball court at the U of A and sample everyone there.

Describing data

who= rows of spread sheet what/variable= columns

Histogram

x-axis is divided into even bins and the bars show how many observations fall within that range in the data

Linear Regression Equation

y(hat)= a+bx

.Observational study

you simply observe, so there is no manipulation at all, ie. there is no random assignment of subject to different treatment groups.


Ensembles d'études connexes

Consumer Behavior Chapter 3 Practice

View Set

Module 1 Professional identity & Ethics

View Set

Coursera Basics: How Does It Work?

View Set