STAT 2020 Exam 1

Ace your homework & exams now with Quizwiz!

Common Rule

- The organization that carries out the study must have an institutional review board that reviews all planned studies in advance to protect the subjects from possible harm -All individuals who are subjects in a study must give their informed consent in writing before data are collected -All individual data must be kept confidential

Sample

- is a group selected from a population, denoted by "n" - Many different samples can be selected from any given population -In general, we want to take a random sample of individuals from the population -The sample is then used to explore possible trends/patterns and can be used to answer questions posed about the population

Statistic

-a number that describes a characteristic of a sample (aka sample statistics -The observed value of a statistic is used to estimate the unobserved value of a parameter

odds

-a ratio of two probabilities, where the numerator represents the probability of an event occurring and the denominator represents the complementary probability of that event not occurring -including values greater than 1 p/(1-p), where p=risk p=odds/(1+odds)

Continuous sample spaces

-contain an infinite number of events -The density curve is the probability model for a continuous RV (random variable). -The total area under a density curve represents the whole population (sample space) and equals 1 (100%) Probabilities are computed as areas under the corresponding portion of the density curve for the chosen interval. -The probability of an event being equal to a single numerical value is zero when the sample space is continuous. P(y = 0.5) = 0 P(y ≤ 0.5 or y > 0.8) = P(y ≤ 0.5) + P(y > 0.8) = 0.7 Y is our notation for a random variable! -With continuous RVs, probability is defined OVER a range of values and not just a single occurrence.

density curve

-is always on or above the horizontal axis, and -has area exactly 1 underneath it. -A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall within that range.

Tree Diagrams

-used to represent probabilities graphically and facilitate computations -1. The marginal/individual probabilities are the first set of branches 2. The second set of branches from the first set are the conditional probabilities

The basic principles of statistical design of experiments are

1. Control the effects of lurking variables on the response, most simply by comparing two or more treatments. 2. Randomize—use impersonal chance to assign subjects to treatments. 3. Use enough subjects in each group to reduce chance variation in the results. An observed effect so large that it would rarely occur by chance is called statistically significant.

Boxplots

5 number summary (min, q1, median, q3, max) Symmetric or skewed (look at and compare the lengths of the whiskers aka the lines that extend from the box) Outliers: check with the outlier rule

The Best Measure of Center

A distribution is skewed if it is not symmetric and extends more to one side than the other Skewed left: mean < median Skewed right: mean > median The mean is unique in that it takes all data values into account; however, it is not resistant to skew and extreme values (outliers) The median is resistant to skew and outliers For data that is approximately symmetric with only one mode, the mean, median, and mode will be approximately the same For data that is obviously asymmetric, you should report both the mean and the median

Simpson's paradox

An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group

Ways to chart categorical data

Bar graphs: each characteristic is represented by a bar. The height of a bar represents either the count of individuals with that characteristic, the frequency, or the percent of individuals with that characteristic, the relative frequency. Pie charts: can only represent how one categorical variable breaks down into its components. Each characteristic is represented by a slice, and the size of a slice represents what percent of the whole is made up by the characteristic

Comparative Observational Studies

Case-control studies: start with 2 random samples of individuals with different outcomes and look for exposure factors in the subjects' past ("retrospective") Case-subjects are selected based on a defined outcome, and a control group of subjects is selected separately to serve as a baseline with which the case group is compared Cohort Studies: enlist individuals of common demographic and keep track of them over a long period of time ("prospective"). Individuals who later develop a condition are compared with those who don't. - long time frame (longitudinal) Cross-Sectional Studies: measure the exposure and the outcome at the same time, i.e. surveys

Stem & Leaf Plot

Create a two-column table Left most digit goes to the "stem" RIght most digit goes to the "leaf" Only used with small data sets (and variables must be quantitative) 15 observations or less

1979: The Belmont Report

Establishes 3 aims: respect for persons, beneficence, justice

Comparative, Randomized Experiments

Experiments compare the response to a given treatment versus: another treatment, the absence of treatment (often called a control), a placebo (a fake treatment) Experiments randomize the assignment of subjects treatments Experiments use replication: several or many individuals are studied

Multiplication Rule

For independent events: P(A and B)= P(A) x P(B)

When we describe a scatter plot, we look at

Form: linear, curved, clusters, no pattern Direction of the trend: positive, negative, no direction Strength: how closely the points fit the "form" Outliers

Bias and Blinding

Hawthorne effect: used to describe a type of bias that may occur due to behavior modification because of study enrollment. Also known as "observer effect." Blinding can help against bias. Double-blind: experiment is one in which neither the subjects nor the experimenters know which individuals received which treatment until the experiment is completed. Most serious potential weakness of experiments is a lack of realism

Ways to chart quantitative data

Histograms: summary graph for a single variable, useful to understand the pattern of variability in the data, especially for large data sets Dotplots (stem/leaf plots): graphs for raw data, useful to describe the pattern of variability in the data, especially for small data sets Time Series Plots: a graph with a sequence for the horizontal variable, like time. The line connecting the points helps emphasize change over time

IQR and suspected outliers

IQR = Q3-Q1 Suspected low outlier: any value < Q1 - 1.5 x IQR Suspected high outlier: any value > Q3 + 1.5 x IQR

Conditional Probability

If events A, B are independent, then the conditional probability would be P( B | A) = P( B | not A ) = P( B )

Baye's Theorem

If we know the conditional probability P(B|A) and the individual probability, P(A), we can use Baye's Theorem to find the conditional probability, P(A|B). P(A |B)= (P(B│A)P(A))/(P(B│A)P(A)+P(B│A^C )P(A^C))

Experimental Designs

In a completely randomized experimental design individuals are randomly assigned to groups, then the groups are assigned to treatments completely at random Matched Pairs: choose a pair of subjects that are closely matched (like twins). Within each pair, randomly assign who will receive which treatment Repeated Measures: given the two (or more) treatments to each subject over time, in random order, so we have repeated measures for each subject.

marginal distributions

In a two-way table there are two marginal distributions -> one for the row variable and one for the column variable. Calculating the marginal distributions: row/column total divided by the overall total of the population Marginal distribution tells us nothing about the relationship between two variables.

Standard Deviation

Not resistant to skew or outliers Only used when the average is the measure of center Always zero or greater than zero, s=0 only when all the values in the sample are identical

Interpreting Histograms

Shape/Distribution: unimodal, bimodal, symmetric, skewed (left/right), irregular Center: approximate midpoint, where is the peak/peaks of the curve? Spread: range of values observed Outliers: any points that may be possible outliers How many peaks? (unimodal, bimodal, multimodal) Overall distribution (normal, uniform, etc), Normal distribution: the bell curve

Types of Histograms

Standard way: make with classes/bins and the # of observations in each class/bin Relative frequency: counting the number of observations in each class/bin but then divides by the total and converts to a percent (called a "relative frequency")

Establishing Causation from an observed association can be done if

The association is strong The association is consistent Higher doses are associated with stronger responses The alleged cause precedes the effect The alleged cause is plausible

Linear Correlation Coefficient r

The correlation coefficient is a measure of the strength and direction of the linear relationship, correlation cannot be calculated with categorical variables. r= the sample correlation P (greek letter = rho): the population correlation (this is unknown; this is the population parameter measures the strength and direction of the linear association between paired x and y quantitative values in a sample

Mode

The mode of a data set is the value that occurs most frequently. When two values occur with the same (greatest) frequency, each one is a mode in the data set is bimodal When more than two values occur with the same (greatest) frequency, each is a mode and the data set is multimodal When no value is repeated, there is no mode

Sample space

The probability of the complete sample space S must equal 1: P (sample space) = 1

Completely randomized designs

The simplest experimental design assigns the individuals (subjects or experimental units) to the treatments completely at random. Note that it is not necessary for a completely randomized design to assign the same number of individuals to each treatment.

Properties of the Correlation Coefficient

The value of r is always between -1 and 1 The value of r does not change if all values of either variable are converted to a different scale. The value of r is not affected by the choice of x and y R measures the strength and direction of linear association Strong correlations are between 0.7 and 1 or -1 and -0.7 Moderate correlations are between 0.4 and 0.69 or between -0.69 and -0.4 Weak correlations are between -0.39 and 0.39 If r is close to zero, we conclude that there is no significant linear correlation between x and y. If r is close to -1 and 1, we conclude that there is significant linear correlation If r is equal to 0 or close to 0, all this means is that there is no linear relationship (it does not mean that there is no relationship whatsoever) R is not resistant to outliers

z-score

The z-score is the number of standard deviations that a given value x is above or below the mean Sample: z=x-xbar/s Population: z=x-mu/sigma A positive z-score indicates that the value is above the mean while a negative z-score indicates that the value is below the mean. A z-score of -2 shows that the value is 2 standard deviations below the mean.

Survey Challenges

Undercoverage or Selection Bias: parts of the population are systematically left out Nonresponse: some people choose not to answer/participate Nonresponse is increasing Wording effects: biased or leading questions, complicated/confusing statements can influence survey results Response bias: fancy term for lying or forgetting (especially on sensitive/personal issues). Can be exacerbated by survey method

Measures of Variation

Variation: the measure of the amount that values within a data set vary among themselves -S2 is the variance, has squared units of the original observations, and is harder to interpret The range of a set of data is the difference between the maximum value and the minimum value: range= max-min The standard deviation of a set of sample values is a measure of variation of values about the mean

The Role of Randomness in Sampling

Voluntary Response Sampling: the researchers requests/advertises for individuals of the population to volunteer -> individuals choose whether to be involved Convenience Sampling: the researcher chooses a sample that is readily available (in a non-random way) Probability Sampling: individuals or units are randomly selected; the sampling process is unbiased Voluntary and Convenience are not methods are not scientific because they are not truly representative of the whole population of interest; rather, they tend to be strongly biased Probability Sampling relies on randomness and probability; choosing a sample by chance mitigates bias by giving all individuals an equal chance to be chosen Simple Random Sample (SRS): made of randomly selected individuals. Each individual in the population has the same probability of being in the sample. All possible samples of size n have the same chance of being drawn

Addition rule for disjoint events

When two events A and B are disjoint: P(A or B)= P(A) + P(B) General addition rule for any two events A and B: P(A or B) = P(A) + P(B) - P(A and B)

Dotplot

a graph in which each data value is plotted as a point along a scale of values. Dots representing equal values are stacked. Only used with small data sets (and the variable must be quantitative), a small data set is about 15 observations or less

block

a group of individuals that are known before the experiment to be similar in some way that is expected to affect the response to the treatments

Parameter

a number that describes a characteristic of a population

Outlier

a value that is located very far away from almost all of the other values. Relative to the other data, an outlier is an extreme value

Lurking variable

a variable that is not among the explanatory or response variables in a study, and yet might influence the relationship between the variables studied We say that two variables are confounded when their effects on a response variable cannot be distinguished from each other

Population

all subjects or items of interest, denoted by "N"

Binomial distributions

are models for some categorical variables, typically representing the number of successes in a series of n independent trials §the total number of observations n is fixed in advance §each observation falls into just one of two categories: success and failure §the outcomes of all n observations are statistically independent §all n observations have the same probability p of "success"

Intercept

b0=ybar-b1xbar -only meaningful when its close to 0

The slope of the regression line

b1, equals: b1=r(sy/sx)

Independent events

changing one event won't change the outcome of another event happening -ex:Male and get tails when flipping coin -> independent

Continuous sample space

continuous variables that can take on any one of an infinite number of possible values over an interval. Comparable to quantitative variables, continuous spaces will be able to take on any real number within an interval, ex: cholesterol level

risk

corresponds to the probability of an undesirable event such as death, disease, or side effects

Experimental Study

deliberately imposing or assigning a treatment on individuals and record their responses. Influential factors can be controlled. Experiments, on the other hand, provide an opportunity for manipulating the environment and confounding variables. Therefore, well-designed experiments yield stronger conclusions.

Sampling Design

describes exactly how a sample is chosen from the population.

Poisson distribution

describes the count X of occurrences of an event in fixed, finite intervals of time or space when • occurrences are all independent, • and the probability of an occurrence is the same over all possible intervals.

Discrete sample space

discrete variables that can take on only certain values (a whole number or a descriptor). Comparable to categorical variables, discrete sample spaces will generally have a finite sample space (can only take on certain outcomes), ex: blood type

Quartiles

divide the sorted data values into four equal parts Q1: first quartile, 25th percentile, find the median of the bottom 50% Q2: second quartile, 50th percentile, the median Q3: third quartile, 75th percentile, find the median of the top 50%

Correlation

exists between two variables when one of them is linearly related to the other in some way

Sample Surveys

is an observational study that relies on a random sample drawn from the entire population

Median

measure of the center that is the middle value when the data values are arranged in increasing or decreasing order If the number of values is odd, the median is the number located in the exact middle of the list If the number of values is even, the median is found by computing the mean of the two middle numbers

Inferential Statistics

methods for drawing conclusions about a phenomenon (population) on the basis of data (sample)

Descriptive Statistics

methods of organizing, summarizing, and presenting in an informative way

Random

outcomes are uncertain, but there is nonetheless a regular distribution of outcomes in a large number of repetitions

Observational Study

record data on individuals without attempting to influence the responses. Observational studies often fail to yield clear causal conclusions because the explanatory variable is confounded with lurking variables.

stratified random sample

sample distinct groups within the population separately, and then to combine these samples

Least Squares Regression Line

sum of the vertical distances between the data points and the line is zero, and the sum of the squared vertical distances is the smallest possible. (as close to 0 as possible) Makes the sum of the squared vertical distance from the data points to the line as small as possible The least squares regression line will always pass through the point (xbar, ybar) y=b0+b1x least-squares regression is only for linear associations

Two-way tables

summarizes data about two categorical variables (or factors) collected on the same set of individuals -display joint probabilities for combinations of events

Conditional Distribution

the distribution of one factor for each level of the other factor. A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total). To calculate a conditional distribution, you will fix either a row or a column then calculate the distribution of the levels of the other variable.

Mean

the mean or arithmetic average of a data set is the measure of center found by adding the values and dividing the total by the number of values Notation: μ (mu)= population mean (population parameter)

Complement Rule

the probability that an event A does not occur (not A) equals 1 minus the probability that is does occur: P(Ac)=P(A)=P(not A)=1-P(A)

Statistical Inference

the process of drawing conclusions about a population on the basis of sample data

R squared

the proportion of variation in y that is explained by x coefficient of determination Represents the fraction of the variation in y that is explained by the regression model Will always be between 0 and 1 The closer r2 gets to 1, the better the model explains the data.

block design/crossover design

the random assignment of individuals to treatments is carried out separately within each block

Residuals

the vertical distances from each point to the least-squares regression line Observed - Predicted Outliers have unusually large residuals Use the equation of the least-squares regression to predict y for any value of x within the range studied Prediction outside the range of data is extrapolation (avoid extrapolation)

disjoint, or mutually exclusive

they can never happen together (have no outcome in common). When two events have no outcomes in common, they can never happen together, which means that their joint probability is zero


Related study sets

(Workbook) CHP 12: Private On Site Wastewater

View Set

Accy 405 - Chapter 3: Tax Formula and Tax Determination; An Overview of Property Transactions

View Set

Chapter 15 - Differential Reinforcement

View Set

Chapter 21: The Statement of Cash Flows Revisited

View Set

3.06 Identify and locate the major continents and oceans using maps and globes: Africa, Antarctica, Asia, Australia, Europe, North America, South America, Arctic Ocean, Atlantic Ocean, Indian Ocean, Pacific Ocean, Southern Ocean

View Set

Neurologic and Cognitive Function

View Set