STAT 220 Midterm 1
Event
A subset of a sample space
Histogram
A two dimensional way of plotting; divided into class intervals or bins
How to deal with low response rate
Cannot take a new sample to replace the non-respondents since the new sampled subjects are still the respondents, not the non respondents, Must make more efforts to reach the non-respondents, Always check the response rate, a low one might not result in a trustworthy study
Nominal variable
Categorical and categories do not have a natural order
Ordinal Variable
Categorical and has ordered categories
boxplot (box and whisker plot)
Graph based on the FNS; know how to make
Bin width
Impacts precision of data in a histogram; usually 0 or 1
68% rule
In a normal distribution, about 68% of the cases lie between the mean and ONE standard deviation unit on both sides of the mean
95% rule
In a normal distribution, about 95% of the cases lie between the mean and TWO standard deviation units on both sides of the mean
Quartile
Marker that divides data into four even parts
Numerical Variables
Values that represent quantities; continuous and discrete
Case
What we collect data from (every row represents a case)
random assignment vs random sampling
Whether a result is generalizable from data to a larger population depends on whether the date came form a random sample Whether a cause and effect relationship can inferred depends on whether the subjects are randomly assigned to the control/treatment group
Confounder
a factor associated with some outcome that confuses or confounds the determination of true cause and effect
Mosaic plot
a modified segmented bar graph in which the width of each rectangle is proportional to the number of individuals in the corresponding category; maintains absolute count
Statistic
a number describing a characteristic of a sample; can be computed based on your sample
Parameter
a number describing a characteristic of the population
observational study
a study based on data in which no manipulation of factors has been employed; proves association only
Categorical Variables
any variable that is not quantitative; nominal and ordinal
Histogram in a density scale
bar area = proportion of observations in the bin
non-response bias
bias introduced to a sample when a large fraction of those sampled fail to respond
Rows represent
cases
Variable
characteristic of a case that varies; describes a single case
retrospective study
collect data after events have taken place
Multistage sampling
combine the three basic sample methods in any way
graph for categorical v categorical
contingency tables, segmented/standardized bar plots, mosaic plots
Bad sampling methods
convenience (those who are available) and voluntary response samples (online polls (ex))
Outliers
extreme values that don't seem to belong with the rest of the data
Look for in scatter plots
form of the relationship, direction, strength, deviations from pattern
marginal totals
give distribution of the two variables
Segmented bar plot
graphical display of contingency table information (two categorical variables)
Clustered sampling
have subgroups of entire population (clusters), uniformly choose some of the clusters and take the entire clusters data
bell-shaped distribution
highest point occurs in the middle and tails go off equally to the left and right
Difference between bar plot and histograms:
histograms = numerical data, bar plots =categorical
When are two variables independent
if row/column proportions do not change, the variables are independent
Placebo effect
improvement resulting from the mere expectation of improvement
Clusters
indicate data may be comprised of several distinct kinds of individuals
mean < median impact on graph
left skewed
Probability
likelihood that a particular event will occur
Forms or relationship
linear, no relation, nonlinear
The mean and median of a symmetric distribution:
mean = median
Common measure of center
mean and median
Mean v median
mean is more easily changed with one data point
mean and median in skewed distribution
mean is pulled toward the longer tail (bc its more sensitive to extreme values)
center of histogram
mean or median
Standard deviation (concept)
measure that describes an average distance of every score from the mean
Properties of SD
measures spread about the mean and should be used only when the mean is the measure of center; very sensitive to outliers
Cells
middle columns of table (with information)
five number summary
minimum, Q1, median, Q3, maximum
What to look for in a histogram
modality, skewdness, outliers, center, spread
Blocking
more sophisticated design technique for experiments (know how this works/steps)
Unimodal
one mode/peak
frequentist interpretation of probability
probability of an event proportional to number of times event occurs in a large number of repetitions of the experiment
continuous variable
quantitative and has values are not countable
discrete variable
quantitative and values form a set of of separate numbers (0,1...)
Simple random sampling
random selection; every sampling unit has a known and equal chance of being selected (con: impractical for large populations)
how do you combat confounding
randomization (restrict or balance confounder), make comparisons for small/homogenous groups
Mean > median impact on graph
right skewed
standardized bar plot
same as segmented bar plot but uses proportions, not absolute count
graph for numerical v numerical
scatterplot
Strength of association of variables
seen by how much scatter there is around main form
two-way contingency table
shows data with two categorical variables and are shown as a two-dimensional table of rows and columns.
graph for numerical v categorical
side by side boxplots, histograms on the same horizontal axis
standard deviation formula
sqrt(sum of squares of the deviation from the mean/n-1)
prospective study
study that identifies individuals and collects information as events unfold
left-skewed distribution
tail extends to the left
right skewed distribution
tail extends to the right
Mean of histogram
the balance point of the histogram
Sample
the part of the population we actually examine and have data on
Sample Space
the set of all possible outcomes of an experiment
Trimodal
three modes/peaks
column total
total number in each column
Row total
total number in each row
Bimodal
two modes/peaks
single categorical variable
use a pie or bar chart, bar plot is most common
1.5 IQR Rule
used for identifying outliers: any values that are more than 1.5 times the IQR lower than the first quartile or higher than the third quartile are called outliers
Columns represent
variables
Simpson's Paradox
when averages are taken across different groups, they can appear to contradict the overall averages
negative association
when one variable increases or becomes larger, the other does the opposite
positive association
when one variable increases or becomes larger, the other does the same
Principles of experimenting
Replicate (collecting a large enough sample to make sure the difference in outcome of groups is not by chance), control, randomization, blocking
Standard Variance
Measures the average to which each point differs from the mean, the average of all data points; formula: standard deviation squared
Modes of a histogram
Number of peaks
Types of variables
Numerical and Categorical
Sampling keywords
Population, sample, parameter, statistic
Interquartile Range (IQR)
Q3-Q1
Common measure spread
Range = max-min, SD, IQR
What do you use the mean and median for in a graph
Tell if the graph is skewed and by how much
Skewness
Tells you the shape of the distribution of data
Multimodal
distributions with more than two modes/peaks
stratified sampling
divide population into subgroups (strata) and then perform simple random sampling
column proportions
dividing by column total to find proportions (normalized to 1)
overall proportions
dividing by grand total to find proportions (normalized to 1)
row proportions
dividing by row total to find proportions (normalized to 1)
Population
entire groups of individuals we are interested in
how to establish causality
establish correlation, establish time order, rule out alternative explanations
intersection
event that both A and B occur (A ⋂ B)
Union
event that either A or B occurs (A U B)
double-blind experiment
experiment in which neither the experimenter nor the participants know which participants received which treatment
single-blind experiment
experiment in which the participants are unaware of who received the treatment
