STATISTICS
histogram
-range of possible values is divided into evenly-sized intervals - vertical axis shown as a frequency or relative frequency - individuals are counted, creating the height of the column over the range
What is the fundamental difference between a matched pairs experiment and a case-control study?
Case-control studies are observational studies, not experiment
raw data
The original data as it was collected must have: unique identifier, all data one row, each variable in one column
Which of the following is correct about categorical data?
The proportion of a given outcome for a sample of observations is given the symbol P-HAT, whereas the proportion of a given outcome for an entire population is given the symbol P.
frequency
When summarizing categorical data, counts can also be called
percentile
a measure of placement in a sorted quantitative dataset
When examining a time plot, we especially look for evidence of
a trend, cyclical pattern, or both
issues that effect our ability to estimate the properties of a population when only using a sample data
amount of information made available (sample size), variability of data collected, potential bias
mistake
an error was made during the study or when recording the value -can be corrected or discarded -always keep detailed records
mean
average
The outcomes in a graph representing categorical data
can be sorted in any order
association, however strong, does NOT imply
causation
center (distribution of quantitative data)
center of mass (mean) vs midpoint (median)
response bias
fancy terms for lying or forgetting (especially on sensitive or personal issues), can be exacerbated by survey method (in person vs by phone or online
when describing association, need to find
form, direction, strength, outliers
form
general shape of the plot of points (linear, curved, clusters, no pattern)
"suspected" outlier flag
greater than Q3 or less than Q1
strength
how closely the points fit the "form" -weak (lots of scatter) -strong (little scatter)
median
measure of center -splits the data set into sets of equal number of data points -50th percentile (50% smaller, 50% larger)
five number summary
minimum, Q1 (25th percentile), median, Q3 (75th percentile), maximum
quantitative variable
must be on scaled axis: -dotplot -histogram
The variable EyeColor in the UCI student dataset is an example of
nominal data
nonresponse
some people chose not to answer/participate
variance
standard deviation squared
With a simple random sample, nobody can game the system to purposefully amplify their answers over that of others because
the individuals are selected entirely by chance
To learn about the target population, anecdotal data and volunteer data are fundamentally biased data collection processes because
the individuals used are typically unusual and may have ulterior motives.
timeplots
trends or major changes over time, seasonal or cyclical variations
explanatory variable (independent variable)
x-value of a function
response variable (dependent variable)
y-value of a function
matched paired experiment
(comparisons made at individual level) imposed conditions are compared on pairs/sets of RELATED individuals- the pair of subjects are closely matched then randomly assigned a treatment; detects subtle effects of variables
Which of the following is correct notation for the standard deviation?
The standard deviation of a sample of observations is labeled with the English letter S, whereas the standard deviation of an entire population is labeled with the Greek letter SIGMA.
What is the main reason that it is easier to reach a conclusion of causation from an experimental study than from an observational study?
When conditions are imposed at random like in a randomized experiment, the conditions are not confounded with any important lurking variable
relative frequency
When summarizing categorical data, proportions can also be called
What is the fundamental difference between a sample survey of human beings that may suffer from nonresponse and data using a volunteer sample?
You can't participate in a sample survey unless you were selected by chance, and you can never answer any given question more than once
A bar graph is more versatile than a pie chart to plot categorical data because
a bar graph can display more than one categorical variable whereas a pie chart cannot
modified boxplot
a display for quantitative data that graphs the five-number summary on an axis and shows outliers if they exist (does not include them in whiskers)
scatterplot
a graphed cluster of dots, each of which represents the values of two variables -specifically two quantitative variables recorded for each individual (1 dot= 1 individual, two variables for that individual form the (x,y) coordinates
percentile AKA quantile
a measure of placement in the ordered data set -value splitting data set in two, some smaller or equal to the percentile -pth percentile= p% smaller or equal to
linear correlation coefficient (r)
a measure of the strength and direction of the linear relation between two quantitative variables -unitless (calculated relative to the mean and standard deviation of both variables) -bounded by -1, 1 -measure of both direction and strength of a relationship for linear or random patterns only -only for linear patterns -always plot the data before computing any summary value -outliers have impact
extropolation
a model-based prediction for a value outside the range of data used to create the model
double-blind procedure
an experimental procedure in which both the research participants and the research staff are ignorant (blind) about whether the research participants have received the treatment or a placebo. Commonly used in drug-evaluation studies. (informed consent for human subjects- ethical matter)
bar graphs
bar heights indicate a summary value for each outcome shown (some or all outcomes are shown, count/proportion of outcomes must be shown, versatile- can be easily misrepresented) -used to display anything for the height of the bar represents a numerical value
wording effects
biased or leading questions, complicated/confusing statements can influence survey results
actual relationship
both variables are influenced by another variable (confounding) -one variable actually influences the other, directly or indirectly (need to explore all possibilities before reaching causality when interpreting data in context)
stratification
constraint random sample so tat it has x,y, z% of individuals of certain types (typically fit the population makeup)
2 categorical variables
contingency/two-way tables, bar graphs, comparing conditional proportions (comparing distribution of eye color between male and female students)
systematic random sample
create your own sample by taking every other nth individual on the population list (beware of potential patterns/cycles in population) -does not guarantee a totally unbiased sample but guarantees the sample is unlikely biased
illusion
created by inappropriately lumping things together -resulting in deceitful pattern and appearance of no pattern
population data
data from every individual of interest cons: expensive, time consuming, maybe impossible pro: exact knowledge of the population
case-control observational study
data is recorded in observational setting using 2 distinct random samples of individuals by some feature, individuals with the feature are CASES; those without are CONTROLS
experimental study
deliberate treatments are imposed on the individual and record their responses pros: influential factors can be controlled, concluding causation is possible cons: realistic/simplistic setting
Boxplot
displays the 5-number summary as a central box with whiskers that extend to the non-outlying data values
multivariate analysis
examining a pattern in a single variable is interesting - comparing patterns across different groups -studying how a pattern changes over time -looking for patterns of association made by two or more variables *higher level of complexity
direction
if there is a pattern to the plot of points (positive, negative, no direction)
r^2
indicates what fraction of the variation in y can be explained by the linear regression model
atypical individual
individual is fundamentally not representative of target population -belongs to different subgroup or is known to be atypical -values may be discarded if summary is desired for the majority group only
dotplot
individual value plot -each individual placed on scaled axis -for large data, one dot= multiple individuals
for every study, we need to identify...
individuals and variables studied, type and design of the study (objective)
completely randomized experiment
individuals are assigned to different treatment groups, occurs completely at random and creates independent random samples
Longitudinal cohort study
individuals are observed repeatedly over time, examine the compounded effect of naturally occurring factors over time
independent samples
individuals compared are unrelated, the comparison is made at the group level ex: comparing GPA at the end of the first year in a random sample of freshman who did or did not attend a seminar on learning skills and study habits
When describing quantitative data, an outlier
is a data point that does not fit the main pattern of the data (investigate- dont throw out unless justified)
double-blind experiment
neither the participants nor the experimenters know who is receiving which treatment until the study is over
quantitative data
numerical data (values of many individuals can be averaged) -discrete (only record whole numbers) -continuous (anything over an interval: decimals)
parameter
numerical summary of a population
sample surveys (polls)
observational design
case-control studies
observational studies using two distinct random samples that differ by one important feature
volunteer response sample
open to anyone who wants to participate, fundamentally bias, open to manipulation, potentially hugely different from target population ex: write-ins, polls, bots, tweets
outliers (distribution of quantitative data)
points that do not fit the main pattern
What is NOT an important issue affecting our ability to estimate the properties of a population when using only sample data?
population size
categorical data
qualitative description recorded for each individual (values with individuals with given feature can be counted) -nominal (attribute) -ordinal (ranked)
coincidence
random occurrence of unrelated things
simple random sample
randomly selecting individuals in the population that have the same probability of being selected and all possible samples of size n have the same chance of being drawn ex: in a class of 100 students the instructor uses the roster to randomly pick 5 students midterms to check that they were graded properly
spread (distribution of quantitative data)
range (min to max)
observational study
record data on individuals without attempting to influence the responses cons: lots of unknown factors, concluding causation is very difficult pro: realistic setting
association (relationship)
refers to the idea that variables can vary together with some level of synchronicity; the existence of an overall pattern -deterministic (exact pattern) -statistical: example being weather (overall, but not an exact pattern)
legitimate value
represents the natural variability for the group and the variable measured -provides important information about location and spread -do not discard
conditional distribution
row percents and column percents of one factor, given the levels of the other factor
2 quantitative variables
scatterplot, correlation and regression (examining the distribution of height and weight among students)
convenience sampling
select a set of easily accessible individuals, representative of similar individuals but not the whole population ex: using college students for human behavioral studies
1 categorical variable and 1 quantitative variable
side-by-side dotplots or boxplots, comparing means or medians, variability and spread
pie charts
slices are scaled to proportion of each outcome that make up the categorical variable (all outcomes must be shown, count/proportion must be shown, only represent one variable in one group)
when choosing a variable, things that need to be considered
study's ultimate objective, what aspects of the goal can be recorded, would quantitative or categorical point of view be better, cost, speed, and accuracy
marginal distribution
summarizer each factor independently with proportions or percents
statistic
summary of values of sample data
shape (distribution of quantitative data)
symmetrical- homogenous, skewed- right, left, multimodal(several types of occurrences), irregular
replication and randomization prevent what?
systematic bias and confounding ex: several individuals are studied for each condition, individuals are assigned to treatment using probability
sample data
the data are from only some of the individuals of interest pros: cheaper, faster, typically doable cons: uncertainty about the population
confounding variables
the effects on the response variable cannot be distinguished -major issue because there is no clear conclusion especially in observational studies -makes difficult to conclude causation
matched pairs, cross-over, repeated measures, time series
the individuals compared across conditions are clearly RELATED or identical, comparison is made at the individual level ex: comparing total sleep times the week before and the week after finals in a random sample of freshman (same students both times)
probability sampling
the individuals/units are randomly selected therefore the sampling process is unbiased
undercoverage
the sampling process systematically leaves out or under-represents part of the population
standard deviation
the square root of the variance
least squares regression line
the unique line such that the sum of the squared vertical distances between the data points (residuals) and the line is as small as possible
bias
unconscious or conscious, should be prevented at all costs
anecdotal evidence
uniquely personal cases not representative in target population ex: celebrity endorsements
The ideal way to organize electronically the raw data for a study is to
use one table and assign one row for each individual so that all the data for that individual is in that row.
two-way tables (contingency tables)
used to organize the counts of joint outcomes of two categorical variables (factors) 1 factor= row, 1 factor= column -represents the intersection of one factor with a given level of the other factor
cross-sectional survey
uses 1 random sample drawn once from a population, comparisons can be made from the subgroups after the data set is collected
regression line
y= constant + slope * x -slope: how much we expect y to change on average for every unit increase in x -y-intercept: may or may not have a meaning in context -can be used to make predictions within range (on average)