Biometry test 1
What is 6% of 12,000 responses?
(6/100)(1200)= 72
Midquartile
(Q3+Q1)/2
Interquartile Range
(Q3-Q1)
Semi-interquartile range
(Q3-Q1)/(2)
For small data sets (20 values or fewer)...
Use a table instead of a graph.
Bar Graphs
Uses bars of equal width to show frequencies of categories of categorical (qualitative) data.
Cluster Sampling
We first divide the population area into sections (or clusters), then randomly selecting a few of those sections, and the choosing all the members from those selected sections.
Systematic Sampling
We select some starting point and then select every Kth element in the population.
Convenience Sampling
We simply use data that is very easy to get.
Stratified Sampling
We subdivide the population into at least two different subgroups (or strata) so that subjects within the same subgroup share the same characteristics.
Important uses of Histograms
-Visually displays the shape of the distribution of the data. -Shows the location of the center of the data. -Shows the spread of the data -Identifies outliers
Double Blind
neither the person giving or receiving treatment know if they are giving/receiving treatment.
Sampling Error
occurs when the sample has been randomly selected with a random method, but there is a discrepancy between a sample result and the true population result.
stratified sampling
population is separated into strata (homogenous groups), then randomly selected
Relative Frequency Equation
relative frequency = frequency / sum of all frequencies
Stem Plot
represents quantitative data by separating each value into parts: the stem (the left most digit) and the leaf (the right most digit).
Continuous (numerical) data
result from infinitely many possible quantitive values, where the collection of values is not countable.
Discrete data
results when the data values are quantitative and the number of values is finite or "countable"
Practical significance
some treatment or finding is effective, but common sense, which might suggest that the treatment or findings does not make enough of a difference to be practical.
standard deviation (SD)
square root of the variance
Census
the collection of data from every member of the population.
Population
the complete collection of all measurements or data that are being considered; typically the population is the complete collection of data we would like to make inferences about.
A data value is missing completely at random if...
the likelihood of its being missing is independent of its value or any of the other values in the data set.
Median
the measure of center that is the middle value when the original data values are arranged in order of increasing value.
Midrange
the measure of center that is the value midway between the maximum and the minimum values in the original data set. (Max data value+Min data value)/2 Not resistant
A statistic is resistant if....
the presence of extreme values (outliers) do not cause it to change very much.
Relative frequency
the proportion (or percent) of observations within a category
Statistics
the science of planning studies and experiments; analyzing and interpreting those data and then drawing conclusions based on them.
Digression line
the straight line that best fits the scatter plot of the data.
5 numbers summary and box plot
the values of the minimum, maximum and the three quartiles are used for the 5-number summary and the construction of boxplot graphs.
what can be used to avoid measurement bias?
true placebo control groups
what can be used to avoid selection (sampling) bias?
truly random samples
Frequency Polygon
uses line segments connected to points located directly above class midpoint values. Similar to a histogram; uses segments instead of bars.
Relative Frequency Polygon
variation of the basic frequency polygon.
Experiment
we apply some treatment and then proceed to observe its effects on the individuals.
Observational Study
we observe and measure specific characteristics but we don't attempt to modify the individuals being studied.
Confounding
when we can see some effect, but we can't identify the specific factor that caused it.
Statistical Significance is achieved in a study when:
when we get a result that is very unlikely to occur by chance.
what is one type of measurement bias?
- Hawthorne effect - human subjects change behavior simply because they are being studied
variance
- S^2 - AKA mean square - mean of squares of all the deviation scores in a distribution - obtained by finding the deviation score (x) for each element then squaring these and obtaining their mean
what type of sampling is based on geographical areas?
- cluster samples - randomly selecting geographical locations and then taking # of samples from each
odds ratio
- measure of association between an exposure and an outcome - (A x C)/(B x D) = (cases exposed x control not exposed) / (control exposed x cases not exposed) - if ratio = 1 then NOT related to the disease - if ration > 1 then the risk factor is found more frequently among the cases than the controls - if ratio < 1 then the risk factor may actually be a protective factor against the disease - must be used instead of relative risk when analyzing CASE-CONTROL data
cluster sampling
- population already separated into strata, then randomly selected - Ex: choosing 100 medical students- then randomly selecting 10 med schools, and then 10 random students from each
relative risk
- probability of developing a disease over time period; considers risk factors - (incidence - exposed to risk factors)/(incidence- NOT exposed to risk factors) (A/A+B)/(C/C+D)
attributable risk
- risk difference - portion of incidence in exposed group that is due to exposure - [exposed - not exposed]
systematic sampling
- selecting elements in a specificly systematic way (Ex: pick every 3rd person); Every Kth number. - usually provides equivalent of simple random sample w/out using randomization
Features of a dot plot
-Displays the shape of the distribution of data -It is usually possible to recreate the original list of data points.
Features of a Stem Plot
-Shows the shape of the distribution of data -Retains the original data values -The sample data are sorted (arranged in orders)
Graphs the Deceive
-Start at non-zero axis -Pictographs (images)
Preparing Data
1. Context: what do the data represent; what is the goal of the study. 2. Source of data: are the data from a source with a special interest so that there is pressure to obtain results that are favorable to the source. 2. Sampling method: were the data collected in a way that is unbiased, or were the data collected in a way that is baised.
Analyzing Data
1. Graph the data 2. Explore the data: are there any outliers (numbers away from the majority of data points); what important statistics summarize the data (mean/standard deviation); how are the data distributed; are there any missing data; did many selected subjects refuse to respond? 3. Apply statistical data: use technology to conclude results.
Normal Frequency Distribution
1. The frequencies start low, the increase to one or two high frequencies, and decrease to a low frequency. 2. The distribution is approximately symmetric: Frequencies preceding the maximum frequency should be roughly a mirror image of those that follow the maximum frequency.
Histogram
A graph consisting of bars of equal width drawn adjacent to each other (unless there are gaps in the data). -The horizontal scale represents classes of quantitative data values, and the vertical scale represents frequencies. -The heights of bars correspond to frequency values.
multiplication rule
AND; independent events
σ "sigma"^2
Population variance
Skewed to the Right
Positive
Data
Collections of observations, such as measurements or survey responses.
Dot Plots
Consist of a graph of quantitative data in which each data value is plotted as a point (or dot) above a horizontal scale of values.
Weight Mean
Different x data values are assigned different weights w, we can compute a weight mean.
Skewed Distribution
Distribution is not symmetric and extends more to one side than the other.
Significance
Do the results have statistical significance; do the results have practical significance?
Boxplot
Graphical representation of the spread of a set of data
Randomization
Individuals assigned to groups by random selection.
Experimental units
Individuals in the experiments; often called subjects when they are people.
r
Linear correlation coefficient
Relative Frequency Distribution
Lists each category of data together with the relative frequency. The sum of all the relative frequencies should add up to 1.
Variance
Measure of variation equal to the square of the standard deviation.
Standard Deviation
Measure of variation equal to the square root of the variance. How much data values deviate away from the mean. Never negative; only zero when all data values are the same.
Skewed to the Left
Negative
addition rule
OR; mutually exclusive (dependent) events
Mode
Of a data set is the value(s) that occur with the greatest frequency. Two data values: bimodal More than two data values: multimodal No values: no mode
10-90 percentile range
P90-P10
Types of data
Parameter: a numerical measurement describing some characteristics of a population. Statistic: a numerical measurement describing some characteristic of a sample. "population parameter, sample statistic" Quantitative (or numerical): consists of numbers representing counts or measures. Categorical(or qualitative or attribute): data consists of names or labels (not numbers or measurements)
Multi-stage sample design
Pollsters select a sample in different stages, and each stage might use different methods of sampling.
μ "mu"
Population mean
σ "sigma"
Population standard deviation
Big data
Refers to data sets so large and so complex that their analysis is beyond the capabilities of traditional software tools. Analysis of big data require software simultaneously running in parallel on many different computers.
Modified Boxplot
Regular boxplot with these modifications 1. Special Symbol 2. Solid horizontal line extends only as far as the maximum data value that is not an outlier.
x̅ "x-bar"
Sample mean
x~ (x tilde)
Sample median
S
Sample standard deviation
S^2
Sample variance
Features of Bar Graphs
Shows the relative distribution of categorical data so that it is easy to compare the different categories.
Percentiles
The 99 values that divide ranked data into 100 groups with approximately 1% of the values
Quartiles
The three values that divide ranked data into four groups with approximately 25% of the values in each group.
Cumulative Frequency Distribution
The frequency for each class is the sum of the frequency for that class and all previous classes.
Range
The measure of variation that is the difference between the highest and lowest values. Range =Max data value-Min data value
Normal Distribution
The population distribution is normal if the pattern of the points in the normal quantile plot is reasonably close to a straight line and the points do not show some systematic pattern that is not a straight line pattern.
Not a Normal Distribution
The population distribution is not normal if the normal quantile plot has either or both of these conditions: -the points do not lie reasonably close to a straight line pattern -the points show some systematic pattern that is not a straight line.
Replication
The repetition of an experiment on more than one individual.
Non-Sampling Error
The result of human error, biased wording, false data provided, or applying statistical methods that are not appropriate for the circumstances.
Nonrandom Sampling Error
The result of using a sampling method that is not random, such as using convenience sample or a voluntary response sample.
Mean/Average
The sum of a set of values divided by the number of values. Sensitive to outliers.
Linear Correlation Coefficient (r)
a numerical measure that can help make decisions more objectively using parallel data,w e can calculate the value of the linear correlation coefficient (r). Value of -1 or 1 means there is a strong correlation. Value of 0 means little to no correlation.
Scatter Plot
a plot of parallel (x,y) quantitative data with a horizontal x-axis and a vertical y-axis
Voluntary response samples/self-selected sample
a sample in which the respondents themselves decide whether to be included.
Sample
a sub-collection of members selected from a population
Measure of Center
a value at the center or middle of a data set.
Lurking variable
affects the variables in the study but the lurking variable is not involved in the study.
Ordinal level of measurement
can be arranged in some order, but differences (obtained by subtraction) between data values either cannot be determined or are meaningless. (rank of colleges in the U.S.)
Nominal level of measurement
characterized by data that consist of names, labels, or categories only. Not possible to arrange in any order. (eye color)
Pie Charts
common graph that depicts categorical data as slices of a circle.
Retrospective study
data are collected from a past time period by going back in time to examine records, interviews, etc.
Prospective Study
data are collected in the future from groups that share common factors.
Cross-sectional study
data are observed, measured, and collected at one point in time, not over a period of time.
Interval level of measurement
data can be arranged in order, and differences between data values can be found and are meaningful, but data at the level do not have a natural zero starting point at which none of the quantity is present. (body temp. in degrees)
Ratio level of measurement
data can be arranged in order, differences can be found and are meaningful, and there is a nature zero starting point at which none of the quantity is present. (height, length, distances, volume)
what can be used to avoid confounding bias?
double blind design
what can be used to avoid experimental expectancy/bias?
double blind design
simple random sampling
every element has an equal chance of selection
Correlation
exists between two variables when the values of one variable are somehow associated with the values of the other variable.
Linear Correlation
exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximately by a straight line.
Block
groups of subjects that are similar, but blocks differ in ways that might affect the outcome of the experiment.
Relative frequency histogram
has the same shape and horizontal scale as a histogram, but the vertical scale uses relative frequencies (as percentages or proportions) instead of actual frequencies.
A data value is missing not at random if...
if the missing value is related to the reason that is it missing.
Data science
involves applications of statistics, computer science, and software engineering, along with some other relevant fields (such as biology).
Blinding
is used when the subject doesn't know whether he or she is receiving a treatment or a placebo.