EXST 2201, Exam 1 (McKenna)
What is the formula for deviation (of ONE data value)?
(x - x̄)
What is the mean of the standard normal curve?
0
What is the area within a standard normal curve within +3 and +infinity (fourth event) standard deviation?
0.15%
What does the sum of deviations always equal? Why?
0; takes into account direction
What is the standard deviation of the standard normal curve?
1
What are the two symmetries of the standard normal curve (since the normal curve is perfectly symmetrical)?
1) area left of mean = area right of mean 2) area left of point = 1 - area right of point
What are the three steps for finding a percentile?
1. Rank the data values and calculate the index position 2. Adjust the index position (decimal = 0 then average, decimal does not = 0, go up) 3. Find percentile (value)
What three things does a random sample do?
1. Removes selection bias by the researcher 2. Does not affect natural variation in the data 3. Does not guarantee a representative sample
What does the normal probability density function (PDF) ensure are true about frequency values on a normal curve?
1. area under curve = 1 (100%) 2. curve never goes below x-axis (area never negative)
What are the two interpretations of a probability of an event by the science of statistics?
1. the proportion of the population described by the event (ex: 50% of people have brown hair) 2. the chance that a randomly selected individual from the population will be described by the event (ex: there is a 50% chance that the student selected will have brown hair)
What is the area within a standard normal curve within +1 and +2 (second event) standard deviation?
13.5%
If the standard deviation is 4, what is the variance?
16
What three positions divide a set of ranked data into four equally sized parts?
1st quartile, median, 3rd quartile
What is the area within a standard normal curve within +2 and +3 (third event) standard deviation?
2.35%
What is the area within a standard normal curve within +1 (first event) standard deviation?
34%
If the variance is 25, what is the standard deviation?
5
What is another name for the median?
50th percentile or second quartile (Q2)
What is the area within a standard normal curve + and - one standard deviation?
68%
What is the area within a standard normal curve + and - two standard deviations?
95%
What is the area within a standard normal curve + and - three standard deviations?
99.7%
Using statistical distance, which values are close to the population mean and considered a reasonable value for a randomly selected individual from this population?
Between -2 and 2
uses the probability distribution of the population, expressed in terms of statistical distance, to determine which values are close to the population mean and which values are far from the population mean
Concept of close and far
When making a frequency table, what counts as a category for discrete data and how should they be ordered?
Each number found in the column of discrete data is a category. All numbers must be listed, must be in numerical order, and must include every number between the lowest category and the highest category. No numbers can be skipped because they have no data values
When making a frequency table, what counts as a category for qualitative data and how should they be ordered?
Each value found in the column of qualitative data is a category. All values are listed, and the order of the categories does not matter
When making a frequency table, what counts as a category for continuous data and how should they be ordered?
First, the numbers are binned into ranges. Then, each range is a category, and each number goes into only one range. All ranges must be listed, must be in numerical order, and must include every range between the lowest category and the highest category. No ranges can be skipped because they have no data values.
What is the formula for inter-quartile range?
IQR = Q3 - Q1
What method of spread is used with resistant statistics?
Inter-quartile range
a number value that divides a set of ranked data into two parts: a lower part with a k% of the data values, and an upper part with a (100-k) % of the data values. (Denoted Pk for a population; Pk for a sample)
Kth percentiles
How does the measure of distance different in mathematics vs. statistics?
Mathematics uses a deterministic (exact) measure, while statistics uses a probabilistic (likely) measure
What letter denotes size (or observations) in a population?
N
Using the Shapiro-Wilk test of normality, if the p-value is < 0.05, what can one say about the shape?
NOT normal shape
Is range resistant to extreme values?
No; its calculations use the most extreme values
What is the formula for the lower fence of a column of data?
Q1 - 1.5(IQR)
What is the formula for the upper fence of a column of data?
Q3 + 1.5(IQR)
What statistics are used as a summary number for normal shape?
Shapiro-Wilk statistics
a statistical test designed to determine if the frequency of data values in a column of data follows the normal curve (with p-values)
Shapiro-Wilk test of normality
a RAW measure of spread for a column of data values found by squaring all the deviations from the mean, then summing the squares (denoted: SS or SOS)
Sum-of-Squares (Sum of squared deviations)
Why doesn't a simple random sample guarantee a representative sample?
We will never know if a sample is TRULY representative of the population or not because we can't see all population information
What is the first step in analytical thinking?
abstraction
the process of looking at a problem, extracting only the information relevant to solving the problem, and ignoring all other unnecessary information
abstraction
On a stem-and-leaf plot, what makes up the stem?
all remaining digits to the left of the leaf (which is the rightmost digit)
What is more useful in statistics, analytical or synthetic thinking?
analytical (need to break down problems, not just look at them whole)
to break a big problem down into smaller parts, solve each part individually, and then put the parts back together to get the answer to the big problem
analytical thinking
using one piece of information to make a decision (contrasts with statistics)
anecdotal decision making
On a histogram, where is the value (category) of the bar marked?
at the middle of the bar (indicating a single value)
a graphical summary of a frequency table for qualitative data giving shape information by showing a NON-TOUCHING bar for each category, with the height of the bar representing how many data values are in the category
bar chart
How is the shape of qualitative data given (what graphical summary)?
bar chart (or a pareto chart)
If one changes the standard deviation of a standard normal curve (and not the mean), what happens to the standard normal curve?
becomes narrower (s < 1) or wider (s > 1)
shape of a histogram in which there is two "humps"
bimodal
to separate continuous data into bins (or groups) to reduce the number of values (or categories) for use in a frequency table
binning
What graphical method shows a picture of the five number summary?
boxplot
a graphical summary of continuous data giving shape information by displaying the middle %50 of the data values as a box, the median as line inside the box, and the upper 25% and lower 25% of the data values as tails on either side of the box; resistant measure of shape
boxplot
What are the five columns of a frequency table?
category (of a certain value), frequency, relative frequency, cumulative frequency, cumulative relative frequency
an observational study that measures a characteristic of the individuals in a population; does not involve a sample (measures everyone in a population)
census
type of data that does not have a pattern "in the long run"; not useful to study with statistics (not true of almost all data in nature)
chaotic data
Within a data set, where are variables located?
columns
What can a lurking variable cause in statistics?
confounding
phenomena in statistics where the effect of one variable cannot be distinguished from the effect of another variable
confounding
type of variable when the data value is fixed to only one possible value by there being only one value in the population (ex: number of cents in a quarter dollar--always equals 25)
constant
What are the three types of variables depending on the possible data values in the population?
constant, variable, or random variable
What type of data values contains the most information?
continuous
data values that consist of numerical information that has many, many possible values in the population (usually infinite number of values; ex: wavelengths of radio waves, minutes stuck on interstate, etc.)
continuous data
the summary numbers of a column of data whose values still need to be known to calculate probability from the equation of the shape of the data values (mean and standard deviation)
critical parameters
accumulation of frequency up to and including each category
cumulative frequency
accumulation of relative frequency up to and including each category
cumulative relative frequency
a column of numbers; distinct pieces of information organized in a special way to allow analysis
data
What information is used in the calculation of resistant statistics?
data positions (in a ranked set)
a specific way to organize data by putting the information from 1) variables into columns, and 2) individuals into rows
data set
multiple columns of numbers
data set
a measurement from an individual (ex: height, volts, beak length, etc.)
data value
What information is used in the calculation of efficient statistics?
data values
the number of units of information contained in a sample statistic; number of information WITHIN a statistic
degrees of freedom
type of experimental design where half of the experimental units are chosen randomly and the second half are chosen by matching some statistic
dependent design
broad class of statistical methods that give sample information
descriptive statistics
statistical methods used to summarize and describe columns of data for the purpose of extracting information about the values of data
descriptive statistics
how far a data value is from its mean value; efficient measure of spread for ONE data value
deviation
data values that consist of numerical information that has only a few possible values in the population (less than 15 to 20 values); often counts (ex: number of doors on a car, number of drinks consumed in a day, etc.)
discrete data
the shape, location, and spread of a column of data; summary of all possible values of data and how often each occurs
distribution
to describe a column of data by giving the shape, location, and spread of all the data values in the column
distribution of data
How does one the find relative frequency of a data value in a column?
divide the frequency by the total number of values
study in which neither the researcher nor the experimental unit knows whether or what treatment is being applied
double blind
On a histogram, what is the height of the bars equal to?
each categories frequency (or relative frequency)
On a bar chart, what does the height of the bars equal?
each category's frequency (or relative frequency)
Why is standard deviation the most commonly used measure of spread in statistics?
easy to understand and used with normal curve to get probability
What kind of descriptive statistics are used with "pretty" columns of data?
efficient statistics
summary numbers that extract the most information about a characteristic out of a column of data; used for discrete or continuous data contain the most information but are strongly affected by extreme values
efficient statistics
On a histogram of continuous data, what does the height represent?
equal to the category's frequency (or relative frequency)
something that happens; an interval on the x-axis of the real number line describing a real situation being studied
event
type of study where the individuals are in a highly controlled environment before measuring, so that physical controls can be used to allow on the variable of interest to have an effect (can control which individuals are studied and which treatment is given)
experimental study
another name for an individual in the sample
experimental unit
proper name for outliers; the are data values that lie outside the overall shape; pull the average away from its true mean ; any data value far enough away from the mean value to question if it comes from the population being measured or from another population
extreme values
How do fences help us find extreme values in a column of data?
extreme values will lie outside the fences
two values that bracket the reasonable values for a column of data such that any data value outside of this bracket can be considered to be an extreme value (lower and upper)
fences
the number value such that 25% of the data values have values less than, and 75% of the data values have values greater than (25th percentile)
first quartile (Q1)
What is the set of numbers included in a boxplot called?
five number summary
a set (enclosed in brackets) of five resistant statistics consisting of the min, the first quartile (Q1), the median (M), the third quartile (Q3), and the max
five number summary
number of times a data value occurs in each category
frequency
a table summarizing the shape information in a column of data by listing all possible data values and recording how often each value occurs in the column of dtata
frequency table
How is data collected?
from observing and recording some phenomena in nature
used to summarize shape information from a column of data
graphs
What type of graphical summary is used for discrete data?
histogram
a Bar Chart of the data values in a column of data where the shape can be seen from the patterns of the bars; most often used to get information about samples
histogram
a graphical summary of a frequency table for discrete data giving shape information by showing a TOUCHING bar for each category, with the height of the bar representing how many data values are in the category
histogram
What type of graphical summary is used for efficient continuous data? Resistant continuous data?
histogram (but with binning) or stem-and-leaf; boxplot
What are the three most common types of graphs used to summarize information in data?
histogram, boxplot, and scatterplot
When should resistant statistics be used?
if the data contains any extreme values
When deciding whether to "go up or average" when finding the percentile, when do you decide to go up?
if the decimals of the position are NOT 0s (even if less than 0.5)
When deciding whether to "go up or average" when finding the percentile, when do you decide to average?
if the decimals of the position given are all 0s
type of experimental design where all experimental units are chosen randomly and assigned to treatments randomly (randomized design)
independent design
What is the first subset of a sample?
individual
one element of a population; single unit in a sample (ex: one refrigerator, one voter, one structural beam, etc.)
individual
broad class of statistical methods that give population information
inferential statistics
statistical methods that take sample information, combined with probability information, to get information about a population
inferential statistics
What is the maximum sample size for sampling with replacement?
infinite (no max)
How many normal distributions exist in statistics?
infinite number (different means or different standard deviations)
A population has a ___ amount of elements that can be measured, while a sample has a ___ amount of elements to be measured.
infinite, finite
a non-weighted measure of the spread of all the values in a column of data (spread of the middle 50%); resistant statistic
inter-quartile range (IQR)
What three things can be used to show spread of a column of data?
inter-quartile range, variance, standard deviation
A measure of the spread of the middle 50% of the data values found by the difference between the first and third quartiles; resistant statistic
interquartile range (IQR)
the total area under the normal curve (1) minus the right tail area (the area to the left of a positive z-score)
left body area
the area to the left of a negative z-score
left tail area
gives the middle of the data values, again usually on a real number line; often considered the most representative data value of the column
location
refers to the data value in the middle of a column of data (can be described in different ways: mean, median, and mode); where the middle of the shape of data lies over a real number line
location
characteristics of the individuals in a sample that are not measured during the study, but do affect the result; a known or unknown variable whose values affect the values of the variables being tested
lurking variables
On a histogram of continuous data, where are the values (categories) of each bar marked?
marked at the left side of the bar (indicating a range of values)
study in which the same people tested before administering the treatment are tested after
matched pairs
the number of mathematical units between two numbers on the Real Number Line; found by taking the difference between the two numbers
mathematical distance
the highest data value in a data set
max
What method of location is used in efficient statistics?
mean
the arithmetic average of all the data values in a column of data 1) denoted by μ for a population; x̄ for a sample 2) gives location information
mean
the value weighted middle of all the values in a column of data; average of all the data values (non-resistant statistic)
mean (u, x̅)
What three ways can be used to show location of a column of data?
mean, median, and mode
What method of location is used in resistant statistics?
median
the number value that divides a ranked data set into two equal parts, thus becoming an appropriate measure of the location (middle) of the data values in a column of data (use percentile method to find)
median
the non-weighted middle of all the values in a column of data (resistant statistic)
median (M)
the total area under the normal curve (1) minus the right tail area and minus the left tail area
middle body area
On a bar chart, where is the value (category) of each bar marked?
middle of the bar
the lowest data value in a data set
min
What two characteristics are used to describe the overall shape of data?
modality and symmetry
What method of location is used with qualitative data?
mode
the data value that occurs most often in the column of data; highest frequency of occurrence
mode
If one changes the mean of the standard normal curve (and not the standard deviation), what happens to the curve?
moved left or right on the real number line (depending on what the mean is)
What letter denotes size (or observations) in a sample?
n
What formula is used to find degrees of freedom?
n-1 (Number of observations - 1)
Can qualitative data be ranked on a real number line (and be used in mathematical calculations)?
no
Can qualitative data be used to tell direction (greater than or less than)?
no
On a bar chart, do the bars touch?
no (indicates that they can be rearranged)
Do statistics determine causation?
no, only relationship
occurs when the mathematical equation describing the frequency of data values has been modified to make easy to find probabilities by satisfying two conditions 1) total area under entire curve must equal 1 (100%), 2) Curve never goes below the x-axis (no negative frequencies)
normal probability density function (PDF)
Using the Shapiro-Wilk test of normality, if the p-value is > 0.05, what can one say about the shape?
normal shape
What method of spread is used with qualitative data?
number of categories
What things are descriptive statistics usually used to find for a column of sample data?
number of data values (n), shape (histogram), location (sample average), spread (sample standard deviation)
type of study where the individuals are in an uncontrolled environment before measuring, so that statistical controls are needed to cancel out the effects of the non-interesting variables; DO NOT control experimental units or treatment (experiment is post study); determines association not causation
observational study
What is another name for data values?
observations
Using statistical distance, which values are far from the population mean and NOT considered a reasonable value for a randomly selected individual from this population?
outside -2 and +2
a number summarizing information for a characteristic from a column containing population data; value does not change when repeating the statistical process; denoted with Greek letters
parameter
type of bar chart where the bars are rearranged from highest frequency to the lowest frequency (or vice versa) for the purpose of making the information easier to see
pareto chart
a false treatment that has no effect used to prevent experimental units from knowing whether they receive the treatment
placebo
the totality of elements in a well-defined group that is to be studied; infinite number of elements
population
Should sample or population information be used to make a decision?
population (more observations)
Parameter is to ___, as statistic is to ___
population; sample
Sources of information can come from a ___ from which a ___ is drawn, that consists of information from many ___.
population; sample; individuals
the position of a data value in a column of data ranked from lowest to highest
position (of a data value)
What is another name for resistant statistics?
positional statistics
type of data where the value of the next data value is not known (in the short run), but the value of many data values is very well known (in the long run); raw material for the science of statistics (data that can be predicted in the long run because it follows a pattern of behavior)
probabilistic data
the likelihood of an event happening; the area over an event and under a curve
probability
to understand the behavior of probabilistic data, so that information can be extracted from columns of sample data, to help make better decisions
purpose of statistics
Of the three types of data, which contains the least amount of information?
qualitative data
data value that consists of measured qualities or categories; non numerical data (labels)
qualitative data
What are the three types of data values (observations)?
qualitative, discrete, or continuous
What type of variable is the focus of statistical analysis?
random variable
type of variable where the data values can vary, AND the data values vary randomly (individuals in the sample are chosen randomly)
random variable
a measure of spread of 100% of the data values found by the difference between the largest data value and the smallest data value
range
concept of if, and how, the values of one variable relate to the values of another variable
relationship
percent (or proportion) of each data value in the column of data
relative frequency
What kind of descriptive statistics are used with "ugly" columns of data?
resistant statistics
summary numbers that extract less, but more robust, information about a characteristic out of a column of data; used for discrete or continuous data weakly affect by extreme values but contain less information
resistant statistics
a quantitative or qualitative variable that reflects the character of interest
response variable
the total area under the normal curve (1) minus the left tail area (area to the right of a negative z-score)
right body area
the area to the right of a positive z-score
right tail area
Within a data set, where are individuals located?
rows
What is the symbol for variance of a sample?
s^2
What is the first subset of a population?
sample
any part of the population that is 1) small enough to measure, and in a good sample is 2) representative of the population
sample
Statistics takes columns of ___ information from the short run, combines this with ___ information form the long run, to give ___ information.
sample, probability, population
defined procedure in which the researcher chooses the sample from a population
sampling
What type of sampling does statistical THEORY use?
sampling with replacement
to select and measure an individual by sampling, and then return it back into the population, so there is some small chance that it could be selected again
sampling with replacement
What type of sampling does statistical PRACTICE use?
sampling without replacement
to select and measure an individual by sampling, then not return it to the population, so there is no chance that it could be selected again
sampling without replacement
a graphical device used to help solve problems about probabilities and events
schematic curve
refers to a picture of the value and frequency of the data values in a column of data; mainly seen in graphical methods (but can also be summary numbers)
shape
refers to the pattern the data values make when graphed, usually over a real number line; can be expressed with a bar chart for any data or written with an equation for "pretty" data
shape
What are the three "useful characteristics" used to describe a column of data?
shape, location, spread
What is the most common form of sampling?
simple random sample
a method of sampling that gives every individual in the population the same chance of being chosen
simple random sampling
the number of data values (also called observations) in a column of data; counted number of data values
size
shape of a histogram in which a tail is stretched (could be tail stretched to the right, or tail stretched to the left)--occurs with extreme values
skewed
skewed distribution where the tail is stretched to the left and the "hump" is on the right
skewed to the left
skewed distribution where the tail is stretched to the right and the "hump" is on the left
skewed to the right
What shape is resistant statistics appropriate for?
skewed, extreme values
gives the width of the data values over a real number line in how far away the minimum data value is from the maximum data value or by measuring how far the data values are from the middle on average
spread
refers to how far apart data values are; how wide the shape is over a real number line (how far away the data values are from the middle)
spread
comparison of boxplots to compare multiple columns of data
stacked boxplots
What method of spread is used with efficient statistics?
standard deviation
an approximation of the average deviation found by taking the square-root of the variance 1) denoted σ for a population; s for a sample
standard deviation
What is the most common form of spread of column of data?
standard deviation (square root of the variance)
a normal curve with the mean = 0 and the standard deviation = 1
standard normal curve
a graph of data values whose frequency follows the standard normal distribution
standard normal curve (also called z-curve)
distribution of data values whose frequencies follow the normal shape, with only a mean value of zero (μ=0) and only a standard deviation value of one (σ=1)
standard normal distribution (or Z-distribution)
How is mathematical direction found (with two numbers)?
stating to the right or to the left on a real number line (after finding magnitude)
a number summarizing information for a characteristic from a column containing sample data; value changes when repeating the statistical process (because new random sample is chosen); denoted with roman letters
statistic
the number of spread units between two numbers; found by dividing the mathematical distance by the spread of the data (standard deviation)
statistical distance
science that studies columns of numbers to extract information to help make better decisions
statistics
the science of data (collecting data, organizing data, summarizing data, and analyzing data) with the purpose of getting information to make a decision
statistics
a graphical summary of continuous data giving shape information by displaying each data value as a stem for the category and a leaf for the bar; essentially a histogram turned sideways
stem-and-leaf plot
sampling with a certain method (ex: sampling the every 5th person)
strategic sampling
study in which sampling is taken from known, stratified groups (ex: freshmen, sophomores, juniors, etc.)
stratified study
How is mathematical magnitude found (with two numbers)?
subtracting the smaller number from the larger number (always positive)
a single number summarizing information about one characteristic from a column of data
summary number
numerical values used to summarize one characteristic from a column of data in order to communicate the largest amount of information as simply as possible (ex: sample average); form of descriptive statistics
summary numbers
used to summarize location and spread information from a column of data
summary numbers
shape of a histogram in which the shape of the left side is equal to the shape of the right
symmetric
to look at the whole problem at once, see what aspect of the problem is most important, then use this aspect to solve the problem
synthetic thinking
the theory that states that the shape of the sample average is normal when the sample size is greater than 30 data values
the central limit theory
What is the thing being studied in statistics (thing statistics gets information on)?
the individual (any treatment in an experiment is applied to the individual)
law that has shown that the larger the sample size, the closer the sample statistic gets to the population parameter
the law of large numbers
a symmetric, bell-shaped curve of the frequency of data values that are normally distributed
the normal curve
Resistant statistics are resistant to extreme values because they look at what?
the positions of the data values (instead of the values of the data values)
On a stem-and-leaf plot, what makes up the leaf?
the rightmost digit of the data value
What is the maximum sample size for sampling without replacement?
the size of the population
the number value such that 75% of the data values have values less than, and 25% of the data values have values greater than (75th percentile)
third quartile (Q3)
What is the purpose of statistics?
to analyze data to make a decision
What is the goal of sampling?
to get a measurable amount of individuals that are representative of a population
What does a negative deviation tell about the position of a data value relative to the mean?
to the left of the mean
What does a positive deviation tell about the position of a data value relative to the mean?
to the right of the mean value
a condition of interest that is applied to the experimental unit
treatment
the sum of the left and right tail areas
two tail area
shape of a histogram in which all the bars are about the same height (each number has the same probability of being there)
uniform
shape of histogram in which there is one "hump"
unimodal
What shape is efficient statistics appropriate for?
unimodal symmetrical
characteristic that is easily seen in data
useful characteristic
type of variable when the data values can vary by there being more than one value in the population; values change in some systematic fashion (individuals chosen systematically)
variable
What is the most useful form for spread of a column of data?
variance
a value weighted form for spread and a "non-resistant" statistic; standardized form of spread calculated from the sum-of-squares (raw form of spread)
variance
standardized measure of spread in a column of data values found by dividing the sum of squares by the degrees of freedom (denoted σ^2 for a population and s^2 for a sample)
variance
the average of a column of data when the numbers have different weights
weighted average
How is data analyzed?
with graphs or images or by mathematical measurements (to give summary numbers)
In what two ways can sampling be done with respect to the treatment of the individual after being measured?
with replacement and without replacement
a data value from any normal distribution other than the standard normal distribution (frequency follows any normal curve); has units of measurement
x-value
What is the symbol for the mean of a sample (sample average)?
x̅
Can continuous data be ranked on a real number line?
yes
Can discrete data be ranked on a real number line (and used in mathematical calculations)?
yes
Can qualitative data be used to tell difference (equal or unequal)?
yes
On a histogram of continuous data, do the bars touch?
yes (indicating they cannot be rearranged--real number line)
On a histogram, do the bars touch?
yes (indicating they cannot be rearranged--real number line)
Can discrete data be used to tell difference and direction?
yes, but its information is limited because of the number of possible values
What is another name for the standard normal distribution?
z-distribution
The equation used to convert any normal data (an x-value) into standard normal data (a z-score), or standard normal data (a z-score) into any normal data (an x-value)
z-equation
What can be used to transform between z-scores and x-values to find probability?
z-equation
a data value from a column of data that follows the standard normal curve; graphed on the x-axis
z-score
What two things can a z-table be used to fin?
z-scores and tail areas
a table relating events, expressed as an interval of z-scores on the x-axis, with probability, expressed as the area over the event and under a standard normal curve (used to go between z-scores and tail areas)
z-table
What is the symbol for the mean of a population?
μ (mu)
What is the symbol for variance of a population?
σ^2