STAT 201- Exam 1 TAMU
experimental unit
a single object or individual to be measured
sample
a subset of the population of interest that we will collect data on
representative sample
a subset of the population that has the same characteristics as the population of interest
placebo
a treatment that cannot influence the response variable
population
all individuals, objects, or things whose properties are being studied
ethics
all planned studies that involve human participants must be approved in advance by the IRB (institutional review board)
systematic sample
randomly select a starting point and then take every n^th data piece after that, frequently chosen because it's a simple method
relative frequency
ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes.
statistical fraud
researchers have a responsibility to verify that proper methods are being followed
What type of sample? A completely random method is used to select 75 students. Each undergraduate student in the fall semester has the same probability of being chosen at any stage of the sampling process.
simple random sample
What type of sample? A sample of 100 undergraduate San Jose State students is taken by organizing the students' names by classification (freshman, sophomore, junior, senior), and then selecting 25 from each
stratified
What type of sample? A random number generator is used to select from the alphabetical listing of all undergraduate students in the Fall semester. Starting with that student, every 50th student is chosen until 75 students are included in the sample.
systematic
frequency table
table with data value and frequency
descriptive statistics
organizing and summarizing data
steps for constructing a box plot
- Obtain the Five Number Summary (Min, Q1, M, Q3, Max) - Compute IQR = (Q3 − Q1) - Draw the line from the Min to the Max - Draw a box from Q1 to Q3 - Draw a line in the box at Median M - Compute the outlier cutoffs: L = Q1 − 1.5 × IQR and U =Q3 +1.5×IQR - Erase the line from Min to L: any data in this region is a potential outlier, mark with an × or ◦ - Erase the line from U to Max: any data in this region is a potential outlier, mark with an × or ◦
observational study
- observes individuals and measures variables of interest - does not attempt to influence responses
percentile
- percentiles divide ordered data into hundredths; percentiles may or may not be part of the data - the kth percentile is the value such that k% of data is less than it, and (100-k)% of data is greater than it
Identify population, statistic, parameter, sample, variable, & data: We want to know the average (mean) amount of money spent on school uniforms each year by families with kids at Knoll Academy. We randomly survey 100 families with kids in the school. Three families spent $65, $75, and $95, respectively
- Population: families whose kids attend Knoll Academy - Statistic: mean amount of $ spent on uniforms - Parameter: price spent on school uniform - Sample: 100 families w/ kids at school - Variable: amount of $ spent - Data: $65, $75, $95...
example: On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret the 70th percentile in the context of this situation.
- Seventy percent of students answered 16 or fewer questions correctly. - Thirty percent of students answered 16 or more questions correctly. - A higher percentile is considered good in this context, as answering more questions correctly is desirable
IQR= Q3-Q1
- The IQR can help to determine potential outliers - A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile.
example: On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile, in the context of this situation.
- Twenty-five percent of students finished the exam in 35 minutes or less. -Seventy-five percent of students finished the exam in 35 minutes or more. - A lower percentile is considered good in this context, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.)
variable
- a characteristic or measurement that can be determined for each member of a population - 2 types
statistic
- a number that represents the property of interest in the sample - an estimate of the parameter value that we get looking only at the sample
problems with samples
- a sample must be representative of the population - a non-representative sample is biased - biased samples that are not representative of the population give results that are inaccurate and not valid
lurking variables
- a variable that has an important effect on the relationship among the variables in a study - is not one of the explanatory variables studied
outlier
- an observation of data that does not fit the rest of the data - some are due to mistakes while others may indicate that something unusual is happening - it may take some background info to explain outliers
stratified sample vs. cluster sample
- both divide population into groups - stratified samples include individuals in EVERY group - cluster samples include every individual in SOME groups
pie chart
- categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category - displaying parts of a whole, cannot add up to be more than 100%
non-response/refusal of subject to participate
- collected responses may no longer be representative of the populations - people with strong positive or negative opinions may answer surveys which can affect results
histogram
- consists of contiguous (adjoining) bars - horizontal axis is labeled with what the data represents - vertical axis is labeled either frequency or relative frequency (or percent frequency or probability) of each bin - graph will have the same shape with either label
sampling bias
- created when a sample is collected from a population and some members of the population are not as likely to be chosen as others - can cause incorrect conclusions drawn about the population that is being studied
variation
- data values will not always be the same for each element of the population - this difference is called variation
experiment
- deliberately imposes some treatment on individuals to measure their responses - purpose is to study whether the treatment causes a change in the response
random experiment- outcomes
- different outcomes measured in the response variable, therefore, must be a direct result of the different treatments - in this way, an experiment can prove a cause-and-effect connection between the explanatory and response variables
stem-and-leaf plot
- divide each observation of data into a stem and leaf - the leaf consists of a final significant digit - good choice for showing the distribution of a numerical variable when the data set is small
simple random sample (SRS)
- every member of the population has an equal chance of being chosen 1. give each member of the pop. a number 2. use a random number generator to select a set of labels 3. these randomly selected labels identify the members of your sample- use them to find out which members were sampled
how to find relative frequency
- frequency (number of times each data value occurs) / (total) - the number 2 occurs 3 times in a data set out of 20= 3/20 or 0.15
time series graph (line graph)
- given a paired data set, we start with the standard Cartesian coordinate system: horizontal axis is used to plot date/time increment, vertical axis is used to plot the values of the variable that we are measuring - by doing this, we make each point on the graph correspond to a date and a measured quantity - the point on the graph are connected by straight lines in the order in which they occur
misleading use of data
- improperly displayed graphs - incomplete data - lack of context
five number summary
- minimum (min) - first quartile (Q1) - Median (M) - Third Quartile (Q3) - Maximum (Max)
descriptive statistics
- numerical and graphical ways to describe and display your data - a graph is a tool that helps you learn about the shape or distribution of a sample/population - statisticians often graph data first to get a picture of the data then apply more formal tools
Identify the population, sample, experimental units, explanatory variable, treatments, and response variable: Researchers want to investigate whether taking aspirin regularly reduces the risk of heart attack. Four hundred men between the ages of 50 and 84 are recruited as participants. The men are divided randomly into two groups: one group will take aspirin, and the other group will take a placebo. Each man takes one pill each day for three years, but he does not know whether he is taking aspirin or the placebo. At the end of the study, researchers count the number of men in each group who have had heart attacks
- population: men aged 50-84 - sample: 400 men who participated - experimental units: individual men in study - explanatory variable: aspirin/oral medication - treatments: aspirin and placebo - response variable: whether a subject had a heart attack
causality
- relationship between two variables does not mean that one causes the other to occur - they may be related (correlated) b/c of their relationship through a different variable
bin
- represents a range of data - used when displaying large data sets - also classes or intervals - all bins have the same width
key protections mandated by law
- risks to participants must be minimized and reasonable with respect to projected benefits. - participants must give informed consent. this means that the risks of participation must be clearly explained to the subjects of the study. - subjects must consent in writing, and researchers are required to keep documentation of their consent - data collected from individuals must be guarded carefully to protect their privacy
sample size issues
- samples that are too small are often unreliable - the larger the better - small samples are sometimes unavoidable and can still be used to draw conclusions - ex: crash testing cars and medical testing for rare conditions
self-funded/self-interest studies
- study performed by a person or organization in order to support their claim - is the study impartial? Read the study carefully to evaluate the work - do not automatically assume that the study is good, but do not automatically assume the study is bad either - evaluate it on its merits and the work done
categorical/qualitative variable
- take on a label - nominal: no natural order to the variables (ethnicity) - ordinal: is a natural order to variables ( movie reviews)
numerical/quantitative variable
- take on a number value of units - discrete: measured in whole units (count data) - continuous: can be measured in decimals (time to finish a race)
sampling errors
- the actual process of sampling causes sampling errors - a sample will never be exactly representative of the population so there will always be some sampling error - as a rule, the larger the sample, the smaller the sampling error
sample mean
- the average in a set of data - used to estimate the population mean - add up all the data values and divide by the total number
quartile/percentile
- the median of the data is the second quartile or 50th percentile - the first and third quartiles are the 25th and 75th percentiles, respectively
interquartile range
- the range of the middle 50% of the data values - the IQR is found by subtracting the first quartile from the third quartile
control group
- the treatment group that receives the placebo - helps researchers balance the effects of being in an experiment with the effects of active treatments
confounding (lurking/explanatory)
- two variables are confounded when their effects on a response variable cannot be distinguished from each other - the confounded variables may be either explanatory variables or lurking variables
confounding
- when the effects of multiple factors on a response cannot be separated - makes it difficult or impossible to draw valid conclusions about the effect of each factor
box plot (box-and-whisker plot)
-graphical representation of the concentration of the data - shoes how far the extreme values are from most of the data
self-selected samples
-responses only by people who choose to respond, are often unreliable - ex: call-in surveys
randomized experiment- random treatments
-when subjects are assigned treatments randomly, all of the potential lurking variables are spread equally among the groups - at this point the only difference between groups is the one imposed by the researcher.
role of statistician
1. Design Studies 2. Analyze Data 3. Translate data into knowledge and understanding of the world around us
constructing a histogram
1. decide how many bins 2. calculate the width of each bin: (starting point-ending point)/ (number of bins) 3. calculate the bins (intervals): boundary= starting point + k * bin width (k= 0,...., number of bins + 1) 4. calculate bin frequencies (or relative frequencies) 5. draw the histogram, where the height of each bar is given by the frequency/relative frequency of its corresponding bin
Determine correct data type: 1. number of pairs of shoes you own 2. type of car you drive 3. distance from home to nearest grocery store 4. number of classes you take per school year 5. type of calculator you use 6. weights of sumo wrestlers 7. number of correct answers on a quiz 8. IQ scores
1. quantitative discrete 2. categorical 3. quantitative continuous 4. quantitative discrete 5. categorical 6. quantitative continuous 7. quantitative discrete 8. quantitative discrete
methods of random sampling
1. simple random sample (srs) 2. systematic sample 3. stratified sample 4. cluster sample
variation example
16-ounce cans of beverage may contain more or less than 16 ounces of liquid. - in one study, eight 16 ounce cans were measured and produced the following amount (in ounces) of beverage: 15.8; 16.1; 15.2; 14.8; 15.8; 15.9; 16.0; 15.5 - measurements of the amount of beverage in a 16-ounce can may vary because different people make the measurements or because the exact amount, 16 ounces of liquid, was not put into the cans
convenience sample example
A computer software store conducts a marketing study by interviewing potential customers who happen to be in the store browsing through the available software. The results of convenience sampling may be very good in some cases and highly biased (favor certain outcomes) in others.
frequency/frequency table example: Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows: 5,6,3,3,2,4,7,5,2,3,5,6,5,4,4,3,5,2,5,3 Make a frequency table.
Data Value 2 3 4 5 6 7 Frequency 3 5 3 6 2 1
cluster sample example
If you randomly sample four departments from your college population, the four departments make up the cluster sample. Divide your college faculty by department. The departments are the clusters. Number each department, and then choose four different numbers using simple random sampling. All members of the four departments with those numbers are the cluster sample.
stratified sample example
You could stratify (group) your college population by department and then choose a proportionate simple random sample from each stratum (each department) to get a stratified random sample. Suppose there are 5 departments, and you want to choose a sample of 50. You choose 10 from each department using SRS sampling.
distribution
a listing or function showing all the possible values (or intervals) of the data and how often they occur
parameter
a number that is used to represent a characteristic of the population and that generally cannot be determined easily
bar graph vs. histogram
bar graph: - used to visualize distribution for categorical variables - number of bars cannot be changed ( it's the # of categories) histogram: - used to visualize distribution for quantitative variables - number of bars/bins is your choice
double-blinding
both the subjects and the researchers involved with the subjects are blinded
cause-and-effect in observational studies
cause-and-effect relationships usually cannot be drawn, since lurking variables confound the effects that one variable may have on another variable
non-sampling errors
caused by factors not related to the sampling process
What type of sample? The freshman, sophomore, junior, and senior years are numbers one, two, three, and four, respectively. A random number generator is used to pick two of those years. All students in those two years are in the sample.
cluster
undue influence
collecting data or asking questions in a way that influences the response
What type of sample? An administrative assistant is asked to stand in front of the library one Wednesday and to ask the first 100 undergraduate students he encounters what they paid for tuition the Fall semester. Those 100 students are the sample.
convenience
frequency
count number of times each data value occurs
quartile
divide ordered data into quarters; quartiles may or may not be part of the data
inferential statistics
formal methods for drawing conclusions from data
explanatory and response variable relationship
in an experiment, the researcher manipulates values of the explanatory variable and measures the resulting changes in the response variable
randomized experiment- proving change
in order to prove that the explanatory variable is causing a change in the response variable, it is necessary to isolate the explanatory variable.
formula for finding a quartile
k= the quartile (1, 2 or 3) or (25, 50, 75) i= the index (position of a data value) n= total number of data 1. order data from smallest to largest 2. calculate index i= k/4(n+1) if using 1,2,3 or i= k/100(n+1) if using 25,50,75 3. find the value that corresponds to index i: - if i is a whole number, find the ith data value in the ordered list - if i is a decimal, take the two indices closest to i and average their corresponding data values
probability
mathematical study of uncertainty- foundation of statistics
convenience sample
non-random and involves using results that are readily available
sampling with replacement
once a member is picked, that member goes back into the population and thus may be chosen more than once
sampling without replacement
once a member is picked, that member goes out of the pool and thus cannot be chosen more than once ( only becomes mathematical issue when population is small)
response variable (experimental design)
the affected variable is called the response variable
treatments (experimental design)
the different values of the explanatory variable
bar graph
the length of the bar for each category is proportional to the number or percent of individuals in each category. bars may be vertical or horizontal
blinding
the subjects do not know who is receiving the active treatments and who is receiving the placebo treatment
describing data
to describe the distribution of data, look for an overall pattern and any outliers
cluster sample
we divide the population into clusters (groups) and then randomly select some of the clusters. all members from these clusters are in the cluster sample.
stratified sample
we divide the population into groups called strata and then take a SRS from each stratum
parameter is to population as statistic is to sample
we use the statistic, calculated from data obtained from a sample, in order to estimate the parameter which is unknown 1. Have parameter that you are interested in learning about for some population 2. Obtain a sample from the pop. 3. Gather data on that sample 4. Calculate a statistic 5. Use the statistic as an estimate of the parameter
explanatory variable (experimental design)
when one variable causes change in another, we call the first variable the explanatory variable
placebo effect
when participation in a study prompts a physical response from a participant, the experimenter must take further steps to isolate the effects of the explanatory variable.
interpreting percentiles, quartiles, and median
when writing the interpretation of a percentile in the context of the given data, the following information should be included: - information about the context of the situation being considered - the data value that represents the percentile - the percent of individuals below the percentile - the percent of individuals above the percentile
systematic sample example
you have to do a phone survey. your phone book contains 20,000 residence listings. you must choose 400 names for the sample. number the population 120,000 and then use a simple random sample to pick a number that represents the first name in the sample. then choose every fiftieth name thereafter until you have a total of 400 names (you might have to go back to the beginning of your phone list).
Example: Consider this data set on house prices: 114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000 Are there any outliers?
• Q1 = 194,250 • Q3 = 649,000 • IQR = 454,750 • Lower Cutoff: L = 194, 250 − 454750 × 1.5 = −487, 875 • Upper Cutoff: U=659,000+454750×1.5=1,341,125 ∴ The 5.5M house is an outlier.