Stats Exam 1
(Split) Stem Plot
0 3 0 8 1 24 1 779 2 01123 2 567899 3 3 59 2 0 2 223 2 455 2 66677 2 8888999 3 00011 3 222333 3 445 3 67 3 9
Principles of Valid Experiments
1. Control/Comparison 2. Randomization 3. Replication 4. Double-Blinding
Study Designs
1. Observational Study 2. Sample Survey 3. Experiment
Computing the Median (n Even)
1. Order the data 2. If the number of observations is even, median equals the mean of two middle observations
Computing the Median (n Odd)
1. Order the data 2. If the number of observations is odd, median equals the middle observation
Three Principles of Experiments
1. Randomly assign the two treatments to the two individuals within each pair (block) OR randomize the order of applying the treatments to each individual 2. Replication equals the number of pairs 3. Compare the two treatments. Each pair serves as its own control
Skills of Metacognition
1. assess task of learning statistics 2. evaluate your strengths and weaknesses 3. plan an approach to your learning 4. monitor your performance 5. reflect and adjust plan
Reasons to avoid bad sampling
1. bias • sample favors certain outcomes • not representative 2. impossible to assess uncertainty • more on this later
Correct Sampling Requirements
1. explicitly describe population 2. explicitly describe variable 3. select representative sample (but how?)
For Experiments to Work
1. explicitly describe response variable 2. if possible, choose homogeneous subjects 3. choose treatments to control effects of lurking variables (but how?) 4. assign subjects to treatments such that groups nearly identical other than treatments - no confounding (but how?)
Mean vs. Median General advice:
1. first construct histogram or stem plot, evaluate skewness and outliers 2. use median if markedly skewed or outliers are present 3. use mean if roughly symmetric
Reasons for Possible Outliers
1. if distribution is long-tailed and value is legitimate: • keep outlier 2. if values produced under different conditions than rest of data set: • remove outlier 3. if value is a mistake or typo: • correct if possible; otherwise remove outlier
What is Statistics?
1. science of extracting meaning from data 2. art of persuading the universe to divulge information about itself 3. methodology for using data to answer questions in the presence of variation
Cluster Sampling
1. used when population is naturally divided into groups called clusters. (e.g., households are divided into city blocks) 2. each cluster is essentially representative of the population as a whole 3. a random sample of clusters is taken 4. all individuals in the selected clusters are included in the sample
Stem Plot
7 5 8 05555 9 0000000000055555555555555 10 000000000000000000555555555555555 11 000000000000000000000055555 12 0000000 13 000 14 0 15 0
IQR Calculation
= range occupied by middle 50% of data = 3rd quartile - 1st quartile (more on this later)
Self-check In a famous study, 5200 patients were categorized into 2 groups according to their soda habit. After 4-years of follow-up, the rate of heart disease was higher in the "regularly drank" group than the "sparingly drank" group. What kind of study is this? (a) historical comparison experiment (b) unreplicated experiment (c) confounded experiment (d) observational study
???
Self-check Five men in a room have a mean height of 70 inches. A tall man, 80 inches, enters the room. Now the mean height is: (a) impossible to say (b) 70.4 inches (c) 71.7 inches (d) 75.1 inches
??? I think A
Self-check 22 5 12 13 59 What is M for the data set above? (a) 12 (b) 12.5 (c) 13 (d) 27
A
Self-check 240 subjects are available for an experiment testing the effects of different diets. Software randomly assigns 60 subjects to Diet 1, 60 subjects to Diet 2, 60 subjects to Diet 3, and 60 subjects to Diet 4. What type of study is this? (a) a randomized controlled experiment (b) a randomized block design, with four blocks (c) a matched pairs design (d) an observational study (e) none of the above
A
Self-check In a famous randomized vitamin C study, most patients could tell from taste whether they were receiving vitamin C pills or placebo pills. The rate of cold/flu was lower in the vitamin C group. What do you conclude? (a) vitamin C reduces the cold/flu rate (b) nothing - the difference could be due to vitamin C or a placebo effect
A
Self-check Which is better? (a) random sample of size 400 (b) a nonrandom sample of size 5000
A
Sample
A subgroup of the population which we can examine or observe and collect data from.
Outliers
Ask: • Is data point miscoded? • Were conditions for outlier unusual? • Should data point be excluded?
Self-check A given data set has Q1 = 25 Median = 37 Q3 = 45 Use the IQR rule to determine if the following statement is true or false: "73 is an outlier in this data set." (a) True (b) False
B
Self-check Gallup has never been wrong in predicting the winner of a presidential election. (a) True (b) False
B
Self-check Gallup's predictions have always been more accurate when the sample has been larger. (a) True (b) False
B
Self-check Subjects in an experiment knew that they were being observed, so they behaved better than they usually did. This is an example of: (a) diagnostic bias (b) Hawthorne effect (c) placebo effect (d) Simpson's paradox
B
Self-check To participate in the Time.com poll, go to the Time website and click. What kind of sample is Time.com poll? (a) convenience (b) volunteer response (c) quota
B
Self-check What are the treatments in the Salk vaccine experiment? (a) syringe, school nurse (b) polio, vaccine (c) polio status (d) vaccine, placebo
B
Self-check What is an advantage of histograms over stem plots? (a) they can be created by hand (b) the data set can be any size (c) the actual data can be extracted from them (d) they can be horizontal or vertical
B
Self-check What treatments are being compared in the Visual Cliff experiment? (a) Whether infants crawled to their mothers when called (b) Whether mothers stood on the checkered side of the table (c) Whether the table had a checkered pattern (d) Whether the observer was in the room
B
Self-check Who are the subjects in the Salk vaccine experiment? (a) 400,000 children who participated in the study (b) 200,000 children who received the vaccine (c) second-grade American children (d) all American children
B
Self-check What is the dogma of statistics? (a) numbers are fun (b) variation has to be dealt with (c) uncertainty must be avoided (d) statistics don't lie
B( & C?)
Self-check A stemplot of the 29 measurements made by Henry Cavendish in 1798 when he measured the density of the earth (in g/cm3) is shown here. What is the median value of his measurements? (Leaf unit=0.01) (a) 5.42 (b) 5.44 (c) 5.46 (d) 5.47 Variable : Cavendish 48 : 8 49 : 50 : 7 51 : 0 52 : 6799 53 : 04469 54 : 2467 55 : 03578 56 : 12358 57 : 59 58 : 5
C
Self-check An automobile salesman tells you that he gets a bonus if you report on a post-sale survey that he was effective and courteous. What kind of bias might be present in this survey? (a) nonresponse (b) undercoverage (c) misleading response (d) no bias
C
Self-check In a study of religious practices among U.S. college students, 127 students were interviewed. Of those interviewed, 107 said that they pray at least once in a while. What is the population? (a) All Americans (b) All U.S. students (c) All U.S. college students (d) The 127 students who were interviewed (e) The 107 students who said they prayed at least once in a while
C
Self-check The IRS obtains a sample of Utah tax returns by taking a random sample of Utah counties and then taking a random sample of returns filed in those counties. What kind of sample is this? (a) simple random sample (b) stratified random sample (c) multistage sample (d) cluster sample
C
Self-check What is the population in the potato example? (a) all potatoes in the world (b) all potatoes in the U.S. (c) all potatoes in the truckload (d) all potatoes in the buckets
C
Self-check What is the response variable in the Salk vaccine experiment? (a) type of inoculation (b) polio, vaccine (c) polio status (d) vaccine, placebo
C
Producing Data
Choosing a sample, and collecting data from it.
Incorrect Sampling Methods
Convenience Sampling Volunteer Response Sampling Quota Sampling
Self-check In a study of religious practices among U.S. college students, 127 students were interviewed. Of those interviewed, 107 said that they pray at least once in a while. What is the sample in this study? (a) All Americans (b) All U.S. students (c) All U.S. college students (d) The 127 students who were interviewed (e) The 107 students who said they prayed at least once in a while
D
Self-check What does the distribution of textbook costs tell you? (a) who the students were (b) how the students were selected (c) that the costs were measured in dollars (d) how frequently the various costs occurred in the data set
D
Self-check What is the factor in the Salk vaccine experiment? (a) type of inoculation (b) vaccine (c) placebo (d) polio status
D
Self-check What type of discipline is statistics? (a) an art (b) a science (c) a methodology (d) all of the above (e) none of the above
D
Self-check 200 students were assigned to teaching method A or B. The method A group was taught by Dr. R. while the method B group was taught by Dr. T. What kind of study is this? (a) historical comparison experiment (b) unreplicated experiment (c) confounded experiment (d) observational study
D (???)
First Quartile of Distribution
Data contains at least 25% of distribution
Second Quartile of Distribution
Data contains at least 50% of distribution
Third Quartile of Distribution
Data contains at least 75% of distribution
Data Set
Data identified with contextual information In a table: rows = individuals columns = variables
Diagnostic bias
Diagnosis of subjects biased by preconceived notions about effectiveness of treatment
Self-check Do cars get better gas mileage with clean air filters? Gas mileage for 10 cars with dirty air filters and clean air filters was studied. Each car was tested once with a clean air filter and once with a dirty air filter (with the order of the testing randomized). What type of study is this? (a) an observational study based on a simple random sample (b) an observational study based on a stratified random sample (c) an observational study based on a multistage random sample (d) a randomized controlled experiment (e) a matched pairs experiment
E
Quota Sampling
Force the sample to meet specified quotas
Types of Bad Experiments
Historical Comparison Experiments Unreplicated Experiments Confounded Experiments
Volunteer Response Sampling
Individuals select themselves
IQR
Interquartile Range
Interviewer
Interviewer influences responses Examples: • rude • intimidating to some people • subtle clues or gestures
Lesson 5 START
Lesson 5 START
Lesson 6 START
Lesson 6 START
Lesson 7 START
Lesson 7 START
Lesson 8 START
Lesson 8 START
Metacognition
Literally "thinking about thinking"
Center
Look for a value with • roughly half of the data to the left and • half to the right
Calculate Range
Max - Min
Replication
Multiple subjects for a given treatment
Types of Questions in Sample Surveys
Open Questions Closed Questions
Question Order
Order of questions promotes certain responses
Experimental Design Principles
Principle #1: Comparison Principle #2: Randomization Principle #3: Replication Principle #4: Double-blinding
Correct Sampling Methods
Probability Sampling
Non-Sampling Bias
Probability samples may still have bias due to: • undercoverage • non-response • misleading response • interviewer effect • question order • question wording
Stratified Random Sample
Quota sampling done right! 1. classify population into groups (strata) that are different from each other (e.g., classify according to age or gender) 2. individuals within a group (stratum) share a similar characteristic (e.g., all are males or all are children) 3. select SRS from every group 4. combine SRS's
Placebo Effect
Response by human subjects due to the psychological effect of being treated
Simple Random Sample (SRS)
Sample of specified size chosen such that every possible set of that size has equal chance of being the sample
Convenience Sampling
Select individuals in easiest possible way
Misleading Response
Selected individuals lie or give inaccurate answer (sensitive issues)
Non-Response
Selected individuals refuse to answer or can't be contacted
Undercoverage
Some individuals have no possibility of being selected
Population
The entire group of individuals that is the target of our interest
Spread
The full 'Range' Look for • minimum and • maximum
Uncertainty
The unknown regarding relationships of linked subjects
Flat or Uniform
Where there is no 'hill'
Question Wording
Wording of question leads, misleads, or confuses
Experiment Definition
a study design where treatments are imposed on individuals before observing response
Define: Sample Survey (Poll)
a type of observational study in which individuals report variables' values themselves, frequently by giving their opinions
Terminology Control
an effort to reduce effects of lurking variables
Individual
an entity that is observed e.g., student, person, rat, classroom, plot of ground
Replication
assign more than one subject to each treatment group
Response variable
characteristic measured on each subject
Variable
characteristic that is measured
Variable
characteristic that is measured on each individual e.g., cost, height, yield, opinion
Process of Statistics
collect data summarize data interpret data
Why Sample?
compared to census: • practical • cheap • often more accurate!
Control/Comparison
control lurking variables by including comparison treatments, using homogeneous subjects; used to measure placebo effect
Statistic
corresponding numerical fact in the sample
Experiment Purpose
determine if treatments cause change in response
Population
entire group of individuals of interest
Treatment
experimental condition applied to subject = value of factor
Subject
individual to which treatment applied
Sample
individuals that are selected from the population and measured
Data
measurements for a set of individuals e.g., textbook costs for the sample of students
Flagging possible Outliers
min outliers = Q1 − 1.5(IQR) max outliers = Q3 + 1.5(IQR)
Double Blinding
neither the subjects nor the people who evaluate them know which treatment each subject is receiving; used to prevent experimenter effect
Randomization
neutralize effects of lurking variables by assigning subjects to treatments randomly
Parameter
numerical fact about the variable in the population
Hawthorne effect
phenomenon where people in an experiment behave differently from how they would normally behave; attention/observation bias
Factor
planned explanatory variable
Bias Due to Question Wording Occurs when...
questions have leading phrases, loaded words, or ambiguities that influence the response.
Lack of realism
realism is often compromised by controlled study conditions, choice of homogeneous subjects, application of treatments
Confounding
situation in which effects of lurking variables cannot be distinguished from effects of factors
Multistage Sample
take sample at each level: e.g., 1. SRS of states 2. for selected states, SRS's of counties 3. for selected counties, SRS's of people 4. combine SRS's of people
Purpose: Sample Survey (Poll)
use sample fact in place of population fact • e.g., use sample mean as (uncertain) estimate of population mean
Explanatory variable
used to predict or explain changes in the response variable
Measurement:
value of a variable for an individual e.g., textbook cost for Nathan
Quantitative Variable
variable whose possible values are meaningful numbers e.g., cost, height, yield
Categorical Variable
variable whose possible values are non-quantitative categories e.g., gender, opinion
Lurking variables
variables that affect response variable but no measures or included in planned factors 19 / 36
Matched Pairs
• Examples: • Twins: each receiving a treatment • Two treatments on each individual • Measurements before and after treatment on each individual
Valid Experimental Designs
• Randomized Controlled Experiment: subjects randomly assigned to treatments • Randomized Block Design (RBD) • matched pairs, a special case of RBD
Dogma of Statistics
• always variation • variation leads to uncertainty • converting data into useful information requires understanding and dealing with variation/uncertainty
• collect data • summarize data • exploratory data analysis • inference • distribution of a variable
• collect data: Get data from a population sample • summarize data: Turn data into useful information • exploratory data analysis: 5 Steps on card listed 2 ago • inference: Assumptions made about the general population based on the • distribution of a variable: the values of the variables and how often they occured
Noncompliance
• failure to submit to the assigned treatment • refusal to follow the protocol of the experiment
Why Experiment? Compared to observational study:
• no confounded lurking variables • can validly draw cause-effect conclusions
Exploratory Data Analysis
• organize and summarize data • discover: features, patterns, striking deviations from patterns • interpret patterns in context • single variable patterns (distribution) and two variable patterns (relationship) • visual displays and numerical summaries
Pitfalls in Experimentation Randomized comparative experiments may still have problems (What are they)
• placebo effect • diagnostic bias • lack of realism • Hawthorne effect • noncompliance
Principles of Data Ethics
• safety and well-being of the subjects must be protected • all individuals must give their informed consent before data are collected • individual data must be kept confidential
Observational Studies
• subjects choose which treatment to receive or naturally belong to one of the treatment groups • lurking variables that influence choice confounded with treatments • passive data collection: observing, measuring, counting, subjects are undisturbed • media often improperly attribute cause-effect conclusions to these