MAT 202 Statistics Super Study Guide
At the beginning of the semester, an Intro to Statistics instructor asked the 225 students enrolled in the class to complete a survey. For each student, the instructor collects information about the following: Sport (Favorite sport: Football, Baseball, Basketball, Hockey, Other) Exercise (How many minutes do you spend exercising per week) Personality (on a 0-25 scale, how would you describe your personality 0=total introvert, 25=total extrovert) Death penalty (Strongly agree, Agree, Neutral, Disagree, Strongly Disagree) How many variables are in this example?
4
The histogram below displays the distribution of 50 ages at death due to trauma (accidents and homicides) that were observed in a certain hospital during a week. What percentage of deaths were individuals younger than 35?
68%
Use the Standard Deviation Rule to calculate 1, 2 and 3 SD's of data if the mean= 70.5 and the SD=3
68% of data: 70.5-3= 67.5 70.5+3=73.5 95% of data: 70.5-(2)3= 64.5 70+(2)3=76.5 All or nearly all (99.7%) of data: 70.5-(3)3=61.5 70.5+3(3)=79.5
This histogram shows the distribution of times, in minutes, required for 25 rats in an animal behavior experiment to navigate a maze successfully. What percentage of the rats navigated the maze in less than 5.5 minutes?
84%
stemplot
Also called a stem-and-leaf plot. Data are separated into a stem and leaf by place value and organized in the form of a histogram.
Least Squares Criterion
Among all the lines that look good on your data, choose the one that has the smallest sum of squared vertical deviations (smallest total area), or least squares regression line
Calculating standard deviation
1. Calculate each score's deviation (distance form the mean) 2. Square each deviation 3. Compute the mean for the squared deviations (this is the variance) 4. Take the square root of the variance (this is the standard deviation) ie. data set (7, 9, 5, 13, 3, 11, 16, 9) mean= 9 (7-9), (9-9), (5-9), (13-9)... = -2, 0, -4, 4, -6, 2, 6, 0 square deviations and add=4+0+16+16+36+4+36+0= 112 divide by n-1 (8-1=7) = 112/7= 16 (the variance) square root the variance for a SD= 4
Here again is the histogram showing the distribution of 50 ages at death due to trauma (accidents and homicides) that occurred in a certain hospital during a week. A possible value of the median in this example is:
33
At the beginning of the semester, an Intro to Statistics instructor asked the 225 students enrolled in the class to complete a survey. For each student, the instructor collects information about the following: Sport (Favorite sport: Football, Baseball, Basketball, Hockey, Other) Exercise (How many minutes do you spend exercising per week) Personality (on a 0-25 scale, how would you describe your personality 0=total introvert, 25=total extrovert) Death penalty (Strongly agree, Agree, Neutral, Disagree, Strongly Disagree) Which of the variables above is ordinal?
Death penalty
Which of the tables is the appropriate table of conditional percents to discover if the region where one lives affects whether or not one has health insurance?
Table A
Which of the following variables is not a ratio variable? Temperature (Outside temperature) Charity (How much money do you donate to charity in a year) Internet (How much minutes/day do you spend on the Internet?) Text messages (How many text messages you send a day)
Temperature
What does it mean to have a SD of 4 with a mean of 9?
The average is 9, give or take 4
The number of hours students study is compared with the day's highest temperature. It is found that the coefficient of determination r2 = 0.481. About 48% of the studying habits can be explained by the linear regression model of the relationship between the two variables. What is r?
The square root of 0.481 is approximately equal to 0.694.
A local cafe kept track of the number of servings of the soup of the day it sold each day, and the temperature that day, for two months during the summer. The data are displayed in the scatterplot below:
Negative linear relationship with outlier(s)
a
a= y bar- b(x bar) ie a= 42.3- (-3 * 51) = 576
SD=0 when
all observations are the same value
outlier
an observation is considered an outlier if it is: less than Q1-1.5(IQR) or more than Q3+1.5(IQR) ie 32-(1.5)(9.5)= 17.75 41.5+(1.5)(9.5)=55.75 Observations of 62, 74 and 80 should therefore be flagged as outliers
to add an observation and keep M the same, the observation must be
at the exact same place as where the median already is
b
b=r (Sy/Sx) ie b= (-0.793) x (82.8/21.78)= -3 For every year a driver gets older, the maximum distance at which they can read a sign decreases on average by 3 feet
This histogram shows the distribution of times, in minutes, required for 25 rats in an animal behavior experiment to navigate a maze successfully. Which of the following best describes the shape of the histogram?
Right skewed with a possible outlier
Sx
SD of explanatory variable
Sy
SD of response variable
The boxplots below show amount spent for vehicles in two neighboring locations (in thousands of dollars). Which city has a greater percentage of vehicles which cost between $30,000 and $50,000?
Suburbia
The distribution of the amount of money spent by students for textbooks in a semester is approximately normal in shape with a mean of $235 and a standard deviation of $20. According to the Standard Deviation Rule, in a semester, almost all (99.7%) of the students spent on textbooks:
between 175 and 295 dollars.
The boxplots below show amount spent for vehicles in two neighboring locations (in thousands of dollars). Which city has the greater percentage of vehicles which cost below $30,000?
Both locations have the same percentage of vehicles which cost below $30,000.
standard deviation rule
a normal distribution contains 68% of the data between one standard deviation above and below the mean 95% of the data between two standard deviations above and below the mean 99.7% of data between three standard deviations above and below the mean
correlation coefficient (r)
a numerical measure that measures the strength and direction of a linear relationship between to quantitative variables Can fall between -1 and 1
A survey was conducted to study the relationship between the annual income of a family and the amount of money the family spends on entertainment. Data were collected from a random sample of 280 families from a certain metropolitan area. A meaningful graphical display of these data would be:
a scatterplot
dataset
a set of data identified with particular circumstances. Typically displayed in tables with rows as the individuals and columns as the variables
A survey was conducted to study the relationship between whether the family is buying or renting their home and the marital status of the parents. Data were collected from a random sample of 280 families from a certain metropolitan area. A meaningful graphical display of these data would be:
a two way table
ordinal variable
categorical variable where there is a natural order among the categories ie socioeconomic status (high, medium, low)
nominal variable
categorical variables where there is no natural order among the categories (eye color)
median
center of distribution (M) if n is even, it is between the 2 at the center ie 3, 5 M=4
r
correlation coefficient
form of a scatterplot means its
general shape
boxplot
graphically represents the distribution of a quantitative variable, displaying 5 number summary and any observations classified as outliers using 1.5 IQR outliers= * box= IQR, top line Q3, bottom line Q1, with M represented as line top line max (largest non outlier), bottom line min (smallest non outlier)
strength of a scatterplot means
how closley the data follow the form of the relationship (strong v weak)... requires numerical measure
a bar graph is used to show
how the different categories compare to each other
a pie chart is used to show
how the different categories relate to the whole
At the beginning of the semester, an Intro to Statistics instructor asked the 225 students enrolled in the class to complete a survey. For each student, the instructor collects information about the following: Sport (Favorite sport: Football, Baseball, Basketball, Hockey, Other) Exercise (How many minutes do you spend exercising per week) Personality (on a 0-25 scale, how would you describe your personality 0=total introvert, 25=total extrovert) Death penalty (Strongly agree, Agree, Neutral, Disagree, Strongly Disagree) What kind of variable is Personality?
interval
what are the two types of quantitative variables?
interval and ratio
if data is skewed right, the mean will be
larger than the median
x bar
mean
x bar
mean of explanatory variables
y bar
mean of response variable
_______ are sensitive to outliers, while _______ are resistant
means, medians
M
median
mode
most commonly occurring value in a distribution
At the beginning of the semester, an Intro to Statistics instructor asked the 225 students enrolled in the class to complete a survey. For each student, the instructor collects information about the following: Sport (Favorite sport: Football, Baseball, Basketball, Hockey, Other) Exercise (How many minutes do you spend exercising per week) Personality (on a 0-25 scale, how would you describe your personality 0=total introvert, 25=total extrovert) Death penalty (Strongly agree, Agree, Neutral, Disagree, Strongly Disagree) What kind of variable is Sport?
nominal
what are the two types of categorical variables?
nominal and ordinal
sample size
number of indiviudals
box plots do not show
number of observations
variable
particular characteristic of the individual, ie
individual
particular person/object, unit, ie marathon runners
standard deviation
quantifies the spread of a distribution by measuring how far the observations are from their mean (x bar) gives an average distance from data point to mean rep by SD, s, Sd, and StDev
Interquartlie Range (IQR)
quantifies variability of a distribution by giving us the range covered by the middle 50% of data
ratio variable
quantitative variable for which it makes sense to talk about the difference but also the ratio has intrinsic meaning ie income, weight, time
interval variable
represent a measure/count for which it makes sense to talk about the DIFFERENCE between values but it does not make sense to talk about the ratio ie temperature
categorical variable
represent labels or ranks and places/classifies an individual into one of several groups ie eye color, social status, right/left handed, "strongly agree" to "strongly disagree" Can be represented numerically (1, 2)
quantitative variable
represents a measurement or count, answering "how much" or "how many" ie. time waiting in line, temperature, income, height
In order to study the relationship between IQ level and GPA, data were collected from a sample of 540 students. The data collected in this study would best be displayed using:
scatterplot
The data display and numerical summary you should use to analyze QQ study are
scatterplot correlation coefficient (negative, positive, outliers?)
what is used to interpret a histogram?
shape, center, spread (the pattern) and outliers (deviation from the pattern)
In order to study whether there is a relationship between IQ level and birth order, data were collected from a sample of 540 students on their birth order (Oldest/In Between/Youngest) and their score on an IQ test. The data collected in this study would best be displayed using:
side by side box plots
The data display and numerical summary you should use to analyze CQ study are
side by side boxplot descriptive statistics
A store asked 250 of its customers how much they spend on groceries each week. The responses were also classified according to the gender of the customers. We want to study whether there is a relationship between amount spent on groceries and gender. A meaningful display of the data from this study would be:
side by side boxplots
Here again is the histogram showing the distribution of 50 ages at death due to trauma (accidents and homicides) that were observed in a certain hospital during a week. For the data described by the above histogram, the median will be _________ than the mean
smaller
This histogram shows the distribution of times, in minutes, required for 25 rats in an animal behavior experiment to navigate a maze successfully. For the data described by the above histogram, the median will be
smaller than the mean
if the data is skewed left, the mean will be
smaller than the median
regression
specifies the dependence of the response variable on the explanatory variable
means can be used as a measure of center over a median only for
symmetric distributions without outliers, otherwise medians are better
SD should only be used for
symmetrical distributions as it is strongly influenced by outliers
shape
symmetry/skewedness of distribution peakedness (modality) or number of peaks (modes)
linear regression
technique of finding the line that best fits the pattern of the linear relationship of the response and explanatory variable
for symmetric distributions with no outliers, the mean (x bar) is approximately equal to
the median (M)
A store asked 250 of its customers whether they were satisfied with the service or not. The responses were also classified according to the gender of the customers. We want to study whether there is a relationship between satisfaction and gender. A meaningful display of the data from this study would be:
two way table
types of symmetrical distributions
unimodal (one peak) bimodal (two peaks), more than 2 multimodal uniform (no peaks)
distribution
what values the variable takes and how often the variable takes those values
This histogram shows the distribution of times, in minutes, required for 25 rats in an animal behavior experiment to navigate a maze successfully. A possible value of the median in this example is:
3.9
the sum of deviations from the mean is always
0
The data display and numerical summary you should use to analyze CC study are
2 way table conditional percentages
left skewed distribution
A density curve where the left side of the distribution extends in a long tail
right skewed distribution
A density curve where the right side of the distribution extends in a long tail; (mean > median)
What determines which numerical measures of center and spread are appropriate for describing a given distribution of a quantitative variable?
IQR/median better for outlier
The boxplots below show amount spent for vehicles in two neighboring locations (in thousands of dollars). Which location has more vehicles?
It is impossible to tell from the boxplots.
The histogram below displays the distribution of exam scores for 40 students in an elementary statistics course. To describe the center and spread of the above distribution, the appropriate numerical measures are:
M and IQR
The boxplots below show amount spent for vehicles in two neighboring locations (in thousands of dollars). Which city has greater variability in the cost of the vehicles?
Metropolis
5 Number Summary
Min, Q1, M, Q3, Max, with Q1 to Q3= IQR
Coefficient of Determination r2
The proportion of the variation in the y data set that can be predicted by the linear regression model of the relationship between x and y. The value of r2 can fall between 0 and 1. r2 is never negative.
A student survey was conducted in a major university, where data were collected from a random sample of 750 undergraduate students. One variable that was recorded for each student was the student's answer to the question: What region of the country did you live in just prior to enrolling in this university? Northeast/Southeast/Northwest/Southwest/Midwest/Outside the U.S. These data would be best displayed using which of the following?
Pie chart
Here again is the histogram showing the distribution of 50 ages at death due to trauma (accidents and homicides) that were observed in a certain hospital during a week. Which of the following best describes the shape of the histogram?
Right skewed with a possible outlier
In IQR, M is essentially
Q2
Calculating IQR
Q3-Q1, with Q1 the median of the bottom 50% of data and Q3 as the median of the upper 50% ie Q1=32 and Q3=41.5 IQR= 41.5-32= 9.5
The number of people at a park each day is compared with the day's highest temperature. It is found that the coefficient of correlation r = 0.76. What is r2?
The coefficient of determination r2 = (0.76)^2 = 0.578. About 58% of the variation in park attendance can be explained by the linear regression model of the relationship between the two variables.
Here again is the histogram showing the distribution of 50 ages at death due to trauma (accidents and homicides) that were observed in a certain hospital during a week. Assume that the largest observation in this dataset is 90. If this observation were wrongly recorded as 900, then:
The mean will increase, but the median won't change.
This histogram shows the distribution of times, in minutes, required for 25 rats in an animal behavior experiment to navigate a maze successfully. Assume that the largest observation in this dataset is 8.6 minutes. If this observation were wrongly recorded as 86, then the mean will ___________ and the median will ___________
The mean will increase, but the median won't change.
At the beginning of the semester, an Intro to Statistics instructor asked the 225 students enrolled in the class to complete a survey. For each student, the instructor collects information about the following: Sport (Favorite sport: Football, Baseball, Basketball, Hockey, Other) Exercise (How many minutes do you spend exercising per week) Personality (on a 0-25 scale, how would you describe your personality 0=total introvert, 25=total extrovert) Death penalty (Strongly agree, Agree, Neutral, Disagree, Strongly Disagree) What are the individuals in this example?
The students enrolled in Intro to Statistics
The histogram below displays the distribution of 50 ages at death due to trauma (accidents and homicides) that were observed in a certain hospital during a week. What is the largest age of death due to trauma in this dataset?
This information is not provided by the histogram.
This histogram shows the distribution of times, in minutes, required for 25 rats in an animal behavior experiment to navigate a maze successfully. What is the largest time recorded in this dataset?
This information is not provided by the histogram.
histogram
a bar graph depicting a frequency distribution
