Stat exam 2
In each of the following situations, determine whether the mean is smaller than the median, the mean is greater than the median, or the mean is the same as the median: 1. The distribution is symmetric. 2. The distribution is left skewed. 3. The distribution is right skewed.
1. The distribution is symmetric. The mean is the same as the median. 2. The distribution is left skewed. The mean is less than the median. 3. The distribution is right skewed. The mean is greater than the median.
For each of the following situations, do you expect the standard score to be greater than zero, equal to zero, or less than zero? 1. The observed value is the same as the mean. 2. The observed value is less than the mean. 3. The observed value is greater than the mean.
1. The observed value is the same as the mean. Standard score equals zero. 2. The observed value is less than the mean. Standard score is less than zero. 3. The observed value is greater than the mean. Standard score is greater than zero.
What three things are necessary for a clear data table?
1. main heading giving the subject and the date of the data, 2. labels within the table to identify the variables and the units they are measure in, and 3. the source of the data.
In each of the following situations, what type of chart or graph would be most appropriate: 1. Graphing the number of students at Texas A&M each year from 1950 to 2016? 2. You collect data from 8 different countries (UnitedStates, Iceland, Sweden, Canada, Greenland, Switzerland, Finland, and Denmark) about what percent of people in each country speak at least two languages fluently. 3.You do a survey of students at A&M and ask students whether or not they want the library to be open for longer hours. 50% say yes, 43% say no, and 7% are undecided.
1> line graph, 2. bar graph, 3. pie or bar graph
One way to describe our data is to draw a density curve. What is a density curve and what is the general idea behind how to find the density curve for a set of data?
A density curve is a way of describing the overall pattern of a distribution with a smooth curve. In general, a density curve is created by drawing a smooth curve through the tops of the bars of a histogram, making sure to draw it such that the area under the curve is exactly 1.
What is a market basket? What is a fixed market basket price index?
A market basket is a collection of goods and services whose total cost we follow. A fixed market basket price index is an index number for the total cost of a fixed collection of goods and services.
What is a personal probability?
A personal probability of an outcome is a number between 0 and 1 that expresses an individual's judgement of how likely the outcome is.
What does a positive value of r indicate? What does a negative value of r indicate?
A positive value of r indicates positive association. A negative value of r indicates a negative association.
What is a probability model?
A probability model for a random phenomenon describes all the possible outcomes and says how to assign probability to any collection of outcomes (events).
What is a regression line? What does a regression line do/accomplish?
A regression line is a line that summarizes the straight-line relationship between two variables. It describes how a response variable, y, changes as an explanatory variable, x, changes. It can be used to predict the value of y given a value of x.
What is a simulation? Why do we use simulations? When we do a simulation, how many trials do we generally need to do?
A simulation is when we use random digits from a table or from computer software to imitate chance behavior. We use simulations to try and determine the probabilities of complex events. We generally need to do thousands of trials of a simulation to get a good estimate.
What does a standard score do? What is another name for a standard score? How do we calculate a standard score?
A standard score expresses an observation in terms of the number of standard deviations it is above or below the mean. Standard scores are also called z-score. We calculate a standard score by using this formula: (observation - mean)/standard deviation.
Eleanor flips a coin 6 times and gets HTHTTH. Brittany flips a coin 6 times and gets HHHTTT. Alec flips a coin 6 times and gets HHHHHH. Which of these outcomes was most likely to happen? Which of these outcomes was least likely to happen?
All of these outcomes have the same probability.
What does an index number do? How do we calculate an index number?
An index number measures the value of a variable relative to its value at a base period. It is calculated by dividing the value by the base value and then multiplying by 100.
What does it mean if something has a probability of zero? Give an example of something that may have a probability of zero?
An outcome with a probability of 0 never occurs. An example of something with a probability of 0 is rolling a 6 sided die and getting a 7.
What does it mean if something has a probability of 0.5? Give an example of something that may have a probability of 0.5?
An outcome with a probability of 0.5 happens half of the time in a very long series of trials. An example of something with a probability of 0.5 is tossing a coin and getting tails.
What does it mean if something has a probability of one? Give an example of something that may have a probability of one?
An outcome with a probability of 1 always occurs. An example of something with a probability of 1 is rolling a 6 sided die and getting a number between 1 and 6.
Bar charts and pie charts are most useful for what kind of variables?
Bar charts and pie charts are most useful for categorical variables.
What chart makes it easier for us to compare categories? What are some advantages of this type of chart?
Bar graphs make it easiest for us to compare categories. These are easy to draw, there is a natural way to order categories with a bar graph, and we can visually compare different categories, even those not positionally next to each other.
A survey of college freshmen in 2007 asked what field they planned to study. Of those surveyed, 12.8%, were arts and humanities majors, 17.7% were business majors, 9.2% were education majors, 19.3% were engineering, biological sciences, or physical sciences majors, 14.5% were professional majors, and 11.1% were social science majors. Given the data presented (using no additionally categories) what type of chart (bar chart or pie chart) is appropriate to use for this data? What could we add to this data to make it appropriate to use either type of chart?
Because the categories given do not include all possible categories, it would not be appropriate to use a pie chart of this data as presented. It would only be appropriate to use a bar chart. If we added a category for other majors, then it would be appropriate to use either type of chart.
What do we know about chance behavior in the short run? What do we know about chance behavior in the long run?
Chance behavior is unpredictable in the short run, so we don't know anything about chance behavior in the short run. Chance behavior is regular and predictable in the long run.
Does correlation take into account the difference between the explanatory and the response variable? Will the correlation change if we switch the two?
Correlation does not take into account the difference between the explanatory and the response variable. It will not change if we switch the two.
How do you calculate correlation (r)?
Correlation is found by finding the standard scores for each of the x and y values, multiplying the pairs of standard scores together, adding up all this products, and then dividing by n-1. The formula describing this is:
What kind of associations does correlation measure?
Correlation measures straight-line associations only.
What does correlation measure? Why do we use correlation to measure this?
Correlation measures the strength of the straight-line relationship between two variables. We use correlation to do this because it is very hard for us to judge how strong this relationship is just by looking at it.
Why do we use data tables?
Data tables are used to summarize large amounts of information. We use them to show us what is going on with the data overall, instead of what is going on with each individual.
When making classes for histograms, we need to make sure they are exclusive and exhaustive. What does this mean?
Exclusive means that there should be no overlap between groups (one individual can't be placed into multiple groups). Exclusive means that there is a place for every data point; every individual falls into a group.
Using a density curve, how do we find the mean?
Finding the mean with the density curve is slightly harder to find just by looking at it. The mean of the density curve is the balancing point: the point at which the curve would balance if it were made of a solid material.
Citizens rely on their government for data. What qualities should good government data have?
Good government data should be accurate, timely, keeping up with the changes in society and the economy, and free from political influence.
What does it mean if the standard deviation is 0? What values can the standard deviation never be? What does it mean if one set of numbers has a larger standard deviation than another set of numbers?
If the standard deviation is 0, that means there is no spread and all the observations have the same value. Standard deviations can only be positive numbers (they must be greater than or equal to 0). If one set of number has a larger standard deviation than another, that means its values are more spread out.
If we change the mean of a normal distribution, what happens? If we change the standard deviation of a normal distribution, what happens?
If we change the mean of a normal distribution we change its location. If we change the standard deviation of a normal distribution we change its shape.
How do we draw a box plot?
In a boxplot, a center box spans the quartiles. A line drawn across this box marks the median. Lines extend from the box out to the smallest and largest observations (the minimum and the maximum).
When we make a histogram, what do we do? What are the steps to making a histogram?
In order to make a histogram, we have to group nearby variables together to make the histogram easy to read. First we must divide the range of the data into classes or groups of equal width, then we count the number of individuals in each class/group, and finally we draw the histogram.
When making a data table, is it better to present data as counts (for example, the number of people in a category) or as rates (for example, the percent of people in a category)?
It is better to present the data as rates, because they are more informative to someone reading the data table.
Phil wins the lottery in 2008. He wins the lottery again in 2013. Is it unlikely that Phil won the lottery twice? Is it unlikely that someone won the lottery twice?
It is unlikely that Phil won the lottery twice, but it is not unlikely that someone won the lottery twice.
What type of graph do we use to show how quantitative variables change over time?
Line graph- Used to show how a quantitative variable changes over time, a graph that plots the values of the variable (vertical scale) against time (horizontal scale). Data points are connected by lines.
In order to use an association for predictive reasons, do we need to know that one variable causes the change in the other (is causation necessary)?
No, causation is not necessary for us to be able to use an association for predictive reasons.
Can we calculate correlation for categorical variables?
No, correlation can only be calculated for numeric variables.
Does a strong relationship between two variables mean that changes in one variable causes changes in another variable?
No, just because we see a relationship, we can't be sure that changes in one variable actually cause the changes in the other variable.
Chad is taking a statistics class and a history class. The probability that he passes the statistics class is 0.9. The probability that he passes the history class is 0.8. The probability that he passes both is 0.77. Are these two events independent?
No, these events are not independent. If they were independent, the probability of passing both classes would be the same as the probability of passing statistics times the probability of passing history. Since 0.8*0.9 = 0.72, which is not the same is 0.77, the two events are not independent.
People who use low-calorie salad dressing in place of regular dressing tend to be heavier than people who use regular dressing. Does this mean that low-calorie salad dressings cause weight gain? Give a more plausible explanation for this association.
No, this does not mean that that low-calorie salad dressing causes people to gain weight. People who are overweight are more likely to be trying to lose weight, so they may be more likely to use low-calorie salad dressing.
With normal distributions, do we expect there to be many outliers?
No, we do not expect there to be outliers with normal distributions.
When defining classes, can the classes be of unequal widths?
No,the classes can't be of unequal widths, because this changes how the graph is interpreted (our eyes respond to the areas of the bars in a histogram).
In a histogram, should there be any space between the class bars?
No,when drawing a histogram, there shouldn't be any space between the class bars.
What are three terms we use to describe normal curves (or normal distributions)?
Normal distributions are symmetric, single-peaked, and bell-shaped.
Give an example in which you would rely on a probability found as a long-term proportion from data on many trials. Give an example in which you would rely on your own personal probability.
One time you would rely on a probability found as a long-term proportion from data on many trials is if you wanted to know the probability of rolling a die and getting a 1. One time you would rely on your own personal probability is if you wanted to know the probability of you personally getting in a car crash.
A pictogram is a variation of a bar chart. Why is it generally a bad idea to use a pictogram use a pictogram?
Pictograms are misleading. Bar graphs are a better idea because in a bar graph all of the bars are the same width, which means when a person is reading it, they only have to compare the height of the different bars. However, in a pictogram, both the heights and the widths of the picture are different for each category, which makes it difficult for people to see the true difference between the categories.
What are some of the disadvantages of using a pie chart? What about pie charts makes them hard for people to visually read?
Pie charts are hard to draw by hand and have no natural way to order them and can be hard to compare the sizes of different categories. It is harder for us to visually compare angles (which is how pie charts are drawn) than lengths, so it is hard for people to visually read a pie chart. One way to help make a pie chart easier to read is to add the percentages falling into each category next to the wedge representing that category.
What are pie charts used to show? What do the wedges within the pie chart represent?
Pie charts show how a whole is divided into parts. Wedges within the circle represent the parts, with the angle spanned by each wedge in proportion to the size of that part.
Many times we use a regression line for prediction. What is prediction based on? When does prediction work best? What is extrapolation and why should we be careful about extrapolating? Describe an example where extrapolation was a bad idea.
Prediction is based on fitting some model to a set of data: this could be a straight line or something more elaborate. Prediction world best when the model fits the data closely; when the data closely follows the regression line, our model is more trustworthy. Extrapolation is when you use a model to predict a y value for an x value that is outside of your original range. We should be careful about extrapolating because there could be different patterns going on outside of the ranges we have data for. For example, if we have data on the heights of children, we would see a straight-line relationships between a child's age and their height: as they get older, they get taller. We shouldn't use this to predict the heights of adults, because adults don't keep growing at a steady rate forever: at some point adults' growth levels off.
What does it mean if something is random?
Random is a word to describe events that are unpredictable in the short run, but have a pattern in the long run. We call a phenomenon random if individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repetitions.
What study design is the best for establishing causation? What can we do if this study type is not ethical or feasible?
Randomized comparative experiments are the best for establishing causation. If a randomized comparative experiment is not ethical or feasible, we have some criteria we can use to try and establish causation. This includes strong association, consistent association, a dose response (higher doses are associated with higher responses), the cause precedes the effect (cause happens before the effect), and the cause is plausible (some sort of biologic or scientific reason that it makes sense).
What type of error do we commonly see in data tables?
Roundoff errors - when each entry is rounded up, the entries don't quite add to the total which is added separately.
What is seasonal variation? What is seasonal adjustment? What is an example of a time when you would need to use seasonal adjustment?
Seasonal variation is a pattern that repeats itself at known regular intervals of time. Seasonally adjustment is when the expected seasonal variation is removed before the data is published. Examples where you would need to use seasonal adjustment is graphing unemployment rates or prices of gasoline over time.
Emma flips a coin 5 times and gets all heads. Is she more likely to get head or tails on her sixth coin toss?
She is equally likely to get heads and tails on the sixth coin toss.
The correlation between the average SAT Mathematics score in the states and the proportion of high school seniors who take the SAT is r = −0.843. The correlation is negative. What does that tell us? How well does proportion taking predict average score? (Use r2 in your answer.)
Since the correlation is negative, that means we know that the average SAT Math score and the proportion of high schoolers taking the SAT are negatively associated. As the proportion of high schoolers taking the SAT increases, the average score decreases. We know that r2 is 0.7106, so that means thats 71.06% of the variation in the SAT math scores can be described by the least-squares regression line. This is a fairly high number, so we know that the proportion taking the exam is a good predictor.
How do you interpret a stemplot?
Stemplots are simply histograms rotated by 90 degrees. You can interpret a stemplot the same way you interpret a histogram: look for patterns, outliers, and describe the center, shape, and spread.
What is the 68-95-99.7 rule?
The 68-95-99.7 rule states that in any normal distribution, approximately 68% of the observations fall within one standard deviation of the mean, approximately 95% of the observations fall within two standard deviations of the mean, and approximately 99.7% of the observations fall within three standard deviations of the mean.
Is the CPI a true fixed market basket price index? Who is included in the CPI? How are the items in the market basket and their prices determined?
The CPI is not a true fixed market basket price index, because the market basket is not fixed. The most commonly used CPI is the CPI for all urban consumers, which includes all urban consumers. The items in the market basket and their prices are determined by sample surveys of many households and prices around the country.
What is the CPI? How do we interpret the CPI? For example, if the CPI (1982 = 100) for 2011 is 224.9, what does that mean?
The CPI is the Consumer Price Index. It is an index number for the cost of everything that American consumers buy. If the CPI (1982 = 100) for 2011 is 224.9, that means that for goods and services that cost $100 in 1982, we would have to pay $224.90 in 2011.
What does the Law of Large Numbers say?
The Law of Large Numbers says that in a large number of independent repetitions of a random phenomenon, averages or proportions are likely to become more stable as the number of trials increases.
What does the area under a density curve represent?
The area under the density curve represent proportions of the total number of observations.
Pie charts and bar charts show us the distribution of a categorical variable. What charts can we use to show the distribution of numeric variables?
The chart we use to show the distribution of numeric variables is a histogram.
How do you choose the classes for a stemplot?
The classes of a stemplot are the stems. You don't choose the classes, they are given to you. However, you can adjust the stems slightly by rounding the data differently. This is typically done when the data have too many digits.
The correlation between IQ score and school GPA is r = 0.634. The correlation between wine consumption and heart disease is r = −0.645. Which of these two correlations indicates a stronger straight-line relationship? Explain your answer.
The correlation between wine consumption and heart disease is a stronger straight-line relationship because its value is further from zero.
What two numbers does r always fall between?
The correlation, r, is always between -1 and 1.
What is the distribution of a variable?
The distribution of a variable tells us what values it takes and how often it takes these values. (ex in Table 10.1 on page 214)
What is expected value? How is it calculated?
The expected value is a "long run average." It is an average of the possible outcomes in which each outcome is weighted by its probability. The expected value of a random phenomenon that has numerical outcomes is found by multiplying each outcome by its probability and then adding all of the products.
What variable goes on the x-axis (horizontal)? What variable goes on they y-axis (vertical)?
The explanatory variable goes on the x-axis and the response variable goes on the y-axis.
What are the first and third quartiles?How do we calculate them?
The first and third quartiles are the midpoints of each half. They divide the data in quarters. Like when finding the median, you start by arranging the observations in order from smallest to largest. The first quartile is the median of the observations that are to the left of the overall median. The overall median is not included in these numbers. The third quartiles is the median of the observations that are to the right of the overall median. Again, the overall median is not included in these numbers.
What are the four probability rules? Which two of these rules must we check to make sure we have a valid probability distribution?
The first probability rule is that any probability is a number between 0 and 1. The second probability rule is that all possible outcomes together must have a probability of 1. The third probability rule is that the probability that an event does not occur is 1 minus the probability that the event does occur. The fourth probability rule is that if two events have no outcomes in common, the probability that one or the other occurs is the sum of their individual probabilities. In order to check that we have a valid probability distribution, we must check the first two rules (all probabilities between 0 and 1 and that the sum of all the probabilities is 1).
What are the three steps to doing a simulation?
The first step is to give a probability model. The second step is to assign digits to represent outcomes. The third step is to simulate many repetitions.
When was the first time randomness was studied? When did we begin studying probability theory?
The first time randomness was studied was in the 17th century when gamblers in France wanted to know how they should bet. We began studying probability theory in the 17th century as well.
What number make up the five-number summary? What is the graphical representation of the five number summary called?
The five number summary is made up of the minimum, first quartile, median, third quartile, and maximum. We graphically represent this with a boxplot.
If we convert dollars from one year into dollars for another year, that allows us to adjust for changes in buying power. What is the formula we use to do this?
The formula we use of this is dollars at time B = dollars at time A * (CPI at time B/ CPI at time A).
What is the idea behind standard deviation? How do you calculate standard deviation?
The idea of the standard deviation is to give the average distance of the observations from the mean. It is calculated by taking the square root of the variance. That means that you find the distance each observation is from the mean, square each of these distances, add all of the distances up, divide by n-1, and take the square root.
What are some of the advantages and disadvantages of a stemplot?
The main advantage of a stemplot is that you don't lose the data: you still have access to each exact data value. They are also faster to draw than histograms and easier to make because we don't have to make a choice about the classes. The main disadvantage is that it doesn't work well with large data sets because then there are too many leaves on each stem.
What is the mean? How do you calculate the mean?
The mean is the arithmetic average of a set of observations. You calculate it by adding up all the values and dividing by n.
What measurement do we use to determine how useful our regression line is for prediction?
The measurement we use to determine how useful our regression line is for prediction is r2.
Using a density curve, how do we find the median and the quartiles?
The median is the point with half of the observations on either side. In a density curve, this is the point where half of the area lies to the right of it and half of the area lies to the left of it. The quartiles are found by determining the points that divide the area under the curve into quarters. The first quartile is the point where 25% of the area is to the left of it and the third quartile is the point where 75% of the area is to the left of it.
What is the median? How do we calculate it?
The median represents the midpoint of the data: it is the point that divides the data in half because half of the data points are below the median and half of the data points are above the median. To find the median, first arrange all observations in order of size from smallest to largest. After they are all in order, find the midpoint of the data values. If there are an odd number of observations, the median is the center observation (the (n+1)/2 observation). If there are an even number of observations, the median is the average of the two center observations.
What is the most common numerical way to describe a distribution?
The most common numerical way to describe a distribution is the combination of the mean and the standard deviation.
What is the most common type of regression line? How is this line drawn?
The most common regression line is the least-squares regression line. This line is drawn by minimizing the sums of the squared vertical distances from the line to the actual observed values.
What is a percentile? What percentile is the median? What percentile is the first quartile? What percentile is the third quartile?
The nth percentile of a distribution is a value such that c percent of the observations lie below it and the rest lie above. The median is the 50th percentile. The first quartile is the 25th percentile. The third quartile is the 75th percentile.
What is the probability of an outcome happening?
The probability of any outcome of a random phenomenon is a number between 0 and 1 that describes the proportion of times the outcome would occur in a very long series of repetitions.
What is the purpose of making graphs?
The purpose of making a graph is to help us understand the data.
What is a sampling distribution? How do we generally describe a sampling distribution?
The sampling distribution of a statistic tells us what values the statistic takes in repeated samples from the same population and how often it takes those values. We generally describe a sampling distribution using a density curve (such as a normal curve).
How do we interpret the slope and the intercept of a regression equation
The slope of the line is the amount by which y changes when x is increased by one unit. The intercept is the value of y when x = 0.
What is r2? How do we calculate r2?
The square of the correlation, or r2, is the proportion of the variation in the values of y that is explained by the least- squares regression of y on x. It is calculated by squaring r, the correlation.
What does standard deviation measure? You should only use standard deviation when you use what to measure the center of a distribution?
The standard deviation measures spread around the mean. You should only use the standard deviation when you use the mean to measure the center of a distribution.
What is the standard form of a regression line? What do the values in this equation stand for?
The standard form or equation of a regression line is y = a + bx. In this equation, a is the intercept of the line and b is the slope of the line.
What are three possible explanations for an association between two variables (see slide 15).
The three possible explanations for an association between two variables are causation, common response, and confounding.
What are the three principals for making good graphs?
The three principals for making good graphs are making sure the graph has labels and legends (tell what variables are plotted, their units, and the source of the data), making sure the data stands out (do not use unnecessary grids or background art and make sure the placement of the labels doesn't interfere with reading the data), and paying attention to what people will see when they read the graph (be careful with scales and don't use pictograms or 3D effects that will confuse the reader).
What three things do we use to describe the overall pattern of a graph/distribution?
The three things we look at when describing the overall pattern of a graph/ distribution is the center, spread, and shape.
What are the 4 steps for exploring data with a single, quantitative variable?
Thefour steps for exploring data with a single, quantitative variable are: 1) plot your data; 2) look for overall patterns and striking deviation; 3) choose a numeric summary (five-number or mean/standard deviation) to describe the data; and 4) describing the overall pattern with a smooth curve.
Three students take a quiz. Their scores are 9, 8, and 7.5. What are each of their standard scores?
Their standard scores are 1.1, -0.2, and -0.9, respectively.
What are some ways we can determine if two events are independent?
There are many methods for us to look at two events to try to determine whether or not they are independent. We can use the definition of independence and see if knowing the outcome of one event tells us anything about the outcome of the other. If we are looking at two events with numerical outcomes, we can look at a scatterplot of the outcomes or see what the correlation is between them.
Why do some people worry about risks that almost never occur, but ignore other risks that are much more plausible? Give an example of this occurring.
There are many reasons why some people worry about risks that almost never occur, but ignore more plausible risks. One reason is that we feel safer about risks we can control. Another is that humans are bad at comprehending small probabilities, so we tend to overestimate small risks and underestimate larger risks. Another reason is that sometimes these probabilities are determined from complex studies, which people find harder to trust. An example of this is that very few people would leave a sleeping infant home alone for ten minutes while they went to run errands, even though the risk of a car crash is higher than the risks the child would face sleeping at home.
Why does the government have less complete data on social statistics?
There is less complete government statistics on social issues because citizens are not always comfortable with their government collecting this sort of information about them.
How many classes should a histogram have? What happens if there are too many or too few classes?
There is no one correct way to determine how many classes a histogram should have. A good general rule is to use between 10 and 20 classes. Too few classes will give a "skyscraper" histogram, with all the values in a few classes with tall bars. Too many classes will give a "pancake" histogram, with most classes having one or no observations.
Suppose you had data regarding the GPA's of all the students at Texas A&M University. Would this data be better displayed as a histogram or a stem and leaf plot?
This data would be better displayed as a histogram. There would be too many data points for a stem and leaf plot.
Stem plot
This stemplot has all stems in the range of the data, has leaves in ascending order, and repeated leaves to represent repeated data values.
How do you make a stemplot? In a stemplot, what is the stem and what is the leaf?
To make a stemplot, you must separate each observation into a stem and a leaf. Then you write the stems in a vertical column with the smallest at the top and draw a vertical line at the right of this column. Finally, you write each leaf int he row to the right of its stem, in increasing order out from the stem. The stem consists of all but the final (rightmost digit) and the leaf is the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit.
What does it mean if two events are independent?
Two events are independent if knowing the outcome of one does not change the probabilities for the outcomes of the other.
We can completely describe a normal distribution using what two things?
We can completely describe a normal distribution with the mean and the standard deviation.
How do we use a regression equation to predict a new value?
We can use regression lines to predict a new value of y by substituting your x-value into the equation and solving for y.
How do we define the center of a distribution?
We define the center of the distribution as its midpoint: the point int he graph where roughly half of the observations are smaller and roughly half are larger.
How can we describe the direction of a scatter plot?
We describe the direction of a scatterplot by saying it is positively associated, negatively associated, or no association.
What three things do we use to describe the overall pattern of a scatter plot?
We describe the overall pattern of a scatterplot by describing its form, direction, and strength.
When looking at the spread of a distribution, what do we do about outliers?
We do not include outliers when looking at the spread of the distribution. We define the spread of a distribution by giving the smallest and largest values, ignoring any outliers.
What graph do we use to show the relationship between two quantitative variables? How do you make this graph? In this graph, what does each point represent?
We use a scatterplot to show the relationship between two quantitative variables. To make this graph, values of one variable are plotted on the horizontal axis and value of the other variable are plotted on the vertical axis. Each point on a scatterplot represents one individual in the data and their observed values for both of the variables.
When creating a line graph of how price of gas changes over time, what would you plot on the X (horizontal) and Y (vertical) axes?
When creating a line graph, you put time on the X (horizontal) axis and the variable you are measuring not he Y (vertical) axis.
When choosing how to numerically describe a distribution, what is the first thing you should do?
When deciding how to numerically describe a distribution, the first thing you should do is start with a graph of your data.
What three words do we use to describe the shape of a distribution. What do these three words mean?
When describing the shape of a distribution, we say it is either symmetric/roughly symmetric/approximately symmetric, skewed to the right, or skewed to the left. Symmetric means the right and left sides of the histogram are approximately mirror images of each other. Skewed to the right means that the right side of the histogram extends out much further than the left (sometimes we call this a tail to the right). Skewed to the left means that the left side of the histogram extends out much further than the right (tail to the left).
What two things do we look for when looking at a scatter plot?
When looking at a scatterplot, we look for the overall pattern and any deviations from that pattern, which may be outliers.
What things should we look for when making graphs of data?
When making graphs we should look for overall patterns and any deviations from those patterns, which could be a sign of an outlier.
What three things do we look for when studying line graphs?
When studying a line graph, we look for overall patterns/trends, deviations from that pattern, and seasonal variation.
What happens to r when we change the unit of measurement?
When we change the units of measurement, r does not change.
When we describe chance behavior, what two things do we need to include in our description?
When we describe chance behavior, we need to include a list of possible outcomes and a probability for each outcome.
When we see an outlier, what should we do?
When we see an outlier, we should look for an explanation. Is there some reason we would expect this value to be an outlier? Is there a chance it is a mistake? Once we determine it is an outlier, we should make note that it exists, but not include it when discussing the overall pattern of the data.
Is correlation affected by outliers?
Yes, correlation is strongly affected by outliers.
Are regression lines affected by outliers?
Yes, regression lines are strongly affected by outliers.
Does regression take into account what you pick to be the explanatory and the response variable?
Yes, regression takes into account which variable is the explanatory and which variable is the response variable
You decide to study the average temperature in Chicago each month for many years. Do you expect a line graph of the data to show seasonal variation? Describe the kind of seasonal variation you expect to see
Yes, you would expect a line graph to show seasonal variation. You would expect the average temperatures to be lowest in the winter and highest in the summer, so you expect to see average temperatures to increase during the first half of the year and decrease during the second half of the year.
When should you use the mean/standard deviation to describe a distribution? When should you use the five-number summary to describe a distribution?
You should use the mean/standard deviation to describe a distribution when the distribution is reasonably symmetric and there are no outliers. You should use the five-number summary to describe a distribution when the distribution is skewed or has outliers.
Square of Correlation
is the proportion of the variation in the values of y that is explained by the least-squares regression of y on x.
The least-squares line of y on x
predicts the value of y based on the value of x.
