STATS Exam 2

Ace your homework & exams now with Quizwiz!

When making classes for histograms, we need to make sure they are exclusive and exhaustive. What does this mean?

*Exclusive* means that there should be *no overlap* between groups (one individual can't be placed into multiple groups). *Exhaustive* means that there is a *place for every data point*; every individual falls into a group.

What are some of the disadvantages of using a pie chart? What about pie charts makes them hard for people to visually read?

*Hard to draw by hand*, have no natural way to *order them* and can be *hard to compare the sizes* of different categories. It is harder for us to visually *compare angles* than lengths, so it is hard for people to visually read Make easier by adding percentages falling into each category next to the wedge

What three words do we use to describe the shape of a distribution. What do these three words mean?

*Symmetric*: right and left sides of the histogram are approximately mirror images of each other. *Skewed to the right*: the right side of the histogram extends out much further than the left (sometimes we call this a tail to the right). *Skewed to the left*: left side of the histogram extends out much further than the right (tail to the left).

What two numbers does r always fall between

-1 and 1

What three things are necessary for a clear data table?

-A main heading giving the subject and the date of the data -Labels within the table to identify the variables and the units they are measure in, -Source of the data.

What three things do we look for when studying line graphs?

-Overall patterns/trends -Deviations from that pattern -Seasonal variation.

How do you make a stemplot? In a stemplot, what is the stem and what is the leaf?

-Separate each observation into a stem and a leaf. -Write the stems in a vertical column with the smallest at the top and draw a vertical line at the right of this column. -Finally, you write each leaf in the row to the right of its stem, in increasing order out from the stem. The stem consists of all but the final (rightmost digit) and the leaf is the final digit. *Stems may have as many digits as needed*, but each leaf contains only a *single digit*.

What are the three principals for making good graphs?

-The graph has *labels and legends* (tell what variables are plotted, units, and source) -*Data stands out* (no unnecessary grids or background art and make sure placement of labels doesn't interfere with data) -Paying attention to *what people will see* when they read the graph (be careful with scales and don't use pictograms or 3D effects)

What are the 4 steps for exploring data with a single, quantitative variable?

1) plot your data; 2) look for overall patterns and striking deviation; 3) choose a numeric summary (five-number or mean/standard deviation) to describe the data; and 4) describing the overall pattern with a smooth curve.

Find the mean and the standard deviation of the following sets of numbers: 1. 4, 3, 8, 10 2. 1, 1, 1, 2, 1 3. 12, 13, 12, 19 4. 4, 8, 12

1. 4, 3, 8, 10 mean = 6.25, standard deviation = 3.304 2. 1, 1, 1, 2, 1 mean = 1.2, standard deviation = 0.447 3. 12, 13, 12, 19 mean = 14, standard deviation = 3.367 4. 4, 8, 12 mean = 8, standard deviation = 4

The distribution of test scores in a college class are approximately normally distributed with a mean of 75 and a standard deviation of 7. Using the 68-95-99.7 rule:

1. 68% of scores fall between what 2 numbers? *68 and 82*. 2. 95% of the scores fall between what 2 numbers? *61 and 89*. 3. 99.7% of the scores fall between what 2 numbers? *54 and 96.* 4. The highest 2.5% of scores fall above what number? The highest 0.15% of scores fall above what number? *The highest 2.5% of scores fall above 89, The highest 0.15% of scores fall above 96*. 5. The lowest 2.5 % of scores fall below what number? The lowest 0.15% of scores fall below what number? The lowest 2.5% of scores fall below 61, The lowest 0.15% of scores fall below 54

In each of the following situations, determine whether the mean is smaller than the median, the mean is greater than the median, or the mean is the same as the median:

1. The distribution is symmetric. The mean is the same as the median. 2. The distribution is left skewed. The mean is less than the median. 3. The distribution is right skewed. The mean is greater than the median.

Three students take a quiz. Their scores are 9, 8, and 7.5. What are each of their standard scores?

1.1, -0.2, and -0.9,

What is a regression line? What does a regression line do/accomplish?

A regression line is a line that summarizes the straight-line relationship between two variables. It describes how a response variable, y, changes as an explanatory variable, x, changes. It can be used to predict the value of y given a value of x.

How do we draw a boxplot?

A center box spans the quartiles. A line drawn across this box marks the median. Lines extend from the box out to the smallest and largest observations (the minimum and the maximum).

What does a positive value of r indicate? What does a negative value of r indicate?

A positive value of r indicates positive association. A negative value of r indicates a negative association

What does a standard score do? What is another name for a standard score? How do we calculate a standard score?

A standard score expresses an observation in terms of the *number of standard deviations it is above or below the mean*. Standard scores are also called *z-score*. We calculate a standard score by using this formula: *(observation - mean)/standard deviation*.

.What are some of the advantages and disadvantages of a stemplot?

Advantage: -you don't lose the data, you still have access to each exact data value. -faster to draw than histograms -easier to make because we don't have to make a choice about the classes. Disadvantage: -it doesn't work well with large data sets bc there are too many leaves on each stem

What does an index number do? How do we calculate an index number?

An index number measures the value of a variable relative to its value at a base period. It is calculated by dividing the value by the base value and then multiplying by 100.

You collect data from 8 different countries, about what percent of people in each country speak at least two languages fluently.

Bar graph

What chart makes it easier for us to compare categories? What are some advantages of this type of chart?

Bar graphs Easy to draw Natural way to order categories Can visually compare different categories, even those not next to each other.

Scales

Be aware of scales on axes Better to plot percentage increase than actual increase

A survey of college freshmen in 2007 asked what field they planned to study. 12.8%, were arts and humanities majors, 17.7% were business majors, 9.2% were education majors, 19.3% were engineering, biological sciences, or physical sciences majors, 14.5% were professional majors, and 11.1% were social science majors. What type of chart (bar chart or pie chart) is appropriate to use for this data? What could we add to this data to make it appropriate to use either type of chart?

Because the categories given *do not include all possible categories*, it would not be appropriate to use a pie chart. Use a *bar chart.* If we added a category for other majors, then it would be appropriate to use either type of chart.

Bar charts and pie charts are most useful for what kind of variables?

Categorical variables

What three things do we use to describe the overall pattern of a graph/distribution?

Center, spread, and shape

Does correlation take into account the difference between the explanatory and the response variable? Will the correlation change if we switch the two?

Correlation does not take into account the difference between the explanatory and the response variable. It will not change if we switch the two.

How do you calculate correlation (r)?

Correlation is found by finding the standard scores for each of the x and y values, multiplying the pairs of standard scores together, adding up all this products, and then dividing by n-1. The formula describing this is:

What does correlation measure? Why do we use correlation to measure this?

Correlation measures the strength of the straight-line relationship between two variables. We use correlation to do this because it is very hard for us to judge how strong this relationship is just by looking at it.

What is the idea behind standard deviation? How do you calculate standard deviation?

Give the average distance of the observations from the mean. It is calculated by taking the square root of the variance. That means that you find the distance each observation is from the mean, square each of these distances, add all of the distances up, divide by n-1, and take the square root

Pie charts and bar charts show us the distribution of a categorical variable. What charts can we use to show the distribution of numeric variables?

HISTOGRAM

Suppose you had data regarding the GPA's of all the students at Texas A&M University. Would this data be better displayed as a histogram or a stem and leaf plot?

Histogram. There would be too many data points for a stem and leaf plot

What are pie charts used to show? What do the wedges within the pie chart represent?

How a *whole is divided into parts.* The wedges represent the *parts.* The size represents what portion of the whole fall into that category.

What does it mean if the standard deviation is 0? What values can the standard deviation never be? What does it mean if one set of numbers has a larger standard deviation than another set of numbers?

If the standard deviation is 0, that means there is no spread and all the observations have the same value. Standard deviations can only be positive numbers (they must be greater than or equal to 0). If one set of number has a larger standard deviation than another, that means its values are more spread out.

.If we change the mean of a normal distribution, what happens? If we change the standard deviation of a normal distribution, what happens?

If we change the mean of a normal distribution we change its location. If we change the standard deviation of a normal distribution we change its shape.

How do you interpret a stemplot?

Interpret a stemplot the same way you interpret a histogram: look for *patterns, outliers, and describe the center, shape, and spread.*

How do we define center of distribution

Its midpoint: the point int he graph where roughly half of the observations are smaller and roughly half are larger.

Graphing the number of students at Texas A&M each year from 1950 to 2016?

Line graph

When we see an outlier, what should we do?

Look for an explanation. Is there some reason we would expect this value to be an outlier? Is there a chance it is a mistake? -Make note that it exists, but not include it when discussing overall pattern

What is the most common numerical way to describe a distribution?

Mean and standard deviation

In a histogram, should there be any space between the class bars?

NO

0.With normal distributions, do we expect there to be many outliers?

No

How many classes should a histogram have? What happens if there are too many or too few classes?

No definitive answer. -A good general rule is to use between 10 and 20 classes. -Too few classes will give a "skyscraper" histogram, with all the values in a few classes with tall bars. -Too many classes will give a "pancake" histogram, with most classes having one or no observations.

Does a strong relationship between two variables mean that changes in one variable causes changes in another variable?

No, just because we see a relationship, we can't be sure that changes in one variable actually cause the changes in the other variable.

Can we calculate correlation for categorical variables?

No, only numeric

When defining classes, can the classes be of unequal widths?

No, the classes can't be of unequal widths, because this changes how the graph is interpreted (our eyes respond to the areas of the bars in a histogram)

Why is it generally a bad idea to use a pictogram?

Pictograms are misleading. In a bar graph all of the bars are the same width, which means they only have to compare the height of the different bars. In a pictogram, both the heights and the widths of the picture are different for each category, which makes it difficult for people to see the true difference between the categories

You do a survey of students at A&M and ask students whether or not they want the library to be open for longer hours. 50% say yes, 43% say no, and 7% are undecided.

Pie/bar graph

Many times we use a regression line for prediction. What is prediction based on? When does prediction work best? What is extrapolation and why should we be careful about extrapolating? Describe an example where extrapolation was a bad idea

Prediction is based on fitting some model to a set of data: this could be a straight line or something more elaborate. Prediction world best when the model fits the data closely; when the data closely follows the regression line, our model is more trustworthy. Extrapolation is when you use a model to predict a y value for an x value that is outside of your original range. We should be careful about extrapolating because there could be different patterns going on outside of the ranges we have data for. For example, if we have data on the heights of children, we would see a straight-line relationships between a child's age and their height: as they get older, they get taller. We shouldn't use this to predict the heights of adults, because adults don't keep growing at a steady rate forever: at some point adults' growth levels off

What does the area under a density curve represent?

Proportions of the total number of observations

When making a data table, is it better to present data as counts (for example, the number of people in a category) or as rates (for example, the percent of people in a category)?

RATES, more informative to reader

What study design is the best for establishing causation? What can we do if this study type is not ethical or feasible?

Randomized comparative experiments are the best for establishing causation. If a randomized comparative experiment is not ethical or feasible, we have some criteria we can use to try and establish causation. This includes strong association, consistent association, a dose response (higher doses are associated with higher responses), the cause precedes the effect (cause happens before the effect), and the cause is plausible (some sort of biologic or scientific reason that it makes sense).

What type of error do we commonly see in data tables?

Roundoff error

What is seasonal variation?

Seasonal variation is a pattern that repeats itself at known regular intervals of time.

The correlation between the average SAT Mathematics score in the states and the proportion of high school seniors who take the SAT is r = −0.843. The correlation is negative. What does that tell us? How well does proportion taking predict average score?

Since the correlation is negative, that means we know that the average SAT Math score and the proportion of high schoolers taking the SAT are negatively associated. As the proportion of high schoolers taking the SAT increases, the average score decreases. We know that r2 is 0.7106, so that means thats 71.06% of the variation in the SAT math scores can be described by the least-squares regression line. This is a fairly high number, so we know that the proportion taking the exam is a good predictor.

.When choosing how to numerically describe a distribution, what is the first thing you should do?

Start with a graph

What are three terms we use to describe normal curves (or normal distributions)?

Symmetric, single peaked, bell shaped

What is the distribution of a variable?

Tells us what values it takes and how often it takes those values

How do you chose the classes for a stemplot?

The classes of a stemplot are the stems, they are given to you. However, you can adjust the stems slightly by rounding the data differently. This is typically done when the data have too many digits

The correlation between IQ score and school GPA is r = 0.634. The correlation between wine consumption and heart disease is r = −0.645. Which of these two correlations indicates a stronger straight-line relationship? Explain your answer. T

The correlation between wine consumption and heart disease is a stronger straight-line relationship because its value is further from zero.

What variable goes on the x-axis (horizontal)? What variable goes on they y-axis (vertical)?

The explanatory variable goes on the x-axis and the response variable goes on the y-axis.

What number make up the five-number summary? What is the graphical representation of the five number summary called?

The five number summary is made up of the minimum, first quartile, median, third quartile, and maximum. We graphically represent this with a boxplot.

Using a density curve, how do we find the mean?

The mean of the density curve is the balancing point: the point at which the curve would balance if it were made of a solid material.

Using a density curve, how do we find the median and the quartiles?

The median is the point with half of the observations on either side. (half of the area lies to the right of it and half of the area lies to the left of it.) The quartiles are found by determining the points that divide the area under the curve into quarters. The first quartile is the point where 25% of the area is to the left of it and the third quartile is the point where 75% of the area is to the left of it

What is the median? How do we calculate it?

The median represents the midpoint of the data: point that divides the data in half because half are below the median and half are above the median. -First arrange all observations in order of size from smallest to largest. -Find the midpoint of the data values. -If there are an odd number of observations, the median is the center observation (the (n+1)/2 observation). -If there are an even number of observations, the median is the average of the two center observation

What are the first and third quartiles? How do we calculate them?

The midpoints of each half. They divide the data in quarters. Start by arranging the observations in order from smallest to largest. The first quartile is the median of the observations that are to the left of the overall median. The overall median is not included in these numbers. The third quartiles is the median of the observations that are to the right of the overall median.

.What is a percentile? What percentile is the median? What percentile is the first quartile? What percentile is the third quartile?

The nth percentile of a distribution is a value such that c percent of the observations lie below it and the rest lie above. The median is the 50th percentile. The first quartile is the 25th percentile. The third quartile is the 75th percentile

For each of the following situations, do you expect the standard score to be greater than zero, equal to zero, or less than zero?

The observed value is the same as the mean. Standard score *equals zero.* The observed value is less than the mean. Standard score is *less than zero.* The observed value is greater than the mean. Standard score is *greater than zero*.

. A researcher wanted to see if the average SAT Mathematics score of each state's high school seniors could be predicted by the proportion of each state's seniors who took the exam. He found that the least-squares regression line for predicting average SAT Math score from proportion taking is: average Math SAT score = 580.0 − (109.7 × proportion taking). Interpret the slope and the intercept of this equation. In New York, the proportion of high school seniors who took the SAT was 0.89. Predict their average score

The slope is -109.7. This means that when we increase the proportion taking the test by 1, we expect the average score to decrease by 109.7 points. The intercept is 580.0. This means that when 0% of high school seniors in a state take the SAT, we expect their average score to be 580. The predicted average score for the state of New York is 482.367.

What is r2? How do we calculate r2?

The square of the correlation, or r2, is the proportion of the variation in the values of y that is explained by the leastsquares regression of y on x. It is calculated by squaring r, the correlation

What does standard deviation measure? You should only use standard deviation when you use what to measure the center of a distribution?

The standard deviation measures spread around the mean. You should only use the standard deviation when you use the mean to measure the center of a distribution.

What is the standard form of a regression line? What do the values in this equation stand for?

The standard form or equation of a regression line is y = a + bx. In this equation, a is the intercept of the line and b is the slope of the line.

When creating a line graph of how price of gas changes over time, what would you plot on the X (horizontal) and Y (vertical) axes?

Time on the X (horizontal) axis and the variable you are measuring on the Y (vertical) axis.

What is the purpose of making graphs?

To help us understand the data

Why do we use data tables

To summarize *large amounts of information* We use them to show us what is going on with the data *overall*, instead of what is going on with each individual.

One way to describe our data is to draw a density curve. What is a density curve and what is the general idea behind how to find the density curve for a set of data?

Way of describing the *overall pattern of a distribution* with a smooth curve. Created by drawing a smooth curve through the tops of the bars of a histogram, making sure to draw it such that the area under the curve is exactly 1

How do we use a regression equation to predict a new value?

We can use regression lines to predict a new value of y by substituting your x-value into the equation and solving for y.

How can we describe the direction of a scatterplot?

We describe the direction of a scatterplot by saying it is positively associated, negatively associated, or no association.

What three things do we use to describe the overall pattern of a scatterplot?

We describe the overall pattern of a scatterplot by describing its form, direction, and strength.

When looking at the spread of a distribution, what do we do about outliers?

We do not include outliers when looking at the spread of the distribution. We define the spread of a distribution by giving the smallest and largest values, ignoring any outliers.

What are the steps to making a histogram?

We have to group nearby variables together to make the histogram easy to read. -First divide the range of the data into classes or groups of equal width -Then count the number of individuals in each class/group -Draw the histogram.

What graph do we use to show the relationship between two quantitative variables? How do you make this graph? In this graph, what does each point represent?

We use a scatterplot to show the relationship between two quantitative variables. To make this graph, values of one variable are plotted on the horizontal axis and value of the other variable are plotted on the vertical axis. Each point on a scatterplot represents one individual in the data and their observed values for both of the variables

What graph do we use to show the relationship between two quantitative variables? How do you make this graph? In this graph, what does each point represent?

We use a scatterplot to show the relationship between two quantitative variables. To make this graph, values of one variable are plotted on the horizontal axis and value of the other variable are plotted on the vertical axis. Each point on a scatterplot represents one individual in the data and their observed values for both of the variables.

The heights of children ages 3-5 are approximately normally distributed with a mean of 40 inches, with a standard deviation of 2.5 inches. Use that information and the 68-95-99.7 rule to answer the following questions:

What percent of children are between 40 and 45 inches tall? 47.5% of children are between 40 and 45 inches tall. 2. What percent of children are above 42.5 inches tall? 16% of children are above 42.5 inches tall. 3. What percent of children are between 32.5 inches tall and 42.5 inches tall? 83.85% of children are between 32.5 inches tall and 42.5 inches tall. 4. What percent of children are below 35 inches tall? 2.5% of children are below 35 inches tall. 5. What percent of children are below 40 inches tall? 50% of children are below 40 inches tall. 6. What percent of children are above 37.5 inches tall? 84% of children are above 37.5 inches tall.

What two things do we look for when looking at a scatterplot?

When looking at a scatterplot, we look for the overall pattern and any deviations from that pattern, which may be outliers.

Seasonal adjustment? Ex?

When the expected seasonal variation is removed before the data is published. Ex: graphing unemployment rates or prices of gasoline over time.

Are regression lines affected by outliers? Do they take into account what you pick for response and explanatory variable?

YES

You decide to study the average temperature in Chicago each month for many years. Do you expect a line graph of the data to show seasonal variation?

Yes, you would expect the average temperatures to be lowest in the winter and highest in the summer, so you expect to see average temperatures to increase during the first half of the year and decrease during the second half of the year.

When should you use the mean/standard deviation to describe a distribution? When should you use the five-number summary to describe a distribution?

You should use the mean/standard deviation to describe a distribution when the distribution is *reasonably symmetric and there are no outliers*. You should use the five-number summary to describe a distribution when the *distribution is skewed or has outliers.*

What are three possible explanations for an association between two variables

causation, common response, confounding

What is the 68-95-99.7 rule?

in any normal distribution, approximately 68% of the observations fall within one standard deviation of the mean, approximately 95% of the observations fall within two standard deviations of the mean, and approximately 99.7% of the observations fall within three standard deviations of the mean.

What is the most common type of regression line? How is this line drawn?

least-squares regression line. This line is drawn by minimizing the sums of the squared vertical distances from the line to the actual observed values.

What type of graph do we use to show how quantitative variables change over time?

line graph

We can completely describe a normal distribution using what two things?

mean and standard deviation

.In order to use an association for predictive reasons, do we need to know that one variable causes the change in the other (is causation necessary)?

no

What things should we look for when making graphs of data?

overall patterns and any deviations from those patterns, which could be a sign of an outlier.

What happens to r when we change the unit of measurement?

r does not change

What measurement do we use to determine how useful our regression line is for prediction?

r^2

What kind of associations does correlation measure?

straight line

Is correlation affected by outliers?

yes


Related study sets

Environmental science chapter 10

View Set

HAP Test #6 - The Muscular System

View Set

Test 3: Ch 41 Cardiovascular Disorders PrepU

View Set

Programming 1 final exam review 5

View Set

Chapter One: Making OB Work For Me

View Set

Portage Learning Chemistry CHEM 103 Unit 2

View Set

Chapter 25: Anatomy of the Digestive System

View Set