Data Analysis

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

0.49

A linear relationship is found between number of beers and blood alcohol level. The value for r is 0.7 . Clearly there are variables besides the number of beers that affects the BAC level (e.g. weight, race, gender, etc). If you had to give a number, how strong a role do you think the number of beers plays in determining the BAC level?

Because there are additional variables besides weight that contribute to the risk of heart disease (e.g. age, diet,exercise, etc).

A person's risk of heart attack is clearly correlated to their weight. However, when an analysis is done, the R-squared is fairly low. Why do you think that is?

He is extrapolating beyond the dataset.

A wedding planner is attempting to convince this poor lady to buy her wedding supplies in bulk. In what way is he lying with his statistics?

250 days

According to this graph, which of the following represents the approximate mean gestation time?

About 33

According to this histogram, approximately how many cars have a city gas mileage *below* 40?

Between 15 and 20

According to this histogram, what is the most common mileage achieved by cars driving in the city?

Somewhere between 4 and 6

Suppose you are given the scatterplot shown here along with a properly generated regression line. What is the value of y^ for an x of 4?

About 240 or fewer days.

The gestation time for a group of women in a study was N(265, 25). About how many days did the lower (bottom) 16% of women carry their babies?

More than 265 days.

The gestation time for a group of women in a study was N(265, 25). How many days did the upper 50% of women carry their babies?

About 240 or more days.

The gestation time for a group of women in a study was N(265, 25). How many days did the upper/top 84% of women carry their babies? Hint: If I asked you about the upper/top 10% of women, this would refer to the 90th percentile. If I asked you about the top 20% of women, this would refer to the 80th percentile and so on...

0.045

The graph and resulting model shown here was generated using appropriate statistical techniques. A person drinks 3.25 beers. What would you predict their BAC to be? BAC = -0.013+0.018*num_beers

The relationship is not linear.

The residual plot of a relationship is plotted as shown here. Which of the following can you determine from this plot?

Both of the above.

Examine the plot shown. If we were to remove the red dot, which of the following do you think would happen?

Indicate somewhere in your discussion that you have done so.

If you choose to remove an outlier from your dataset, what must also always be done?

Normal distribution

If you had to make an educated guess, what would you imagine to be the most common distribution encountered in 'real life'?

Mean

If you wanted to calculate the center of the distribution for the 'hours online' variable, which measure should you use? (You may assume that this dataset is not skewed).

40%

Make your best estimate: Approximately what percentage of people would be expected to score MORE than 8 on this grade equivalency test?

Both groups demonstrate a linear relationship with fairly strong correlation, yet the regression models for the two groups are quite distinct.

This plot demonstrates the improvement in emotional quotient (EQ) over the years between a group of people from Mars and another group from Venus. Why was it important to distinguish these two categories on the plot?

They are essentially the same thing. SD is simply the square root of the variance.

What is the difference between variance and standard deviation (SD)?

There is no statistical tool/plot that can make such a guarantee.

What is the name given to the statistical plot that can guarantee that our distribution is Normal?

Normal distribution

What is the proper or 'official' name typically applied to this distribution?

None of the above

What is the z-score for an observation of 27 given a mean of 20 and a standard deviation of 3?

Median because it is skewed

What measure of center would you use to describe this distribution?

About 50%

What percentage of people had a gestation time of less than 250 days?

BOTH center and spread are very important in describing a distribution.

When attempting to describe and/or analyze a distribution, which of the following is the most important number to consider?

Bimodal

Of the following options, which best describes the shape of this distribution? Hint: Do you think the density curve drawn on this figure is accurate?

Bell -Shaped

Of the following options, which best describes the shape of this distribution? You may ignore the outlier at 55-60 mpg.

None of the above. The sample size is very small and as a result, we can not have much confidence in this model.

You are interested in the relationship between the amount of hours students spend studying, and their performance on their midterm statistics exams. You look at 3 students (n=3) and accurately record this information. You plot the information, and it does appear to be linear. You calculate r and see that it is very close to +1. You then generate the regression formula: score^ = 42 + # hours * 8 What would you predict as their score of they studied for 4.5 hours?

Explanatory: Soda Consumed, Response: Weight gain

You are trying to examine a relationship between the amount of soda (or 'pop' if you are from the midwest!) consumed, and weight gain. In particular, you want to see if the amount of soda a person drinks results in weight gain. Which do you think should be considered the explanatory variable, and which the response?

Your observation lies 2.4 standard deviations below the mean.

You convert a datapoint to a z-score of -2.4. What does this value tell you?

The proportion of observations that lie under the curve to the left of your z-score.

You look up a certain z-score using either R's pnorm() function, or on a standard normal table (z-table). What does the value that you find tell you?

Both of the above.

You look up a z-score using R's pnorm() function and are given a value of 0.342. What does that value tell you?

You are encouraged that your data follows a normal distribution. But you are also aware that you can't be 100% certain.

You plot a dataset on a normal quantile plot and find the result seen here. It does indeed appear to be a straight line. What can you conclude?

You are certain the data is not normally distributed.

You plot a dataset on a normal quantile plot and find the result seen here. The line is clearly curved. What can you conclude?

True

rue/False:A histogram should account for 100% of the observations in the dataset. For example, this gas mileage histogram should account for ALL of the 34 cars in the previous question.

80%

Make your best estimate: Approximately what percentage of people would be expected to score MORE than 4 on this grade equivalency test?

Yes, you can calculate a z-score for any value within the range of a Normal distribution.

A dataset composed of the following values is follows a Normal distribution: 59 60 61 62 62 63 63 63 64 64 65 66 67 68 Is it possible to calculate a z-score for the value 63.49?

91.9%

A group of women are sampled and their heights give the distribution (in inches): N(64.5, 2.5) What percentage of women are *shorter* than 68 inches?

About 67 inches

A randomly selected individual from the N(64.5, 2.5) distribution tells you she is at the 84th percentile for height. How tall is she?

About 36% of students' GPA comes from factors other than the number of hours they study.

A relationship is hypothesized between hours studied and students' GPA. A study is done and the analysis shows a linear relationship with an r of 0.8 and an R^2 (r squared) of 0.64. What does the R^2 value tell us?

Sounds like a good score as it is above average, but I can't tell just how good it was without some measure of the variation.

A student scores 24 on this year's ACT exam. He describes it as a fantastic score since the "average was only 18". Which of the following would be a reasonable response?

Histogram

Friends <- c(10.2, 20.7, 15.9, 25.3, 30.6, 12.1, 18.8, 35.4, 40.9, 23.5) If you wished to graph/chart the 'Friends' variable, which of the following would be the best choice?

+3

Given a mean of 100 and a standard deviation of 5, what is the z-score of 115?

Somewhere between 1 and 2.

Given a mean of 20 and a standard deviation of 4, what is the z-score of 25?

It is not appropriate to use r in this situation.

Given the options below, which would you estimate as the best value for r

r = -0.7

Given the options below, which would you estimate as the best value for r ?

0.6

Given x- (x-bar, i.e. mean): = 1 Standard deviation x = 2 y- (i.e. y-bar) = 3 Standard deviation y = 6 r = 0.8 What is the value for b0?

All of the above statements are TRUE.

Given: N(3, 0.220), which (if any) of the following statements is FALSE?

True

Here is the 5-number summary for a certain dataset: 10, 20, 25, 30, 42. True/False: According to the 1.5 IQR rule, a value of 48 should be considered an outlier.

x

Here is the general regression equation: y^ = b0 + b1*x. Which character represents the explanatory variable?

b0

Here is the general regression equation: y^ = b0 + b1*x. Which character represents the y-intercept?

By using a widely accepted mathematical model/formula such as the method-of-least squares.

How do we decide on the best regression line on a scatterplot?

9

How many cars have a gas mileage of below 15?

It can vary - it really depends on the distribution of the variable.

How many intervals (or 'bins' or 'classes') should be chosen when creating a histogram?

outlier

How would you classify the red dot in this plot?

Right skewed

How would you describe the skew (if any) of this histogram?

Form: Linear, Direction: negative, Strength: strong

How would you interpret this scatterplot?

Left skewed

Identify the distribution pictured here:

The means are all the same

In the image seen here, which of the curves has the highest mean?

Blue

In the image seen here, which of the curves has the lowest standard deviation?

Scales that are abnormally stretched or compressed can give very misleading impressions about our data.

In the image shown here, most of the plots have very poorly chosen scales. Why can this be a problem?

Bottom left

In the image shown here, which of the four plots would be the best choice to display?

No because while it is far from the other observations, it is still close to the regression line.

In the scatterplot shown here, would you consider the datapoint at the top right to be an outlier?

4

In this scatterplot, given an IQ of 83, what would you predict as the grade point average?

False

It is observed that a certain variable 'x' always seems to be associated with some change in a variable 'y'. True/False: We can therefore assume that there is some causation present

x-axis

On a scatterplot, the explanatory variable is typically plotted on which axis?

Uniform distribution / The straight red line

On this graph, name - (1)the distribution, - (2)the density curve

None of the above

The following distribution has a mean of 10 and a standard deviation of 2. What is the z-score for an observation of 6? Generous hint: Examine the distribution....

Unable to determine as the dataset range does not include this period.

The plot shown here shows the relationship between temperature and time (as we head into late fall and winter). As you can see, the relationship is linear and with a reasonably strong correlation. What would you predict the temperature to be for the week of November 21st (11/21)?

Yes. The relationship appears to be linear, a fact that is supported by an apparently random residual. Also, r appears fairly strong. Therefore, I am comfortable continuing on to do a regression analysis.

The top plot here is a scatterplot showing what appears to be a linear relationship. The residual plot below shows no apparent pattern, that is, the dots appear to be randomly scattered. Would you be comfortable progressing to a regression analysis?

3

The variance of a distribution is calculated to be 9. What is the standard deviation?

It exerts a disproportional pull on the line upwards towards itself.

This image shows a scatterplot with a single influential point. What effect, if any, does this point have on the regression line.

False

True/False: For a dataset with outliers, the best statistic to describe the center of a distribution is the mean.

False

True/False: If a correlation is perfect (1.0) or very close to perfect, you can assume that there is probably at least some degree of causation.

True

True/False: If you had to guess, the car make that achieves 55-60 mpg on the road should probably be considered an outlier.

False

True/False: Not all values in a dataset with a Normal distribution can be converted to a z-score.

True

True/False: The correlation coefficient is NOT resistant to (i.e. *is* affected by) outliers.

Student A

Two students are applying to your graduate school program. Student A had a GPA of 3. Her school's GPAs had the distribution N(2.5, 0.5). Student B had a GPA of 3.5. Her school's GPAs had the distribution N(3, 1). Which student performed better within their school?

0.382

Use R and your z-score formula: What is the proportion (or percentage) of values that lie below a z-score of -0.30?

97.5%

Use the empirical rule to answer the following: The score distribution for a certain standardized exam was N(17, 0.3).What percentage of students scored *below* 17.6?

99.7%

Use the empirical rule to answer the following: The score distribution for a certain standardized exam was N(17, 0.3).What percentage of students scored between 16.1 and 17.9?

81.5%

Use the empirical rule to answer the following: The score distribution for a certain standardized exam was N(17, 0.3).What percentage of students scored between 16.4 and 17.3?

About 68%

Use the empirical rule to answer the following: The score distribution for a certain standardized exam was N(17, 0.3).What percentage of students scored between 16.7 and 17.3?

0.15%

Use the empirical rule to answer the following: What percentage of observations lie above z=+3?

58.7%

Using R and your z-score formula: What percentgage of observations are *above* a z-score of -0.22?

38%

Using R, what is the area under the curve (rounded to the nearest percentage) to the right of z=0.31?

-1.5

Using the formula discussed in lecture: z = (x-mean)/sd What is the z-score for an observation of 15.5 given a mean of 20 and a standard deviation of 3?

True

Variance is used for quartiles whereas SD is used for accurate estimation under the curve.

If the number of powerboats is restricted (reduced), fewer manatees will die.

What conclusions would you draw from this plot? (For those of you who have read ahead, you may assume there is also causation. If you don't know what I mean by this, you can ignore...)

The mean of a population

What does the Greek character 'mu' represent? Tip: 'mu' is the character that looks like a funny-shaped 'u'.

40

What is the IQR for this boxplot?

Both of the above answers are correct.

What is the Interquartile Range (IQR)?

Using the line requires us to estimate both the x and y values and is therefore imprecise.

What is the chief problem with simply using a regression line as our model instead of going on to generate a regression formula?

An r of +0.1 indicates a strong relationship.

Which of the following facts about r is FALSE?

The observation is clearly a data entry error.

Which of the following is *most* likely to be considered a valid reason for removing an observation due to it being an outlier?

A well-designed experiment that includes a control group.

Which of the following is the best method of definitively establishing causation?

Both are resistant to outliers.

Which of the following measures of center is said to be 'resistant' to outliers?

The one labeled r=0.9

You are attempting to describe the "strength" of a relationship between two variables. Of the following plots, which relationship do you think has the highest strength?

Support your idea that the data is linear by drawing a residual plot.

You are interested in the relationship between number of beers and blood alcohol level. You collect data, and the plot appears to show a very strong linear relationship. What should you probably do NEXT?

You calculate a value for r and it is 0.1 .

You are interested in the relationship between people's age and how much money they spend on eating in restaurants each month. You ask a sample of 200 people their age, and record the amount of money they spent eating out. Which, if any, of the following would cause you to decide AGAINST generating a regression model?


संबंधित स्टडी सेट्स

HESI Cystic fibrosis Case Study Darla

View Set

International Business Chapter 7 Review

View Set

Chapter 48: Assessment and Management of Patients with Obesity

View Set

Final exam study guide (all quiz and test questions/answers)

View Set