2.2-2.3-2.6

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Example 2.11 Find the median and the mean for the heart rates, in beats per minute, of 20-year-old patients and 55-year-old patients from the ICUAdmissions study. -20-year-old patients: 108, 68, 80, 83, 72 -55-year-old patients: 86, 86, 92, 100, 112, 116, 136, 140

a.)Mean is 82.1 beats per second 80 is mode b.)Mode 106 Mean 108.5

Example 2.42: In Example 2.41, we find the predicted tip amount for three different bills in the RestaurantTips dataset. The actual tips left by each of these customers are shown below. Use this information to calculate the residuals for each of these sample points. -The tip left on a bill of $59.33 was $10.00. From 2.41 A the predicted was 10.51 -The tip left on a bill of $9.52 was $1.00. and predicted was 1.44 -he tip left on a bill of $23.70 was $10.00. and assumed was 4.2

a10- 10.51 + -0.51 b 1.00 - 1.44 = -0.44 c The tip left on a bill of $23.70 was $10.00

what do the terms regression line and least squares line and line of best fit and fitted line interchangeably. mean

are all the regression line

Example 2.45: we consider some scatterplots from the dataset FloridaLakes showing relationships between acidity, alkalinity, and fish mercury levels in n = 53 Florida lakes. We wish to predict a quantity that is difficult to measure (mercury level of fish) using a value that is more easily obtained from a water sample (acidity). We see in Example 2.34 that there appears to be a negative linear association between these two variables, so a regression line is appropriate. -Use technology to find the regression line to predict Mercury from pH, and plot it on a scatterplot of the data. Mercury = -0.1523 X PH + 1.53 -Interpret the slope of the regression line in the context of Florida lakes.

he slope in the prediction equation represents the expected change in the response variable for a one unit increase in the explanatory variable. Since the slope in this case is −0.1523, we expect the average mercury level in fish to decrease by about 0.1523 ppm for each increase of 1 in the pH of the lake water.

variability

how spread out things are

Data 2.3 The dataset ICUAdmissions30 includes information on 200 patients admitted to the Intensive Care Unit at a hospital. Twenty variables are given for the patients being admitted, including age, sex, race, heart rate, systolic blood pressure, whether or not CPR was administered, and whether or not the patient lived or died.

look at next

notation for median

m

Predicted vs observed values of the response variable notation

predicted : ^ P

does it always make sense to look at y0in

represents the predicted value of the response variable y if the explanatory variable x is zero. The interpretation may be nonsensical since it is often not reasonable for the explanatory variable to be zero.

correlations

stronger correlation is close to 1 or -1 weaker is closer to zero the clearer the line the stronger the correlation

It is always important to plot the data and look for patterns that may or may not follow a linear trend. meaning

t is only appropriate to use a regression line when there is a linear trend in the data.

Finding the regression line

tells us the prediction

comparing if the z score is close to the SD to see if its usual

the bigger the z score the farther from the mean -most data fall between 2 standard deviations so if the z score is bigger than +/-2 it is unusual

Two summary statistics that describe the center or location of a distribution for a single quantitative variable

the mean and the median

observed value of a scatterplot

the observed value is the height of the particular data point with that Bill amount, so the residual is the vertical distance from the point to the line.

Equation of a regression line:

y=mx+b Response = Y Explanatory = x m= slope of the line Slope: change in y —---------------- Change in X B: the y intercept

example 2.21 The five number summary for the number of hours spent exercising a week for the StudentSurvey sample is (0, 5, 8, 12, 40). Explain what this tells us about the amount of exercise for this group of students.

-All of the students exercise between 0 and 40 hours per week. -The 25% of students who exercise the least exercise between 0 and 5 hours a week, and the 25% of students who exercise the most exercise between 12 and 40 hours a week.] -The middle 50% of students exercise between 5 and 12 hours a week, with half exercising less than 8 hours per week and half exercising more than 8 hours per week.

Regression cautions:

-Avoid trying to apply a regression line to predict values far from those that were used to create it. (extrapolating) -(cant make predction usuing values of the explanatory variable that are too far from the original -It is always important to plot the data and look for patterns that may or may not follow a linear trend. -Outliers can have a st.rong influence on the regression line, just as we saw for correlation. In particular, data points for which the explanatory value is an outlier are often called influential points because they exert an overly strong effect on the regression line.

advantages and disadvantages to mean and SD

-Both the mean and standard deviation have the advantage that they use all of the data values. However, they are not resistant to outliers.

visualizing mean and median if skewed

-If the data are skewed to the right, the values in the extended right tail pull the mean up but have little effect on the median. -In this case, the mean is bigger than the median. -Similarly, if data are skewed to the left, the mean is less than the median. See

IQR and R and range

-In general, although the range is a very easy statistic to compute, the IQR is a more resistant measure of spread. - numericals not intervals

example 2.40 -Use the regression line in Figure 2.70 to estimate the predicted tip amount on a $60 bill.

-Is about 10 dollars (go to x valueof 60 and go up to y value that falls on the line)

what do we ask when we consider the shape of a dataset

-Is it approximately symmetric? -If not, is the data piled up on one side? -If so, which side? -Are there outliers?

why its important to know resistance

-Knowing that outliers have a substantial effect on the mean but not the median can help determine which is more meaningful in different situations. (if you have a huge outlier use median bc it has significant resistance)

process of finding the residual if not given the regression line which gives prediction

-Need to find the predicted if its not given -Two points they give you is coordinance -Insert known info into the equation -Put x coordinantes into formula for line and -And solve for y -Sustracted that answer from the actual thats provided

positive and negative deviation and sizes

-Points above the line will have positive residuals and points below the line will have negative residuals. -If the predicted values closely match the observed data values, the residuals will be small.

range and interquartile range

-Range: Max- min -Interquartile range = Q3-Q1

symetric vs right skewed vs left skewed

-Symmetric :if the two sides approximately match when folded on a vertical center line -Skewed to the right :if the data are piled up on the left and the tail extends relatively far out to the right -Skewed to the left: if the data are piled up on the right and the tail extends relatively far out to the left

example 2.20 Standard & Poor's maintains one of the most widely followed indices of large-cap American stocks: the S&P 500. The index includes stocks of 500 companies in industries in the US economy. A histogram of the daily volume (in millions of shares) for the S&P 500 stock index for every trading day in 2018 is shown in Figure 2.21. The data are stored in SandP500. Use the histogram to roughly estimate and interpret the 25th percentile and the 90th percentile.

-The 25th percentile is the value with a quarter of the values below or equal to it (or about 63 of the 251 cases in this dataset). -This is the value where 25% of the area of the histogram lies to the left. -This appears to be somewhere in the tallest bar, perhaps around 3200 million.

example 2.16 Temperatures on April 14th in Des Moines and San Francisco are given in Table 2.26 and shown in Figure 2.18. -Which dataset do we expect to have a larger standard deviation? Why? -Use technology to find the standard deviation for each dataset and compare your answers.

-The Des Moines temperatures are more spread out, so we expect this dataset to have a larger standard deviation. -In all three cases, we see that the standard deviation for the sample of Des Moines temperatures is about s = 11.18°F. Similar output for the San Francisco temperatures shows that the standard deviation for those 25 values is s = 3.06°F. As we expect, the standard deviation is larger for the Des Moines temperatures than for the San Francisco temperatures.

Direction of skewdness:

-The direction of skewness is determined by the longer tail. - if tail is on right then its right skewed -if the tail is on the left, its left skewed.

how is standard deviation connected to valience

-The larger the standard deviation, the more variability there is in the data and the more spread out the data are.

What Does "Line of Best Fit" Mean?

-The line that fits the data best should then be one where the residuals are close to zero. -(y-^y)2

notation for the mean

-The mean for a sample is x— or x bar -The mean for a population is y (add a u on top) which is the greek letter mu

Visualizing the mean and the median on a graph:

-The mean is the "balancing point" of a dotplot or histogram in the sense that it is the point on the horizontal axis that balances the graph. -In contrast, the median splits the dots of a dotplot, or area in the boxes of a histogram, into two equal halves.

advantages and disadvantages to IQR and median

-The median and IQR are resistant to outliers. -Furthermore, if there are outliers or the data are heavily skewed, the five number summary can give more information (such as direction of skewness) than the mean and standard deviation.

Key to not **** up with predictions using the regression line

-The regression line to predict y from x is NOT the same as the regression line to predict x from y. -Be sure to always pay attention to which is the explanatory variable and which is the response variable!

calculation for the risidual

-The residual at a data value is the difference between the observed and predicted values of the response variable: -residual= observed - predicted= -is the vertical deviation from the line

2.44 For the restaurant tips regression line tip = 0.182 x Bill -0.29 interpret the slope and the intercept in context

-The slope 0.182 indicates that the tip is predicted to go up by about $0.182 for a one dollar increase in the bill. A rough interpretation is that people in this sample tend to tip about 18.2%. -The intercept −0.292 indicates that the tip will be −$0.292 if the bill is $0. Since a bill is rarely zero dollars and a tip cannot be negative, this makes little sense.

Notation for standard deviation

-The standard deviation of a sample is denoted s, and measures how spread out the data are from the sample mean . -The standard deviation of a population is denoted σ, which is the Greek letter "sigma," and measures how spread out the data are from the population mean μ.

Resistance:

-The term resistance is related to the impact of outliers on a statistic. -In general, we say that a statistic is resistant if it is relatively unaffected by extreme values. -The median is resistant, while the mean is not.

what does the z score tell you

-The z-score tells how many standard deviations the value is from the mean, and is independent of the unit of measurement.

Q1: first quartile =

25th percentile

Calculating Standard deviation for a single value

Data value - mean

how to calculate z scores

Data value - mean —---------------------------- S (standard deviation)

predicted value on a scatter plot

On a scatterplot, the predicted value is the height of the regression line for a given Bill amount

regression line using explanatory and response

Response=m(explanatory) + b

Interpreting standard deviation:

Since the standard deviation is computed using the deviations from the mean, we get a rough sense of the magnitude of s by considering the typical distance of a data value from the mean.

negative and postive deviation

Since values can fall above and below the mean, some deviations are negative and some are positive.

what does the slope tell you

Slope tells you how much Y will increase if you increase X by 1

Data 2.5 The dataset Cars2020 contains information for a sample of 110 new car models in 2020. There are many variables given for these cars, including model, price, miles per gallon, and weight. One of the variables, QtrMile, shows the time (in seconds) needed for a car to travel one-quarter mile from a standing start. Example 2.18 A histogram of the quarter-mile times is shown in Figure 2.20. Is the distribution approximately symmetric and bell-shaped? Use the histogram to give a rough estimate of the mean and standard deviation of quarter-mile times.

The histogram is relatively symmetric and bell-shaped. The mean appears to be approximately 16 seconds.

median in the 5 number summary

The median is the 50th percentile, since it divides the data into two equal halves.

minimum and maximum

The minimum and maximum in a dataset identify the extremes of the distribution: the smallest and largest values, respectively.

predicted value

The predicted value is an estimate of the average response value for that particular value of the explanatory variable.

Residuals:

The residual is the difference between the observed value and the predicted value.

Example 2.46: In Example 2.45 on page 143, we used the acidity (pH) of Florida lakes to predict mercury levels in fish. Suppose that, instead of mercury, we use acidity to predict the calcium concentration (mg/l) in Florida lakes. Figure 2.74 shows a scatterplot of these data with the regression line ^calcium= -51.4 +11.17X PHf or the 53 lakes in our sample. Give an interpretation for the slope in this situation. Does the intercept make sense? Comment on how well the linear prediction equation describes the relationship between these two variables.

The slope of 11.17 in the prediction equation indicates that the calcium concentration in lake water increases by about 11.17 mg/l when the pH goes up by one. The intercept does not have a physical interpretation since there are no lakes with a pH of zero and a negative calcium concentration makes no sense. Although there is clearly a positive association between acidity and calcium concentration, the relationship is not a linear one. The pattern in the scatterplot indicates a curved pattern that increases more steeply as pH increases. The least squares line predicts negative calcium concentrations (which are impossible) for pH levels as large as 4.5, which are within the domain of lakes in this sample.

what does the standard deviation look at

The standard deviation gives a rough estimate of the typical distance of a data value from the mean.

Standard deviation:

The standard deviation is a statistic that measures how much variability there is in the data

y-intercept

The y value when x = 0

Using the equation of the regression line to make predictions:

We substitute the value of the explanatory variable into the prediction equation to calculate the predicted response.

How to find the residuals when given the regression line

We use the regression line to find the predicted value for each data point, and then subtract to find the residuals.

what does a large deviation mean

-A larger standard deviation means the data values are more spread out and have more variability. (Another measure of variability is the square of the standard deviation, called the variance.)

when is a graph symmetric

-. A distribution is considered symmetric if we can fold the plot (either a histogram or dotplot) over a vertical center line and the two sides match closely.

percentiles if 620 is 84th percentile

-84th percentile means that 84% of the students in the class who took the same exam scores less than or equalk to 620 which was the 84th percentile

importance of z scores

-A common way to determine how unusual a single data value is, that is independent of the units used, is to count how many standard deviations it is away from the mean.

example 2.13 As in most professional sports, some star players in the National Football League (NFL) in the US are paid much more than most other players. In particular, five players (all quarterbacks) were paid salaries greater than $30 million in 2019. Two measures of the center of the player salary distribution for the 2019 NFL season are 930,000 and 3.75 billion (if you have a huge outlier use median) -One of the two values is the mean and the other is the median. Which is which? Explain your reasoning. -In salary negotiations, which measure (the mean or the median) are the owners more likely to find relevant? Which are the players more likely to find relevant? Explain.

-There are some high outliers in the data, representing the players who make a very high salary. These high outliers will pull the mean up above the median. The mean is $3.075 million and the median is $930,000. -The owners will find the mean more relevant, since they are concerned about the total payroll, which is the mean times the number of players. The players are likely to find the median more relevant, since half of the players make less than the median and half make more. The high outliers influence the mean but are irrelevant to the salaries of most players. Both measures give an appropriate measure of center for the distribution of player salaries, but they give significantly different values. This is one of the reasons that salary negotiations can often be so difficult.

-The owner80 of a bistro called First Crush in Potsdam, New York, is interested in studying the tipping patterns of its patrons. He collected restaurant bills over a two-week period that he believes provide a good sample of his customers. The data from 157 bills are stored in RestaurantTips and include the amount of the bill, size of the tip, percentage tip, number of customers in the group, whether or not a credit card was used, day of the week, and a coded identity of the server. -What is the equation of the regression line for the tips question?

-Tip: 0.182 (Bill) -0.292 -The y-intercept of this line is −0.292 and the slope is 0.182.

How do you estimate the SD for problem above

-To estimate the standard deviation, we estimate an interval centered at 16 that contains approximately 95% of the data. -The interval from 13 to 19 appears to contain almost all the data. Since 13 is 3 units below the mean of 16 and 19 is 3 units above the mean, by the 95% rule we estimate that 2 times the standard deviation is 3, so the standard deviation appears to be approximately 1.5 seconds. -Note that we can only get a rough approximation from the histogram. To find the exact values of the mean and standard deviation, we would use technology and all the values in the dataset. For the QtrMile variable in this example, we find mean = 16.04 and s = 1.30 seconds, so the rough approximation worked reasonably well.

Example 2.17 We see in Example 2.9 on page 74 that the distribution for pulse rates from the StudentSurvey data is symmetric and approximately bell-shaped.

-Use the fact that the mean of the pulse rates is mean = 69.6 and the standard deviation is s = 12.2 to give an interval that is likely to contain about 95% of the pulse rates for students. -69.9 -2(12.2)= 45.2 -69.9 + 2(12.2)=94.0

How to draw a histogram:

-Variable goes on the x axis what is being investigated -frequency (number of cases) goes on the y axis -Put number of dots per student that applied to the specific number

Using a curve to represent the shape of a histogram

-We often draw smooth curves to illustrate the general shape of a distribution. -follow general shape

In describing a single quantitative variable, we generally consider the following three questions:

-What is the general shape of the data? -Where are the data values centered? -How do the data vary?

Example 2.10 Give the notation for the mean in each case. -For a random sample of 50 seniors from a large high school, the average SAT (Scholastic Aptitude Test) score was 582 on the Math portion of the test. -About 1.67 million students in the class of 2014 took the SAT,28 and the average score overall on the Math portion was 513.

-X-=582 -yu=513

Ex 2.41 Three different bill amounts from the RestaurantTips dataset are given. In each case, use the regression line ^ Tip= 0.182 X Bill - 0.292 to predict the tip. 1.)A bill of $59.33 2.)A bill of $9.52 3.)A bill of $23.70

-^ Tip= 0.182 X 59.33l - 0.292= 10.506 ^ Tip= 0.182 X 9.52 - 0.292= 1.441= 1.44 ^ Tip= 0.182 X 23.70 - 0.292 = 4.021→ 4.02

what is used to show a 5 number summary

-a box plot

what are histograms similar to

-are similar to bar charts for a categorical variable, except that a histogram always has a numerical scale on the horizontal axis.

mean:

-for a single quantitative variable -is the numerical average -Sum of all data values/ number of data values

outliers

-is an observed value that is notably distinct from the other values in a dataset. -Usually, an outlier is much larger or much smaller than the rest of the data values.

Dotplots:

-looking at one quantitative variable -We create a dotplot by using an axis with a scale appropriate for the numbers in the dataset and placing a dot over the axis for each case in the dataset.

Histograms:

-looks at one quantitative variable -formed by grouping similarities together

Median

-one quantitative -summarizes the center of a data set -If the numbers in a dataset are arranged in order from smallest to largest, the median is the middle value in the list. -If there are an even number of values in the dataset, then there is not a unique middle value and we use the average of the two middle values.

features of a box plot and 5 number summary

-outliers are random points -min and max are at end of the whicskers the longer the box or whiskers the more variability Q1 is outer part of the box to the middle on the left Q2 is median in middle Q3 is line that makes edge at right

regression line

-the line of best fit -The regression line provides a model of a linear association between two variables, and we can use the regression line on a scatterplot to give a predicted value of the response variable, based on a given value of the explanatory variable.

how to find two points where 95% of data is between

-use 95% rule -about 95% of the data in a sample from a bell-shaped distribution should fall in the interval from mean - 2s to mean + 2s

Q3: third quartile

75 th percentike

linear regression

:The process of fitting a line to a set of data

bell shaped curve

:if the data are symmetric and, in addition, it has one large bump that is a huge bell

Example 2.15 Average temperature on April 14th for the 25 years ending in 2019 is given in Table 2.26 for Des Moines, Iowa, and San Francisco, California.44 Use technology and the data in to find the mean and the median temperature on April 14th for each city.

Average temperature on April 14th for the 25 years ending in 2019 is given in Table 2.26 for Des Moines, Iowa, and San Francisco, California.44 Use technology and the data in April14Temps to find the mean and the median temperature on April 14th for each city.

Choosing measures of center and spread:

Because the standard deviation measures how much the data values deviate from the mean, it makes sense to use the standard deviation as a measure of variability when the mean is used as a measure of center.

Notation for the slope for population

For the slope of a regression line for a population, we use the Greek letter β1 (beta). β^ is for sample

Example 2.22 The five number summary for the mammal longevity data in Table 2.21 on page 73 is (1, 8, 12, 16, 40). Find the range and interquartile range for this dataset.

From the five number summary (1, 8, 12, 16, 40), we see that the minimum longevity is 1 and the maximum is 40, so the range is 40 − 1 = 39 years. The first quartile is 8 and the third quartile is 16, so the interquartile range is IQR = 16 − 8 = 8 years.

using the standard deviation: 95% rule

If a distribution of data is approximately symmetric and bell-shaped, about 95% of the data should fall within two standard deviations of the mean.

what to do if you have multiple data values that are the same

If there are multiple data values that are the same, we stack the dots vertically.

what do all deviations add to

In fact, if you add up all of the deviations, the sum will always be zero.

parts of a 5 number summary

Is the minimum, Mz, Q1 and Q3

Example 2.12 In Example 2.11(a), we saw that the mean and the median heart rate for n = 5 ICU patients in their twenties are given by -Suppose that the patient with a heart rate of 108 bpm instead had an extremely high heart rate of 200 bpm. How does this change affect the mean and median?

Would not affect median but would change mean to mean increases alot to 100.6

Example 2.43: The regression line for these 12 data points is Margin= 0.839 (Approval) - 36.8 Calculate the predicted values and the residuals for all the data points. go to graph and plug in approval for each -Which residual is the largest? For this largest residual, is the observed margin higher or lower than the margin predicted by the regression line? To which president and year does this residual correspond

^Margin= 0.839 (62) - 36.8 = 15.23 Actual : 10.0 Residual 10- 15.23 = -5.23 Di this for each of the 12 -The largest residual is 12.16 Observed was The observed margin of victory is 23.2, high above the predicted value of 11.04.

example 2.9 The StudentSurvey dataset introduced in Data 1.1 on page 4 contains results for 362 students and many variables. Figure 2.8 shows histograms for three of the quantitative variables: Pulse (pulse rate in number of beats per minute), Exercise (number of hours of exercise per week), and Piercings (number of body piercings). Describe each histogram

a)In the histogram for Pulse, we see that almost all pulse rates are between about 35 beats per minute and about 100 beats per minute, with two possible outliers at about 120 and 130. Other than the outliers, this histogram is quite symmetric. b.)In the histogram for Exercise, the number of hours spent exercising appears to range from about 0 hours per week to about 30 hours per week, with a possible outlier at 40. This histogram is not very symmetric, since the data stretch out more to the right than to the left. c.)The histogram for Piercings is even more asymmetric than the one for Exercise. It does not stretch out at all to the left and stretches out quite a bit to the right. Notice the peak at 0, for all the people with no piercings, and the peak at 2, likely due to students who have pierced ears and no other piercings.

Example 2.19: One of the patients (ID#772) in the ICU study (Data 2.3 on page 77) had a high systolic blood pressure of 204 mmHg and a low pulse rate of 52 bpm. Which of these values is more unusual relative to the other patients in the sample? The summary statistics for systolic blood pressure show a mean of 132.3 and standard deviation of 32.95, while the heart rates have a mean of 98.9 and standard deviation of 26.83

a. Symbolic: 204 - 132.3 —----------------= 2.18 32.95 Heart rate: 52 - 98.9 —-------------== -1.75 26.83


Kaugnay na mga set ng pag-aaral

Journalism Term 2 Unit 2: Editing an Article or Newscast Quiz

View Set

Group 2 - Chapter 03: Section 3.3 - Observing Microorganisms Through a Microscope

View Set

CS2050 - Final Exam - Dr. Uhlmann

View Set

IB Business Management Topic 4. Marketing - ALL

View Set

Physics 100A multiple choice final

View Set