Soc 106 - Ch 3

Ace your homework & exams now with Quizwiz!

EmpiricalRule: For Bell-Shaped Frequency Distributions, the Empirical RuleSpecifies ApproximatePercentages of Data within 1, 2, and 3 Standard Deviations of the Mean

Whenever the smallest or largest observation is less than a standard deviation from the mean, this is evidence of severe skew. Suppose that the first exam in yourcourse, having potential scores between 0 and 100, has¯y = 86 and s = 15. The upperbound of 100 is less than one standard deviation above the mean. The distribution islikely highly skewed to the left. The standard deviation, like the mean, can be greatly affected by an outlier, especially for small data sets. For instance, for the incomes of the seven Leonardo'sPizza employees shown on page 36,¯y = $45,900 and s = $78,977. When we removethe outlier,¯y = $16,050 and s = $489

Finding the "mean" formula

greek letter sigma

Properties of Standard Deviation

p. 44

comparing groups

Comparing Canadian and U.S. Murder Rates Figure 3.16 (page 50) shows side-by-side box plots of murder rates (measured as the number of murders per 100,000 pop-ulation) in a recent year for the 50 states in the United States and for the provincesof Canada. From this figure, it is clear that the murder rates tended to be much lowerin Canada, varying between 0.7 (Prince Edward Island) and 2.9 (Manitoba) whereasthose in the United States varied between 1.6 (Maine) and 20.3 (Louisiana). Theseside-by-side box plots reveal that the murder rates in the United States tend to bemuch higher and have much greater variability.

Range

The difference between the largest and smallest observations is the simplest way to describe variability For nation A, from Figure 3.10, the range of income values is about $50,000 −$0 = $50,000. For nation B, the range is about $30,000 − $20,000 = $10,000. Nation A has greater variability of incomes The range is not, however, sensitive to other characteristics of data variability.The three distributions in Figure 3.11 all have the same mean ($25,000) and range($50,000), but they differ in variability about the center. In terms of distances ofobservations from the mean, nation A has the most variability, and nation B theleast. The incomes in nation A tend to be farthest from the mean, and the incomesin nation B tend to be closest.

quartiles and other percentiles

The range uses two measures of position, the maximum value and the minimumvalue. The median is a measure of position, with half the data falling below it andhalf above it. The median is a special case of a set of measures of position called percentiles

stem and leaf plots

This figure, called a stem-and-leaf plot, represents each observation by its lead-ing digit(s) (the stem) and by its final digit (the leaf ). Each stem is a number to theleft of the vertical bar and a leaf is a number to the right of it. For instance, on thefirst line, the stem of 1 and the leaves of 2 and 3 represent the violent crime rates 12and 13. The plot arranges the leaves in order on each line, from smallest to largest.

association b/w response and explanatory variables

With multivariable analyses, the main focus is on studying associations among thevariables. An association exists between two variables if certain values of one vari-able tend to go with certain values of the other For example, consider "religious affiliation," with categories (Protestant,Catholic, Jewish, Muslim, Hindu, Other), and "ethnic group," with categories(Anglo-American, African-American, Hispanic). In the United States, Anglo-Americans are more likely to be Protestant than are Hispanics, who areoverwhelmingly Catholic. African-Americans are even more likely to be Protestant.An association exists between religious affiliation and ethnic group, because the pro-portion of people having a particular religious affiliation changes as the ethnic group changes An analysis of association between two variables is called a bivariate analysis,because there are two variables. Usually one is an outcome variable on which com-parisons are made at levels of the other variable. The outcome variable is calledthe response variable. The variable that defines the groups is called the explanatoryvariable. The analysis studies how the outcome on the response variable depends on or is explained by the value of the explanatory variable. for example, when wedescribe how religious affiliation depends on ethnic group, religious affiliation is theresponse variable and ethnic group is the explanatory variable. In a comparison ofmen and women on income, income is the response variable and gender is the ex-planatory variable. Income may depend on gender, not gender on income. Often, the response variable is called the dependent variable and the explana-tory variable is called the independent variable. The terminology dependent variable refers to the goal of investigating the degree to which the response on that variable depends on the value of the other variable. We prefer not to use these terms, since independent and dependent are used for many other things in statistical science.

FREQUENCY DISTRIBUTIONS AND BAR GRAPHS:CATEGORICAL DATA

A bar graph has a rectangular bar drawn over each category.The height of the bar shows the frequency or relative frequency in that category. Fig-ure 3.1 is a bar graph for the data in Table 3.1. The bars are separated to emphasizethat the variable is categorical rather than quantitative. Since household structureis a nominal variable, there is no particular natural order for the bars. The order ofpresentation for an ordinal variable is the natural ordering of the categories Another type of graph, the pie chart, is a circle having a "slice of the pie" for eachcategory. The size of a slice represents the percentage of observations in the category.A bar graph is more precise than a pie chart for visual comparison of categories with similar relative frequencies.

Chapter Summary

A frequency distribution summarizes numbers of observations for possible values or intervals of values of a variable For a quantitative variable, a histogram uses bars over possible values or intervals of values to portray a frequency distribution. It shows shape—such aswhether the distribution is approximately bell shaped or skewed to the right (longer tail pointing to the right) or to the left. The box plot portrays the quartiles, the extreme values, and any outliers.

Histograms

A graph of a frequency distribution for a quantitative variable is called a histogram.Each interval has a bar over it, with height representing the number of observationsin that interval. Choosing intervals for frequency distributions and histograms is primarily a mat-ter of common sense. If too few intervals are used, too much information is lost. Iftoo many intervals are used, they are so narrow that the information presented isdifficult to digest, and the histogram may be irregular and the overall pattern of theresults may be obscured. Ideally, two observations in the same interval should be sim-ilar in a practical sense. To summarize annual income, for example, if a difference of$5000 in income is not considered practically important, but a difference of $15,000is notable, we might choose intervals of width less than $15,000, such as $0-$9999,$10,000-$19,999, $20,000-$29,999, and so forth.For a discrete variable with relatively few values, a histogram has a separate barfor each possible value. For a continuous variable or a discrete variable with many possible values, you need to divide the possible values into intervals, as we did with the violent crime rates. Statistical software can automatically choose intervals for usand construct frequency distributions and histograms.

the mode

Another measure, the mode, states the most frequent outcome. The mode is most commonly used with highly discrete variables, such as with categorical data. In Table 3.6 on the number of sex partners in the last year, for instance, the mode is 1, since the frequency for that outcome is higher than the frequency for any other outcome Properties of the Mode: - The mode is appropriate for all types of data. For example, we might measurethe mode for religion in Australia (nominal scale), for the grade given by ateacher (ordinal scale), or for the number of years of education completed byHispanic Americans (interval scale). - A frequency distribution is called bimodal if two distinct mounds occur inthe distribution. Bimodal distributions often occur with attitudinal variableswhen populations are polarized, with responses tending to be strongly in onedirection or another. For instance, Figure 3.9 shows the relative frequency distribution of responses in a General Social Survey to the question "Do you personally think it is wrong or not wrong for a woman to have an abortion if the family has a very low income and cannot afford any more children?" The frequencies in the two extreme categories are much higher than those in the middle categories. The mean, median, and mode are identical for a unimodal, symmetric distribution, such as a bell-shaped distribution. The mean, median, and mode are complementary measures. They describe dif-ferent aspects of the data. In any particular example, some or all their values maybe useful. - be on the lookout for any misleading statistical analysis (ex. be wary of the mean when the distribution may be highly skewed) - p. 41

measures of position

Another way to describe a distribution is with a measure of position. This tells us the point at which a given percentage of the data fall below (or above) that point. As special cases, some measures of position describe center and some describe variability. - quartiles & other percentiles - IQR - five number summary (box plots) - Z-score

the shape of a distribution

Another way to describe a sample or a population distribution is by its shape. Agroup for which the distribution is bell shaped is fundamentally different from agroup for which the distribution is U-shaped, for example. See Figure 3.5. In theU-shaped distribution, the highest points (representing the largest frequencies) areat the lowest and highest scores, whereas in the bell-shaped distribution, the highestpoint is near the middle value. A U-shaped distribution indicates a polarization onthe variable between two sets of subjects. A bell-shaped distribution indicates thatmost subjects tend to fall near a central value The distributions in Figure 3.5 are symmetric: The side of the distribution below acentral value is a mirror image of the side above that central value. Most distributionsencountered in the social sciences are not symmetric. Figure 3.6 illustrates this. Theparts of the curve for the lowest values and the highest values are called the tails ofthe distribution. Often, as in Figure 3.6, one tail is much longer than the other. A distribution is said to be skewed to the right or skewed to the left, according to which tail is longer. - p. 34

how many standard deviations from the mean? The Z-Score

Another way to measure position is by the number of standard deviations that a value falls from the mean. By the Empirical Rule, for a bell-shaped distribution it is very unusual for anobservation to fall more than three standard deviations from the mean. An alterna-tive criterion regards an observation as an outlier if it has a z-score larger than 3 inabsolute value. By this criterion, the murder rate for Louisiana is an outlier.

overview of bivariate descriptive statistics

Bivariate statistics summarize data on two variables together, to analyze the association between them. Many studies analyze how the outcome on a response variable depends on the value of an explanatory variable. For categorical variables, a contingency table shows the number of observations at the combinations of possible outcomes for the two variables. For quantitative variables, a scatterplot graphs the observations. It shows a point for each observation, plotting the response variable on the y-axis and the explanatory variable on the x-axis For quantitative variables, the correlation describes the strength of straight-line association. It falls between −1 and +1 and indicates whether the responsevariable tends to increase (positive correlation) or decrease (negative correla-tion) as the explanatory variable increases. A regression line is a straight-line formula for predicting the response variable using the explanatory variable

outliers

Box plots identify outliers separately. An observation is an outlier if it falls more than 1.5(IQR) above the upperquartile or more than 1.5(IQR) below the lower quartile. In box plots, the whiskers extend to the smallest and largest observations only ifthose values are not outliers, that is, if they are no more than 1.5(IQR) beyond thequartiles. Otherwise, the whiskers extend to the most extreme observations within1.5(IQR), and the outliers are marked separately. Why highlight outliers? It can be informative to investigate them. Was the obser-vation perhaps incorrectly recorded? Was that subject fundamentally different fromthe others in some way? Often it makes sense to repeat a statistical analysis withoutan outlier, to make sure the conclusions are not overly sensitive to a single obser-vation. Another reason to show outliers separately in a box plot is that they do notprovide much information about the shape of the distribution, especially for largedata sets. In practice, the 1.5(IQR) criterion for an outlier is somewhat arbitrary. It is better to regard an observation satisfying this criterion as a potential outlier rather than a definite outlier. When a distribution has a long right tail, some observations may fall more than 1.5(IQR) above the upper quartile even if they are not separated far from the bulk of the data.

comparing two groups: bivariate categorical and quantitative data

Chapter 7 presents descriptive and inferential methods for comparing two groups.For example, suppose we'd like to know whether men or women have more goodfriends, on the average. A General Social Survey reports that the mean numberof good friends is 7.0 for men (s = 8.4) and 5.9 for women (s = 6.0). The twodistributions have similar appearance, both being highly skewed to the right andwith a median of 4.Here, this is an analysis of two variables—number of good friends and gender.The response variable, number of good friends, is quantitative. The explanatory vari-able, gender, is categorical. In this case, it's common to compare categories of thecategorical variable on measures of the center (such as the mean and median) forthe response variable. Graphs are also useful, such as side-by-side box plots.

Bivariate categorical data

Chapter 8 presents methods for analyzing association between two categorical vari-ables. Table 3.8 is an example of such data. This table results from answers to twoquestions on the 2014 General Social Survey. One asked whether homosexual rela-tions are wrong. The other asked about the fundamentalism/liberalism of the respon-dent's religion. A table of this kind, called a contingency table, displays the numberof subjects observed at combinations of possible outcomes for the two variables. Itdisplays how outcomes of a response variable are contingent on the category of theexplanatory variable. Consider the always wrong category. For fundamentalists, since 262/378 = 0.69,69% believe homosexual relations are always wrong. For those who report beingliberal, since 122/541 = 0.23, 23% believe homosexual relations are always wrong.Likewise, you can check that the percentages responding not wrong at all were 23%for fundamentalists and 67% for liberals. There seems to be an appreciable association between opinion about homosexuality and religious beliefs, with religiousfundamentalists being more negative about homosexuality

comparing variability at first sight and calculation (standard deviation) ex. - p. 43

Comparing Variability of Quiz Scores Each of the following sets of quiz scores fortwo small samples of students has a mean of 5 and a range of 10: Sample 1: 0,4,4,5,7,10 Sample 2: 0, 0, 1, 9, 10, 10 By inspection, sample 1 shows less variability about the mean than sample 2. Most scores in sample 1 are near the mean of 5, whereas all the scores in sample 2 are quite far from 5.

bivariate quantitative data/ scatterplot/ correlations/ regression analysis

Figure 3.17 is an example of a type of graphical plot, called a scatterplot, thatportrays bivariate relations between quantitative variables. It plots data on percentusing the Internet and gross domestic product. Here, values of GDP are plotted onthe horizontal axis, called the x-axis, and values of Internet use are plotted on thevertical axis, called the y-axis. The values of the two variables for any particular observation form a point relative to these axes In Chapter 9, we'll learn about two ways to describe such a trend. One way to describe the trend, called the correlation, describes how strong the association is, in terms of how closely the data follow a straight-line trend. For Figure 3.17, the correlation is 0.88. The positive value means that Internet use tends to go up as GDP goes up. By contrast, Figure 3.17 also shows a scatterplot for GDP and GII. Those variables have a negative correlation of −0.85. As GDP goes up, GII tends to go down.The correlation takes values between −1 and +1. The larger it is in absolute value,that is, the farther from 0, the stronger the association. For example, GDP is more strongly associated with Internet use and with GII than it is with fertility, because correlations of 0.88 and −0.85 are larger in absolute value than the correlation of−0.49 between GDP and fertility The second useful tool for describing the trend is regression analysis. - The second useful tool for describing the trend is regression analysis.Thismethod treats one variable, usually denoted by y, as the response variable, andthe other variable, usually denoted by x, as the explanatory variable. It provides a straight-line formula for predicting the value of y from a given value of x. Chapter 9 shows how to find the correlation and the regression line. It is simple with software, as shown in Table 3. 10 using R with variables from the data file UN atthe text website. Later chapters show how to extend the analysis to handle categoricalas well as quantitative variables

RELATIVE FREQUENCIES: CATEGORICAL DATA

For categorical variables, we list the categories and show the number of observationsin each category. To make it easier to compare different categories, we also reportproportions or percentages in the categories, also called relative frequencies.Thepro-portion equals the number of observations in a category divided by the total numberof observations. It is a number between 0 and 1 that expresses the share of the ob-servations in that category. The percentage is the proportion multiplied by 100. Thesum of the proportions equals 1.00. The sum of the percentages equals 100. Table 3.1 lists the different types of house-holds in the United States in 2015. Of 116.3 million households, for example, 23.3million were a married couple with children, for a proportion of 23.3/116.3 = 0.20.A percentage is the proportion multiplied by 100. That is, the decimal place ismoved two positions to the right. For example, since 0.20 is the proportion of familiesthat are married couples with children, the percentage is 100(0.20) = 20%. table 3.1 is a "frequency distribution" A frequency distribution is a listing of possible values for a variable,together with the number of observations at each value. When the table shows the proportions or percentages instead of the numbers, it iscalled a relative frequency distribution.

FREQUENCY DISTRIBUTIONS: QUANTITATIVE DATA

Frequency distributions and graphs also are useful for quantitative variables. Thenext example illustrates this Table 3.2 lists all 50 states in the United States andtheir 2015 violent crime rates. This rate measures the number of violent crimes inthat state per 10,000 population. For instance, if a state had 12,000 violent crimesand a population size of 2,300,000, its violent crime rate was (12,000/2,300,000) ×10,000 = 52. Tables, graphs, and numerical measures help us absorb the information in these data. Table 3.3 also shows the relative frequencies, using proportions and percentages.As with any summary method, we lose some information as the cost of achievingsome clarity. The frequency distribution does not show the exact violent crime ratesor identify which states have low or high rates The intervals of values in frequency distributions are usually of equal width. Thewidth equals 10 in Table 3.3. The intervals should include all possible values of thevariable. In addition, any possible value must fit into one and only one interval; thatis, they should be mutually exclusive.

population distribution and sample data distribution

Frequency distributions and histograms apply both to a population and to samplesfrom that population. The first type is called the population distribution, and thesecond type is called a sample data distribution. In a sense, the sample data distri-bution is a blurry photo of the population distribution. As the sample size increases,the sample proportion in any interval gets closer to the true population proportion.Thus, the sample data distribution looks more like the population distribution. For a continuous variable, imagine the sample size increasing indefinitely, withthe number of intervals simultaneously increasing, so their width narrows. Then,the shape of the sample histogram gradually approaches a smooth curve. *notice on the figure, relative frequency is on the y-axis and values of the variable on the x-axis - also, for continuous variables, the bars should be close to each other in respected intervals as opposed to discrete variables --- double check for accuracy Even if a variable is discrete, a smooth curve often approximates well the population distribution, especially when the number of possible values of the variable is large.

overview of measures of center

Measures of center describe the center of the data, in terms of a typical observation. The mean is the sum of the observations divided by the sample size. It is thecenter of gravity of the data. The median divides the ordered data set into two parts of equal numbers ofobservations, half below and half above that point. The lower quarter of the observations fall below the lower quartile, andthe upper quarter fall above the upper quartile. These are the 25th and75th percentiles. The median is the 50th percentile. The quartiles and mediansplit the data into four equal parts. These measures of position, portrayedwith extreme values in box plots, are less affected than the mean by outliersor extreme skew.

overview of measures of variability

Measures of variability describe the spread of the data. The range is the difference between the largest and smallest observations. Theinterquartile range is the range of the middle half of the data between the upperand lower quartiles. It is less affected by outliers The variance averages the squared deviations about the mean. Its square root, the standard deviation, is easier to interpret, describing a typical distance from the mean. The Empirical Rule states that for a bell-shaped distribution, about 68% ofthe observations fall within one standard deviation of the mean, about 95%fall within two standard deviations of the mean, and nearly all, if not all, fallwithin three standard deviations of the mean. Table 3.11 summarizes measures of center and variability. A statistic summarizesa sample. A parameter summarizes a population. Statistical inference uses statisticsto make predictions about parameters

Sample statistics and population parameters

Of the measures introduced in this chapter, the mean¯y (y-bar) is the most commonly used measure of center and the standard deviation s is the most common measure of spread. We'll use them frequently in the rest of the text. Since the values¯y and s depend on the sample selected, they vary in value fromsample to sample. In this sense, they are variables. Their values are unknown be-fore the sample is chosen. Once the sample is selected and they are computed, they become known sample statistics With inferential statistics, we distinguish between sample statistics and the corresponding measures for the population. Section 1.2 introduced the term parameter for a summary measure of the population. A statistic describes a sample, while a parameter describes the population from which the sample was taken. In this text, lower case Greek letters usually denote population parameters and Roman letters denote the sample statistics The population mean (mu) is the average of the observations for the entire population. The population standard deviation (sigma) describes the variability of those observations about the population mean. Whereas the statistics¯y and s are variables, with values depending on the samplechosen, the parameters μ and σ are constants. This is because μ and σ refer to just one particular group of observations, namely, the observations for the entire population. The parameter values are usually unknown, which is the reason for sampling and computing sample statistics to estimate their values Much of the rest of this textdeals with ways of making inferences about parameters (such as μ) using samplestatistics (such as¯y). Before studying these inferential methods, though, you need to learn some basic ideas of probability, which serves as the foundation for the methods. Probability is the subject of the next chapter

measuring variability: interquartile range

The difference between the upper and lower quartiles is called the inter quartile range, denoted by IQR. This measure describes the spread of the middle half of the observations. For the U.S. violent crime rates just summarized by the five-numbersummary, the interquartile range IQR = 43 − 26 = 17. The middle half of the ratesfall within a range of 17, whereas all rates fall within a range of 64 −12 = 52. Like therange and standard deviation, the IQR increases as the variability increases, and it isuseful for comparing variability of different groups An advantage of the IQR over the ordinary range or the standard deviation is that it is not sensitive to outliers. The violent crime rates ranged from 12 to 64, so the range was 52. When we include the observation for D.C., which was 130, the IQR changes only from 17 to 18. By contrast, the range changes from 52 to 118.

Box plots: graphing the 5 number summary of positions

The five-number summary consisting of (minimum, lower quartile, median, upperquartile, maximum) is the basis of a graphical display called2the box plot that summarizes center and variability. The box of a box plot contains the central 50% of thedistribution, from the lower quartile to the upper quartile. The median is marked bya line drawn within the box. The lines extending from the box are called whiskers - These extend to the maximum and minimum, except for outliers, which are markedseparately.

interpreting the magnitude of s: the empirical rule

The following rule is applicable for many data sets The rule is called the Empirical Rule because many frequency distributions seenin practice (i.e., empirically) are approximately bell shaped. The Empirical Rule applies only to distributions that are approximately bell shaped. For other shapes, the percentage falling within two standard deviations ofthe mean need not be near 95%. It could be as low as 75% or as high as 100%. TheEmpirical Rule does not apply if the distribution is highly skewed or if it is highlydiscrete, with the variable taking few values. The exact percentages depend on theform of the distribution, as the next example demonstrates

properties of the mean

The formula for the mean uses numerical values for the observations. So, themean is appropriate only for quantitative variables. It is not sensible to computethe mean for observations on a nominal scale. For instance, for religion mea-sured with categories such as (Protestant, Catholic, Muslim, Jewish, Other), themean religion does not make sense, even though for convenience these levelsmay be coded in a data file by numbers. The mean can be highly influenced by an observation that falls well above orwell below the bulk of the data, called an outlier. The mean is pulled in the direction of the longer tail of a skewed distribution, relative to most of the data. - - In the Leonardo's Pizza example, the large observation $225,000 results inan extreme skewness to the right of the income distribution. This skewnesspulls the mean above six of the seven observations. This example shows thatthe mean is not always typical of the observations in the sample. The morehighly skewed the distribution, the less typical the mean is of the data. - •The mean is the point of balance on the number line when an equal weight is at each observation point. The mean is the center of gravity (balance point) of the observations: The sum of the distances to the mean from the observations above the mean equals the sum of the distances to the mean from the observations below the mean. Denote the sample means for two sets of data with sample sizes n1and n2by¯y1and¯y2. The overall sample mean for the combined set of (n1+n2) observations is the weighted average - refer to pic for formula

the median

The mean is a simple measure of the center. But other measures are also informa-tive and sometimes more appropriate. Most important is the median. It splits thesample into two parts with equal numbers of observations, when they are orderedfrom lowest to highest or from highest to lowest. def. The median is the observation that falls in the middle of the orderedsample. When the sample size n is odd, a single observation occurs in themiddle. When the sample size is even, two middle observations occur, andthe median is the midpoint between the two. To illustrate, the ordered income observations for the seven employees ofLeonardo's Pizza are $15,400, $15,600, $15,900, $16,400, $16,400, $16,600, $225,000.The median is the middle observation, $16,400. - This is a more typical value for this sample than the sample mean of $45,900. When a distribution is highly skewed, the median describes a typical value better than the mean. In Table 3.4, the ordered economic activity values for the Western Europeannations are 66, 68, 71, 78, 78, 79, 79, 81, 82, 85. - Since n = 10 is even, the median is the midpoint between the two middle values, 78and 79, which is (78 +79)/2 = 78.5. This is close to the sample mean of 76.7, becausethis data set has no outliers The middle observation has the index (n+1)/2. --- p. 38 - That is, the median is the value ofobservation (n +1)/2 in the ordered sample. When n = 7, (n +1)/2 = (7 +1)/2 = 4,so the median is the fourth smallest, or equivalently fourth largest, observation.When n is even, (n + 1)/2 falls halfway between two numbers, and the median isthe midpoint of the observations with those indices. For example, when n = 10, then(n + 1)/2 = 5.5, so the median is the midpoint between the fifth and sixth smallestobservations. ex. 3.4 - refer to pic

median compared to mean

The median is usually more appropriate than the mean when the distribution is veryhighly skewed, as we observed with the Leonardo's Pizza employee incomes. Themean can be greatly affected by outliers, whereas the median is not.For the mean we need quantitative (interval-scale) data. The median also applies for ordinal scales. To use the mean for ordinal data, we must assign scores to the categories. In Table 3.5, if we assign scores 10, 12, 13, 14, 16, 18, and 20 to the categories of highest degree, representing approximate number of years of education, we get a sample mean of 13.7 The mean uses the numerical values of the observations, not just their ordering - p. 39 (the median has its own disadvantages) - For discrete data that take relatively few values, quite different patterns of data can have the same median. The most extreme form of this problem occurs for binary data, which can take only two values, such as 0 and 1. The median equals the more common outcome, but gives no information about the relative number of observations at the two levels. For instance, consider a sample of size 5 for the variable, number of times married. The observations (1, 1, 1, 1, 1) and the observations (0, 0, 1, 1, 1) both have a median of 1. The mean is 1 for (1, 1, 1, 1, 1) and 3/5 for (0, 0, 1, 1, 1). For binary (0, 1) data, proportion = meanWhen observations take values of only 0 or 1, the mean equals the proportion of observations that equal 1. Generally, for highly discrete data, the mean is more informative than the median. In summary, - If a distribution is highly skewed, the median is better than the mean in representing what is typical. - If the distribution is close to symmetric or only mildly skewed or if it is discretewith few distinct values, the mean is usually preferred over the median, becauseit uses the numerical values of all the observations

properties of the median

The median, like the mean, is appropriate for quantitative variables. Since itrequires only ordered observations to compute it, it is also valid for ordinal-scale data, as the previous example showed. It is not appropriate for nominal-scale data, since the observations cannot be ordered. For symmetric distributions, such as in Figure 3.5, the median and the mean areidentical. To illustrate, the sample of observations 4, 5, 7, 9, and 10 is symmetricabout 7; 5 and 9 fall equally distant from it in opposite directions, as do 4 and10. Thus, 7 is both the median and the mean. For skewed distributions, the mean lies toward the longer tail relative to the median. The median is insensitive to the distances of the observations from the middle, since it uses only the ordinal characteristics of the data. For example, the following four sets of observations all have medians of 10 - p. 39 The median is not affected by outliers. For instance, the incomes of the seven Leonardo's Pizza employees have a median of $16,400 whether the largest observation is $20,000, $225,000, or $2,000,000.

Standard Deviation

The most useful measure of variability is based on the deviations of the data from their mean. Deviation: The deviation of an observation yi from the sample mean¯y is (yi−¯y <- y-bar), the difference between them. Each observation has a deviation. The deviation is positive when the observationfalls above the mean. The deviation is negative when the observation falls below themean. The interpretation of¯y as the center of gravity of the data implies that the sum of the positive deviations equals the negative of the sum of negative deviations.Thus, the sum of all the deviations about the mean, equals 0 Because of this, measures of variability use either the absolute values or the squares of the deviations. The most popular measure uses the squares. The variance is approximately the average of the squared deviations. The unitsof measurement are the squares of those for the original data, since it uses squareddeviations. This makes the variance difficult to interpret. It is why we use instead itssquare root, the standard deviation. The larger the deviations, the larger the sum of squares and the larger s tends to be - p. 43 Although its formula looks complicated, the most basic interpretation of thestandard deviation s is simple: s is a sort of typical distance of an observation fromthe mean. So, the larger the standard deviation, the greater the spread of the data

percentiles and quartiles

The pth percentile is the point such that p% of the observations fall below or at that point and (100 − p)% fall above it. Substituting p = 50 in this definition gives the 50th percentile. This is the median. The median is larger than 50% of the observations and smaller than the other(100 − 50) = 50%. In proportion terms, a percentile is called a quantile. The 50th percentile is the 0.50 quantile Two other commonly used percentiles are the lower quartile and the upper quartile. The 25th percentile is called the lower quartile. The 75th percentile is calledthe upper quartile. One quarter of the data fall below the lower quartile.One quarter fall above the upper quartile The quartiles result from the percentile definition when we set p = 25 and p = 75.The quartiles together with the median split the distribution into four parts, eachcontaining one-fourth of the observations. See Figure 3.14. The lower quartile is the median for the observations that fall below the median, that is, for the bottom half of the data. The upper quartile is the median for the observations that fall above the median, that is, for the upper half of the data. The interquartile range (IQR) describes the spread of the middle half of the distribution. The median, the quartiles, and the maximum and minimum are five positionsoften used as a set to describe center and spread. Software can easily find these valuesas well as other percentiles.

Analyzing more than 2 variables

correlation does not imply casualty or "causation" This section has introduced analyzing associations between two variables. One im-portant lesson from later in the text is that just because two variables have an asso-ciation does not mean there is a causal connection. For example, the correlation forTable 3.9 between the Internet use and the fertility rate is −0.48. But having morepeople using the Internet need not be the reason the fertility rate tends to be lower(e.g., because people are on the Internet rather than doing what causes babies). Per-haps high values on Internet use and low values on fertility are both a by-product ofa nation being more economically advanced. Most studies have several variables. The second half of this book (Chapters 10-16) shows how to conduct multivariate analyses. For example, to study what is as-sociated with the number of good friends, we might want to simultaneously considergender, age, whether married, educational level, whether attend religious servicesregularly, and whether live in urban or rural setting.

frequency distribution and Relative Frequency Distribution for Violent Crime Rates

note mutually exclusive active here

describing variability of data

p. 41 intro - range - standard deviation

descriptive statistics

statistical methods are descriptive or inferential. The purpose of descriptivestatistics is to summarize data, to make it easier to assimilate the information Quantitative variables also have two key features to describe numerically: - The center of the data—a typical observation - The variability of the data—the spread around the center Most importantly, the mean describes the center and the standard deviation describes the variability. association—how certain values for one variable may tend to go with certain values of the other.For quantitative variables, the correlation describes the strength of the association, and regressionanalysis predicts the value of one variable from a value of the other variable.

describing the center of the data

the center of a frequency distribution fora quantitative variable. The statistics show what a typical observation is like Mean: - The best known and most commonly used measure of the center is the mean - The mean is the sum of the observations divided by the number of observations. - is often called the average - p. 35 (an example) . Then sample observations on a variable y are denoted by y1 for the first observation, y2for the second, and so forth. For example, for female economic activity in WesternEurope, n = 10, and the observations are y1= 79, y2= 79,...,y10= 81. The symbol¯y for the sample mean is read as "y-bar." A bar over a letter represents the samplemean for that variable. For instance,¯x represents the sample mean for a variabledenoted by x.


Related study sets

Español de Negocios examen parcial

View Set

Are you happy? Focus on vocabulary test 1

View Set

BATTLE OF THE BOOKS Questions & Answers 2016-2017

View Set