BUSI-230 Probability & Statistics - Module 2

Ace your homework & exams now with Quizwiz!

Decide which type of graph to use

- Bar graphs are useful for quantitative or qualitative data. With qualitative data, the frequency or percentage of occurrence can be displayed. With quantitative data, the measurement itself can be displayed, as was done in the bar graph showing life expectancy. Watch that the measurement scale is consistent or that a jump scale squiggle is used. - Pareto charts identify the frequency of events or categories in decreasing order of frequency of occurrence. - Circle graphs display how a total is dispersed into several categories. The circle graph is very appropriate for qualitative data, or any data for which percentage of occurrence makes sense. Circle graphs are most effective when the number of categories or wedges is 10 or fewer. - Time-series graphs display how data change over time. It is best if the units of time are consistent in a given graph. For instance, measurements taken every day should not be mixed on the same graph with data taken every week - For any graph: Provide a title, label the axes, and identify units of measure. As Edward Tufte suggests in his book The Visual Display of Quantitative Information, don't let artwork or skewed perspective cloud the clarity of the information displayed.

How to compute the sample variance and sample standard deviation

- The variable x represents a data value or outcome - Mean: This is the average of the data values, or what you "expect" to happen the next time you conduct the statistical experiment. Note that n is the sample size. - x- x bar This is the difference between what happened and what you expected to happen. This represents a "deviation" away from what you "expect" and is a measure of risk. - The expression Z(x - xbar)^2 is called the sum of squares. The (x - xbar) quantity is squared to make it nonnegative. The sum is over all the data. If you don't square (x - x bar), then the sum Z(X - xbar) is equal to O because the negative values cancel the positive values. This occurs even if some (x - xbar) values are large, indicating a large deviation or risk. -The sample variance is S^2. The variance can be thought of as a kind of average of the (x - xbar)^2 values. However, for technical reasons, we divide the sum by the quantity n - 1 rather than n. This gives us the best mathematical estimate for the sample variance. The defining formula for the variance is the upper one. The computation formula for the variance is the lower one. Both formulas give the same result. - This is the sample standard deviation, s. Why do we take the square root? Well, if the original x units were, say, days or dollars, then the s2 units would be days squared or dollars squared (wow, what's that?). We take the square root to return to the original units of the data measurements. The standard deviation can be thought of as a measure of variability or risk. Larger values of s imply greater variability in the data. The defining formula for the standard deviation is the upper one. The computation formula for the standard deviation is the lower one. Both formulas give the same result.

Features of a Bar Graph

1. Bars can be vertical or horizontal . 2. Bars are of uniform width and uniformly spaced. 3. The lengths of the bars represent values of the variable being displayed, the frequency of occurrence, or the percentage of occurrence. The same measurement scale is used for the length of each bar. 4. The graph is well annotated with title, labels for each bar, and vertical scale or actual value for the length of each bar.

How to make a box-and-whisker plot

1. Draw a vertical scale to include the lowest and highest data values. 2. To the right of the scale, draw a box from Q1 to Q3 3. Include a solid line through the box at the median level 4. Draw vertical lines, called whiskers, from Q1 to the lowest value and from Q3 to the highest value

How to compute a 5% trimmed mean

1. Order the data from the smallest to the largest 2. Delete the bottom 5% of the data and the top 5% of the data. Note: If the calculation of 5% of the number of data values does not produce a whole number, round to the nearest integer. 3. Compute the mean of the remaining 90% of the data

What Does a Box-and-Whisker plot Tell Us?

A box-and-whisker plot is a visual display of data spread around the median. It tells us: • the high value, low value, first quartile, median, and fourth quartile; • how the data are spread around the median; • the location of the middle half of the data; • if there are outliers (see Problem 12 of this section) .

Box-and-whisker plot

A graph that displays the highest and lowest quarters of data as whiskers, the middle two quarters of the data as a box, and the median. We will use these five numbers to create a graphic sketch of the data called a box-and-whisker plot. Box-and-whisker plots provide another useful technique from exploratory data analysis (EDA) for describing data.

Trimmed Mean

A measure of center that is more resistant than the mean but still sensitive to specific data values is the trimmed mean. A trimmed mean is the mean of the data values left after "trimming" a specified percentage of the smallest and largest data values from the data set. Usually a 5% trimmed mean is used. This implies that we trim the lowest 5% of the data as well as the highest 5% of the data. A similar procedure is used for a 10% trimmed mean.

Percentile

A point on a ranking scale of 0 to 100. The 50th percentile is the midpoint; half the people in the population being studied rank higher and half rank lower. For whole numbers P (where 1 ::5 P ::s; 99), the Pth percentile of a distribution is a value such that P% of the data fall at or below it and (100 - P)% of the data fall at or above it. For all conventions, the data are first ranked or ordered from smallest to largest. A natural way to find the Pth percentile is to then find a value such that Po/o of the data fall at or below it. This will not always be possible, so we take the nearest value satisfying the criterion. It is at this point that there is a variety of processes to determine the exact value of the percentile.

Resistant Measure

A summary number that is not affected by outliers. The median is a resistant measure of center. because we can make the mean as large as we want by changing the size of only one data value. The median, on the other hand, is more resistant. However, a disadvantage of the median is that it is not sensitive to the specific size of each data value.

Mean

An average that uses the exact value of each entry is the mean (sometimes called the arithmetic mean). To compute the mean, we add the values of all the entries and then divide by the number of entries. Usually used to calculate test scores. Mean = sum of all entries / number of entries

Time Series

Data sets composed of similar measurements from the same subject taken at regular intervals over time

Chebyshev's Theorem

For any set of data (either population or sample) and for any constant k greater than 1, the proportion of the data that must lie within k standard deviations on either side of the mean is at least: 1- (1/k^2) The concept of data spread about the mean can be expressed quite generally for all data distributions (skewed, symmetric, or other shapes) by using the remarkable theorem of Chebyshev.

Mew

If your data comprise the entire population, we use the symbol µ, (lowercase Greek letter mu, pronounced "mew") to represent the mean.

Time-Series Graph

In a time-series graph, data are plotted in order of occurrence at regular intervals over a period of time. A time-series graph is a graph showing data measurements in chronological order. To make a time-series graph, we put time on the horizontal scale and the variable being measured on the vertical scale. In a basic time-series graph, we connect the data points by line segments.

Distribution Shapes and Averages

In general, when a data distribution is mound-shaped symmetric, the values for the mean, median, and mode are the same or almost the same. For skewed-left distributions, the mean is less than the median and the median is less than the mode. For skewed-right distributions, the mode is the smallest value, the median is the next largest, and the mean is the largest.

Chapter 2 Summary

Organizing and presenting data are the main purposes of the branch of statistics called descriptive statistics. Graphs provide an important way to show how the data are distributed. • Frequency tables show how the data are distributed within set classes. The classes are chosen so that they cover all data values and so that each data value falls within only one class. The number of classes and the class width determine the class limits and class boundaries. The number of data values falling within a class is the class frequency. • A histogram is a graphical display of the information in a frequency table. Classes are shown on the horizontal axis, with corresponding frequencies on the vertical axis. Relative-frequency histograms show relative frequencies on the vertical axis. Ogives show cumulative frequencies on the vertical axis. Dotplots are like histograms, except that the classes are individual data values. • Bar graphs, Pareto charts, and pie charts are useful to show how quantitative or qualitative data are distributed over chosen categories. • Time-series graphs show how data change over set intervals of time. • Stem-and-leaf displays are effective means of ordering data and showing important features of the distribution Graphs aren't just pretty pictures. They help reveal important properties of the data distribution, including the shape and whether or not there are any outliers.

What Do Graphs Tell Us?

Provides a visual summary of data that tells us: • how data are distributed over several categories or data intervals; • how data from two or more data sets compare; • how data change over time.

Measures of Variation

Right now, let's think about the steps that we went through to calculate the standard deviation. Uh, we began with our data set and then we...we calculated, uh, our mean. But...but to do that we needed the sum of the data values. So we found the sum of the data values, we calculated the mean, and then we set up the table. And in the table is a column for deviation from the mean and then squaring all of those deviations from the mean. And then we add those values, and we're ready to go with the calculation with the formulas. Now concerning the formulas, let's...let's think about our relationships here and...and how we can consolidate matters a little bit. A lot of times it's nice for us to talk about variance when we first talk about standard deviation. We...we want the logic flow of what we're doing because if...if we're trying to remember this in a...on a test, let's say, or later in life, let's say, or whatever, it's...it's important to go back to concepts in order for that memory to be trip-started. And here, if we think of standard deviation as average deviation from the mean, then the logic pattern flows through what we have done in this problem. Uh, but we can consolidate things quite a bit here. We notice that if this is an s squared and here's an s squared under the radical, then we can put the entire fraction under the radical, you see. So we can have one formula for standard deviation if we wish to consolidate in that fashion. Let's go through the...the situation with the other data valuesthe data values 14, 2. Here are the heights that we talked about earlier and they're in inches. And remember the first task is to add those items and then calculate x bar. So we would write the equation for x bar and then make the calculation. Then we figure out the deviation from the mean. And I'm putting a sum here just for emphasis; we know that the sum of those deviations from the mean is going to be 0. Then we square those values, and add those squared values. And now we're ready to apply to the formula. So it's the square root of 156 over 4. Don't forget, now, that in the denominator theyre under the radical. It's...it's n minus 1. And n is 5 in this case, so the number is 4 in that denominator; 156 divided by 4 turns out to be, I think it's 39.The square root of 39 is approximately 6.24. So the standard deviation for team two is approximately 6.24 inches. Recall that the standard deviation for team one was 2 point...approximately 2.45 inches. And we're seeing a larger standard deviation for team two, which had a bigger range of values, you see, for those different heights. And, in general, we can say this: That the more variation there is in a data set, the larger the standard deviation. That's just a simple truism having to do with, uh...with statistics. 1st column - Sum of data values then calculated the mean column 2 - deviation from the mean column 3 - we squared the deviations x bar 375 Variance - s^3= sum|x-x|2/n-1 Standard deviation= s = square root of s^2

Approximating x and s from grouped data

Sample Mean for a Frequency Distribution x bar= Σxf/n

Outliers

Some data sets include values so high or so low that they seem to stand apart from the rest of the data. These data are called outliers. Outliers may represent data collection errors, data entry errors, or simply valid but unusual data values. It is important to identify outliers in the data set and examine the outliers carefully to determine if they are in error. One way to detect outliers is to use a box-and-whisker plot. Data values that fall beyond the limits,

interquartile range

The difference between the upper and lower quartiles. The median, or second quartile, is a popular measure of the center utilizing rela tive position. A useful measure of data spread utilizing relative position is the interquartile range (iqr0 It is simply the difference between the third and first quartiles.

Mode

The easiest average to compute is the mode. The mode of a data set is the value that occurs most frequently. Note: If a data set has no single value that occurs more frequently than any other, then that data set has no mode. However, the mode is a useful average when we want to know the most frequently occurring data value, such as the most frequently requested shoe size

Harmonic Mean

The mean of n numbers expressed as the reciprocal of the arithmetic mean of the reciprocals of the numbers When data consist of rates of change, the harmonic mean is the appropriate average to use.

five number summary

The quartiles together with the low and high data values give us a very useful fivenumber summary of the data and their spread. Lowest Value, Q1, Median, Q3, highest value

Quartiles

The values that divide the data into four equal parts. 1. Order the data from smallest to largest. 2. Find the median. This is the second quartile - Q2 3. The first quartile (Q1) is the the median of the lower half of the data. That is, it is the median of the the data falling below the Q2 position and not including Q2 4. The third quartile Q3, is the median of the upper half of the data; that is, it is the meidan of the data falling above the Q2 position and does not include the Q2 data 1st Quartile is 25th percentile 2nd is the 50th percentile -- median 3rd s the 75th percentile 4th id the highest quartile

Stem and Leaf Displays - Video 2.3

There are a whole bunch of numbers in an array. Well if we rearrange the array a little bit, maybe we can see some trends and one way to arrange the information is to arrange it by size in particular categories. Stem - Leaves need an indicator of how to read the graph 3 | 2 represents 32 pounds most important thing is that the integrity of that data is preserved

Bar Graphs

These are graphs that can be used to display quantitative or qualitative data.

Grouped Data

When data are grouped, such as in a frequency table or histogram, we can estimate the mean and standard deviation by using the following formulas. Notice that all data values in a given class are treated as though each of them equals the midpoint x of the class.

Geometric Mean

When data consist of percentages, ratios, compounded growth rates, or other rates of change, the geometric mean is a useful measure of central tendency. For n data values, assuming all data values are positive. Geometric mean = exponent n square root of the product of the n data values

Cluster Bar Graphs

When there are two or more bars comparing two or more things life the life expectancy of males vs. females

Changing Scale

Whenever you use a change in scale in a graphic, warn the viewer by using a squiggle --\,- on the changed axis. Sometimes, if a single bar is unusually long, the bar length is compressed with a squiggle in the bar itself.

Pareto Chart

a bar graph in which the bar height represents frequency of an event. In addition, the bars are arranged from left to right according to decreasing height.

standard deviation

a computed measure of how much scores vary around the mean score

Percentile - Video 3.3

a percentile number describes the percent of data of a data set which are less than the percentile number. So if a person scores in the 85th percentile, it simply means that 85 percent of the people taking that test scored lower than that person's score.

How to Find the Mean

add up all the numbers, then divide by how many numbers there are

Pareto Charts - Video 2.2

bars are arranged according to height - Pareto charts can come into play when the lower axis is not some kind of timeline, for example, where we have to list the lower axis in a particular fashion. And Pareto charts are useful in being able to identify the most frequently occurring items or the least frequently occurring items in a particular survey. Identify cause that happened most frequently in order to least frequent Cause Frequency

measures of variation

give information on the spread or variability or dispersion of the data values - range - sample standard deviation -variance

x bar

mean of a sample of x values

Computation Formula for the Sample Standard Deviation

s =SQRT of Σx²f-(Σxf)²/n --------------------------- n-1 x is the midpoint of a class, f is the number of entries in that class, n is the total number of entries in the distribution, and n =Σf. The summation Σ is over all classes in the distribution.

Sample Standard Deviation for a Frequency Distribution

s=sqrt of Σ (x-x bar)²f ------------------------ n-1

Sample Standard Deviation Formula

square root of S^2/n-1

Variance

standard deviation squared Tells us the square of standard deviation. As such, it is also a measure of data spread around the mean.

Range

the difference between the highest and lowest scores in a distribution Range = Largest Value - Smallest Value Tells us the difference between the highest data value and the lowest. It tells us about the spread of data but does not tell us if most of the data is or is not closer to the mean.

Median

the middle score in a distribution; half the scores are above it and half are below it. Another average that is useful is the median, or central value , of an ordered distribution. When you are give n the median, you know there are an equal number of data values in the ordered distribution that are above it and below it. How to find the median The median is the central value of an ordered distribution. To find it, 1. Order the data from smallest to largest. 2. For an odd number of data values in the distribution, Median = Middle data value 3. For an even number of data values in the distribution Median = sum of middle two values / 2

Weighted Mean - Video 3.1

the notion of weighted mean, and weighted mean refers to the, the idea that some data values are more important than others. And a very simple example of it would be a situation where a, a teacher would like to give more value to a particular test. Maybe to the final exam. Maybe we want the final exam to count double. Weights: Quizzes= 1 Tests= 4 x's quizzes Project= 8 x's mean Multiply the sum of items where we multiply a test score times the weight, and it's the sum of all of these products, you see. So it's X times W plus X times W plus X times W and so forth, and in the denominator, we're just adding all of the weight values (how many weight values do we have here)

Sample Standard Deviation

the square root of the sample variance In statistics, the sample standard deviation and sample variance are used to describe the spread of data about the mean x.

Coefficient of Variation (CV)

the standardized measure of the risk per unit of return; calculated as the standard deviation divided by the expected return which expresses the standard deviation as a percentage of the sample or population mean The coefficient of variation can be thought of as a measure of the spread of the data relative to the average of the data. If x bar and s represent the sample mean and sample standard deviation, respectively, then the sample coefficient of variation CV is defined to be CV=S/x bar * 100% CV= σ/µ * 100% Notice that the numerator and denominator in the definition of CV have the same units, so CV itself has no units of measurement. This gives us the advantage of being able to directly compare the variability of two different populations using the coefficient of variation.

Sum of Squares

the sum of each score's squared deviation from the mean

population mean, µ

the sum of the values in the population divided by the population size where N is the number of data values in the population and x represents the individual data values of the population.

Chapter 3 Summary

to characterize numerical data, we use both measures of center and of spread. • Commonly used measures of center are the arithmetic mean, the median, and the mode. The weighted average and trimmed mean are also used as appropriate. • Commonly used measures of spread are the variance, the standard deviation, and the range. The variance and standard deviation are measures of spread about the mean. • Chebyshev's theorem enables us to estimate the data spread about the mean • The coefficient of variation lets us compare the relative spreads of different data sets. • Other measures of data spread include percentiles, which indicate the percentage of data falling at or below the specified percentile value. • Box-and-whisker plots show how the data are distributed about the median and the location of the middle half of the data distribution.

Summation Notation

using the Greek letter "sigma"

Circle Graphs or Pie Charts

wedges of a circle visually display proportional parts of the total population that share a common characteristic. It is relatively safe from misinterpretation and is especially useful for showing the division of a total quantity into its component parts. The total quantity, or 100%, is represented by the entire circle. Each wedge of the circle represents a component part of the total. These proportional segments are usually labeled with corresponding percentages of the total.

Population Standard Deviation Formula σ

σ = SQRT of Σ ( X- μ )² / N where N is the number of data values in the population and x represents the individual data values of the population.

population variance σ²

σ² = Σ ( X - μ )² / N

What do averages tell us?

• The mode tells us the single data value that occurs most frequently in the data set. The value of the mode is completely determined by the data value that occurs most frequently. If no data value occurs more frequently than all the other data values, there is no mode. The specific values of the less frequently occurring data do not change the mode. • The median tells us the middle value of a data set that has been arranged in order from smallest to largest. The median is affected by only the relative position of the data values. For instance, if a data value above the median (or above the middle two values of a data set with an even number of data) is changed to another value above the median, the median itself does not change. • The mean tells us the value obtained by adding up a// the data and dividing by the number of data. As such, the mean can change if just one data value changes. On the other hand, if data values change, but the sum of the data remains the same, the mean will not change.


Related study sets

Biology- Chapter 8: Photosynthesis

View Set

medical billing/coding abbreviations

View Set

Micro Exam 2 Yuan Quizzes plus study guide

View Set

Government Chapter 1,2, & 5 D2l quizzes

View Set