Statistics: Descriptive- Chapter 2

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Shapes of distribution

The graphs of freq dist can have different shapes: symmetric, uniform, skewed left (negatively), skewed right (positively)

box-and-whisker plot (or boxplot)

highlights important features of a data set. Five important values (the five number summary) must be known in order to construct a box-and-whisper plot

bimodal

if two data entries have the same greatest freq, each entry is a mode and the data set is called bimodal

population varience

in a population data set, the mean of the squares of the deviations Population variance = σ2 = Σ (x - μ )2 / N = sum of the deviations squared / sample size

interquartile range (IQR)

indicates the variation of the range of the middle 50% of the data. the calculation is: IQR = Q3 - Q1

organize data sets by grouping the data into:

intervals called classes and forming a frequency distribution

frequency histogram

is a bar graph that represents the frequency distribution of a data set. *classes must be changed to class boundaries so that the bar graphs touch each other *first, find the distance between the classes and subtract this number from the lower class limit. Also add this number to the upper class limit. Then divide by 2.

outlier

is a data entry that is out of place with the other data numbers * the data entry that is far removed from the other entries in the data set Example: for the data values given: {3, 6, 4, 3, 1, 6, 77}, the data entry 77 is an outlier

cumulative frequency graph (ogive)

is a line graph that displays the cumulative frequency of each class at its upper class boundary. *the upper boundaries are marked on the horizontal axis, and the cumulative frequencies are marked on the vertical axis

frequency polygon

is a line graph that emphasizes the continuous change in frequencies

frequency distribution

is a table that shows classes or intervals of data entries with a count of the number of entries in each class *used to show patterns or trends in a data set

measure of central tendency

is a value that represents a typical, or central of a data set

range

is the difference between the maximum and minimum entries in a data set Range = Maximum - Minimum

class width

is the distance between lower (or upper) limits of consecutive classes

lower class limit

is the least number that can belong to the class

frequency f of a class:

is the number of data entries in the class

a z-score of -2.5 is considered very unusual (T/F)

False. A z-score of -2.5 is considered unusual

range

difference between the maximum and minimum data entries

cumulative frequency (of a class)

is the sum of the frequencies of that class and all previous classes. the cumulative frequency of the last class is equal to the sample size n

characteristics to look for when organizing and describing a data set

its center, its variable (or spread), and its shape

positive z-score

means that the x-value is greater than the mean

negative z-score

means that the x-value is less than the mean

z-score of exactly zero

means x-value is equal to the mean

standard deviation

measure of typical amount an entry deviates from the mean *a large standard deviation indicates that the data is spread out away from the mean while a small standard deviation indicates the data is clustered close together near the mean

How do outliers affect the central tendencies?

the mean is heavily influenced by outliers because they are included in the calculation * the median is usually not influenced by outliers much since outliers usually fall at the beginning or end of the data set. * the mode is usually not influenced by outliers since they typically occur just once

weighted mean

the mean of a data set that has entries of varying weights x̅ = Σ (x * w) / Σ w = Σ (data values * weights) / Σ weights

Z-scores

the number of standard deviations that a value, x, falls from the mean, μ The formula to calculate is: z = value - mean / standard deviation = x - μ / σ z-scores should fall btw -2 and 2 as these values represent 95% of the data according to the Emperical Rule. A z-score outside this range occurs about 5% of the time and would be considered unusual

class bounderies

the numbers that separate classes without forming gaps between them *if data entries are integers, subtract 0.5 from each lower limit to find the lower class boundaries. *to find the upper class boundaries, add 0.5 to each upper limit. * the boundaries of class will equal the lower boundary of the next higher class

population standard deviation

the square root of the population variance * a population data set of N entries is the square root of the population variance population standard deviation = σ = √σ2 = √Σ (x - μ )2 / N

sample standard deviation

the square root of the sample variance *the population samples standard deviation estimator sample standard deviation = s = √s2 = √ Σ (x - x̅ )2 / n - 1

After constructing an expanded frequency distribution, what should the sum of the relative frequencies be? Explain

the sum of the relative freq must be 1 or 100% because it is the sum of all portions or percentage of the data.

Σ (sigma)

the uppercase Greek letter sigma ( Σ) is used throughout statistics to indicate a summation of values

what is the advantage of using a stem-and-leaf plot instead of a histogram? What is a disadvantage?

unlike the histogram, the stem-and-leaf plot still contains the original data values. However, some data are difficult to organize in a stem-and-leaf plot

Five Number Summary

used to create a box-and-whisker plot 1. minimum 2. Q1 3. Q2, or median 4. Q3 5. Maximum

Describe the difference btw the calculation of population standard deviation and that of sample standard deviation

when calculating the population standard deviation, you divide the sum of the squared deviations by N, then take the square root of that value. When calculating the sample standard deviation, you divide the sum of the squared deviations by n - 1, then take the square root of that value.

Sum of squares

when you add the squares of the deviations, you compute a quantity called the sum of squares, denoted SSx

upper class limit

which is the greatest number that can belong to the class

construct a sample data set for which n = 7, x̅ = 9, and s = 0

{9, 9, 9, 9, 9, 9, 9, 9}

Construct the described data set: median and mode are the same

Example: 1, 2, 2, 2, 3 (answers may vary)

Construct the described data set: Mean is not representative of a typical number in the data set

Example: 2, 5, 7, 9, 35 (answers may vary)

Some quantitative data sets do not have medians (T/F)

False. Every quantitative data set has a median.

The 50th percentile is equivalent to Q1 (T/F)

False. The 50th percentile is equivalent to Q2

the mean and median of a data set are both fractiles. (T/F)

False. the median of a data set is a fractile, but the mean may or may not be a fractile depending on the distribution of the data

In a frequency distribution, the class width is the distance between the lower and upper limits of a class (T/F)

False: Class width is the difference between lower or upper limits of consecutive classes

An ogive is a graph that displays relative frequencies (T/F)

False: an ogive is a graph that displays cumulative freq.

How is a Pareto chart different from a standard vertical bar graph?

In a Pareto chart, the height of each bar represents frequency or relative frequency and the bars are positioned in order of decreasing height with the tallest bar at the left

Find the range, mean, variance, and standard deviation of the sample data set: 4 15 9 12 16 8 11 19 14

Range = 15 x̅ = 12 s squared = 21 s = 4.6

Find the range, mean, variance, and standard deviation of the population data set: 9 5 9 10 11 12 7 7 8 12

Range = 7 μ = 9 σ squared = 4.8 standard deviation = 2.2

min= 17, max= 135, 8 classes; what is the class width, the lower class limits, and the upper class limits

class width=15; lower class limits: 17,32,47,62,77,92,107,122; upper class limits: 31,46,61,76,91,106,121,136

relative frequency histogram

*has the same shape and the same x-axis as a frequency historgram. the difference is that the y-axis measures relative frequencies *uses the same information as a frequency histogram but compares each class interval to the total number of items. The only difference between a frequency histogram and a relative frequency histogram is that the vertical axis uses relative or proportional frequency instead of simple frequency

median

*the data entry located in the middle of an ordered data set. *the data is arranged in order from least to greatest so that the data value in the middle can be determined. *measures the center of an ordered data set by dividing it into two equal parts. If the data set has an odd number of entries, the median is the middle data entry. if the data set has an even number of entries, the median is the mean of the two middle data entries.

relative frequency (of a class)

*is the portion or percentage of the data that falls in that class. *To find the relative frequency of a class, divide the frequency f by the sample size n. relative frequency= class frequency / sample size = f/n * the sample size is the total number of data values *you can write as a fraction, decimal, or percent. *the sum of the relative freq of all classes should be equal to 1, or 100%. *due to rounding, the sum may be slightly less than or greater than 1. Such as 0.99 or 1.01

mean

*the average * is the sum of the data entries divided by the number of entries. there are two types of means: population means and sample mean

frequency histogram (properties)

* has the following properties: 1. the horizontal scale is quantitative and measures the data values 2. the vertical scale measures the frequencies of the class 3. consecutive bars must touch

sample variance

* population samples, variance estimators sample variance= s 2 = Σ (x - x̅ )2 / n - 1= sum of the deviations squared / sample size - 1

midpoint (of a class)

* the center of each class *is the sum of the lower and upper limits of the class divided by two. the midpoint is sometimes called the class mark. midpoint=(lower class limit)+(upper class limit) / 2

dot plot

each data entry is represented by a point that is placed above an axis. *with a dot plot, the specific data entries can be determined, and it displays the distribution of the data

stem-and-leaf plot

* way to display quantitative data *contains the original data values * an easy way to display and sort data *two digit numbers are separated into a stem (the data's first digit) and a leaf (the data's second digit) * if the data is more than two digits, the stem is typically the first digits and the leaf is the last digit *similar to a histogram but sill contains the original data values Example: Data entries 382, 397, and 398 Stem leaf 38 2 39 7 8

frequency distribution (guidelines for constructing)

1. decide on the number of classes. this will be given in each problem. 2. find the class widths. class width =range/number of classes. always round up to the nearest whole number( EX: 2.8 rounds up to 3. 2.3 also rounds up to 3) 3. find the class limits. use the minimum as the starting value of the first class. then add the class width to the minimum of the previous class. this will give you the lower limits for the next class. then, you can easily find the upper limits by the class width. classes canNOT overlap. 4. tally the data by marking where each data entry belongs in the table. 5. count the tally marks to find the total frequencies of each class

sample mean (x̅, read as "x" bar)

A mean of a numerical set that includes an average of only a portion of the numbers within a group x̅ = Σ x / N = sum of data / sample size

in terms of displaying data, how is a stem-and-leaf plot similar to a dot plot?

Both allow you to see how data are distributed, to determine specific data entries, and to identify unusual data calues

What is the difference between class limits and class boundaries?

Class limits determine which numbers can belong to each class. Class boundaries are the numbers that separate classes without forming gaps between them.

min=9, max=64, 7 classes; what is the class width, the lower class limits, and the upper class limits?

Class width=8; lower class limits: 9,17,25,33,41,49,57; upper class limits: 16,24,32,40,48,56,64

population mean ( μ, pronounced mu)

The mean of a numerical set that includes all the numbers within the entire group. μ = Σ x / N= sum of the data / number of data entries

A data set can have the same mean, median, and mode. (T/F)

True

The mean is the measure of central tendency most likely to be affected by an outlier. (T/F)

True

When each data class has the same frequency, the distribution is symmetric (T/F)

True

the second quartile is the median of an ordered data set (T/F)

True

Pareto chart

a bar graph where the heights of each bar represent the frequency or relative frequency of each category. the bars are in order from the tallest to the shortest. Commonly seen in newspaper and on tv

pie chart

a circle that is divided into sectors that represent categories. the area of each sector is proportional to the frequency of each category

gaps

a data set can have one or more outliers, causing gaps in distribution

time series chart

a data set composed of quantitative entries taken at regular intervals of time

paired data sets

a data set that contains points, where the x-value of one set is paired with a y-value of another set *each entry in one data set corresponds to one entry in a second data set, the sets are called paired data sets

clusters

a distribution can have several gaps caused by outliers, or clusters of data. * they occur when several types of data are included in the one data set

scatter plot

a graph of data points with an x-axis and y-axis. *one way to graph paired data sets, where the ordered pairs are graphed as points in a coordinate plan. *used to show the relationship between two quantitative variables

symmetric

a vertical line can be drawn down the center of the graph, and the graph is about the same on both sides. * the mean, median, and modes are equal

uniform

all classes have the same frequency. * the mean and median are equal. *a uniform distribution is also symmetric

fractiles

numbers that divide an ordered data set into equal parts * median is a fractile b/c it divides a data set into two equal parts

deviation

of an entry, x, in a population data set is the difference between the entry and the mean μ of the data set deviation = x - μ

what are some benefits of representing data sets using frequency distributions?

organizing the data into freq dist. may make patterns within the data more evident

name some ways to display qualitative data graphically

pie chart, Pareto chart

Discuss the similarities and the differences btw the Empirical Rule and Chebychev's Theorem

similarities: both estimate proportions of the data contained within k standard deviations of the mean Differences: the Empirical Rule assumes the distribution is bell-shaped: Ccebychev's Theorem makes no such sassumption

what are some benefits of using graphs of frequency distribution?

sometimes it is easier to ID patterns of a data set by looking at a graph of the freq. dist.

name some ways to display quantitative data graphically.

stem-and-leaf plot, dot plot, histogram, scatter polt, time series chart

Chebychev's Theorem

tells the portion of data that lies within k standard deviations of the mean 1 - 1/ k2 (k is squared) * k = 2: in any data set, it tells us that 75% of the data lie within 2 standard deviations of the mean *k = 3: in any data set, it tells us that 88.9% of the data lie within 3 standard deviations of the means

what is an advantage of using the range as a measure of variation? what is the disadvantage?

the advantage of the range is that it is easy to calculate. the disadvantage is that it uses only two entries from the data set

mode

the data entry that occurs the most or with great freq. * a data set can have one mode, more than one mode, or no mode.

skewed left (negatively skewed)

the graph has a tail that elongates to the left. * a tail has values with a lower frequency. *the mean is less than the median which is less than the mode

skewed right (positively skewed)

the graph has a tail that elongates to the right. *the mode is less than the median which is less than the mean

Why is the standard deviation used more frequently than the variance? (Hint: Consider the units of the variance)

the units of variance are squared. Its units are meaningless. (example: dollars squared)

quartiles

there are three that divide a data set into four equal parts first quartile: 1/4 of the data is on or below Q1 second quartile or median: 1/2 of the dta is on or below Q2 third quartile: 3/4 of the data is on or below Q3

Empirical Rule

used for data with a bell-shaped distribution has the following characteristics: *about 68% of the data lies between ( x̅ - s, x̅ + s) or one standard deviation of the mean *about 95% of the data lies between ( x̅ - 2s, x̅ + 2s), or two standard deviations of the mean *about 99.7% of the data lies between ( x̅ - 3s, x̅ + 3s), or three standard deviations of the means


Set pelajaran terkait

AH Theory I: Ch. 11 Healthcare of the Older Adult

View Set

APES 5.2- Clearcutting and 5.17 - Sustainable Forestry

View Set

Lesson 10 Online - Project Risk Management

View Set

behave research methods quiz # 5

View Set