Business Statistics

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Survey

-To encourage respondents to participate, an effective survey will state its purpose in the beginning -Personal demographic questions are often last, when respondents feel more comfortable with the process

Discrete Data

-Values are WHOLE numbers (integers) -Usually COUNTED, not measured -number of complaints per day -number of TVs in a household -number of rings before the phone is answered

Mean of Binomial Distribution

*Represents the long-term average number of successes to expect based on the number of trials conducted u=np

Stem and Leaf Display

*SPLITS the data values into STEMS (the LARGER place values) and LEAVES (the SMALLER place value) *By listing all of the LEAVES to the RIGHT of each stem, we can graphically describe how the data are distributed -All the original data points are visible on the display -Easy to construct by hand -Provides a histogram-like view of the distribution 1) Sort the data from lowest to highest 2) Determine the unique stem values 7, 8, 9 are the different stem values in this example 3) List the stems in a vertical column and then add the leaf values to the right of the appropriate stem, in ascending order (78, 78, 79, 79, 79, 80, 80, 80, 80, 81, 81, 82, 83, 83...) 7 | 8 8 9 9 9 8 | 0 0 0 0 1 1 2 3 3 4 4 4 5 6 7 8 9 | 0 2 5

Measures of Variability

*Show how much SPREAD is present in the data. 1) RANGE: 2) Variance: for a Sample or Population 3) Standard Deviation: for a Sample or Population

Range

*Simplest measure of variation -Difference between the highest value and the lowest value in a data set ADVANTAGES: -Easy to calculate and understand DISADVANTAGES: -Only based on TWO numbers in the data set (Ignores the way in which data are distributed) -Sensitive to outliers: ex: 1, 2, 3, 4, 1000 Range=999 1, 2, 3, 4, 5 Range= 4 Only one value changed - the range does NOT accurately reflect the overall VARIABILITY of the data

Empirical Rule

According to the Empirical Rule, if a distribution follows a BELL-SHAPED, SYMMETRICAL curve centered around the MEAN, we would expect: 1) Approximately 68% of the values to fall within ± 1 standard deviations from the mean 2) Approximately 95% of the values to fall within ± 2 standard deviations from the mean 3) Approximately 99.7% of the values to fall within ± 3 standard deviations from the mean

Advantages and Disadvantages of Using MEAN to Summarize Data

Advantages: Simple to calculate -Summarizes the data with a single value Disadvantages: With only a summary value you LOSE information about the original data -Just knowing the mean does not help you know what the underlying data looks like -The value of the mean is sensitive to OUTLIERS (values that are much higher or lower than most of the data)

Standard Deviation

The standard deviation is the square root of the variance Has the same units as the original data *A measure of HOW FAR on average each data value is FROM the MEAN of the sample Sample SD=STDEV.S(data values) Population SD=STDEV.P(data values) *Is a common measure of CONSISTENCY in business applications, such as quality control *The standard deviation measures the amount of VARIABILITY AROUND the mean -The standard deviation is AFFECTED by the SCALE of the data -When sample means are DIFFERENT comparing standard deviations can be MISLEADING

Standard Normal Distribution

When the original random variable, x, follows the Normal distribution, Z-Scores also follow a Normal distribution with μ = 0 and σ = 1

Point Estimate

Is a SINGLE value that best describes the population of interest -The sample MEAN is a point estimate of the UNKNOWN population mean -The sample PROPORTION is a point estimate of the unknown population proportion Point estimates are easy to calculate but do not provide any information about their accuracy An INTERVAL ESTIMATE provides additional information about VARIABILITY

Quantitative Data

It is COUNTED: Examples: Number of Children, Defects per hour (Counted items) It is MEASURED Examples: Weight, Voltage (Measured characteristics) 1) INTERVAL: Meaningful differences; NO true zero point -Example: Calendar Year (2014, 2015) 2) RATIO: Meaningful differences; TRUE zero point -Example: Income ($48,000, $0)

Nonsampling Error

Occur as a result of issues such as -ambiguous survey questions -questions that lead respondents to a certain "correct" answer -data collection errors *These are errors not related to sampling variability

Formulas for Expressing Z-Score in terms of X

Population: x=u+zo Sample: x=x+zs QUESTION: For a symmetric bell shaped population with a mean of 20 and a standard deviation of 3, what interval will contain about 95% of all the values? x=u+zo= 20+(2)(3)=26 x=u-zo= 20-(2)(3)=14 ANSWER: About 95% of the values will fall between 14 and 26 -Approximately 95% of the values to fall within ± 2 standard deviations from the mean

Sample Correlation Coefficient (Rxy)

Rxy , measures both the STRENGTH and DIRECTION of the linear relationship between two variables Rxy=(Sxy/SxSy) -The values of r range from -1.0, a strong negative relationship, to +1.0, a strong positive relationship -When r = 0, there is NO relationship between variables x and y

Formula for the Variance: Grouped Data

Sample: s^2=(Summation)[((m-x)^2)f]/(n-1)

Sampling Error

Sampling error is defined as the DIFFERENCE between the sample STATISTIC and the population PARAMETER -A LARGER sample size will provide a SMALLER average sampling error =x-u

Example: Probabilities for a NEGATIVE Z-Score

Suppose that μ = 12 and σ = 3 for a normal distribution *Find P(x ≤ 8.5) P(x ≤ 8.5)=P(z ≤ (8.5-12)/3)) P(z≤ -1.17)=0.1210

Sample Covariance (Sxy)

Sxy , measures the DIRECTION of the linear relationship between two variables *A relationship is LINEAR if the SCATTER plot of the independent and dependent variables has a straight-line pattern -A POSITIVE value implies a positive linear relationship (as one variable increases, the second variable also tends to increase) -A NEGATIVE COVARIANCE indicates a negative linear relationship (as one variable increases, the second variable tends to decrease) -A covariance close to ZERO indicates NO relationship between the two variables

Shapes of Frequency Distribution

Symmetric: Mean=Median Skewed: Left or Right Skewed -Left Skewed: Mean< Median -Right Skewed: Mean> Median

Cumulative Relative Frequency Distribution

TOTALS the proportion of observations that are LESS THAN or EQUAL TO the class at which you are looking -Shows the ACCUMULATED proportion as values VARY from low to high

Five-Number Summary

consists of these five values: 1) The minimum value 2) The first quartile 3) The second quartile 4) The third quartile 5) The maximum value *Note that outliers ARE INCLUDED in the Five-number summary

Frequency Distribuition

shows the NUMBER of data observations that fall into specific INTERVALS -Graphically summarize information not readily observable by merely looking at data in a table

Sample

refers to a portion of the population that is representative of the population from which it was selected

Population

represents all possible subjects that are of interest in a particular study

Data

values assigned to observations or measurements -Raw facts or measurements of interest

Cross Section Data

values collected from a number of subjects during a SINGLE time period

Time Series Data

values that correspond to specific measurements taken over a RANGE of time periods

Mean of Grouped Data

x=(Summation)fm/n f=frequency m=midpoint Only an approximate value since the MIDPOINT is just an ESTIMATE of the value in each class -Example: An online merchant has collected the following grouped data for the number of web pages viewed by a sample of its customers: # of Pages---Frequency (f) 1 to under 5-----6 5 to under 9-----12 9 to under 13----10 13 to under 17----4 -The merchant would like to calculate the AVERAGE number of viewed pages. 1. Find the midpoint (m) for each class 3, 7, 11, 15 2. Calculate the MEAN [6(3)+12(7)+10(11)+4(15)]/(6+12+10+4)=(272/32)=8.5 *The average number of viewed pages is about 8.5

Calculating Probabilities for Normal Distributions Using Normal Probability Tables

*Any Normal distribution (with any mean and standard deviation combination) can be transformed into the Standard normal distribution (z) -Need to transform x units into z units -The resulting z value is called a Z-Score

Continuous Random Variables

*Are outcomes that take on ANY numerical value in an INTERVAL, as determined by conducting an experiment -Usually MEASURED rather than counted -EXAMPLES of continuous data include time, distance, and weight -can take on ANY value WITHIN a specified interval *Because there are an INFINITE number of possible values, the probability of ONE specific value occurring is theoretically equal to ZERO -Probabilities are based on INTERVALS, not individual values -----Probability is represented by an area UNDER the probability distribution *Continuous Probability Distributions can have a VARIETY of SHAPES

Interquartile Range (IQR)

*Describes the middle 50% of a range Find the IQR by subtracting the first quartile from the third quartile IQR=Q3-Q1

Percentage Polygon

*Graphs the MIDPOINT of each class as a LINE rather than a column -The HEIGHT of each midpoint represents the relative frequency of the CORRESPONDING class -Used to COMPARE the shape of two or more distributions on one graph

Z-Score

*Identifies the number of STANDARD DEVIATIONS a particular value is FROM the MEAN of its distribution *A Z-Score has NO UNITS -ZERO for values EQUAL to the MEAN -POSITIVE for values ABOVE the MEAN -NEGATIVE for values BELOW the MEAN -A data value that has a Z-Scores ABOVE +3 or BELOW -3 is categorized as an OUTLIER (has a value FAR from the MEAN) Formula= z = (x-u)/o Question: How far is 670 from the sample mean of 776.3 (in standard deviation increments)? Answer: 670 is 0.276 standard deviations below the mean

Percentile Rank

*Identifies the percentile of a particular value within a set of data -Formula to find the approximate percentile RANK for a value x: Percentile Rank= [[(Number of Values below x)+0.5]/(Total Number of Values)]*100 -QUESTION: What is the percentile rank for the car with 35.8 MPG?

Box-and-Whisker Plot

*Is a graphical display showing the relative position of the three QUARTILES as a BOX on a number LINE -It also shows the MINIMUM and MAXIMUM values in the data set and any outliers

Probability Density Function

*Is a mathematical description of a probability distribution -represents the RELATIVE DISTRIBUTION of FREQUENCY of a Continuous Random Variable

Expected Monetary Value (EMV)

*Is the MEAN of a discrete probability distribution when the discrete random VARIABLE is expressed in terms of DOLLARS -The EMV represents a LONG-TERM AVERAGE, as if outcomes from the distribution occurred many times

Percentiles

*Measure the approximate percentage of values in the data set that are BELOW the value of interest -The pth percentile of a data set (where p is any number between 1 and 100) is the value that at least p percent of the observations will fall below Examples: 20% of the data values are below the 20th percentile 73% of the data values are below the 73rd percentile *To find percentiles MANUALLY: -Sort the data from LOWEST to HIGHEST -Calculate the index point, i i=(p/100)n p= percentile of INTEREST n= # of data values -----If i is NOT a WHOLE number, ROUND i to the next whole number. The ith position represents our VALUE of interest -----If i IS a WHOLE number, the MIDPOINT BETWEEN the ith and (i + 1) position is our VALUE of interest ***i is NOT the value of the percentile, it is the POSITION of the percentile value in the ranked data EXAMPLE: Miles per gallon were recorded for a sample of 12 cars. The ranked values are shown below. What is the value of 60th percentile? i=(p/100)n= (60/100)12= 7.2=round up to =8 Position 8 equals 31.1 MPG= 60th percentile *Excel=PERCENTILE.EXC(array, k) array= the data range of interest k= the percentile of interest between 0 and 1 inclusive

Estimated Class Width

*Once k is known, the WIDTH of each class can be found **The width is the RANGE of numbers to put into EACH CLASS Estimated Class Width= [(Max Data Value)-(Min Data Value)]/(k) *Round this estimate to a useful whole number that makes the frequency distribution more readable -There is no one correct answer for the class width -The goal is to create a histogram to clearly and usefully show the pattern in the data -Often there is more than one acceptable way to accomplish this

Sampling and Non-Sampling Errors

*PARAMETERS: are values that describe some characteristic of a POPULATION, such as its mean or median -Values calculated using population data *STATISTICS: are values calculated from a SAMPLE, such as the sample's mean or median -Values computed from sample data -Statistics will vary from sample to sample -A sample statistic is NOT likely to be exactly EQUAL to the population parameter, since only a portion of the population is in the sample

Scatter Plots

*Provide a picture of the relationship between TWO data points that are PAIRED together -The DEPENDENT VARIABLE, which is placed on the VERTICAL axis of the scatter plot, is influenced by changes in the INDEPENDENT VARIABLE, which is placed on the HORIZONTAL axis

Quartiles

*Split the ranked data into 4 equal groups: -The FIRST quartile (Q1) is the value that constitutes the 25th percentile -The SECOND quartile (Q2) is the value that constitutes the 50th percentile --Note that the second quartile (the 50th percentile) is the MEDIAN -The third quartile (Q3) is the value that constitutes the 75th percentile EXAMPLE: Find the first quartile. -Sample Data: 11, 12, 13, 16, 16, 17, 18, 21, 22, 22, 25 (n=11) i=(p/100)n i=(25/100)11=2.75=Round Up=3 3rd position so Q1= 13 EXCEL: =QUARTILE.EXC(array, quart) array= data range of interest quart= 1, 2, or 3 (for first, second, or third quartile)

Chebyshev's Theorem

*States that for any number z GREATER than 1, the percent of the values that fall within z standard deviations above and below the mean will be at least =[1-(1/z^2))]*100 -Applies REGARDLESS of the SHAPE of the distribution At least 75% of the data values will fall within ±2 standard deviations around the mean At least 89% of the data values will fall within ±3 standard deviations around the mean At least 94% of the data values must fall within ±4 standard deviations around the mean

Working with Grouped Data

*Suppose data has already been summarized by a frequency distribution **The individual data values are no longer shown **Only grouped data is available -To ESTIMATE the AVERAGE for the FREQUENCY DISTRIBUTION: --Find the MIDPOINT for each GROUP (The midpoint is the halfway point in each group) --Use the midpoint as a REPRESENTATIVE VALUE for that group

Which Measure of Central Tendency Should You Use?

*The MEAN is generally used, unless extreme values (outliers) exist -If outliers are PRESENT, the MEDIAN is often used, since the median is NOT sensitive to outliers For example, median home prices may be reported for a region; it is less sensitive to outliers -For categorical data, the MODE is the best choice

Measures of Association Between Two Variables

*The goal of this section is to examine TWO DESCRIPTIVE statistics that MEASURE the LINEAR relationship between two variables 1) Sample Covariance 2) Sample Correlation Coefficient

Using the Normal Distribution to Approximate the Binomial Distribution

*The normal distribution can be used as an approximation to the binomial distribution -Normal probabilities are easy to look up in Appendix A, Tables 3 and 4 -Binomial probabilities are MORE DIFFICULT to calculate -The normal distribution approximation can be used when the sample size is large enough so that np ≥ 5 and nq ≥ 5 EXAMPLE: Suppose that 15% of people are left-handed.

Features of Z-Scores

*Z-Scores are NEGATIVE for values of x that are LESS than the distribution MEAN *Z-Scores are POSITIVE for values of x that are MORE than the distribution mean -The Z-Score AT the MEAN of the distribution EQUALS zero

Biased Sample

*a sample that does not represent the intended population -can lead to distorted findings -biased sampling can occur intentionally or unintentionally -results can be manipulated by how we ask questions and who is responding to them

Pareto Charts

*are BAR CHARTS that show the frequency of the categories that cause QUALITY CONTROL PROBLEMS. -Show quality problem categories in DECREASING order -The MOST problematic categories are shown FIRST -Pareto charts also plot the cumulative relative frequency as a LINE on the chart known as an OGIVE *The categories are arranged from MOST FREQUENT to least frequent

Bar Charts

*are a good tool for displaying QUALITATIVE data that have been ORGANIZED in categories -Can be arranged in a vertical or horizontal orientation

Pie Charts

*are another excellent tool for comparing proportions for categorical data -Each SEGMENT of the pie represents the RELATIVE FREQUENCY of one category -All categories in the data set must be included in the pie -Use a pie chart to compare the relative sizes of all possible categories -Bar charts are MORE USEFUL when you want to highlight the ACTUAL DATA VALUES and when the classes combined DON"T form a whole

Cumulative Percentage Polygon (Ogive)

*is a LINE GRAPH that plots the cumulative relative frequency distribution Percentage polygons and cumulative percentage polygons can be created using PHStat

Line Chart

*is a SCATTER PLOT in which the data points in the scatter plot are CONNECTED with line segments -Often used with TIME SERIES DATA -When graphing a time series the convention is to place the TIME DATA on the HORIZONTAL axis

Mode

*is the value that appears MOST often in a data set -If no data value or category repeats more than once, then we say that the mode DOES NOT exist -more than one mode can exist if two or more values tie for most frequent -The mode is a particularly useful way to describe categorical data

Inferential Statistics

*making CLAIMS or conclusions about the data based on a SAMPLE -Making statements about a population by examining sample results Observed sample statistic (known) -->(INFERENCE)---> Estimated population parameter (unknown, but can be estimated from sample evidence)

Coefficient of Variation (CV)

*measures the standard deviation in terms of its PERCENTAGE of the MEAN *HIGH CV indicates high VARIABILITY relative to the SIZE of the mean *LOW CV indicates low VARIABILITY relative to the SIZE of the mean (GOOD) *A SMALLER coefficient of variation indicates MORE CONSISTENCY within a set of data values. CV=(s/x)(100)

Contingency Tables

*provide a format to display observations that have MORE than ONE VALUE associated with them -Use ROWS and COLUMNS for separate variables to summarize the data efficiently

Discrete Probability Distributions

-A listing of ALL the possible outcomes of an experiment for a discrete random variable -along with the relative frequency of each outcome *A discrete probability distribution meets the following conditions: -Each outcome in the distribution needs to be MUTUALLY EXCLUSIVE with other outcomes in the distribution -The probability of each outcome, P(x), must be BETWEEN 0 and 1 (inclusive): -The SUM of the probabilities for all the outcomes in the distribution must be 1

Continuous Data

-Can potentially take on ANY value, depending only on the ability to MEASURE ACCURATELY -Often MEASURED, fractional values ARE POSSIBLE -thickness of an item -time required to complete a task -temperature of a solution -height, in inches

Qualitative Data

-Classified by DESCRIPTIVE terms Examples: Marital Status, Political Party, Eye Color (Defined categories) 1) NOMINAL: arbitrary labels for data. No ranking allowed -Examples: Zip Codes (75033) 2) ORDINAL: ranking allowed. No measurable meaning to the number differences -Example: Education level (master's degree, doctorate)

Random Variables

-DISCRETE Random Variables: Have outcomes that typically take on WHOLE numbers as a result of conducting an EXPERIMENT -CONTINUOUS Random Variables: Have outcomes that take on ANY NUMERICAL VALUE as a result of conducting an experiment How many data values can be found in a specific interval? -Discrete Random Variables: a FINITE number of values -Continuous Random Variables: a INFINITE number of outcomes

Class Frequencies

-Find class frequencies by counting and recording the number of observations in each class -this is easier when the data are sorted

Formulas for Limits of Outliers

-Formulas for the Upper and Lower Limits of Outliers -UPPER LIMIT= Q3+1.5(IQR) -LOWER LIMIT= Q1-1.5(IQR) -Values beyond these limits are considered OUTLIERS

Central Limit Theorem

-States that the sample means of LARGE-sized samples will be normally distributed REGARDLESS of the SHAPE of their population distributions -Sample means from samples of sufficient size, drawn from any population, will be normally distributed -In most cases, sample sizes of 30 or larger will result in sample means being normally distributed, regardless of the shape of the population distribution -If the population follows the normal probability distribution, the sample means will also be normally distributed, regardless of the size of the samples For any population, the average value of all possible sample means computed from all possible random samples of a given size from the population is equal to the population mean: Ux=U EXAMPLE: Population size N = 4 Random variable, x, is AGE of individuals Values of x: 18, 20, 22, 24 (years) u= (18+20+22+24)/4= 21 o= 2.236 The sampling distribution of the mean describes the pattern that sample averages tend to follow when randomly drawn from a population Population

Standard Normal Probability Tables

-The column represents the second digit of the desired z-score -The row shows the value of z to the first decimal point ---Example: UPPER TAIL probabilities The area under the normal curve equals 1.0, so P(z > 0.67) = 1 - P(z ≤ 0.67) = 1 - 0.7486 = 0.2514 ---Example: Suppose that μ = 12 and σ = 3 for a normal distribution. Find the x value so that P(z ≤ x) = 0.95 -Find the necessary z-score -What z value is needed to include 95% of the area under the curve? Look in the body of the table for 0.9500 The value 0.9500 would be found in the 1.6 row and between the 0.04 and 0.05 columns. This means our point of interest is halfway between these columns at 1.6 + 0.045, or z = 1.645 *QUESTION: Find the x value that is 1.645 standard deviations above the mean: u=12 and o=3 x=u+zo x= 12+(1.645)3= 16.94

Normal Probability Distribution

-The distribution is bell-shaped and symmetrical around the mean -Because the shape of the distribution is symmetrical, the mean and median are the same value -Values NEAR the mean, where the curve is the tallest, have a HIGHER likelihood of occurring than values FAR from the MEAN, where the curve is shorter -The total area under the curve is always equal to 1.0 -Because the distribution is SYMMETRICAL around the mean, the area to the LEFT of the mean equals 0.5, AS DOES the area to the RIGHT of the mean -The left and right ends of the normal probability distribution EXTEND INDEFINITELY *A distribution's mean (μ) and standard deviation (σ) completely describe its shape ----Changing μ shifts the distribution left or right ----Changing σ increases or decreases the spread (changes vertically)

Class Boundaries

-represent the MINIMUM and MAXIMUM values for each class -Choose class boundaries that are easy to read (use whole numbers)

Specific Discrete Probability Distributions

1) Binomial 2) Poisson 3) Hypergeometric

Primary Data Collection Methods

1) Direct Observation or Focus Group: -Observing subjects in their natural environment Example: Watching to see if drivers stop at a stop sign 2) Experiments: -Treatments are applied in controlled conditions Example: Crop growth from different plots using different fertilizers 3) Surveys or Questionnares: Subjects are asked to respond to questions or discuss attitudes Example: E-mail surveys to customers to assess service quality

Probability Distributions

1) Discrete Probability Distributions 2) Continuous Probability Distributions

Constructing a Box-and-Whiskers Plot

1) Draw a horizontal number line that spans the length of the data values 2) Draw a box above the number line extending from Q1 to Q3, with a center line at the median (Q2) 3) WHISKERS extend from the central box to the highest and lowest values that are not outliers 4) If outliers exist in the data set, they are plotted with an ASTERISK above the number line

Rules for Classes for Grouped Data

1) Equal-Size Classes: All classes in the frequency distribution must be of equal WIDTH 2) Mutually-Exclusive Classes: Class boundaries CANNOT overlap 3) Include all Data Values: Make sure all data values are accounted for in the total row of the frequency distribution 4) Avoid Empty Classes: . It is undesirable for a histogram to display a class so narrow that there are no observations in it 5) Avoid Open-Ended Classes (if possible): These violate the first rule of equal class sizes

Why Sample?

1) Examining the entire population would be EXPENSIVE and TIME CONSUMING 2) Can't examine everything if the test is DESTRUCTIVE -If a sample is selected properly and the analysis performed correctly, sample information can be used to make an accurate assessment of the entire population

Specific Continuous Probability Distributions

1) Normal Probability Distribution: is useful when the data tend to fall into the CENTER of the distribution and when very HIGH and very LOW values are fairly RARE - shape is a bell curve 2) Exponential Probability Distribution: is used to describe data where LOWER values tend to DOMINATE and HIGHER values DON'T OCCUR very often - shape is a downward curved slope 3) Uniform Probability Distribution: describes data where ALL the values have the SAME CHANCE of OCCURING - shape is a box

Binomial Distributions

1) The experiment consists of a FIXED number of trials, denoted by n 2) each trial has only TWO possible outcomes, a SUCCESS or a FAILURE 3) the probability of a SUCESS p and the probability of a FAILURE q are CONSTANT throughout the experiment 4) each trial is INDEPENDENT of the other trials in the experiment EXAMPLES of Binomial SETTINGS: -A survey RESPONSE to a question is "yes I will buy" or "no I will not" -An electronic component is either DEFECTIVE or ACCEPTABLE -New job applicants either ACCEPT an offer or REJECT it *The Binomial Probability Distribution is used to calculate the probability of a specific number of SUCCESSES (x) for a certain number of TRIALS (n), given specified PROBABILITY of SUCCESS (p) and probability of FAILURE (q) *P(x,n) = The probability of observing x successes in n trials EXCEL: =BINOM.DIST(x, n, p, cumulative) x = Number of successes n = Number of trials p = Probability of a success cumulative = FALSE, if you want to determine the probability of EXACTLY x successes occurring cumulative = TRUE, if you want to determine the probability of x OR FEWER successes occurring

One method to determine the number of classes in a frequency distribution is the rule

2k (greater than or equal to) n k= Number of Classes n= Number of Data Points -Find the lowest value of k that satisfies the rule Suppose n = 50 2^5 = 32 < 50 (k = 5 is too small) 2^6 = 64 > 50 (k = 6 is a GOOD CHOICE)

Measures of Relative Position

COMPARE the position of one value in relation to other values in the data set 1) Percentiles 2) Quartiles

Descriptive Statistics

Collecting, summarizing, and displaying data

Example: Probability between two values Suppose income is normally distributed for a group of workers, with μ = $45,000 and σ = $5,000

Find the probability that a randomly selected worker from this group has an income between $38,000 and $48,000 Convert x = 38 and x = 48 to z-scores: Z38=(38-45)/5= -1.40 Z48=(48-45)/5= 0.60 P(38≤x≤48)=P(-1.40≤z≤0.60) =P(z≤0.60)-P(z≤-1.40) =0.7257-0.0808=0.64490. EXCEL=NORM.DIST (x, standard_dev, cumulative) -cumulative = FALSE if you want the probability density function -cumulative = TRUE if you want the CUMULATIVE PROBABILITY

Bias

The manner in which survey questions are asked can affect responses -can occur when a question is stated in a way that encourages or leads a respondent to a particular answer Example: "Do you agree that the current overly complex tax code should be simplified and made more fair?"

Expected Value; E(x)

The mean, μ, of a discrete probability distribution is the weighted average of the outcomes of the random variables that comprise it

Sample Variance

The sample variance is denoted by s^2 The variance measures the variability, or SPREAD, of the data points around the MEAN Sample Variance=VAR.S(data values) Population Variance=VAR.P(data values)

Z-Score Example

The time customers spend on the phone for service follows the Normal Distribution with a MEAN of 12 minutes and a standard deviation of 3 minutes. *What is the probability that the next customer who calls will spend 14 minutes or less on the phone? μ = 12 and σ = 3 z=(x-μ)/ σ= (14-12)/3= 2/3= 0.67 -This says that x = 14 is 0.67 standard deviations (0.67 increments of 3 units) ABOVE the mean of 12

Stem and Leaf Display Cont.

To get more detail the stems can be split in half 7(5) | 8 8 9 9 9 8(0) | 0 0 0 0 1 1 2 3 3 4 4 4 8(5) | 5 6 7 8 9(0) | 0 2 9(5) | 5 The stem labeled 7(5) stores all the scores between 75 and 79 The stem 8(0) stores all the scores between 80 and 84

Parameter

a described characteristic about a POPULATION -Values calculated using population data are called parameters

Statistic

a described characteristic about a SAMPLE -Values computed from sample data are called statistics

Weighted Mean

allows you to assign more weight to certain values and less weight to others

Discrete Data

are values based on observations that can be counted and are typically represented by WHOLE numbers -represent something that has been COUNTED -take on whole numbers such as 0, 1, 2, 3 Examples: # of children per family, # of cars listed per insurance policy, Vacation days per month

Continuous Data

are values that can take on any real numbers, including numbers that contain DECIMAL points -usually MEASURED rather than counted -Examples are weight, time, and distance, Time required to read chapter 2, Thickness of paint applied to a car body, Voltage of batteries produced in August

Secondary Data

data collected by someone else Advantages: Readily available, Less expensive to collect Disadvantages:No control over how the data was collected, Less reliable unless collected and recorded accurately

Information

data that are transformed into useful facts that can be used for a specific purpose, such as making a decision -Analyzing the data can provide information for decision making

Primary Data

data that you have collected for your own use Advantages: collected by the person or organization who uses the data Disadvantages: Can be expensive and time-consuming to gather

Relative Frequency Distribution

display the PROPORTION of observations of each class relative to the total number of observations -shows the fraction of observations in each class -found by dividing each frequency by the total number of observations -the fractions in a relative frequency distribution add up to 1.00

Central Tendency

is a SINGLE value used to describe the CENTER point of a data set *Use Descriptive Statistics *Measures of Central Tendency: 1) Mean or Weighted Mean 2) Median 3) Mode

Histograms

is a graph showing the number of observations in each class of a frequency distribution -Excel uses the term "bins" for the classes in the distribution

Mean (Average)

is the most common measure of central tendency *Calculate the mean by adding all the values in a data set and then dividing the result by the number of observations

Median

is the value in the data set for which half the observations are higher and half the observations are lower -The median is not sensitive to outliers -First arrange the data in ASCENDING order -Use an Index Point to determine the position of the median in the data set Formula for the Index Point for the Median: (i=0.5(n)) -Whenever the index point is not a whole number, round the value up to the next highest whole number Example with sample of size n = 7: 21 27 27 28 34 45 50 i = 0.5(n) = 0.5(7) = 3.5 Index number is not a whole number so round up to i = 4 Median= 28 When the index point is an even whole number, the position of the median is halfway between the index point (i) and the next highest data point (the i + 1 position)

Statistics

the mathematical science that deals with the COLLECTION, ANALYSIS, and PRESENTATION of data, which can then be used as a basis for INFERENCE and INDUCTION


संबंधित स्टडी सेट्स

Intro to Earth Science Reading/Video Notes Quiz 5

View Set

Chap. 16 and 17 Study Set: Gene Expression and Biotech.

View Set

Chapter 7 Small Business and Entrepreneurship

View Set

Music of Medieval Renaissanceand Baroque Peroid

View Set

Pediatrics Chapter 32: Genetic disorders

View Set