Statistics Chapter 1
Median
The median M is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger; resistant measure of center
Association
There is an association b/w 2 variables if specific values of one variable tend to occur in common with specific values of the other
What does the mean mean?
Think of it as a fair share value; tells us how large each observation in the data set would be if the total were split equally among all the observations
Back-to-back stem plot
To compare stem plots use common stems; leaves on each side are ordered out from each common stem
Mean Ordinary arithmetic average
To find the mean of a set of observations, add their values and divide by the number of observations
How to deal with outliers
Try to find explanation (typing error, measuring device broken, silly response); remove outlier from data When outliers are real data, choose statistical methods not greatly affected by outliers
Common mistake w/ histograms #3
Use percents instead of counts on the venical axis when comparing distributions with different numbers of observations.
Purpose of graphs
Used to observe distributions of variables
Mode
Value that appears most frequently Major peaks
How to calculate percents
We can display the marginal distribution of opinions in percents by dividing each row total by the table total and converting to a percent.
Distribution
What values the variable takes and how often it takes these values.
Splitting stems
When there are too many leaves, it is difficult to see distributions so we split the stems (duplicate stems of the same number; can be represented by a dot) - Only want 5 to 7 leaves - 0 to 4/5 to 9
Which conditional distributions should we compare?
Which of these two gives us the information we want? Think about whether changes in one variable might help explain changes in the other.
When you first meet a new data set, ask yourself the following questions:
Who are the individuals described by the data? How many individuals are there? What are the variables? In what units is each variable recorded?
Box plot vs. modified box plot
"Modified" box plot shows outliers while standard box plot does not.
Do numbers guarantee a quantitative variable?
-Not every variable that takes number values is quantitative. -Other categorical variables are created by grouping values of a quantitative variable into classes
What should we beware of when looking at graphs?
1.) Pictography 2.) Scales Misleading/may get distorted impression of relative frequencies or percentages of categories
Roundoff error
Don't point to mistakes in our work, just to the effect of rounding off results - Ex: percents don't add to 100
Common mistake w/ histograms #2
Don't use counts (in a frequency table) or percents (in a relative frequency table) as data.
Unimodal
Graph has a single peak (ignore minor ups and downs)
Multimodal
Graph has more than two clear peaks
The Interquartile Range (IQR)M
IQR = Q3 - Q1
Should we use the mean or median
Income tax: mean Skewed distributions (income): median
Conditional distribution
Of a variable describes the values of that variable among individuals who have a specific value of another variable; there is a separate conditional distribution for each value of the other variable 1.) Distributions of a row variable for each value of the column variable. 2.) Distributions of the column variable for each value of the row variable.
Categorical variable
Places an individual into one of several groups or categories
Spread
Tells use how much variability there is in the data
How do we determine how to analyze variables?
The proper method of analysis for a variable depends on whether it is categorical or quantitative.
What to consider when creating a stem plot
- Do not work well for large data sets where each stem must hold a large number of leaves - There is no magic number of stems to use, but five is good minimum (too few or too many will make it difficult to see the distribution's shape) - If you split stems, be sure that each stem is assigned an equal number of possible leaf digits - You can get more flexibility by rounding the data so that the final digit after rounding is suitable as a leaf/you can also truncate (remove one or more digits)
What is important to consider when reading data?
Whether the data you have help answer your questions.
Are quartiles/IQR resistant?
Yes because they are not affected by a few extreme observations
How do we explore data?
-Begin by examining each variable by itself. Then move on to study relationships among the variables. -Start with a graph or graphs. Then add numerical summaries.
How to find the median
1.) Arrange all observations in order of size, from smallest to largest. 2.) If the number of observations n is odd, the median M is the center observation in the ordered list. 3.) If the number of observations n is even, the median M is the average of the two center observations in the ordered list.
How to calculate quartiles
1.) Arrange the observations in increasing order and locate the median M in the ordered list of observations. 2.) The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the median. 3.) The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the median. *leave out overall median when you locate quartiles
Box plot/box and whisker plot
1.) Central box drawn from Q1 to Q3 2.) A line in the box marks the median 3.) Lines (whiskers) extend from box out to smallest and largest observations that aren't outliers *Outliers denoted with star
How to make a histogram
1.) Divide range of the data into classes of = width We need to specify the classes so that each individual falls into exactly one class (<, >) 2.) Find the count (frequency) or percent (relative frequency) of individuals in each class. 3.) Title, label and scale your axes and draw the histogram. Label the horizontal axis with the variable (units) whose distribution you are displaying and vertical axis with count, number, percent, frequency, relative frequency) 4.) Draw the bars with no horizontal space between them unless class is empty, so its bar has height 0
Which graphs do you use for quantitative data?
1.) Dotplot 2.) Stemplot 3.) Histogram 4.) Box and Whisker Plot
How to make a dot plot
1.) Draw a horizontal axis (a number line) and label it with the variable name. 2.) Scale the axis. Start by looking at the minimum and maximum values of the variable. 3.) Mark a dot above the location on the horizontal axis corresponding to each data value.
Problems w/ association
1.) Even a strong association between two categorical variables can be influenced by other variables lurking in the background. 2.) In the most extreme cases, it is possible for an association between two categorical variables to be "reversed" when we consider a third variable
Things to consider when making a histogram
1.) Our eyes respond to the area of the bars so choose classes that are the same width; then area is determined by height and all classes are fairly represented 2.) There is no right choice of the classes; too few gives a "skyscraper " graph with all values in a few classes with tall bars. Too many will produce a "pancake" graph with most classes having one or not observations. Neither choice gives a good picture of the shape of the distribution. Five classes is a good minimum.
Two types of charts/graphs for categorical variables
1.) Pie Chart 2.) Bar Graph/Chart
How to make a stem plot?
1.) Separate each observation into a stem, consisting of all but the final digit, and a leaf, the final digit. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. Do not skip any stems, even if there is no data value for a particular stem. 2.) Write each leaf in the row to the right of its stem 3.) Arrange leaves in increasing order out from stem. 4.) Provide a key that explains in context what the stems and leaves represent.
The five number summary
1.) Smallest observation 2.) First Quartile 3.) Median 4.) Third Quartile 5.) Largest Observation Written in order from largest to smallest (Minimum Q1 M Q3 Maximum) Five numbers divide each distribution roughly into quarters (25% b/w min and Q1, 25% b/w Q1 and M, 25% b/w M and Q3, 25% b/w Q3 and max)
How to organize a statistical problem; a four step process
1.) State: what's the question that you're trying to answer 2.) Plan: how will you go about answering the question? What statistical techniques does this problem call for? 3.) Do: make graphs and carry out needed calculations 4.) Conclude: give your practical conclusion in the setting of the real-world problem (context)
Simpson's paradox
An association between two variables that holds for each individual value of a third variable can be changed or even reversed when the data for all values of the third variable are combined
Outlier
An individual value that falls outside the overall pattern.
Case
Another name used for individual
Bar graph/chart
Bar graphs represent each category as a bar. The bar heights show the category counts or percents. Bar graphs are easier to make than pie charts and are also easier to read.
Advantages of bar graph when compared to a pie chart
Both graphs can display the distribution of a categorical variable, but a bar graph can also compare any set of quantities that are measured in the same units.
What should you do if there are no totals and why?
Calculate them b/c it helps with marginal distribution
Variable
Characteristic of an individual; can take different values for different individuals.
Comparison w/ bar graphs
Compare several quantities by comparing the heights of bars that represent the quantities
How to describe a shape of a graph?
Concentrate on the main features. Look for major peaks, not for minor ups and downs in the graph. Look for clusters of values and obvious gaps. Look for rough symmetry or clear skewness.
Center
Describe using median (middle value; divides observations so half take large values and half take smaller values) Describe use mean (average)
Column variable
Described by each column in a two-way table
Row variable
Described by each row in a two-way table
How to compare distributions of quantitative variable?
Discuss shape, center, spread and possible outliers - Don't just list values; explicitly compare them --> greater than, less than or about the same as
Frequency table
Displays the counts (frequencies) of individuals in a category
Symmetric distribution
Distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other
Skewed distribution
Distribution is skewed to the right if the right side of the graph (containing the half of the observations w/ large values) is much longer than the left side. It is skewed to the left if the left side of the graph is much longer than the right side.
Common mistake w/ histograms #1
Don't confuse histograms and bar graphs. Although histograms resemble bar graphs, their details and uses are different. A histogram displays the distribution of a quantitative variable. The horizontal axis of a histogram is marked in the units of measurement for the variable. A bar graph is used to display the distribution of a categorical variable or to compare the sizes of different quantities. The horizontal axis of a bar graph identifies the categories or quantities being compared. Draw bar graphs with blank space between the bars to separate the items being compared. Draw histograms with no space, to show the equal-width classes.
Inference
Draw conclusions beyond the data - Depends on probability; study of chance behavior - What are the chances?
Dot plot
Each data value is shown as a dot above its location on a number line; used for quantitative variables
What graph to use for marginal distribution?
Each marginal distribution from a two-way table is a distribution for a single categorical variable. We can use a bar graph or a pie chart to display such a distribution.
Common setup of data tables
Each row is an individual; each column is a variable.
Segmented bar graph
For each category, there is a single bar with "segments"; segments of bar add up to 100%
Bimodal
Graph has two clear peaks
Facts vs. graphing
Graphs give best overall picture of a distribution, not numerical measures; numerical measures of center/spread report specific facts but do not describe entire shape; numerical summaries do not highlight presence of multiple peaks or clusters for example; always plot data
Interpret median
Half observations above this value and half observations below this value
How to Examine the Distribution of a Quantitative Variable
In any graph, look for the overall pattern and for striking departures from that pattern. You can describe the overall pattern of a distribution by its shape, center, and spread. An important kind of departure is an outlier
Common mistake w/ histograms #4
Just because a graph looks nice, it's not necessarily a meaningful display of data
Values of a categorical variable
Labels for the categories
Distribution of a categorical variable
Lists the categories and gives either the count or the percent of individuals who fall into each category
Average value vs. typical value
Mean (balancing point) vs. median (equal areas)
Estimates of time
Most people estimate time in multiples of 5/10/15/30/60 minutes
Problem w/ measuring center
Need to measure both center and spread to provide useful description of distribution
Individuals
Objects described by a set of data; may be people, animals or things.
Marginal Distribution
Of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.
Two-way table
Organized two categorical variables; often used to summarize large amounts of information by grouping outcomes into categories.
Good bar graphs; area and height
Our eyes react to the area of the bars as well as to their height. When all bars have the same width, the area (width × height) varies in proportion to the height, and our eyes receive the right impression. Make the bars equally wide for a bar graph.
Identifying outliers
Outlier if: 1.) Less than Q1 - 1.5(IQR) 2.) More than Q3 + 1.5(IQR)
Percents for Marginal Distribution
Percents are often more informative than counts, especially when we are comparing groups of different sizes.
Pie Chart
Pie charts show the distribution of a categorical variable as a "pie" whose slices are sized by the counts or percents for the categories. A pie chart must include all the categories that make up a whole. Use a pie chart only when you want to emphasize each category's relation to the whole.
Stemplot/stem and leaf plot
Provide picture of the shape of a distribution of qualitative variables while including the actual numerical values in the graph
Histograms
Quantitative variables often take many values. A graph of the distribution is clearer if nearby values are grouped together. The most common graph of the distribution of one quantitative variable is a histogram.
Interpreting IQR
Range of middle half is:
Types of histograms
Relative frequency histogram (vertical axis: percents w/o multiplying by 100) Frequency histogram (vertical axis: counts/number)
Comparing the mean and the median
Roughly symmetric distribution: close together Exactly symmetric: exactly the same Skewed distribution: mean is farther out in long tail than the median
Problems w/ mean
Sensitive to the influence of extreme observations (outliers/skewed toward long tail); cannot resist influence of extreme observations so mean is not a resistant measure of center
SOCS
Shape; skewed/distance b/w min/max + median Outliers; identify Center; compare medians Spread; IQR (height of boxes) and range - Variability
When should we use box plots?
Show less detail than histograms or stemplots so they are best used for side-by-side comparison of more than one distribution
Relative frequency table
Shows the percents (relative frequencies) of individuals in a category
What aids finding median
Stem plot b/c arranges observations in order
Range
Subtract the smallest value from the largest value. Shows full spread of data but only uses max and min (which may be outliers)
Quantitative variable
Takes numerical values for which it makes sense to find an average
Problem with marginal distribution
Tell us nothing about the relationship between two variables. To describe a relationship between two categorical variables, we must calculate some well-chosen percents from the counts given in the body of the table.
Classes for a histogram
The choice of classes in a histogram can influence the appearance of a distribution. Histograms with more classes show more detail/outliers but may have a less clear pattern. (want five to seven bins w/ outliers separate)
Mean notation
x w/ bar - mean of a sample (most of the time) μ - mean of a population