Statistics Chapter 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Median

The median M is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger; resistant measure of center

Association

There is an association b/w 2 variables if specific values of one variable tend to occur in common with specific values of the other

What does the mean mean?

Think of it as a fair share value; tells us how large each observation in the data set would be if the total were split equally among all the observations

Back-to-back stem plot

To compare stem plots use common stems; leaves on each side are ordered out from each common stem

Mean Ordinary arithmetic average

To find the mean of a set of observations, add their values and divide by the number of observations

How to deal with outliers

Try to find explanation (typing error, measuring device broken, silly response); remove outlier from data When outliers are real data, choose statistical methods not greatly affected by outliers

Common mistake w/ histograms #3

Use percents instead of counts on the venical axis when comparing distributions with different numbers of observations.

Purpose of graphs

Used to observe distributions of variables

Mode

Value that appears most frequently Major peaks

How to calculate percents

We can display the marginal distribution of opinions in percents by dividing each row total by the table total and converting to a percent.

Distribution

What values the variable takes and how often it takes these values.

Splitting stems

When there are too many leaves, it is difficult to see distributions so we split the stems (duplicate stems of the same number; can be represented by a dot) - Only want 5 to 7 leaves - 0 to 4/5 to 9

Which conditional distributions should we compare?

Which of these two gives us the information we want? Think about whether changes in one variable might help explain changes in the other.

When you first meet a new data set, ask yourself the following questions:

Who are the individuals described by the data? How many individuals are there? What are the variables? In what units is each variable recorded?

Box plot vs. modified box plot

"Modified" box plot shows outliers while standard box plot does not.

Do numbers guarantee a quantitative variable?

-Not every variable that takes number values is quantitative. -Other categorical variables are created by grouping values of a quantitative variable into classes

What should we beware of when looking at graphs?

1.) Pictography 2.) Scales Misleading/may get distorted impression of relative frequencies or percentages of categories

Roundoff error

Don't point to mistakes in our work, just to the effect of rounding off results - Ex: percents don't add to 100

Common mistake w/ histograms #2

Don't use counts (in a frequency table) or percents (in a relative frequency table) as data.

Unimodal

Graph has a single peak (ignore minor ups and downs)

Multimodal

Graph has more than two clear peaks

The Interquartile Range (IQR)M

IQR = Q3 - Q1

Should we use the mean or median

Income tax: mean Skewed distributions (income): median

Conditional distribution

Of a variable describes the values of that variable among individuals who have a specific value of another variable; there is a separate conditional distribution for each value of the other variable 1.) Distributions of a row variable for each value of the column variable. 2.) Distributions of the column variable for each value of the row variable.

Categorical variable

Places an individual into one of several groups or categories

Spread

Tells use how much variability there is in the data

How do we determine how to analyze variables?

The proper method of analysis for a variable depends on whether it is categorical or quantitative.

What to consider when creating a stem plot

- Do not work well for large data sets where each stem must hold a large number of leaves - There is no magic number of stems to use, but five is good minimum (too few or too many will make it difficult to see the distribution's shape) - If you split stems, be sure that each stem is assigned an equal number of possible leaf digits - You can get more flexibility by rounding the data so that the final digit after rounding is suitable as a leaf/you can also truncate (remove one or more digits)

What is important to consider when reading data?

Whether the data you have help answer your questions.

Are quartiles/IQR resistant?

Yes because they are not affected by a few extreme observations

How do we explore data?

-Begin by examining each variable by itself. Then move on to study relationships among the variables. -Start with a graph or graphs. Then add numerical summaries.

How to find the median

1.) Arrange all observations in order of size, from smallest to largest. 2.) If the number of observations n is odd, the median M is the center observation in the ordered list. 3.) If the number of observations n is even, the median M is the average of the two center observations in the ordered list.

How to calculate quartiles

1.) Arrange the observations in increasing order and locate the median M in the ordered list of observations. 2.) The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the median. 3.) The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the median. *leave out overall median when you locate quartiles

Box plot/box and whisker plot

1.) Central box drawn from Q1 to Q3 2.) A line in the box marks the median 3.) Lines (whiskers) extend from box out to smallest and largest observations that aren't outliers *Outliers denoted with star

How to make a histogram

1.) Divide range of the data into classes of = width We need to specify the classes so that each individual falls into exactly one class (<, >) 2.) Find the count (frequency) or percent (relative frequency) of individuals in each class. 3.) Title, label and scale your axes and draw the histogram. Label the horizontal axis with the variable (units) whose distribution you are displaying and vertical axis with count, number, percent, frequency, relative frequency) 4.) Draw the bars with no horizontal space between them unless class is empty, so its bar has height 0

Which graphs do you use for quantitative data?

1.) Dotplot 2.) Stemplot 3.) Histogram 4.) Box and Whisker Plot

How to make a dot plot

1.) Draw a horizontal axis (a number line) and label it with the variable name. 2.) Scale the axis. Start by looking at the minimum and maximum values of the variable. 3.) Mark a dot above the location on the horizontal axis corresponding to each data value.

Problems w/ association

1.) Even a strong association between two categorical variables can be influenced by other variables lurking in the background. 2.) In the most extreme cases, it is possible for an association between two categorical variables to be "reversed" when we consider a third variable

Things to consider when making a histogram

1.) Our eyes respond to the area of the bars so choose classes that are the same width; then area is determined by height and all classes are fairly represented 2.) There is no right choice of the classes; too few gives a "skyscraper " graph with all values in a few classes with tall bars. Too many will produce a "pancake" graph with most classes having one or not observations. Neither choice gives a good picture of the shape of the distribution. Five classes is a good minimum.

Two types of charts/graphs for categorical variables

1.) Pie Chart 2.) Bar Graph/Chart

How to make a stem plot?

1.) Separate each observation into a stem, consisting of all but the final digit, and a leaf, the final digit. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. Do not skip any stems, even if there is no data value for a particular stem. 2.) Write each leaf in the row to the right of its stem 3.) Arrange leaves in increasing order out from stem. 4.) Provide a key that explains in context what the stems and leaves represent.

The five number summary

1.) Smallest observation 2.) First Quartile 3.) Median 4.) Third Quartile 5.) Largest Observation Written in order from largest to smallest (Minimum Q1 M Q3 Maximum) Five numbers divide each distribution roughly into quarters (25% b/w min and Q1, 25% b/w Q1 and M, 25% b/w M and Q3, 25% b/w Q3 and max)

How to organize a statistical problem; a four step process

1.) State: what's the question that you're trying to answer 2.) Plan: how will you go about answering the question? What statistical techniques does this problem call for? 3.) Do: make graphs and carry out needed calculations 4.) Conclude: give your practical conclusion in the setting of the real-world problem (context)

Simpson's paradox

An association between two variables that holds for each individual value of a third variable can be changed or even reversed when the data for all values of the third variable are combined

Outlier

An individual value that falls outside the overall pattern.

Case

Another name used for individual

Bar graph/chart

Bar graphs represent each category as a bar. The bar heights show the category counts or percents. Bar graphs are easier to make than pie charts and are also easier to read.

Advantages of bar graph when compared to a pie chart

Both graphs can display the distribution of a categorical variable, but a bar graph can also compare any set of quantities that are measured in the same units.

What should you do if there are no totals and why?

Calculate them b/c it helps with marginal distribution

Variable

Characteristic of an individual; can take different values for different individuals.

Comparison w/ bar graphs

Compare several quantities by comparing the heights of bars that represent the quantities

How to describe a shape of a graph?

Concentrate on the main features. Look for major peaks, not for minor ups and downs in the graph. Look for clusters of values and obvious gaps. Look for rough symmetry or clear skewness.

Center

Describe using median (middle value; divides observations so half take large values and half take smaller values) Describe use mean (average)

Column variable

Described by each column in a two-way table

Row variable

Described by each row in a two-way table

How to compare distributions of quantitative variable?

Discuss shape, center, spread and possible outliers - Don't just list values; explicitly compare them --> greater than, less than or about the same as

Frequency table

Displays the counts (frequencies) of individuals in a category

Symmetric distribution

Distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other

Skewed distribution

Distribution is skewed to the right if the right side of the graph (containing the half of the observations w/ large values) is much longer than the left side. It is skewed to the left if the left side of the graph is much longer than the right side.

Common mistake w/ histograms #1

Don't confuse histograms and bar graphs. Although histograms resemble bar graphs, their details and uses are different. A histogram displays the distribution of a quantitative variable. The horizontal axis of a histogram is marked in the units of measurement for the variable. A bar graph is used to display the distribution of a categorical variable or to compare the sizes of different quantities. The horizontal axis of a bar graph identifies the categories or quantities being compared. Draw bar graphs with blank space between the bars to separate the items being compared. Draw histograms with no space, to show the equal-width classes.

Inference

Draw conclusions beyond the data - Depends on probability; study of chance behavior - What are the chances?

Dot plot

Each data value is shown as a dot above its location on a number line; used for quantitative variables

What graph to use for marginal distribution?

Each marginal distribution from a two-way table is a distribution for a single categorical variable. We can use a bar graph or a pie chart to display such a distribution.

Common setup of data tables

Each row is an individual; each column is a variable.

Segmented bar graph

For each category, there is a single bar with "segments"; segments of bar add up to 100%

Bimodal

Graph has two clear peaks

Facts vs. graphing

Graphs give best overall picture of a distribution, not numerical measures; numerical measures of center/spread report specific facts but do not describe entire shape; numerical summaries do not highlight presence of multiple peaks or clusters for example; always plot data

Interpret median

Half observations above this value and half observations below this value

How to Examine the Distribution of a Quantitative Variable

In any graph, look for the overall pattern and for striking departures from that pattern. You can describe the overall pattern of a distribution by its shape, center, and spread. An important kind of departure is an outlier

Common mistake w/ histograms #4

Just because a graph looks nice, it's not necessarily a meaningful display of data

Values of a categorical variable

Labels for the categories

Distribution of a categorical variable

Lists the categories and gives either the count or the percent of individuals who fall into each category

Average value vs. typical value

Mean (balancing point) vs. median (equal areas)

Estimates of time

Most people estimate time in multiples of 5/10/15/30/60 minutes

Problem w/ measuring center

Need to measure both center and spread to provide useful description of distribution

Individuals

Objects described by a set of data; may be people, animals or things.

Marginal Distribution

Of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.

Two-way table

Organized two categorical variables; often used to summarize large amounts of information by grouping outcomes into categories.

Good bar graphs; area and height

Our eyes react to the area of the bars as well as to their height. When all bars have the same width, the area (width × height) varies in proportion to the height, and our eyes receive the right impression. Make the bars equally wide for a bar graph.

Identifying outliers

Outlier if: 1.) Less than Q1 - 1.5(IQR) 2.) More than Q3 + 1.5(IQR)

Percents for Marginal Distribution

Percents are often more informative than counts, especially when we are comparing groups of different sizes.

Pie Chart

Pie charts show the distribution of a categorical variable as a "pie" whose slices are sized by the counts or percents for the categories. A pie chart must include all the categories that make up a whole. Use a pie chart only when you want to emphasize each category's relation to the whole.

Stemplot/stem and leaf plot

Provide picture of the shape of a distribution of qualitative variables while including the actual numerical values in the graph

Histograms

Quantitative variables often take many values. A graph of the distribution is clearer if nearby values are grouped together. The most common graph of the distribution of one quantitative variable is a histogram.

Interpreting IQR

Range of middle half is:

Types of histograms

Relative frequency histogram (vertical axis: percents w/o multiplying by 100) Frequency histogram (vertical axis: counts/number)

Comparing the mean and the median

Roughly symmetric distribution: close together Exactly symmetric: exactly the same Skewed distribution: mean is farther out in long tail than the median

Problems w/ mean

Sensitive to the influence of extreme observations (outliers/skewed toward long tail); cannot resist influence of extreme observations so mean is not a resistant measure of center

SOCS

Shape; skewed/distance b/w min/max + median Outliers; identify Center; compare medians Spread; IQR (height of boxes) and range - Variability

When should we use box plots?

Show less detail than histograms or stemplots so they are best used for side-by-side comparison of more than one distribution

Relative frequency table

Shows the percents (relative frequencies) of individuals in a category

What aids finding median

Stem plot b/c arranges observations in order

Range

Subtract the smallest value from the largest value. Shows full spread of data but only uses max and min (which may be outliers)

Quantitative variable

Takes numerical values for which it makes sense to find an average

Problem with marginal distribution

Tell us nothing about the relationship between two variables. To describe a relationship between two categorical variables, we must calculate some well-chosen percents from the counts given in the body of the table.

Classes for a histogram

The choice of classes in a histogram can influence the appearance of a distribution. Histograms with more classes show more detail/outliers but may have a less clear pattern. (want five to seven bins w/ outliers separate)

Mean notation

x w/ bar - mean of a sample (most of the time) μ - mean of a population


Ensembles d'études connexes

Ethics of Cybersecurity - Exam 1

View Set

Module 2: Basic Switch and End Device Configuration

View Set

Chapter 48 - Neurons, Synapses, and Signaling

View Set

Chapter 13 Lesson 3: Layers of Atmosphere

View Set

RD Exam Domain III Practice Questions

View Set