STATISTICS

Ace your homework & exams now with Quizwiz!

Interaction amongst categorical variables: "Or"

"Or" question: What percentage of the respondents are Midwesterners or persons who agree that Global Warming increased temperatures during December 2011 and January 2012? a) 951/1009 b) 168/1009 c) 168/219 d) 783/1009

(Your) Example

(Your) Example: Create three survey questions that elicit public opinion on whether the NFL will treat all of its new players with equal respect - regardless of sexual orientation. Each question should be written with the following perspective: a) Strong belief that the NFL will be fair in treating all new players b) Neutral c) Strong belief that the NFL will not be fair in its treatment of all new players a) Strong belief that the NFL will be fair in treating all new players The NFL stands by the fact that players are drafted based upon performance, do you believe that it will treat new players fairly? b) Neutral Will the NFL treat all of its players fairly, regardless of orientation? c) Strong belief that the NFL will not be fair in its treatment of all new players Because the NFL has a recent history of bullying, do you think that each player will be treated fairly especially if one is a rookie of a different orientation?

categorical variable A few examples:

-Stores from which the birthday presents were purchased. -The method by which people expressed interest in attending your friend's party. -Breed of puppy.

How do you know if a variable is categorical or quantitative?

-What is being recorded about those n individuals/units? -Is that a number (→ quantitative) or a statement (→ categorical)? (Note: Be careful of categorical variables parading around as numbers...)

Here are a couple of reasons to trust random sampling:

1. Random sampling eliminates bias and is representative of the population. 2. Inference is based upon sound mathematics and probability theory.

Block Designs

A block is a group of individuals that are known before the experiment to be similar in some way that is expected to affect the response to the treatments. In a block design, the random assignment of individuals to treatments is carried out separately within each block. Block designs allow another form of control in an experiment. They control for possible confounding variables by bringing them into the experiment as blocks. Example: Back to the Crohn's disease study... Suppose that the researchers believe that a subject's response to either Azath or Inflix might vary by age What should be our blocking variable?

Boxplot

A boxplot is a graphical display of the five-number summary. A central box spans the middle 50% of the data (marked by the first and third quartiles). A line in the box marks the median M. Lines extend from the box out to the smallest and largest observations. Boxplots for skewed data will show the skew:

Symmetric and skewed distributions:

A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. A distribution is skewed to the right if the right side of the histogram (containing the half of the observations with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram extends much farther out than the right side.

Placebos

A placebo is a dummy/phantom treatment. Often a sugar pill is administered to subjects.

Regression Lines

A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x.

Residuals

A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is, a residual is the remaining prediction "error" after we have calculated the regression line: We can calculate the residual for each individual in the data set. Because of the mathematics of least-squares regression, the sum (and therefore the mean) of the residuals is always zero.

Residual Plot

A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data. In other words, residual plots are good diagnostic tools for least-squares regression. We hope to see no significant patterns (i.e., random scatter that is evenly spread about the residual = 0 line) in the residual plot. Comment on the residual plot for the per capita income & education data. Residual Plot There seems to be one potential outlier ( a town where the actual growth in per capita income was much less than the predicted amount). The other residuals reflect more of a random scatter.

Displaying Relationships: Scatterplots

A scatterplot shows the relationship between two quantitative variables measured on the same individuals. In this graph, each individual in the data appears as a point corresponding to the value of both variables for that individual.

Time Plots

A time plot of a variable plots each observation against the time at which it was measured. Always put time on the horizontal scale of your plot and the variable you are measuring on the vertical scale. Connecting the data points by lines helps emphasize any change over time.

Voluntary Response Samples

A voluntary response sample consists of people who choose themselves to be part of the sample by responding to a broad appeal. Voluntary response samples are biased because people with strong opinions are most likely to respond. Example Example Flo Haysbert contacts CNN; who then presents the "Which company insures your automobile?" question as a 24-hour poll. The responses from college students are separated from the others and the results are posted. Identify any problems with the data gathered. This is better as we now have a national set of responses. But, not many college students visit CNN.com on a daily basis. The data might reflect the auto insurance habits of students w/ stronger political views. This website, retrieved January 28, 2014, affirms the change in top auto insurance companies. http://www.insurancejournal.com/news/national/2013/12/16/314530.htm

Interpreting Scatterplots

After plotting two variables on a scatterplot, we look for an overall pattern and for striking deviations from that pattern. We describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship. Outliers reappear in this chapter. In this chapter, outliers are ordered pairs that fall outside of the relationship of bivariate data. We are also interested in any clusters of observations. Two variables are positively associated when above-average values of one tend to accompany above-average values of the other, and below-average values tend to occur together. Two variables are negatively associated when above-average values of one tend to accompany below-average values of the other, and vice versa.

The 68-95-99.7 Rule

All Normal distributions obey the following rule: Approximately 68% of all observations fall within 1 standard deviation () of the mean (). Approximately 95% of all observations are within 2 standard deviations of . Almost all 99.7% observations are within 3 standard deviations of .

Randomized Comparative Experiments

An experiment that uses both comparison of two or more treatments and random assignment of subjects to treatments is a randomized comparative experiment. Note: 1. In a completely randomized experimental design, all of the subjects are allocated at random among all the treatments. 2. An outline of the experimental design of an experiment, reflects the random assignment, the groups, the treatments, and the comparison of results.

Influential Observations

An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction are often influential for the least-squares regression line.

Finding a Value When Given a Proportion

At times we must find the observed range of values that correspond to a given area under the curve. This requires use of Table A inverse capacity. We'll use P.P.A.Z.A. (Piazza?) Step 1. State the Problem in terms of the observed variable x. Step 2. Draw a Picture that shows the proportion you want. Step 3. Use Table A to find the entry closest to the proportion we are interested in (and its corresponding Z-score) Step 4. Unstandardize by using the z-score equation to find . Step 5. Answer the question. In effect, we reverse some of our intermediate steps from 3.7 P.P.Z.A.A.

Additional Cautions

Beware of ecological correlation A correlation based on averages rather than on individuals is called ecological. Ecological correlations are typically stronger than correlations for individuals. Example: This scatterplot shows the average final exam score versus the average midterm score for 11 sections of an introductory statistics course: The correlation for this data set is r = 0.829. The following is a scatterplot of individual final exam scores versus individual midterm scores for the each student in these sections: The correlation for the individual students is r = 0.687. What can you say about the summarized data versus the individual data? 2. Beware of extrapolation. Extrapolation is the use of a regression line for prediction outside the range of x-values you used to obtain the line. Such predictions are often not accurate. Checkpoint: Would we use the regression line to predict growth in per capita income for cities where 70% of residents have at least 4 years of college? Yes No Maybe 3. Beware of lurking variables. A lurking variable is a variable not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among these variables. A final caution.... Association Does Not Imply Causation! An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y. Example (From the Chp. 4 notes): Ice cream consumption & Number of homicides. Consider the lurking variable of "season." These two variables are not causal.

Simple Random Samples

Convenience samples and voluntary response samples are based upon personal choice and do not yield representative samples. Sampling by chance allows all members of the population an equal opportunity to be selected as part of the sample.A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance of being the sample actually selected (and every individual has an equal chance of being selected). Simple random samples can be chosen by assigning random numbers to individuals in the population of interest.

Facts About Correlation

Correlation does not distinguish between explanatory and response variables. Why is this? The order does not matter when we multiply the standardized values of and for each individual. r does not change when we change the units of measurement of x, y, or both. Why is this? We standardize to find the correlation; therefore,. neither r, nor-scores possess units. Positive r indicates positive association between the variables, and negative r indicates negative association between the variables. Correlation is always between ‒1 and +1, inclusive. Consider the following continuum: Correlation can be exactly equal to ‒1 or +1, but we probably won't see that in real data because it would mean that all of the data points fall exactly on a line. Determining the value of the correlation from a scatterplot takes some practice. The Correlation and Regression Applet re-calculates based upon the arrangements of the ordered pairs.

More Facts/Cautions About Correlation

Correlation is meaningless unless both variables are quantitative. While we can calculate a correlation for any two variables that use numbers, we must think about the situation to ensure that the variables are indeed quantitative. Correlation measures only the strength of the linear relationship between two variables. Correlation does not describe curved relationships between variables, no matter how strong the relationship. Like the mean and standard deviation, the correlation is not a resistant measure: is strongly affected by a few outlying observations. Correlation is not a complete summary of two-variable data, even when a clear linear relationship exists between the two variables. In addition to r, report the means and standard deviations for both variables.

quantitative variable EXAMPLE

Cost of a friend's birthday present. Number of people who expressed interest in attending your friend's birthday party.

Pie charts:

Each "slice" represents a piece of one whole. The size of a slice depends on what percent of the whole this category represents. Make sure that your labels match the data and the percentages total 100%. (Note that JMP does not label the percent for each slice.) -Categorical Variables

Bar graphs

Each category is represented by one bar. The bar's height shows the count (or sometimes the percentage) for that category. -Categorical Variables

Finding the slope

First we calculate the slope b of the line: r is the correlation, sy is the standard deviation of y, sx is the standard deviation of x. Interpretation: The slope is the amount by which y changes when x changes by 1 unit.

Stemplots

For small data sets, a stemplot is quicker to make and presents more detailed information. Directions for making a stemplot can be found on page 20 of the text.

bimodal

Histograms with two peaks are called bimodal (and can indicate that there are two subgroups in the data set).

multimodal

If the histogram has more than two peaks, we call it multimodal.

unimodal

If there is one peak (mode) in the histogram, we say the distribution is unimodal.

Connecting Chapter 3 to our Current Knowledge of Statistics

In Chapters 1 and 2, we explored distributions for quantitative variables by: making graphical displays of data (Chapter 1), looking for an overall pattern and four key features (center, spread, shape, and outliers) (Chapters 1/2), and calculating numerical summaries to briefly describe center and spread (Chapter 2). Let's extend this: Approximate the overall pattern of a large sample size with a smooth curve. Use this smooth curve to determine probabilities of events.

Convenience Samples

In a convenience sample we collect information from the members of the population that are easiest to reach.

Matched Pairs Design

In a matched pairs design, the researchers choose pairs of subjects that are closely matched - e.g., same sex, height, weight, age, and race - and the treatment is randomly assigned to one of the subjects within the pair. Note: A variation of this design is to use 1 subject and give the two treatments to the subject in random order over time.

Interpreting Scatterplots - Form

In particular, we are interested in a linear relationship between the explanatory and response variables.

Four-Step Process

In this course we will use a Four-Step Process to Organize our Problems: 1. STATE: What is the practical question, in the context of the real-world setting? PLAN: What specific statistical operations does this problem call for? SOLVE: Make the graphs and carry out all required calculations. CONCLUDE: Give your practical conclusion in the setting of the real-world problem. Earlier chapters will be more prescriptive in letting you know which tools are to be used. Later chapters tend to be more realistic by presenting the problem at-hand, then allowing you to use the four-step process to solve them.

The Least-squares Regression Line

Mathematically: the least-squares regression line of y on x is the line that minimizes the sum of the squared vertical deviations between the data points and the line. The least-squares regression line has the equation is the predicted y-value a is the intercept b is the slope Note: a is in units of y, and b is in (units of y)/(units of x)

Principles of Experimental Design

Minimizing confounding and maximizing the perceived impact of the treatment(s) on the response provide intuitive support for sound experimental design. The authors continue this discussion on page 232 of the text—please review this. Statistical significance occurs when an observed effect is so large that it would rarely occur by chance alone. Chance variation is reduced by having an adequate number of subjects. (Experimental Design Principle 3)

Finding the intercept a

Next, we can calculate the intercept a and are the sample means of the x- and y-variables, respectively. The intercept is the value of y when x=0.

Resistance to Outliers

Outliers can strongly influence the mean. The mean is not a resistant measure of center. The center of a dataset tells us information about the "typical" value. We will also use numerical measurements to describe the shape of the data.

Spotting Outliers

Outliers tend to require additional investigation. Statisticians tend to ask questions like, "Was there a data entry error?" or "Was this an unusual circumstance?" In our example, Esmerelda spent a larger than expected amount on a present. Would it qualify as an outlier? A suspected outlier falls more than 1.5 "IQRs" away from either Q1 or Q3. This is called the "1.5 × IQR rule for outliers." Mathematically, a suspected outlier lies below or above .

Finding Quartiles

Quartiles and medians are also measures of location - hence the need to order the values. Arrange the observations in increasing order and locate the median M in the ordered list of observations. The first quartile Q1 is the median of the observations that are to the left of M. The third quartile Q3 is the median of the observations that are to the right of M.

Response bias

Results when the behavior of the respondent or of the interviewer can impact the sample results. Operational Issues

Example: Per Capita Income vs. Education

Since the scatterplot confirmed that the regression line fits these data well, let's go ahead and use it to make some predictions. The regression line can be used to predict the percentage growth in per capita income for cities that have 8% to 16% of residents that have at least 4 years of college. Let's use the actual equation from the JMP output. a. Predict the percentage growth in per capita income for a city with 14% of residents having at least 4 years of college. b. Predict the percentage growth in per capita income for a city with 11.4% of residents having at least 4 years of college.

Single-blind

Single-blind experiments involve one group (subjects or personnel involved with the study) who knows which treatment the subject is receiving. Recall that the 1st two cautions for randomized comparative experiments pertained to administration of the treatment.

Another Caution

Some experiments fall prey to a lack of realism that occurs when the experiment cannot realistically duplicate the conditions desired.

Explanatory and Response Variables

Sometimes we would like to show a causal relationship between the two variables (where one or more explanatory variable(s) causes a change in the response variable.) More will be discussed about this later on in Chapter 8. Checkpoint: Which variable in each of the following pairs is the explanatory variable and which is the response variable? There might not be a clear answer in some cases—that is, there might be some cases where we would just like to explore the relationship between the two variables. Square Footage of a Home & its Amount in Real Estate Taxes a) Explanatory: Square Footage of a Home Response: Real Estate Taxes b) Explanatory: Real Estate Taxes Response: Square Footage of a Home c) Explore the relationship Which variable in each of the following pairs is the explanatory variable and which is the response variable? There might not be a clear answer in some cases—that is, there might be some cases where we would just like to explore the relationship between the two variables. Ice Cream Consumption & Number of Homicides a) Explanatory: Ice Cream Consumption Response: Number of Homicides b) Explanatory: Number of Homicides Response: Ice Cream Consumption c) Explore the relationship

Using Table A to Find Normal Proportions, you might need "Pizza" (P.p.z.a.a.):

Step 1. State the Problem in terms of the observed variable . Step 2. Draw a Picture that shows the proportion you want in terms of cumulative proportions. Step 3. Compute to restate the problem in terms of a standard Normal variable z. Step 4. Use Table A to find the cumulative proportion. Step 5. Answer the question. Use the fact that the total area under the curve is 1 to find the required area under the standard Normal curve.

3. Most experiments follow this design:

Subjects Treatment Measure Response.

We standardize the values we have in any Normal distribution by

Subtracting the mean of the distribution from the observation Then dividing this value by the standard deviation of the distribution.

Review of Straight Lines

Suppose that y is a response variable (plotted on the vertical axis) and x is an explanatory variable (plotted on the horizontal axis). In Algebra, a straight line relating y to x had an equation of the form y = mx + b In Statistics, a straight line relating y to x has an equation of the form y = a + bx In the equation, b is the slope, the amount by which y changes when x increases by one unit. The number a is the intercept, the value of y when x=0.

Inference about the Population

Taking a census requires gathering information about each element of a population. Sampling tells us something about our population of interest. If done scientifically, sampling can be just as effective as a census and considerably more efficient. Drawing conclusions about a population based on the sample is called inference.

Choosing Measures of Center and Spread

The 5-Number summary is resistant to strong outliers. The median is a resistant measure of center. The IQR is a resistant measure of spread. When extreme observations are involved, be sure to choose statistical methods that are not influenced by outliers. and should be used for reasonably symmetric distributions that are free of outliers. Numerical measures of center and spread report specific facts about a distribution. Be sure to produce graphical displays to better understand the behavior of the data.

The Standard Normal Distribution

The 68-95-99.7 rule is helpful when a value is exactly 1, 2, or 3 standard deviations away from the mean... but standardizing allows us to work with values that are any number of standard deviations from the mean.

Facts About Normal Distributions

The Normal distribution will be the premier distribution used. All Normal curves have the same overall shape: Symmetric Single-peaked Bell-shaped Any specific Normal curve is completely described by its mean and standard deviation . We use the notation to abbreviate "Normal with mean and standard deviation ." Changing the mean without changing shifts the curve horizontally. Changing the standard deviation without changing adjusts the spread and preserves the center. Compare and contrast the center and spread/scale of each pair of graphs. Normal curves A and B have similar centers, but different scales. Normal curves B and C have similar scales, but different centers. Normal curves A and C have different centers and scales.

Measuring Linear Association: Correlation

The correlation r measures the direction and strength of the linear relationship between two quantitative variables. The correlation can be calculated with the formula We will let technology do the work for us in terms of calculating r. Examining the formula tells us some facts about correlation.

Bias

The design of a statistical study is biased if it systematically favors certain outcomes; it either systematically over- or under-estimates the variable of interest.

Facts about Least-squares Regression

The distinction between explanatory and response variables is essential in regression. If we reverse the roles of the two variables, we get a different least-squares line. There is a close connection between correlation and the slope of the least-squares line The slope and the correlation always have the same sign. Along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y. (Remember the equation of the slope.) The least-squares regression line always passes through the point on the graph of y against x. The correlation describes the strength of a straight-line relationship in a specific way: the square of the correlation r2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.

Marginal Distributions

The first thing we want to do when looking at a two-way table is to examine each variable separately. Add row and column totals to the table if they do not appear in a two-way table. Example: The distributions of Region of U.S. and Belief that Global Warming Increased Temperatures During December 2011 and January 2012 are called marginal distributions because they appear at the right and bottom margins of the two-way table. The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of the values of that variable among all individuals described by the table

Goal

The goal for this chapter is to create an equation that best models the relationship between a response and an explanatory variable.

interquartile range (IQR),

The interquartile range (IQR), is a measure of spread that plays a role in mathematically determining outliers. The IQR is the distance between Q1 and Q3. To find the IQR, just subtract: IQR= Q3 -Q1

Mean and Median of a Symmetric Distribution

The mean and median of a roughly symmetric distribution are close together. (If the distribution is exactly symmetric, the mean and median are exactly the same.)

mean

The mean is an arithmetic average and is calculated by summing the observations and then dividing by the number of observations. It is considered the "balance point" of the distribution. The mathematical notation for the mean is

Median

The median M is the midpoint of a distribution— the number such that half of the observations are smaller than it & the other half are larger. To find the median of a distribution: Arrange all, n, observations in order of size, from smallest to largest. If n is odd, the median M is the center observation in the list. Or If n is even, the median M is midway between the two center observations in the list. Note: the median in the ordered list of observations can be found by counting up to the value at the "(n+1)/2" location.

Histograms

The most common graph of the distribution of one quantitative variable is a histogram. Directions for manually constructing a histogram can be found in Example 1.4 on pages 12-13 of the text. We can display either the counts or percentages (relative frequencies or probabilities) of the distribution in a histogram. The four key features used in describing a graphical display of a quantitative variable are the: Shape: Is the distribution symmetric or skewed? Are there multiple 'peaks?' Center: What seems to be a 'typical' range of values? Spread: How variable is the data set? Outliers/unusual features: Are there any other odd things to note?

placebo effect

The placebo effect is the response to a dummy/phantom treatment. Phantom, mirage, hoax, etc... are all synonyms for placebos. The physician's belief in the treatment and the patient's faith in the physician exert a mutually reinforcing effect; the result is a powerful remedy that is almost guaranteed to produce an improvement and sometimes a cure. -- Petr Skrabanek and James McCormick, Follies and Fallacies in Medicine, p. 13

Interpreting Scatterplots - Strength

The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. We will have a formal measure for strength when we meet correlation. For now, focus on how closely the points follow the underlying pattern.

Mathematical Notes about Standard Deviation

The sum of the deviations is always 0. Computing the average "squared deviation" and then taking the square root provides a far better description of the data than computing the average deviation. When you know the n - 1 deviations, you could easily determine the last one.

Cautions

There are a couple of cautions we should consider with randomized comparative experiments: 1. If the treatment is in the form of a pill (or the like), should everyone take the treatment? 2. What if half of the subjects take the treatment and the other half do not? 3. What if the subjects know which treatment they are receiving? 4. What if those evaluating the results know which treatment the subjects received?

Examples of Surveys

There are a few excellent examples of sample surveys in use in the United States: 1. The Current Population Survey (CPS) is a joint effort between the Bureau of Labor Statistics and the U.S. Census Bureau. The CPS samples about 60,000 households each month from the 50 states and the District of Columbia. (More details can be found at http://www.census.gov/cps/methodology and Example 8.1 on page 200 of the text.) 2. a. The General Social Survey (GSS) is run by the National Opinion Research Center (NORC) and funded by the Sociology Program of the National Science Foundation. (See http://www3.norc.org/GSS+Website/ for more information. ) b. Small fact - Dr. Baker was fortunate to have a summer internship with NORC (Chicago office) during graduate school. He worked on a Child Immunization study and assisted with several other projects.

Organizing a Statistical Problem

There are numerous ways to approach the beloved discipline of collecting, organizing, analyzing, and interpreting data. Generally, picturing distributions with graphs and describing them with numbers are good initial steps. Interpreting results within the context of the problem is also important.

Categorical Variables

This same set of n individuals could be placed into the 1st, 2nd, or, ..., cth category for Categorical Variable II Example: Categorical Variable II—The belief that global warming contributed to above average temperatures during December 2011 & January 2012 Agree, Disagree: c=2 There are (r*c) pairs of outcomes the individuals could be placed into. Example: 4 *2 = 8 pairs of outcomes. Northeasterners who agree that global warming increased temperatures in 12/11-1/12 Midwesterners who agree that global warming increased temperatures in 12/11-1/12 Southerners who agree that global warming increased temperatures in 12/11-1/12 "Westerners" who agree that global warming increased temperatures in 12/11-1/12 Northeasterners who do not agree that global warming increased temperatures in 12/11-1/12 Midwesterners who do not agree that global warming increased temperatures in 12/11-1/12 Southerners who do not agree that global warming increased temperatures in 12/11-1/12 "Westerners" who do not agree that global warming increased temperatures in 12/11-1/12 A two-way table helps us organize the data for a pair of categorical variables. Note: You might hear of this table referred to as a "contingency table."

Other Sampling Designs

To select a stratified random sample, 1. Classify the population into groups of similar individuals, called strata. 2. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample. This method is useful when there are important groups within the population. Note: Most large-scale surveys use multistage samples, which might incorporate several types of sampling.

Interaction amongst categorical variables: "And"

Two-Way tables allow us to look at interactions amongst categorical variables. "And" question: What percentage of the respondents are Midwesterners who agree that Global Warming increased temperatures during December 2011 and January 2012? a) 219/1009 b) 168/1009 c) 168/219 d) 732/1009

Cautions about Correlation and Regression

We already know that Correlation and regression lines describe only linear relationships. Correlation and least-squares regression lines are not resistant to outliers and influential observations.

Blindness

When neither the subjects nor the personnel involved with the study with them know which treatment each subject is receiving, the experiment is double-blind.

Conditional Distributions

When we change the "percent of what" question from "percent of all respondents" to "percent of Midwesterners" or "persons who believe in global warming's impact" we are conditioning on one of the variables. A conditional distribution of a variable is the distribution of values of that category among only individuals who have a given value of that category.

Pie Charts and Bar Graphs

When we conduct exploratory data analysis: -Begin by examining each variable by itself. Then move on to study the relationships among the variables. -Produce a graphical display of the data. Then add numerical summaries of specific aspects of the data.

Describing Density Curves

When we talk about density curves, we use lower-case Greek letters to describe the curves. μ (Greek lower-case "mu," pronounced like "mew") We use to denote the mean of a density curve. σ (Greek lower-case "sigma," pronounced like "sig-muh") We use to denote the standard deviation of a density curve.

When planning a statistical study or exploring data provided from someone else's work, ask the following questions:

Who? What individuals do the data describe? What? How many variables do the data contain? What are the exact definitions of these variables? In what unit of measurement is each variable recorded? Where? Does the location seem consistent with the findings? When? Could the results have been impacted by a historic event? Why? What purpose do the data have?

Normal distributions

are a family of specific density curves that are also good approximations for the results of many kinds of chance outcomes. are a family of specific density curves that are also good approximations for the results of many kinds of chance outcomes. The statistical inference procedures we will use later in the course can be approximated by Normal distributions. Though you don't necessarily know what statistical inference is right now, the Normal distribution will be the premier distribution used throughout the whole course - you cannot get through the course without knowing it!

wording of questions

can have the most important influence on the answers given to a sample survey, especially when the questions are confusing or misleading. Operational Issues

A sampling design

describes how to choose a sample from the population.

density curve

describes the overall pattern of a distribution. The area under the curve for a given range of values along the -axis is the proportion of the population that falls in that range. Important: the density curve has an area of exactly 1 underneath it. You might find it helpful to think about this in terms of 100%. We can use the same concepts of shape, center, and spread/variability to describe a density curve. Outliers are less relevant here, as we'll focus on symmetric distributions. The median of a density curve is the point that divides the area under the curve in half - it is the equal-areas point. The mean of a density curve is the balance point, the point at which the curve would balance if made of solid material.

Experiments

determine whether the treatment elicited the change in the response.

An explanatory (independent) variable

explains or influences changes in a response variable. Cause

observational study

focuses on individuals and measures variables of interest but does not attempt to influence the responses. Data is collected to better describe a group or situation. There is no active treatment being impressed upon the individuals.

cumulative proportion

for a value in a distribution is the proportion of observations in the distribution that are less than or equal to .

The population

in a statistical study is the entire group of individuals about which we want information.

explanatory variables

in an experiment are often called factors.

Use enough subjects

in each group to reduce chance variation in the result

2. Confounding can happen in

in experiments, but ideally the researcher designs the experiment to avoid confounding.

variable

is a characteristic that differs among individuals in a population or in a sample

A sample

is a part of the population from which we actually collect information. We can make inferences about a population from a well-drawn sample.

treatment

is any specific experimental condition applied to the subjects. If an experiment has more than one factor, a treatment is a combination of specific values of each factor.

Typically, the explanatory variable

is plotted on the -axis, and the response variable is plotted on the -axis. If there is no explanatory-response distinction, either variable can go on the -axis.

modified boxplot

is similar to a boxplot, but it shows suspected outliers as dots (or another symbol, such as an asterisk). Fuel efficiency (in MPG) of 15 automobiles. Which type of car might the outlier be?

quantitative variable

is something that can be counted or measured for each individual and then added, subtracted, averaged, etc. across individuals in the population.

second quartile, Q2

is synonymous to the median.

The first quartile, Q1,

is the value in the sample that has 25% of the data at or below it.

The third quartile, Q3

is the value in the sample that has 75% of the data at or below it.

The Standard Deviation

is used to describe the variation around the mean. Since the deviation depends on the mean , it is not resistant to outliers and skewness. The deviation of an observation xi is how far it is from the mean. . The deviation of an observation xi is how far it is from the mean. . Prior to finding the standard deviation, we first calculate the variance s2, the average of the squares of the deviations of the observations from their mean. The variance is calculated using the following formula: The standard deviation is the positive square root of the variance: Note: The number n-1 in the denominator is called the degrees of freedom.

A response (dependent) variable

measures an outcome of a study. Effect

Nonresponse

occurs when an individual chose for the sample can't be contacted or refuses to participate. Design Issues

Undercoverage

occurs when some groups in the population are left out of the process of choosing the sample. Design Issues

experiment

on the other hand, deliberately imposes some treatment on individuals in order to observe their responses.

we collect information—data—from individuals. Individuals can be

people, animals, plants, or any other objects of interest.

categorical variable

places an individual into one of several groups or categories. Counts or proportions of individuals are important here.

statistical inference

procedures we will use later in the course can be approximated by Normal distributions. Though you don't necessarily know what statistical inference is right now, the Normal distribution will be the premier distribution used throughout the whole course - you cannot get through the course without knowing it!

Properties of Standard Deviation

s measures spread about the mean and should be used when the mean is most appropriate. s is always . s increases as the observations become more spread out about their mean. s has the same units of measurement as the original observations. s, like the mean, is not resistant to outliers and skewness.

1. Observational studies are not as effective at

showing cause and effect because of confounding.

five-number summary

smallest observation Q1 the median Q3 the largest observation Minimum Q1 M Q3 Maximum

individuals

studied in an experiment are often called subjects, particularly when they are people. Animals or objects can be subjects also.

The distribution of a categorical variable lists

the categories and gives either the count or the percent of individuals who fall in each category.

Control

the effects of lurking variables on the response, most simply by comparing two or more treatments.

In a skewed density curve

the mean and median are not the same. This is skewed right, and the mean is pulled to the right.

In a symmetric density curve

the mean and median are the same. We say there is no skew.

Two variables (explanatory variables or lurking variables) are confounded when

their effects on a response variable cannot be distinguished from each other.

Randomize

use (impersonal) chance to assign subjects to treatments. Technology is used to determine these assignments.

The distribution of a variable tells us

what values it takes and how often it takes these values.


Related study sets

Pre-Algebra - Chapter 4 - Solving Equations/Equations

View Set

California Real Estate Principles Unit 15 - Keeping Your Real Estate Liscence

View Set

6.1 Summarize general cryptography concepts

View Set

ATI MedSurg Endocrine Practice Test

View Set

Cyber Security 2023 completed 10/26/2023

View Set