Statistic

Ace your homework & exams now with Quizwiz!

Difference between sample statics & population parameters

"Population" is the total group about whom we want to make a conclusion. "Parameter" is a numerical summary of the POPULATION. Greek letters are typically used for these variables: - mean: μ ("mu") - standard deviation: σ ("sigma") "Sample" is a subset of the population. This is from whom we actually have data. "Statistic" is a numerical summary of the DATA. - Mean: x with 一, x bar - Standard deviation: s

Association vs. Correlation

"correlation" applies strictly to linear relationships between variates, whereas the word "association" applies broadly to any type of relationship.

Percent / Proportion / Percentile

"relative frequency" E.x. Occurrences = 20, the Frequency of this data is 3, so its relative frequency is 15.0

Stem and Leaf (stemplots)

Like a dot plot, it displays individual observations. 1. Each observation is represented by a stem and a leaf. Typically, the stem consists of all the digits except for the units digit, which is the leaf. 2. The stems go in the left side of a vertical line. *If we don't have many stems, maybe we could break them up by their leaves. E.x. group 0-4 and 5-9 together.

Individuals

Objects described by a set of data. E.x. people, dog, almost any noun.

Frequency Table

One way to display data is frequency table. These tables plot the possible values vs. # of observation.

Dot Plots

Show a dot for each observation, placed on a number line just above the corresponding value. To make one: 1. Draw a horizontal line, label it with the name of the variable, and mark regular values of the variable on it, 2. For each observation, place a dot above its value on the number line.

Side by side box plot

Side by side box plots help to compare distributions

Nonsense Correlations

Strong correlation but a nonsense conclusion. The association is real, but there is no causation effect present.

- Median

Symbol: M Arrange data in numerical order - find middle value. If number of data is even, find the mean of the two middle values.

- variance

Symbol: s^2 The average of the squares of the deviations of the observations from their mean. s^2 = ((x1 - xbar)^2 + (x2 - xbar)^2 + ... + (xn - xbar)^2) / (n - 1)

- Mean

Symbol: x with 一 on top of it, read as "x bar", means "average"; Formula:【 (x1 + x2 + ... + xn) / n 】, or,【 ∑Xi / n 】 ∑ - "sigma", means sum of i - all the data n - number of data

1.5 * IQR criterion

The 1.5 * IQR criterion for identifying outliers: A data point is an outlier if it lies: - Less than 1.5 * IQR below Q1 【data < 1.5 * IQR】 - More than 1.5 * IQR above Q3 【data > 1.5 *IQR】 Outliers are points that fall outside the range (Q1 - 1.5IQR) to (Q3 + 1.5IQR)

Correlation

The way to calculate strength of association is correlation Symbol: r r = (1/n-1)∑(xi-xbar/Sx)(yi-ybar/Sy) don't need to memorize the formula, use calculator

Equation of the line

y hat = a + bx (use y hat to emphasize that this is a predicted response y hat for any x. It is VERY important to include this notation!) where b = r(Sy/Sx) 【slope】 a = y - bx【intercept】 **NOTE: (x bar, y bar) is always on the line!! We could also use this form of the line: 【y hat - y bar = r(Sy/Sx)(x - x bar)】

Residual

residual(prediction error) = observed - predicted LSRL of y on x: minimizes the sum of the squares of the residuals

Percentile

A value that x% of the data fall at or below. - = or < the given x% percentile Use if the median is the measure of the center. - The median is the 50th percentile.

Bar Graphs

A vertical bar for each category. The height of the bar is the percentage of observations in the category.

Quartiles

- The 1st quartile, Q1 is the 25th percentile. It is the median of the observations below the overall median. - The 2nd quartile, Q2 is the median of all observations. - The 3rd quartile, Q3 is the 75th percentile. It is the median of the observations above the overall median.

2.2 Describing Data with Summaries

...

2.3 Describing Center

...

2.4 Describing Spread(variability)

...

2.5 Measures of Position and Spread

...

3.1 Association between to categorical variables

...

Measures of Center:

...

Measures of Spread:

...

Quantitative variables can be:

...

Two type of Variables:

...

Warning of Associations

1. Associations are tendencies, not truth set in stone 2. The relationship could be strongly influences by other variables lurking in the background

Other notes

1. Correlation does NOT care about explanatory vs. response variable distinction(ie it doesn't matter which is x or y) 2. Correlation requires BOTH variables to be quantitative 3. Correlation had NO units 4. Correlation only works for LINEAR relationships 5. Correlation is NOT resistant to outliers 6. Use caution if there are outliers: - Usually a good idea to find it without the outliers and then compare them to the remaining data

Explanatory Variable

1. Defines the groups to be compared with respect to values on the response variable 2. If a linear relationship, it's our "x" in y = mx + b 3. Explains or cause CHANGES in the response variable 4. Can also think of it as the input

General Rules When Looking at Data

1. Examine each variable separately 2. Study any relationships between the variables 3. Create graph(s) *pick appropriate graph 4. Add numerical summaries

Timeplots

1. Horizontal axis is ALWAYS time 2. Vertical axis is whatever is being measured per unit time 3. Connect "dots" to emphasize the pattern 4. Used to display change/trend 5. LABEL THE AXES

* Comments

1. If the distribution is exactly symmetric, the median and the mean will be exactly the same; 2. Median is resistant to extreme data. This means that extreme values or outliers do not affect it. - What is an outlier? For now we will say it is an observation that is quite outside where the bulk of the data lies. - Use the median to describe the center of data if it is highly skewed. 3. Mean is nonresistant to extreme data. A skewed distribution will pull the mean towards the tail relative to the median. - If skewed to the left, the mean is smaller than the median. - If skewed to the right, the mean is larger than the median. - Use the mean to describe the center of data if it is fairly symmetrical. 4. Mode: the value that occurs most frequently. - It does not need to be the center of the distribution - It describes the most typical observation in terms of the most common outcome. - It is most often used to describe the category of a categorical variable that had the highest frequency. 5. Noun for skew: - Symmetrical Distribution (Mean, Median, Mode at one line) - Positive Skew (Mode, Median, Mean) - Negative Skew (Mean, Median, Mode) - The more skewed, the bigger the difference between median and mean.

Aspects to consider

1. Interpret the y-intercept(the predicted value when x = 0) 2. Interpret the slope(the amount that yhat change when x increase by one unit)

Notes about box plots

1. It doesn't give all features of a distribution like other graphs do, such as modality. A histogram is better. 2. It does show if data is skew. If it is, it is to the side with the larger part of the box and the longer whisker. 3. Great for identifying potential outliers. 4. When summarizing a distribution, the mean and standard deviation is more common than the 5 number summary.

Response Variable

1. It is the outcome variable on which comparisons are made 2. A linear relationship, it's our "y" in y = mx + b 3. It's the dependent variable 4. Think of it as the output

LSRL (least-squares regression line)

1. Method for finding this relationship, ie. how the responses variable(y) changes as an explanatory variable(x) changes 2. Purpose: to be able to predict y given x *Note - Unlike correlation, MUST have response and explanatory variables - If the variables aren;t related, do not create a regression line - Statisticians use line of best fit 【y hat = a + bx】 or 【y hat = b + ax】 The idea behind it: Since we want to predict y from x we want a line that os as close as possible to the points in the vertical direction

Conditional Proportions (based on table)

1. Proportions within a category 2. Whenever we distinguish between a response and an explanatory variable, it is natural to form conditional proportions for categories of the response variable.

What are the criteria for establishing causation when we can't do an experiment(for ethical or feasibility reason)?

1. The association is strong 2. The association is consistent 3. Higher doses = stronger responses 4. Alleged cause precedes the effect in time 5. Alleged cause is plausible

Pareto Charts

A bar graph with categories ordered by their frequency, from tallest to shortest.

Variable

A characteristic of an individual. E.x. if individual = HS students, then possible variables = age, grade ,ethnicity.

Pie Charts

A circle that has a "slice" for each category. The size of each slice corresponds to the percentage of observations in the category.

Contingency Table

A display for 2 categorical variables. Rows list the categories of one variable and its columns list the categories of the other variable.

Histograms

A graph that uses bars to portray the frequencies or the relative frequencies of the possible outcomes for a quantitative variable. These are great to use when a quantitative variable takes on many values. It breaks the range of values of a variable into "classes" and displays only the count(or %) of the observations that fall into each class. Use an interval to display frequency. Things to keep in mind: 1. You can choose any convenient number of classes. Pick what's best for displaying the shape of the data. 2. Always choose classes of uniform(equal) width. 3. "Blocks" should be adjacent as there should not be a break in your intervals on the horizontal axis. 4. LABEL THE AXES 5. NOTE: They do NOT display actual values observed. 6. If you want to compare distributions with different numbers of observations by comparing their histograms, use histograms of percents, not frequencies. 7. Modes. These are the "peaks." If there is only one, we say it is unimodal. If there appears to be two, we call it bimodal.

- Categorical (Qualitative)

A group to which an individual belongs.

2.1 Descriptive Statistic

A key element of statistical analysis is the pictorial representation of data. We can more quickly draw conclusions by looking at displays than by looking at raw data.

Extrapolation

A prediction for an explanatory variable that is outside the given values for x. Consider the domain and range.

Z-Score

It is the number of standard deviations that an observation falls from the mean. indicates how many standard deviations an element is from the mean. 【Z-score = (observation - mean) / standard deviation】

Side-by-Side bars (graph)

Also compare conditional proportions

Association

An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.

3.3 Predicting Outcomes

As we saw in the last section, sometimes the scatterplot implies that the data had a linear relationship. If the two variables are quantitative, we can use a regression line to define that relationship.

Key concept

Association does not imply causation.

Correlation coefficient The general rule(no absolute) for determining strength

Direction/Assoc. positive(+) 【weak】 0 < r <= 0.3 【mild/moderate】 0.3 < r <= 0.6 【strong】 0.6 < r <= 1 negative(-) 【weak】 0 > r >= -.3 【mild/moderate】 -0.3 > r >= -0.6 【strong】 -0.6 > r >= -1 **Note: if r += 1, points lie in an exact straight line

r^2 in regression

It is the proportional reduction in error. The square of the correlation is the fraction of the variation in the values of y that is explained by the least squares regression of y on x.

How do we establish causation?

Experimentation We have to design experiments in which the effects of the lurking variable is controlled

3.4 Cautions in Analyzing Association

For any statistical procedure, there are pitfalls to be avoided.

5 number summary

For the 5 number summary(same place as mean and standard deviation), give the following values from the data set: minimun Q1 median Q3 maximum To get them.. Stat1 Edit1 Stat1 Calc 1: 1 Var Stat

Proportion

Frequency / Total occurrences It's in the fraction form.

- Quantitative

Have numerical values. We can apply arithmetic operations to them. E.x. age, score, cholesterol level.

Mode

Highest frequency.

Frequency (count)

How often a piece of data occurs.

Interquartile range (IQR)

IQR = Q3 - Q1 Anther way to measure spread when the median is used. - Measures the spread of the middle half of the data. - It is resistant to outliers - VERY important as it can help identify outliers

Empirical Rule

If a distribution is unimodal and is approximately symmetric and bell shaped, we can apply the empirical rule: - 68% of the observations fall within 1 standard deviation of the mean. - 95% of the observations fall within 2 standard deviation of the mean. - 99.7% (nearly all) of the observation fall within 3 standard deviation of the mean. Hence, we got six range... 2.35, 13.5, 34, 34, 13.5, 2.35

Outlier

If an observation has a large effect on results of a regression analysis, it is said to be influential. An influential point pulls the data toward it and away from the line of the trend of the rest of the point. To be considered influential - Its x value is relatively low or high compared to the rest of the data - It is a regression outlier as it falls quite far from the trend that the rest of the data follow

Scatter plot

If both variables are quantitative, the best way to graph them is to use a Scatter plot. 1. Shows the relationship between TWO quantitative variables 2. explanatory variable - x-axis 3. response variable - y-axis (if there is no explanatory -response distinction, either variable can go on either axis) 4. A scatterplot dies NOT have to be a function - meaning: one explanatory value can take on more than one response value

Appears to be come type of association between the variables

If there is appearing to be come type of association between variables, is it positive or negative, and how strong is it? - Positive association: roughly linear and the low values of x tend to occur with low values of y, and visa-versa - Negative association: roughly linear and high values of one variable occurs with low values of the other variable, and visa-versa

Standard deviation

It is easier to interpret the square root. This is called the standard deviation. Symbol: s The square root of the above equation √(s^2) = s - Gives the average distance from the mean. - The larger the standard deviation, the more spread out the data is. - It equal to zero if all the data is the same value. - It can be influenced by outliers.

- Deviation

The difference between the observation and the sample mean, (x - "x bar") E.x. data = 7, mean = 5, the deviation of data is 2. More about deviation: - Uses all of the data - It is positive if it falls above the mean - It is negative if it falls below the mean - Sum of the deviations always equals zero. The positive deviations should balance out the negative ones.

- Range

The difference between the smallest and largest value.

To find the correlation using the cals, NOT formula

stat calc 9: LinReg(a + bx) if no r, go 2nd->0 turn "diagnostic on"

Confounding variable

Two variables ghat both associate with a response variable but are also associated with each other. It is difficult to determine whether either of them truly causes the response.

Extrapolation

Using a regressions line to predict y values for x values outside the observed range of data

- Continuous

Values an be anything within an interval.

- Discrete

Values can only be specific numbers.

Lurking Variable

Variable that impacts upon relationship of the variables involved but is not included among the variables studied E.x. Age, initial condition, type... Time: policies can change over time and that can impact upon data "The new factor *lurking variable* would impact upon data and yet it wasn't taken into account with the original variables"

3.2 Exploring association between two quantitative variables

Will be exploring multi-variable situations. Typically more useful than single variable situations bc they allow for comparison. Remember: Always plot data! If both variables are quantitative, the best way to graph them is to use a Scatter plot.

Boxplot

You can display your data with this summary in the form of a boxplot, a graph of the five-number summary. Constructing a box plot: 1. a box goes from the lower quartile Q1 to the upper quartile Q3 2. A line is drawn inside the box at the median 3. A line goes from the lower end of the box to the smallest observation that is not a potential outlier. A separate line goes from the upper end of the box to the largest observation that is not a potential outlier. These lines are called 【whiskers】. The potential outliers are shown separately. - Modified Boxplot (shows the outliers) - Non-modified box plot doesn't distinguish the outlier Outlier Min(last data point that is not a outlier) Q1 Median Q3 Max(Last data point that is not an outlier) Outlier To get it.. 2ND Stat Plot Change the setting and turn on Zoom Stat


Related study sets

Foundations and Practice of Mental Health Nursing; Psychobiological Disorders

View Set

chaper 1:the human body overview

View Set

Picture, Animation, Video & Audio Formats

View Set

Introduction to Information Technology - Sophia

View Set

5.11 Unit Test: The Government Gets Involved - Part 1

View Set

Chapter 27: Growth and Development of the Adolescent: 11 to 18 Years

View Set