Agresti Statistics 1a
Shapes of Distributions
1. symmetric 2. assymetric/ non-symmetric: => skewed right (positive skew) => skewed left (negative skew)
data
- Observations gathered
Parameters
- Parameters are characteristics of populations. = numerical summary of the population =>They are not known (in fact, they are often what you want to know) =>Example includes population mean, population variance, the population median
Variance
- the square of the standard deviation - Question: How far, on average, are observations from the mean?
standard normal distribution
Properties: - mean: 0 - variance: 1 - standard deviation: 1 - median: 0 - mode: 0
Population
- larger set of data from which the sample is drwan =>Actual population: inferences apply to this population =>Conceptual population: generalization (hypothetical)
descriptive statistics
- no generalization beyond the data at hand
How do you represent the different scales of measurement?
- nominal & ordinal scales (!are qualitative/ categroical!) => Plot; e.g. Bar graph, pie chart - interval & ratio (!are quantitative!) => Plot; e.g. histogram, stemplot
What to do about outliers?
- remove them - examine whether the outliers signal a problem with your sampling - obtain more data maybe they are not really outliers)
Sample
- small subset of a larger set of data (=population) - one score in this subset is called a "sample point" - sampling has to be random
range
- the difference between the highest and lowest scores in a distribution - Range = max - min
Lower Quartile (Q1)
- the median of the lower half of the data
Upper Quartile (Q3)
- the median of the upper half of the data
statisfied (geschichtet) sampling
- the random sampling from (each) subgroups in a population => used if the population has distinct number of "strata" or groups - sizes of the subgroups in the sample = proportional to their sizes in the population - subgroups are often randomly divided into treatment group and control group (e.g. taking a test without sleeping (condition) vs. with sleep (control)??)
How to identify outliers?
- use histograms and boxplots - the 1,5 x IQR rule => an observation that falls more than 1,5 x IQR below 1st quartile or above 3rd quartile is a suspected outlier
sampling bias
- when the sample over-represent one kind of group at the expense of others
Match the question with the suitable summary measure. 1. Where is the "center" of the data? 2. Where does the data tend to cluster? 3. How spread out is the data? How different are observations from one another? 4. Is the distribution symmetric?
1. central tendency: median, mean 2. mode 3. spread: range, variance, standart deviation, interquartile range 4. shape: skewness, outliers
Sampling distribution: 3 very important facts in statistics!
1. on average, the sample mean x̅ will not be too far from the population mean μ 2. statement: 95% of the time, the sample mean is within X units of the population mean => how small X depends on sample size n; larger n = smaller X 3. We can use this knowledge to create "confidence intervals" to learn about population parameters
variables
= properties or characteristics that can vary in value among subjects in a sample or population
Histogram
=>A graph of vertical bars representing the frequency distribution of a set of data. - Each bar is a 'bin'. - There is a bin for every range of numbers in the data - All bins have the same width - The height of a bin is the number of observations in the bin - There are no gabs between bins (Note: Difference with bar graph!)
qualitative (categorical) variable
A variable that cannot assume a numerical value but can be classified into two or more nonnumeric categories
Which four scales of measurement are there? Are they qualitative/ categorical OR quantitative?
Nominal => are qualitative/ categorical (=no scores) Interval =>qualitative/ categorical Ordinal => are quantitative (=uses scores/ numerical) Ratio => are quantitative
Normal distribution
Properties: - mean: μ - variance: σ² - standard deviation: σ - median: μ - mode: μ => If X is a value that comes from a Normal distribution, we say: X ∼ N (μ,σ)
What are the two kinds of distribution?
diescrete and continuous distribution
Data
evidence; information gathered from observations
Descrete data
- "between" numbers are meaningless -Example: How many siblings do you have? (2 and 3 are possible answers, but 2,5 not!)
Continious data
- "between" numbers have meaning -Example: How tall are you? (all positive real numbers are meaningful answers)
What is an outlier?
- An outlier is a value that appears to be unusually large or small, given the rest of the data. - Look around the room. Would a 175cm tall person be unusual? A 100cm tall person?
treatment
- Condition in an experiment
Reliability
- Consistent in the sense that a subject will give same response when asked again
Continuous distribution (probability density!)
- Continous distribution functions are described by probability density
Validity
- Describing what is intended to measure and accurately reflecting the concept
diescrete distribution (probability!)
- Discrete distribution functions are described by probability
Leptokurtic distribution
- Distribution curve is very tall, thin and peaked. => More scores in its tail
Distributions
- Distributions = Models of populations => It is possible to observe single samples with absolute precision, but we can not observe populations in science => therefore creating models (=distributions)
independent vs. dependent variable
- Effects of independent variable on dependent variable are measured
Platykurtic distribution
- Flatter and more spread out than a normal curve. => (Memory: 'Plat' sounds like 'flat') => Fewer scores in its tails
Confidence Interval (CI)
A range of values, calculated from the sample observations, that is believed, with a particular probability, to contain the true value of a population parameter. A 95% confidence interval, for example, implies that were the estimation process repeated again and again, then 95% of the calculated intervals would be expected to contain the true parameter value. Note that the stated probability level refers to properties of the interval and not to the parameter itself which is not considered a random variable.
Ordinal scale
- assigns observations to ordered categories - categorical: e.g.: How good are you in sports? Choose from very poor, satisfactory, very good; Service quality,
Nominal scale
- assigns observations to unordered categories - identity / labels; e.g., gender, martial status, car owned
Interval scale
- assigns scores on a scale with equal intervals - e.g. thermometer; Δ400C-500C = Δ200C-300C - However, one can't say 100C is twice as hot as 50C => Which implies that 0C is not the absolute minimum -e.g. standardized exam score
Ratio scale
- assigns scores on a scale with equal intevals and a true zero point. - e.g. weight, height, age, weekly food spending
What are the 3 types of distribution?
- bimodal distribution - leptokurtic distribution - platykurtic distribution
Database
- existing archived collection of data
Inferential Statistics
- generalizating data to other set of cases - based on idea that sampling is random
standard deviation
the square root of the variance
mean of a sample
x̅ (x bar)
population mean
μ
population variance
σ²
Computing upper & lower quartile - Example:
- Formula: R= P/100 x (N+1) => R=Rank, P = desired percentile, N = number of all numbers 1. Order the given sample points from the sample by size, compute the formula. 2. Define IR as the integer (ganze Zahl) portion of R (the number to the left of the decimal point of the formula's result), e.g. R=2.25 => IR = 2 3. Define FR as the fractional portion of R (the number to the right of the decimal point) => FR = 0.25. 4. Find the scores with Rank IR and with Rank IR + 1 (take a look at your ordered sample points) => e.g. score with Rank 2 and the score with Rank 3 (the scores at the 2rd or 3rd position in the ordering) 5. Interpolate by multiplying the difference between the scores by FR and add the result to the lower score => (0.25)x(7 - 5) + 5 = 5.5. => Therefore, the 25th percentile (=lower quartile) is 5.5.
population
- Is defined with respect to the psychological question - can be abstract - are usually large - Computing: settle for a sample and take inferences to the population =>populations are not random! (Populations stay the same over samples)
Example: population vs. sample
- Males in Germany average 182cm in height. => statement about the population - The males in this class average 180cm. => statement about the sample
Interquartile Range (IQR)
- Q3-Q1, the middle 50% of the data -Finding quartiles is similar to finding the median. =>First quartile (Q1): observation such that 25% ob observations are less than or equal to it. =>Second quartile (Q2): observations such that 50% of observations are less than or equal to it. (median!) =>Third quartile (Q3): observation such that 75% of observations are less than or equal to it. -IQR is the range of the middle 50% of the data 1. find Q1 and Q3 2. subtract Q1 from Q3 to get the interquartile range
Mean
- Question: What is the "average" number? - Computing: Take all the values together and divide them by n (n = total number of observations in the sample) - The sample mean is the most important measrue of central tendency - It is the point where the SUM of all deviations from it are 0. It can also be thought of as the "balancing point" of the sample. - The mean of observations is often abbrviated x^- (pronounced "x bar") =>an observation is often displayed as x_j
Median
- Question: What is the "middle" number? - Computing: Odd number of values: median = middle value; even number of values: take the two "middle" values and divide them by 2 - median = (also) the middle quartile (Q2) or 50th percentile
Mode
- Question: What is the most common number? Where do the data tend to cluster? - Computing: Looking at the sample. The number that is sampled most of the times.
simple random sampling
- Sampling is random => equal change to be selected for every member of population; independent selection of members
Skewness
- Skew describes the symmetry of a sample => If a histogram is symmetric, it has no skew =>If it is not symmetric, it can have left skew (negative skewness) or right skew (positive skewness).
Statistics
- Statistics are computed from a sample => They are known with absolute precision. => Example include x^- (mean of sample), s^2 (variance of sample), the median
Leaf display
- The angle to a stem - The leaf is next to the stem
box plot
- The graphical representation of the five-number-summary - Functions: 1. They give a quick representation of the important properties of a sample. 2. You can fit many of them in a small area (histograms are not good for this!)
dependent variable
- The outcome factor; the variable that may change in response to manipulations of the independent variable.
What is measurement?
- The process of applying numbers to objects according to a set of rules
back-to-back stemplot
- Used to compare two sets of data. - The leaves for one set of data are on one side of the stem, and the leaves for the other set of data are on the other side. - Good for displaying distributions
Five-number-summary
- Using the range and the quartiles, we can describe any distribution in five numbers. => i.e.: minimum, Q1, median, Q3, maximum
independent variable
- Variable manipulated by experimenter - Number of levels of an independent variable = the number of experimental conditions
What is a density curve?
- With a density curve, probabilities correspond to the AREA under the curve => total area under the curve is 1
quantitative variable
- a characteristic that can be measured numerically => e.g. numerical values on a measurement scale - divided into: 1. discrete variables: => variables with finite number of possible values => e.g. number of children 2. continuous variables: => variables with infinite number of possible values =>e.g. stars in universe
bimodal distribution
- a distribution with two modes => two distinct peaks
Sampling
- are observed, are the object of analysis - are drwan from a particular population to answer a psychological question - provide data so that we can learn about the population, without examining the whole population => Samples vary and are random
Why are outliers important? Give 3 reasons.
1. An outlier may represent a sample from outside the population of interest 2. Outliers have a large effect on many statistics (like mean, standard deviation) 3. Outliers may represent errors in the data (Example: a height of 100cm)
What is the right measurement scale? 1. How many siblings do you have? 2. How many cigarettes do you smoke per day? 3. How many degrees celsius is it in the rrom? 4. What color do you prefer? 5. On a scale from extremely dissatisfied (0) to extremely satisfied (9), how satisfied are you with your life?
1. Ratio scale (true zero point/ equal intervals) 2. Ratio scale (true zero point/ equal intervals) 3. Interval (no true zero point) 4. Nominal scale (no order) 5. Ordinal (ordered/ no equal intervals)
What are the 3 types of distribution?
1. Standard normal distribution 2. Normal distribution 3. Binominal distribution
What are the two (complementary) approaches to data analysis?
1. Summarize data graphically 2. Summarize characteristics of data with numbers (numerically/ summary measures => i.e. center, spread, shape)