PSY 251: Chapter 1 (Introduction) Notes
Probability Density Function
For a discrete random variable, a probability distribution contains the probability of each possible outcome. However, for a continuous random variable, the probability of any one outcome is zero (fi you specify it to enough decimal places). A probability density function is a formula that can be used to compute probabilities of a range of outcomes for a continuous random variable. The sum of all densities is always 1.0 and the value of the function is always greater or equal to zero.
Discrete Variable
Variables that can only take on a finite number of values. All qualitative variables; some quantitative variables, such as performance rated as 1,2,3,4, or 5, or temperature rounded to the nearest degree. Sometimes, a variable that takes on enough discrete values can be considered to be continuous (ex. time to the nearest millisecond)
Continuous Variables
Variables that can take on any value in a certain range. Time and distance are continuous. Whereas, gender, SAT score, and "time rounded to the nearest second" are not. No measured variable is truly continuous; however, discrete variables measured with enough precision can often be considered continuos for practical purposes
Third-variable problem
When two variables appear to be related to each other but there is another unknown variable (the third variable) that is the real source of the link between the first two variables; people erroneously believe that there is a causal relationship between the two primary variables rather than recognize that a third variable can cause both Ex. The more churches in a city, the more crime there is. Thus, churches lead to crime. A major flaw is that both increased churches and increased crime rates can be explained by larger populations. In bigger cities, there are both more churches and more crime.
Population
the larger set from which a sample is drawn
Platykurtic
the lower distribution has relatively fewer scores in its tails
Data
information that has been collected from an experiment, a survey, a historical record, etc. AND "data" is plural; one piece of information is called a datum
History Effect
interpret outcomes as the result of one variable when another variable is actually responsible ex. A new advertisement for Ben and Jerry's ice cream introduced in late May of last year resulted in a 30% increase in ice cream sales for the following three months. Thus, the advertisement was effective. A major flaw is that ice cream consumption generally increases in the months of June, July, and August regardless of advertisements.
Simple Random Sampling (aka Random Sampling)
process of selecting a subset of a population for the purposes of statistical inference (the act of drawing conclusions about a population from a sample); means that every member of the population is equally likely to be chosen for the sample Additionally, the selection of one member of a sample form the population must not increase or decrease the probability of picking any other member (relative to the others); simple random sampling chooses a sample by pure chance
Variables
properties or characteristics of some event, object, or person that can take on different values or amounts as opposed to constants that do not vary
To be an intelligent consumer of statistics, your first reflex must be to...
question the statistics that you encounter
Things to keep in mind about the curve that describes a continuous distribution(ex. normal curve)
1. area under the curve of a probability distribution is 1, meaning that a score chosen at random will occur under the curve is 1 2. the probability of any exact value of X is zero 3. the area under the curve and bounded between two points is the probability that a number chosen at random will fall between two points
Basics of logarithms
1. log transformation reduces positive skew; this can be valuable both for making the data more interpretable and for helping to meet the assumptions of inferential statistics 2. in a sense, the opposite of exponents Log10(100) = 2 This can be read as: The log base ten of 100 equals 2. The result is the power that the base of 10 has to be raised to in order to equal the value (100) This example all used base 10, but any base could have been used. There is a base which results in "natural logarithms" and that is called e and equals approximately 2.718. Natural logarithms can be indicated either as: Ln(x) or loge(x) 3. changing the base of the log changes the result by a multiplicative constant. To convert from Log10 to natural logs, you multiply by 2.303. Analogously, to convert in the other direction, you divide by 2.303. 4. Taking the antilog of a number undoes the operation of taking the log. Taking the antilog of a number simply raises the base of the logarithm in question to that number. 5. A series of numbers that increase proportionally will increase in equal amounts when converted to logs. 6. Example, if one student increased their score from 100 to 200 while a second student increased theirs from 150 to 300, the percentage change (100%) is the same for both students. The log difference is also the same 7. Log(AB) = Log(A) + Log(B) Log(A/B) = Log(A) - Log(B)
Leptokurtic
A distribution with long tails relative to a normal distribution
Bimodal distribution
A distribution with two peaks
Linear Transformation
A linear transformation is any transformation of a variable that can be achieved by multiplying it by a constant, and then adding a second constant. If Y is transformed value of X, then Y = aX + b. The transformation from degrees Fahrenheit to degrees Centigrade is linear and is done using the formula: C = 0.55556F - 17.7778 Thus, in a plot of degrees Centigrade as a function of degrees Fahrenheit, the points form a straight line. This will always be the case if the transformation from one scale to another consists of multiplying by one constant and then adding a second constant whereas with nonlinear transformations, the points in a plot of the transformed variable against the original variable would not fall on a straight line More examples: transformation of miles into distance is linear because you multiply the distance in miles by 5,280 feet/mile
Based on what you have learned in this chapter about measurement scales, does it make sense to compare SAT scores using percentages? Why or why not?
As you may know, the SAT has an arbitrarily-determined lower limit on test scores of 200. Therefore, SAT is measured on either an ordinal scale or, at most, an interval scale. However, it is clearly not measured on a ratio scale. Therefore, it is not meaningful to report SAT score differences in terms of percentages. For example, consider the effect of subtracting 200 from every student's score so that the lowest possible score is 0. How would that affect the difference as expressed in percentages?
Positive Skew
Distribution with the longer tail extending in the positive direction is said to have a positive skew; "skewed to the right"
Negative Skew
Distribution with the longer tail extending to the left; less common than positive skew
Ordinal scale
Example template: feelings as either "very dissatisfied," "somewhat dissatisfied," "somewhat satisfied," or "very satisfied." Items in this scale are ordered, ranging from least to most satisfied. This is what distinguishes ordinal from nominal scales. Unlike nominal sales, ordinal scales allow comparisons of the degree to which two subjects possess the dependent variable. Also fails to capture important info that will be present in other scales we examine. To clarify, with this scale, the difference between two levels cannot be assumed to be the same as the difference between two other levels. Statisticians express this point as: the differences between adjacent scale values do not necessarily represent equal intervals on the underlying scale giving rise of the measurements (ex. of underlying scale is true feeling of satisfaction)
Computing means for different scales
Nominal data: there is an agreement that is it NOT OK to compute means of nominal data Ordinal data: there is a DEBATE; most statisticians agree that it is valid to compute means ordinal data, although some vehemently disagree Interval data: there is agreement that is is OK to compute means of interval data Ratio data: there is agreement that is OK to compute means of ratio data
"We're spending $70 per person to fill this out. That's just not cost effective, especially since in the end this is not a scientific survey. It's a random survey." What is wrong with this statement? Despite the error in this statement, what type of sampling could be done so that the sample will be more likely to be representative of the population?
Randomness is what makes the survey scientific. If the survey were not random, then it would be biased and therefore statistically meaningless, especially since the survey is conducted to make generalizations about the American population. Stratified sampling would likely be more representative of the population.
What level of measurement is used for psychological variables ?
Rating scales are used frequently in psychological research. Typically made on a 5-point or 7-point scale. Side note: It is often inappropriate to consider psychological measurement scales as either interval or ratio. Ex. memory recall of 5 easy and 5 difficult items
Summation Notation
The Greek letter capital sigma (Σ) indicates summation. The "i = 1" at the bottom indicates that the summation is to start with X1 and the 4 at the top indicates that the summation will end with X4. The "Xi" indicates that X is the variable to be summed as i goes from 1 to 4. Thus, when no values of i are shown, it means to sum all the values of X.
Why are we interested in a type of scale that measures a dependent variable?
The crux of the matter is the relationship between the variable's level of measurement and the statistics that can be meaningfully computed with that variable
How to calculate percentiles with the third definition
The first step is to compute the rank of the desired percentile using the formula: R=P/100 x (N+1) where P is the desired percentile and N is the number of numbers in the data If R is an integer, the Pth percentile is the number with rank R. When R is not an integer, we compute the Pth percentile by interpolation as follows: 1. Define IR as integer portion of R ( the number to the left of the decimal point). For example if R is 2.25, then IR is 2. 2. Define FR as the fractional portion of R. For example, if R is 2.25, then FR is 0.25. 3. Find the scores with Rank IR and with Rank IR +1. For the example being used, this means the score with Rank 2 and the score with Rank 3. Say these scores are 5 and 7 according to the data. 4. Interpolate by multiplying the difference between the scores by FR and add the result to the lowest score. For these data, this is (0.25)(7-5) +5 =5.5 Therefore, 5.5 is the score of the desire percentile in this example. EXAMPLE: Consider the 50th percentile of the numbers 2, 3, 5, 9. Answer: R= 50/100 x (N+1)= 2.5 IR= 2 and FR=0.5 The score with the rank of IR, 2, is 3 and the score with rank of IR+1.3. is 5. Therefore, the 50th percentile is= (0.5) (5-3) +3 = 4
Percentile
There is no universally excepted definition of percentile. Definition 1: the lowest score that is greater that X% of the scores Definition 2: the smallest score that is greater than or equal to X% of the scores A third way to compute percentiles is is a weighted average of the percentiles computed according to the first two definitions. This third definition handles rounding more gracefully than the other two and has the advantage that it allows the median to be defined conveniently as the 50th percentile. Unless, otherwise specified, we will use this definition.
Example of Sample vs. Population
You have been hired by the National Election Commission to examine how the American people feel about the fairness of the voting procedures in the U.S. Whom will you ask? You query a relatively small number of Americans, and draw inferences about the entire country from their responses. The Americans actually queried constitute our sample of the larger population of all Americans.
Sample
a small subset of a larger set of data; typically a small subset of the population--in choosing a sample it is really important to not over-represent one ind of a citizen at the expense of others In statistics, we often rely on a sample to draw inferences about the larger set
Frequency Table
a table containing the number of occurrences in each class of data; often used to create histograms and frequency polygons
Frequency Distribution
aka distribution; the distribution of empirical data is called frequency distribution and consists of a count of the number (frequency) of occurrences of each value; if the data are continuous, the a group frequency distribution (frequencies displayed for ranges of data rather than for individual values) is used. Typically, a distribution is portrayed using a frequency polygon or a histogram. Mathematical equations are often used to define distributions. Ex: the normal distribution
Qualitative variables
also known as categorical variables; variables with no sense of ordering. They are measured on a nominal scale. For instance, hair color is a qualitative variable, as is name. They can be coded to appear numeric but their numbers are meaningless, as in male+1, female=2
Continuous Variables
can take on any value in a certain range; time and distance are continuous; no measured variable is truly continuous; however, discrete variables measured with enough precision can often be considered continuos for practical purposes.
Statistics
facts and figures RELY: upon calculation of number; upon how numbers are chosen and how statistics are interpreted (sometimes have some problematic interpretations) In the broadest sense, "statistics" refers to a range of techniques and procedures for analyzing, interpreting, displaying, and making decisions based on data. Statistics is a field of study concerned with summarizing data, interpreting data, and making decisions based on data.
Probability Distribution
for a discrete random variable, a probability distribution contains the probability of each possible outcome. The sum of all probabilities is always 1.0
Inferential statistics
involves generalizing beyond the data at hand to another set of cases; uses data from a sample to answer questions about a population Includes mathematical procedures whereby we convert information about the sample into intelligent guesses about the population Inferential statistics are based on the assumption that sampling is random as a random sample is trusted to represent different segments of society in close to appropriate proportions provided that the sample is large enough Only a large sample size makes it likely that our sample is close to representative of the population. For this reason, inferential statistics take into account the sample size when generalizing results from samples to populations
Quantitative Variable
measured on a numeric or quantitative scale. Ordinal, interval, and ratio scales are quantitative. A country's population, a person's show size, or a car's speed are all quantitative variables
Kurtosis
measures how fat or thin the tails of a distribution are relative to a normal distribution
Sampling bias
method of sampling in which a sample is collected in such a way that some of the members of the intended population are less likely to be included than others, leading to a non-representative sample
Ratio scale
most informative scale; it is an interval scale with the additional property that the zero position indicates the absence of the quantity being measured Like a nominal scale, it provides a name or category for each object (the numbers serve as labels). Like an ordinal scale, the objects are ordered (in terms of the ordering of the numbers). Like an interval scale, the same difference at two places on the scale has the same meaning. And in addition, the same ratio at two places on the scale also carries the same meaning. Ex. Kelvin scale as zero on the Kelvin scale is absolute zero--on the Kelvin scale if a number is twice as high as another then it has twice the kinetic energy as the other temperature. Another example is money! If you have zero money, this implies the absence of money.
Number of Levels of an Independent Variable
number of experimental conditions
Descriptive Statistics
numbers that are used to summarize and describe data Ex. percentage of certificates issued in NY in May 1999; average age of mother in NY at the same time; average salaries for various occupations in 1999 Any other number we choose to compute also counts as a descriptive statistic for the data from which the statistic is computed Several descriptive characteristics are often used at one time to give a full picture of the data; descriptive statistics are just descriptive and do not involve generalizing beyond the data at hand
Interval scales
numerical scales in which intervals have the same interpretation throughout ex. Fahrenheit scale of temperature. The difference between 30 degrees and 40 degrees represents the same temperature difference between 80 degrees and 90 degrees--each 10-degree interval has the same physical meaning in terms of kinetic energy of molecules Not perfect: they do not have a true zero point even if one of the scaled values happens to carry the name "zero" Ex. zero degrees Fahrenheit does not represent the complete absence of temperature or molecular kinetic energy. Since an interval scale has no true zero point, does not makes sense to compute ratios of temperature. For example, it doesn't make sense to say that 80 degree is "twice as hot" as 40 degrees as such a claim, would depend on an arbitrary decision about where to "start" the temperature scale, namely what temperature to call zero
Random Assignment
occurs when the subjects in an experiment are randomly assigned to conditions; prevents systematic confounding of treatment effects with other variables Ex. consider the bias that could be introduced if the first 20 subjects to show up at the experiment were assigned to the experimental group and the second 20 subjects were assigned to the control group. It is possible that subjects who show up late tend to be more depressed than those who show up early, thus making the experimental group less depressed than the control group even before the treatment was administered. Failure to randomize the splitting of a sample group invalidates the experimental findings. A non-random sample simply restricts the generalizability of the results.
Median
popular measure of central tendency (center of a distribution). It is the 50th percentile of a distribution. To find the median of a number of values, first order them, then find the observation in the middle. Note--if there is an even number of values, one takes the average of the middle two. The median is often more appropriate than the mean in skewed distributions and in situations with outliers
Stratified Random Sampling
the population is divided into a number of subgroups (or strata). Random samples are then taken from each subgroup with sample sizes proportional to the size of the subgroup in the population. For instance, if a population contained equal numbers of men and women, and the variable of interest is suspected to vary by gender, one might conduct stratified random sampling to insure a representative sample. In stratified sampling, you first identify members of your sample who belong to each group. Then you randomly sample from each of those subgroups in such a way that the sizes of the subgroups in the sample are proportional to their sizes in the population. Ex. If 70% of the students were day students, it makes sense to ensure that 70% of the sample consisted of day students. Thus, your sample of 200 students would consist of 140 day students and 60 night students.
Symmetric Distribution
the tails extend equally in both directions. Therefore, there is no skew, and if folded in the middle, two sides would match perfectly
Independent variable
variable manipulated by an experimenter
Dependent variable
variable that measures the experimental outcome. In most experiments, the effects of the independent variable on the dependent variable are observed
Nominal scale
when measuring using this scale, one simply names or categorizes responses; they do not imply any ordering among the responses as necessary for ordinal, interval and ratio scales. Embody the lowest level of measurement Gender, handedness, favorite color, and religion are examples of variables measured on this scale
A difference between the means of two groups on an ordinal rating scale..
will usually reflect a valid difference between groups, and can be in the opposite direction of the "true" difference between means. For all but the most extraordinary situations, differences between means on interval scales are meaningful. However, in extreme circumstances, it is possible for the difference on an ordinal scale to be in the opposite direction from the difference on an interval scale.