Statistics, test #2 - stats and methods - ch. 3 & 4
Sum of Squares
(SS) A numerical value obtained by subtracting the mean of a distribution from each score in the distribution, squaring each difference, and then summing the differences.
Mean
(X) The sum of set of scores divided by the number of scores summed. - Arithmetic average -Most common measurement of central tendency -Influenced by extreme scores -Data should have interval properties; can not be used with nominal or ordinal data -Sample mean is the best estimator of population mean. -Can be manipulated algebraically. -Any change of a score in the distribution affects the mean
Leptokurtic
- If you move scores from shoulders of a mesokurtic distribution into the center and tails of a distribution, the result is a peaked distribution with thick tails. This shape is referred to as leptokurtic.
Positively skewed
- A distribution is positively skewed when is has a tail extending out to the right (larger numbers) When a distribution is positively skewed, the mean is greater than the median reflecting the fact that the mean is sensitive to each score in the distribution and is subject to large shifts when the sample is small and contains extreme scores.
Negatively skewed
- A negatively skewed distribution has an extended tail pointing to the left (smaller numbers) and reflects bunching of numbers in the upper part of the distribution with fewer scores at the lower end of the measurement scale.
Mesokurtic
- A normal distribution is called mesokurtic. The tails of a mesokurtic distribution are neither too thin or too thick, and there are neither too many or too few scores in the center of the distribution.
Platykurtic
- Starting with a mesokurtic distribution and moving scores from both the center and tails into the shoulders, the distribution flattens out and is referred to as platykurtic.
Outliers
-An extreme score that is not typical of the rest of the distribution -It may be larger than the other numbers or smaller than the other numbers. -Distorts the mean To find an outlier -Organize your data -Look for extreme scores -If the mean and median differ by a large amount, you have an outlier
A measurement instrument was used at Mercy Hospital in a sample of 175 patients. There were 35 true positives, 40 false positives, 10 false negatives and 90 true negatives. What is the Prevalence of disease in the sample?
.26
A measurement instrument was used at Mercy Hospital in a sample of 175 patients. There were 35 true positives, 40 false positives, 10 false negatives and 90 true negatives. What is the positive predictive value of the measurement tool?
.47
A measurement instrument was used at Mercy Hospital in a sample of 175 patients. There were 35 true positives, 40 false positives, 10 false negatives and 90 true negatives. What is the efficiency of the screening instrument?
.71 -- Efficiency is all true tests divided by total sample. That is (125/175) x 100 = .71428 x 100 = 71.43
A measurement instrument was used at Mercy Hospital in a sample of 175 patients. There were 35 true positives, 40 false positives, 10 false negatives and 90 true negatives. What is the negative predictive value of the meaurement tool?
.9
What is the real lower limit of the interval .60 - .69 in the table below? Please use 3 decimal places in your answer. .50 - .59 .60 - .69 .70 - .79 .80 - .89
0.595
A measurement instrument was used at Mercy Hospital in a sample of 175 patients. There were 35 true positives, 40 false positives, 10 false negatives and 90 true negatives. What is the specificity of the measurement tool?
0.69
A measurement instrument was used at Mercy Hospital in a sample of 175 patients. There were 35 true positives, 40 false positives, 10 false negatives and 90 true negatives. What is the sensitivity of the measurement tool?
0.78
What is the real upper limit of the interval .80 - .89 in the table below? Please use 3 decimal places in your answer. .50 - .59 .60 - .69 .70 - .79 .80 - .89
0.895
What is the mode of the following distribution? Round your answer to the nearest 2 decimal places. 2 2 2 2 3 5 7 9 10 10 10
2
Calculate the 20% winsorized trimmed mean of the following distribution. Round your answer to the nearest 2 decimal places. x 10 13 14 17 19 22 24 24 26 27 27 27 28 29 30 37 47 77 88 100
27.2 replace these 4 and those 4 with the adjacent value
What is the mean of the following distribution? Round your answer to the nearest 2 decimal places. 2 4 6 7 8
5.4
Given the following data, list the outliers. 50, 60, 73, 77, 80, 81, 82, 83, 84, 84, 84, 85, 88, 95, 100
50, 60, 100. These are the values that go beyond the ends of the whiskers.
What is the real lower limit of the interval 60 - 69 in the table below? Please use 2 decimal places in your answer. 50 - 59 60 - 69 70 - 79 80 - 89
59.5
What is the median of the following distribution? Round your answer to the nearest 2 decimal places. 2 4 6 7 8
6
What is the median of the following distribution? What is the median of the following distribution? Round your answer to the nearest 2 decimal places. 2 4 6 6 6 10 20 25
6.17 -- RLL of 6 plus 2/3 5.5 + .67 = 6.17
What is the median of the following distribution? Round your answer to the nearest 2 decimal places. 2 4 6 7 8 10
6.5
Suppose you wanted to construct a box plot of the following data. What is the end of the lower whisker? This is the end of the lower whisker, not the maximum lower whisker. We will use this same data for the rest of the questions in this practice test. 50, 60, 75, 79, 80, 81, 82, 83, 84, 85, 87, 88, 92, 95, 100, 106, 120
75
Suppose you wanted to construct a box plot of the following data. What is the first quartile? We will use this same data for the rest of the questions in this practice test. 50, 60, 75, 79, 80, 81, 82, 83, 84, 85, 87, 88, 92, 95, 100, 106, 120
80
correlation
A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other. a reciprocal connection between two or more things
Normal distribution
A normal distribution of data means that most of the examples in a set of data are close to the "average," while relatively few examples tend to one extreme or the other
Sample
A portion of the population selected for a study
Random sample
A sample drawn in such a way that each element of the population has the same chance of being included in the sample
Given the following data, what is the first quartile? .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
Count over 3 from the left (3 is the QL) and you get 1.13
Given the following data, what is the whisker end on the right? .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
Count over to the right, end of whisker plus max whisker length = 1073 + .90 = 2.63 There is not a 2.63 in the data so move to the left until you reach 2.03
Given the following data, what is the end of the whisker on the left? .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
From the 1st quartile, go .90 to the left, which is 1.13 - .90 = 0.23 There is not a 0.23 in the data so move to the right until you reach 0.95
Given the following data, what is the interquartile range? .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
IQR = 3rd quartile minus 1st quartile = 1.73 - 1.13 = .60
Symmetrical unimodal
In a perfectly symmetrical unimodal distribution, mean, median, and mode are identical.
An advantage of the mean is
It can be manipulated algebraically
Given the following data, what is the median location? 50, 60, 73, 77, 80, 81, 82, 83, 84, 84, 84, 85, 88, 95, 100
ML = (n+1) / 2 = (15 + 1) / 2 = 16 / 2 = 8
Given the following data, what is the maximum length of the whisker? .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
Max length of whisker = 1.5 + IQR = 1.5 * .60 = .90
Given the following data, what is the end of the maximum whisker length? 50, 60, 73, 77, 80, 81, 82, 83, 84, 84, 84, 85, 88, 95, 100
Maximum whisker length is 1.5 * IQR = 1.5 * 8 = 12
Median
Middle score -The score that has an equal number of scores above and below it (the 50th percentile). -It cuts the distribution into two equal parts. 50% split of data. -Not affected by extreme scores (desirable for skewed distributions). -Can be used with ordinal and interval data, but not with nominal data. -Does not take into account all scores. -Not a stable measure of central tendency.
Given the following data, what is the median location? .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
Ml = (n+1) / 2, which equals (10 + 1) / 2 = 5.5
Mode
Most frequent score Finding the Mode -Put the data in order -Choose the most frequent occurring score in the data set UNIMODAL: distribution has only one mode. BIMODAL: distribution has two modes MULTIMODAL: distribution has more than 2 modes. -Mode may not appear in all data sets. -Data set may contain multiple modes. -Not a stable measure of central tendency. -Not affected by extreme scores. -Can be used with nominal, ordinal interval, or ratio data.
Which measurement of Central Tendency to Use
Nominal Data -Mode Ordinal Data -Median response Interval or Ratio Data -Symmetrical Distribution (No outliers) -Mean Skewed Distribution (Outliers) -Median
Scale of Measurement of Scores
Nominal: Mode Ordinal: Mode Median Interval: Mode Median Mean Ratio: Mode Median Mean
Sample size
Number of units in a sample
Measures of Variability
Numbers that indicate how much scores differ from each other and the measure of central tendency in a set of scores. -Range, Variance, Standard Deviation
Measures of Central Tendency
Numbers that represent the average or typical score obtained from measurements of a sample. -Indicate typical score obtained -Mean, Median, Mode
Given the following data, what is the quartile location? 50, 60, 73, 77, 80, 81, 82, 83, 84, 84, 84, 85, 88, 95, 100
QL = (ML + 1) / 2 = (8+1) / 2 = 9/2 = 4.5 but you drop the fraction so it is 4
Given the following data, what is the quartile location? .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
QL = (ML + 1) / 2, which equals (5 + 1) / 2, which equal 6/2 = 3 Note, we dropped the fraction in the ML, from 5.5 to 5
Given the following data, what is the interquartile range (IQR)? 50, 60, 73, 77, 80, 81, 82, 83, 84, 84, 84, 85, 88, 95, 100
Quartile on the right is 4 digits over = 85 Quartile on the left is 4 digits over = 77 88 - 77 = 8
Given the following data, what is end of the whisker on the left side of the distribution? 50, 60, 73, 77, 80, 81, 82, 83, 84, 84, 84, 85, 88, 95, 100
Start from the 1st quartile, which is 77 Subtract the maximum whisker length = 77 - 12 = 65. This is the maximum possible location of the whisker. There is not a 65 in the data so move to the right until you find the data that represents the end of the whisker, which is 73.
Given the following data, what is end of the whisker on the right side of the distribution? 50, 60, 73, 77, 80, 81, 82, 83, 84, 84, 84, 85, 88, 95, 100
Start from the 3rd quartile, which is 85 Add the maximum whisker length = 85 + 12 = 97. This is the maximum possible location of the whisker. There is not a 97 in the data so move to the left until you find the data that represents the end of the whisker, which is 95.
Descriptive Statistics
Statistical procedures used to summarize and describe the data from a sample. -Describe raw data with a single number -Way of capturing trends in data -Two Types of Descriptive Statistics
Mean
Sum of the observations divided by the number of observations = average. The average result of a test, survey, or experiment.
Range
The difference between the largest and smallest data value in a data set
Deviation
The difference of a score in a set of scores from the mean of that set of scores.
Given the following data, list the outliers. .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
The ends of the whiskers are the highest and lowest numbers in the data. That means there are no outliers.
Median
The middle number or center value of a set of data in which all the data are arranged in sequence
What is the 20% trimmed mean of the following distribution? Round your answer to the nearest 2 decimal places. X 10 13 14 17 19 22 24 24 26 27 27 27 28 29 30 37 47 77 88 100
The mean of the 20 numbers is 34.3 and the 20% trimmed mean is 26.67.
Suppose you wanted to construct a box plot of the following data. What would be the median? We will use this same data for the rest of the questions in this practice test. 50, 60, 75, 79, 80, 81, 82, 83, 84, 85, 87, 88, 92, 95, 100, 106, 120
The median location is 17 +1 divided by 2 equals 9 and the ninth number is 84
Skewed Distribution
The mode is at the peak of the curve, mean is closest to the tail and median is positioned between the mode and the mean. The median is the best measure of central tendency for skewed distributions.
Mode
The value or values that occur most frequently in a data set
Shape of Distribution of Scores
Unimodal and perfectly symmetrical distribution Mean=Medain=Mode Skewed Distribution Mode> Median > Median negatively skewed Mode< Median< Mean positively skewed
Statistics
a set of concepts, rules, and procedures that help us to: o organize numerical information in the form of tables, graphs, and charts; o understand statistical techniques underlying decisions that affect our lives and well-being; and o make informed decisions.
Standard deviation
a statistic that tells you how tightly all the various examples are clustered around the mean in a set of data. When the examples are pretty tightly bunched together and the bell-shaped curve is steep, the standard deviation is small. When the examples are spread apart and the bell curve is relatively flat, that tells you that you have a relatively large standard deviation. About 68% of the data will fall within one standard deviation of the mean, 95% of the data will fall within two standard deviations of the mean and 99.7% of the data will fall within three standard deviations of the mean.
o Qualitative Variable
a variable based on categorical data.
Continuous Variable
a variable that can take on many different values, in theory, any value between the lowest and highest points on the measurement scale.
Independent Variable
a variable that is manipulated, measured, or selected by the researcher as an antecedent condition to an observed behavior. In a hypothesized cause-and-effect relationship, the independent variable is the cause and the dependent variable is the outcome or effect.
Dependent Variable
a variable that is not under the experimenter's control -- the data. It is the variable that is observed and measured in response to the independent variable.
Discrete Variable
a variable with a limited number of values (e.g., gender (male/female), college class (freshman/sophomore/junior/senior).
Categorical data
also referred to as frequency or qualitative data. Things are grouped according to some common property(ies) and the number of members of the group are recorded (e.g., males/females, vehicle type).
diagram in which occurrence frequency of different values of X is represented by height
bar graph
a normal distribution must
be symmetric
real lower limit
boundary halfway between the bottom of one interval and the top of the next
real upper limit
boundary halfway between the top of one interval and the bottom of the next
The "real lower limit" of an interval in a histogram is
c. the lowest continuous value that would be rounded up into that interval.
unimodal
characteristic of distribution having one distinct peak
symmetry
characteristic of having the same shape on both sides of the center
Given the following data, what is the median? .95, 1.06, 1.13, 1.40, 1.41, 1.56, 1.63, 1.73, 1.73, 2.03
count over to the rigth, 5.5 and the ML is between 1.41 and 1.46. The mean of these is 1.485
Histogram
diagram in which rectangles are used to represent recurrence of observations within each interval
line graph
diagram in which the Y values corresponding to different values of ? are connected
Data
facts, observations, and information that come from investigations.
stem-and-leaf display
graphic presenting original data arranged into a histogram
A negatively skewed distribution
has a tail pointing to the left
leaf
horizontal axis of display containing the trailing digits
Which of the following is not an advantage of the median?
it can be manipulated algebraically
leading digit
leftmost numeral of a number
Skewness
measure of the degree to which a distribution is asymmetrical
modality
number of major peaks in a distribution
less significant digit
numeral to the right of the leading digit
Standard deviation
o - (s or ) is defined as the positive square root of the variance. The variance is a measure in squared units and has little meaning with respect to the data. Thus, the standard deviation is a measure of variability expressed in the same units as the data. The standard deviation is very much like a mean or an "average" of these deviations. In a normal (symmetric and mound-shaped) distribution, about two-thirds of the scores fall between +1 and -1 standard deviations from the mean and the standard deviation is approximately 1/4 of the range in small samples (N < 30) and 1/5 to 1/6 of the range in large samples (N > 100).
Symmetric
o - Distributions that have the same shape on both sides of the center are called symmetric. A symmetric distribution with only one peak is referred to as a normal distribution.
Kurtosis
o - Like skewness, kurtosis has a specific mathematical definition, but generally it refers to how scores are concentrated in the center of the distribution, the upper and lower tails (ends), and the shoulders (between the center and tails) of a distribution.
Interquartile Range (IQR)
o - Provides a measure of the spread of the middle 50% of the scores. The IQR is defined as the 75th percentile - the 25th percentile. The interquartile range plays an important role in the graphical method known as the boxplot. The advantage of using the IQR is that it is easy to compute and extreme scores in the distribution have much less impact but its strength is also a weakness in that it suffers as a measure of variability because it discards too much data. Researchers want to study variability while eliminating scores that are likely to be accidents. The boxplot allows for this for this distinction and is an important tool for exploring data.
Skewness
o - Refers to the degree of asymmetry in a distribution. Asymmetry often reflects extreme scores in a distribution.
Variance
o - The variance is a measure based on the deviations of individual scores from the mean. As noted in the definition of the mean, however, simply summing the deviations will result in a value of 0. To get around this problem the variance is based on squared deviations of scores about the mean. When the deviations are squared, the rank order and relative distance of scores in the distribution is preserved while negative values are eliminated. Then to control for the number of subjects in the distribution, the sum of the squared deviations, (X - X), is divided by N (population) or by N - 1 (sample). The result is the average of the sum of the squared deviations and it is called the variance.
Histogram
o - a form of a bar graph used with interval or ratio-scaled data. Unlike the bar graph, bars in a histogram touch with the width of the bars defined by the upper and lower limits of the interval. The measurement scale is continuous, so the lower limit of any one interval is also the upper limit of the previous interval.
Scatterplot
o - a form of graph that presents information from a bivariate distribution. In a scatterplot, each subject in an experimental study is represented by a single point in two-dimensional space. The underlying scale of measurement for both variables is continuous (measurement data). This is one of the most useful techniques for gaining insight into the relationship between tw variables.
Bar graph
o - a form of graph that uses bars separated by an arbitrary amount of space to represent how often elements within a category occur. The higher the bar, the higher the frequency of occurrence. The underlying measurement scale is discrete (nominal or ordinal-scale data), not continuous.
Boxplot
o - a graphical representation of dispersions and extreme scores. Represented in this graphic are minimum, maximum, and quartile scores in the form of a box with "whiskers." The box includes the range of scores falling into the middle 50% of the distribution (Inter Quartile Range = 75th percentile - 25th percentile)and the whiskers are lines extended to the minimum and maximum scores in the distribution or to mathematically defined (+/-1.5*IQR) upper and lower fences.
Quantitative Variable
o - a variable based on quantitative data.
frequency distribution
occurrence in which dependent variable values are tables or plotted against their recurrence
"u" is the
population mean
exploratory data analysis (EDA)
set of techniques developed by Tukey for presenting data in visually meaningful ways
Variable
property of an object or event that can take on different values. For example, college major is a variable that takes on values like mathematics, computer science, English, psychology, etc.
Which of the following is an advantage of the median?
relatively unaffected by extreme scores it does not depend on the assumption of interval or ratio level data
trailing digit
rightmost numeral of a number
Xbar is the
sample mean
Someone asks you if you have seen the movie Titanic. Before you answer, you look back into your memory for all of the movies you have ever seen and review the titles one at a time. This is an example of
sequential processing
Measurement data
sometimes called quantitative data -- the result of using some instrument to measure something (e.g., test score, weight);
If the mean score of test #1 was 80.00 in section 01 with 20 students, 70.00 in section 02 with 15 students and 50.00 in section 03 with 40 students, what is the mean score of all students in all three sections? Round your answer to the nearest 2 decimal places.
sum of mean*n = 4650 sum of n = 75 4650 / 75 = 62.00 Below is how I answered the question with Excel. nmean n * mean Section 012080 1600 Section 021570 1050 Section 034050 2000 sums =75200 4650 62 Weighted mean = sum of (n * mean) divided by sum of n, which is 4650 / 75 in this example
What is the mean of the following frequency distribution? Round your answer to the nearest 2 decimal places. X f 2 5 3 6 4 4
sum of xf = 44 sum of f = 15 44 / 15 = 2.93
An advantage of the mode is
the mode can be used with nominal data
stem
vertical axis of display containing the leading digits
In deciding on the number of stems to use in a stem and leaf display,
you should normally make all of the stems the same width.
Measures of Shape
• - For distributions summarizing data from continuous measurement scales, statistics can be used to describe how the distribution rises and drops.
Measures of Center
• - Plotting data in a frequency distribution shows the general shape of the distribution and gives a general sense of how the numbers are bunched. Several statistics can be used to represent the "center" of the distribution. These statistics are commonly referred to as measures of central tendency.
Graphs
• - visual display of data used to present frequency distributions so that the shape of the distribution can easily be seen.