Chapter 5 - Basic Stats Concepts, and Descriptive Stats
Factors that influence reliability
- Administration errors - Test length - the more Q's the better - Item homogeneity: similar the questions are better - Item difficulty - Interval between tests - error effects - Objectivity: responses that are subjective & involve observations add more room for error - Testing environment/student factors: fatigue/illness, temperature of environment, etc.
Classical test theory subsumed 3 primary types of evidence that support the validity of a measure
1. Construct 2. Content 3. Criterion
To compute variance use the following steps
1. Find the mean 2. Find the difference between each observation and the mean 3. Square each difference score 4. Sum the squared differences 5. Because data is a sample, divide the sum of the squared differences (step4) by number of observations minus one (n-1)
Stanley Smith Stevens, Scales of measurement
1. Nominal 2. Ordinal 3. Interval 4. Ratio
Histogram
A bar graph depicting a frequency distribution. Bars are contiguous and represent increasing magnitude of the variable. Can also be used to show percentage and/or frequency
Symmetrical Distribution
A distribution in which the pattern of frequencies on the left and right side are mirror images of each other One peak (mode). AKA normal curve/bell curve. Curve is constant and always bell shaped.
Positively Skewed distribution
A distribution where the scores pile up on the left side and the line tapers off to the right. The tail is being pulled to the right of the x-axis The mean will be higher than the median in this distribution.
Kuder-Richardson Formula 20
A measure of homogeneity for dichotomous responses (yes/no, correct/incorrect), which functions under the assumption that all items on a test measure the same thing or are the same difficulty level a type of split-half that is a formula to compute an average between all the possible splits to yield an overall reliability estimate
Standard Deviation
A measure of variability that describes an average distance of every individual score from the group mean. We interpret this as indicating how far something is from the mean and how many points in our sample fall within that distance
Standard normal distribution
A normal distribution with a mean of 0 and a standard deviation of 1.
T-Scores
A test score that is converted to a normal distribution that has a mean of 50 and a standard deviation of 10. Always whole numbers and always positive.
Criterion Validity
A test that predicts an outcome based on info' from other measures
Negatively Skewed Distribution
Asymmetric distribution in which the majority of the data is concentrated to the right of the mean The tail is being pulled to the left of the x-axis The mean will be lower than the median in a negatively skewed distribution.
Nominal Scale of measurement
Categorical or grouping data. Assigning observations into various independent categories and then counting the frequency of occurrence within each of the categories. Can use Chi-square tests can be used with this type of data
Ways to calculate inter-rater reliability
Cohen's Kappa and variations of kappa
Evidence of Criterion Validity
Concurrent & Predictive evidence
Standard Scores in a normal distribution
Converting sample/raw scores to a standard metric helps to compare across a sample and have a standard scale of reference. Allows us to determine probability of a particular score occurring within our sample.
Mean
the arithmetic average of a distribution, obtained by adding the scores and then dividing by the number of scores
Median
the exact middle score in a distribution; half the scores are above it and half are below it. List all scores in numerical order and locate the center score. With odd number of values can compute by: Md= N+1/2
Validity
the extent to which a test measures or predicts what it is supposed to The applicability, meaningfulness, and usefulness of the specific inferences made from scores
Reliability Coefficient
the measure of the degree of reliability of an assessment ranges from 0.00 - 1.00, higher numbers means greater reliability .90 or higher is preferred
Standard Deviation formula
the square root of the variance; converts the variance back into same units as raw data Subtracting the SD from mean gives you the lower limit, Adding SD from mean gives you upper limit
Z-Score formula
z=(x-mean)/standard deviation z = (x - μ)/σ
Internal Consistency
Correlating the individual items of a test to each other High reliability means that the test has homogenous items, measures a single construct, and correlates highly
68-95-97.7 Rule
Designates the actual percentages of the data for any normal curve that can fall between 1,2,3 standard deviations from the mean. -68% of values are within 1 standard deviation of the mean. -95% are within 2 standard deviations. -99.7% are within 3 standard deviations.
Range of variability
Difference between highest and lowest scores Does not include all observations/data points.
Split-half reliability
Divide a test into halves (odd vs. even q's) and correlate scores on each half and then correcting for length The longer the more reliable
Construct Validity; Discriminant Evidence
Extent to which scores on a test do not correlate with or negatively correlate with scores on another test that was intended to measure a different construct Ex. If this validity is high, scores on a test designed to assess X should NOT be highly correlated with scores from tests designed to assess Y Correlations between theoretically dissimilar tests should be "low"
Variability of Distribution
Extent to which the scores in a distribution differ from one another. Three measures are: Range Variance Standard Deviation
Outliers
Extreme values which are on the tails of the distribution and create the skew
When is reliability generally more important than validity?
For the purposes of estimating consistency and stability of the scale
When is validity generally more important than reliability?
For the purposes of predicting future behavior, scores, success, and performance, and diagnostic purposes
Frequency polygon
Graph of a frequency distribution that shows the number of instances of obtained scores, usually with the data points connect by straight lines. Appropriate for Quantitative data, especially for continuous data (height or age).
If a distribution has variability it is called
Hetergenous
Ratio Scale of measurement
Highest form of measurement and meets all of the rules of other forms of measurement Has a true Zero point Can use all statistical procedures in data analysis
Most common methods of graphing a distribution?
Histogram (bar graph) and Frequency polygon (percentage)
If a distribution is lacking in variability it is called
Homogenous
Kurtosis
How "flat" or "peaked" a normal distribution is; indicates how much variability there is in the distribution of scores Indicates likelihood of extreme outcomes.
Variance of variability
How close the scores in the distribution are to the mean. This is the average of the squared deviations from the mean
OS = TS + error
In classical test theory, any Observed Score (OS) consists of the True Score (TS) plus some amount of error.
Systematic errors
Increase or decrease a response by a predictable amount of each measurement
Chi-Square tests
Nonparametric tests used to determine frequency data from two samples or between observed and expected frequencies.
Interval Scale of measurement
Ranking items, but with equivalent and meaningful distances between scale points. Does NOT have a true Zero point. This measurement can help to find: - Mean and Standard Deviation - Correlation and regression - Analysis of variance (ANOVA) - Factor Analysis
Construct Validity; Convergent Evidence
Refers to the degree to which scores on a test correlate highly and positively with scores on other tests that are designed to assess the same construct or trait Correlations between theoretically similar tests should be "high" - when they assess for the same construct/trait
Stanine
Standard nine scale. A method of scaling test scores on a nine-point standard scale with a mean of five (5) and a standard deviation of two (2). These are useful in comparing scores across different content areas
Types of distributions of visual data (graphs)
Symmetrical Skewed Multimodal
T-Score Formula
T= 50 + 10z Have to have the z-score to find the t-score
Criterion Validity; Concurrent Evidence
Test scores and criterion measures are obtained at the same time. Ex. IQ test scores compared to student's most recent school grades would be assessing the concurrent evidence of the IQ scores
Reliability
The extent to which a test yields consistent results. Free from error and provides consistent results. Focuses only on degree of nonsystematic or random error in assessment.
Content Validity
The extent to which the measurement accurately samples the content domain. Extent to which a test-taker's responses to a given test reflect that test-taker's knowledge of the content area
Construct Validity
The extent to which the test is an accurate measure of a particular construct or variable. Is the test measuring what we think it should be measuring?
Mode
The most frequently occurring score(s) in a distribution. In a frequency polygon, this score is the peak of the distribution. Useful to find the most common category when studying nominal variables
Measures of central tendency
These are intended to describe the most average or "typical" score in the distribution. Mean, Median, and Mode are most common.
Descriptive Statistics
These are used to explain the basic characteristics of study data, including describing the numbers and what they show. They help us to simplify large amounts of data by providing a summary that may enable comparisons across people or other units. A bridge between measurement and understanding.
Skewed Distribution
When a variable does not fall within a normal distribution. An asymmetrical but generally bell-shaped distribution; its mode, or most frequent response, lies off to one side
Can you have reliability without validity?
Yes, but you CANNOT have validity without reliability
Criterion Validity; Predictive Evidence
Yields scores at a later time from administration of test. Test scores are kept on record and compared with a criterion measure obtained sometime in the future. Ex. High school math test yields predictive evidence of validity if it can predict some aspect of college performance
Ordinal Scale of measurement
divide observations into categories and provide measurement by order and rank. Does NOT take intervals into account; a higher number is a higher value, but intervals between numbers are not necessarily equal. Allow to calculate mean and median but not mean. Ex: Likert Scale
Random errors
error that results from chance alone and influences measures arbitrarily.
Evidence of internal structure
evidence that the items on the measure relate to one another (one factor or multiple components of construct) in a way that reflects the theoretical basis of the construct Encapsulates how well the structure and relationship of the variables in the assessment correspond with the theoretical understanding of the construct
Inter-rater reliability
for tests with subjective responses: this assesses the consistency or agreement between two or more scorers when making independent ratings of a particular construct
Multimodal Distribution
frequency distribution with two or more high frequencies separated by a lower frequency; a bimodal distribution is the special case of two high frequencies
Z-Scores
indicates by how many standard deviations a score is above or below the mean. 0= mean 1= 1 standard deviation above the mean -1= 1 standard deviation below the mean
Level of measurement of a variable
is a classification that describes the nature of information contained within the numbers that are assigned to that particular variable.
Cronbach's alpha
An indicator of internal consistency reliability assessed by examining the average correlation of each item (question) in a measure with every other question.
Evidence of Response processes
Analysis of responses to individual test items, rationale of responses, performance strategies, even possibly eye movement and response times
Stanine Averages
Above average = 9,8,7 Average = 6,5,4 Below Average = 3,2,1
Parallel Form Reliability
Administer similar (Form A and Form B), but not identical tests and correlate scores Most expensive and demanding test for reliability
test-retest reliability
Administer the same test twice and correlate the scores The closer the scores, the greater the reliability
Frequency distribution
An arrangement of data into a table that indicates how often a particular score or observation occurs. The most common way to describe a single variable and display the chaos of numbers in an organized manner.
Evidence of consequences of testing
An examination of the possible consequence of using a particular assessment. Ensures that no harm comes from using the assessment