Psych 461 Test 1
What are the three parts that consists of a test score
-Observed Score: What you actually get on the test. -True Score: the "true" amount of what is being measured -Error Score: error present in measurement accounting for difference between observed score and true score
What is the standard error of measurement (SEM)?
-The degree to which a person's observed score fluctuates as a result of errors of measurement. -Do not confuse with the standard error of estimate (SEE) from chapter 4. -The SEM relates to the reliability of measurement, whereas, the SEE concerns the validity of an estimate. What slides say: Index of how much on average an individual's score might vary if they were to repeatedly take a test. ● Reliable tests have small SEM and little variation with repeat testing ● SEM has an inverse relationship with reliability. The more reliable the test the less error there is on average. -The SEM is the standard deviation of the hypothetical distribution of test scores we would get for an individual who repeatedly took the test
Which measure of central tendency is strongly influenced by extreme scores?
-The mean -Mean is the only measure of central tendency which depends on all the values as it is derived from the sum of the values divided by the number of observations.
What is validity?
-The property of an assessment tool that indicates that the tool does what it says it does. -Does a test measure what it claims to measure?
What are the properties of a normal distribution?
Bell shaped, -symmetric around its mean, -95% of its probability between myu - 1.96SD and myu + 1.96SD. Mean = 0 -and variance (SDˆ2) = 1 What slides say -The total area under the curve represents the total number of scores in the distribution -the curve is symmetrical -the curve extends ad infinitum -be concerned with three SDs in each direction from the midpoint of the curve -percentages of cases (scores) falling between any two points is known We will use approximations for percentages of scores falling between any two points on the curve -68% +/- 1 SD -95% +/- 2 SD -99% +/- 3SD Less than 1% of cases fall outside the area +/- 3 SD from the mean
What are the benefits in using percentiles? What are the drawbacks?
Benefits -Ease of calculation and interpretation ● Computed based on one's relative position to others in the group. ● Thought to be the most readily understood of all raw-score transformations ● Used commonly in most disciplines (e.g.,education, psychology, medicine) Drawbacks -Ordinally ranked- not an interval scale of measurement -some confusion between percentages and percentiles -Transformation to percentiles distorts the underlying measurement scale such that three are not equal intervals between percentiles that have equal mathematical differences
How can we calculate the various standardized scores once we have the Z scores?
Calculate z score Z= x=xbar/s X= raw score you want to transform Xbar= mean of raw scores in the distribution S=standard deviation of raw scores in the distribution To transform scores to distributions having different means and standard deviations -Use the generic formula X' = (Z)(SD) + M -X' = score for distribution you are transforming to Z = the z score for the raw score you want to transform SD = standard deviation of the distribution you are transforming to M = mean of the distribution you are transforming to
What is an acceptable level for d? a marginal level? at what level might we discard an item?
d = .30+ Item is acceptable (good discriminator) d = .20-.29 Item is marginal, consider revision d < .19 Minimal discrimination; revise or discard d < .09 Discard
What does validity tell us about a test?
Does a test measure what it claims to measure? Test validity -Validity addresses the accuracy or usefulness of test results -A test is valid to the extent that inferences made from it are appropriate, meaningful, and useful. -Validity is the most important aspect of a test. Validity tells us.... -• ...who the test works for • ...what conclusions can be drawn from the results • ...under what conditions the test can be used and when it cannot • ...the limitations to use & interpretation
What is the standard error of the estimate? How is it related to predictive validity?
Error in Prediction • The higher the correlation between test and criterion, the less error there is in the prediction of the criterion. • Standard Error of Estimate - margin of error expected in the predicted criterion score, it tells us how accurately can test scores predict the performance on the criterion ● Formula same as SEM Predictive Validity • The extent scores on a test are able to predict a criterion that is obtained at a later date. i.e., IQ tests have predictive validity because they correlate with future achievement in school i.e., SAT has predictive validity because scores correlate with first-year college GPA • Predictive Validity is crucial for tests that determine who will succeed/fail; helps to determine cut off scores
Validity
The property of an assessment tool that indicates that the tool does what it says it does Ex Does a test measure what it claims to measure
Example of transforming a z score to a T score
Using the generic formula X' = (Z)(SD) + M X' = (Z)(10) + 50 You get these numbers from the mean/sd chart -these numbers are for t score you wanna find t score If Z = 1.5 The equivalent T score is X'= 1.5 (10) + 50 = 65 If Z = -.5 The equivalent T score is X'= -.5 (10) + 50 = 45
test validity
Validity addresses the accuracy or usefulness of test results A test is valid to the extent that inferences made from it are appropriate, meaningful and useful Validity is the most important
Validity tells us
Who the test works for What conclusions can be drawn from the results Under what conditions the test can be used and when it cannot The limitations to use and interpretation
What are the Mean & SD of z scores, T scores, IQ scores, & CEEB scores?
Z Mean= 0 SD=1 IQ Mean=100 SD=15 Example = Amplitude tests (Wechsler scales) CEEB Mean=500 SD=100 Example= Educational testing (SAT,GRE) T Mean=50 SD=10 Example= Personality tests (MMPI-2, CPI-R)
What is a z score? How is one calculated?
Z-score indicates how much a given value differs from the standard deviation. -The Z-score, or standard score, is the number of standard deviations a given data point lies above or below mean. -Standard deviation is essentially a reflection of the amount of variability within a given data set. -Z scores are easy to understand as the distance from the mean of a set of scores. -No more fooling around with raw scores - Z scores tell you exactly where a test score lies and what its relative relationship is to the entire set of scores. -Negative numbers and decimals can make the results difficult to interpret What slides say Z score= a standard score -correspond to standard deviation units -Mean=0 SD=1 -The distribution of Z scores is called the standard normal distribution
Examples of z score If x=60, xbar=50, and s =10
Z= x=xbar/s Z=60-50/10=1 -A raw score of 60 in this distribution = Z score of 1 -You need to find z score first to try and transform it into something else -or it can be given
What is the optimal difficulty for multiple choice items with 4 response options?
e.g. 4 options on multiple-choice test ( g= .25) 1.0+.25/2= .63 -for MC questions with four response options, it is always .63
What is meant by desirable procedures in test administration?
i.e., the conditions during testing
What is meant by standardized procedures in test administration?
i.e., the rules for using the test
And If You Can't Establish Reliability... Then What?
● Remember, lower the error to increase the reliability ● How can we lower the error? -Make sure instructions are standardized -Increase number of test items. -Delete unclear items. -Moderate easiness and difficulty of test. -Minimize the effects of external events.
What are the five main purposes of tests in psychology?
● Selection ● Placement ● Diagnosis ● Hypothesis Testing ● Classify
How does one assess parallel (or alternate) form reliability?
● Separate/equivalent forms of the test are developed. ● Given at the same time (immediate) or at separate times (delayed) ● Assumption is that scores should be the same/similar ● Scores on the two forms are correlated to determine the reliability. ● Error in immediate condition has to be due to sample of questions. ● In the delayed condition error is due to administration time and sample of questions. Ex The two forms of the regular guy test are equivalent to each other and have shown parallel forms of reliability
Spearman-Brown formula
● Split-half correlations estimate reliability for a test that is only half as long as the one actually taken ● Longer tests generally more reliable ● Spearman-Brown formula is used to make a statistical correction ● Elevates the correlation to approximately a parallel-forms reliability
What are norms? What are the important issues to consider when developing test norms?
"Norms" = a set of scores that represents a collection of individual performances and is developed by administering a test to a large group of test takers. -Norms are developed by selecting a sample from the "target population" (group for whom the test is intended) ● For a norm-referenced test, having strong "norms" is everything!! ● If the norms don't represent the population for whom the test was intended, they are useless and all comparisons made will be useless. Norm group must represent the people who will be taking the test in terms of important characteristics: Age Ethnic group Grade SES Sex Geographic region Education Any other characteristic that could affect test scores -Year the data were collected -Norms can become outdated, especially for tests of cognitive abilities -Most standardized tests are periodically re-normed -Number of people in standardization sample Number of ppl needed to form an adequate norm group -Consider that larger samples are more likely to yield a normal distribution of scores -Group tests (e.g. SAT, GRE) are often standardized on as many as 100,000 people -Individual tests (e.g. WISC, WAIS) typically have 2,000-4,000 people in the normative sample
What is construct validity? What does it involve?
"The most interesting, ambitious, and difficult of all the validities to establish." •Construct: A group of interrelated variables; a theoretical, intangible quality of a trait on which individuals differ i.e., aggression, intelligence, creativity •Construct Validity: Involves the theoretical meaning of test scores. -Are test scores consistent with what we expect based on our understanding of the construct? -There is no single way to validate its existence - we need multiple sources of evidence • This is the most comprehensive type of validity -It subsumes content and criterion-related validity -It includes reliability -Essentially, construct validity is proven via any means that one can show the test works the way we expect it to work! Must develop evidence of construct validity through a program of research. It takes many studies and many methods to amass enough evidence that the test works that way it is supposed to work.
What are some of the important ethical issues that are relevant to testing?
(1) Best interest of the client *Consider why doing the testing *Ask: if it serves a constructive purpose? *Select tests with care - consider how the results will affect the person (2) Expertise *You only give tests you are qualified to give and interpret & in the context you are giving them (3) Confidentiality & Duty to Warn • Must keep information protected • Unless danger to self or others, or reasonable suspicion of child or elder abuse (4) Informed Consent • Taker must be aware of reasons for testing, tests used, how results used, consequences of results • Three elements: disclosure, competency, voluntariness (5) Can't use obsolete tests or test results • Varies by test how long results are good (6) Responsible Report Writing (7) Communication of Test Results • "in a language the test taker can understand" (8) Consideration of Individual Differences • How do test results vary across groups • Balance considering culture and not stereotyping
What were the main events that are covered in the "5-minute history of testing?"
-2200 BC: Civil Service Testing in China -Late 1800s: Galton's study of Individual differences -1905: First Binet-Simon Intelligence Test (goal to identify which kids could or could not learn in a typical classroom environment) -1916: Stanford-Binet Intelligence Test published in U.S. (Terman) -Early 1920s: Psychological Corporation (1st test publisher) formed by Cattell, Thorndike, Woodworth. -1937: Stanford Achievement Test (SATs) required for admission to Ivy League schools -1945: Stanley Kaplan founds test prep service (still debated if prep can improve test scores) -1948: Educational Testing Service (ETS) opens to assess areas other than intelligence (SATs, GREs, TOEFL)
What percent of scores fall within 1 SD in a normal distribution? 2 SD? 3 SD?
-68% +/- 1 SD Mean=0.0 Sd= 1.0 -95% +/- 2 SD. Mean =0.0 Sd=1.0 -99% +/- 3SD. Mean= 0.0 Sd=1.0 1 SD accounts for 68% of scores -the additional sds account for 27% more of the cases -27% divided by 2 =13.5% -last two sides are around 2%
What is a percentile rank? How do you calculate a percentile rank?
-A point in a distribution of scores below which a given percentage of scores fall. -Tells you how someone scored relative to others -e.g. 70th percentile -Scored better than 70% of other examinees Percentile rankings can be determined for any score in the distribution (as long as you have access to all scores in the distribution) ● Rankings are only accurate if the distribution of scores is normal! -PR = B/N (100) -B = number of scores in the distribution lower than score being transformed -N = total number of scores in distribution Calculating percentile ranks -Find the percentile rank for a score of 84 in this distribution: formula PR = B/N (100) 64 91 83 72 94 72 66 84 87 76 There are 6 numbers that are lower than 84 including 84 so -PR = 6/10 (100) = 60th percentile
What is a standardized score? How are they an improvement on percentile scores?
-A z-score, or standard score, is used for standardizing scores on the same scale by dividing a score's deviation by the standard deviation in a data set. -The result is a standard score. -It measures the number of standard deviations that a given data point is from the mean. What slides say -Standard Score: One that is standardized based on a certain metric (Standard Deviation). ● Do NOT create the distortion we see in percentiles ● Used to make scores more directly interpretable ● Express a score's distance from the mean in standard deviation units ● Means and standard deviations of these scores have been arbitrarily chosen
How do we calculate confidence intervals? And what does a confidence interval tell us?
-Calculated using the Standard Error of Measurement. -in my notes Example If sX = 10 and r11 = .91, then serr = 3.0 Interpretation: 68% of the time the examine takes this test, his true score will fall with +/- 1 standard error, i.e., +/- 3 -Need the formula to do this
How do we decide whether or not test norms are good?
-How do you know if the test has good norms? -The test manual will describe the norm group, present tables, graphs, etc. to see who is in it and how they scored. But YOU must decide if the group is representative of the individuals for whom you intend to use the test! -Some tests are great for one group, but not for others! And if the norms are not representative - the test is no good to you.
Why do we not use raw scores on norm-referenced tests?
-In norm-referenced tests, raw scores are meaning less -We don't care about how many they got right (that's a criterion referenced test!) -We care about how they did compared to others. -So... we don't use the raw score, we create a new score that represents how they did in comparison to others who take the test!
What is a criterion-reference test?
-Interpret performance by comparing score to a set standard -Used to determine if a person meets a predetermined standard or performance -How well has someone mastered some domain of content - did they meet the set standard? -There is no ranking or comparison to others. i.e., most classroom exams are this type!
What is a norm-referenced test?
-Interpret performance by comparing score to others -Used to classify people from low to high on a continuum, often for placement decisions -Must compare obtained score to the scores of others - the norm group or standardization sample
What is reliability?
-Refers to consistency in test measurement ● Not all or nothing More like a continuum ● Ranges from minimal to perfect ● A psychometric property of tests indicating degree to which test is consistent in measurement ● If you take a fairly reliable test twice in a short period of time you will have highly similar scores (scores that correlate with one another) ● A test is reliable if it produces consistent measurement under varying conditions similar scores on repeat administrations
What is face validity? How is it different from other types of validity?
-Refers to the appearance of the appropriateness of the test from the test taker's perspective. -Addresses the question of whether the test content appears to measure what the test is measuring from the perspective of the test taker. ***Face validity is unrelated to whether the test is truly valid! ***
What is the relationship between reliability and the SEM?
-SEM has an inverse relationship with reliability. The more reliable the test the less error there is on average -Reliable tests have small SEM and little variation with repeat testing
What is a transformed score?
-So... we don't use the raw score, we create a new score that represents how they did in comparison to others who take the test! -this is callee a transformed score
What are examples of transformed scores?
-We use the Normal Distribution to convert a raw score into a transformed score on a norm-referenced test -Take someone's score - Look at how close that score is to the average score on the test (mean) while also taking into consideration how much scores tend to vary on the test (SD) Examples -Percentile ranks -Stanines -Z scores -T scores -Tell us how a given score compares to other scores in the distribution
Why do we like working with a normal distribution?
-easy to work with -mathematically defined -many statistical tests assume a normal distribution -it just keeps coming up
Summary of statistics review
-use the normal distribution to interpret raw scores for a norm-referenced test -take someones score -see how their score compares to others -look at how close the score is to the average score (mean) while taking into consideration how much scores tend to vary on the test (SD) -to make the comparison easier, we don't use the raw score- we use a transformed score
What it the optimal difficulty for true/false items?
.75
If the midpoint of the distribution is at the 50th percentile what is the percentile ranking of a score of 1 sd below the mean
16th percentile (look at slide 30 in appendix power point)
If the midpoint of the distribution is at the 50th percentile what is the percentile ranking of a score of 2 sds below the mean
2nd
If the midpoint of the distribution is at the 50th percentile what is the percentile ranking of a score of 2 sd's above the mean
98th percentile (look at slide 29 in the appendix a power point)
How is content validity assessed
A educated judgment
Test Validation requires...
A lot of research Studies will be done by the test authors before publication Ongoing process data is accumulated overtime even after test is published
Reliability and Validity
A test can be reliable wiithout being valid, BUT a test cannot be valid unless it is also reliable How can anything do what it is supposed to do (validity) if it cannnot do that thing consistently (reliability)
What is the definition of a test?
A tool, procedure, device, examination, investigation, assessment, or measure of an outcome (which is usually some kind of behavior).
What can you say about a z score = 1?
Above the mean or below the mean? Above ● How far above the mean in SD units? 1 SD ● Better than approximately 84% of the distribution
What is classical test theory? (Be able to explain what it is and how it relates to reliability!)
Classical test theory, also known as true score theory, assumes that each person has a true score, T, that would be obtained if there were no errors in measurement. -A person's true score is defined as the expected score over an infinite number of independent administrations of the scale. Example -It is the proportion of test takers who answered correctly out of the total number of test takers. -For example, an item difficulty score of 89/100 means that out of 100 people, 89 answered correctly. ●X=T+e ● X = observed score ● T = true score ● e = error score Observed score = True score + Error score < Trait error, Method error
What is a confidence interval?
Confidence Intervals -Once we have a reliability coefficient we can calculate the standard deviation of error to estimate the range within which a person's score might fall if they were to take the test over and over again -i.e. an estimate of where the true score lies -Reliability estimates can help us determine confidence intervals for an obtained score indicating where the person's true score might fall on repeat administrations
What are the different types of validity
Content validity Criterion validity (or Criterion related validity) Concurrent validity Predictive Validity Construct Validity
What is convergent validity? discriminant validity? What do they tell us?
Convergent validity - when tests measuring the same traits are correlated (as expected) Discriminant validity - when tests measuring different traits are NOT correlated (as expected)
How can we summarize & pictorially represent a distribution of scores?
FREQUENCY DISTRIBUTION ( A S E T O F N U M B E R S A N D T H E I R F R E Q U E N C Y O F OCCURRENCE) -Plotting the data -Frequency histogram -Frequency polygon ?
Important consideration
For all types of standard scores we are comparing a person's score to the scores of others. ● Who are these others? ● The norm group or standardization sample. ● "Norms" = a set of scores that represents a collection of individual performances and is developed by administering a test to a large group of test takers.
What are the 'goods and bads' of multiple-choice items?
Good -They can be used to measure learning outcomes at almost any level -They are easy to understand (if well written, that is) -They de-emphasize writing skills -They minimize guessing -they are easy to score -they can be easily analyzed for their effectiveness Bad -They take a long time to write -Good ones are difficult to write -they limit creativity -they may have more than one correct answer
What are the 'goods and bads' of true/false items?
Good -They are continent to write -They are easy to score Bad -The truth is never quite what it seems to be -The items emphasize memorization -The right answer is relatively easy to guess
In decision theory, what is considered a hit? miss? false positive? false negative?
Hit= correct prediction, predicted will succeed and did succeed Miss= incorrect prediction, false positive or false negative False positive= predicted will succeed, but failed False negative= predicted will fail, but succeeded
What is a validity coefficient?
How do we assess criterion-related validity? By calculating a correlation coefficient between test scores and criterion variable scores That correlation is called a validity coefficient
What are the primary steps in creating a test?
Idea/trait/characteristic being tested Select best method of testing Develop test items Pilot testing of items Evaluate/Revise items Final determination of items Development of instructions Establishing reliability and validity Develop Norms Publish Test
Confidence interval example changing the reliability
If sX = 10 and r11 = .75, then serr = 5.0 ●Again, given an observed score of 50: ● 68% likelihood the true score is between 45 & 55 ● 95% likelihood the true score is between 40 & 60 ● 99% likelihood the true score is between 35 & 65 Note the inverse relation between reliability and standard error
What is the mean (M)? median? mode?
Mean= Average (add all the numbers together and then divide by the number of numbers there are) (Ex/n) Mode= most frequent number Median= numbers lined up in order and middle numbers are taken
Test Validation statistic
NOT described with a single statistic but rather summarized either can be strong evidence moderate nonexistent weak etc Summary based on all the research
What are the four levels of measurement?
Nominal - Categorical and discrete, no logical order, qualitative in nature, just assigning a label (i.e., 0 = true, 1 = false). Ordinal - numbers refer to rank order, can make < or > comparison, but distance between ranks is unknown (i.e., 1st , 2nd, 3rd) Interval - numbers refer to both order & rank, with equal intervals (i.e., aptitude test score) Ratio - same as interval but with an absolute zero that indicates the absence of the trait (i.e., weight) ■ Some stats can't be used with lower order data - Nominal/Ordinal - simple stats - Interval/Ratio - more complex stats ■ In Psychology - most data are ordinal or interval
How does test interpretation work for norm-referenced tests & criterion-referenced tests?
Norm-referenced tests: -Used to classify people from low to high on a continuum, often for placement decisions -Must compare obtained score to the scores of others - the norm group or standardization sample Criterion-referenced tests: -Used to determine if a person meets a predetermined standard or performance -How well has someone mastered some domain of content - did they meet the set standard? -There is no ranking or comparison to others. i.e., most classroom exams are this type!
How is a table of specifications used in test construction?
Or, you can approach item development with an organized and planned approach.
What is a positively skewed distribution? negatively skewed?
Positively skewed: too many hard items? -tail to the right - a type of distribution in which most values are clustered around the left tail of the distribution while the right tail of the distribution is longer. Negatively skewed: too many easy items a -tail to the left If highly skewed then -mean = bad -median = good
Which levels of measurement are most commonly used in psychology?
Ratio then interval then ordinal and nominal is last
Content Validity
The property of a test such that the test items sample the universe of items for which the test is designed Ex Are items on the test a good representative sample of the domain we are measuring? Ex 2 Classroom achievement test( like ours)- do the items sufficiently address the material students are expected to master?
What is the variance? standard deviation (SD)?
SD -How close or how far scores are apart from each other on average -Average distance from the mean -the larger the sd the larger the average distance each data point is from the mean of the distribution How to get the sd thru math -get the mean of everything -subtract the scores from the mean (x-xbar) -square root the answers you got for the step before this one -then add all of them together to get a sum -then you divide the sum you got in the previous step by n-1 -find the square root of all that You can do the first half in chart form n-1 don't have to do it in chart form Variance -measure of variability -its basically the same steps as finding the sd however you do not sqaure it -no last step
What is the difference between split-half reliability and a coefficient alpha?
Split half reliability ● Treats a single test as if it consists of two equivalent halves ● Assumes scores on the two halves should be the same ● Correlate the two halves of the test ● Any differences are due to the sample of questions (error) ● Often used along with test-retest reliability Alpha Formulas ● Coefficient Alpha (Cronbach's Alpha) -Provides the mean reliability for all possible split-half combinations Split-half on steroids!!!! -This is a MUCH stronger estimate of internal consistency than split-half reliability and you are much more likely to see this nowadays.
What is an expectancy table? How does it relate to predictive validity?
Table of data with the number of scores, and a cut off to select who will succeed and who will fail Internet An expectancy table is a two-way table showing the relationship between two tests. Helmstadter (1964) notes that an expectancy table provides the probabilities indicating that students achieving a given test score will behave in a certain way in some second situation
What are most tests primarily used to do?
Tests are primarily used to make decisions or judgments about people. ● Selection ● Placement ● Diagnosis ● Hypothesis Testing ● Classify
If half the scores fall above the mean and half below approximately what percentage of scores fall within one sd above the mean?
The answer is 84% so we can say the percentile ranking of score 1 sd above the mean= 84th
Criterion-Related Validity
The extent to which the test correlates with non-test behavior(s), called criterion variables Criterion - can be anything you would expect the test to relate to - BUT must be reliable, appropriate, and free of contamination from the test itself. CRV exists when a test can estimate an examinee's performance on some outcome measure.
What are the differences between concurrent & predictive validity?
These are types of criterion validity • Concurrent validity - test is correlated with a criterion measure that is available at the time of testing. -i.e., achievement test scores correlate wit current grades in a class (must be same content area as test). • Predictive validity - test is correlated to a criterion that becomes available in the future. -i.e., SAT has predictive validity because scores correlate with first-year college GPA. -Good for tests that determine who will succeed/fail; helps to determine cut off scores.
What is the relationship between reliability and validity? Can a test be valid if it is not reliable?
• A test can be reliable without being valid, BUT a test cannot be valid unless it is also reliable. "How can anything do what it is supposed to do (validity) if it cannot do that thing consistently (reliability)?"
Types of Criterion-Related Validity
• Concurrent validity - test is correlated with a criterion measure that is available at the time of testing • Predictive validity - test is correlated to a criterion that becomes available in the future
What are the three primary types of validity?
• Content Validity • Criterion Validity (or Criterion-Related Validity) • Concurrent Validity • Predictive Validity • Construct Validity
What is content validity and how is it assessed?
• Content Validity: The property of a test such that the test items sample the universe of items for which the test is designed. -Are items on the test a good representative sample of the domain we are measuring? • Example: Classroom achievement test (like ours) - do the items sufficiently address the material students are expected to master? How is content validity assessed -by experts • A matter of educated judgment -No statistic expresses this concept directly although you can compute the degree of agreement amoung experts regarding the validity of the selected questions -Content validity is particularly important for achievement or classroom tests and other tests that have well defined domain of content
What happens in test development process after item analysis is done?
● Once you've created, tested, and revised your items you can finalize your test. ● This involves all those things we've already learned about... Conduct research to determine reliability & validity Develop norms Create your testing materials and manual Publish your test! But don't rest... because you'll have to revise it, update your norms, etc., etc., etc.,
What is criterion-related validity? What is a criterion?
• Criterion Validity: Assesses whether a test reflects a set of abilities in a current or future setting as measured by some other test or evaluation tool. • The Criterion: Can be anything you would expect a test to relate to BUT must be reliable, appropriate, & free of contamination from the test itself • Criterion validity exists when a test can estimate an examinee's performance on some outcome measure. Examples of criterion validity Criterion validity is found when there is a correlation between: -Achievement test scores & grades in school -A scale measuring a person's performance on the job & supervisor's ratings
Summary of Construct Validity
• Established through many studies • Is ongoing after the test is published • Must use many methods to demonstrate • Not all methods will be appropriate for any given test • The more support shown, the stronger the validity of the test, the more confident we can be in the test results
What are some of the major guiding sources that tell us what we can and can't do in testing?
• Ethical Principles and rules guiding use of tests and standards for practice are offered by MANY groups. • Some examples: • APA Ethics Code (2017) • Code of Fair Testing Practices (2004) • State Laws • State Psychology Boards
And If You Can't Establish Validity... Then What?
• If you don't have the validity evidence you want, it's because your test is not doing what it should... • Content Validity: Redo questions on test to make sure they are consistent with what they should be, according to an expert. • Criterion Validity: Reexamine the nature of the items on the test and answer the question of how well you would expect these responses to these questions to relate to the criterion you selected. • Construct Validity: Take a close look at the theoretical rationale that underlies the test you developed and the items you created to reflect the rationale.
What is decision theory used for?
• Involves the use of test scores as a decision- making tool • Relates to Predictive Validity If we use tests to make decisions (e.g., to select among school or job applicants) then those tests must have strong PV • We must conduct empirically based research to provide the foundation for decision- making. -Predictive validation studies are used to determine who will succeed/fail on the criterion -Data from predictive validation studies can be tabled in expectancy tables -Minimum score on test is selected based on information (expectancy tables) about who succeeded and who failed
What is a reliability coefficient?
● Reliability is estimated by correlating test scores from repeat administrations. The resulting correlation coefficient is called a reliability coefficient.
What are some ways we can demonstrate a test has construct validity?
• Many types of studies that can provide evidence: Developmental Change - if test measures something that changes with age, do test scores reflect this Theory-Consistent Group Differences - do people with different characteristics score differently (in a way we would expect) Theory Consistent Intervention Effects - do test scores change as expected based on an intervention • Factor-Analytic Studies - a way of analyzing test items to identify distinct and related factors in a test • Classification Accuracy - how well can a test classify people on the construct being measured -Sensitivity - accurately identify those with the trait -Specificity - accurately identify those without the trait • Multitrait-Multimethod Matrix - measure many traits to provide collection of correlations -Convergent validity - when tests measuring the same traits are correlated (as expected) -Discriminant validity - when tests measuring different traits are NOT correlated (as expected)
What is test validation?
• Test validation requires extensive research. • Some studies will be done by the test authors before publication, but... • Test validation is an ongoing process where data are accumulated over time, even after the test is published. • Studies after the test is published can: -Provide additional support for the test -Show conditions under which the test doesn't work -Demonstrate additional uses for the test (i.e., new populations it works with or new behaviors that can be measured) • Test validity is NOT described with a single statistic but rather summarized in terms of the evidence being strong, moderate, adequate, weak, or sometimes nonexistent • The summary is a judgement made based on all the research (validation studies) that have been done on that instrument.
What is the difference between testing and assessment?
• Testing v. Assessment • One is specific, the other is broad
What is a typical validity coefficient for predictive validity?
• There is no set value for validity - it will depend on the test and the nature of decisions to be made from the test. ● Concurrent Validity tends to be higher than ● Predictive validity coefficients are rarely ● Tests are still considered useful & acceptable predictive (no time between test and criterion) greater than r =.60 - .70 for use with a far smaller validity coefficient, e.g. .30-.50
Why does this matter for testing?
■ Everything we measure is on one of these scales. ■ The higher the level, the more precise the measurement. ■ As we begin to talk about different tests, take note on what type of scaling is used. ■ Once we have a scaling method, we must create and test our items!
What occurs in the first step of test construction? (i.e., defining the test/trait)
■ Test developer makes decisions about: - What the test will measure - Who the target population is - Type of items to be included - Standard scores to be used - Other norms to be reported - What it will be used for - Etc.
Select the Best Method of Testing
■ We use scores to represent how much or little of a trait a person has ■ Quantify this information ■ The scaling method provides the rules by which we assign numbers to the responses
Confidence Intervals: Summary
● A confidence interval is a range of scores (i.e., 45-55) ● It is calculated using the Standard Error of Measurement. ● The SEM uses the reliability coefficient to calculate a number that represents how much (on average) a person's score is likely to vary if they were to take the test again (i.e., how much the test score might change due to error!) ● So... we add and subtract the SEM to the observed score to get the Confidence Interval We add to get the high end of the interval. We subtract to get the low end of the interval.
What is acceptable reliability for tests used to make important decisions about individuals?
● Remember these values: -For research purposes, reliability needs to be at least .65 -For making important decisions about individuals, reliability needs to be at least .90
What is item discrimination (d)? How is it calculated? (FROM SLIDES, NOT BOOK!)
● A measure of how effectively an item discriminates between the high and low groups. ● Procedure: Divide examinees into groups based on test scores Upper group (U) = 27% of examinees with highest scores on the test Lower group (L) = 27% of examinees with lowest scores on the test Formula D=Up-Lp/ U Up = number of people in the upper group that passed the item Lp = number of people in the lower group that passed the item U = total number of people in the upper or lower group U = 27% of N (Always round to a whole #!) **U = N(.27), then round BEFORE entering into equation** ● Range of values: -1.00 to +1.00
According to Classical Test Theory, what are the three parts of a test score?
● A test score consists of three parts: Observed Score: What you actually get on the test. True Score: the "true" amount of what is being measured Error Score: error present in measurement accounting for difference between observed score and true score
What different things do we test in psychology? (i.e., primary types of tests)
● Achievement Tests ● Personality Tests ● Aptitude Tests ● Ability or Intelligence Tests ● Performance ● Vocational or Career Tests ● Neuropsychological Tests
Confidence interval summary part 3
● But why these percentages? -Because the SEM works just like a standard deviation and 68% of scores in a distribution fall within 1 SD of the mean, 95% within 2 SDs, and 99% within 3 SDs. -Therefore, 68% of the time a person takes a test, their score will fall within one SEM of their original (observed) test score (their score +/- 1 SEM). -And 95% of the time a person takes a test, their score will fall within 2 SEMs their observed score (score +/- 2 SEMs) -And 99% of the time a person takes a test, their score will fall within 3 SEMs their observed score (score +/- 3 SEMs)
Correlations & Reliability
● Can use correlation to express reliability ● If test is highly consistent will be highly correlated ● Reliability ranges from 0.00 to 1.00 -Closer to 1.00 the more reliable ● Most test scores are related to one another to some degree
Alpha Formulas
● Coefficient Alpha (Cronbach's Alpha) -Provides the mean reliability for all possible split-half combinations -Split-half on steroids!!!! -This is a MUCH stronger estimate of internal consistency than split-half reliability and you are much more likely to see this nowadays.
Item discrimination example
● Consider a test taken by 100 test takers ● For a given item: everyone in upper group answers the item correctly and no one in the lower group answers the item correctly ● A perfect discriminator; rarely if ever achieved D= 27- 0/27= 1.0 Another item ● No one in the upper group gets it right but everyone in the lower groups does ● A perfect negative discriminator ● We NEVER want negative discrimination values (poor items); more of the examinees in lower group got item correct than examinees in the upper group D= 0 -27/27= -1.0 Most items will have a value between zero and one ● e.g. another item with 100 test takers ● 18 in upper group get item correct and 9 in lower group get it correct ● Good discrimination D= 18-9 /27 =.33 ● 13 in upper group get it right and 9 people in lower group get it right ● Item shows marginal discrimination D= 13-9 /27= .148
Error Score
● Each time you take a test you get a single score that could vary if you take the test again. ● If a single individual takes the same test over and over again and gets varying scores, what makes the scores different from one another? Error ● Reliability tells us the extent to which a test is free of error
What is item analysis?
● Evaluates the quality of the test at the item level ● Always done post hoc (after the test is given) ● Various statistical techniques that help determine whether test items should be kept, tossed, or revised Types of Item Analyses ● There are MANY techniques one can use to analyze test items. ● We will learn two: Item-difficulty index (p) **Your book uses D for this index. I tem-discrimination index (d) IMPORTANT: Your book uses different formulas from the ones in these slides. Please use the ones in these slides!!!
What is considered acceptable reliability for research purposes?
● How high does the reliability coefficient have to be? -Or, how much error can we tolerate? ● It depends how the test is being used. ● According to your book .70 or above is acceptable (.80 or > is better) But... there's a bit more to it than that!
Example of item difficulty
● If 46 out of 78 test takers get an item correct: ● Interpretation: 59% of test takers passed the item. ● It is moderately difficult. P=46/78=.59
What is a lower bound? How does it affect the way we interpret item difficulty?
● If optimal difficult tells you where you are aiming for, how do you know when an item is too far away from ideal? ● Use the lower bound of acceptable difficulty ● Is a function of k (options per item) and N (number of examinees) ● Formula: 1+1.645 square root k-1/n on top half only then all of it divided by k e.g. k=4, N=150 Lower bound = .31 Interpretation of lower-bound value ● e.g. lower bound = .31 ● As long as items are at or above the lower bound they are not considered to be too difficult. ● Items passed by fewer than 31% of test takers should be considered difficult and examined for discrimination ability Item difficulty -p -optimal difficulty -Lower bound ● You want items that are closest to optimal difficulty & not below the lower bound
What does the optimal difficulty tell us?
● If optimal difficult tells you where you are aiming for, how do you know when an item is too far away from ideal? ● Use the lower bound of acceptable difficulty Can be evaluated using the formula 1.0+g/2 where g=chance level
How does one assess test-retest reliability?
● Most direct, common type of reliability ● Test is given at two different times separated by days or weeks ● Assumption is that true score has not changed ● Scores from the two administrations are correlated with one another. ● Any differences (error) in the scores is attributable to random error associated with the time of administration -Risk of practice effect: first test influences second test. Ex The bonzo test of identify formation for adolescents is realizable over time
Observed Score
● Observed Score - what a person actually scores on a test, typically on one administration ● The observed score is our best estimate of a true score
How do we consider item difficulty and discrimination together to make decisions about items?
● Start by interpreting each index separately according to given criteria p = close to optimal difficulty & at or above lower bound d = .30 acceptable (.20-.29 - marginal) ● Then evaluate item difficulty and item discrimination indices in conjunction with one another Generally it is desirable to have moderate difficulty (p) with good item discrimination If p is low (near lower-bound) item may be OK with good discrimination ability ● Discrimination level is constrained by difficulty level. Ideal p is . 50, the further from that the more limited the item will be in discriminating. So very high or very low p will automatically pull down D ● Items that are both very difficult and that show little discrimination ability should be discarded or completely revised ● An item passed by nearly every test taker or missed by nearly every test taker will not allow for differentiation within the group
What is item difficulty (p)? How is it calculated? (FROM SLIDES, NOT BOOK!)
● Tells us how many people got that item correct. ● It is a percentage reflecting the number of responses in both the high and low group that got the item correct. ● Expresses the percentage or proportion of examinees that answered an item correct P=np/N np= number of test takers that passed an item N = total number of test takers ● Range of values: 0 to 1.00 -If p = 0 no one got the item correct -If p = 1.0 everyone got the item correct ● What is an ideal level of difficulty? .5 is generally ideal ● Must adjust this for True/False or Multiple - Choice items to account for guessing Use formula to find optimal difficulty (This is different from the formula used to correct for guessing!!)
What are the four main types of reliability?
● Test-Retest Reliability ● Parallel Forms Reliability ● Reliability as Internal Consistency Split-Half Reliability Spearman-Brown formula Cronbach's Alpha (α) ● Interrater reliability
What are the two types of error?
● Trait Error: Sources of error residing within the individual taking the test (didn't study enough, feeling sick, etc.) ● Method Error: Sources of error that reside in the testing situation (poor test instructions, room to loud/warm/cold, etc.)
True Score
● True Score - the true amount of the trait being assessed. ● A person's "true" score on any test is defined as the average score a person would obtain on a test taken an infinite number of times. ● Obviously - true scores can not be known or directly measured. ● True scores can only be estimated
How does one assess internal consistency reliability?
● Used to estimate the reliability of a single test given on a single occasion ● Tests are internally consistent when items inter-correlate ● If two halves from one test (given in one administration) correlate with each other, then scores on the whole test should correlate across two administrations Ex All the items on the SMART test of creativity assess the same construct
How does one assess interrater reliability?
● Used when we want to know how much two raters agree on their judgements of an outcome. ● Shouldn't be used alone, but is good to know when scoring error is possible ● What to do if IRR is low? -Provide guidelines to increase consistency; with good criteria to help guide scoring -Even subjectively scored items can be reliably scored - if they have good guidance to help in scoring The interrater reliability for the best dressed football player judging was .91, indicating a high degree of agreement among judges
Confidence Intervals: Summary Part 2
● We add/subtract the SEM 1 time to get 68% CI Observed score = 50, SEM = 5, so CI = 45 - 55 ● We add/subtract the SEM 2 times to get the 95% CI Observed Score = 50, SEM = 5, so CI = 40-60 ● We add/subtract the SEM 2 times to get the 99% CI Observe Score = 50, SEM = 5, so CI = 35-65 ● We interpret CI like this: There is a 95% chance, if this person we to take this test again, they would obtain a score that falls between 40 and 60.
What is a correlation? What is the value of a strong v. weak correlation? Positive v. negative?
● What is a correlation? ● Degree of linear relation between two variables ● Ranges from -1.00 to +1.00 +1.00 - strong positive correlation -1.00 - strong negative correlation 0.00 - no correlation, no relation between variables