Psychometrics
steps to developing a measure
1. Overall Goal & Pre-Planning Aim & purpose of testing? What construct will be measured? Test format? Administration modality? Approval? 2. Content Definition Operationalization 3. Test Specifications Test & item format Total number of test items / test length Containing visual stimuli How test scores will be interpreted Time limits 4. Item Development 5. Test Design & Assembly 6. Test Production 7. Test Administration 8. Scoring Responses Item difficulty Discriminating power Item bias 9. Establishing Passing Scores Reliability & validity Establishing norms/cut-off scores 10. Reporting Results 11. Item Banking 12. Test Technical Report
aspects of content validity
1. construct under-representation - the test does not capture important components of the construct 2. construct irrelevant-variance - when test scores are influenced by things other than the construct the test is supposed to measure
9. establishing passing scores
1. establishing reliability and validity (the attribute of consistency in measurement; the extent to which a test measures what it claims to measure) 2. establishing norms, performance standards, or cut-off scores (depends on whether or not test is norm-referenced or criterion-referenced)
how to improve reliability
1. item analysis - item-total correlations, discrimination index, exclude poor items 2. use identical instructions 3. eliminate questions that evoke inconsistent responses 4. cover entire range of the dimension 5. clear conceptualization 6. standardization 7. inter-rater training 8. use more precise measurement 9. use multiple indicators 10. pilot-testing
factors influencing reliability
1. number of items in scale 2. variability of the sample - better estimate of reliability with sample representative of wider population 3. extraneous variables - testing situation, ambiguous/misleading items, unstandardized procedures, perceived demand effects, etc
factors affecting validity
1. reliability: any form of measurement error can reduce validity 2. social diversity: tests may not be equally valid for different social/cultural groups 3. variability - the more things you measure, the higher your validity estimates are
types of reliability
1. test-retest reliability - give someone a test at one point in time, give them the same test later 2. parallel forms reliability - two different forms of the same test 3. internal consistency (split-half reliability - test is split in half, each half scored separately, total scores for each half correlated, kuder-richardson 20 reliability, coefficient/cronbach's alpha - estimates the consistency of responses to different scale items) - do the different items within one test all measure the same thing, to the same extent 4. inter-reater reliability - measures how consistently 2 or more raters/judges agree on rating something
12. publishing and refinement
4 parts
classical test theory
4 underlying assumptions: 1. each person has a true score we could obtain if there was no measurement error 2. there is measurement error - but it is random 3. the true score of an individual doesn't change with repeated applications for the same test, even though their observed score does 4. the distribution of random errors and thus observed scores will be the same for all people
Beck Depression Inventory
BDI : "I feel my family would be better off if I were gone" "I would kill myself if I could"
history of testing: what tests
Binet-Simon (1905) did not measure any single faculty, assessed child's general mental development; aim was classification, not measurement; brief and practical (less than one hour); measured practical judgement rather than wasting time with lower level abilities; items were arranged by level of difficulty instead of content (30 items) - revised in 1908 with 58 terms, third time in 1911, each age level had 5 tests, scale was extended into the adult range - 'mental level/mental age' Intelligence Quotient (IQ): mental age/chronological age x 100 Stanford-Binet Scale: Lewis Terman in 1916; adapted Binet test for schools and adults; 5th version of test still in use today; heavy reliance on language/vocabulary skills - coined the abbreviation "IQ"
testing and racism, and testing in SA
Fick (1929) applied tests developed for and standardized on whites only to white & black children Used Army Alpha and Army Beta tests And used Fick Scale White children of course got the highest scores Originally proposed cultural, environmental, educational, and social reasons for discrepancy Later suggested that due to differences in 'innate abilities' between whites and blacks, and not external factors Use of tests gained momentum after WW II and 1948 when NP came to power Arose from need to identify occupational suitability of black people Whites tested using the Otis Mental Ability Test Developed in US, only had American norms
history of modern psychometrics; Galton
Francis Galton - father of modern psychometrics; obsessed with measurement (believed everything was measurable); was more interested in the problems of human evolution than psychology; used psychophysical procedures (brass instruments) from Wundt, adapted them to a series of single and quick sensorimotor measures, allowed him to collect data from thousands of subjects quickly; set up psychometric laboratory in London-tests involved physical and behavioral domains
sensory discrimination
Galton believed sensory discrimination defined intelligence - "The discriminative facility of idiots is curiously low; they hardly distinguish between heat and cold, and their sense of pain is so obtuse that some of the more idiotic seem hardly to known what it is. In their dull lives such pain as can be excited in them may literally be accepted with a welcome surprise"
Henry Goddard (1906)
Hired to do research on classification & education of feebleminded children Needed a diagnostic instrument to do so Translated Binet-Simon (1908) scale to make application to American children < 2 years = idiots 3 to 7 years = imbeciles 8 to 12 years = feebleminded Tested normal children with translated test Children with mental age four or more years behind chronological age were feebleminded; this constituted 3% (!) of his normal sample Needed to be segregated Used tests for immigrants on Ellis Island, found feeblemindedness in very high percentages of immigrants and suggested deportation or use as laborers
testing and the army
IQ tests were welcomed as a way to assess the intellect of immigrants and potential soldiers Robert Yerkes and the rise of group tests: quick effective, efficient way of evaluating emotional and intellectual functioning of soldiers - Stanford-Binet adapted to multiple-choice test in 1917 for use by US army - ease of administration and scoring - does not need to be administered by trained professionals - lack of subjectivity Testing assigned either: Army Alpha - English literates Army Beta - non-English and non-literates
history of modern psychometrics; Cattell
James Cattell studied experimental psychology with Wundt and Galton - invented the term mental test; examined relationship between academic grades, psycho-sensory tests, and size of the brian and shape of the head - proposed a battery of 10 mental tests: Strength-of-hand squeeze Rate of hand movement through 50cm Two-point threshold for touch (2-point discrimination) Degree of pressure causing pain Weight differentiation Reaction time for sound Time for naming colours Line bisection of 50cm line Judgement of 10 seconds of time Number of letters remembered on one hearing
personality test examples
MMPI, Rorschach, TAT (provide ambitious pictures and have to make up a story), 16PF Questionnaire
1. overall goal and pre-planning
Provides systematic framework for project: A priori decisions What is the aim & purpose of the testing? What construct will be measured? Test format? Administration modality? Decide on timeline Test security and quality control Who produces, publishes or prints the test?
summary of psychometrics history
Sir Francis Galton (1869) Believed 'genius' was hereditary; used psychophysical methods; believed sensory discrimination and RTs defined intelligence James McKeen Cattell (1890) Took 'Brass Instruments' method to U.S.A.; invented term 'mental test'; proposed battery of 10 tests Alfred Binet (1905; 1908; 1911) Developed Binet-Simon Intelligence Test to separate special needs children (three revisions) from ages 3-13; intelligence was 'good judgment' and changeable; 1908 revision invented term 'mental level/age'; 1911 revision extended into adult range Lewis Terman (1916) Coined abbreviation IQ; developed Stanford-Binet Scale for U.S. schools and adults
SEM
Standard Error of Measurement: we can work out how much measurement error we have by working out how much, on average, an observed score on our test differs from the true scores - standard deviation of the scores
intelligence test example
Stanford-Binet; Raven's Progressive Matrices (designs with a cut out portion), WAIS
critique of mental testing
Testing came under attack by advocates for underprivileged Required knowledge and cultural values rather than innate intelligence - biased towards White middle class Tests are culturally biased Correlation doesn't imply causation Stephen Jay Gould: The Mismeasure of Man (1981)
(3.4) ensure that all domains of the construct are tested
a grid structure (test blueprint); columns represent content areas (indicators), rows represent manifestations
Psychological test
a set of items that is designed to measure characteristics of human beings that pertain to behavior; the process of measuring psychology-related variables by means of devices or procedures designed to obtain a sample of behavior
absolute standard setting
a specific passing score; expert judgment of what amount of knowledge, skill or ability testee needs to demonstrate
ability vs personality tests
ability: measure skills in terms of speed, accuracy (power), or both; (1) achievement (previous learning) - refers to a person's past/previous learning - designed to measure person's past learning on accomplishment of a task (2) aptitude (potential for acquiring a specific skill/learning) - refers to a person's potential to learn a specified task under provision of training (3) intelligence (general mental abilities) - refers to a person's general potential to solve problems, adapt to changing circumstances, think abstractly, benefit from experience personality: measure typical behavior - traits, temperaments, dispositions, etc; designed to measure a person's individuality in terms of their unique traits and behavior; tests can help in predicting future behavior
(8.3) item bias
bias in testing: errors in measurement, associated with an individual's group membership item bias in tests: differences in group performance on a test, that result from the item content of a test
Domain Sampling Model
central concept to the classical test theory if we construct a test on something, we can't ask all the possible questions so we only use a few test items (sample); using fewer test items can lead to the introduction of error
relative standard setting
compare testee's score to well-defined group of test-takers
normative
compares scores to a norm group
reliability
consistency in measurement, how much error we have in measurement; the precision with which the test score measures achievement; desired consistency or reproducibility of test scores - reliability is necessary for validity
item-total correlations
correlation between the score on an item and performance on the total measure; + correlation = good discriminating power; 0 correlation = no discriminating power; - correlation = poor discriminating power items having correlations less than .20 and negative correlations are not retained
theoretical (conceptual) definitions
defined in terms of a concept's relationship with other concepts; e.g. stress is defined as hardship, adversity, affliction, feeling of strain and pressure
operational definitions
defined in terms of how to observe/measure the concept/variable; operational definitions therefore "real life" definitions; concepts often have many indicators - related but distinct items that make up that concept
defining the content of the measure
definition of test content is ultimately a matter of human judgment; methods and procedures may be developed and used to minimize bias and increase objectivity of judgments
(12.4) revision and refinement
depends on content and how quickly it dates, depends on popularity of the measure
predictive validity
do scores on a test predict a future event successfully? - the test is the predictor, the future event is the criterion
11. item banking
essentially storing items for future use; security is NB; must be able to easily retrieve items
types of validity
face validity, content validity, criterion validity, construct validity
history of modern psychometrics; Wundt
father of psychology (1879); 'brass instruments' era of testing (early psychologists mistook sensory processes for intelligence) - Wundt's 'thought meter'; overly simplistic, but at least demonstrated empirical analysis that sought to explain individual differences
Alfred Binet
first to develop an intelligence test; Binet-Simon intelligence test (1905); wanted to separate children with intellectual disabilities from normal children in schools
(5.2) pretesting
give the test to a representative sample from the target population to gather info about it, content should be checked by independent, non-invested content experts
history of testing
han dynasty in china (2200 BC) - officials of the emperor examined every 3 years - tests required proficiency in civil law, military affairs, agricultures, revenue, geography - good penmanship was important; preliminary, district, and peking final round exam
criterion validity (predictive; concurrent)
how well a test score estimates or predicts a criterion behavior or outcome, now or in the future - easy for ability tests but can be hard for personality and attitude tests
(3.1) decide on test/response format
how will participants demonstrate their skills? selected response (Likert scale/MCQ/Dichotomous) constructed response (essay question/fill in the blank) performance (block design task)
individual vs group tests
individual: designed to be administered to 1 person at a time, useful for collecting comprehensive info (e.g. some personality tests, some IQ tests), usually some degrees of subjectivity in the scoring, time cost labor intensive group: designed to be administered to more than 1 person at a time (mass testing), e.g. university tests (especially MCQs), scoring is usually more objective, economical and time-saving
subjective formats
interpretation of response depends on examiner judgment - more unstructured format type (projective tests)
ipsative
intra-individual comparisons
(3.3) decide on your test length
mainly depends on: amount of administration time available, purpose of the measure; less items for a screening measure, tests that measure more than one aspect of a construct need more items than test that measure only one aspect general rule of thumb: need at least 12 items compliance lower when item numbers are higher; people get fatigued, bored, etc need at least 50% more items in initial version than final version (you will discard bad items) do you need an equal number of items per area, if you are measuring more than 1 area of a construct?
test
measurement devices or techniques used to quantify behavior - to aid in understanding and prediction of behaviors
covert behavior
mental, social, or physical action or practice that is not immediately observable: e.g. feelings of anxiety, depression
7. test administration
most public and visual aspect of testing; standardization of testing conditions is related to quality of test administration: each test can be considered a mini-experiment, seeks to control extraneous variables and make conditions identical for all examinees, examinee fairness, clarity of instructions, time limits, otherwise examinee scores difficult to interpret; security is a major concern for test administration; preferable to designate one highly experienced invigilator as 'chief invigilator' for testing site
contributions of the han dynasty to modern testing
names of candidates to be concealed; independent assessments; conditions of examinations should be standardized
2. content definition
need to operationally define the construct you are measuring
norm vs criterion-referenced tests
norm: test score is judged against the distribution of scores obtained by the other test takers; this distribution is called the norms; compares an individual's results on the test with statistically representative sample; rank the performance of a student in a particular group criterion-referenced: compare each individual's performance to a criterion or expected level of performance; testee's score compared to objectively stated standard of performance on that test; establish standard/criterion and marks students against it; e.g. you need 50% to pass your exams
overt behavior
observable human behavior: eg, time needed to put 10 pegs in a peg board
(3.2) decide on an item format
open-ended items; no limitations on the test taker forced choice items; MCQs: true/false; likert-type items appositive forced choice
(3.2) decide on an item format
open-ended items; no limitations on the test taker forced choice items; MCQs: true/false; likert-type items ipsative forced choice (e.g. do you prefer humanities or commerce?) sentence-completion items (e.g. the most important thing in life is ____) performance-based items; involve the manipulation of an apparatus; writing an essay; oral presentations
criterion-referenced
performance compared to predefined standard
examples of test specifications
permissions-based advertising: consumer gives company permission to send them ads incentive-based advertising: company offers rewards to consumers who agree to receive their ads general mobile advertising: ads sent to consumers without receiving their permission
5. test design and assembly
placement of correct items, check for errors, manual or computer assembly?, how are people going to answer the test?, does our test look aesthetically pleasing?
problems with classical test score theory
population dependent, test dependent, assumption of equal error measurement
6. test production
production, printing, publication of examination; making final all items, their order, visual stimuli; security of tests very NB; quality assurance procedures
Psychometrics
psychological measurement; the measurement of mental capacity, thought processes, aspects of personality, etc, especially by mathematical or statistical analysis of quantitative data; the science or study of this; (also) the construction and application of psychological tests
(12.3) publishing and marketing
publishing and marketing
test items
quantify behavior; a specific stimulus to which a person responds overtly
10. reporting results
reporting of examination results; all examinees have a right to accurate, timely, and useful reports of their performance, in understandable language, test manual should label appropriate uses of the scores and discourage misuse; standardizing administration; catastrophic error to publicly distribute incorrect score report
8. scoring criteria and item analysis
scoring criteria: develop a scoring key, when to drop respondents from sample item analysis: item difficulty, discriminating power, item bias
(12.2) submitting the measure for classification
should it be classified as a psychological measure or not?
construct validity (convergent; discriminant)
something we think exists, but is not directly observable or measurable - look at the relationship between the construct and other constructs - what observable behavior can we expect if a person has a high (or low) score on a test measuring this construct convergent: scores on a test have high correlations with other tests that measure the similar constructs discriminant validity: scores on a test have low correlations with other test that measure different constructs
personality test types
structured (objective): self report statements, choose between 2 or more alternative responses (true/false, yes/no) projective: unstructured; spontaneous response required of test-takers, provides an ambiguous stimulus; response requirements unclear - assumes that person's interpretation of an ambiguous stimulus will reflect their unique characteristics (e.g. Rorschach)
psychometrics committee of the professional board
test will be listed as a test in the development phase
individual test example
the Wechsler Adult Intelligence Scale (WAIS): block design: requires blocks to be assembled into a certain pattern; testee is given a number of different trials (test run) the Rorschach test (blotches)
operationalization
the act of making a fuzzy concept measurable
measurement
the assignment of numbers to objects and events according to some rules
content validity
the degree to which a test measures an intended content area; content validity is a non-statistical type of validity that involves the systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured; correspondence between items on a test and the content domain = # relevant items / total # of items
(8.2) item discriminating power
the extent to which an item measures same aspect that the total test is measuring; measured by: discrimination index, item-total correlations
concurrent validity
the extent to which test scores can correctly identify the current state of individuals - if an individual scores above 75 on depression test, can we call them depressed?
(8.1) item difficulty
the proportion or percentage of individuals who answer the item correctly higher % of correct responses = easier the item; and vice-versa
eugenics
the science of using controlled, selective breeding to improve hereditary qualities of the human race and create superior individuals; concerned about the lower classes breeding too quickly, lowering average of standard intelligence; ideological forerunner to Nazism Positive: encouraging reproduction of the genetically 'fit' Negative: aims to prevent those deemed physically, mentally, or morally unfit to procreate (sterilization, segregation) there was a movement for eugenics in the early 1900s
3. test specifications
the test blueprint; test specifications should describe: test (response) format item format total number of test items (test length) all the content areas of the construct(s) tested whether items or prompts will contain visual stimuli how test scores will be interpreted time limits
(12.1) compiling the test manual and technical report
these are usually very detailed; should contain: administration instructions, scoring, instructions on reporting results, norms, test analysis and evaluation
4. item development
use clear wording, appropriate vocabulary (not too fancy), avoid double negatives, unique stimuli if test is for children; don't make questions too obvious and don't ignore the influence of social desirability
discrimination index
uses the method of extreme groups: compares performance of top 25% to performance of bottom 25% (# in top 25% passing item/# in top 25%) - (# in bottom 25% passing item/# in bottom 25%)
relationship between validity and reliability
validity: does a scale measure the construct it set out to measure reliability: does a scale consistently measure the same thing
objective formats
very structured type of response format; person usually picks only one response
purpose
what are we going to use scores on this test for (use participant's perceived stress score to develop a plan fro cognitive behavior therapy to change perceptions of stress)
aim
what exactly are we going to measure (to measure of the degree to which situations in one's life are perceived as stressful)
face validity
when a test seems on the surface to measure what it is supposed to measure - a test can have good face validity, but not really be a valid test - test/scale must feel authentic to participants (if participants have doubts about the test, it will have an effect on test scores); least scientific of all measures of validity - just the researcher's opinion if items look valid or not
validity
whether or not a test measures what it intends to measure - is the test measuring what it claims to measure? aims of establishing validity: to be able to make accurate inferences from scores on a test - gives meaning to test scores indicates the usefulness of the test