Week 1: Psychological Testing and Assessment
Wechsler based his concept of verbal and performance scales on
alpha/beta army tests
The standard error of measurement is more useful than the reliability coefficient
for making decisions about the test scores of individuals
other methods of estimating internal consistency
formulas developed by kuder, Richardson and Chronbach
z-score
how far the score is from the mean as a proportion of a standard deviation the z-score locates the individual score in relation to the mean of the distribution of scores
Donald McElwain and George Kearney were responsible for developing the
queensland test
Inter-rater reliability
- Examines the extent to which the score obtained by one informant (e.g. parent) correlates with the score obtained by another informant (e.g., teacher)
Test-retest reliability
- Examines whether the score we have obtained on a measure remains stable over time. Time between testing varies across measures.
Internal consistency
- Splits a test into subtests which are then correlated with all the other subtests and the average correlation is calculated (can also look at internal consistency of individual subtests, by examining the items within the subtest)
Split half reliability
- Splits a test into two equivalent forms and correlating the scores on the two halves
Convergent validity
- Tests the extent that the content of the items in one test correlate (have a relationship with) the content of items in another measure of the same (or a similar) construct e.g., VABS and ABAS
Face validity
- The extent that items appear to be valid for the area that is being measured (informal and may be subjective)
Predictive validity
- The extent that scores on a test allow us to predict scores on some criterion measure. e.g., is the Conners 3 an adequate screening tool for ADHD?
Discriminant (or divergent) validity
- The extent that the content of the items in one test is different from (does not overlap with) the content of the items in a contrasting measure e.g., RSES and WAIS
Content validity
- The extent that the content of the test items represents all facets of the construct being measured e.g., WIAT-III
Construct validity
- The extent that the test truly reflects the construct that it purports to measure. Convergent and divergent validity are subsets of construct validity.
What is a construct?
A hypothetical entity with theoretical links to other hypothesised variables, that is postulated to bring about the consistent set of observable behaviours, thoughts or feelings that is the target of a psychological test
What is a psychological test?
A psychological test is an objective procedure for sampling and quantifying human behaviour to make an inference about a particular psychological construct using standardised stimuli, and methods of administration and scoring.
linear transformation
A simple linear transformation is the addition of a constant to all raw scores. • E.g. five test scores on an achievement test that has a maximum score of 40: 35, 20, 17, 11, and 5. • A constant value (100) is added to each of the raw scores, and a new set of scores result. • Plotting the transformed scores against the raw scores results in a straight line. • An essential feature of this type of transformation is that the differences between the raw scores are maintained in the transformed scores, although the magnitude has changed. • The same effect would occur if instead of adding a constant we subtracted. • In general, a linear transformation is one in which the transformed scores are related to the raw scores in terms of a straight-line function — this means that the equivalence of distances between points on the raw score distribution is maintained in the transformed distribution.
Arthur Otis and Cyril Burt
Arthur Otis and Cyril Burt trialled a variety of group tests of intelligence, but the most convincing demonstration of their usefulness was to come form Clarence Yoakum and Robert Yerkes • They developed two group tests of general mental ability for use with recruits to the US armed services during WW1 (The Army Alpha and The Army Beta).
criterion referenced test
Criterion referenced test: a psychological test that uses a predetermined empirical standard as an objective reference point for evaluating the performance of a test taker.
criterion referencing
Criterion referencing: a way of giving meaning to a test score by specifying the standard that needs to be reached in relation to a limited set of behaviours. Not many variables in psychology allow this form of interpretation because the potential item pool for a test often cannot be determined with precision.
Error in psychological testing
Error is an inherent part of psychological testing In terms of test administration, factors such as personal and environmental factors mean that you can never control every extraneous variable when administering a test. You can only attempt to control as many variables as possible. Error can also occur during the test development. E.g. with the RCMAS being a measure of anxiety, but not including items on separation anxiety which is commonly observed and reported in younger children. Psychological tests may also be used in contexts which they were not developed for, therefore calling into question the validity of the conclusions being drawn based on the test findings E.g., using the MMPI as a screener by HR personnel in organisations, when it was designed to measure pathological personality traits If a test is translated into another language without using rigorous processes, the items may not measure (in the second language) what they were designed to measure in the language in which it was developed due to differences in interpretation of the test item by translators, or not having the same word in the second language
What factors can affect the validity of test results?
External events unrelated to the construct being measured (e.g. the death of a family member prior to taking an exam) Factors that are not considered by the test developers which are found to be relevant to the construct e.g., RCMAS does not include items about separation anxiety Scores on one construct correlate highly with scores on an unrelated construct e.g., measures of creativity correlating more highly with IQ tests than with other measures of creativity
practical implications for poor reliability
For individual assessment, it increases the standard error of measurement and hence widens the confidence intervals within which the person's true score might lie. For research, it decreases the possible intercorrelation of measures and works against finding relationships that are hypothesised.
norm referencing
Norm referencing: a way of giving meaning to a test score by relating it to the performance of an appropriate reference group for the person. Test developers have sought to relate the raw score to the average score or norm of a representative group of people similar to the person being tested.
percentiles can be determined in three main ways
Graphic interpolation. • Arithmetic calculation. See page 64 for formula. • By reading from tables of the normal curve.
who developed the MMPI?
Hathaway and McKinley,
What factors can affect the reliability of test results?
How recently the test was developed The type of test it is - tests of cognitive abilities and (self-reported) personality are generally more reliable than other tests The standard error of measurement The length of the test (long forms are generally more reliable that their short form or screening equivalent) The interval between testing and retesting Who the test was developed for (i.e., cultural considerations) Personal (e.g. fatigue, motivation, anxiety) and environmental (e.g., time of day, lighting, external noise) factors
Why do we need psychological assessments/testing?
Human judgment is subjective and fallible. Some of the factors that can influence the outcomes of human judgment include stereotyping, personal bias, positive and negative halo effect, errors of central tendency. Psychologists consider psychological tests better than personal judgment in informing decision making in many situations because of the nature and defining characteristics of these tests.
normalised standardised scores
Normalised standard scores: a score in a distribution that has been altered to conform to a normal distribution by calculating the z-scores for each percentile equivalent of the original raw score distribution.
principle axes analysis
Kelley's approach — Principal Axes Analysis - the reduction was a way of identifying the underlying or latent variables that gave the particular form to the correlation matrix.
non-probability methods of sampling
Non-probability methods of sampling may produce a biased estimate of the parameter and the precision of the estimate is unknown. These include accidental or convenience sampling. • There is no way of knowing how representative such a sample is or even what population it may be a sample of.
norm referenced test
Norm referenced test: a psychological test that uses the performance of a representative group of people (i.e., the norms) on the test for evaluating the performance of a test taker.
the domain sampling model
Other techniques to estimate the reliability of a single test are based on the domain sampling model in which tests are seen as being made up of items randomly sampled from a domain of items. The standard deviation of the distribution of scores from all possible samples about the true score would tell us about the likelihood of obtaining any particular sample score. • It is referred to as the standard error of measurement and indicates the precision of our estimate of the true score. • In practice, we only have samples and no true score, but we can make estimates of the likely true score for an individual, and the interval in which it lies- degree of confidence. • If the confidence interval is very large = great deal of imprecision in the measurement process and we cannot depend on any score we obtain with this sample of items. • Two quantitative indexes of reliability that allow us to be more precise include standard error of measurement (SEM) and reliability coefficient (r).
flyn effect
Over time. IQ scores might increase. 3 points per decade
how do psychological tests and assessment differ?
Psychological testing - the process of administering a psychological test and obtaining and interpreting the test scores • Psychological assessment - a broad process of answering referral questions, which includes but is not limited to psychological testing • Maloney and Ward defined psychological assessment as a process of solving problems in which psychological tests are often used as one of the methods of collecting relevant data. • Psychological testing forms only a part of psychological assessment • Best practice in assessment must take into account other sources of information - do not solely rely on test results.
Whats the difference between psychological test and psychological assessment?
Psychological testing - the process of administering a psychological test and obtaining and interpreting the test scores Psychological assessment - a broad process of answering referral questions, which includes but is not limited to psychological testing
types of psychological tests
Psychological tests differ in a number of ways: • Self-report tests and performance tests • Self-report: • Self-report tests have practical advantages in that they usually take less time to complete and can be given to a number of people at the one time. • Self-report tests are common when the interest is in typical behaviour- what the person frequently does, as in the case of personality and attitude. • Performance tests: • Performance tests are usually limited to individual administration but they provide information about what the person can actually do as distinct form what they say they can do. • Performance tests are used in assessing the limits of what a person can do such as in assessing their aptitudes or abilities. • Group vs. individual administration • Computer used tests vs. non-computer used tests • They can differ in terms of the frame of reference for comparing the performance of the individual on the test. Norm-referenced vs. criterion referenced testing.
random sampling
Random sampling: members are drawn from the population but in such a way that every member of the population has an equal opportunity of being selected and drawing one member does not influence in any way the likelihood of any other members being selected.
What is reliability?
Reliability is the confidence we can have that the score is an accurate reflection of what the test purports to measure. dependability refers to the consistency in measurement psychologica tests have both systematic and unsystematic sources of unreliability the variance of observed scores on the test is likely to differ depending on the sample of individuals we chose to study, and we cannot assume that the reliability will remain constant across different samples
if reliability is 1
SEM = 0
if reliability is 0
SEM = 1
why is sample size important in developing norms
Size is important because the requirement is to estimate the mean and SD with precision, and sample size has a potent influence on the standard errors of these statistics. • In the case of estimating the mean, the SE is proportional to the SD of the distribution divided by the square root of the sample size. • As sample size increases, the denominator becomes larger and the SE smaller. • The effect is not a linear one: doubling sample size does not halve the SE. • It is not sample size but the square root of sample size that is the denominator- therefore; to halve the sampling error one must increase the sample size by a factor of 8.
non-linear transformation
Sometimes we can find that numbers in the raw score distribution are bunched in the middle of the range of scores, affording little discrimination in that region. • A test developer might want to draw out the differences in the middle of the range while leaving the values in the tails of the distribution unchanged. • This means a nonlinear transformation of the raw scores because in these circumstances the plot of transformed and raw scores will not produce a straight line.
Standardised Scores have a M of _ and a SD of _
Standardised Scores have a M of 10 and a SD of 3
Stanley Porteu
Stanley Porteu saw the need for practical or performance tests of ability that did not depend on verbal skills or exposure to mainstream formal schooling. • He reported the use of mazes for assessing comprehension and foresight. • His test required the test taker to trace with a pencil increasingly complex mazes while avoiding dead ends and not lifting the pencil from the paper. • The test I still used by neuropsychologists in assessing executive functions. • His work was the forerunner of the development in Australia of a number of tests of ability that are not dependent on access to English for their administration- most notable was the Queensland Test by Donald McElwain and George Kearney (The administrator of this test used mime to indicate task requirements).
T scores have a M of _ and a SD of _
T scores have a M of 50 and a SD of 10
What are norms?
Tables of the distribution of scores on a test for specified groups in a population that allow interpretation of any individual's score on the test by comparison to the scores for a relevant group. E.g. age, gender, grade etc. Ideally, norm samples should be representative of the reference group. They should take into account demographic characteristics that relate to the construct of interest (e.g. age, gender, education level, SES, ethnicity).
What factors should you consider when selecting a psychological test for a client?
The age of the client, and the age range that the test was developed for In an English speaking country, the English language proficiency of the client being assessed (particularly relevant for IQ testing) How long it has been since the client was assessed (if the test has been administered before) Whether the client (or their parent) understands what the assessment is for The psychometrics of the assessment tool you have selected Whether you have the skills/experience to administer and interpret the tests needed to assess this client
• Can reliabilities be improved if found wanting in any particular case?
The answer depends on the nature of the reliability being considered and the practical constraints on what is possible in any particular situation. • Where one is sampling from a domain of items, reliability can often be improved by extending the sample; that is lengthening the test. • The spearman-Brown formula can be used to given an indication of the number of items that need to be added to a test to bring its reliability from a given level to some desired level. • The relationship between increasing the number of items and changes in reliability is not linear- doubling the number of items for example does not double the reliability.
culture differences
The effects of language and culture on tests are so pervasive that repeated attempts to develop tests that are culture free have proved unsuccessful. • A culture fair test is one for which there is no systematic distortion of scores resulting from differences in the cultural background of the test takers. • Producing a culture fair test requires that test items have the same meaning in each of the cultures in which the test is to be used, and that it acts as a predictor of socially relevant criteria in each culture in the same way. There must be an equivalence across cultures in what is termed the test's construct validity and in its predictive or criterion validity. • To define bias in a test on the basis of difference in average scores between groups is to rule out other possible causes of the difference. • Bias must be determined independently of average differences. • Cultural differences may lead to bias in the use of psychological tests • Several criteria need to be applied to adequately assess whether cultural differences are biasing test results • To date, most studies applying these criteria have no found evidence of bias, but this can only be conditional on the outcome of further research. • The cultural background of the person to be tested must be appreciated and respected if the psychologist is to perform the task competently and ethically. • Assessment is more than testing because it involves decisions about whether a test should be used in the first place and, if it is, how the test score is to be interpreted against the background of a full knowledge of the person, including their cultural experiences.
origins of testing
The origins of psychological testing can be found in the public service examinations used by chinese dynasties to select those who would work for them programs of testing were conducted from about 200 BCE to the early years of the 20th C when they were discounted- at about the time the modern era of psychological testing was being introduced in the USA a major drive to this modern development of testing was the need to select men for military service during WW1 there were a number of precursors to this development- most significant being the work of Alfred Binet in the late 19th C and early 20th C.
In the context of psychological testing, what does standardisation mean?
The process of administering a test to a representative sample of test takers for the purpose of establishing norms.
raw score total
The raw score total on a psychological test is the score obtained by summing the item scores on the test. • Raw score totals typically are of little use by themselves and require some way of acquiring meaning. The meaning of the raw score total is established by its place in the distribution of scores: • If it is towards the top end of the distribution the person's performance is better than most; and vice versa. • This approach has received some criticism the individual only has importance when considered in the context of other individuals.
reliability coefficient
The reliability coefficient can be defined as the proportion of observed score variance that is due to true score variance. • If the proportion is only 0.5 (i.e. r = 0.5), 50% of the variance in the scores obtained with the test is due to variance in true scores and the other 50% to errors of measurement. • The reliability coefficient is generally used in forming judgments about the overall value of a particular test.
what is SEM?
The standard error of measurement (SEm) estimates how repeated measures of a person on the same instrument tend to be distributed around his or her "true" score. The true score is always an unknown because no measure can be constructed that provides a perfect reflection of the true score. SEm is directly related to the reliability of a test; that is, the larger the SEm, the lower the reliability of the test and the less precision there is in the measures taken and scores obtained
percentile point
The term percentile point is sometimes used to describe the point in the raw score distribution
What is validity?
The validity of a test has been traditionally defined as the extent to which the test measures what is purports to measure (based on what we currently know)
If a person receives a standard score of 100 on an intelligence test (that has a M of 100 and SD of 15) what does this mean?
This score would be considered to be at the central point on the curve of normal distribution, when comparing this person's performance with that of their same-aged peers. This individual would be considered to be performing within the Average range of functioning.
If a person receives a percentile rank of 2 on an intelligence test (that has a M of 100 and SD of 15) what does this mean?
This would mean that the person was performing as well as or better than 2% of their same- aged peers. Their IQ score would be considered to be within the Very Low range.
When test administration is not followed exactly as outlined in a test manual, what are the possible implications?
You may not obtain a result that is representative of the individual's abilities, behaviour or skills. This could have further-reaching implications for the individual e.g., eligibility for funding within the school system, or for Centrelink benefits. It could result in an individual receiving a diagnosis that does not represent their functioning or behaviour in that area (false positive), or alternatively, not being diagnosed (false negative)
Z scores have a M of _ and a SD of _
Z scores have a M of 0 and a SD of 1
What is a construct
a hypothetical entity with theoretical links technological, that is postulated to bring about the consistent set of observable behaviours, thoughts or feelings that is the target of a psychological test
In confirmatory factor analysis the fit of data and model is shown by
a nonsignificant chi-square test
Expectancy tables depend on
a single well-specified criterion c. a large and appropriate sample d. sufficient data at each score level
1. What are the five defining characteristics of a psychological test?
a. Psychological test is a sample of behavior that is used to make inferences about the individual in a significant social context b. It is an objective procedure c. The result of a psychological test is summarized quantitatively in terms of a score or score d. A psychological test provides an objective reference point for evaluating behavior it measures e. Must meet a number of criteria to be useful - psychometric properties
Content validity is usually restricted to
achievement tests
codes of ethics for psychological assessments
b.13.1. Psychologists use established scientific procedures and observe relevant psychometric standards when they develop and standardise psychological tests and other assessment techniques b.13.2. Psychologists specify the purposes and uses of their assessment techniques and clearly indicate the limits of assessment techniques' applicability b.13.3. Psychologists ensure that they choose, administer and interpret assessment procedures appropriately and accurately. b.13.4. Psychologists use valid procedures and research findings when scoring and interpreting psychological assessment data. b.13.5. Psychologists report assessments results appropriately and accurately in language that the recipient can understand* B.13.6. Psychologists do not compromise the effective use of psychological assessment methods or techniques, nor render them open to misuse, by publishing or otherwise disclosing their contents to persons unauthorised or unqualified to receive such information
•? IQs, ?scores, and ? scores are all variations on the basic idea of the z-score.
deviation, T scores, standardised scores
practical implications of poor validity
does not measure what it claims to measure
Percentiles have been widely used in _______________
education
The reliability of a test for a particular purpose depends on
eliminating unsystematic error
what factors can affect the reliability of test results?
how recently the test was developed the type of test it is- tests of cognitive abilities and personality are generally more reliable than others the standard error of measurement the length of the tests- long generally more reliable the interval between testing and retesting who the test was developed for personal and environmental factors
z-score transformation is useful because
if we can assume a distribution of scores is normal, the properties of the normal curve can be involved in interpreting a z-score
Factor analysis can also used to identify ? between variables in a data ste
interrelationships
Factor analysis proceeds
involves judgement at every step
Construct validity of a test
involves study of the implications of psychological theory
A pass mark of 50 per cent on a test
is a convention adopted in some academic settings
multitrait multimethod matrix
is a tool for evaulating construct validity It organizes convergent and discriminant validity evidence for comparison of how a measure relates to other measures.
the equivalence of differences only holds with __________transformations.
linear
In transforming psychological tests scores to give them normative meaning ___________________are used.
linear and non-linear transformations
The standard error of measurement depends
on both reliability of the test and the variability of scores in the sample to which the test is administered
the most common form of a non-linear transformation is
percentile
Human judgment is influenced by
personal bias halo effect errors of central tendency
Alfred Binet in devising the test that bears his name
required the items of his test to pass the empirical standards he set for validity
Who developed the first self report personality test?
robert wood worth - this was a screening test for psychological adjustment to the military situation
in the public area, there were several challenges to psychological testing
serious invasion of privacy the concern about the homogenising effects on the workforce by using psychology tests for selection, with only a limited set of personality characteristics and abilities being acceptable to an organisation discriminatory
common errors in scoring
some fate most common errors in scoring include miscalculations, incorrectly reading tables and incorrectly transferring scores on test forms
who developed the first theory of intelligence?
spearman
what type of scores did Wechsler, MMPI and Cattel use?
standardised score-wechsler t-score - MMPI sten score - cattel
Psychometric properties of a psychological test refer to
the criteria that a test needs to meet to be a useful device
Percentile rank scores in a distribution such that
the order is preserved
In terms of decision theory, the base rate involves
the sum of false negatives and valid positives
The sensitivity of a test used in diagnosing whether a person suffers from psychosis or not is
the sum of the valid positives divided by the sum of the valid positives and the false negatives
T/F: the z-score transformation is linear
true
Factor analysis is frequently used in developing psychological tests to estimate
unidimensionality an alternative to Cronbach's alpha construct validity
the most common form of a linear transformation is
z-score
limitations of chronbach alpha
• A test can be developed to have a high internal consistency by having items with highly similar content- the domain itself may be so constricted as to be trivial. • High internal consistency does not in itself guarantee that the items are all reflecting the one thing. It means that the items are interrelated but not that they are homogenous; or, as it is technically referred to, uni-dimensional. • If there are multiple factors underlying performance on the test, alpha may overestimate the reliability of the factor thought to underlie the test because alpha estimates the reliability of the labelled factor and all other factors being measured. • High internal consistency is an important attribute in a psychological test but it is not of itself a seal of approval.
statine scale
• A variant of the percentile that is used by some test developers is the stanine scale. • This was developed to facilitate recording of scores because it required only nine numbers, all single digits, to describe all possible raw scores. • The standard nine or stanine scale grouped percentiles into bands and assigned the numbers 1-9 to these bands. • The stanine distribution has a mean of 5 and a standard deviation of approximately 2. • Stanines (a non-linear transformation) and stens (a linear transformation).
selecting a test
• After deciding that psychological testing is necessary for a client and selecting the particular construct or constructs to be assessed, a psychologist needs to select the most appropriate and psychometrically sound tests. • Psychometrics— concerned with psychological measurement and the theories that underpin it. • In Australia and overseas, test publishers usually require test purchases to register before they are allowed to buy the tests. • The purpose of this is to ensure that confidential test materials are supplied to professionals who are appropriately trained and qualified
what needs to be considered before administering psychological tests
• After selecting a psychological test, the following need to be considered before administering the test: • Ensure that the test is appropriate for use with the particular client in terms of age, educational level and ethnic background. • Ensure a suitable venue is selected for administration of the test. • Check that all test materials are present and intact. • Ensure adequate time is spent becoming familiar with the test so that standardised instructions and procedures are used.
principle components analysis
• Components produced or caused by variables (PCA). • They are aggregates of correlated variables • No underlying theory about which variables should be related to which factors: empirical summary of data. • All variance. • In observed variables is distributed to components (including error and unique variance). • Always yields a unique mathematical solution. shortening scale
factor analysis
• Factors thought to cause variables • The underlying construct procedures scores on the variables • Useful for theory development and testing: research on underlying constructs expect to produce scores on your variables. • Analyses shared variance • Covariance= shared variance, is estimated by commonalities to eliminate variance due to error, vs. variance unique to each variable • Most forms of FA do not provide a unique solution (image is an exception). scale development
Binet
• He was asked to provide a method for objectively determining which children would benefit from special education. • Responding to this, Binet developed the first of the modern intelligence tests • He proposed a method for quantifying intelligence in terms of the concept of mental age the child's standing among children of different chronological ages in terms of his or her cognitive capacity. • The assumption in Binet's work- that performance on a range of apparently different problems can be aggregated to yield an overall estimate of mental age - was examined by Charles Spearman.
measures of inter-scorer reliability
• In some types of tests under some conditions, the score may be more a function of the scorer than anything else. • Certain tests lend themselves to scoring in a way that is more consistent that with other tests. • Inter-scorer reliability is the degree of agreement or consistency between two or more scorers with regard to a particular measure. • If the reliability coefficient is high, the prospective test user knows that test scores can be derived in a systematic, consistent way by various scorers with sufficient training. • Indexes of inter-rater reliability: • The contingency coefficient-a version of the product moment correlation when applied to categorical data. • Kappa.
percentiles
• Nonlinear transformation known as the percentile is as popular as the z-score. • The percentile scale expresses each raw score in a distribution in terms of the percentages of cases that lie below it. • E.g. the 50th percentile is larger than 50 per cent of the raw scores in the distribution. • Note that the percentile does not indicate the percentage correct on the test but the percentage of cases below the given value of the raw score. • The value of the percentile scale is that it allows scores to be ranked in such a way that their position in the distribution is immediately apparent. • This transformation is non-linear it is not based on the equation of a straight line — does not preserve the equivalence of distances between scores in the raw score distribution. • Scores in the middle of a normal distribution of scores are stretched apart on the percentile scale, whereas those at the tails are pushed closer together to form a rectangular scale. • Scores that are an equal number of percentiles apart are not necessarily an equal distance apart in the raw score distribution.
normal distribution
• Normal curve: a statistical distribution that has a characteristic bell shape. • A normal distribution is symmetrical about the mean, with half the scores the mean and half above. • When standard scores are specified in terms of their distance from the mean, the mean is 0 and the standard deviation is 1.
Cronbach
• Proposed that the test be split into subtests, each one item in length. • All subtests are then correlated with all other subtests and the average correlation calculated. • This average correlation becomes the estimate of reliability. • This method is often is described as determining the internal consistency of a test.
limitation of psychological tests
• Psychological tests are only tools- they do not and cannot make decisions for test users. • Psychologist tests are often used in an attempt to capture the effects of hypothetical constructs. • Psychology employs constructs that are not directly observable. • As such, we need to be aware that sometimes a gap exists between what the psychologist intends to measure using a psychological test and what a test actually measures. Because of continual development or refinement of psychological theories, development of technology and passage of time, psychological tests can become obsolete. • Sometimes a psychological test can disadvantage a subgroup of test takers because of their cultural experience of language background.
classical test theory
• Sometimes referred to as weak-true score theory. • Five basic assumptions of classical test theory are: 1. Observed score on a test is the sum of two components: a true score and an error score Xo = T + e 2. The true score is the population mean of the observed scores. E.g. If one were to repeatedly administer a test to a person, then the long-term mean of the scores obtained would be the true score for that person. E(Xo) = T Where E(Xo) means the expected value (population mean) of the observed scores 3. The correlation between true score and error score components is zero. Errors are random and therefore cannot relate systematically to any other variable. Where means the correlation between T and e. 4. The correlation between error components on two tests is zero.Where e and e' are the error components of observed scores on two tests 5. The correlation between the error component of the observed score on one test and the true score component on another is zero. Where e is the error component of the observed score on one test and T' is the true score component on another.
Stratified sampling
• Stratified sampling (non-probability sampling method): a method of sampling in which the sample is drawn from the population in such a way that it matches it with respect to a number of characteristics that are considered important for the purposes of the study.
what are norms?
• Tables of the distribution of scores on a test for specified groups in a population that allow interpretation of any individual's score on the test by comparison to the scores for a relevant group. • E.g. age, gender, grade etc. • Ideally, norm samples should be representative of the reference group. They should take into account demographic characteristics that relate to the construct of interest (e.g. age, gender, education level, SES, ethnicity). • Norm-referenced testing and assessment is a method of evaluation and a way of deriving meaning from test scores by evaluating an individual test taker's score and comparing it to scores of a group of test-takers. • The meaning of an individual test score is understood relative to other scores on the same test. • In a psychometric context — norms are the test performance data of a particular group of test takers that are designed for use as a reference when evaluating or interpreting individual test scores. • A normative sample is that group of people who performance on a particular test is analyzed for reference in evaluating the performance of individual test takers. • In norm referencing, the raw score is referred to a relevant group for comparison purposes. • If the comparison is not with an appropriate group, the transformation (although technically correct) fails to convey meaning or opens the score to misinterpretation. • Selecting an appropriate reference distribution and ensuring that the mean and SD are well estimated are essential aspects of the norm-referencing approach. • The accuracy of SD and mean depends on two principal considerations: the manner by which the sample is drawn from the population in question and the size of the sample. • Probability methods of sampling increase the likelihood of the sample matching the population in all respects that may be important to the researcher and permit the calculation of the degree of precision in estimating a parameter of interest in the population.
how reliable does a test need to be?
• The answer to this question depends on the circumstances in which the test is being used. • If the result of the test has serious consequences for an individual, then a very high level of reliability is required. • Nunnally gave the following rule of thumb for assessing reliability: • 0.5 or better for test development; • 0.7 or better for using a test in research; • Better than 0.9 for use in individual assessment. • Tests of cognitive abilities have the highest reliabilities, followed by self-report tests of personality.
what are the implications of differing levels of reliability across different tests?
• The reliability of a test affects the magnitude of the intercorrelation of the test with any other variable. • An unreliable test is one that does not correlate with itself- how then can it be expected to correlate with anything else? • The theoretical maximum correlation coefficient is 1.0, as the reliability falls, the maximum possible correlation falls too. • With low reliabilities we may conclude that two variables are unrelated when in fact that the magnitude of the correlation has been reduced by poor measurement of one or other of the variables.
standard error of measurement
• The standard error of measurement is used in making judgements about individual scores obtained with the test. • The reliability coefficient is determined from data obtained with the test.
percentile equivalent
• The term percentile equivalent is used to refer to the percentile score that expresses the raw score.
percentile rank
• The term percentile rank is more widely used and refers to the percentage of scores that fall below the percentile point.
age and grade equivalents
• To examine children's performance in terms of expected developmental change, age or grade, equivalent scores are sometimes computed. • The idea is to refer the child's level of performance to the typical performance of children of the same age or grade level. • The median age or grade score for a sample of children is set as the age or grade equivalent. • E.g. for children in grade 6 the median may be 17 and a raw score of 17 is thus a grade equivalent of 6. • Raw scores between 17-20 are given grade equivalents by interpolation. • A raw score of 18 would have a grade equivalent of 6.3 • The decimal place can by given a score in months. • There are warnings against misinterpretation: • For example, children at different ages have different levels of understanding and preparedness for different types of learning, even though they may have the same age or grade equivalent scores.
transforming scores for norm referencing
• To refer a raw score total to an appropriate reference group, the raw score has to be changed or transformed to a score that has normative information. • Two basic forms of transformation are typically employed: linear and non-linear.
• When standard scores are specified in terms of their distance from the mean, the mean is ? and the standard deviation is ?.
• When standard scores are specified in terms of their distance from the mean, the mean is 0 and the standard deviation is 1. if the number is positive the score must be larger thant he mean if it is negative, the score is less than the mean- lies below it in the distribution
When test administration is not followed exactly as outlined in a test manual, what are the possible implications?
• You may not obtain a result that is representative of the individual's abilities, behaviour or skills. • This could have further-reaching implications for the individual e.g., eligibility for funding within the school system, or for Centrelink benefits. • It could result in an individual receiving a diagnosis that does not represent their functioning or behaviour in that area (false positive), or alternatively, not being diagnosed (false negative).