PSYCH ASSESSMENT 106
SOURCES OF ERROR VARIANCE
-Assessee -Assessor -Measuring instruments
TYPES OF NORMS
1. Age norms 2. Grade norms 3. National norms 4. National anchor norms 5. Local norms 6. Norms from a fixed reference group 7. Subgroup norms 8. Percentile norms
2 types of sampling procedure
1. Purposive sampling 2. Incidental sampling
Sources of Error Variance
1. Test construction, administration, scoring, and/or interpretation
Sources of error variance during test administration
1. Test environment like room temperature, level of lightning and amount of ventilation and noise. 2. Test taker variables like pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication 3. Formal learning experiences, casual life experiences, therapy, illness and changes in mood or mental state 4. Body weight, can be source of error variance 5. Examiner-related variables are potential sources of error variance. Scorers and scoring system are potential sources of error variance.
Basic 3 approaches to the estimation of reliability
1. Test-retest 2. Alternate or parallel forms 3. Internal or inter-item consistency
- also known as age-equivalent scores, indicate the average performance of different samples of test takers who were at various ages at the time the test was administered
AGE NORMS
are simply different versions of a test that have been constructed so as to be parallel.
ALTERNATE FORMS
refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error or other error.
Alternate forms reliability
an informed, scientific concept developed or constructed to describe or explain behavior. We can't see, hear, or touch construct but we can infer their existence from an overt behavior.
CONSTRUCT
a standard on which a judgment or decision may be based.
CRITERION
is also referred to as the true score model of measurement. It is the most widely used and accepted model in the psychometric literature today.
Classical test theory
states that a score on an ability test is presumed to reflect not only the test takers true score on the ability being measured but also error
Classical test theory
may be defined as a method of evaluation and a way of deriving meaning from test scores by evaluating an individual's score with reference to a set standard
Criterion-referenced testing and assessment
is designed to provide an indication of where a test taker stands with respect to some variable or criterion such as an educational or a vocational objective.
Criterion-referenced tests
a test item or question that can be answered with only one of two response options such as true or false or yes-no.
Dichotomous test item
The computation of a coefficient of split-half reliability generally entails three steps:
Divide the test into equivalent halves. Calculate a Pearson r between scores on the two halves of the test Adjust the half-test reliability using Spearman-Brown formula.
- refer to mistakes, miscalculations and the like. - Traditionally refers to something that is more than expected; it is a component of the measurement process; - Refers to a long-standing assumption that factors other than what a test attempts to measure will influence performance on the test
ERROR
is used as the basis for the calculation of test scores for future administration of the test.
Fixed reference group
- a system of scoring wherein the distribution of scores obtained on the test from one group of test takers is used as the basis for the calculation of test scores for future administrations
Fixed reference group scoring system
are developed by administering the test to representative samples of children over a range of consecutive grade levels
GRADE NORMS
designed to indicate the average test performance of test takers in a given school grade.
GRADE NORMS
- examines how generalizable scores from a particular test are if the test is administered in different situations
Generalizability study
describes the degree to which a test measures different factor.
Heterogeneity
is composed of items that measure more than one trait.
Heterogeneous test
is one source of variance during test construction. It refers to the variation among items within a test as well to variation among items between tests.
ITEM SAMPLING OR CONTENT SAMPLING
- referred to as convenience sampling. The process of arbitrarily selecting some people to be part of sample because they are readily available, not because they are most representative of the population being studied.
Incidental sampling
refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single trait.
Inter-item consistency
Is the degree of agreement or consistency between two or more scorers with regard to a particular measure.
Inter-scorer reliability
an estimate of reliability of a test obtained from a measure of inner-item consistency
Internal consistency estimates of reliability
-refers to collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured. Categories of measurement error
MEASUREMENT ERROR
are derived from a normative sample that was nationally representative of the population at the time the norming study was conducted.
NATIONAL NORMS
- refer to behavior that is usual, average, normal, standard, expected, or typical
NORM
- provide a standard with which the results of measurement can be compared.
NORMS
an equivalency table for scores on two nationally standardized test designed to measure the same thing.
National anchor norms
refers to an observable action or product of an observable action
OVERT BEHAVIOR
exist when for each form of the test the means and the variances of observed test scores are equal.
PARALLEL FORMS OF A TEST
refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when for each form of the test, the means and variances of observed tests scores are equal.
PARALLEL FORMS RELIABILITY
refers to the distribution of raw scores more specifically to the number of items that were answered correctly multiplied by 100 and divided by the total number of items.
PERCENTAGE CORRECT
is an expression of the percentage of people whose score on a test or measure falls below a particular raw score. It is a converted score that refers to a percentage of test takers.
PERCENTILE
are the raw data from a test's standardization sample converted to percentile form.
PERCENTILE NORMS
- a test item or question with three or more alternative responses where only one alternative is scored correct or scored as being consistent with a targeted trait or other construct
Polytomous test items
-is when a time limit is long enough to allow test takers to attempt all items and if some items are so difficult that no test taker is able to obtain a perfect score
Power tests
Assumption #1
Psychological Traits and States Exis
Methods of obtaining internal consistency estimates of reliability
SPLIT-HALF ESTIMATE
norms for any defined group within a large group.
Subgroup norms
is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. It is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time such as personality trait.
TEST-RETEST RELIABILITY
is presumed to represent the strength of the targeted ability or trait or state and is frequently based on cumulative scoring
THE TEST SCORE
any distinguishable, relatively enduring way in which one individual varies from another
TRAIT
The greater the proportion of the total variance attributed to true variance, the more reliable the test
TRUE
Assumption #3
Test-Related Behavior Predicts Non-Test-Related Behavior. Patterns of answer to true-false questions on one widely used test of personality are used in decision making regarding mental disorder. The tasks in some tests mimic the actual behaviors that the test user is attempting to understand
Assumption #4
Tests and other Measurement Techniques Have Strengths and Weaknesses Competent test users understand a great deal about the tests they use. They understand among other things, how a test was developed, the circumstances under which it is appropriate to administer the test, how the test should be administered and to whom, and how the test result should be interpreted. Competent test users understand and appreciate the limitations of the tests they use as well as how those limitations might be compensated for by data from other sources
- allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.
The Spearman-Brown formula
Assumption #5
Various Sources of Error Are Part of the Assessment Process
variance from true differences
true variance
a relatively new measure for evaluating the internal consistency of a test. It is a measure that focuses on the degree of difference that exists between item scores. It is a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores
Average proportional distance (APD)
refers to the component of the observed test score that does not have to do with the test takers ability.
Error
the assumption is made that each test taker has a true score on a test that would be obtained but for the action of measurement error.
Classical test theory or true score theory
- developed by Cronbach and elaborated on by others
Coefficient alpha
the mean of all possible split-half correlation
Coefficient alpha
is when the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as
Coefficient of stability
also referred to as criterion-referenced or domain-referenced testing and assessment. A method of evaluation and a way of deriving meaning from test scores by evaluating an individual's score with reference to a set standard; contrast with norm-referenced testing and assessment.
Content-referenced testing and assessment
- seek to estimate the extent to which specific sources of variations under defined conditions are contributing to the test score. It is a test reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample.
Domain sampling theory
- the component of a test score attributable to sources other than the trait or ability measured
ERROR VARIANCE
the equivalency of scores on different tests is calculated with reference to corresponding percentile score
Equipercentile method
another alternative to the true score model. It is also referred to as latent-trait theory or the latent-trait model, a system of assumptions about measurement and the extent to which each test items measures the trait.
Item response theory (IRT)
- provide normative information with respect to the local population's performance on some test.
Local norms
are the test performance data of a particular group of test takers that are designed for use as a reference when evaluating or interpreting individuals test scores
Norm in the psychometric context
a way to derive meaning from a test scores. An approach to evaluate the test score in relation to other scores on the same set
Norm- referenced
as a method of evaluation and a way of deriving meaning from test scores by evaluating an individual test taker score and comparing it to scores of a group of test takers. A common goal of norm-referenced tests is to yield information on a test taker's standing or ranking relative to some comparison group of test takers.
Norm-referenced testing and assessment
Assumption #2
Psychological Traits and States can be Quantified and Measured Measuring traits and states by means of a test entail developing not only appropriate test items but also appropriate ways to score the test and interpret the results. The test score is presumed to represent the strength of the targeted ability or trait or state and is frequently based on cumulative scoring
the arbitrary selection of people to be part of a sample because they are thought to be representative of the population being studied
Purposive sampling
- is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process
RANDOM ERROR
A good test or more generally, a good measuring tool or procedure is reliable. The criterion of reliability involves the consistency of the measuring tool. The precision with which the test measures and the extent to which error is present in measurements. In theory, the perfectly reliable m measuring tool consistently measures in the same way
RELIABILITY
refers to the proportion of the total variance attributed to true variance
RELIABILITY
also referred to as inflation of variance, a reference to a phenomenon associated with reliability estimates wherein the variance od either variable in a correlational analysis is inflated by the sampling procedures used and so the resulting correlations coefficients tends to be higher contrast with restriction of range.
Restriction or inflation of range
a portion of the universe of people deemed to be representative of the whole population
SAMPLE
the process of selecting the portion of the universe deemed to be representative of the whole population.
SAMPLING
- distinguish one person from another but are relatively less enduring. Psychological trait exists only as a construct.
STATES
-refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.
SYSTEMATIC ERROR
- some defined group as the population for which the test is designed.
Sampling
generally contain items of uniform level of difficulty so that when given generous time limits all test takers should be able to complete all the test items correctly
Speed tests
is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
Split-half reliability
is the process of administering a test to representative sample of test takers for the purpose of establishing norms.
Standardization or test standardization
generalizability study examines how much of impact different facets of the universe have on the test score.
Stated in the language of generalizability theory
a trait, state or ability presumed to be relatively unchanging overtime; contrast with dynamic.
Static characteristics
is the process of developing a sample based on specific subgroups of a population
Stratified sampling
is the process of developing a sample based on specific subgroups of a population in which every member has the same chance of being included in the sample.
Stratified-random sampling
Assumption #7
Testing and Assessment Benefit Society. In a world without tests, teachers and school administrators could arbitrarily place children in different types of special classes simply because that is where they believed the children belonged. In a world without tests, there would be a great need for instruments to diagnose educational difficulties in reading and math and point the way to remediation. The criteria for a good test would include clear instructions for administering, scoring, and interpretation. It would also seem to be a plus if a test offered economy in time and money it took to administer, score, and interpret it.
Assumption #6
Testing and Assessment can be conducted in a fair and unbiased manner. Today all major test publishers strive to develop instruments that are fair when used in strict accordance with guidelines in the test manual. One source of fairness-related problems is the test user who attempts to use a particular test with people whose background and experience are different from the background and experience of people for whom the test was intended
True
The extent to which a test takers score is affected by the content sampled on a test and by the way the content is sampled is a source of error variance.
as a value that according to classical test theory genuinely reflects an individual's ability level as measured by a particular test.
True score
A test is considered valid for a particular purpose if it does. It measures what it purports to measure. A test reaction time is valid test if it accurately measures reaction time. A test of intelligence is a valid test if it truly measures intelligence. Other considerations
VALIDITY
The degree of relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability which is often termed as
coefficient of equivalence
The influence of particular facets on the test score is represented by
coefficients of generalizability In the decision study developers examine the usefulness of test scores in helping the test user make decisions.
- is a trait state, or ability presumed to be ever-changing as a function of situational and cognitive experiences.
dynamic characteristic
variance from irrelevant, random sources
error variance
Split-half reliability is also referred to
odd-even reliability
Test are said to be homogeneous if
they contain items that measure a single trait.
The simplest way of determining the degree of consistency among scorers in the scoring of a test is
to calculate a coefficient of correlation- coefficient of inter-scorer reliability
A statistic useful in describing sources of test score variability is the
variance -the standard deviation squared.