PSYC 461 Exam 1 Study guide
Types of tests :individual v. group
- Individual tests are instruments that by their design and purpose must be administered one on one - Group tests are largely pencil-and-paper measures suitable to the testing of large groups of persons at the same time
Effects of test anxiety
- test anxiety is negatively correlated with school achievement, aptitude test scores, and measures of intelligence -One possibility is that students develop test anxiety because of a history of performing poorly on tests. - test anxiety is likely both cause and effect in the equation linking it with poor test performance. - Tests with narrow time limits seems to exacerbate the degree of personal threat, causing significant reductions in the performance of test-anxious persons.
Factors affecting reliability: item difficulty
- too easy or difficult you cant tell everyone's' consistent scores. use items of moderate difficulty
Constructing the items: (floor effects v. ceiling effects
-Ceiling effect = too many high scores; on an ability test, too few hard questions; difficult to discriminate -Floor effect = too many low scores; on an ability test, too few easy questions; difficult to discriminate
Writing good multiple choice questions
-Don't use many opinion questions -It's typical to have four or five answer choices -If there is a natural order to options use it -All options should be plausible those who do not know the answer -Minimize the use of negative expressions (such as 'not') -Avoid determiners such as "always" and "never". - Do not use "all" or "none of the above"
Transformed scores: (standard scores ) + mean and standard deviation :percentile rankings
-Express the percentage of scores that fall below a specific raw score -Tells you how someone scored relative to others -the corresponding SD / Percentile Ranking pairings (e.g., 1 SD above the mean = 84th percentile) are applicable only when we have a normal distribution solve: nPR = nL/N (100) nL = number of scores in the distribution lower than score being transformed N = total number of scores in distribution Drawbacks: Transformation to percentiles distorts the underlying measurement scale such that there are not equal intervals between percentiles that have equal mathematical differences. (the difference between 55 to 59 percentile is not as big 95 t 99)
Guidelines for evaluating reliability - research purposes v. individual decisions
-For research purposes the reliability cut off needs to be at least .65 rxx= .65 - 1.00 -For making important decisions the reliability cut off needs to be at least .90 rxx = .90 - 1.00
Galton and the Brass Testing Era
-Human abilities were instead tested in laboratories. Researchers used objective procedures that were capable of replication. -The problem was that the early experimental psychologists mistook simple sensory processes for intelligence. They used assorted brass instruments to measure sensory thresholds and reaction times, thinking that such abilities were at the heart of intelligence. -Galton borrowed the time-consuming psychophysical procedures and adapted them to a series of simple and quick sensorimotor measures. Thus, he continued the tradition of brass instruments mental testing but with an important difference: amenable measures of individual differences, historians of psychological testing usually regard Galton as the father of mental testing
Types of reliability and sources of unsystematic error: Other methods: KR20
-KB-20- assumes items vary in difficulty (e.g IQ) *KB's are used for dichotomously (T/F) scored tests
Types of reliability and sources of unsystematic error: Other methods: KR21
-KB-21- assumes items are the same in difficulty (e.g. personality) *KB's are used for dichotomously (T/F) scored tests
Standard error of measurement (relationship between SEM and reliability)
-SEM is the Index of how much on average an individual's score might vary if they were to repeatedly take a test -SEM has an inverse relationship with reliability. The more reliable the test the less error there is on average. -SEM is the standard deviation of the hypothetical distribution of test scores we would get for an individual who repeatedly took the test -Reliable tests have small SEM values and little variation with repeat testing
Types of reliability and sources of unsystematic error: Other methods: Spearman-Brown
-Split-half correlations estimate reliability for a test that is only half as long as the one actually taken -Longer tests generally more reliable -Spearman-Brown (rSB) formula is used to make a statistical correction -elevates the correlation to approximately parallel forms reliability
Types of reliability and sources of unsystematic error: Other methods: Split-half
-Split-half method- treats a single test as if it consists of two equivalent halves -assumes scores on the two halves should be the same -correlate the 2 halves of the test -any differences (error) are due to the sample of questions - can be used to estimate parallel forms
Types of reliability and sources of unsystematic error: Other methods (internal consistency coefficients)
-Split-half method- treats a single test as if it consists of two equivalent halves -assumes scores on the two halves should be the same -correlate the 2 halves of the test -any differences (error) are due to the sample of questions - can be used to estimate parallel forms -Alpha coefficients- provides mean reliability for all possible split half combinations -KB-20- assumes items vary in difficulty (e.g IQ) -KB-21- assumes items are the same in difficulty (e.g. personality) *KB's are used for dichotomously (T/F) scored tests & correct/ incorrect i.e. multiple choice
selecting norm groups (age, grade, ethnic group, SES,...),
-The age span for a normative age group can vary from a month to a decade or more, depending on the degree to which test performance is age-dependent. -Grade group is similar to age
How the examiner can influence test results: gender, experience and ethnicty
-The results are contradictory and, therefore, inconclusive. Most studies find that sex, experience, and race of the examiner make little, if any, difference.
How the examiner can influence test results: rapport
-a comfortable, warm atmosphere that serves to motivate examinees and elicit cooperation. - A tester who fails to establish rapport may cause a subject to react with anxiety, passive aggressive noncooperation, or open hostility. -Failure to establish rapport distorts test findings: Ability is underestimated and personality is misjudged. -is especially important in individual testing and particularly so when evaluating children.
Major uses of testing: classification
-assigning a person to one category rather than another. -ex: such as granting or restricting access to a specific college or determining whether a person is hired for a particular job.
Major uses of testing: placement
-is the sorting of persons into different programs appropriate to their needs or skills. For example, universities often use a mathematics _____ exam to determine whether students should enroll in calculus, algebra, or remedial courses.
Cattell's testing in the U.S.
-measuring with great precision the fractions of a second presumably required for different mental reactions. -Cattell (1890) invented the term mental test in his famous paper entitled "Mental Tests and Measurements." -in Cattell's view, an ostensibly physiological measure such as dynamometer pressure was an index of one's mental power as well.
Factors affecting reliability: number of test items
-more test items the better, greater variance = higher reliability - more opportunity to find out someone's consistency in answers
Leta Stetter Hollingworth and Testing for Giftedness
-spent her short career focusing on the psychology of genius -showed significantly greater school achievement than those of mere ordinary genius. she dispelled the belief, that gifted children should not be moved ahead in school because they would lag behind older children in penmanship and other motor skills. -also was a feminist who attributed gender differences in eminence and achievement to social and cultural impacts
Rudimentary Forms of Testing in China 2200 B.C.
-the Chinese emperor had his officials examined every third year to determine their fitness for office -refined over the centuries until written exams were introduced in the Han dynasty (202 b.c.- a.d. 200). Five topics were tested: civil law, military affairs, agriculture, revenue, and geography. - the Chinese failed to validate their selection procedures. Nonetheless, it does appear that the examination program incorporated relevant selection criteria.
Methods of scaling: method of equal appearing intervals
1.Method of Equal Appearing Intervals: (Thurstone) Method to construct -e.g. Attitude toward physical exercise (text, pp. 119-120)
Correlation: Spearman
2 ordinal variables Variable 1 Rank-ordered e.g., Political orientation (far left, left, center, right, far right) e.g., Olympic Medal Variable 2 Rank-ordered e.g., Definitely No, No, Maybe, Yes, Definitely Yes e.g., Income in $ (ratioàordinal)
stratified random sampling
A form of probability sampling; a random sampling technique in which the researcher identifies particular demographic categories of interest and then randomly selects individuals within each category.
Relationship between reliability and validity
A test can be reliable without being valid, BUT a test cannot be valid unless it is also reliable. Reliability puts a statistical cap on validity
Importance of standardization
A test is considered to be standardized if the procedures for administering it are uniform from one examiner and setting to another. - Nonstandard testing procedures can alter the meaning of the test results, rendering them invalid and, therefore, misleading.
Types of tests : aptitude v. achievement
Aptitude Tests: Measure the capability for a relatively specific task or type of skill; aptitude tests are, in effect, a narrow form of ability testing. testing your ability for future success. Achievement Tests: Measure a person's degree of learning, success, or accomplishment in a subject or task. Testing your learned abilities you've acquired throughout your past.
Types of tests : creativity
Assess novel, original thinking and the capacity to find unusual or unexpected solutions, especially for vaguely defined problems.
Content validity v. face validity
Assessed from the perspective of experts vs assessed from the experience of the test takers Content Validity: Particularly important for achievement tests and other tests that have a well defined domain of content, e.g., math, reading, science, history Face Validity: Addresses the question of whether the test appears valid from the perspective of the test taker based on test content * But it is actually unrelated to whether the test is truly valid
Binet and Simon's Tests
Binet introduced his scale of intelligence in 1905, meant for children and helped in identifying those with special needs. -Binet and his colleague Simon were called on to develop a practical tool for assessing the intelligence of children. it was published and revised a few years later. -tests were given based on age, -a third revision of the Binet-Simon scales appeared. Each age level now had exactly five tests. The scale was also extended into the adult range. -In 1916, Terman and his associates at Stanford revised the Binet-Simon scales, producing the Stanford-Binet, -Simon, after Binet passed later called the concept of IQ a "betrayal" of their scale's original objectives
Criterion-related validity: (Concurrent v. predictive
Both criterion-related validity Concurrent validity - test is correlated with a criterion measure that is available at the time of testing Predictive validity - test is correlated to a criterion that becomes available in the future
Trinitarian model (know different types of validity and when each type is appropriate)
Content validity- Are items on the test a good representative sample of the domain we are measuring? -i.e., are the items representative of the universe of skills and behaviors that the test is supposed to measure Criterion-related validity-The extent to which the test correlates with non-test behavior(s), called criterion variables -This type of validity will be most important when a test is being used to make a prediction Construct validity-The most comprehensive type of validity -subsuming content and criterion-related Involves the theoretical meaning of test scores. -Are test scores consistent with what we would expect to find based on our theory (understanding) of the construct? -Construct validity is most important for tests that do NOT have clear, easily identifiable content or a single criterion that would be adequate to define or describe the construct being assessed -Particularly relevant for measuring psychological "constructs" such as personality (e.g., MMPI), intelligence (e.g., WAIS), leadership ability -Construct validity is the unifying concept of test validity
The Army Alpha and Army Beta Tests
Developed in World War I for selection and placement of recruits The Alpha was based on the then unpublished work of Otis (1918) and consisted of eight verbally loaded tests for average and high-functioning recruits. (1) following oral directions, (2) arithmetical reasoning, (3) practical judgment, (4) synonym-antonym pairs, (5) disarranged sentences, (6) number series completion, (7) analogies, (8) information. The Army Beta was a nonverbal group test designed for use with illiterates and recruits whose first language was not English. It consisted of various visual-perceptual and motor tests such as tracing a path through mazes and visualizing the correct number of blocks depicted in a three dimensional drawing.
Correlation: effect of restricting the range
Dont eliminate outliers -by restricting the range you dramatically change the correlation when there is significance
Methods of scaling: Expert rankings
Expert Rankings: Assign numbers according to set criteria indicating where the examinee falls on the construct being measured. e.g. Glasgow Coma Scale (Eyes open... 4 - Spontaneously, 3 - To Speech, 2 - To Pain, 1 - None
Construct validity (know the various sources that are used to determine Construct validity)
Expert opinion - looking for agreement in the field - specifically supports content validity Internal consistency estimates - (reliability) looking for items that inter-correlate Experimental and observational research - looking for empirical support, particularly group differences Developmental change - looking for relation between test scores and maturation Test performance related to intervention - looking for clinical utility and consistency with theory e.g., decrease in depression test scores following successful cognitive-behavior therapy in depressed patients Factor-analytic studies - identify distinct and related factors in the test e.g., on an intelligence test, spatial ability and verbal ability Inter-correlations among tests - looking for discrimination or congruence in test scores -Convergent validity -Discriminant validity Abstract Reasoning Ability- older vs younger kids
Item response theory: (item response functions [item characteristic curves]
For Four Test Items -x-axis: standard scores, ranging from -3 to +3; zero represents the average amount of the trait (e.g., intelligence -Item A: easiest item (of the 4 items), passed by almost everyone, including those with small amounts of the trait Item D: most difficult item, only those with high amounts of the trait answer correctly Items B, C: equal in difficulty, answered by 50% of examinees Item characteristic curve reveals that at the specific level of the trait indicated by the arrow, Item C outperforms Item B because the curve is steeper for Item C
Reliability and Classical Test Theory (X = T + E)
For every score X we obtain for a person on a test we know that there will be some degree of error. -
Sampling: cluster
Groups of individuals who are naturally together are chosen and characteristics are alike. Example: cluster by school, club, units, geographic area.
Methods of scaling: Guttman scale
Guttman Scale: It is assumed that if one statement in an ordered sequence is endorsed that all milder or lesser statements apply as well e.g. The client is able to a ( ) plan and prepare a meal on her own b ( ) plan and prepare a meal with some assistance c ( ) prepare a meal but must be given the ingredients d ( ) prepare a meal but needs assistance e ( ) she cannot prepare a meal
Major uses of testing: self-knowledge
In some cases, the feedback a person receives from psychological tests can change a career path or otherwise alter a person's life course.
Item response theory: item information functions
Item A: useful only for individuals low on the trait (at high trait levels, everyone answers correctly and no information is gained) Item D: useful only if individuals high on the trait and everyone else answers incorrectly
Item analysis: item difficulty v. item discrimination indexes
Item Difficulty (p)-Expresses the percentage or proportion of examinees that answered an item correctly Item Discrimination (d)-Tells us if items are capable of discriminating between high and low scorers (score groups) Procedure: Divide examinees into groups based on test scores
Item analysis: Item validity v. item reliability
Item reliability- Assesses the extent to which an item contributes to the overall assessment of the construct being measured -If most people who get an item right do poorly on the test overall and most people who get it wrong do well on the test overall, it's a bad item Item Validity-Assesses the extent to which a given item correlates with a measure of the criterion you are trying to predict with the test -Item validity is determined with a correlation (point biserial, rpb) computed between item score and criterion score
Methods of scaling: Likert scale
Likert Scales: Often used for scaling attitudes. Typically uses an odd number (e.g., 5, 7) of ordered responses; ordinal or interval scale but often treated as interval without evidence; sometimes referred to as a summative scale; central tendency bias is sometimes present
Transformed scores: standardized scores + mean and standard deviation: IQ
M= 100 SD =15 IQ =(Z)(15)+100
Transformed scores: + mean and standard deviation (standard scores): z scores
Mean = 0 SD = 1 Z= X-M/SD -drawbacks: negatives and decimals
Transformed scores: standardized scores + mean and standard deviation: T score
Mean = 50 SD=10 T = (Z)(10) + 50 if transforming T into percentile Perentile 2 -percentile 1/ 2 = e.g. T=65 -> 84-98/2 =91
Transformed scores: standardized scores+ mean and standard deviation: CEEB
Mean = 500 SD=100
Types of tests : intelligence tests
Measure an individual's ability in relatively global areas such as verbal comprehension, perceptual organization, or reasoning and thereby help determine potential for scholastic work or certain occupations.
Types of tests : interest inventories
Measure an individual's preference for certain activities or topics and thereby help determine occupational choice.
Types of tests :neuropsychological tests
Measure cognitive, sensory, perceptual, and motor performance to determine the extent, locus, and behavioral consequences of brain damage.
Types of tests : personality
Measure the traits, qualities, or behaviors that determine a person's individuality; such tests include checklists, inventories, and projective techniques.
Distributions normal distribution v. skewed distributions
Normal distribution- distribution of scores would more and more closely resemble a symmetrical, mathematically defined, bell-shaped curve. -the normal curve has useful mathematical features that form the basis for several kinds of statistical investigation, it has mathematical precision, -the normal curve often arises spontaneously in nature. Skewness- refers to the symmetry or asymmetry of a frequency distribution. If test scores are piled up at the low end of the scale, the distribution is said to be positively skewed. In the opposite case, when test scores are piled up at the high end of the scale, the distribution is said to be negatively skewed
Types of tests : behavioral procedures
Objectively describe and count the frequency of a behavior, identifying the antecedents and consequences of the behavior.
Item analysis: know how to interpret p and d values
P interpretations p < L.B. à Difficult L.B. < p < U.B. à Moderately Difficult p > U.B. à Easy d interpretations d = .30+ Item is acceptable (good discriminator) d = .20 - .29 Item is marginal and should be revised (only minor revision required) d = .10 - .19 Very marginal; revise d < .09 Probably a poor item; major revision (if negative d value, consider discarding)
Types of reliability and sources of unsystematic error: Parallel forms
Parallel (Alternative) Forms Reliability Sample of questions being the source of error - unless it is delayed (Part B given later) than time of administration can be another source of error -scores on the two forms are correlated to determine the reliability -assumes scores should be the same/ similar
Correlation: different types of correlation
Pearson r, Spearman rank order, Phi, Point Biserial, Biserial
Speed and power tests
Power test- a test that allows enough time for test takers to attempt all items; however, the test is difficult enough that no test taker is able to obtain a perfect score. Speed test- typically contains items of uniform and generally simple levels of difficulty. most subjects should be able to complete most or all of the items. -the traditional split-half approach (comparing odd and even items) will yield a spuriously high reliability coefficient.
Incremental validity
Relates to the amount of increase in accuracy by adding a new test to a battery of tests -Can we account for more of the variance in the criterion by adding test as a predictor.
Methods of scaling: method of empirical keying
Scales can be developed by experts or by empirical keying -Unique in that items are selected based entirely on empirical considerations, without regard to theory or expert judgment -Items are selected based on how well they contrast a criterion group from a normative sample e.g., Depression scale -Empirically keyed tests tend to be highly valid * can lead to inclusion of unexpected items
Item analysis:- know how to compute optimal item difficulty
Solve: 1) difficulty (p) = # test takers who passed/ total number of test taker 2) 1.0 + (g)/ 2 <- g= chance level (.25= 4 answer choices) 3) solve for Lower bound Lb = 1 + 1.645 (sqroot k- 1 / N)/k >k= #of answers N= total # of test takers
Standard error of estimate ( how SEest differs from SEM)
Standard error estimates (SEest) is used to predict the error in the observed score SEM is used to estimate the error already in the observed score
Types of reliability and sources of unsystematic error: (e.g. Test-retest - Time of Administration) - Test-retest
Test- Retest Reliability Time of administration being the source of error - scores from the 2 administrations are correlated -assumes your true score wont change
Item analysis: analysis of distractors
The purpose of distractors is to offer alternatives that sound reasonable to people who do not know the correct answer
True scores v. Observed scores
True Score True amount of the trait being assessed. -true score cannot be know -only can be estimated Observed Scores what a person actually scores on a test, typically on one administration -is our best estimate of the true score
What is validity?
Validity = accuracy the degree to which a test measures what it says it measures. the peabody test said it measured intelligence. however, it only tests receptive vocabulary, so that's not valid.
Correlation: Biserial
Variable 1 Artificial dichotomy e.g., SAT score (high/low) Variable 2 Interval or Ratio e.g., Cumulative college GPA (interval)
Correlation: Pearson
Variable 1 Interval or Ratio e.g., SAT (interval) Variable 2 Interval or Ratio e.g., IQ score (interval)
Correlation: Point Biserial
Variable 1 True dichotomy e.g., Took GRE prep class (Y/N) Variable 2 Interval or Ratio e.g., GRE score (interval)
Correlation: Phi
Variable 1 True dichotomy e.g., Took GRE prep class (Y/N) Variable 2 True dichotomy e.g., Get into PhD program (Y/N)
The Normal Curve Properties relevant to testing
We will use approximations for percentages of scores falling between any two points on the curve 68% +/- 1 SD 95% +/- 2 SDs 99% +/- 3 SDs Note that less than 1% of cases fall outside the area +/- 3 SDs from the mean
Standard error of difference
a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant The standard error of difference- is the difference between two scores is a statistical measure that can help a test user determine whether a difference between scores is significant.
Major uses of testing: program evaluation
a use for psychological tests is the systematic evaluation of educational and social programs.
Systematic v. Unsystematic error
both errors in measurement Systemic- -effects validity only -affects test takers' scores in a predictable way. -this can occur when a test consistently measures something other than the trait it was intended to measure or when there is scoring error. -affect is consistent across examinees Unsystemic- -Affects reliability -> in-turn affects validity -errors arise from random or chance influences that have different effects -cannot be measure or corrected
Item response theory: computerized adaptive tests
changes as test takers take the test
Major uses of testing
classification, placement, screening, certification, diagnosis, self-knowledge, program evaluation, research
Major uses of testing: diagnosis
consists of two intertwined tasks: determining the nature and source of a person's abnormal behavior, and classifying the behavior pattern within an accepted diagnostic system. Is usually a precursor to remediation or treatment of personal distress or impaired performance.
Correlation: deterministic v. statistical
deterministic- perfect correlation - r = 1.0 or -1.0 -a single line at 45 degrees can be drawn through all the data points - this is highly unusual statistical- the closer the scores are to a straight line, the higher the correlation (r) absolute value -e.g. when r= -.8, the scores fit a strait line more closely than when r= .6
simple random sampling
every member of the population has an equal probability of being selected for the sample. ideal for normative data
Major uses of testing: certification
have a pass/ fail quality. Passing a certification exam confers privileges. Examples include the right to practice psychology or to drive a car. Thus, typically implies that a person has at least a minimum proficiency in some discipline or activity.
Reliability of criterion-referenced tests
how consistently the test classifies individuals as masters or nonmasters The structure of criterion-referenced tests is that the variability of scores among examinees is typically quite minimal. Under these conditions, traditional approaches to the assessment of reliability are simply inappropriate.
Correlation: (linear v. curvilinear)
linear = data falls in a line ( positive or negative) curvilinear = increases in one variable are associated with systemic increases and decreases in another variable
Descriptive statistics; what they tell us and how they are used measures of central tendency
mean or the average, can be calculated by adding all the scores up and dividing by N, the number of scores. means can be problematic as tail gets longer with skewness median the middlemost score when all the scores have been ranked. If the number of scores is even, the median is the average of the middlemost. mode the most frequently occurring score. If two scores tie for highest frequency of occurrence, the distribution is said to be bimodal.
Distributions: kurtosis
more about the tails -When kurtosis is high (leptokurtic), more of the variance is the result of infrequent outliers (extreme deviations from the mean), as opposed to frequent moderate deviations from the mean (as in platykurtic distributions). platykurtic leptokurtic mesokurtic
Norm-referenced vs. Criterion-referenced tests (when each is used)
norm-referenced test is to classify examinees, from low to high, across a continuum of ability or achievement. Used to rank students along a continuum in comparison to one another. (Ex: IQ, SAT, GRE, MCAT) criterion-referenced tests are used to compare examinees' accomplishments to a predefined performance standard. (Ex: college tests, graded, pas no pass tests)
Major uses of testing: research
play a major role in both the applied and theoretical branches of behavioral research.
Norm group: expectancy tables
portrays the established relationship between test scores and expected outcome on a relevant task. -Expectancy tables are especially useful with predictor tests used to forecast well-defined criteria. For example, an expectancy table could depict the relationship between scores on a scholastic aptitude test (predictor) and subsequent college grade point average (criterion).
Correlation: positive v. negative
positive- increase in one variable is associated with an increase in another negative- decrease in one variable is associated with an increase in another
Types of reliability and sources of unsystematic error: Other methods: coefficient alpha
provides mean reliability for all possible split half combinations -KB-20- assumes items vary in difficulty (e.g IQ) -KB-21- assumes items are the same in difficulty (e.g. personality) *KB's are used for dichotomously (T/F & correct/ incorrect i.e. multiple choice) scored tests
How the examiner can influence test results ?
rapport, gender, experience, ethnicity
Major uses of testing: screening
refers to quick and simple tests or procedures to identify persons who might have special characteristics or needs.
Distributions: skewness
skewed distributions usually signify that the test developer has included too few easy items or too few hard items. Scores massed at the low end (positive skew), the test probably contains too few easy items to make effective discriminations at this end of the scale.
asymptote
space between x and the curve
Descriptive statistics; what they tell us and how they are used variability
standard deviation, designated as s or abbreviated as SD. From a conceptual standpoint, the reader needs to know that the standard deviation reflects the degree of dispersion in a group of scores. -If the scores are tightly packed around a central value, the standard deviation is small. variance and the standard deviation convey interchangeable information— one can be computed from the other by squaring (the standard deviation to obtain the variance )
Types of error affecting validity
systematic and unsystematic: -
Guidelines for testing time
t/f - 30 sec per item multiple choice - 45-60 per item essay - 6 - 8 min per item
reliability coefficient
tells us the extent to which a test is free of random error -an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance
Factors affecting reliability: heterogeneity of test takers
test takers should come from a wide range. restricting the range lowers the reliability
Distributions: modality
the number of modes.
Factors affecting reliability: sample of questions
the questions selected to be on the exam - maybe the questions selected were all ones you studied -maybe the questions selected were the ones you didn't study *- can measure ~error of sample questions with parallel forms reliability
The Stanford-Binet
the widely used American revision (by Terman at Stanford University) of Binet's original intelligence test. only a single overall summary score, the global IQ.
Constructing the items: item formats
there is no evidence students do better when items are arranged in ascending order of difficulty
Decision theory as related to testing (empirical expectancy tables: true +. true -. false +. false -)
to limit false positives you have to increase false negatives
Types of reliability and sources of unsystematic error: Other methods: interscorer reliability)
used when we require an estimate of reliability between examiners on subjectively scored tests or behaviors - source for error is scorer differences
Factors affecting reliability: time of administration
when you took the test, - you could be having family or relationship isssues that were distracting - time of day, not enough time to sleep in *-can measure ~error of time with test-retest reliability