PSYCH 309 Midterm #1
Know the four scales of measurement
(NOIR) nominal, ordinal, interval, ratio
magnitude
(property of scale) property of moreness
equal intervals
(property of scales) Difference between 2 points on a scale has the same meaning as the difference between 2 other points that differ by the same number of units
absolute zero
(property of scales) nothing of the measured property exists
Define and explain how the extreme group and point biserial methods differ.
-Extreme Group Method: compares people who have done well with those who have done poorly on the test -The Point Biserial Method: find the correlation between performance on the item and performance on the total test
What factors should be considered when choosing a reliability coefficient?
-Homogeneity v heterogeneity of items: is the test measuring a multi-faceted or uni-faceted construct? -Dynamic v static characteristics: is the true score fluctuating or relatively stable over time? Does it change frequently or from situation to situation?
When shown an item characteristic curve, be able to determine good or poor discrimination
-high positive slope - good discrimination -no or negative slope- bad discrimination
How do the different aspects of internal consistency differ?
-internal consistency: examine how people perform similar subsets of items selected from the same form of the measure consistency of items within the same test. evaluate the extent in which the different items on a test measure the same ability or trait -Split half: corrected correlation between two halves of the test -KR20: requires you to find the proportion of people who got each item "correct" -Alpha: designed to use KR20 with tests where there is not a right or wrong answer (such as a likert scale test) more general reliability estimate
Polytomous format
-resembles dichotomous format but there are more than two alternatives, multiple choice tests, easy to score, takes little time for the test taker, incorrect choices are called distractors, Well-chosen distractors are an essential ingredient for good items -ADVANTAGE: good to tell if someone really know the information -DISADVANTAGE: poorly written distractors can adversely affect the quality of a test
how do you measure split half reliability (internal consistency)
-this can cause problems when one half is more difficult than the other, if this is the case its better to use the odd-even system where one subscore is obtained from odd numbered items and vice versa -to estimate the reliability you need to find the correlation between the two halves -spearman brown formula: corrects for the half length: corrected r= 2r/1+r where r is the estimated correlation between the two halves of the test if each test had the total # of items in the test
Dichotomous Format
-two alternatives for each item Most common is T/F -ADVANTAGES: easy construction and easy scoring, T/F require absolute judgment -DISADVANTAGES: encourage students to memorize material, "truth" comes in shades of gray, do not allow test takers to show they understand complexity, tends to be less reliable and less precise
skewness
A measure of the shape of a data distribution
criterion validity
A property exhibited by a test that accurately measures performance of the test taker against a specific learning goal.
interval scale
Adjacent values on the scale represent equal intervals in magnitude of the attribute being measured. What properties does it have? -data can thus be: Classified, Counted, Proportioned, Rank-ordered, added (creating a total-scale score), Subtracted, divided to form averages (calculating a scale mean) -data does not have a "0" point and cannot be: Divided to form ratios -temperature
advantages and disadvantages of mode
Advantages: easy to obtain, only measure that can be used for nominal scale Disadvantages: not stables sample to sample, there may be more than one mode for a particular set of scores
advantages and disadvantages of median
Advantages: less sensitive to extreme scores, distributions that are skewed the median is the best measure Disadvantages: it responds to how many scores are above or below but not how far away the scores are
advantages and disadvantages of mean
Advantages: the best choice when we need a measure of central tendency to reflect the total scores, stable from sample to sample, most resistant to chance sample variation Disadvantages: reactive to exact position of each score, it gives undue weightage to extreme values
measurable phenomenon
All phenomena the construct generates (gives rise) (ex. Panic attacks and operational def if minutes spent worrying, # of anxious thoughts)
ordinal scale
Assignment of ranks according to the degree to which the measured attribute is present/absent. What properties does it have? -data can be: Classified, Counted, Proportioned, Rank-ordered -data cannot be: Added/Subtracted or Divided to form averages/ratios -ex. how do you feel on a scale of 1-10
Know and be able to identify examples of a double-barreled item.
Avoid "double-barreled" items that convey two or more constructs at the same time
Define Content Validity: How is it measured?
Based on the correspondence between the item content and the domain the items represent. Measured through asking whether the items are a fair sample of the total potential, consider the wording of test items (Does my test get at the whole domain?)
What are the primary differences between the Likert and Category formats?
Category Format: Like the Likert format but has an even greater number of choices,10 point rating scale, people will change ratings depending on the context, problems can be avoided if the endpoints of the scale are clearly defined and the subjects are frequently reminded of the definition of the endpoints
What are the two types of evidence in construct validity?
Convergent and discriminant
What is the Correlation Coefficient? With what concept should correlation not be confused?
Describes how much two measures or items covary. How similar is the variance between the variables. Not to be confused with causation
What types of questions are answered by psychologists through assessment?
Diagnosis and Treatment Planning, Monitor Treatment Progress, Help clients make more effective life choices/changes, Program evaluation, Helping third parties make informed decisions
IQR
Discards the distribution's upper and lower 25% and taking what remains IQR = Q3 - Q1 (middle 50% or 75th to 25th%)
To avoid bias, how should error be distributed in a psychological test?
Double blind, random sampling, want error to be unsystematic and random!
Which two types of validity are logistical and not statistical? Why?
Face validity and content validity Require good logic, intuitive skills, and perseverance
What are the five characteristics of a good theory?
Has explanatory power, broad scope, systematic, fruitful, Parsimonious
Why types of irregularities might make reliability coefficients biased or invalid?
How could you introduce bias basically. environment, personal, evaluator bias, not the same scoring material, tired participants, personal effects
How can one address/improve low reliability?
Increase the number of items, Factor and Item analysis, the reliability of a test depends on the extent to which all of the items measure one common characteristic, Correction for attention: estimating what the correlation between tests would have been if there had been no measurement error, Estimate the true correlation if the test did not have measurement error
Kurtosis
Index of the "peakedness" vs. "flatness" of a distribution
In what settings do psychologists assess and what is their primary responsibility in each?
Inpatient, Schools, Forensic (legal) settings, Employment settings, such as corporations and law firms, Career counseling, Pre-marital counseling
What is the Pearson product moment correlation? What meaning do the values -1.0 to 1.0 have?
It is a ratio used to determine the degree of variation in one variable that can be estimated knowledge about variation in the other variable. The closer to +1, the stronger the positive correlation is. The closer r is to -1, the stronger the negative correlation is. If |r| = 1 the variables are perfectly correlated! (continuous variables)
ratio scale
Measured on a scale with a true "0" point. Allows all mathematical operations. It can be meaningfully: Classified, Counted, Proportioned, Rank-ordered, added (creating a total-scale score), Subtracted, divided to form averages (calculating a scale mean), divided for form ratios -ex. weight scale
median
Middle score in the distribution (50% ↑ 50% ↓) Rank the scores (include repeating scores) from lowest to highest If an odd number of scores, select the middle score. If an even number of scores, take the average of the middle scores. measure of central tendency
Playtikurtic
Negative kurtosis = flatter distribution (Plate)
percentiles
Percentage of test-takers whose scores fall below a given raw score
Leptokurtic
Positive kurtosis = more peaked distribution (Leaping)
hypothetical construct
Processes that are not directly measurable, but which are inferred to have real existence and to give rise to measurable phenomena
What is psychometry? What are the two major properties of psychometry?
Psychometry: the branch of psychology dealing with the properties of psychological tests -Reliability: dependability, consistency, or repeatability of the test results (measuring tool) -Validity: Does a test measure what it purports to measure? Accuracy
What is the relationship between reliability and validity?
Reliability is not necessary but sufficient for validity
What is factor analysis?
Studies interrelationships among items within a test. Data-reduction technique. Can be used as measure of internal consistency
How are T scores different from Z scores?
T scores (Unlike Z Scores): Mean = 50 and standard deviation = 10. Are all positive, Values > 70 are often considered "clinically significant" (2 SD's above)
What are the stages of test development?
Test conceptualization, Test construction, Test Tryout, Item Analysis, Test revision
What should be asked when generating a pool of candidate test items?
Test is covering universe of construct. What content domain should the test items cover? How many items? What are the demographics of population? How should I word my items?
Norm (testing)
Testing in which scores are compared with the average performance of others. Test norms are created during the standardization process and must be periodically updated.
Ecological validity
The extent to which a study is realistic or representative of real life.
What is the co-efficient of determination? What is the purpose of the co-efficient of determination?
The proportion of the total variation in scores on Y that we explain through X (r^2)
standardization. why is it important to obtain a standardization samle?
The uniform procedures used in the administration and scoring of a test. This is important because without it the meaning of scores would be almost impossible to evaluate
What is restricted range? To what does it lead?
Using a sample of people who won't fit the test or test is too easy or hard. It reduces range and variability (leads to flooring and ceiling effect)
nominal scale
Variables can be named - put into categories. What properties does it have? -Values symbolize category membership, and can be: Classified, Counted, Proportioned -data cannot be meaningfully: Ranked, Added/Subtracted, divided to form averages/ratios -ex. labels
Define item analysis. What two methods are closely associated with item analysis?
a general term for a set of methods used to evaluate test items, one of the most important aspects of test construction. The basic methods involve assessment of item difficulty and item discriminability
test
a measurement device or technique used to quantify behavior or aid in the understanding and predictions of behavior
What is a scatterplot (scatter diagram)? How does it work?
a picture of the relationship between 2 variables. each point on the diagram shows where a particular individual scored on both X and Y
item
a specific stimulus to which a person responds overtly (questions on a test)
Define split half reliability
a test is given and divided into halves that are scored separately. the results from each half are compared to one another
If a test is reliable its results are what?
accurate, dependable, consistent, or repeatable
What is a construct?
an indicator variable that measures a characteristics, or trait. For example, college admission scores are constructs that measure how well a student is likely to do in their first year. Construct validity measures how well the observed construct predicts the outcome expected.
What is incremental validity?
are you offering something new with your test
What components make up Classical Test Score Theory?
assumes that each person has a true score that would be obtained is there were no errors in measurement. This is used to understand and improve reliability of test. X=T+E (observed score=true score+error)
What is systematic error variance called? Is it good or bad and why?
bias, bad
Know the different types of correlations and when they are used.
biserial correlation, point biserial correlation, phi coefficient, spearmans rho
norm referenced tests
compares a test takers performance to others
psychological assessment
comprised of tests, interviews, case studies, behavioral observations, apparatus, etc. it is comprised of psychological tests
What are the three main types of validity evidence?
construct related, criterion related, content related
concurrent validity
criterion is at the same time as the measure. you work at the factory and they give you a test to make sure you know what your doing while you are there
Define item difficulty. What does the proportion of people getting the item correct indicate?
defined by the number of people who get a particular item correct.1st things a test constructor needed to determine is the probability that an item could be answered correctly by chance alone
operational definition
defining a way to measure a hypothetical construct
Reliability
dependability, consistency, or repeatability of the test results (measuring tool) -The proportion of true to total observed score variance
Define item discriminability. What is good discrimination? What are two ways to test item discriminability?
determines whether the people who have done well on particular items have also done well on the whole test - 2 ways to test item discrimination - Extreme group method - top 3rd and bottom 3rd - Point biserial - all people tested
standardization
develop specific (standardized) procedures for the administration, scoring, and interpretation of a test
What is a z score? How is it calculated?
difference between a score and the mean, divided by the SD
Define parallel/alternate forms reliability. What are its advantages and disadvantages?
different forms of the same test (ACT/SAT). They are hard to make the same in difficulty but are better tests to administer.
content validity
does the test represent the whole content. does it get to the whole breathe and depth of the content ex: if you are being tested on the first 6 chapters of a book, then content related validity is provided by the correspondence between the items on the test and the information in the chapters.
construct validity
does your test measure what it should measure. ex: trying to find out if an educational program increases emotional maturity in elementary school age children. Construct validity would measure if your research is actually measuring emotional maturity.
What is the regression formula? Understand the different components of the formula and how they are applied.
equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0)
In what ways can error impact the observed score?
error pulls from the true score
test-retest method
evaluates the error associated with taking a test at two different time (rorshark ink blot tests are not appropriate for this evaluation) there is a possibility of a carryover effect: when the first testing session influences scores from the second session (some remember their answers from their first test). or practice effects, when some skills improve with practice
Internal consistency
examine how people perform on similar subjects of items selected from the same form of the measure
What is the content validity ratio and how is it calculated?
experts look at it and the rate if it is related or not and give it a CVR content validity ratio. The formula of content validity ratio is CVR=(Ne - N/2)/(N/2), in which the Ne is the number of panelists indicating "essential" and N is the total number of panelists (good for educational testing)
biserial correlation
expresses the relationship between a continuous variable and an artificial dichotomous variable.
What is Criterion-related Validity?
how well the test relates to the criterion we are using.
postdictive validity
if the test is a valid measure of something that happened before. For example, does a test for adult memories of childhood events work?
Be able to define and recognize the Likert Format. What scales most frequently use the Likert format?
indicate the degree of agreement with a particular question, used with attitude and personality, "Strongly agree... strongly disagree"
Central tendency
indices of the central value or location of a frequency distribution with respect to the X Axis.
The Normal distribution
is the most common continuous probability distribution. The function gives the probability that an event will fall between any two real number limits as the curve approaches 0 on either side of the mean. Area underneath the normal curve is always equal to 1
Mesokurtic
kurtosis at zero. normal distribution (Medium)
Define item characteristic curve. Know what information the X and Y axes give as well as slope
learn about items by graphing their characteristics. X is ability level Y is probability of correct responses
What are the three properties of scales that make scales different from one another?
magnitude, equal intervals, absolute zero
Three types of central tendency
mean, median, mode
mode
measure of central tendency, the most frequently occurring score in the distribution
What is the Kappa statistic and how does it relate to reliability?
measures Inter rater reliability (observations of the samples with more types of judgment) indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement
Criterion referenced tests
measures performance against an established criterion
spearman's rho
method of correlation for finding the association between two sets of ranks tetrachoric correlation: if both dichotomous variables are artificial
Which type of validity has been referred to as "the mother of all validities", or "the big daddy" and why?
mother: construct validity. it measures what it should. you want your test to measure the construct
multiple regression
multivariate analysis that considers the relationship among combinations of three or more variables; the goal is to find the linear combination of the three variables that provides the best prediction
Name and define the three subtypes of criterion related validity. Be able to give examples.
predictive, concurrent, postdictive
covariance
relationships between variables (How much both variables change together)
What example was given in class regarding reliability
rubber yardstick
Define representative sample. Know when and why representative samples are collected.
sample comprised of individuals similar to those for whom the test is to be used When: when used for the general population, a rep. Sample must reflect all segments of the population Why: it can be used as a standardized sample and be representative of an entire population which raised the validity
predictive validity
sat and freshman GPA, the test is trying to predict something in the future
simple linear regression
seeks to find the linear explanation for the relationship between 2 variables
convergent evidence of validity
shows how similar your test is to other test that are measuring the same thing
discriminant evidence of validity
shows that your test is different from other tests of different constructs
variance
standard deviation squared
What is the principle of least squares? How does it relate to the regression line?
statistical procedure to find the best fit for a set of data points by minimizing the sum of the offsets or residuals of points from the plotted curve
negative skew
tail points to the left (towards - end)
positive skew
tail points towards to the right (towards + end)
What contributes to measurement error?
test construction, test administration, test scoring and interpretation
What prerequisites exist for validity?
test needs to be RELIABLE. you can't have validity without reliability (you can have reliability without validity)
Test reliability is usually estimated in one of what three ways?
test retest method, parallel forms method, internal consistency
intelligence testing
testing a person's general potential to solve problems
aptitude testing
testing potential for learning or acquiring a specific skill
achievement testing
testing previous learning
incremental validity
tests add something to new science
skrinkage
the amount of decrease observed when a regression equation is created for one population and applied to another
mean
the arithmetic average of a distribution, obtained by adding the scores and then dividing by the number of scores. measure of central tendeny
What is Construct-related Validity?
the degree to which a test measures what it claims, or purports, to be measuring
residual
the difference between the predicted and the observed values
Define Face Validity. How does it differ from other aspects of validity?
the mere appearance of that measure has validity Not technically validity but still important. When its obvious what you are measuring.
percentile rank
the percentage of scores below a specific score in a distribution of scores. equation is (BN/)X100. B is # of cases below individual score and N is total # of scores -Ex. Runner finishes 62nd out of 63. 1/63 = .016 = 1.6
What is the meaning of a squared validity coefficient?
the percentage of variation in the criterion that we can expect to know in advance bc of our knowledge of the test scores. ex: from the previous example we will know .40 squared, or 16% of the variation in college performance bc of info we have from the SAT test. the remaining 84% variance of why they preform differently is still unexplained
standard deviation
the positive square root of the variance
psychological testing
the process of measuring psychology-related variables by obtaining information
What is the validity coefficient?
the relationship between a test and a criterion is usually expressed as a correlation called validity coefficient. ex: the SAT has a validity coefficient of .40 for predicting GPA at a particular university. bc the coefficient is significant we can say that it tells us more about how well people will do in college than we would know by chance. (r)
How do construct underrepresentation and construct-irrelevant variance relate to content validity?
the score that you get on a test should represent your comprehension of the content you are expected to know. construct underrepresentation is the failure to capture important components of a construct. construct irrelevant variance occurs when scores are influences by factors irrelevant to a construct.
What is the standard error of estimate? What is its relationship to the residuals?
the standard deviation of the residuals
Define and be able to apply the broad definition of validity.
the usefulness and meaning of the results. can be defined as an agreement between a test score or measure and the quality it is believed to measure. is can also be defined as the answer to a question. does the test measure what is it supposed to measure?
Parallel forms method
this compares two equivalent forms of a test that measures the same attribute Advantages: Reduces memory bias, One of the most rigorous assessment of reliability Disadvantages: Hard to construct
Understand the major components of inter-rater reliability.
three different ways to do this: most common method is to record the percentage of times that two or more observers agree. Kappa statistic is the best method for assessing the level of agreement among several observers
What is the purpose of factor and item analysis?
to see if a certain item is bringing the reliability down. see how many factors there are in the test
observed score
true score plus error
what are test batteries?
two or more tests used in conjunction
point biserial correlation
used when the dichotomous variable is true, meaning that the variable naturally forms two categories
What does the standard error of measurement do?
uses standard deviation of errors as the basic measure of error. allows us to estimate the degree to which a test provides inaccurate readings. the larger the standard error of measurement, the less certain we can be about the accuracy with which an attribute measured
phi coefficient
when both variables are dichotomous and at least one of the dichotomies is true