Measurement, reliability and validity
Example of ordinal questions
1. Are you underweight, healthy weight or overweight; 2. How many times do you eat pizza a week? (never, sometimes, often) 3. How close do you live to campus? (close, not too far, really far)
What are the two types of validity?
1. Convergent validity: 2 measures of a similar construct should be correlated 2. Discriminant validity: 2 unrelated measures should not be correlated (don't expect 2 unrelated measures to be correlated)
How to establish construct validity?
1. Correlate new test with an established test 2. Show that people with and w/o certain traits score differently 3. Compare your measure with other related measures of similar or differing constructs
Define item to total correlation
1. Item to total correlation ∙ correlate performance of each item w/ overall performance across participants ∙ ACROSS ALL PARTICIPANTS (within the study)
How do you measure internal consistency?
1. Split-half reliability 2. Cronbach's alpha
Types of reliability
1. Test-retest 2. Parallel forms 3. Interrater/intrarater 4. Internal consistency
Factors that influence error score
1. Tester or rater (errors in taking measurements, recording behavior or data entry) 2. Measurement instrument (Equipment malfunction/uncalibrated, unclear questionnaire) 3. Variability in characteristic being measured (transient states of the participants, ie food intake, mood, blood pressure, fatigue-level) 4. Situational factors (room temperature, lighting, crowding)
Decreasing Error/Increasing Reliability
1. Tester or rater (maintain consistent scoring procedures) 2. Measurement instrument (increase number of items/observations, standardize instructions, eliminate unclear questions) 3. Variability in characteristic being measured (standardize testing conditions) 4. Situational factors (minimize the effects of external events)
If you are measuring BMI in your research study, what issues could contribute to: A. systematic error B. random error
???
Which are strategies to improve reliability in a study?
???
Nominal Scale
Assignment of labels ∙ quality but not numbers ∙ for categorical variables - cannot quantify! ∙ categories vary in quality but not amount ∙ cannot say that one is more or less than another
Which of the following variables is an example of the nominal level of measurement? A. rank in graduating class B. gender C. age of students D. amount of money earned
B. Gender
How to increase reliability?
Decrease error! ∙ Inc. sample size (# of items/observations) ∙ Standardize instructions ∙ Eliminate unclear q's ∙ Standardize testing conditions ∙ Minimize the effects of external events ∙ Maintain consistent scoring procedures ∙ Moderate test difficulty
What is validity?
Does the test do what it's supposed to do and measure what it's supposed to measure; tool measures "what it should - truthfulness, accuracy, authenticity" ∙ refers to the meaning of the test's results not the test itself
Concurrent Validity (Present-criterion)
E.g. a food frequency questionnaire validated with 24 hr recalls or diet records collected for the same time period.
Predictive validity (future)
E.g. how well do nutrition screening tools predict mortality among seniors.
Parallel/Alternate forms
Equivalence ∙ Often want to eliminate practice effects ∙ 2 diff forms of same test to same group ∙ to assess dietary knowledge, give slightly different forms of the test (make sure they are comparable) ∙ eliminate practice effect (having done it once knows the answer)
A valid measurement doesn't have to be reliable. T/F?
False
How do you know if you're measuring the right thing in the right way
First you'll need to find a measure that's "reliable": consistent, dependable, predictable
Examples of nominal questions.
Gender, favorite color, where do you live, mode of transportation to school
What is internal and external validity?
Internal: are the methods correct and the results accurate? External: are the findings generalizable
CQ: Temperature in degrees Celsius is an example of which level of measurement?
Interval ∙ can you rank temperatures (yes - 25C hotter than 15C) ∙ are spaces between ranks equal? (Yes 24-25C is the same as 14-15C) ∙ Does it have an absolute zero & can you make a meaningful ratio? (No - 0C does not mean absence of temperature)
What is considered least precise level of measurements?
Nominal
NOIR - no one is ready
Nominal - Lowest, categories, no rank Ordinal - Second lowest, ranked categories Interval - next to highest, ranked categories with know units b/w rankings Ratio - highest, ranked categories with known interval and an absolute zero
True score equation:
Observed score = True score + error score ∙ True: perfect reflection of true value (theoretical but never truly known) ∙ Error: diff b/w true and observed
Define reliability using an equation.
Reliability = True Score / (true score+error score) ∙ Reliability of the observed score becomes higher if error is reduced!!
Systemic error
Source of error ∙ "predictable" errors of measurement ∙ i.e. consistent under/over estimation ∙ e.g. measured height consistently 0.5 cm greater than true height ∙ major concern for *validity of measure*
Random Error
Source of error ∙ due to chance ∙ e.g. fatigue, mistake ∙ major concern for *reliability* ∙ e.g. True height = 167 cm measured (observed) height = 166.5, 168, 166 etc
operational definitions
Specifying exactly what will be observed and how it will be done ∙ how to measure the variables e.g. measuring socioeconomic status --> total family income + highest level of school completed
Conceptualization
Specifying what we mean by a term ∙ helps translate an abstract theory/ construct into specific variables ∙ makes it possible to test hypotheses ∙ e.g. Are children healthier when they eat well? (what do we mean by "healthier" and "eat well"?)
Test Re-test Reliability
Stability over time: ∙ give same test to same people to see if same results are obtained ∙ choosing the correct time periods b/w tests ∙ characteristic measured does not change over time ∙ e.g. give IQ test twice
"To assess the _____ of the DHQ, 58 pregnant women completed it twice within a 4-5 week interval" which type of reliability assessment was used?
Test-retest reliability
Level of measurement
The degree of precision by which a variable may be assessed
Operationalization
The process of connecting concepts to observations
Measurements can be reliable but not valid. T/F?
True
Poor operationalization
can influence study results (invalid operational definitions) ∙ E.g. operationalizing "success in career" by looking at pay cheque only
Cronbach's Alpha
conceptually, it is the average consistency across all possible split-half reliabilities ∙ can be directly computed from data (0.7 is seen as acceptable) ∙ used for internal consistency reliability ∙ often reported to show scale reliability
What are the types of validity?
face, content, criterion, construct
Variables are measured at one of four levels which are:
nominal, ordinal, interval, and ratio ∙ (in less accurate to more accurate order) ∙ The more precise (higher) the level of measurement, the more accurate is the measurement process
Good example that uses NOIR
numbers assigned to runners - nominal rank order of winners - ordinal performance rating on 0-10 - interval time to finish - ratio
Split-half reliability
randomly divide items into 2 subsets & examine consistency in total scores across the 2 subsets 2. Split half reliability -randomly divide items into 2 subsets and examine the consistency in total scores across the 2 subsets -within a person's response
Define reliability
reproducibility of a measurement -a consistent and free from error measurement
Criterion validity
∙ Ability of the tool to predict results obtained on an external criterion or reference standard ∙ How well does test estimate performance in comparison to an external criterion or reference standard ∙ criterion should be a valid indicator of variable of interest and relevant to variable being measured ∙ concurrent validity (current) ∙ predictive validity (future)
Examples of ratio scale
∙ Age (days, months, years) ∙ Height (cm, inches) ∙ Nutrient intake (kcal/day) ∙ Length of hospital stay (days) ∙ Medical costs ($)
Ordinal scale
∙ Assignment of values along some underlying dimension ∙ One observation is ranked above or below another (variables are ordered) ∙ BUT you cannot say the amount of one variable is more or less than the other
Interval scale
∙ Assignment of values with equal distances between points. ∙ one score differs from another on a scale that has equally appearing intervals ∙ arbitrary zero ∙ BUT cannot say the amount of difference is an exact representation of difference s of the variable being studied i.e. not why
Criterion Validity Examples
∙ Brief beverage and snack questionnaire (target tool) compared with intakes from FFQ/24 recall (criterion). ∙ FFQ (target tool) compared with objective measure such as blood biomarker (criterion) ∙ Questionnaire to assess childcare nutrition environment (target tool) compared with child observation or caregiver interview (criterion)
Internal consistency
∙ Consistency of underlying measures ∙ Extent to which items measure the same characteristic ∙ degree of correlation/consistency among items in a scale ∙ correlate performance of each item with overall performance across participants.
Face validity
∙ Does the measuring instrument appear to test what it is supposed to? ∙ Does it appear to be valid to the persons completing it?
define construct validity
∙ Extent to which test results are related to underlying construct ∙ Difficult to establish - often use a combination of methods ∙ Can try to compare to "gold standard" if there is one... (chicken and egg). ∙ how well does it measure an abstract concept (e.g. how well does a questionnaire measure healthy eating)
Absolute vs. Relative Validity
∙ For "absolute" validity you need a true gold standard that is an exact measure of what is intended to capture ∙ in nutrition studies, we seldom achieve absolute validity
Importance of levels of measurements
∙ How a variable is measured can determine the amount of information we obtain (i.e. measuring elevated body mass as BMI gives more information that overweight/underweight) ∙ The level of measurement affects the types of statistical test you can use.
Content validity
∙ How well the items represent entire universe of items ∙ ask an expert: does this instrument measure everything it is supposed to? ∙ think about a short FFQ to assess vitamin D intake. Are all relevant Vitamin D source covered? ∙ Does HEI capture key aspects of dietary quality in nutritional guidelines?
Rater reliability (inter/intra)
∙ Inter: consistency b/w raters (two raters judge the same event/behavior and assess agreement) ∙ Intra: stability of measures by the same person (to assess effects of bias, fatigue etc.)
Why should we care about levels of measurements?
∙ Measurement should be as precise as possible ∙ In social science, variables are often measured at the nominal or ordinal level (but in nutritional/food science and dietetics we often want precise date such as blood glucose, Na concentration) ∙ How a variable is measured can determine the level of precision ∙ Affects types of statistical test you can use.
Challenges in measuring change
∙ Need valid & reliable tools of measurement ∙ level of original measurements matter ∙ reliability of the tool ∙ starting point matter (floor/ceiling effects) ∙ Variables change naturally over time (and people can get batter at taking tests with practice) ∙ can you detect the difference? how much is clinically important?
Vitamin D intake measurements
∙ Nominal = Vit D supplements (yes/no) ∙ Ordinal = intake reported as exceeds/meets/does not meet DRI ∙ Interval = set RDA as 0 and report scores as (+) above or (-) below RDA. (e.g. for RDA 600 IU/d, an intake of 1000 IU/d would be reported as +400). ∙ Ratio - Vitamin D intake as IU/d
Ratio scale
∙ Possess all the properties of the nominal, ordinal, interval scales ∙ has an absolute zero point ∙ CAN say values differ, by how much, and what this means
What are the types of errors made in measurements?
∙ Systematic error ∙ Random error
What are the factors that can lower validity?
∙ Tests that are too short ∙ Identifiable patterns of answers ∙ Unclear directions/vocabs ∙ Time limits ∙ Level of difficulty ∙ Poorly structured test items ∙ Difficult reading vocabulary and sentence structure
How is reliability measured?
∙ Using correlation coefficient (r) ∙ indicate how scores on one test change relative to scores on a 2nd test ∙ can range from -1 to +1 (=perfect reliability, while 0 = no reliability)
Examples of interval questions
∙ What is you GPA? (2 vs 4 doesn't mean you are twice as smart) ∙ What is you IQ? (120 vs. 40 doesn't mean you are 3 times as smart) ∙ What temperature (C) was it when you left the house? ∙ What is your score on the "Healthy Eating Index" ∙ Year (AD, BC) - equally occurring intervals (1980-81 is same as 2010-2011, but year 0 is not beginning of time.
Approaches for establishing validity
∙ correlate new test with an established test ∙ show that people with and without certain traits score differently ∙ determine whether tasks required on test are consistent with theory guiding test development