EXAM 1 - test and measurements

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

External Validity

Extent to which the results can be generalized to other populations and settings

Amount of time in test retest reliability

The amount of time allowed between measures is critical. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time. Optimum time betweem administrations is 2 to 4 weeks.

measurement error

Any fluctuation in test scores that results from factors related to the measurement process that are irrelevant to what is being measured. The difference between the observed score and the true score is called the error score. S true = S observed - S error Developing better tests with less random measurement error is better than simply documenting the amount of error.

internal consistency

Measures the reliability of a test solely on the number of items on the test and the intercorrelation among the items. Therefore, it compares each item to every other item. If a scale is measuring a construct, then overall the items on that scale should be highly correlated with one another. A common way of measuring internal consistency ... Cronbach's Alpha: .80 to .95 (Excellent) .70 to .80 (Very Good) .60 to .70 (Satisfactory) <.60 (Suspect)

Describe and give example of how a test score is the combination of a true score and an error score

Never know what a score is, it changes moment to moment, always includes some error.

Explain some pro's and con's about different aspects of the subject (p.13): precision, good tools (validity and reliability), take on many forms, interpreted only within context measured, test misuse, achievement tests (p.13-15)

Precision: can measure ones ability, but HOW one solves is difficult. Tools: sometimes use the best, sometimes what's convenient. Best= more time and money - more accurate and reliable. Take on many forms: paper pencil, computer, observation - select form which best fits what Q you're asking. Interpreted only within the context measured: keep test scores in perspective & understand them within the initial purpose for the testing. misuse: know purpose of test, who to give to, quality, how to interpret. Achievement tests: those who know material, those who don't. some are bad test takers.

Define Quartile; Q1, Q2, Q3

Q1 = the 25th percentile, also called first quartile Q2 = the median, also the 50th percentile Q3 = the 75th percentile, third quartile. anyone with a score of 1-25 goes in first quartile. etc.

nominal scales

Separated into different categories NAMES All categories are equal - Cats, dogs, rats NOT: 1st, 2nd, 3rd Republicans, Democrats, Independents There is no magnitude within a category One dog is not more dog than another No intermittent categories No dog/cat or cat/fish categories Membership in only one category, not both Mutually exclusive properties

closed ended should have categories that are?

Should have response categories that are Exhaustive Can use category: "Other" Mutually Exclusive Bad Example: "How many times have you hit your children in the last year?" 0 to 5 5 to 10 10 to 20 20 or more

Advantages of split half and odd-even

Simplest method - easy to perform Time and Cost Effective

Four General Categories of Variables

Situational variables Response variables Participant or subject variables Mediating variables

Test retest reliability

Test-retest reliability is usually measured by computing the correlation coefficient between scores of two administrations. If a scale is measuring a construct consistently, then there should not be radical changes on the scores between administrations --- unless something significant happened. The rationale behind this method is that the difference between the scores of the test and the retest should be due to measurement solely.

open ended questions positive points:

Give us a rich source of information Often used to construct a closed-ended survey for later use Usually included at the end of every survey as a 'catch all' question Example: "Is there anything else you want to share with use about your parents?"

reliability is synonymous with?

consistency.... It is the degree to which test scores for a an individual test taker or group of test takers are consistent over repeated applications. No psychological test is completely consistent, however, a measurement that is unreliable is worthless. The consistency of test scores is critically important in determining whether a test can provide good measurement.

Measurement:

The assignment of labels to a variable or an outcome SAT (variable): 740 (measurement) Age : 22 Weight : 160 Gender : male

IV is on what axis?

X axis

Is this open or closed? "About how many times did you get a spanking as a child?" Never 1 to 20 21 or more

closed.

T/F? The higher the level of measurement the more precise the value.

true

ordinal scales

what is measured by ranks, 1st 2nd 3rd, Although there is a ranking difference between the groups, the actual difference between the group may vary. Marathon runners classified by finish order The times for each group will be different Top ten 4- to 5-hour times Bottom ten 4- to 5-week times

Because no unit of measurement is exact, any time you measure something (observed score), you are really measuring WHAT two things

. True Score - the amount of observed score that truly represents what you are intending to measure. Error Component - the amount of other variables that can impact the observed score Observed Test Score = True Score + Errors of Measurement

Variable

Anything that can take on more than one value SAT score Age Weight Gender

disadvantages of parallel/alternate forms method

Are the two forms of the test actually measuring the same thing? More Expensive Requires additional work to develop two measurement tools. Are the two forms of the test actually measuring the same thing? More Expensive Requires additional work to develop two measurement tools.

Conclusion Validity

Draws reasonable conclusions based upon an analysis of the data

closed ended positive points:

Gives us uniform objective responses Can be easily input into a data file for analysis

Describe how personality tests first got started

In WW1 to place people into different areas in the military . Maturity, leadership, etc.

IV

Independent Variables The variables that are considered to be the "cause" Usually manipulated by the researcher

how can we determine reliability?

Internal Consistency - Test-retest Reliability - Interrater Reliability - Split-half Methods - Odd-even Reliability - Alternate Forms Methods

Survey Instruments can also be developed to assess individual attributes. EXAMPLES:

Personality Anxiety Depression Happiness

administrator factors

Poor or unclear directions given during administration or inaccurate scoring can affect reliability.

types of questionnaires

Self-administration Face-to-face

interval scales

Someone or thing is measured on a scale in which interpretations can be made by knowing the resulting measure. The difference between units of measure is consistent. Height Speed Four miles really is twice as far as two miles

Describe what a professional must do when they notice a particular behavior or set of behaviors

TAKE AN ACTION - they reach conclusion, then take action

causality - Inferences of Cause and Effect Require Three Elements:

Temporal precedence Covariation between the two variables Need to eliminate plausible alternative explanations

open ended difficulties

Hard to summarize across many surveys Must be 'coded' so that statistics can be generated Coding can be too subjective

Min. corelation in test - retest reliability is?

.50 The higher the correlation (in a positive direction) the higher the test-retest reliability

What is the likert scale?

1. Strongly Agree 2. Agree 3. Disagree 4. Strongly Disagree Sometimes insert a "Neutral" response

How high should reliability be?

A highly reliable test is always preferable to a test with lower reliability. .80 > greater (Excellent) .70 to .80 (Very Good) .60 to .70 (Satisfactory) <.60 (Suspect) A reliability coefficient of .80 indicates that 20% of the variability in test scores is due to measurement error.

FACTORS that affect reliability

Administrator Factors Number of Items on the instrument The Instrument Taker Length of Time between Test and Retest

disadvantages of split-half and odd even

Many was of splitting Each split yields a somewhat different reliability estimate Which is the real reliability of the test?

closed ended difficulties:

May fail to include proper response categories Example: "How many sexual partners have you had in the last year?" 1 to 5 6 or more

NONEXPERIMENTAL VERSUS EXPERIMENTAL METHODS

Nonexperimental Method Direction of Cause and Effect The Third-Variable or Confounding Variable Problem Experimental Method Experimental Control Randomization

Give example of predictive validity and how the GRE is an example of a test with problems

The GRE doesn't necessarily predict how well one will do in graduate school.

Internal consistency estimates are a function of:

The Number of Items - if we think that each test item is an observation of behaviour, high internal consistency strengthens the relationship --- i.e., There is more of it to observe. Average Intercorrelation - the extent to which each item represents the observation of the same thing observed. The more you observe a construct, with greater consistency RELIABILITY

types of variables

discrete (belongs to unique and separate categories, i.e.: dogs, cats, rats) If there are only two categories, then it is a dichotomous variable i.e. open or closed, male or female. Continuous variables: What is measured varies along a line scale and can have small or large units of measure Length Temperature Age Distance Time

The casual possibilities in a non-experimental study

exercise -> anxiety anxiety -> exercise income -> exercise and anxiety

List some common tests and acronyms

APGAR: appearance, pulse, grimise, acuity, respiration (new born test) SAT, GRE, FCAT, MCAT

Internal Validity

Ability to draw conclusions about causal relationships from our data

Perceptions and attitudes in questionnaires

in general Perceptions and attitudes Public views on a variety of subjects Sentencing policies Gun control Police performance Drug abuse Example: Campus Security: fear > victimization

Give some examples of independent and dependent variables

IV - whats manipulated, how much alcohol DV - whats being measured, how drunk are you

ratio scale

Just like an interval scale, and there is a definable and reasonable zero point. Time, weight, length Seldom used in social sciences All ratio scales are also interval scales, but not all interval scales are ratio scales

Give examples for the different types of validity (p.64)

predictive validity - how well a test outcome is consistent with a criterion that occurs in the future - do high scores in high school predict one doing well in college. content validity - where the test items sample the universe for items for which the test is designed - achievement tests, certification, licensing. examine closely to be sure its accurate. Criterion validity - when you want to know if test scores are systematically related to other criteria that indicate the test taker is competent in a certain area (correlate scores with another measure which is valid and assess same abilities) construct validity - if a test measures some underlying psychological construct -how well a test score reflects an underlying construct.

In-Person Interviews?

Completion rates are very high (85% +) Reduces 'don't knows' Can explain questions to eliminate confusion Can observe the respondent and make valuable notes on their behavior Can 'probe' for deeper information But: people are more likely to give 'socially-acceptable' answers

Construct Validity

Adequacy of the operational definition of variables

question ordering can have what affect?

Administer the most sensitive questions near the end of the survey Example: Health; Income; Sexuality Be aware that questions that appear earlier can influence answers later Example: Questions about neighborhood crimes before questions about gun control laws

open/closed ended should be?

Clear and unambiguous "I feel fear when I must speak in public." Measured with Likert Scale Short items are best Don't use Negative Items "I don't think people should be afraid to speak." Rather use "I like to speak to large groups." Wording should be examined for potential bias: "assistance to poor" rather than "welfare" Results: 63% of respondents said too little was given vs. 23% too little for welfare "people who are homeless" rather than "bums"

Construct Validity

Evaluate the adequacy of the operational definition. Is the operational definition sufficiently measuring the construct it claims to measure?

the test retaker

If you took an instrument in August when you had a terrible flu and then in December when you were feeling quite good, we might see a difference in your response consistency. If you were under considerable stress of some sort or if you were interrupted while answering the instrument questions, you might give different responses.

types of reliability

Internal Consistency (Consistency of the items) - Test-retest Reliability (Consistency over time) - Interrater Reliability (Consistency between raters) - Split-half Methods - Alternate Forms Methods

The book's authors define a test as:

everything - exams, test, procedure. any way of measuring things

What are some problems with Stanine scores

lumps people together, no individual differences, cannot tell differences in performance, limit to nine categories, assumption that placing in one of nine stanines makes conceptual sense.

Problem in test-retest

memory effect. Which means that a respondent may recall the answers from the original test, therefore inflating the reliability.

Mediating Variables

number of bystanders -> diffusion of responsibility -> helping behaviors

Define Stanine

one of nine equal segments in a normal distribution. Half of one standard deviation normally.

number of items

The larger the number of items, the greater the chance for high reliability. : Use longer tests or accumulate scores from short tests.

DV is on what axis?

Y axis

How did Alfred Binet contribute to tests and measurement

investigated why children didn't do well which led to intelligence testing

types of questions

open ended: "Tell us about your relationship with your parents." closed ended: Single worded answer - "did you parents hit you?

what are levels of measurement?

ordinal, nominal, interval, ratio

Define a percentile or percentile rank

percentage of people who scored lower than you. The point in a distribution below which s percentage of scores fall.

Explain and give examples why as psychologists we act as if ordinal level data is really interval level data

tradition

What is controversial in Kaplan's approach to tutoring

he offered tutoring for standardized testing test so people could do better. built up tutoring school, which still exist today. still assumption we can teach people, how to do better on tests. CONTROVERSY = no data to show that tutoring actually helps.

DV

The variables that are considered to be the "effect" Usually measured by the researcher

Participant or Subject Variables

These extraneous variables are related to individual characteristics of each participant that may impact how he or she responds.

why do Psychologists typically treat their measurements as if they are Interval Data

This is done mostly out of tradition This is done despite the fact that IQ or Personality Scores are actually Ordinal Data And - despite the fact that the statistical tests we choose demand Interval or Ratio level data We justify this because our tests are 'robust'

OPERATIONAL DEFINITIONS OF VARIABLES

Variable is an abstract concept that must be translated into concrete forms of observation or manipulation Studied empirically Help communicate ideas to others

type of variable : continuous variables

What is measured varies along a line scale and can have small or large units of measure Length Temperature Age Distance Time

Example of true experiment: in class

Women in Drug Treatment (Tx) I.V.'s... Assessed for pre- post-Tx drug use (D.V.) using surveys Other D.V.'s measured

Measurement Error is Reduced By?

Writing items clearly - Making instructions easily understood - Adhering to proper test administration - Providing consistent scoring

Describe and give examples of how we might assess: Achievement, Personality, Aptitude, Ability (intelligence and vocational)

achievement: history exam, or exam Personality: criteria test Aptitude: how good is someone ins sales ability: skills and competence (intelligence - learn and apply)

The chapter title is "Welcome to Lake Woebegone..." This comes from: a fictional town in Minnesota from a character who reports on the radio: "Well, that's the news from Lake Wobegon, where all the women are strong, all the men are good looking, and all the children are above average." Why is this relevant to our discussion of percentiles

because ignore fact that there is variability and that there are other places for distribution.

reliability

consistency of the instrument

SPLIT HALF reliability

refers to determining a correlation between the first half of the measurement and the second half of the measurement (i.e., we would expect answers to the first half to be similar to the second half).

Describe the concept of validity

the assessment tool does that it says it does. Wants to know whats being tested and reliability how consistently. ADD more items to make it more valid.

Parallel/Alternate Forms Method

refers to the administration of two alternate forms of the same measurement device and then comparing the scores. Both forms are administered to the same person and the scores are correlated. If the two produce the same results, then the instrument is considered reliable. A correlation between these two forms is computed just as the test-retest method.

Describe and give examples for the five main reasons we do tests: Selection, Placement, Diagnosis, Hypothesis Testing, Classification for career

selection: training, who to train, who to accept placement: giving a math test to see what course to put them in diagnoses: mental disorders hypothesis: did the program work? classification for career: what will suit you the best?

ADVANTAGES OF MULTIPLE METHODS

Artificiality of Experiments Ethical and Practical Considerations Participant Variables Description of Behavior Successful Predictions of Future Behavior

advantages of parallel/alternate forms method

Eliminates the problem of memory effect. Reactivity effects (i.e., experience of taking the test) are also partially controlled. Can address a wider array of issues than the test-retest method.

two types of self reports

Frequency of occurrence How many drugs used over time Prevalence of occurrence How many women use drugs

Response Variables

All potential responses to the manipulation of an independent variable or observed in reaction to the environment.

If you can... (types of data) why type do you have?

Assign just names Then you have Nominal Data Put things in order Then you have Ordinal Data If the distance between things is consistent Then you Interval Data If the scale has an absolute zero Then you have Ratio Data

How did Darwin's work influence (directly and indirectly) tests and measurement

Darwin saw individual differences which led to his cousin francis goltman took idea and applied to humans in anthroprometric method= metrics for differences in humans ig weight, height,

Questionnaire construction

Don't squeeze too much on a page Leave it open and 'breezy' looking Reduces errors Doesn't demoralize the respondent as they move through the pages quickly Don't use abbreviations like "abbrev." as they can confuse everyone Use 'contingency' or 'skip-questions' so only certain people get certain questions Example: Pregnancy questions for women - otherwise "If male - skip to question 60" Use matrix for questions with the same potential responses Example: Agree Disagree Spanking helps [ ] [ ] Hit with hand only [ ] [ ] Using belt is okay [ ] [ ] Useful for Likert responses: SA A D SD

Internal Validity

Evaluate the extent that it was the independent variable that caused the changes or differences in the dependent variable. Are there alternative explanations (confounds)?

External Validity

Evaluate the extent that the results can generalize to other populations and settings. Can the results be replicated with other participants? Can the results be replicated in other settings?

Situational Variable

Extraneous features of the environment that are present and may influence dependent variable responses.

Explain what you would do if your test validity is low in: content validity and construct validity

content - rewrite, consult someone who is in this field and they will rewrite it. construct - take a better look at the theoretical rationale that underlies the test you developed and the items you created to reflect that rationale. maybe the definition and theoretical model are underdeveloped.

What is the purpose of textbook

expose us to assign and use the assessment instruments in sociological, legal, political testing. See legal implications.

interrater reliability

Interrater reliability means that if two different raters scored the scale using the scoring rules, they should attain the same result. Interrater reliability is usually measured by computing the correlation coefficient between the scores of two raters. Here the criterion of acceptability is pretty high (e.g., a correlation of at least .9), but what is considered acceptable will vary from situation to situation

ODD EVEN reliability

refers to the correlation between even items and odd items of a measurement tool. In this sense, we are using a single test to create two tests, eliminating the need for additional items and multiple administrations. Since in both of these types only 1 administration is needed and the groups are determined by the internal components of the test, it is referred to as an internal consistency measure.

Why do test scores vary - Possible Sources of Variability of Scores

General Ability to comprehend instructions - Stable response sets (e.g., answering "C" option more frequently) - The element of chance of getting a question right - Conditions of testing - Unreliability or bias in grading or rating performance - Motivation - Emotional Strain


Ensembles d'études connexes

Biol 315 Final Exam, BIOL 315 Final, Bio 315 Final Exam, Bio 315 Final exam, BIO 315 final exam, BIO 315 final exam material, BIOL 315 Final Exam Study Guide

View Set

ENSC Exam 2 (Biodiversity and Human Population)

View Set

Chapter 23 Physiologic and Behavioral Adaptations to the Newborn Lowdermilk

View Set

Chapter 3 Network Protocols and Communications

View Set