RM - Exam 3: CH 10-12, Prog Review

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Gliner alternative scales of measurement

- Nominal - Dichotomous - Ordinal - Approx. Normal

- for a particular purpose - the test - a number of different purposes - for example, specialty area scores on the graduate record examination might be used to predict first year success in graduate school. However, they also could be used as a method to assess current status in a particular undergraduate major. - the purpose of the test is completely different - a chain saw is "valid" for tree surgery but not brain surgery. - for each purpose or use. - prior to using the test.

- When we address the issue of validity with respect to a particular test, we are addressing the issue of the scores on that test for _______________ NOT the validity of ___________ per se. - therefore, any particular test might be used for __________. - Example ! - implication of the example: While the same test is used in both instances,_______________completely different for each situation. - Example - Chainsaw - Therefore, validity should be determined for ____________. - When?

- If the SAT has good predictive evidence, then students who score high on this test will perform better in college than those who do not score high. - The criterion in this case would be some measure of how well the student performs in college, usually grades during the first year. - high school students would take the SAT. Then, when they are finished with their freshman year of college, correlations would be established between their high school SAT scores and college grades. If the correlation is high, then predictive evidence is good. If the correlation is low, then the test has problems for prediction of future performance.

- predictive validity and the SAT - EXAMPLE - criterion in this case - how to establish predictive evidence in this example

Almost all physical measurements provide either ______ or ______ data, however, the situation is often less clear with regard to _______..

1) ratio 2) interval 3) psychological measurements

- In research articles, there is usually more evidence for the 1) ________ of the instrument than for the 2)_____________ of the instrument because evidence for 3)____________ is _______________________n. - To establish 4)_________, one ideally needs a 5)_________ . To obtain such a criterion is often not an easy matter, so other types of evidence to support the6_______________ are necessary.

1) reliability 2) validity 3) validity, more difficult to obtain 4)validity 5) "gold standard" or "criterion" related to the particular purpose of the measure. 6) validity of a measure

- face validity - content validity - criterion validity - construct validity

4 types of validity based on 1985 Standards for Educational and Psychological testing:

(1) content; (2) response processes; (3) internal structure; (4) relations to other variables; and (5) the consequences of testing. - Note that the five types of evidence are not separate types of validity and that any one type of evidence alone is insufficient. - Validation should integrate all the pertinent evidence from as many of the five types of evidence as possible. -Preferably, validation should include some evidence in addition to content evidence, which is probably the most common and easiest to obtain.

5 types of validity set forth under the 1999 standards

+ 0.7 and +1 +.9 +.7 , +.6 ---+.6 is only considered marginally acceptable

A measure that is considered reliable will have a reliability coefficient between ____ and ______. - reliability for instruments like IQ tests, GRE, and personnel decisions should be _________. -it is common to see published journal articles in which one or a few reliability coefficients are below _____, but usually ______ or above

- Determine value/outcomes of "programs" - Recommendations for refinement and/or achieving success

Aims of Program Evaluation

- many (5+) ordered levels or scores - frequency distribution of scores is approximately normally distributed - traditional term(s), Interval and Ratio

Approximately normal

- 68% -34% -34% -95% -47.5% -47.5% -If we were to subtract 95 percent from 100 percent the remaining 5 percent relates to the probability or p value of 0.05 conventionally established for statistical significance, as we see in Chapter 16 . Values not falling within two standard deviations of the mean are relatively rare events.

Areas under the normal curve: - _____% of the area under the normal curve is between the mean and +/-1 SD of the mean. - This means _____% of the area under the normal curve is between the mean and +1 SD of the mean - This means _____% of the area under the normal curve is between the mean and -1 SD of the mean. - _______% o f the area under the normal curve is between +/-2 SD of the mean and +/- 2 SD of the mean - This means ________% of the area under the normal curve is between the mean and +2 SD of the mean This means ________% of the area under the normal curve is between the mean and +2 SD of the mean -discuss statistical significance in relation to the area under the normal curve:

- the data can be graphed in a frequency distribution; - the mean, median, and mode can be checked: if they are close to the same value, the distribution can be considered approximately normally distributed - OR the skew value can be computed: if it is between -1 and 1 the distribution can be considered normally distributed.

Assessing whether a variable is approximately normally distributed:

1) Nominal - although numbers correspond to the gate where the horse start the race, their function for the spectator is to ID the name of the horse in the racing/gambling form. 2) Ordinal - bc it is based on whether the selected horse comes in first, second, third, (i.e., win, place, or show). Does not matter if a horse wins by a nose or by ten lengths, a win is a win. Thus ranks form an ordinal scale. 3) Normally distributed - a few people probably win a lot, many break even or lose a little, and a few lose a lot

Attending a Horse Race as spectators: example of 3 out of the 4 levels of measurement: -Numbers worn by horse = 1 -The betting is based on ___2__ scale of measurement -The money people win from all the bets made that day might be best described as ___3___.

- Help determine need for program - ID whether program is effective and efficiently run - Improve program - Add to scientific knowledge (with permission of agency and/or other parties involved) - Helps correct the natural drift that can occur whereby agency finds themselves off course of what they originally set out to do.

Benefits of Program Evaluation

M= ΣX/N

Calculation of population mean

observed score, true score

Cannot assume that ________ score is the same as _______ score.

Name the characteristics and provide examples of: - Dichotomous

Characteristics: -2 levels -ordered or not ordered Examples: -Gender -Math grades (High vs low)

Name the characteristics and provide examples of: - Nominal

Characteristics: -3+ levels -Not ordered -True categories -Names, labels Examples: -Ethnicity -Religion -Curriculum type -Hair color

Name the characteristics and provide examples of: - Ordinal

Characteristics: -3+ levels -Ordered levels -Unequal intervals between levels -Not normally distributed Examples: -Most ranked data -Race Finish (1st, 2nd, 3rd)

Name the characteristics and provide examples of: - Normal

Characteristics: -5+ levels Ordered levels -Approximately normally distributed -Equal interval between levels Examples: -test scores -GRE scores -Height -IQ

- the measurement error is considered to be a single entity, which does not give the researcher the information necessary to improve the instrument

Classical test theory - problems

- Highly politicized environment o No boss wants to be shown that his/her program is ineffective o No politician wants to be attached to a faulty program. o E.g., the "Just Say No" drug education program was universally found to be ineffective, yet it got millions of dollars bc of a powerful lobby and because there was nothing better.

Constraints

1) starts with a definition of the concept that the investigator is attempting to measure. 2) a literature search to see how this concept is represented in the literature. 3) Next, items are generated that might measure this concept. 4) Gradually, this list of items is reduced to form the test. ------ One of the methods of reducing items is to form a panel of experts to review the items for representativeness of the concept.

Content validity - process of establishing it.

- An aspect of measurement validity. The content of the instrument is representative of the concept that is being measured. - Specifically, one asks if the content that makes up the instrument is representative of the concept that one is attempting to measure. - there is no statistic to demonstrate ______ -

Content validity - what is it? - specifically.... - statistic to demonstrate content validity

- One of the most important contributions of the instrument (Assessment of Motor and Process Skills - AMPs) is that it has ecological soundness. Fisher has participants choose to perform "everyday" tasks from a list of possible tasks that require motor and process skills. - If Fisher asked participants to stack blocks or perform other artificial types of motor tasks, then her test would NOT have strong content validity, even though the artificial tasks involved motor activity. - Her test has strong content validity because the tasks not only involve motor and process activity, but also because they are representative of the types of tasks that a person would do in everyday life.

Content validity: - describe using the Assessment of Motor and Process Skills (AMPs)

- reliability -r -

Correlation coefficient - it is the measure most often selected to evaluate ________ - symbol?

a) +1 b) -1 c) 0 d)

Correlation strength for Pearson and Spearman Rho -Perfect positive = a - Perfect negative = b - No correlation = c - Strong = d

1) One aspect of measurement validity. Validating the instrument against a form of external criterion, 2) computing a correlation coefficient between the instrument and the external or outside criterion 3) instruments that are intnded to select participants for a school or profession 4) 2 types, predictive evidence and concurrent evidence 5) being able to establish an outside criterion that is measurable

Criterion Validity - what is it? 1) - validation procedure usually involves 2)_______ - Common examples of criterion validity involve instruments that are _3)__________________. - there are 4)_____ types of evidence for criterion validity. What are they? 5) - The key to criterion validity is __6)_______

- that while it is a measure of internal consistency, it does not necessarily measure homogeneity, or unidimensionality. In other words, people often determine Cronbach's alpha and assume that since it is at a high level (e.g., .85), the test is measuring only one concept or construct. - even though the overall item correlations may be relatively high, they could be measuring more than one factor or dimension. This can lead to problems, because one of the assumptions of using Cronbach's alpha as an index of reliability is that it is measuring only one construct. We caution that when reporting reliability, if only Cronbach's alpha is provided, without information indicating that there is only one underlying dimension, or another index of reliability, then reliability has not been adequately assessed.

Cronbach's alpha - Problems

- designed to show the association between two nominal or dichotomous variables -can be used with ordered variables, but less appropriate to use when either variable has three or more ordered variables - see image for example

Cross-tabulation table

Summarize and describe data from a sample without making inferences about the larger pop from which the sample data were drawn

Descriptive Statistics

- one would need to ask how many different categories are there and what are the percentages in each.

Determining variability of Nominal data

two categories - Either ordered or unordered - traditional term, N/A

Dichotomous

For example, if the average gender was 1.55 (with males = 1 and females = 2), then 55 percent of the participants were females.

Dichotomous Variable statistics - Example of usefulness of mean

one with only two levels or categories (e.g., Yes or No, Pass or Fail) is sometimes assumed to be nominal. While some such dichotomous variables are clearly unordered (e.g., gender) and others are clearly ordered (e.g., math grades—high or low), all dichotomous variables form a special case

Dichotomous Variables

for multiple regression, dichotomous variables, called "dummy" variables, can be used as independent variables along with other variables that are normally distributed. It turns out that dichotomous variables can be treated, in most cases, as similar to normally distributed variables

Dichotomous variable statistics - usefulness for multiple regressions

Per classical test theory, ____________ is the difference b/w observed and true score.

Error

· Surveys - participants, staff, and/or mgmt. o Sample all or randomly select subset · Administrative data - data collected for other purpose o E.g., GPA, Attendance · Focus groups (i.e.) structured discussions o Facilitator with 8-10 people who answer broad questions about program or needs · Key informant interviews o Qualitative, in-depth; with 15-25 "experts, selected for knowledge about program or topic · Observations à description of event or behaviors o Systematic social observation in field o Participant observation: evaluator experiences program

Evaluating program effectiveness - types of measures / data

- would find a sample of persons who were not participants in the experiment previously described but who would fit his target population. - He would administer the QOL to this sample, - and at a later date (at a date that would approximate the interval of the intervention) he would administer the QOL to the same sample. - Then he would determine the reliability coefficient based on the scores of the two admin is trations using a correlation between the two sets of scores. - If the reliability coefficient is relatively high (e.g., above .80), then he would be satisfied that the QOL has good test-retest reliability. - On the other hand, if the reliability coefficient is below .70, then he may need to reconsider the QOL as a measure that produces reliable scores of quality of life.

Example of test-retest reliability in action - use QOL example

- Many researchers do not consider this to be a scientifically recognized type of measurement validity. - An instrument is said to have _____ validity if the content appears to be appropriate for the purpose of the instrument. The key word is "appears." this type of validity does not actually describe the content.

Face Validity - how do researchers view it? - what is it?

o - Informal Evaluation; results are accessible and used only by the requestors of the evaluation and by the evaluation § Contributes to dev of or revision to program § Aim is often to improve program rather than justify existence § Reported to program director/client · May be informal report (e.g., memo, presentation) · Can happen before summative program evaluation

Formative Evaluation - What is it?

a graph which indicates how many participants are in each category They are useful whether categories are ordered or unordered

Frequency distributions

- normal / normally distributed data - they connect the points between the categories

Frequency polygons and Histograms are best used with ________ data. Why?

1. Determine Problem 2. Needs assessment 3. Formative Evaluation (sometimes) 4. Summative Evaluation 5. Cost Effectiveness determination

General Steps in Program Evaluation Process

- Extension of classical test theory - allows investigator to estimate more precisely the difference components of measurement error - in simplest form: this theory partitions the variance that makes up an obtained score into variance components, such as variance that is attributable to the participants, to the judges (observers), and to items

Generalizability theory, an extension of ______________, does what?

- a ______________ correlation , it would mean that students with high anxiety tended to have low grades; also, high grades would be associated with low anxiety.

High negative correlation - what does it mean?

- A ___________correlation between anxiety and grades would mean that students with higher anxiety tended to have high grades, those with lower anxiety had low grades, and those in between had grades that were neither especially high nor especially low.

High positive correlation - what does it mean?

- Measurement reliability is expressed as a coefficient. The reliability coefficient is the ratio of the variance of true scores to the variance of observed scores (Ghiselli, Campbell, & Zedeck, 1981). In other words, the higher the reliability of the data, the closer the true scores will be to observed scores.

How does reliability relate to observed and true scores?

- Note that a high Cronbach's alpha (see Chapter 11 ) is incorrectly assumed to provide evidence that a measure contains only one dimension or construct; it is possible to have a high Cronbach's alpha and be measuring multiple dimensions; thus, Cronbach's alpha should not be relied on to assess evidence based on internal structure. .

IMPORTANT NOTE ABOUT CRONBACH's ALPHA AND EVIDENCE BASED ON INTERNAL STRUCTURE

- Multiple choices, like a likert scale; inter-item reliability - dichotomous items; K-R 20 (Kuder-Richardson 20) - most commonly used index of reliability! -- educational and psychological research -- it takes only one administration of the instrument. More important, though, is that alpha is related to the validity of the construct being measured.

If each item on the test has _____________ like a _________ scale, then Cronbach's alpha is the method of choice to determine _______________ reliability. It is also appropriate for ______ items so it can be used instead of__________ How frequently is Cronbach's alpha used? - and for what area of research? -- why?

- like tx dosages, E.g, no drug, 10mg, 20 mg, and 30 mg - ordinal - normally distributed

In certain cases like ________, an active IV could be ______ or even ________.

- in addition to obtaining test-retest reliability, or parallel forms reliability, the researcher wants to know that the instrument is consistent among the items; that is, the instrument is measuring a single concept or construct. -Rather than correlate different administrations of the same instrument, the investigator can use the results of a single administration of the instrument to determine internal consistency. - split-half method, the Kuder-Richardson (K-R 20) method, and Cronbach's alpha. - But can only use Kruder-Richardson or Chronbach's alpha when one has data from several items that are combined to make a composite score.

Internal consistency reliability - what is it? - how to determine it - generally - common methods

- The distance between the 25th and 75th percentiles -best measure of variability for ordinal data

Interquartile range - what is it? - When is it appropriate to use?

- Agreement or consistency among raters. - When observation is the method of collecting data, then reliability must be established among the judges' scores to maintain consistency. - most common theme is that two or more judges (observers) score certain episodes of behavior and some form of correlation is performed to determine the level of agreement among the judges.

Interrater (interobserver) reliability - what is it? - When to use it? - How to determine it?

- mutually exclusive categories ordered from low to high - intervals are equally spaced - Used for many psychological scales and types of data (ex: extroversion, attitudes about X, IQ).

Interval Scales

- when more than 2 observers/raters are needed to observe behavior as the DV of a study - allow the researcher to calculate a reliability coefficient with two or more judges. - A second advantage of using ICC is that if the judges are selected randomly, then the researcher can generalize the interrater reliability beyond the sample of judges that took part in the reliability study. CRITERIA that MUST BE SATISFIED TO USE THE ICC: -the behavior to be rated must be scaled at an interval level. ---For example, each rater might be rating instances of cooperation on a 1-5 scale. These ICCs are computed using analysis of variance methods with repeated measures to analyze interrater reliability.

Intraclass Correlation Coefficients (ICCs)

- NOT focused on generalizing findings (external validity) or relating back to literature - BUT uses same designs, methods, and statistics - AND may be presented or published as scholarly product o Especially if a quasi-experimental design is used for the evaluation/study.

Is program evaluation research?

strong positive correlation

Is this a strong or weak correlation? Pos or neg?

weak positive correlation

Is this a strong or weak correlation? Pos or neg?

- a method of calculating intraclass correlation coefficients when the data are normal is the ____ - can be calculated with two or more raters - can validate that the agreement exceeds chance - data are often dichotomous (e.g., present or absent), however it is not uncommon to have more than two nominal categories - i.e.,a measure of interrupter reliability for nominal data which corrects for random agreement.

Kappa Statistic

- when the instrument being used is intended to measure a single theme or trait - if each item is scored dichotomously (i.e., pass/fail, true/false, right/wrong) - See photo for formula

Kuder-Richardson 20 (K-R 20) - when is it used? - when is it an appropriate method of determining _____ reliability? - See photo for formula (also see pg 189)

- Measure of central tendency - arithmetic average - usually the statistic of choice if the data is normally distributed - takes into account all of the available information when used to compute the central tendency of a frequency distribution - commonly used in both descriptive and infernal statistics

Mean - What is it? - When is it the statistic of choice?

- variability describes the spread or dispersion of the scores. - In the extreme, if all of the scores in a distribution are the same, there is no variability. - If they are all different and widely spaced apart, the variability will be high. - Standard Deviation

Measure of Variability - What is it? - Most common measure of variability?

the assignment of numbers or symbols to the diff. levels or values of variables according to rules

Measurement

- mean, median, mode - all three measures of central tendency can be used for normally distributed data - all three are the same and in the center of the distribution when the data are normally distributed

Measures of Central Tendency - what are they? - relationship to normally distributed data

- measure of central tendency - the middle score - appropriate for ordinal-level data.

Median - What is it? - When is it the statistic of choice?

- the most common category - can be used with any kind of data, but generally provides the LEAST precise info about central tendency - ONLY use mode for central tendency if there is ONLY ONE mode, if it is clearly identified, and if you want a quick, NONCALCULATED measure. - mode is more useful when the data are nominal or dichotomous, there are relatively few categories, and there are a larger number of participants.

Mode - What is it? - When to use it? Considerations?

1) test-retest 2) parallel forms 3) internal consistency - measured through split-half methods 4) internal consistency - measured through Kuder-Richardson 20 5) Internal consistency - measured through Cronbach's alpha 6) Interrater - measured through percentage agreement methods 7) Interrater - measured through the Kappa Statistic 8) Interrater - measured through Intraclass correlation coefficients (ICCs)

Name 8 types of measurement reliability

- a __________ correlation there are no consistent associations; a student with high anxiety might have low, medium, or high grades.

No correlation/ zero correlation - what does it mean?

- three or more categories - categories are unordered - traditional term: Nominal

Nominal

- same as traditional nominal scale except: gliner only includes variable that have 3 ore more unordered catagories. - ex: single ppl assigned numearal 1, married persons might be coded 2, and divorced persons could be coded 3 -- does not imply that a divorced person is higher or lower than a married/single one. - same reasoning applies to other nominal variables such as: - ethnic croup - type of disability, or - section under class schedule, etc. - questions to determine if nominal: 1) are the variable levels/categories overlapping? 2) are the var. levels/characteristics ordered? If answers to both 1 & 2 = no, then safe to assume variable is nominal

Nominal Variables (Gliner's view)

the numbers used for identifying the categories in a nominal variable must not be treated as if they were numbers that could be used in a formula, added together, subtracted from one another, or used to compute an average. Ex: Average ethnic group makes no sense. However, if one asks a computer to compute average ethnic group, it will do so and provide meaningless information.

Nominal Variables - How are #s used in relation to nominal variables?

The important thing about nominal scales is to have clearly defined, non-overlapping, or mutually exclusive categories that can be coded reliably by observers or by participant self-report.

Nominal Variables - summary of what is most important!

- Lowest of the four levels of measurement - Categories that are not more or less, but are different from one another in some way - Mutually exclusive and exhaustive categories - Named categories - Example: Gender 1 = Male 2 = Female

Nominal scale of measurement

Statistics such as the mean or variance would be meaningless for a three or more category nominal variable (e.g., ethnic group or marital status, as described already). However, such statistics do have meaning when the data are dichotomous, that is, have only two categories. For example, if the average gender was 1.55 (with males = 1 and females = 2), then 55 percent of the participants were females.

Nominal vs Dichotomous - use of statistics

- idealized frequency distribution - provides a model for the fit of the distributions of many of the DVs used in behavioral sciences - ex: height, weight, IQ, various other psych. variables. - Most ppl fall toward the middle of the curve w/ fewer at the extremes

Normal Curve

- 5 or more ordered levels - mutually exclusive categories that are ordered from low to high - ALSO: responses/scores are at least approximately normally distributed in the population from which the sample was selected Normality is important for: - inferential statistics (ex: t test), which assume the DV is normally distributed - the appropriate use of several common descriptive stats (e.g., Mean and standard deviation, etc)

Normally Distributed Variables

- Any score that is obtained from any participant on a particular instrument. - Comprised of true score and error. - ________________ score = true score +/- Error -true score, observed scores

Observed Scores - We can never know a person's _____ score, so we will only ever know their _______ score.

- Three or more levels - ordered levels - frequency distribution of the scores is not normally distributed - traditional term, ordinal

Ordinal

a scale of measurement in which the measurement categories form a rank order from low to high - ex: consider the ranking of winners of a horse race (1st place, 2nd place, and so on) - intervals between the various ranks are not necessarily equal - 2nd place horse may finish 30 seconds behind 1st, but only 1 second in front of the 3rd place horse (thus the intervals are not equal but the rank is not affected)

Ordinal Scales

- 3 or more ordered categories or levels - Responses are NOT normally distributed - Gliner's interpretation of this scale is similar to the traditional interpretation - Important note: when frequencies are plotted from a sample of participants, they do NOT look like the bell-shaped or normal distribution of the scores shown in figure 10.1

Ordinal Variables

A parallel form can be created by simply reordering the items or by writing new items that are similar to the existing items. It is important that the two forms have similar content.

Parallel forms - How to create one?

- the reliability estimate between two similar forms of a measure - involves establishing the relationship between the two forms of the same test. - This type of reliability is easy to establish, since it involves having a sample of participants take the two forms of the same instrument with very little time elapsed between the two administrations. - Then, similar to test-retest reliability, a correlation coefficient is determined for the two sets of scores. - Again, a reliability coefficient of at least .80 would be expected for parallel forms reliability.

Parallel forms reliability

-involves having two or more raters, prior to the study, observe a sample of behaviors that will be similar to what would be observed in the study. - It is important for the two raters to discuss what they will be rating (i.e., the construct of interest) to agree on what each rater believes is an instance of the construct. - Suppose that rater A observes eight occurrences of a particular behavior and rater B observes ten occurrences of the same behavior. A percentage is then computed by dividing the smaller number of observations by the larger number of observations of the specific behavior. In this case, the percentage is 80.

Percentage agreement methods

- One of the problems with this method is that although both observers may agree that a behavior was elicited a particular number of times, this does not mean that each time the behavior occurred that both judges agreed. - For example, suppose that the behavior of cooperation was the dependent variable for a study. Prior to the study, two judges were to observe a classroom of students for particular instances of cooperation. One observer (judge) said that there were eight examples of cooperation. A second observer said that there were ten examples of cooperation. The percentage agreement would be 8 divided by 10, or 80 percent. However, it is possible that the eight instances observed by one judge were not the same instances observed by the second judge. The percentage would be inflated in this particular instance.

Percentage agreement methods -Problems

Using a point-by-point basis of establishing interrater reliability, each behavior would be rated as an agreement or disagreement between judges. The point-by-point method would be easiest to perform if the behavior is on a tape that could be played for the judges. To calculate percentage agreement in the point-by-point method, the number of agreements between the two judges would be divided by the total number of responses (agreements plus disagreements).

Point by point basis - percentage agreement method

1) type of evidence for criterion validity 2) extent to which we can predict how a subject will do on the criterion measure in the future based on a score on the instrument to be validated 3) predict future performance; SAT, GRE, and LSAT 4)

Predictive evidence: - 1) a type of evidence for ______________ validity - 2) what is it? -3) examples of instruments that are used to ________. list 3

- often not all of the participants who were evaluated on the original instrument can be evaluated on the criterion variable. -This is especially the case in selection studies. -For example, we may have SAT scores for a wide range of high school students. However, not all of these students will attend college. Therefore, our criterion variable of first semester college GPA will not only have fewer participants than our predictor variable, but will represent a more homogeneous group (those selected into college). - Therefore the range of scores of those who could participate in the study on both the predictor and criterion variables is restricted, leading to a smaller correlation coefficient

Predictive validity - problem #1

- in order to establish validity, the researcher must wait until those who were tested initially can be measured on the criterion. Sometimes this wait could take years. ---- Therefore a second type of criterion validity was developed to solve this problem -- concurrent evidence

Predictive validity - problem #2 --- AND what has been done in an attempt to solve this problem

Program Evaluation Steps - In Detail!! - Step 1: Determine the problem o Goal of evaluation? What question(s) is evaluation supposed to answer? o Ask directors of program o Can ask target groups (i.e., parents of children on spectrum who enrolled their kids in a program to improve high school graduation rates of autistic children) o What is the problem the program was designed to address and how serious is it? o How has program determined on its own to evaluate its own success?

Program Evaluation Steps - In Detail!! - Step 1:?

Program Evaluation Steps - In Detail!! - Step 2: Needs Assessment o Survey of potential users § To determine need for program § To determine need for expansion of program if that is on the table for future of the program o Survey of available resources o Look at census or archival data to help us answer some of our questions

Program Evaluation Steps - In Detail!! - Step 2: ??

Program Evaluation Steps - In Detail!! - Step 3: Determine the type of evaluation being requested o Summative vs Formative o Need to determine this before you gather any data because if the results are negative and this hasn't been negotiated beforehand the use of the info may be affected. Keeps some level of integrity in the process.

Program Evaluation Steps - In Detail!! - Step 3: ??

Program Evaluation Steps - In Detail!! - Step 4: Cost/benefit analysis o Cost effectiveness o Is this program worth the cost? § Look at financials · This is one area where assistance of experts can be helpful

Program Evaluation Steps - In Detail!! - Step 4:

1)The normal curve is unimodal. It has one "hump," and this hump is in the middle of the distribution. The most frequent value is in the middle. 2) The mean, median, and mode are equal. 3) The curve is symmetric. If you folded the normal curve in half, the right side would fit perfectly with the left side; that is, it is not skewed . 4) The range is infinite. This means that the extremes approach but never touch the x axis. 5) The curve is neither too peaked nor too flat, and its tails are neither too short nor too long; it has no kurtosis . Its proportions are like those in attached figure .

Properties of the normal curve that are always present:

Nominal Variables. They also rely heavily on the process of developing appropriate codes or categories for behaviors, words, etc.

Qualitative / constructivist researchers rely on ______ variables.

§ What type of evaluation is feasible · i.e., evaluation design? § How effective was the program? · Observational or self-report methods · Quasi experimental designs IF POSSIBLE § How should program be delivered? · Obs. Or self-report methods to gather this info/assess this § What is net impact of program?

Questions for Summative Evaluation

- Mutually exclusive categories which are ordered low to high and have a true zero. Very few psychological scales have a true zero (ratio scale) -- ex: it is not possible to say that one has no intelligence (0) or no extroversion (0), or no amount of attitude of a particular type (0).

Ratio Scale

- unacceptable with regards to reliability. - even a high negative correlation is unacceptable - this would indicate that a person who initially score high on the measure later score low and vice versa. - negative reliability coefficient usually indicates a computational error or terrible inconsistency.

Reliability and negative correlations

Criteria: - Past reliability of the data produced by the instrument is high (e.g., above .80) or at least marginally acceptable (e.g., above .60). - The length of time that had been used to establish the test-retest reliability is similar to the length of time to be used in the study. It should be noted that as the length of time increases between administrations, the reliability usually decreases. - The sample that had been used to determine reliability of the instrument is similar to the sample that will be used in the current study.

Reliability considerations when choosing a measure:

SEE IMAGE

SEE IMAGE

- nominal, ordinal, interval, ratio - nominal (least/lowest unordered level) to Ratio (the highest level)

Scales of measurement

- provides a visual picture of the correlation - each dot/circle on the plot represents a particular individual's score on the two variables -- w/ one var. represented on the x axis and the other on the y axis. shows how the score for an individual on one variable associates with his or her score on the other variable -

Scatterplot

See Chart - Appropriate descriptive graph and stats

See Chart - Appropriate descriptive graph and stats

Our Measurement terms: 1) Dichotomous 2) Nominal 3) Ordinal 4) Normal (approximately normally distributed)

Similar Measurement Terms to match with ours: 1)Binary, dummy variable, two categories 2) unordered, qualitative, names, categorical, discrete 3) Unequal intervals, ranks, discrete ordered categories 4) Continuous, equal intervals, interval scale, ratio scale, quantitative, scale (in SPSS), dimensional

When the data bunch up on one side of a central tendency and trail out on the other.

Skewness

= 2r/(1+r) ex: if correlation coefficient between 1st and 2nd half of a test = .7, then the spearman-brown formula would estimate the reliability of the scores when using the entire test to be approximately .82

Spearman-brown formula - used to correct for the underestimation of split-half methods

- when a test or measure is split into two halves, and then the data from these halves is correlated - For example, one could correlate the first half of the test with the second half of the test, or compare the odd items with the even items. A third and highly recommended method is to random sample half of the items of the test and correlate them with the remaining items. - similar, content, difficulty

Split-half methods - what is it? - three examples of how to determine it - The two halves need to be _______ in _______ and ______.

- when dividing the test into two halves, the number of items is reduced by 50 percent compared with test-retest reliability or alternative forms reliability. * This reduction in size means that the resulting correlation coefficient will probably underestimate reliability. -- this can be accounted for by using the spearman-brown formula once the correlation coefficient is established.

Split-half methods - Problems

- the measure of how scores vary about the mean - common in both descriptive and inferential statistics. - See photo for formula

Standard Deviation (SD) - what is it? -formula

- allows us to establish a range of scores (i.e., confidence interval) within which should lie a performer's true score.

Standard error of measurement

- When the plotted points are close to a straight line (the linear regression line ) from the lower left corner of the plot to the upper right as in Figure 10.4 , there is a relatively high positive correlation (e.g., +.5) between the variables. - When the linear regression line slopes downward from the upper left to the lower right, the correlation is high negative (e.g., -.5). - For correlations near zero (as in Figure 10.5 ), the regression line will be close to flat with many points far from the line, and the points will form a pattern more like a circle or blob than a line or oval.

Strength of correlations on scatter plots - What do they look like? - strong positive - strong negative - zero

o results of evaluation are available to others outside of those involved § Summative data · Reported at end of project to measure whether goals are being reached · SEE SLIDE!

Summative Evaluation - What is it?

- stability - one of the most common forms of reliability - easy to understand - if a test produces reliable scores, then if it is given more than once to the same person, that person's scores should be very close if not equal.

Test-retest reliability - a coefficient of what? - how often is this used? - understandability? -- Explain

- test-retest reliability is not established during an experiment . The test-retest reliability coefficient must be established ahead of time, prior to the study, using a period of time when little related to the substance of the instrument should be happening between the two administrations of the instrument. - Even if test-retest reliability has already been established for the instrument of choice, the investigator needs to determine some type of reliability for the present study.

Test-retest reliability - Important considerations

- problem: participants may use the knowledge gained on the pretest to alter the posttest score. This problem, often referred to as testing, or carryover effects, creates significant problems for the investigator because it becomes impossible to determine if the change in scores is due to the intervention or to knowledge obtained on the pretest. - Solutions: 1) create a design without a pretest (e.g., the posttest-only control group design). However, that design can be used only if the investigator can randomly assign participants to groups. 2) have a second or parallel form that could be used as a posttest in place of the instrument used for the pretest.

Test-retest reliability - problems / solutions

Measurement Reliability

The consistency of data collected from a measure.

- probability - It is important to be able to conceptualize the normal curve as a probability distribution because statistical convention sets acceptable probability levels for rejecting the null hypothesis at .05 or .01. As we shall see, when events or outcomes happen very infrequently, that is, only 5 times in 100 or 1 time in 100 (way out in the left or right tail of the curve), we wonder if they belong to that distribution or perhaps to a different distribution.

The normal curve is a _______________ distribution - why is this important?

All normal curves can be converted into standard normal curves by setting the mean equal to zero and the standard deviation equal to one. Since all normal curves have the same proportion of the curve within one standard deviation, two standard deviations, and so on of the mean, this conversion allows comparisons among normal curves with different means and standard deviations. - Show the number of standard deviation units that a person's score deviates from the group mean. A valuable characteristic of_____ is that they allow you to compare scores on different tests.

The standard normal curve and normal curves - what is the relationship? What are Z scores?

Average score that would result if a person were tested an infinite number of times

True score:

- Frequency polygons - Histograms - Bar charts

Types of Frequency Distributions

1999 ----------vs ---------------1985 SEE IMAGE

Types of evidence for validity - 1999 vs 1985

that there is quite a bit of variability in the sample

What does it mean when very few scores in a sample are close to the mean?

- Use of research methods to assess need for, and design implementation, and effectiveness of, a social intervention - Systematic assmt of operations or outcomes of a program or a policy

What is Program Evaluation?

Spearman rho correlation - it is a nonparametric, ordinal statistic.

What statistic should be used with ordinal data? - explain

- Bar graph - the data is nominal - the points that happen to be adjacent in a bar graph's frequency distribution are not by necessity adjacent.

What type of frequency distribution should be used for variables like ethnic group, school curriculum, or age?

when data are ordinal or when other assumptions are markedly violated - Spearman Rho

When NOT TO use the Pearson correlation coefficient? - use what instead?

- when the data is normally distributed

When is it appropriate to use standard deviation to measure variability?

- what type of variability of performance we might expect.

When selecting a test, one of the most important questions to ask, in addition to reliability and validity information, is ___________

- with nominal data - because there is no necessary ordering of the data points for nominal data

When should frequency polygons and histograms not be used? Why?

- when the frequency distribution is skewed markedly to one side - ex: the median income of 100 mid-level workers and a millionaire is substantially lower and reflects the central tendency of the group better than the mean income, which would be inflated in this example and for the country as a whole by a few people who make very large amounts of money.

When to use median instead of the mean? Example?

- when both variables are ordered and approximately normally distributed. - it is a bivariate parametric statistic used to determine correlation.

When to use the Pearson correlation coefficient?

- Trained in program evaluation o No specialized degree or certification o Relevant coursework § Qual and quant methods § Program evaluation theory § Ethics - Internal vs External Evaluators o External: Hired from outside à no apparent biases o Internal: employee who collects evaluation-related data § Should not do summative evaluation and corresponding report. o E.g., SACS vs Institutional Research

Who are the evaluators?

- Marketable skills - Agencies and organizations likely need evaluation o Are funds being used well? - often the question of private sector as well as state funded agencies o Who is paying for program evaluations? § E.g.: Accrediting bodies, granting agencies, state and fed gov agencies.

Why do you need to know about Program Evaluation?

2 must be reported. They are: 1) reliability coefficients cited in the literature prior to data collection for the study; and 2) the reliability coefficients estimated with the data from the study. - common for published measures - test-retest reliability, parallel forms - internal consistency, interrater reliability

_____#_______ reliability coefficients need to be reported. - Explain! - Which are cited as previously reported reliability coefficients? - which ones can be cited as previously reported AND also as reliability coefficients estimated with the data from the study?

- Nominal - nominal -dichotomous

_______ variables cannot be included in associational statistics, such as multiple regression or Pearson correlations. The data from a ______variable could be included in _______ statistics if it is transformed into _______ variables.

reliability or consistency

___________ is necessary for measurement validity.

evaluation of validity

______________ is concerned with establishing evidence for the use of a particular instrument in a particular setting.

- similar to predictive evidence -- examines relationship bw instrument and outside criterion - can be used in place of predictive in situations when it may be too expensive to wait between the time the test was taken and the measurement of the criterion. - ex: suppose that we were interested to see if the SAT taken in high school was a good predictor of freshman grades in college. However, we do not wish to wait the time it takes for the high school students to become freshman. To determine concurrent evidence, we could take current freshman and have them take the SAT and see whether it correlates with their grades (present grades since they are now freshman). If there is a high correlation we can have some confidence in using the instrument as a predictor for success in college.

concurrent evidence

also can be obtained by substituting another instrument for the criterion, especially if it is difficult to measure the criterion. the instrument that is substituted for the criterion can never be more valid than the criterion. One must be cautious when substituting an instrument for a criterion, since in many cases the substituted instrument has not been validated against the criterion of interest. often the case with therapeutic or educational outcomes.

concurrent evidence - how else can it be obtained

the problem of finding a suitable criterion and then being able to measure that criterion. For example, gaining admission to occupational and physical therapy programs in the United States is very difficult due to the high number of applicants for the limited number of positions. In order to select the successful applicants, criteria such as grades and achievement tests often are used. Students (especially those who are not admitted) often complain that high grades don't make the person a good therapist. Could one create an admission test that would predict becoming a good therapist? Consider the problems of defining and measuring the criterion of "what makes a good occupational or physical therapist"?

concurrent evidence - major problem

concurrent evidence is not the same as predictive evidence, and one may not wish to place as much confidence in this procedure. Also, restricted range problems similar to those pointed out under predictive evidence are present. In the above example, we must make the assumption that there was little change between high school students and college students, since the target for the SAT is high school students. If there are large changes between high school and college, the validity of the instrument should be questioned. The best situation would be to obtain both predictive and concurrent evidence, although Cronbach (1960) suggests that this rarely occurs.

concurrent evidence problem of large changes in sample of target

- type of measurement validity - constructs are hypothetical concepts that cannot be observed directly (intelligence, achievement, and anxiety all = constructs) - When applying construct validity to an instrument, there is a requirement that the construct that the instrument is measuring is guided by an underlying theory. Often, especially in applied disciplines, there is little underlying theory to support the construct. - construct validation is a process (relatively slow process) where the investigator conducts studies to attempt to demonstrate that the instrument is measuring a construct. Three processes that are important for achieving construct validity are convergent evidence, discriminant evidence, and factorial evidence.

construct validity

This is determined by obtaining relatively high correlations between your scale and other measures that the theory suggests would be related. In order to demonstrate construct validity, one develops hypotheses about what the instrument should predict ( convergent evidence or validity) if it is actually measuring the construct.

convergent evidence

This is provided by obtaining relatively low correlations between your scale and measures that the theory suggests should not be related to it (Lord & Novick, 1968). Discriminant evidence can also be obtained by comparing two groups that should differ on your scale and finding that they do, in fact, differ.

discriminant evidence

This type of evidence is provided when a construct is complex, and several aspects (or factors) of it are measured. If the clustering of items (usually done with factor analysis) supports the theory-based grouping of items, factorial evidence is provided (see Chapter 19 for a brief discussion of factor analysis).

factoral evidence

- insufficient - only one type

it is _____________ for establishing validity to use/provide ___________ type of evidence

Dependent Variables

it is always important to know the levels of measurement of the _______ variable(s) in a study.

- allows the researcher to separate test characteristics from participant characteristics. - This differs from both classical test theory and generalizability theory by providing information about reliability as a function of ability rather than averaging overall ability levels.

item response theory

- degree to which a measure or test measures that which it was intended to measure. -concerned with establishing evidence for the use of a particular measure or instrument in a particular setting with a particular population for a specific purpose. -We use the term measurement validity; others might use terms such as test validity , score validity , or just validity

measurement validity - what is it?

- We use the modifier measurement to distinguish it from internal, external, and overall research validity (discussed in Chapters 8 , 9 , 23 , and 24 ) and to point out that the scores provide evidence for validity; it is inappropriate to say that a test is "valid" or "invalid." - Thus, when we address the issue of measurement validity with respect to a particular test, we are addressing the issue of the evidence for the validity of the scores on that test for a particular purpose and not the validity of the test or instrument. EXAMPLE/ EXPLAIN: - Scores from a given test might be used for a number of purposes. For example, specialty area scores on the Graduate Record Examination (GRE) might be used to predict first-year success in graduate school. However, the scores could also be used as a method to assess current status or achievement in a particular undergraduate major. - Although the same test is used in both instances, the purpose of the test is different, and thus the evidence in support of each purpose could be quite different.

measurement validity - what it IS NOT?

-A problem with this method is that it ignores chance agreements when few categories are used. -additional problem with these percentage agreement methods is that they are most suited to situations with only two raters.

point-by-point / percentage agreement method - problems

Authors describe a different classification of levels of measurement that they believe is more useful and easier to understand. Please explain:

see slide for Gliner categorization of levels of measurement.

- measurement validity, criterion validity.

there is no one type of statistic to describe measurement validity. However, the correlation coefficient is used to describe one type of -------------, -------------.

-Attribute -Active IV -Nominal

when the IV is an _______, a judgment about the level of measurement should be made. Usually, with an __________ IV the categories of the independent variable are ________.


Ensembles d'études connexes

Economics, The British Empire, Triangle Trade, and Slavery

View Set

Business Class Summer 1 EXAM #2 Chapter 6, 8, 10, 11

View Set

Unit 2 Industrialization and Progressivism (1877-1920)

View Set

AP Physics Final Exam Question Bank and Answer

View Set

Science and Technology - A2 Religious Studies

View Set

Science Quiz Valence Electrons 3-21-17

View Set

Behavioral Science II: Lesson 1: Needs, Motivation, and Attitude

View Set

Ch 9: Text Processing and More about Wrapper Classes & Ch 10: Inheritance

View Set