CS 3205 Final Exam
Scalar questions
Ask user to judge a specific statement on a numeric scale. Scale usually corresponds with agreement or disagreement with a statement
Open-ended questions
Asks for unprompted opinions, good for general subjective information. Difficult to analyze rigorously
Discount usability evaluation (qualitative)
Observe user interactions. Gather user explanations and opinions. Produces a description, usually in non-numeric terms. Anecdotes, transcripts, problem areas, critical incidents
"Do you like my interface?"
Leading question, please the experimenter bias
Naturalistic approach
Observation occurs in realistic setting. Provides useful, realistic data. More likely to generalize. Hard to arrange and do, time consuming. Good for external validity
Usability engineering approach
Observe people using systems in simulated/ artificial settings. Given specific tasks to do. Observations/ measures made as people do their tasks. Look for problem areas/ successes. Good for uncovering 'big effects'. Non-typical users tested, non-typical tasks, different physical environment/ social context
Acceptance testing
Verify that system meets expected user performance criteria
Quantity vs. Quality
Bayles and Orland (and their pots). Quantity produces better final product. Is important to do instead of just sitting and thinking/ theorizing (functional fixation)
Sharing multiple prototypes
Better than sharing best or working as a group on one. More individual exploration, more feature sharing, more conversational turns, better consensus, increase in group rapport
Double-blind studies
Both user and facilitator don't know which experimental group the user is in
Internal validity
Can reproduce the experiment multiple times yourself. Same prototypes, different users, same experimental setup, same conditions
Video recording
Can see and hear what a user is doing. One camera for screen, rear view mirror useful. Initially intrusive
Why do quantitative analysis?
Can't just ask people (preference is not performance). Observations alone won't work (effects may be too small to see but important, variability of people will mask differences). Need to understand differences between users. Good for small details
Case/field studies
Careful study of "system usage" at the site, good for seeing "real life" use, external observer monitors behavior, site visits
Ordinal scale
Classification into named or numbered ordered categories; no information on magnitude of differences between categories (preference, social status, gold/ silver/ bronze medals). Can do everything you can with nominal scale, plus merge adjacent classes, is also transitive. Can find median, percentiles.
Nominal scale
Classification into named or numbered unordered categories (country of birth, user groups, gender, etc.). Can tell whether an item belongs in a category, can count items in a category. Can't find means, medians, etc.
Interval scale
Classification into ordered categories with equal differences between categories; zero only by convention (temperature C or F, time of day). Can add, subtract, cannot multiply as this needs an absolute zero. Can find mean, standard deviation, range, variance. Can have problems with instrument calibration, reproducibility, readability, human error
Qualitative analysis
Collect non-numerical data. Conversation transcripts, general observations. Analyze for broad consistent patterns. Naturalistic vs. experimental
Unpaired T-test
Comparing two sets of independent observations. Usually different subjects in each group. Groups may be different sizes
Causal inference
Control as many external variables as possible, randomize confounding variables. Any outside variable that could effect the result of the study should be equivalent for ALL test subjects.
Discount usability evaluation (quantitative)
Count, log, measure something of interest in user interactions. Speed, error rate, counts of activities, etc.
Collecting user performance data
Data collected on system use. Exploratory vs. targeted
T-test assumptions
Data points of each sample are normally distributed, population variances are equal, individual observations of data points in sample are independent (a person's data is included no more than once)
Usability engineering lifecycle
Design -> implementation -> evaluation (and repeat)
Inspection
Designer tries the system or prototype. Can catch major problems in early versions. Not reliable as completely subjective, not valid as introspector is non-typical user, intuitions and introspection are often wrong. Help task-centered walkthroughs, heuristic evaluation
Initial design stages
Develop and evaluate initial design ideas with the user
Best iPhone study
Did not actually happen. Select participants for each phone group at random, train them for a while, test for speed and error rate
Correlation
Do X and Y co-vary? Requires measuring X and Y. Probably need two prototypes or two different versions of a prototype (each with different X)
Causation
Does X cause Y? Requires measuring X and Y (establishing correlation). Requires establishing time precedence. Requires controlling for all confounding variables
Iterative design
Does system behavior match user's task requirements? Are there specific problems with the design? What solutions work?
Ways to get around please the experimenter bias
Double blind studies, don't let the user know what you are measuring/ what you care about (until the study is over, ask questions that cancel each other out. Evaluation measure should ALWAYS have a baserate if possible
Direct observations
Evaluator observes users interacting with system. Excellent at identifying gross design/ interface problems. Validity depends on how controlled/ contrived situation is. Simple observation, think aloud, constructive interaction
Evaluation
Experiment (or set of experiments) meant to provide answers to at least one design question. MUST have a research question, usually related to usability requirements. Heuristic, quantitative, qualitative
External validity
Experiment applies generally to other outside settings. Different users selected from different 'pool', different prototypes with same general IV an DV, different designers running experiments. Results apply generally to experiments with the same abstract characteristics
Self selection
Experimental groups are chosen by participants in some manner
Experimental approach
Experimenter controls all environmental factors. Good for internal validity
Heuristic evaluation
Experts look at a system and analyze carefully, produce report of usability problems. Can be difficult/ expensive to find/ hire an expert
Chi-square test
Good for categorical data that is larger than FET. No category should total less than 5. X^2 = (sum i = 1 -> n) [(O_i - E_i)^2] / E_i. O is observed count for category, E is expected. E_i = (rowTotal*colTotal)/n. Use table to get p-value
Interviews
Good for pursuing specific issues. Vary questions to suit context, probe more deeply on interesting issues as they arise (let user lead conversation), often leads to specific constructive suggestions. Accounts are suggestive, time consuming, evaluator can easily bias the interview, prone to rationalization of events/ thoughts
Audio recording
Good for recording think aloud talk. Hard to tie into on-screen user actions
Exploratory data collection
Hope something interesting shows up (like a pattern), can be difficult to analyze
Formative conceptual model
How a person perceives a screen after it has been used for a while
Initial conceptual model
How a person perceives a screen the very first time it is viewed
Baserate
How often does Y occur in the current setting (if one exists)? Might make sense to look at competing product
Independent vs dependent variables
Independent: variable that is manipulated to study an effect via a change Dependent: variable that is measured for change after IV is altered
Fair comparison
Insert new approach into an actual production setting, recreate the production approach in your new setting, scale things down so you're looking at a piece of a larger system (most relevant), when expertise is relevant train people before running study
Methods for qualitative discount usability evaluation
Inspection, extracting the conceptual model, direct observation (think-aloud, constructive interaction), query techniques (interviews, questionnaires), continuous evaluation (user feedback, field studies)
Ratio scale
Interval scale with absolute, non-arbitrary zero (temperature K, length, weight, time periods). Can multiply, divide
Pre-design
Investing in new expensive system requires proof of viability
Targeted data collection
Look for specific information, but may miss something
Discount usability evaluation
Low cost methods to gather usability problems. Approximate: capture most large and many minor problems
Process of controlled experiments
Lucid and testable hypothesis (includes both independent and dependent variable(s)). Judiciously select and assign subjects to groups. Control for bias (in instructions, experimental protocols, subject selection). Apply statistical methods to data analysis. Interpret your results
Parallel prototyping
Make multiple prototypes in parallel. Separates ego from artifact (criticism of one design is not a criticism of designer). Supports transfer of positive attributes across designs
Continuous evaluation
Monitor systems in actual use (usually late stages of development like beta releases, delivered system; fix problems in next release). User feedback via gripe lines (users can provide feedback to designers while using the system through help desks, bulletin boards, email, built-in gripe facility) best combined with trouble-shooting facility
Degrees of freedom T-test
N1 + N2 - 2
Null hypothesis of T-test
No difference exists between the means of two sets of collected data
Scales of measurements
Nominal, ordinal, interval, ratio
First iPhone study
Numeric keypad and QWERTY users. Measure wpm. Internal validity, no external validity. Both groups did same (and did poorly compared to their own phones)
How many users should you observe?
Observing many users is expensive. Individual differences matter. Shoot for 5 - 10. Reasonable number of users tested, reasonable range of users, big problems usually detected with handful of users, small problems/ fine measures need many users
Directional T-test
Only interested if the mean of a given condition is greater (OR less) than the other
Styles of questions
Open-ended questions, closed questions (scalar, multi-choice, ranked). Can combine to get specific response while allowing for user's opinion (with a comment section)
Recording observations
Paper and pencil, audio recording, video recording
Critical incidence interviews
People talk about incidents that stood out. Usually discuss extremely annoying problems with fervor, not representative but important to the user, often raises issues not seen in lab tests
Please the experimenter bias
People want to make you feel good about your work (they assume you worked hard)
How do we compare prototypes?
Perform an evaluation
Quantitative evaluation/ analysis
Perform an experiment that involves the collection of quantitative data (numeric data or data that can be translated into numeric data). Run statistical tests to evaluate differences across prototypes
How to interview
Plan a set of central questions, could be based on results of user observations, focuses the interview. Avoid leading questions. Let user responses lead follow-up questions
Retrospective testing interviews
Post-observation interview. Perform observational test, create video record of it, have users view video and comment on it. Clarify events that occurred during system use, avoids erroneous reconstructions, users often offer concrete suggestions
Quantitative analysis
Precise measurement, numerical values. User performance data collection, controlled experiments
Natural vs. experimental
Precision and direct control over experimental design vs. desire for maximum generalizability in real life situations
Questionnaires/ surveys
Preparation is expensive but administration is cheap. Does not require presence of evaluator. Results can be quantified. Only as good as the questions asked. Only ask questions that will have answers you care about. Determine the audience you want to reach. Determine how to deliver/ collect questionnaire (on-line, web site, surface mail)
Paper and pencil
Primitive but cheap. Record events, comments, interpretations. Hard to get detail (writing is slow). Should probably have two people doing this
Leading question
Question that suggests the answer the examiner is looking for or contains the information the examiner is looking to have confirmed. Don't ask these!
Ways of controlling subject variability
Reasonable amount of subjects, random assignment, make different user groups an independent variable, screen for anomalies in subject group
Type 1 error
Reject null hypothesis when it is, in fact, true. Considered worse because null hypothesis is meant to reflect the incumbent theory
Multi-choice questions
Respondent offered a choice of explicit responses
Ranked questions
Respondent places an ordering on items in a list. Useful to indicate user's preferences. Forced choice
Closed questions
Restrict respondent's responses by supplying alternative answers. Makes questionnaires a chore for respondent to fill in. Can be easily analyzed. Watch out for hard to interpret responses (alternative answers should be very specific)!
Conceptual model extraction
Show user static images of prototype or screens during use, have user explain function of each screen element/ how they would perform a particular task (and why they think that). Initial vs. formative. Good for eliciting people's understanding before and after use. Poor for examining system exploration and learning
T-test
Simple statistical test, allows one to say something about differences between means at a certain confidence level. Unpaired vs. paired, non-directional vs. directional. FORMULAS. Look up critical value in table
Lucid and testable hypothesis
State a lucid, testable hypothesis. This is a precise problem statement
Statistical analysis
Tells us mathematical attributes about data sets (mean, variance, etc.), how data sets relate to each other, the probability that claims are correct (statistical significance -> 5%)
Confidence limits
The confidence that your conclusion is correct
Null hypothesis
There is no difference
Constructive interaction method
Two people work together on a task. Monitor normal conversations, removes awkwardness of think-aloud. Co-discovery learning -> use semi-knowledgeable 'coach' and novice, only novice uses the interface, gives insights into two user groups
Non-directional T-test
Two-tailed. No expectation that the direction of difference matters
Fisher's exact test
Use contingency table. p = [(a+b)!(c+d)!(a+c)!(b+d)!]/(a!b!c!d!n!). Good for simple comparisons between distributions of data, small sample sizes, very robust (p is exact). Bad for complicated multi-dimensional data, large sample sizes (b/c factorials)
Partial solution to usability engineering approach
Use real users. Task-centered system design tasks. Environment similar to real situation
Controlled experiments
Use traditional scientific method. Reductionist (clear convincing result on specific issues). Insights into cognitive process, human performance limitations, etc. Allows system comparison, fine-tuning of details, etc.
Direct observations in lab
User asked to complete set of pre-determined tasks
Direct observations in field
User goes through normal duties
Simple observation method
User is given task, evaluator just watches user. Does not give insight into the user's decision process or attitude
Think aloud method
Users speak their thoughts while doing the task (what they are trying to do, why they took action, how they interpret what the system did, etc.). Gives insight into what the user is thinking. Most widely used evaluation method in industry. May alter the way users do the task, unnatural, hard to talk if concentrating
Paired T-test
Usually a single group studied under both experimental conditions. Data points of one subject are treated as a pair. Both conditions will have same number of data points
Confounding variables
Variables that affect both X and Y. If not controlled, cannot say that X causes Y
3 questions to establish purpose of questionnaire
What information is sought? How would you analyze the results? What would you do with your analysis?
Interpret your results
What you believe results really mean, their implications on your research, their implications to practitioners, how generalizable they are, limitations and critique
Statistical vs. practical significance
When n is large, even a trivial difference may show up as a statistically significant result. Statistical significance does not imply that the difference is important (matter of interpretation)
Problem with visual inspection of data
Will almost always see variation in collected data. Is it normal variation or a real difference between data?
Better iPhone study
iPhone (after >= one month), numeric keypad, and QWERTY users. Measure speed and error rate. iPhone and QWERTY were same speed (numeric much slower). iPhone users make many more errors. Still have problem of self selection