ESI Full Deck
Cognitive Walkthrough
Cognitive Walkthrough Is cost effective - Predicts the usability problems that users will experience when learning a new system by exploratory browsing. - Experts walk through a task in a system, or with a design spec, to "question" it. - Experts ask whether the users will experience difficulties at each stage of the task by focusing on the users' knowledge and goals. Predicts usability problems that users will encounter during exploratory learning of a new system - Therefore, used as part of formative evaluation - Requires experts, not users - Conducted individually or in a group There are 3 distinct phases in a cognitive walkthrough: - Preparation (or "defining inputs") - Conducting the walkthrough evaluation - Analysis (dealing with the data) The correct sequence of actions needed to complete each of the tasks using the system (the "happy path"). The evaluator is asked to tell a 'success' or 'failure' story for each action, using the questions. And the answer to the 4 questions for each action should be recorded on forms. If the answer to any question is 'no' (and hence there is a failure story for the action) this is considered to be a usability problem and a separate problem report form should be completed. Every "no" answer is a usability problem List these—they are the findings from a cognitive walkthrough E.g. "The photocopier does not provide any visual feedback when the user presses the on/off button" A cognitive walkthrough begins by defining the task or tasks that the user would be expected to carry out. It is these tasks that the cognitive walkthrough will examine for usability—any tasks that can be performed in the product but are not subject to a cognitive walkthrough will not normally be assessed during the process. four questions to be used by an assessor during a cognitive walkthrough: Will the user try and achieve the right outcome? Will the user notice that the correct action is available to them? Will the user associate the correct action with the outcome they expect to achieve? If the correct action is performed, will the user see that progress is being made towards their intended outcome? A cognitive walkthrough is a technique used to evaluate the learnability of a system from the perspective of a new user. Unlike user testing, it does not involve users (and, thus, it can be relatively cheap to implement). Like heuristic evaluations, expert reviews, and PURE evaluations, it relies on the expertise of a set of reviewers who, in a highly structured manner, walk through a task and assess the interface from a new user's point of view. A cognitive walkthrough takes place in a workshop setting. The user tasks to be evaluated within the session are defined in advance. (If you have a list of top tasks, that's a good source for evaluation tasks.) The workshop participants may include UX specialists, product owners, engineers, and domain experts. One participant acts as a facilitator. All participants serve as evaluators, offering their interpretation of how a particular type of user (which could be defined by a user persona) would perceive the interface and behave in the given situation. Another participant serves as the recorder, documenting the answers found for each question and the probable success or failure of the overarching task (as determined by the group). evaluators discuss 4 key questions (analysis criteria) meant to uncover potential causes for failure: Will users try to achieve the right result? In other words, do users understand that the action (step) at hand is needed to reach their larger goal? Will users notice that the correct action is available? In other words, is the interactive element that achieves the step visible or easily findable? Will users associate the correct action with the result they're trying to achieve? Perhaps the right button is visible, but will users understand the label and will they know to engage with it? After the action is performed, will users see that progress is made toward the goal? Based on what occurs after the action is taken, will users know that this action was correct and helped them make progress toward their larger goal? Are Cognitive Walkthroughs Appropriate for All Types of Interfaces? Since cognitive walkthroughs are meant to evaluate learnability, they're most effective for systems with complex, new, or unfamiliar workflows and functionalities. This question-based evaluation approach applied to common tasks sets cognitive walkthroughs apart from heuristic evaluations, which are more general in nature. Heuristic evaluations help identify weaknesses and potential improvements by evaluating the entire product against a set of usability guidelines and best practices. They do not seek to explore users' perspectives and reactions to the system. A comparison of these two approaches is outlined in the table below.
A/B Testing
In A/B testing, you unleash two different versions of a design on the world and see which performs the best. For decades, this has been a classic method in direct mail, where companies often split their mailing lists and send out different versions of a mailing to different recipients. A/B testing is also becoming popular on the Web, where it's easy to make your site show different page versions to different visitors. Sometimes, A and B are directly competing designs and each version is served to half the users. Other times, A is the current design and serves as the control condition that most users see. In this scenario, B, which might be more daring or experimental, is served only to a small percentage of users until it has proven itself. Benefits Compared with other methods, A/B testing has four huge benefits: As a branch of website analytics, it measures the actual behavior of your customers under real-world conditions. You can confidently conclude that if version B sells more than version A, then version B is the design you should show all users in the future. It can measure very small performance differences with high statistical significance because you can throw boatloads of traffic at each design. The sidebar shows how you can measure a 1% difference in sales between two designs. It can resolve trade-offs between conflicting guidelines or qualitative usability findings by determining which one carries the most weight under the circumstances. For example, if an e-commerce site prominently asks users to enter a discount coupon, user testing shows that people will complain bitterly if they don't have a coupon because they don't want to pay more than other customers. At the same time, coupons are a good marketing tool, and usability for coupon holders is obviously diminished if there's no easy way to enter the code. When e-commerce sites have tried A/B testing with and without coupon entry fields, overall sales typically increased by 20-50% when users were not prompted for a coupon on the primary purchase and checkout path. Thus, the general guideline is to avoid prominent coupon fields. Still, your site might be among the exceptions, where coupons help more than they hurt. You can easily find out by doing your own A/B testing under your own particular circumstances. It's cheap: once you've created the two design alternatives (or the one innovation to test against your current design), you simply put both of them on the server and employ a tiny bit of software to randomly serve each new user one version or the other. Also, you typically need to cookie users so that they'll see the same version on subsequent visits instead of suffering fluctuating pages, but that's also easy to implement. There's no need for expensive usability specialists to monitor each user's behavior or analyze complicated interaction design questions. You just wait until you've collected enough statistics, then go with the design that has the best numbers. Limitations With these clear benefits, why don't we use A/B testing for all projects? Because the downsides usually outweigh the upsides. First, A/B testing can only be used for projects that have one clear, all-important goal, that's to say a single KPI (key performance indicator). Furthermore, this goal must be measurable by computer, by counting simple user actions. Examples of measurable actions include: Sales for an e-commerce site. Users subscribing to an email newsletter. Users opening an online banking account. Users downloading a white paper, asking for a salesperson to call, or otherwise explicitly moving ahead in the sales pipeline. Unfortunately, it is rare that such actions are a site's only goal. Yes, for e-commerce, the amount of dollars collected through sales is probably paramount. But sites that don't close sales online can't usually say that a single desired user action is the only thing that counts. Yes, it's good if users fill in a form to be contacted by a salesperson. But it's also good if they leave the site feeling better about your product and place you on their shortlist of companies to be contacted later in the buying process, particularly for B2B sites If, for example, your only decision criterion is to determine which design generates the most white paper downloads, you risk undermining other parts of your business. For many sites, the ultimate goals are not measurable through user actions on the server. Goals like improving brand reputation or supporting the company's public relations efforts can't be measured by whether users click a specific button. Press coverage resulting from your online PR information might be measured by a clippings service, but it can't tell you whether the journalist visited the site before calling your CEO for a quote. In contrast, paper prototyping lets you try out several different ideas in a single day. Of course, prototype tests give you only qualitative data, but they typically help you reject truly bad ideas quickly and focus your efforts on polishing the good ones. Much experience shows that refining designs through multiple iterations produces superior user interfaces. If each iteration is slow or resource-intensive, you'll have too few iterations to truly refine a design. A possible compromise is to use paper prototyping to develop your ideas. Once you have something great, you can subject it to A/B testing as a final stage to see whether it's truly better than the existing site. But A/B testing can't be the primary driver on a user interface design project. No Behavioral Insights The biggest problem with A/B testing is that you don't know why you get the measured results. You're not observing the users or listening in on their thoughts. All you know is that, statistically, more people performed a certain action with design A than with design B. Sure, this supports the launch of design A, but it doesn't help you move ahead with other design decisions. Of course, you also have no idea whether other changes might bring even bigger improvements, such as changing the button's color or the wording on its label. Or maybe changing the button's page position or its label's font size, rather than changing the buttons size, would create the same or better results. Basically, you know nothing about why button B was not optimal, which leaves you guessing about what else might help. After each guess, you have to implement more variations and wait until you collect enough statistics to accept or reject the guess. Worst of all, A/B testing provides data only on the element you're testing. It's not an open-ended method like user testing, where users often reveal stumbling blocks you never would have expected. It's common, for example, to discover problems related to trust, where users simply don't want to do business with you because your site undermines your credibility. A disadvantage of A/B tests, however, is the time required to prepare and set up two different versions of a website. In addition, for websites with little traffic, the tests may have to be carried out over several weeks or months in order to obtain a sufficiently large database for meaningful results. A/B testing also cannot measure or indicate whether there are usability problems on a website that may be responsible for results such as a low conversion rate. If multiple variables are changed simultaneously, there also is a risk that the test results will be misinterpreted. -- I would also add, that it can be confusing for users to come across different versions of a site, especially if someone it attempting to show another person how to use it
CUE-9 - The evaluator effect
The purpose of CUE-9 was to investigate the evaluator effect or Rashomon effect, which names the observation that usability evaluators who analyze the same usability test sessions often identify substantially different sets of usability problems. CUE-9 assembled experienced usability professionals to discuss the state-of-the-art in usability evaluation based on a common experience in evaluating the website for the US moving company U-Haul. Each professional team: Watched five 30-minute videos from usability test sessions, Wrote a short, anonymous report about their findings, Submitted their report, Read similar reports written by other experienced professionals, Met experienced colleagues at the CUE-9 workshop where they compared and discussed findings, and learned from the similarities and differences. Practitioner's Take Away Have more than one evaluator independently analyse test sessions, at least in important evaluations. With more than one evaluator, more problems are detected and evaluators get an opportunity to reflect on their agreements and disagreements. Consult people with local or domain knowledge to avoid uncertainty in the analysis of user actions. Local and domain knowledge may be needed to interpret whether users approach tasks appropriately, miss important information, and reach correct task solutions. The goal of a test should be clarified ahead of the test. Consolidate the severity ratings of the reported usability issues in a group process. Such a process is likely to reduce the number of highly rated problems and thereby adds focus to redesign work. A group process may also support problem prioritisation, by providing the usability specialists with development people who are knowledgeable about the ease or difficulty of fixing the problems. Consider the use of unmoderated tests. On the basis of this study, unmoderated tests appear to be a costeffective alternative or supplement to moderated tests as the evaluator effect and the number of identified usability issues were similar for moderated and unmoderated tests. Remember that perfect reliability is not required in order for usability testing to be worthwhile. This is particularly relevant when multiple usability tests are conducted in an iterative process of evaluation and redesign, thereby providing additional possibilities for finding usability problems that are initially missed. Paper and Article About CUE-9 What you get is what you see: revisiting the evaluator effect in usability tests,by Morten Hertzum, Rolf Molich and Niels Ebbe JacobsenBehaviour & Information Technology, April 2013 For a copy of this paper, please contact me. A little known factor that could have a big effect on your next usability testby David Travis, retrieved 19 March 2020. Example: CUE-9 Again, investigated the Evaluator Effect. - 9 UX professionals watched moderated usability test sessions; 10 watched unmoderated sessions. - Evaluator effect was similar for both. - Participants individually reported an average of 33% (32%) of all the problems in their group. - The Evaluator Effect existed for all the problems and for the most severe problem
Eight is Not Enough
When we tested the site with 18 users, we identified 247 total obstacles-to-purchase. Contrary to our expectations, we saw new usability problems throughout the testing sessions. In fact, we saw more than five new obstacles for each user we tested. Equally important, we found many serious problems for the first time with some of our later users. What was even more surprising to us was that repeat usability problems did not increase as testing progressed. These findings clearly undermine the belief that five users will be enough to catch nearly 85% of the usability problems on a web site. In our tests, we found only 35% of all usability problems after the first five users. We estimated over 600 total problems on this particular online music site. Based on this estimate, it would have taken us 90 tests to discover them all! While we had tested users on an e-commerce site, Virzi and others had tested users on software products. Today's web sites, particularly e-commerce sites, can be more complex than standard software products that often confine users to a very limited set of activities. Web tasks are also vastly more complex than those users have with most software applications. For example, our tests asked users to complete shopping tasks. No two users looked for the same product and no two users approached the site in the same way. The tasks were dependent on individual user characteristics and interests. Because of the increased complexity of web sites, it's understandable that more users are needed to detect the majority of usability problems. Rather, if you're working on a large e-commerce site—or any web site at all—the usability of your site would likely benefit from ongoing testing. Instead of thinking of usability testing as a discrete activity that takes place every 6 months and involves six, eight or twelve users, think about the advantages of ongoing usability testing, bringing in a user or two every week. With this kind of plan, you'll see over 20 users in six months. With more users testing your site, you'll get more feedback, find more problems, and have more data, but there may be some less obvious advantages as well. When a design team gets into the mindset of regular testing, they can try out new ideas and find out whether these work without making the live site the testing ground. Because web sites undergo many incremental, seemingly small design changes in between drastic redesigns, there's always new fodder for testing. This study found that participants tended to shift their gazes from the upper navigation to the left navigation when thinking aloud. The results partially matched the findings from the Bergstrom and Olmsted-Hawala study. It's been hypothesized that one reason a difference exists is that participants are looking away from the screen [pdf] to describe something to the researcher or focusing on certain areas of the screen while describing their thought processes regarding that area. Another explanation for the shift is that participants are quickly leaving the top navigation and moving to the left navigation to find things to read (especially because the participants knew they would be asked to recall what the website did). Future research can examine whether this pattern holds. The results of this study found that thinking aloud does affect where and how long people look at parts of a website homepage in the first five seconds.
Pros and Cons of Experiments in HCI
- Experiments can provide strong evidence to address specific questions—the kind of evidence that may be needed for publication. - Their scope is narrow. - They are expensive in terms of resources. - Rigorous design is required to avoid errors in experimental design that impact validity.
Matched Participants
- If it won't work to have the same user do different conditions (called one condition contaminating the other), but you are worried about not being able to get enough users in each group, can use two different groups of users but match them on relevant variables - For example, you might match participants based on experience with systems of this type: - if you have 5 users in each condition, have one pair who are matched for one year of experience, one pair who are matched for two years of experience etc
Comparative Studies: Guidelines II
- Participants should be similar in terms of their experience etc. - Participants using the different techniques should be matched (or randomly assigned if you are confident that they are similar). - Participants should perform the evaluation in the same location under the same conditions. - Participants should be asked to report their results using the same usability report format.
"Within Subjects"
- Same users for each condition. - Two advantages: - You need fewer participants. E.g. In an experiment with two conditions, you only need half the number of users compared to a 'Between Subjects' design; - You cut down on problems like one person being very particular about something, as they have that effect on both conditions. - But there is a learning problem (effect): users will learn from their experience in the first condition, so their performance in the second condition might be improved.
Statistical Significance (P value)
- The probability of whether differences in the measures could have occurred by chance. — It tells you how likely it is that a difference or relationship exists.
Comparative Studies: Guidelines III
- The researcher should be consistent and avoid bias in matching problems, both to remove duplicates and to compare techniques. - The researcher should be consistent and avoid bias in categorising problems, rating severity. E.g. through use of multiple blind raters.
How to Run an A/B Test
1. Ask a question and refine into a specific hypothesis. 2. Decide what to change: create A and B versions. 3. Decide what to measure (often clicks on something). 4. Visitors to site divided equally between the A and B versions. 5. Testing service collects click data until there are enough visitors for statistical significance. What Do You Test? - This is the independent variable. - Almost anything • Call to action button • Images • Layout • Colour • Content What Do You Measure? - This is the dependent variable. - You need quantifiable success metrics. - Different sites want visitors to do different things: • Click on ads • Take out a subscription • Register • Purchase
When Creating tasks avoid
1. Telling Users Where to Go 2. Telling Users What to Do 3. Creating Out-of-Date Tasks (like booking a flight for a month in the past) 4. Making Tasks Too Simple 5. Creating an Elaborate Scenario 6. Writing an Ad, not a Task Don't let marketing language or internal lingo sneak into tasks. Make sure your tasks don't include marketing phrases like "exciting new feature," business phrases like "thinking outside the box," or mysterious corporate acronyms. 7. Risking an Emotional Reaction While writing a task that revolves around someone's mother may seem harmless, you never know the specific circumstances of your study participants. Mentioning a specific relationship in a task may add unnecessary emotion to the user test. 8. Trying to Be Funny Don't joke, use famous names in tasks, or otherwise try to lighten the mood. Doing so can backfire and make some participants feel awkward or, even worse, as though you are making fun of them. Even using gender-neutral names, such as telling the user to register as Kelly or Jesse, can be a distraction from the task. 9. Offending the Participant Avoid potentially offensive details in tasks. Societal issues, politics, health, religion, age, and money all have the possibility of offending a participant. 10. Asking Rather than Telling While you want to be polite to your participants, don't overdo it. Don't ask participants "how would you" complete a task — unless you want them to talk you through what they theoretically would do on a site, rather than doing it. The point of usability testing is to see what users do, not to hear what they would do. Tip: Start with the End Goal
Tasks
1. Verb-based tasks: - ask the user to do something and thereby test functionality - e.g. submit your INM315 coursework via Moodle 2. Scavenger-hunt tasks: - ask the user to find something specific - useful for information-rich systems - e.g. find a bookcase that fits your living room on the IKEA website Beware Simplistic "Scavenger-Hunt" Tasks Jared Spool (2006, 2019) "Scavenger-Hunt" tasks: - simplistic information seeking tasks - can be too easily achieved - not realistic - e.g. "find a bookcase"
PURE
Pragmatic Usability Rating by Experts https://www.nngroup.com/articles/pure-method/ Experts rate ease of use (or "friction") for key tasks for specific target users PURE delivers a single UX score by aggregating several ratings PURE Scorecard - 7 tasks - a bar for each step in the task - varying levels of friction Definition: PURE is a usability-evaluation method in which usability experts assign one or more quantitative ratings to a design based on a set of criteria and then combine all these ratings into a final score and easy-to-understand visual representation. The numbers and colors shown in PURE scores represent friction, the opposite of ease of use. The higher the number and the "hotter" the colors, the more friction there is — similar to usability-severity ratings. Comparing the PURE scorecard for the same task across different product versions or among competitors allows you to easily see the variation in friction for different designs of the task. Although lower numbers usually mean less friction, the quality of the steps should also be considered, as indicated by their colors. One of the big benefits of PURE is that it considers overall user effort, rather than just clicks or steps. This can help counter overly simplistic arguments that fewer clicks will result in higher levels of success, and instead refocus attention on reducing user effort, rather than just clicks. (Note that you should generally avoid comparing the PURE scores of different tasks, since their nature and goals are often quite different. ) Because PURE measures the friction in a set of tasks, it is important to define the tasks to be reviewed. Pragmatically, not every task can be measured, so in PURE, we only score the "fundamental tasks" — those critical for the target user and the business. Here is a sample PURE score for a product with 7 fundamental tasks: The PURE score for the product (38 in this case) is the sum of the PURE scores for all fundamental tasks. Just like for tasks, the overall color for the product is determined by the worst color of the fundamental tasks in the product. This means that a single red step (rated 3) in any fundamental task causes that entire task and product to be colored red. The rationale for this convention is that no consumer product should have a step in which the target user is likely to fail a fundamental task. The color red has a tendency to make that statement clearly and focus attention to potential points of failure in the product. Stakeholders will want to improve these numbers as well. But, unlike these other metrics, PURE scores are operational — they show what caused poor metrics and where the user experience needs improvement, providing a clear roadmap for refining the design. Showing PURE at regular business meetings, where product or business metrics are discussed, helps ensure that projects aimed at improving user experience are prioritized and executed. Another benefit of PURE is that you can use it on user experiences that haven't been completely built yet. While it is more accurate when conducted on fully functioning products, PURE can be applied to medium-fidelity prototypes or to clickable wireframes — to either compare possible solutions to the same design problem or see how a proposed flow fares in terms of ease of use before committing to coding it.
Having 5 Participants
Qualitative Usability Studies: Assumptions Behind the 5-User Guideline In contrast, qualitative user studies are mostly formative: their goal is to figure out what doesn't work in a design, fix it, and then move on with a new, better version. The new version will usually also get tested, improved on, and so forth. While it is possible to have qualitative studies that have summative goals (let's see all that's wrong with our current website!), a lot of the times they simply aim to refine an existing design iteration. Qualitative studies (even when they are summative) do not try to predict how many users will complete a task, nor do they attempt to figure out how many people will run into any specific usability issue. They are meant to identify usability problems. That you are trying to identify issues in a design. By definition, an issue is some usability problem that the user experiences while using the design. That any issue that somebody encounters is a valid one worth fixing. To make an analogy for this assumption: if one person falls into a pothole, you know you need to fix it. You don't need 100 people to fall into it to decide it needs fixing. That the probability of someone encountering an issue is 31% Conclusion There is no contradiction between the 5-user guideline for qualitative user testing and the idea that you cannot trust metrics obtained from small studies, because you do not collect metrics in a qualitative study. Quantitative and qualitative user studies have different goals: Quantitative studies aim to find metrics that predict the behavior of the whole population; such numbers will be imprecise — and thus useless — if they are based on a small sample size. Qualitative studies aim for insights: to identify usability issues in an interface. Researchers must use judgment rather than numbers to prioritize these issues. (And, to hammer home the point: the 5-user guideline only applies to qualitative, not to quantitative studies.] After creating the new design, you need to test again. Even though I said that the redesign should "fix" the problems found in the first study, the truth is that you think that the new design overcomes the problems. But since nobody can design the perfect user interface, there is no guarantee that the new design does in fact fix the problems. A second test will discover whether the fixes worked or whether they didn't. Also, in introducing a new design, there is always the risk of introducing a new usability problem, even if the old one did get fixed. Also, the second study with 5 users will discover most of the remaining 15% of the original usability problems that were not found in the first round of testing. (There will still be 2% of the original problems left — they will have to wait until the third study to be identified.) Why Not Test With a Single User? You might think that 15 studies with a single user would be even better than 3 studies with 5 users. The curve does show that we learn much more from the first user than from any subsequent users, so why keep going? Two reasons: There is always a risk of being misled by the spurious behavior of a single person who may perform certain actions by accident or in an unrepresentative manner. Even 3 users are enough to get an idea of the diversity in user behavior and insight into what's unique and what can be generalized. The cost-benefit analysis of user testing provides the optimal ratio around 3 or 5 users, depending on the style of testing. There is always a fixed initial cost associated with planning and running a study: it is better to depreciate this start-up cost across the findings from multiple users. If, for example, you have a site that will be used by both children and parents, then the two groups of users will have sufficiently different behavior that it becomes necessary to test with people from both groups. The same would be true for a system aimed at connecting purchasing agents with sales staff. Even when the groups of users are very different, there will still be great similarities between the observations from the two groups. All the users are human, after all. Also, many of the usability problems are related to the fundamental way people interact with the Web and the influence from other sites on user behavior. In testing multiple groups of disparate users, you don't need to include as many members of each group as you would in a single test of a single group of users. The overlap between observations will ensure a better outcome from testing a smaller number of people in each group. I recommend: 3-4 users from each category if testing two groups of users 3 users from each category if testing three or more groups of users (you always want at least 3 users to ensure that you have covered the diversity of behavior within the group) As with any human factors issue, however, there are exceptions: Quantitative studies (aiming at statistics, not insights): Test at least 20 users to get statistically significant numbers; tight confidence intervals require even more users. Card sorting: Test at least 15 users per user group. Eyetracking: Test 39 users if you want stable heatmaps. The main argument for small tests is simply return on investment: testing costs increase with each additional study participant, yet the number of findings quickly reaches the point of diminishing returns. There's little additional benefit to running more than 5 people through the same study; ROI drops like a stone with a bigger N. And if you have a big budget? Yay! Spend it on additional studies, not more users in each study. "We have several different target audiences." This can actually be a legitimate reason for testing a larger user set because you'll need representatives of each target group. However, this argument holds only if the different users are actually going to behave in completely different ways. Some examples from our projects include a medical site targeting both doctors and patients, and an auction site where you can either sell stuff or buy stuff. The last point also explains why the true answer to "how many users" can sometimes be much smaller than 5. If you have an Agile-style UX process with very low overhead, your investment in each study is so trivial that the cost-benefit ratio is optimized by a smaller benefit. (It might seem counterintuitive to get more return on investment by benefiting less from each study, but this savings occurs because the smaller overhead per study lets you run so many more studies that the sum of numerous small benefits becomes a big number.) For really low-overhead projects, it's often optimal to test as few as 2 users per study. For some other projects, 8 users — or sometimes even more — might be better. For most projects, however, you should stay with the tried-and-true: 5 users per usability test.
Hypothesis
States a causal relationship between independent variables and dependent variables
Null Hypothesis (H0)
States that the independent variable does not influence the dependent variable.
Effectiveness Measures IV
Thoroughness: - how many of the 'real' usability problems are found - this leads us on to comparing techniques...
Alternative Hypotheses (HA)
Users will find the "Add to Basket" button more quickly if it is placed on the bottom right rather than the bottom left of a web page.
User Experience
"User experience' encompasses all aspects of the end-user's interaction... the first requirement for an exemplary user experience is to meet the exact needs of the customer, without fuss or bother. Next comes simplicity and elegance which produces products that are a joy to own, a joy to use." Nielsen Norman Group "User experience... refers to all aspects of someone's interaction with a product, application, or system" Tullis and Albert
Comparative Studies: Guidelines I
- All techniques being compared should be used to evaluate the same system. - All techniques should be used to evaluate the same parts of that system. - The same number of participants should use each technique. - Participants should use only one of the techniques (Between Subjects design). - There should be sufficient participants for statistically valid results (if you want statistics).
Usability / UX Review
A primarily qualitative technique where a UX expert reviews early design ideas, prototypes or live systems. Relies heavily on the expert's experience and knowledge of usability. But should be driven by data including metrics. Preferably undertaken by an expert who is not part of the design team. Useful to alternate cycles of expert review and usability testing during creative phases. You can use analytics to focus the review. For example: - look for landing pages with high bounce rate - look at behaviour flows to discover pages where users drop-out Or click-tracking Or surveys Prepare: - Find out about users, their demographics, behaviours, where they are coming from. - Organisational goals. Conduct: - Use behavioural metrics. - Use attitudinal metrics. - Analyse user journeys driven by data and personas. - Find shortcomings and recommend improvements.
Alhadreti and Mayhew (2017)
Alhadreti and Mayhew compared 3 thinkaloud protocols. - classic concurrent thinkaloud, - speech communication and - active intervention. 60 participants, each undertaking 9 tasks with one of the thinkaloud protocols. (They also give a useful summary of earlier work) Measures went beyond usability problems: - Task performance - Participants' experience - Quantity and quality of usability problems - Cost All 3 techniques delivered similar usability findings. But active intervention modified participants' interaction and their feelings towards the evaluator. READ http://uxpajournal.org/intervene-think-aloud-protocols-usability-testing/
Effectiveness Measures II
And you can measure benefits Benefit measures: - number of usability problems identified. - number of unique usability problems identified. - how serious those problems are (severity). - types of problems identified. - other usability measures that are obtained.
Usability Problem
Any aspect of the user interface that causes the system to have reduced usability for the user Not "user errors"
"Between Subjects"
Between Subjects experimental design: - Different users for each condition. - To avoid bias and learning effects. - But the groups must be similar. - What if you have an 'outlier'?
Comparative Effectiveness
Basic measures such as 'total number of problems found' do not give an adequate account of effectiveness - notably, they say nothing about how effective the predictive, expert techniques are at identifying 'real' usability problems. - in particular, expert review techniques raise questions regarding how the problems predicted by experts relate to those actually experienced by users. So comparative studies are conducted, and measures are taken to: - compare the effectiveness of different techniques. - in particular, to compare usability predictions against 'real' usability of a system. These are experimental investigations in which the performance of two or more evaluation techniques is compared. And therefore, these investigations should be conducted as rigorous experimental studies. In essence, several evaluation techniques are used to evaluate the same system and the results compared. This usually means that the results of expert evaluations are compared with the results of usability testing.
Who Is An "Expert"?
Broadly speaking, an expert is someone with HCI/UX expertise But, the definition of an expert varies from one method to another, e.g. Cognitive Walkthrough suggests the experts should have some background in cognitive psychology And Nielsen talks about HCI experts, domain experts and 'double' experts all having a role in heuristic evaluation Advisable to use more than one expert (and they must not be the designers or similarly involved people); 3-5 often recommended
behavioral metrics vs attitudinal metrics and how it relates to PURE
Broadly speaking, traditional metrics can be broken down into behavioral (what people do) or attitudinal (what people say) measures. Behavioral metrics are gathered from usage, as users perform actions on software or websites, and are commonly used in analytics and A/B testing. They include counts (users, page views, visits, downloads), rates (bounces, conversion, installation, task success), and times (time on page, time on task, engagement). Common attitudinal measures come from surveys (Net Promoter Score, System Usability Scale, customer satisfaction) or user ratings. While these are all useful, there are significant limitations: Numbers alone don't usually provide the insights needed to understand why an effect was observed or how to fix a problem. The metrics used in analytics and A/B testing are typically indirect indicators of the quality of the user experience: they reflect software performance, not human experience. Classic measures of user experience, such as those derived from usability benchmarking studies, are expensive and time-consuming, so they aren't used frequently enough to provide regular assessment and tracking. PURE (Pragmatic Usability Rating by Experts) is a relatively new usability-evaluation method that attempts to sidestep these problems in a way that is reasonably quick, cheap, reliable, and valid. The metrics resulted from PURE can be used frequently and comparatively, making it practical to publish metrics for each version of a product or across a set of competitors, with just a few days of effort. When used with other measures, PURE scores fill in an important gap left by the limitations of traditional metrics
CUE Studies
CUE (Comparative Usability Evaluation) is a series of ten studies that investigate the reproducibility of usability evaluations and explore common practices among usability professionals. In a CUE-study, a considerable number of professional usability teams independently and simultaneously evaluate the same website, web application, or Windows program. Afterwards the results are compared and discussed. The two most important goals of the CUE-studies are: Study the reproducibility of usability evaluations. In other words, if two professional teams independently carry out a usability evaluation of the same product, will they report similar results? The answer turns out to be mostly negative. Learn about common practices among usability professionals when they do a usability evaluation. Purpose of the CUE-studies The main purpose of the CUE-studies is to answer a series of questions about professional usability evaluation, including: What is common practice?What usability evaluation methods and techniques do professionals actually use? Are there any popular methods or techniques that experienced professionals avoid, even though they receive a lot of coverage?This question is addressed in all CUE-studies Are usability evaluation results reproducible?This question is addressed in CUE-1 to CUE-6. How many usability problems are there in a product?What's the order of magnitude of the total number of usability problems that you can expect to find on a typical, nontrivial website?CUE-1 to CUE-7 showed that the number is huge. No CUE-study came close to finding all usability problems in the product that was evaluated, even though many CUE-studies found more than 300 usability problems. How many test participants are needed?How many test participants are required to find most of the critical problems?CUE-1 to CUE-7 showed that the number is huge. A large number of test participants (>>100) and a large number of moderators (>>30) will be required to find most of the critical problems. Quality differencesAre there important quality differences between the results the teams obtained?All CUE-studies addressed this question. What's the return on investment?If you invest more time in a usability evaluation - for example, 100 hours instead of 25 - will you get substantially better results?CUE-4 analyzed this question. Usability test versus usability inspection.How do professional usability testing and usability inspection compare?CUE-4, CUE-5, and CUE-6 analyzed this question.
Independent variables
Characteristics that are being investigated and will be manipulated to produce different conditions for comparison (e.g. the design) - The variables that the researcher is interested in. - Manipulated to create different experimental conditions. - Different values (levels) of the independent variables create the experimental conditions. - There can be more than one independent variable, but it is better to have just one or two.
Dependent variables
Characteristics that are measured in the experiment - e.g. time to complete a task - These are the variables that the researcher measures. - They are variables that are thought to be influenced by the independent variables. - It's good to use quantitative, objective dependent variables; but qualitative, subjective measures are also possible. - Time and errors are the classic measures for usability, but you can be creative and there are many possibilities.
Confidence
Confidence: Measured using a 7-point scale immediately after a task, confidence leads to competence (higher completion rates). Participants are generally over-confident (men more than women), but low confidence can be a good symptom of problems. This measure of participant confidence is different than a confidence interval. Disasters: Using confidence ratings in conjunction with completion rates allows you to compute disasters—when participants fail a task but rate that they were extremely confident. A task failure with a 7 out of 7 on task confidence is a disaster. The only thing worse than failing a task is thinking a participant did it correctly but really failed.
Heuristic Evaluation
Dated, but influential, expert evaluation technique Developed by Jakob Nielsen (see https://www.nngroup.com for lots of info) Originally for conventional 'desktop' applications Nielsen's Heuristics (1994, 2020) 1. Visibility of system status 2. Match between system and real world 3. User control and freedom 4. Consistency and standards 5. Error prevention 6. Recognition rather than recall 7. Flexibility and efficiency of use 8. Aesthetic and minimalist design 9. Help users recognise, diagnose and recover from errors 10. Help and documentation
Examples of Independent Variables in HCI
Design features — Location of buttons; Elements of visual design, information architecture, navigation structures. Technology — Devices for interaction (input and output) Users — Skills, Age, Education, Culture, Emotions, Physical abilities
Direct and Indirect Competitors
Direct competitors are those with similar products and services. Indirect ones are different products and services solving problems for the same target market. For example in the case of ride share services, Uber and Lyft are direct competitors. Personal vehicles and mass transit are indirect competitors to ride share companies.
Comparing Evaluation Techniques
Each evaluation technique has its strengths and weaknesses. Knowing about these enables practitioners and researchers to: - make appropriate choices of techniques to use. - appreciate the limitations of evaluation work. - determine where techniques need further development or where new techniques are required. Effectiveness Three alternative ways to consider this: - Textbook comparisons—claims about techniques - Empirical studies of individual techniques - Comparative studies
The Evaluator Effect
Evaluators have an important influence on the outcome of usability and UX evaluations. Evaluators vary in: - the techniques they choose. - how they apply the chosen techniques. - what they see in the resulting data. We see evidence of this in the CUE studies, Hertzum et al (2014) etc.
Comparative Studies: Experiments
Experimental studies are used to compare the effectiveness of evaluation techniques. In these: - Typically, different evaluation techniques are applied to the same evaluation problem and the results are compared. - Usually it is the evaluation technique (or how it is applied) that is changed between experiment conditions: this is the independent variable. - And measures such as number, type and severity of usability problems, productivity, cost-effectiveness, etc, would be the dependent variables
Experiments in Evaluation
Experiments are used to test causal relationships: whether one thing affects another thing
Competitor or Comparator Review
Experts score a product and its competitors against a set of criteria. 1. Identify the competitors (direct and indirect). 2. Establish the criteria and scoring system. 3. Create a spreadsheet. 4. Get several experts to score each competitor against each criterion. 5. Report. It's a good idea to justify the scores
Expert Reviews: Quantitative
Experts score usability and/or UX against a set of criteria These are summative or benchmark reviews - to track progress or compare against others Examples: - PURE - Competitor reviews - Quantifying the User Experience (Rubinoff)—not covered - Purdue Usability Testing Questionnaire—not covered
Expert Review: Qualitative
Experts use qualitative expert review techniques to "predict" usability problems Therefore, they are typically used in formative evaluation They are also called "usability inspection methods" Techniques - Heuristic evaluation - Cognitive walkthrough - Usability/UX review ('informal' expert review) - Heuristic Walkthrough—not covered but slides at end - Guideline review—not covered in this module - Pluralistic Walkthrough—not covered in this module Qualitative Expert Reviews: Pros and Cons ✓ Easy to plan, rapid to perform, low-cost (certainly easier to plan and conduct than usability testing) ✗ Results depend on experts' knowledge of HCI, the product and the domain ✗ Experts lack users' task and contextual knowledge and may find it difficult to evaluate from their perspective - different knowledge and mental models ✓ Good for finding major usability problems ✗ But often also report lots of minor problems
Uncovering the True Emotions of Our Users
Eye Tracking Getting Into the Heads of Our Users: EEG (and the connection to BCI) Your Skin Reveals a Lot About How You Are Feeling: GSR Galvanic skin response (GSR) Let's Face it, Your Emotions Are Showing: Facial Response Analysis
Five key takeaways from the CUE-studies
Five users are not enough.It is a widespread myth that five users are enough to find 85 percent of the usability problems in a product. The CUE studies have consistently shown that even 15 or more professional teams report only a fraction of the usability problems. Five users are enough to drive a useful iterative cycle, but never claim that you found all usability problems—or just half of them—in an interactive system. Huge number of issues.The total number of usability issues for the state-of-the-art websites that we have tested is huge, more than 300 and counting. It is much larger than you can hope to find in one usability test. Usability inspections are useful.The CUE-4 study indicated that usability inspections produce results of a quality comparable to usability tests—at least when carried out by experts. Designing good usability test tasks is challenging.In CUE-2, nine teams created 51 different tasks for the same user interface. We found each task to be well designed and valid, but there was scant agreement on which tasks were critical. If each team used the same best practices, then they should have derived similar tasks from the test scenario. But that isn't what happened. Instead, there was virtually no overlap. It was as if each team thought the interface was for a completely different purpose. Quality problems in some usability test reports.The quality of the usability test reports varied dramatically. In CUE-2, the size of the nine reports varied from five pages to 52 pages—a 10-times difference. Some reports lacked positive findings, executive summaries, and screen shots. Others were complete with detailed descriptions of the team's methods and definitions of terminology.
Effectiveness Measures I
For any evaluation technique, you can measure costs. Cost measures: - how long does it take to set up the evaluation? - how long to conduct? - how long to analyse the data? - and how long to come up with recommendations? - what resources are required, e.g. how many participants (experts and/or users) and how much do they cost, what equipment?
Formative vs Summative evaluations
Formative evaluations focus on determining which aspects of the design work well or not, and why. These evaluations occur throughout a redesign and provide information to incrementally improve the interface.Thus, formative evaluations are meant to steer the design on the right path. Summative evaluations describe how well a design performs, often compared to a benchmark such as a prior version of the design or a competitor.
Comparing Techniques in Practice
However, it is important to know about evaluation techniques in practice, not just in theory. - For example, usability testing is known to find problems that are missed by expert reviews, but may still miss genuine usability problems. Therefore, researchers conduct studies of evaluation techniques. And take measures of their effectiveness. - Note that many, although not all, of the effectiveness measures are applicable to evaluation techniques that identify usability problems rather than those that take quantitative measures
How Do You Compare Usability Problems?
If you want to compare the usability problems identified by two or more different evaluation techniques, you need to define matching rules. A matching rule defines when two problem reports are describing the same actual usability problem - e.g., both reports describe the same outcome from the same cause in the same part of the system
Problem With Think Alouds
Impacts time on task and energy (if they spend a long time doing it) it may also impact where they look for eye tracking. Thinking aloud is likely to be prohibitively unnatural and taxing if tests are performed in the field, say in a public or collaborative context, or for prolonged periods of time. That is, thinking aloud appears confined to lab settings. In addition, it is often not clear what the users are concretely instructed to do when they are asked to think aloud. Many usability professionals appear to relax the prescriptions of the classic thinking-aloud protocol by asking users to verbalize their feelings, expectations, reflections, and proposals for redesigns (Boren & Ramey, 2000; Nørgaard & Hornbæk, 2006). It is well-established that such relaxed thinking aloud affects behavior (Ericsson & Simon, 1993; Hertzum, Hansen, & Andersen, 2009). For example, Hertzum et al. (2009) found that during relaxed thinking-aloud users took longer to solve tasks, spent a larger part of tasks on general distributed visual behavior, navigated more from one page to another on the websites used in the experiment, scrolled more within pages, and experienced a higher mental workload. It is, however, debated whether the additional information that usability professionals get from relaxed thinking aloud outweighs its effects on behavior. For example, Goodman, Kuniavsky, and Moed (2012) considered the additional information valuable whereas Boren and Ramey (2000) recommended restricting relaxed thinking aloud to curb its effects on behavior. HOWEVER it can also be useful For example, Hertzum, Borlund, and Kristoffersen (2015) found that 38-44% of the verbalizations made during relaxed thinking aloud were of medium or high relevance to the identification of usability problems. In summary: Classic thinking aloud may not affect behavior but may also add little value to usability tests beyond what can be derived from users' observable behavior. Relaxed thinking aloud affects behavior but appears to add value to usability tests beyond what can be derived from users' observable behavior. The only way to avoid that thinking aloud may affect behavior is to abandon thinking aloud concurrently with the behavior. Retrospective thinking aloud or a retrospective interview, possibly supported by a video recording of the session, may provide a cost-effective separation between performing with the system and commenting on it. This separation also provides for using the system in the field but moving to the lab for the retrospective part.
Study-level metrics
Include broader measures of the overall user experience. These usually include SUPR-Q, SUS, UMUX-Lite, product satisfaction, and/or NPS. Figures 1 and 2 include study-level metrics in the top part of each figure. See Chapter 5 in Benchmarking the User Experience for more.
Discourse Variations Between Usability Tests and Usability Reports (Friess, 2011)
Investigated the variations between what participants said in user testing sessions and what novice testers included in oral reports: - 25% of issues were included in the oral reports. - Most findings (84%) included in oral reports had some basis in the user testing. - Half the findings in the oral reports were accurate. - Explanations: confirmation bias; omission of issues contrary to evaluators' concerns; bias from client desires; poor interpretation skills.
Example: An Early Comparative Study Jeffries et al (1991)
Jeffries et al (1991): - historically important - compared 4 techniques: heuristic evaluation, cognitive walkthrough, guidelines and usability testing - heuristic evaluation found the most usability problems and the most severe problems, but also a lot of trivial problems - cognitive walkthrough was time-consuming - cognitive walkthrough and guidelines missed many of the severe problems Limitations - People conducting the evaluations were rather different in each case (in an effort to follow the recommendations of the method proposers). - Different numbers of people involved in each evaluation. - People had different lengths of time in which to conduct the evaluation. - No attempt was made to establish the validity of the usability problems. PAPER Jeffries et al (1991), User interface evaluation in the real world: a comparison of four techniques. Proc. CHI 1991, ACM Press.
Conversion metric
Measuring whether users can sign-up or purchase a product is a measure of effectiveness. Conversion rates are a special kind of completion rate and are the essential metric in eCommerce. Conversion rates are also binary measures (1=converted, 0=not converted) and can be captured at all phases of the sales process from landing page, registration, checkout and purchase. It is often the combination of usability problems, errors and time that lead to lower conversion rates in shopping carts.
Dealing With The Data
More simply, but not as usefully, you can report the data using descriptive statistics - e.g. average the quantitative dependent variable within each experimental condition, calculate the standard deviation and make an informal comparison.
Examples of Dependent Variables in HCI
Objective: - The number of times users fail/succeed in doing tasks - The time taken to do tasks - Number of assist requests - Number of clicks Subjective: - Users' rating of usefulness, ease of use, mental workload e.g. SUS, UMUX, NASA TLX etc
Step 10 (optional): Comparing PURE Scores
One of the most gratifying aspects of the PURE method is comparing scores of the same task among product versions or competitive products, especially when improvement on one's own product is demonstrated. Below is an example of one of the first PURE scorecards to be conducted, on an actual product that went to market, after showing drastic improvements in ease of use over 5 months. The task names were genericized for confidentiality reasons, but you can see that big improvements were realized through redesign iterations by simplifying some of the task flows (cutting steps) and also improving ease of use for individual steps.
Who Should Participate?
Participants must be like the people who will actually use the product (or must be able to put themselves into the position of the users) or else you are wasting your time If you didn't find out about your users already, now is the time to do so: − identify the groups of people who will use the system − identify their relevant characteristics (things they have in common, ways in which they differ)
Archetypes or Personas
Personas help the team focus on one representation with information like gender, age, scenarios, goals. Archetypes focus on behavior (i.e. savor, task centric, or identity focused).
Heuristic Evaluation
Prepare: - Create a paper or software prototype - Recruit several experts (Nielsen suggests 3-5) - Choose a set of heuristics (different ones are possible) - Prepare report forms (coding sheets) - Prepare other materials for use during the evaluation: consent forms, tasks (possibly), instruction sheet, severity rating scale Decide how you will do it: - Either conduct an exhaustive review of entire interface (Nielsen's original approach) - Or design user tasks for all evaluators to undertake (the most useful approach?) - Or allow evaluators to establish their own tasks (but then have less data from each task) Conduct the heuristic evaluation: - Evaluators inspect interface individually. - They check whether each screen/feature they encounter violates any of the heuristics. - Any such violation is considered to be a usability problem. - They record the problem, its location and the heuristic that was violated: either by writing it down or by describing it to an observer. - Evaluators might also give a severity rating to each problem. Deal with the data: - If you haven't got severity ratings, evaluators should now rate the severity of all problems (not just the ones they found) - Group similar problems (to reduce the number of problems and to determine how frequently each problem occurs) - Reassess severity and prioritise - Recommend possible fixes (redesign) ✗ Experts should not give a simple yes / no answer to each heuristic. ✗ Experts should not give a score to each heuristic. ✗ Experts should not find problems and then match against the heuristics (this isn't a classification scheme). - It's helpful that the expert evaluators find the problems and rate their severity for you. - And it's good to have just 10 heuristics—manageable. - And Nielsen's heuristics encourage evaluators to look for issues that they might otherwise overlook—the heuristics offer broad coverage of usability concerns. - And it's a fairly cheap method, little planning required, fast to conduct, not too painful to analyse. But Nielsen's general heuristics don't work for all systems. - And how should they be interpreted—e.g. what is an aesthetic design? There's a lot of judgement. - And evaluators vary in what they judge to be a problem and how severe they think it is (this could be a good thing). - This is aggravated by the fact that evaluators are not provided with a description of the users. - It can be difficult for evaluators to remain focused - it's all too easy to forget the heuristics and start doing an intuitive evaluation instead. - Concerns have been raised about false alarms and missed problems (see later in module) - It may find the serious problems, but it also tends to yield lots of trivial problems - It doesn't look at the positive features of a system
Effectiveness Measures III
Raw counts aren't particularly meaningful. Therefore, we can convert counts into measures that can be compared. Productivity: - how many usability problems are identified in a given period of time, or per participant, or per unit cost. Cost-effectiveness: - ratio of total cost to total benefits. - e.g. ratio of total time to prepare, conduct and analyse to the total number of problems identified, other benefits etc.
Sauro (2018)
Reviewed 33 studies that examined the Evaluator Effect. Concluded: - Any-2-agreement is a recommended way to measure overall agreement/disagreement between evaluators. - The average agreement between evaluators is 27% - Controlled studies have higher agreement. - Agreement isn't necessarily the goal. - Future work is needed. https://measuringu.com/evaluator-effect/ (READ THIS ARTICAL)
SUPR-Q
SUPR-Q measures: - the overall "quality" of website user experience - and usability, credibility/trust, loyalty and appearance. Consists of: - seven statements with responses given on 5 point Likert scale - one question about likelihood of recommending with response given on 11 point scale Supports comparison between websites Usability: - This website is easy to use. - It is easy to navigate within the website. Credibility: - The information on the website is credible. - The information on the website is trustworthy. Appearance: - I found the website to be attractive. - The website has a clean and simple presentation. Loyalty: - I will likely visit this website in the future. - How likely are you to recommend this website to a friend or colleague?
Limitations of Comparative Studies
Some comparative studies make the assumption that it is possible to know all the real usability problems. Is this reasonable? Just because a problem is not identified in usability testing doesn't mean that it doesn't exist... Many of these studies also focus solely on comparing usability problems. Many of the studies to measure and compare the effectiveness of evaluation techniques have been criticised for poor experimental method. It is important to: - apply the methods as they were intended. - minimise differences in experimental conditions to ensure that the comparison is a fair one. But these goals are often in conflict.
Tasks in Iterative Design
Sometimes your goal will be to uncover usability problems as part of iterative design Choose tasks that: − are done frequently, i.e. key user journeys − are novel or controversial − are critical (in terms of safety, time or money) − follow up issues identified in earlier evaluations Sometimes your goal will be to investigate the usability of particular aspects of a system, e.g. based on a client brief − choose tasks that test the relevant parts of the system
Why is PURE not great for comparing more than one path?
Step 3: Happy Paths A given task can often be fulfilled in a variety of ways, with a different number of steps for each method. PURE requires the team to identify the "happy path," which is the most desired way in which the target user would accomplish this task. This path is our best shot at making the task easy for users, so it makes sense to focus PURE scoring on this particular flow more than on any other. It would be reasonable to evaluate multiple paths for the same task, but, just like having more than one target user type, doing so increases the time and effort required to conduct a PURE evaluation. Also, other methods, like heuristic evaluation or standard usability studies, would be sufficient to find and fix problems in other paths. I would only use PURE on multiple paths if it seemed critical to measure and compare them. Lastly, some teams have chosen to use PURE to evaluate the "popular path" by looking at clickstream analytics to determine which flow is most likely for a given goal. This is a reasonable decision, and it may take the place of a happy path for some teams.
Example: An Early Comparative Study John and Marks (1997)
Study of six "predictive" expert review techniques: - Claims Analysis - Cognitive Walkthrough - GOMS - Heuristic Evaluation - User Action Notation - Reading the specification - Each technique was used by one (inexperienced) analyst. - Used to evaluate a specification of a multimedia authoring tool. - The real interest for us is the approach that John and Marks took Investigated: - Predictive power: how predictions (usability problems) from these techniques compared with the results of usability tests. - Persuasive power: how many usability problems led to design changes. - Design-change effectiveness: how effective the designchanges were in reducing the number of usability problems in new versions of the system. PAPER John, B. and Marks, S. (1997), Tracking the effectiveness of usability evaluation methods, Behaviour and Information Technology, 16(4-5).
Hawthorne Effect
The Hawthorne Effect is, in essence, a phenomenon of research. By focusing a researcher's attention on something; the subject is likely to strive to deliver the expectation of that research. it is likely that the action of observing someone in research affects the way that they behave. There are two actions we can take to try and minimize this. The first is to address the issue of expectation in research. If we can keep the participants unaware of our expectations (and the best way to do this is to use research to try and test a hypothesis as opposed to using it to prove an expectation) then we can reduce the likelihood that they will produce the results we expect to see. The second is to account for the Hawthorne Effect. If you are going to have to indicate an expectation from your research; then the research needs a longer "follow up" period. Where you can examine whether a behaviour (or change in product) is sustained beyond the research or whether it only exists in a short-period during and following the research. If the former is the case - the Hawthorne Effect is irrelevant, if the latter - you may need to revisit the design of the product.
NPS
The Net Promoter Score is a popular/notorious measure of customer loyalty that can be applied to all interfaces (desktop, mobile, hardware) and for both consumer-to-business and business-to-business experiences. It's based on responses to a single 11-point item asking participants how likely they are to recommend the experience (which can be a brand, website, product, feature, or page).
Task-level metrics
The core task-level metrics address the ISO 9241 pt 11 aspects of usability: effectiveness (completion rates), efficiency (task time), and satisfaction (task-level ease using the SEQ). Figure 1 shows these metrics aggregated into a Single Usability Metric (SUM) and in disaggregated form at the bottom of the scorecard for three competitor websites. Figure 3 also shows task-level metrics for two dimensions: platform (desktop and mobile) and competitor (base product and two competitors).
Cognitive Walkthrough (2)
The evaluator is asked to tell a 'success' or 'failure' story for each action, using the questions. And the answer to the 4 questions for each action should be recorded on forms. If the answer to any question is 'no' (and hence there is a failure story for the action) this is considered to be a usability problem and a separate problem report form should be completed. Every "no" answer is a usability problem List these—they are the findings from a cognitive walkthrough E.g. "The photocopier does not provide any visual feedback when the user presses the on/off button" A cognitive walkthrough begins by defining the task or tasks that the user would be expected to carry out. It is these tasks that the cognitive walkthrough will examine for usability—any tasks that can be performed in the product but are not subject to a cognitive walkthrough will not normally be assessed during the process. Descriptions of cognitive walkthrough are not very explicit about how to deal with the data. As with other evaluation techniques, you can: - group usability problems identified by different evaluators to remove duplicates - assess severity - and determine fixes The 4 questions offer some clues to possible fixes - Focuses on one aspect of usability: ease of learning - Confronts assumptions about users' mental models and knowledge, and how they match with the system - Incorporates explicit user descriptions - Identifies problems that are specific to the tasks and action sequences rather than general problems - Several versions—confusion! - More costly than heuristic evaluation; cheaper than usability testing
Task Time
The fundamental measure of efficiency provides a sensitive way of understanding how long it takes participants to complete (or fail) tasks. You can provide measures of average task completion time (successfully completed attempts), average task time (average time of all participants) or mean time of failure (average time till participants fail a task). We usually use average task completion time using an appropriate transformation to handle the inherent positive skew in this measure. We prefer this measure over clicks.
Task Ease
The perception of a task experience measured using the 7-item Single Ease Question (SEQ) provides a succinct way to describe a participant's attitude toward the immediately attempted task. While it correlates with post-study metrics (like SUS and SUPR-Q), it provides additional information those broader measures don't.
CUE-4 - Usability test vs. inspection
The purpose of CUE-4 was to compare the effectiveness and efficiency of usability testing and inspection techniques. The study showed that inspection is as effective and efficient as usability testing. CUE-4 was a comparative usability evaluation of Hotel Pennsylvania's website, www.hotelpenn.com, conducted in March 2003. Seventeen professional teams simultaneously and independently evaluated the website's usability. Nine teams used usability testing, and eight teams used their favorite inspection technique. Practitioner's Take Away Usability testing isn't the "high-quality gold standard" against which all other methods should be measured. CUE-4 shows that usability testing - just like any other method - overlooks some problems, even critical ones. Inspections carried out by highly experienced practitioners can be quite valuable - and, according to this study, comparable to usability tests in the pattern of problems identified - despite their negative reputation. Focus on productivity instead of quantity. In other words, spend your limited evaluation resources wisely. Many of the teams obtained results that could effectively drive an iterative process in less than 25 person-hours. Teams A and L used 18 and 21 hours, respectively, to find more than half of the key problem issues, but with limited reporting requirements. Teams that used five to ten times as many resources did better, but the additional results in no way justified the considerable extra resources. This, of course, depends on the type of product investigated. For a medical device, for example, the additional resources might be justified. Papers about CUE-4 Joseph S. Dumas, Rolf Molich, and Robin Jeffries, "Describing Usability Problems - Are We Sending the Right Message,"Interactions, July/August 2004, pp. 24-29. Rolf Molich and Joseph S. Dumas,"Comparative Usability Evaluation (CUE-4),"Behaviour & Information Technology, Vol. 27, issue 3, 2008. Rolf Molich, Robin Jeffries, and Joseph S. Dumas,"Making Usability Recommendations Useful and Usable,"Journal of Usability Studies, vol. 2, no. 4, August 2007. Early Example: CUE-4 Comparative study undertaken back in 2003 - 17 professional usability teams evaluated a hotel web-site. - Wide range of different approaches: 9 teams conducted user testing, 8 teams conducted expert reviews - And a wide range of results. - Of 340 usability issues reported, only 9 were reported by more than half of the teams. - Of 340 usability issues reported, 205 were reported by just one team. https://www.dialogdesign.dk/cue-4
CUE-8 - Task measurement
The purpose of CUE-8 was to compare practical approaches to usability task measurement. Fifteen participating teams carried out independent, simultaneous measurements of the Budget.com car rental website and compared results. CUE-8 results were presented and discussed at a workshop at the UPA 2009 conference in Portland, OR, USA, on 9 June 2009. Practitioner's Take Away Adhere strictly to precisely defined measurement procedures for quantitative tests. Report time-on-task, success/failure rate and satisfaction for quantitative tests. Exclude failed times from average task completion times. Understand the inherent variability from samples. Use strict participant screening criteria. Provide confidence intervals around your results if this is possible. Keep in mind that time-on-task is not normally distributed and therefore confidence intervals as commonly computed on raw scores may be misleading. Combine qualitative and quantitative findings in your report. Present what happened (quantitative data) and support it with why it happened (qualitative data). Qualitative data provide considerable insight regarding the serious obstacles that users faced and it is counterproductive not to report this insight. Justify the composition and size of your participant samples. This is the only way you have to allow your client to judge how much confidence they should place in your results. When using unmoderated methodologies for quantitative tests ensure that you can distinguish between extreme and incorrect results. Although unmoderated testing can exhibit a remarkable productivity in terms of user tasks measured with a limited effort, quantity of data is no substitute for clean data. Papers About CUE-8 Rent a Car in Just 0, 60, 240 or 1,217 Seconds? - Comparative Usability Measurement, CUE-8by Rolf Molich, Jarinee Chattratichart, Veronica D Hinkle, Janne Jul Jensen, Jurek Kirakowski, Jeff Sauro, Tomer Sharon, Brian TraynorJUS, the Journal of Usability Studies, November 2010 Rent a car in just 60, 120, 240 seconds - Comparative Usability Measurementby Rolf Molich, Jurek Kirakowski and Tomer Sharon90-minute session at the UPA 2010 conference in Munich, Germany, May 2010 Investigated quantitative usability work. - 15 usability teams evaluated a car rental web-site. - Teams were asked to take certain measures, including time on task, for 5 prescribed tasks - Teams took a wide variety of approaches. - Those that took similar approaches delivered similar findings. - Reported pitfalls in usability measurement. CUE-8 Recommendations - Adhere strictly to precisely defined measurement procedures for quantitative tests. - Report time-on-task, success/failure rate and satisfaction for quantitative tests. - Exclude failed times from average task completion times. - Use strict participant screening criteria. - Provide confidence intervals around your results if possible. - Combine qualitative and quantitative findings in your report. Justify the composition and size of your participant samples. - When using unmoderated methodologies, ensure that you can distinguish incorrect results
Measurements
The purpose of measurement under this definition isn't to reduce everything to a single number. It's not to generate an exact amount. It's not to have 100% accuracy and perfectly predict the future. And it certainly isn't meant to tell you how to do your job. Measurement and statistical analysis don't guarantee success; instead, they improve the likelihood of success. Instead of saying an experience was "bad" or "good," "intuitive" or "non-intuitive," or "better" or "worse," we can express these more precisely with quantities. Use measures as your dependent, not independent, variable 1. Describe whether designs help or hinder an experience. Testing designs early and often is made easier by collecting measures early and often. Observe users attempting a task and note any problems they have. You can use prototypes or live products. There's compelling research that even low-fidelity prototypes are indicative of the real experience. You can start simple by collecting only two dependent measures: Completion rate: Track whether users complete the tasks (you should know what users are trying to accomplish). Task completion is a fundamental metric. Not much else matters if users can't complete what they want to do. Task ease: After users attempt a task, ask them to rate how easy or difficult it was with the one item Single Ease Question (SEQ). 4. Compare your experience to industry standards. One of the benefits of using standardized measures is you can compare the scores to published benchmarks. Both task- and study-level metrics have an average or best-in-class scores available. For example: SUS: Average is 68% SEQ: Average is 5.5% SUPR-Q: Average is 50% Completion rate: Average is 68% These can be more specific to industries too. For example, in the US airline industry, which relies heavily on self-service, the average SUPR-Q score is 81% compared to the global average of 50%.
The ten CUE-studies
The ten CUE-studies This website contains one page for each CUE-study with detailed information about the study and links to related articles and downloads. CUE-1 - Are usability tests reproducible?Four teams usability tested the same Windows program, Task Timer for Windows. CUE-2 - Confirm the results of CUE-1Nine teams usability tested www.hotmail.com CUE-3 - Usability inspectionTwelve Danish teams evaluated www.avis.com using usability inspections CUE-4 - Usability test vs. usability inspectionSeventeen professional teams evaluated www.hotelpenn.com. Nine teams used usability testing and eight teams used usability inspections CUE-5 - Usability test vs. usability inspectionThirteen professional teams evaluated the IKEA PAX Wardrobe planning tool on www.ikea-usa.com. Six teams used usability testing and seven teams used usability inspection CUE-6 - Usability test vs. usability inspectionThirteen professional teams evaluated the Enterprise car rental website, Enterprise.com. Ten teams used usability testing, six teams used usability inspection, and three teams used both methods CUE-7 - RecommendationsNine professional teams provided recommendations for six nontrivial usability problems identified in CUE-5 CUE-8 - Task measurementSeventeen professional teams measured key usability parameters for the Budget car rental website, Budget.com CUE-9 - The evaluator effectNineteen experienced usability professionals independently observed the same five videos from usability test sessions of www.Uhaul.com, reported their observations and then discussed similarities and differences in their observations. CUE-10 - ModerationSixteen usability professionals independently moderated three usability test sessions of Ryanair.com using the same test script. Videos from the usability test sessions were analyzed to determine good and poor moderation practice.
Does Thinking Aloud Affect Where People Look?
There are a number of ways to assess the effects of thinking aloud on behavior, including Number and type of usability problems uncovered Task metrics (e.g. completion rates, time, clicks) Where people look How long people look at elements Comprehension and recall Purchase behavior Scores on standardized questionnaires One good place to start investigating the effects of thinking aloud is how and where people look at webpages. If thinking aloud causes people to systematically look at each part of a page differently, then it's likely other metrics can be affected: for example, task time, usability problems encountered, and attitudinal metrics collected in standardized questionnaires. Tracking where participants look is a good place to start because it's more sensitive to subtle differences compared to blunt measures like task completion rates. In another study of 95 adults viewing the U.S. Census Bureau website, Romano Bergstrom and Olmsted-Hawala found thinking aloud affected where participants looked on the website. They reported different numbers of eye-fixations on the top and left navigation areas of the website across two tasks depending on whether the participants were concurrently thinking aloud or retrospectively thinking aloud. To mitigate the effects of between-person variability and increase the statistical power, we employed a within-subjects approach. We randomly assigned 13 participants to view all 20 websites. Participants were randomly assigned 10 of the websites to think aloud and the other half to view without being prompted to think aloud. Even with this relatively small sample size we found a statistically significant difference in viewing patterns based on whether participants were thinking aloud. Figure 2 shows the heatmaps for all websites when participants thought aloud and when they didn't. The heatmaps show the aggregated fixations from the participants across the 20 websites.
Expert Reviews
These are evaluation techniques where usability experts examine and critique the usability and UX of a system So, they are evaluations carried out by experts, without users Either: Quantitative: Experts make quantitative assessments of usability and UX Qualitative: Experts identify usability problems (this means they are predicting problems they think will have an impact on the actual users) Or a combination of quantitative and qualitative Expert review techniques rely on expert judgement. And these techniques vary in how they ask experts to make their judgements. Hence different expert review techniques identify different usability problems or give different quantitative assessments of usability. Sometimes the judgement is supported by objective data such as analytics. Expert reviews are primarily used in formative evaluation - because the focus of most techniques is on identifying usability problems. Why 'Review' Rather Than 'Test'? To 'save' (or avoid) users - it can be challenging and expensive to recruit users for user testing and a review can yield rapid, low-cost evaluation data - it may not be possible or desirable to involve users in certain situations: confidentiality, appropriateness of feedback, practicality... - Expert reviews are a useful complement to usability testing, but not a replacement, because they tend to provide different findings - There are many variants of expert review out there and many are heavily reliant on the expertise of the evaluator
Usability Metrics
These are quantitative measures of usability and user experience Usability metrics should measure some aspect of the user's interaction with a system or their perceptions of it - For example, Efficiency or Time on Task Performance metrics, e.g. efficiency (time on task), effectiveness (success / completion rate), learnability, memorability Experience metrics (self-reported), e.g. trust, pleasure, frustration, control, happiness, fun, confusion, stress Experience metrics (physiological), e.g. heart rate, pupil dilation, galvanic skin response Issue metrics, e.g. number of usability problems, severity of usability problems, type of usability problems Another way of thinking about metrics: Task-level metrics that focus on measuring some aspect of each task - e.g. efficiency (time on task), effectiveness (success / completion rate) Study-level metrics that focus on measuring the overall experience - e.g. System Usability Scale (SUS), Standardised User Experience Percentile Rank Questionnaire (SUPR-Q)
Evaluator effect
They proposed that the principal reason for the evaluator effect is that usability evaluation is an interpretive activity in which evaluators need to exercise judgement in transitioning from a sequence of user-system interactions to a list of usability problems. It is unsurprising that such judgements do not produce the exact same results when performed by different evaluators. What may be surprising is the magnitude of the evaluator effect. In multiple studies the number of problems detected by only a single evaluator has clearly exceeded the number of problems shared by all evaluators
SUS
This 10-item questionnaire measures perceived usability of any user experience. The System Usability Scale has been around for 30 years and is best used for measuring software or hardware interfaces. Its popularity and longevity mean you can reference published databases (for example, the average SUS score is 68).
SUPRQ
This compact questionnaire is ideal for benchmarking a website user experience. It provides a measure of the overall quality of the website user experience and its scores are normalized into percentile ranks. It also includes measures of usability, appearance, trust, and loyalty. A license is available to access the normalized database and the items can be used without a fee with attribution.
Task Completion Rate
This fundamental metric of effectiveness tells you whether participants can complete a task (1 = success and 0 = failure).
System Usability Scale (SUS)
This has been around for a very long time Consists of 10 statements (5 positive, 5 negative) with a 5 point agreement scale. E.g. "I think I would like to use this system frequently"; "I found the system unnecessarily complex" The aim is to determine an overall score for the system (and an algorithm is provided for this) rather than looking at the ratings for individual statements A score > 68 is indicative of "good" usability
The Evaluator Effect
This is the phenomenon by which different evaluators tend to uncover different issues. In 1998, Rolf Molich conducted the first of his influential Comparative Usability Evaluations (CUE). The general format of the CUE study is that teams of researcher(s) work independently to evaluate an interface and come up with a list of usability problems. Molich generally found and extensively reported that what you get is what you see. Different evaluators tend to uncover different issues. Measuring Agreement To quantify the effect of agreement/disagreement between different evaluators, you need a measure. While many measures assess agreement (correlation coefficient, Kappa coefficient, chi-square test), they don't necessarily work well for uncovering usability problems, especially with a varying number of evaluators involved. For example, a problem metric of Unique Problems uncovered isn't ideal because as you increase the number of evaluators, you inflate the chances of more unique problems. Any-2-agreement is a recommended way to measure overall agreement/disagreement between evaluators. It's the percentage of problems found in common between two evaluators divided by the total number of problems found. This is averaged over every combination of evaluators in a study. The average agreement between evaluators is 27%. This includes studies in this review that had little control in they collected data with different methods (usability testing with expert reviews), tasks, and participants. In other words, with little direction given, you should expect about 27% agreement (but it can range from 4% to 75%). Controlled studies have higher agreement. By having evaluators watch the exact same participants (controlling for methods, tasks, functions), you can expect the agreement between evaluators to double to around 59%. Agreement isn't necessarily the goal. While the evaluator effect suggests some core UX methods are unreliable, agreement isn't necessarily the only goal, if the goal at all. The goal of usability testing is ultimately a more usable experience. Diversity of perspective can be a strength. Using multiple evaluators in a study will leverage different perspectives and capture more problems.
A / B Testing
This is the process of comparing two variations of a single variable to determine which performs best in order to help improve marketing efforts More than just answering a one-off question or settling a disagreement, A/B testing can be used to continually improve a given experience or improve a single goal like conversion rate optimization (CRO) over time. Numbers alone don't usually provide the insights needed to understand why an effect was observed or how to fix a problem. The metrics used in analytics and A/B testing are typically indirect indicators of the quality of the user experience: they reflect software performance, not human experience. Classic measures of user experience, such as those derived from usability benchmarking studies, are expensive and time-consuming, so they aren't used frequently enough to provide regular assessment and tracking. Larger sites and apps often employ segmentation for their A/B tests. If your number of visitors is high enough, this is a valuable way to test changes for specific sets of visitors. A common segment used for A/B testing is splitting out new visitors versus return visitors. This allows you to test changes to elements that only apply for new visitors, like signup forms. On the other hand, a common A/B testing mistake made is to create audiences for tests that are too small. Therefore it can take a long time to achieve statistically significant results and tell what impact your change had on a particular website visitors. So it is important to check how large your segments are before starting an experiment to prevent false positives. - A / B testing, or split testing, is a simple form of between-subjects experiment. - Used for testing websites. - And supported by software. https://www.youtube.com/watch?v=_E2tErf8esk Show visitors two versions of a live site and collect data about what they do—measure what they do. What to measure? - Sign-ups - Downloads - Purchases - Donations
Data Analysis
To identify the usability problems (in a rigorous way): 1. Transcribe thinkalouds, observational data, interviews, etc. Or work directly with screen capture and video data 2. Find usability problems in the data: this is 'coding' the data 3. Group individual problems to remove duplicates, creating a final set of problems 4. Possibly determine the "severity" of each problem 5. Possibly group and categorise problems 6. Report the usability problems Coding Usability Problems For example: - the user does not succeed in a task - the user has to try 2 or more alternative strategies in order to achieve their goal - the user states what they would do to progress but it's not correct - the user takes a less than optimum series of actions - the user expresses surprise - the user expresses frustration etc - the user suggests that something should be changed Coding Positive Findings Although not traditionally part of usability testing, it's often useful to report on the strengths of an interactive experience Again, this is coding For example: - the user expresses pleasure - the user smiles / laughs - the user is successful - the user's expectations are met Rainbow Spreadsheet Severity Ratings Severity is a judgement based on factors such as: - Frequency - Impact - Persistence - Criticality Affinity Diagrams for grouping problems
UX scorecards
UX scorecards are an excellent way to visually display UX metrics. They can be used to more visibly track (and communicate) how design changes have quantifiably improved the user experience. They should be tailored to an organization's goals and feature a mix of broad (study-level/product-level) and specific (task-level) metrics. Use multiple ways to visualize metric performance (colors, grades, and distances) and include external benchmarks, competitor data, and levels of precision when possible.
Example hypothesis for usability / UX:
Users will find the "Add to Basket" button more quickly if it is placed on the bottom right rather than the bottom left of a web page Also Research question: - Is there a difference in how long it takes a user to make a selection using different forms of input? Hypothesis: - It is faster to speak an item than to select it from a drop-down menu Possible experimental design: - Independent variable: mechanism for user to make a selection - Independent variable levels: speech and drop-down menu - Dependent variable: time taken - Run experiment where all participants use the same interface: half say the name of the item and half select it from a menu
Within Subjects: The Learning Problem
Ways of addressing the learning problem: - Get users to do similar (not the same) tasks in each condition - Pick a situation in which knowing how to do a task does not matter - Also need to counter-balance the order of presentation of the two conditions (in case users get tired, bored etc., as the study proceeds)
What Makes a Good Report?
What Makes a Good Report? - Writing - Use straightforward, professional language. - Do not alienate your reader with overly formal language - Do not be too casual - Focus on clarity and readability. - Use short, descriptive sentences. - Use a consistent tense. - Define any acronyms used. Consider a glossary for specialist terms. - Use clear headings to structure the document - Consider the Information Architecture of your report What Makes a Good Report? - Describing Findings - When you describe a problem, explain why it's a problem. e.g. Instead of "X will be a problem for users" say "X will be a problem for users because Y..." What Makes a Good Report? - Using Screenshots - Screenshots are a good way of documenting issues with a site. - Ensure you take screenshots in the same time period you are running your usability testing in case the site changes before you write your report. - Without screenshots you will need to be very descriptive in order to provide clarity for readers.
PURE: How To Do It
https://www.nngroup.com/articles/pure-method/ 1. Identify the target user type(s). 2. Select the fundamental tasks of this product for the target users. 3. Write the happy path for each fundamental task, i.e. the steps that make up the task. 4. Three expert raters walk through the happy paths of the fundamental tasks and independently rate each step on a scale of 1-3. 5. Calculate the inter-rater reliability for the experts' independent scores to ensure reasonable agreement. 6. Experts discuss ratings and rationale and then agree on a single score for each step. 7. Sum the PURE scores for each fundamental task and hence for the entire product. PURE Summary Points - Low score and green colour are good - Can use PURE to compare the same task on different products - Can use PURE to compare different products - same tasks on all - Not very useful for comparing different tasks (why?) A given task can often be fulfilled in a variety of ways, with a different number of steps for each method. PURE requires the team to identify the "happy path," which is the most desired way in which the target user would accomplish this task. This path is our best shot at making the task easy for users, so it makes sense to focus PURE scoring on this particular flow more than on any other. It would be reasonable to evaluate multiple paths for the same task, but, just like having more than one target user type, doing so increases the time and effort required to conduct a PURE evaluation. Also, other methods, like heuristic evaluation or standard usability studies, would be sufficient to find and fix problems in other paths. I would only use PURE on multiple paths if it seemed critical to measure and compare them. Some tasks may have ten steps, another may have like three, so you can't compare their difficulty because they'll have different numbers by default Using the PURE method to score a given product or service requires certain steps to be taken, many of which are helpful for any crossfunctional product, design, and development team. There are 8 required and 2 optional steps to follow: Using the PURE method to score a given product or service requires certain steps to be taken, many of which are helpful for any crossfunctional product, design, and development team. There are 8 required and 2 optional steps to follow: 1. Clearly identify the target user type(s). 2. Select the fundamental tasks of this product for target users. 3. Indicate the happy path (or the desired path) for each fundamental task. 4. Determine step boundaries for each task and label them in a PURE scoresheet. 5. Collect PURE scores from three expert raters who walk through the happy paths of the fundamental tasks together and silently rate each step. 6. Calculate the interrater reliability for the raters' independent scores to ensure reasonable agreement among experts. 7. Have the the expert panel discuss ratings and rationale for individual scores, and then agree on a single score for each step. 8. Sum the PURE scores for each fundamental task and for the entire product; color each step, task, and product appropriately. 9. (Optional) For each step, provide a screenshot (or photo) and a qualitative summary of the experts' rationale for the scoring of that step. 10. (Optional) If comparing multiple products or product versions, prepare a comparative PURE scorecard, showing the same PURE task scores side by side.