0:00:00.032,0:00:05.077 In this video we are going to introduce a technique called Heuristic Evaluation. 0:00:05.077,0:00:11.047 As we talked about at the beginning of the course, there’s lots of different ways to evaluate software. 0:00:11.047,0:00:14.058 One that you might be most familiar with is empirical methods, 0:00:14.058,0:00:19.045 where, of some level of formality, you have actual people trying out your software. 0:00:19.045,0:00:25.029 It’s also possible to have formal methods, where you’re building a model 0:00:25.029,0:00:28.019 of how people behave in a particular situation, 0:00:28.019,0:00:32.020 and that enables you to predict how different user interfaces will work. 0:00:32.020,0:00:36.027 Or, if you can’t build a closed-form formal model, 0:00:36.027,0:00:40.046 you can also try out your interface with simulation and have automated tests — 0:00:40.046,0:00:44.082 that can detect usability bugs and effective designs. 0:00:44.082,0:00:49.075 This works especially well for low-level stuff; it’s harder to do for higher-level stuff. 0:00:49.075,0:00:52.093 And what we’re going to talk about today is critique-based approaches, 0:00:52.093,0:01:00.005 where people are giving you feedback directly, based on their expertise or a set of heuristics. 0:01:00.005,0:01:03.024 As any of you who have ever taken an art or design class know, 0:01:03.024,0:01:06.072 peer critique can be an incredibly effective form of feedback, 0:01:06.072,0:01:09.024 and it can make you make your designs even better. 0:01:09.024,0:01:12.095 You can get peer critique really at any stage of your design process, 0:01:12.095,0:01:16.840 but I’d like to highlight a couple that I think can be particularly valuable. 0:01:16.840,0:01:21.393 First, it’s really valuable to get peer critique before user testing, 0:01:21.393,0:01:27.079 because that helps you not waste your users on stuff that’s just going to get picked up automatically. 0:01:27.079,0:01:30.352 You want to be able to focus the valuable resources of user testing 0:01:30.352,0:01:34.194 on stuff that other people wouldn’t be able to pick up on. 0:01:34.194,0:01:37.025 The rich qualitative feedback that peer critique provides 0:01:37.025,0:01:40.780 can also be really valuable before redesigning your application, 0:01:40.780,0:01:45.147 because what it can do is it can show you what parts of your app you probably want to keep, 0:01:45.147,0:01:48.779 and what are other parts that are more problematic and deserve redesign. 0:01:49.426,0:01:51.145 Third, sometimes, you know there are problems, 0:01:51.145,0:01:55.660 and you need data to be able to convince other stakeholders to make the changes. 0:01:55.660,0:01:59.549 And peer critique can be a great way, especially if it’s structured, 0:01:59.549,0:02:04.633 to be able to get the feedback that you need, to make the changes that you know need to happen. 0:02:05.571,0:02:11.012 And lastly, this kind of structured peer critique can be really valuable before releasing software, 0:02:11.012,0:02:15.633 because it helps you do a final sanding of the entire design, and smooth out any rough edges. 0:02:15.633,0:02:20.787 As with most types of evaluation, it’s usually helpful to begin with a clear goal, 0:02:20.787,0:02:24.068 even if what you ultimately learn is completely unexpected. 0:02:26.001,0:02:30.658 And so, what we’re going to talk about today is a particular technique called Heuristic Evaluation. 0:02:30.658,0:02:35.460 Heuristic Evaluation was created by Jakob Nielsen and colleagues, about twenty years ago now. 0:02:36.122,0:02:41.653 And the goal of Heuristic Evaluation is to be able to find usability problems in the design. 0:02:42.653,0:02:44.446 I first learned about Heuristic Evaluation 0:02:44.446,0:02:49.748 when I TA’d James Landay’s Intro to HCI course, and I’ve been using it and teaching it ever since. 0:02:49.748,0:02:54.049 It’s a really valuable technique because it lets you get feedback really quickly 0:02:54.049,0:02:57.698 and it’s a high bang-for-the-buck strategy. 0:02:57.698,0:03:01.590 And the slides that I have here are based off James’ slides for this course, 0:03:01.590,0:03:05.897 and the materials are all available on Jacob Nielsen’s website. 0:03:05.897,0:03:10.055 The basic idea of heuristic evaluation is that you’re going to provide a set of people — 0:03:10.055,0:03:15.042 often other stakeholders on the design team or outside design experts — 0:03:15.042,0:03:17.937 with a set of heuristics or principles, 0:03:17.937,0:03:22.828 and they’re going to use those to look for problems in your design. 0:03:23.567,0:03:26.101 Each of them is first going to do this independently 0:03:26.101,0:03:31.057 and so they’ll walk through a variety of tasks using your design to look for these bugs. 0:03:32.565,0:03:36.846 And you’ll see that different evaluators are going to find different problems. 0:03:36.846,0:03:41.330 And then they’re going to communicate and talk together only at the end, afterwards. 0:03:43.068,0:03:47.269 At the end of the process, they’re going to get back together and talk about what they found. 0:03:47.269,0:03:50.589 And this “independent first, gather afterwards” 0:03:50.589,0:03:56.583 is how you get a “wisdom of crowds” benefit in having multiple evaluators. 0:03:56.583,0:03:58.772 And one of the reasons that we’re talking about this early in the class 0:03:58.772,0:04:05.067 is that it’s a technique that you can use, either on a working user interface or on sketches of user interfaces. 0:04:05.067,0:04:10.417 And so heuristic evaluation works really well in conjunction with paper prototypes 0:04:10.417,0:04:16.352 and other rapid, low fidelity techniques that you may be using to get your design ideas out quick and fast. 0:04:18.290,0:04:22.375 Here’s Neilsen’s ten heuristics, and they’re a pretty darn good set. 0:04:22.375,0:04:25.044 That said, there’s nothing magic about these heuristics. 0:04:25.044,0:04:30.300 They do a pretty good job of covering many of the problems that you’ll see in many user interfaces; 0:04:30.300,0:04:33.488 but you can add on any that you want 0:04:33.488,0:04:37.608 and get rid of any that aren’t appropriate for your system. 0:04:37.608,0:04:40.984 We’re going to go over the content of these ten heuristics in the next couple lectures, 0:04:40.984,0:04:45.543 and in this lecture I’d like to introduce the process that you’re going to use with these heuristics. 0:04:46.343,0:04:49.243 So here’s what you’re going to have your evaluators do: 0:04:49.243,0:04:52.272 Give them a couple of tasks to use your design for, 0:04:52.272,0:04:57.025 and have them do each task, stepping through carefully several times. 0:04:57.025,0:05:00.576 When they’re doing this, they’re going to keep the list of usability principles 0:05:00.576,0:05:03.065 as a reminder of things to pay attention to. 0:05:03.065,0:05:05.707 Now which principles will you use? 0:05:05.707,0:05:08.752 I think Nielsen’s ten heuristics are a fantastic start, 0:05:08.752,0:05:12.955 and you can augment those with anything else that’s relevant for your domain. 0:05:12.955,0:05:19.035 So, if you have particular design goals that you would like your design to achieve, include those in the list. 0:05:19.035,0:05:21.572 Or, if you have particular goals that you’ve set up 0:05:21.572,0:05:25.893 from competitive analysis of designs that are out there already, 0:05:25.893,0:05:27.312 that’s great too. 0:05:27.312,0:05:32.621 Or if there are things that you’ve seen your or other designs excel at, 0:05:32.621,0:05:37.189 those are important goals too and can be included in your list of heuristics. 0:05:38.835,0:05:42.706 And then obviously, the important part is that you’re going to take what you learn from these evaluators 0:05:42.706,0:05:48.606 and use those violations of the heuristics as a way of fixing problems and redesigning. 0:05:49.360,0:05:55.042 Let’s talk a little bit more about why you might want to have multiple evaluators rather than just one. 0:05:55.042,0:05:59.899 The graph on this slide is adapted from Jacob Neilsen’s work on heuristic evaluation 0:05:59.899,0:06:06.568 and what you see is each black square is a bug that a particular evaluator found. 0:06:07.783,0:06:11.908 An individual evaluator represents a row of this matrix 0:06:11.908,0:06:15.036 and there’s about twenty evaluators in this set. 0:06:15.036,0:06:16.973 The columns represent the problems. 0:06:16.973,0:06:21.566 And what you can see is that there’s some problems that were found by relatively few evaluators 0:06:21.566,0:06:24.621 and other stuff which almost everybody found. 0:06:24.621,0:06:29.056 So we’re going to call the stuff on the right the easy problems and the stuff on the left hard problems. 0:06:30.087,0:06:35.004 And so, in aggregate, what we can say is that no evaluator found every problem, 0:06:35.004,0:06:41.407 and some evaluators found more than others, and so there are better and worse people to do this. 0:06:43.007,0:06:44.951 So why not have lots of evaluators? 0:06:44.951,0:06:48.878 Well, as you add more evaluators, they do find more problems; 0:06:49.617,0:06:53.160 but it kind of tapers off over time — you lose that benefit eventually. 0:06:53.544,0:06:58.435 And so from a cost-benefit perspective it’s just stops making sense after a certain point. 0:06:59.035,0:07:00.604 So where’s the peak of this curve? 0:07:00.604,0:07:04.134 It’s of course going to depend on the user interface that you’re working with, 0:07:04.134,0:07:08.470 how much you’re paying people, how much time is involved — all sorts of factors. 0:07:08.470,0:07:13.459 Jakob Nielsen’s rule of thumb for these kinds of user interfaces and heuristic evaluation 0:07:13.459,0:07:19.033 is that three to five people tends to work pretty well; and that’s been my experience too. 0:07:20.171,0:07:24.015 And I think that definitely one of the reasons that people use heuristic evaluation 0:07:24.015,0:07:28.042 is because it can be an extremely cost-effective way of finding problems. 0:07:29.119,0:07:31.590 In one study that Jacob Nielsen ran, 0:07:31.590,0:07:37.293 he estimated that the cost of the problems found with heuristic evaluation were $500,000 0:07:37.293,0:07:41.171 and the cost of performing it was just over $10,000, 0:07:41.171,0:07:48.980 and so he estimates a 48-fold benefit-cost ratio for this particular user interface. 0:07:48.980,0:07:54.906 Obviously, these numbers are back of the envelope, and your mileage will vary. 0:07:54.906,0:07:58.984 You can think about how to estimate the benefit that you get from something like this 0:07:58.984,0:08:03.302 if you have an in-house software tool using something like productivity increases — 0:08:03.302,0:08:06.956 that, if you are making an expense reporting system 0:08:06.956,0:08:11.672 or other in-house system that will make people’s time more efficiently used — 0:08:11.672,0:08:13.901 that’s a big usability win. 0:08:13.901,0:08:17.537 And if you’ve got software that you’re making available on the open market, 0:08:17.537,0:08:22.450 you can think about the benefit from sales or other measures like that. 0:08:23.604,0:08:28.265 One thing that we can get from that graph is that evaluators are more likely to find severe problems 0:08:28.265,0:08:29.615 and that’s good news; 0:08:29.615,0:08:32.258 and so with a relatively small number of people, 0:08:32.258,0:08:35.911 you’re pretty likely to stumble across the most important stuff. 0:08:35.911,0:08:40.927 However, as we saw with just one person in this particular case, 0:08:40.927,0:08:46.109 even the best evaluator found only about a third of the problems of the system. 0:08:46.109,0:08:50.680 And so that’s why ganging up a number of evaluators, say five, 0:08:50.680,0:08:54.974 is going to get you most of the benefit that you’ll be going to be able to achieve. 0:08:55.958,0:09:00.017 If we compare heuristic evaluation and user testing, one of the things that we see 0:09:00.017,0:09:06.927 is that heuristic evaluation can often be a lot faster — It takes just an hour or two for an evaluator — 0:09:06.927,0:09:11.458 and the mechanics of getting a user test up and running can take longer, 0:09:11.458,0:09:16.344 not even accounting for the fact that you may have to build software. 0:09:17.667,0:09:21.465 Also, the heuristic evaluation results come pre-interpreted 0:09:21.465,0:09:26.164 because your evaluators are directly providing you with problems and things to fix, 0:09:26.164,0:09:34.315 and so it saves you the time of having to infer from the usability tests what might be the problem or solution. 0:09:35.638,0:09:39.235 Now conversely, experts walking through your system 0:09:39.235,0:09:44.095 can generate false positives that wouldn’t actually happen in a real environment. 0:09:44.105,0:09:50.376 And this indeed does happen, and so user testing is, sort of, by definition going to be more accurate. 0:09:52.099,0:09:55.071 At the end of the day I think it’s valuable to alternate methods: 0:09:55.071,0:10:00.306 All of the different techniques that you’ll learn in this class for getting feedback can each be valuable, 0:10:00.306,0:10:04.849 and that [by] cycling through them you can often get the benefits of each. 0:10:04.849,0:10:10.642 And that can be because with user evaluation and user testing, you’ll find different problems, 0:10:10.642,0:10:15.491 and by running HE or something like that early in the design process, 0:10:15.491,0:10:20.217 you’ll avoid wasting real users that you may bring in later on. 0:10:21.479,0:10:24.944 So now that we’ve seen the benefits, what are the steps? 0:10:24.944,0:10:29.640 The first thing to do is to get all of your evaluators up to speed, 0:10:29.640,0:10:35.798 on what the story is behind your software — any necessary domain knowledge they might need — 0:10:35.814,0:10:39.663 and tell them about the scenario that you’re going to have them step through. 0:10:40.879,0:10:45.081 Then obviously, you have the evaluation phase where people are working through the interface. 0:10:45.081,0:10:50.075 Afterwards, each person is going to assign a severity rating, 0:10:50.075,0:10:52.742 and you do this individually first, 0:10:52.742,0:10:56.123 and then you’re going to aggregate those into a group severity rating 0:10:56.123,0:10:59.504 and produce an aggregate report out of that. 0:11:00.689,0:11:06.284 And finally, once you’ve got this aggregated report, you can share that with the design team, 0:11:06.284,0:11:09.715 and the design team can discuss what to do with that. 0:11:10.007,0:11:12.906 Doing this kind of expert review can be really taxing, 0:11:12.906,0:11:16.096 and so for each of the scenarios that you lay out in your design, 0:11:16.096,0:11:22.056 it can be valuable to have the evaluator go through that scenario twice. 0:11:22.056,0:11:28.029 The first time, they’ll just get a sense of it; and the second time, they can focus on more specific elements. 0:11:30.029,0:11:34.710 If you’ve got some walk-up-and-use system, like a ticket machine somewhere, 0:11:34.710,0:11:38.897 then you may want to not give people any background information at all, 0:11:38.897,0:11:42.098 because if you’ve got people that are just getting off the bus or the train, 0:11:42.098,0:11:45.369 and they walk up to your machine without any prior information, 0:11:45.369,0:11:49.348 that’s the experience you’ll want your evaluators to have. 0:11:49.348,0:11:53.485 On the other hand, if you’re going to have a genomic system or other expert user interface, 0:11:53.485,0:11:57.020 you’ll want to to make sure that whatever training you would give to real users, 0:11:57.020,0:11:59.570 you’re going to give to your evaluators as well. 0:11:59.570,0:12:03.553 In other words, whatever the background is, it should be realistic. 0:12:05.738,0:12:08.647 When your evaluators are walking through your interface, 0:12:08.647,0:12:12.571 it’s going to be important to produce a list of very specific problems 0:12:12.571,0:12:16.983 and explain those problems with regard to one of the design heuristics. 0:12:16.983,0:12:21.200 You don’t want people to just to be, like, “I don’t like it.” 0:12:21.200,0:12:26.233 And in order to maxilinearly preach you these results for the design team; 0:12:26.233,0:12:31.445 you’ll want to list each one of these separately so that they can be dealt with efficiently. 0:12:31.445,0:12:37.158 Separate listings can also help you avoid listing the same repeated problem over and over again. 0:12:37.158,0:12:42.483 If there’s a repeated element on every single screen, you don’t want to list it at every single screen; 0:12:42.483,0:12:45.819 you want to list it once so that it can be fixed once. 0:12:46.881,0:12:52.322 And these problems can be very detailed, like “the name of something is confusing,” 0:12:52.322,0:12:55.709 or it can be something that has to do more with the flow of the user interface, 0:12:55.709,0:13:02.109 or the architecture of the user experience and that’s not specifically tied to an interface element. 0:13:03.232,0:13:07.048 Your evaluators may also find that something is missing that ought to be there, 0:13:07.048,0:13:11.247 and this can be sometime ambiguous with early prototypes, like paper prototypes. 0:13:11.247,0:13:17.365 And so you’ll want to clarify whether the user interface is something that you believe to be complete, 0:13:17.365,0:13:21.762 or whether there are intentional elements missing ahead of time. 0:13:22.177,0:13:25.789 And, of course, sometimes there are features that are going to be obviously there 0:13:25.789,0:13:28.077 that are implied by the user interface. 0:13:28.077,0:13:31.893 And so, mellow out, and relax on those. 0:13:34.509,0:13:36.755 After your evaluators have gone through the interface, 0:13:36.755,0:13:41.265 they can each independently assign a severity rating to all of the problems that they’ve found. 0:13:41.265,0:13:45.099 And that’s going to enable you to allocate resources to fix those problems. 0:13:45.099,0:13:48.220 It can also help give you feedback about how well you’re doing 0:13:48.220,0:13:50.972 in terms of the usability of your system in general, 0:13:50.972,0:13:55.180 and give you a kind of a benchmark of your efforts in this vein. 0:13:56.380,0:14:01.119 The severity measure that your evaluators are going to come up with is going to combine several things: 0:14:01.119,0:14:05.032 It’s going to combine the frequency, the impact, 0:14:05.032,0:14:08.930 and the pervasiveness of the problem that they’re seeing on the screen. 0:14:08.930,0:14:14.052 So, something that is only in one place may be a less big deal 0:14:14.052,0:14:18.563 than something that shows up throughout the entire user interface. 0:14:18.563,0:14:23.024 Similarly, there are going to be some things like misaligned text, 0:14:23.024,0:14:27.553 which may be inelegant, but aren’t a deal killer in terms of your software. 0:14:29.060,0:14:34.441 And here is the severity rating system that Nielsen created; you can obviously use anything that you want: 0:14:34.441,0:14:36.692 It ranges from zero to four, 0:14:36.692,0:14:41.896 where zero is “at the end of the day your evaluator decides it actually is not usability problem,” 0:14:41.896,0:14:47.720 all the way up to it being something really catastrophic that has to get fixed right away. 0:14:48.766,0:14:51.335 And here is an example of a particular problem 0:14:51.335,0:14:56.027 that our TA Robby found when he was taking CS147 as a student. 0:14:56.027,0:15:01.079 He walked through somebody’s mobile interface that had a “weight” entry element to it; 0:15:01.079,0:15:05.916 and he realized that once you entered your weight, there is no way to edit it after the fact. 0:15:05.916,0:15:12.258 So, that’s kind of clunky, you wish you could fix it — maybe not a disaster. 0:15:12.258,0:15:17.085 And so what you see here is he’s listed the issue, he’s given it a severity rating, 0:15:17.085,0:15:23.157 he’s got the heuristic that it violates, and then he describes exactly what the problem is. 0:15:23.634,0:15:26.869 And finally, after all your evaluators have gone through the interface, 0:15:26.869,0:15:31.272 listed their problems, and combined them in terms of the severity and importance, 0:15:31.272,0:15:34.183 you’ll want to debrief with the design team. 0:15:34.183,0:15:39.171 This is a nice chance to be able to discuss general issues in the user interface and qualitative feedback, 0:15:39.171,0:15:42.234 and it gives you a chance to go through each of these items 0:15:42.234,0:15:45.683 and suggest improvements on how you can address these problems. 0:15:47.713,0:15:51.096 In this debrief session, it can be valuable for the development team 0:15:51.096,0:15:55.913 to estimate the amount of effort that it would take to fix one of these problems. 0:15:55.913,0:16:01.436 So, for example, if you’ve got something that is one on your severity scale and not too big a deal — 0:16:01.436,0:16:06.128 it might have something to do with wording and its dirt simple to fix — 0:16:06.128,0:16:08.335 that tells you “go ahead and fix it.” 0:16:08.335,0:16:11.147 Conversely, you may having something which is a catastrophe 0:16:11.147,0:16:15.483 which takes a lot more effort, but its importance will lead you to fix it. 0:16:15.483,0:16:19.602 And there’s other things where the importance relative to the cost involved 0:16:19.602,0:16:22.813 just don’t make sense to deal with right now. 0:16:22.813,0:16:26.867 And this debrief session can be a great way to brainstorm future design ideas, 0:16:26.867,0:16:29.723 especially while you’ve got all the stakeholders in the room, 0:16:29.723,0:16:34.373 and the ideas about what the issues are with the user interface are fresh in their minds. 0:16:34.373,0:16:40.749 In the next two videos we’ll go through Neilsons’ ten heuristics and talk more about what they mean.