WEBVTT

00:00:00.032 --> 00:00:05.077
In this video we are going to introduce a technique called Heuristic Evaluation.

00:00:05.077 --> 00:00:11.047
As we talked about at the beginning of the course, there’s lots of different ways to evaluate software.

00:00:11.047 --> 00:00:14.058
One that you might be most familiar with is empirical methods,

00:00:14.058 --> 00:00:19.045
where, of some level of formality, you have actual people trying out your software.

00:00:19.045 --> 00:00:25.029
It’s also possible to have formal methods, where you’re building a model

00:00:25.029 --> 00:00:28.019
of how people behave in a particular situation,

00:00:28.019 --> 00:00:32.020
and that enables you to predict how different user interfaces will work.

00:00:32.020 --> 00:00:36.027
Or, if you can’t build a closed-form formal model,

00:00:36.027 --> 00:00:40.046
you can also try out your interface with simulation and have automated tests —

00:00:40.046 --> 00:00:44.082
that can detect usability bugs and effective designs.

00:00:44.082 --> 00:00:49.075
This works especially well for low-level stuff; it’s harder to do for higher-level stuff.

00:00:49.075 --> 00:00:52.093
And what we’re going to talk about today is critique-based approaches,

00:00:52.093 --> 00:01:00.005
where people are giving you feedback directly, based on their expertise or a set of heuristics.

00:01:00.005 --> 00:01:03.024
As any of you who have ever taken an art or design class know,

00:01:03.024 --> 00:01:06.072
peer critique can be an incredibly effective form of feedback,

00:01:06.072 --> 00:01:09.024
and it can make you make your designs even better.

00:01:09.024 --> 00:01:12.095
You can get peer critique really at any stage of your design process,

00:01:12.095 --> 00:01:16.840
but I’d like to highlight a couple that I think can be particularly valuable.

00:01:16.840 --> 00:01:21.393
First, it’s really valuable to get peer critique before user testing,

00:01:21.393 --> 00:01:27.079
because that helps you not waste your users on stuff that’s just going to get picked up automatically.

00:01:27.079 --> 00:01:30.352
You want to be able to focus the valuable resources of user testing

00:01:30.352 --> 00:01:34.194
on stuff that other people wouldn’t be able to pick up on.

00:01:34.194 --> 00:01:37.025
The rich qualitative feedback that peer critique provides

00:01:37.025 --> 00:01:40.780
can also be really valuable before redesigning your application,

00:01:40.780 --> 00:01:45.147
because what it can do is it can show you what parts of your app you probably want to keep,

00:01:45.147 --> 00:01:48.779
and what are other parts that are more problematic and deserve redesign.

00:01:49.426 --> 00:01:51.145
Third, sometimes, you know there are problems,

00:01:51.145 --> 00:01:55.660
and you need data to be able to convince other stakeholders to make the changes.

00:01:55.660 --> 00:01:59.549
And peer critique can be a great way, especially if it’s structured,

00:01:59.549 --> 00:02:04.633
to be able to get the feedback that you need, to make the changes that you know need to happen.

00:02:05.571 --> 00:02:11.012
And lastly, this kind of structured peer critique can be really valuable before releasing software,

00:02:11.012 --> 00:02:15.633
because it helps you do a final sanding of the entire design, and smooth out any rough edges.

00:02:15.633 --> 00:02:20.787
As with most types of evaluation, it’s usually helpful to begin with a clear goal,

00:02:20.787 --> 00:02:24.068
even if what you ultimately learn is completely unexpected.

00:02:26.001 --> 00:02:30.658
And so, what we’re going to talk about today is a particular technique called Heuristic Evaluation.

00:02:30.658 --> 00:02:35.460
Heuristic Evaluation was created by Jakob Nielsen and colleagues, about twenty years ago now.

00:02:36.122 --> 00:02:41.653
And the goal of Heuristic Evaluation is to be able to find usability problems in the design.

00:02:42.653 --> 00:02:44.446
I first learned about Heuristic Evaluation

00:02:44.446 --> 00:02:49.748
when I TA’d James Landay’s Intro to HCI course, and I’ve been using it and teaching it ever since.

00:02:49.748 --> 00:02:54.049
It’s a really valuable technique because it lets you get feedback really quickly

00:02:54.049 --> 00:02:57.698
and it’s a high bang-for-the-buck strategy.

00:02:57.698 --> 00:03:01.590
And the slides that I have here are based off James’ slides for this course,

00:03:01.590 --> 00:03:05.897
and the materials are all available on Jacob Nielsen’s website.

00:03:05.897 --> 00:03:10.055
The basic idea of heuristic evaluation is that you’re going to provide a set of people —

00:03:10.055 --> 00:03:15.042
often other stakeholders on the design team or outside design experts —

00:03:15.042 --> 00:03:17.937
with a set of heuristics or principles,

00:03:17.937 --> 00:03:22.828
and they’re going to use those to look for problems in your design.

00:03:23.567 --> 00:03:26.101
Each of them is first going to do this independently

00:03:26.101 --> 00:03:31.057
and so they’ll walk through a variety of tasks using your design to look for these bugs.

00:03:32.565 --> 00:03:36.846
And you’ll see that different evaluators are going to find different problems.

00:03:36.846 --> 00:03:41.330
And then they’re going to communicate and talk together only at the end, afterwards.

00:03:43.068 --> 00:03:47.269
At the end of the process, they’re going to get back together and talk about what they found.

00:03:47.269 --> 00:03:50.589
And this “independent first, gather afterwards”

00:03:50.589 --> 00:03:56.583
is how you get a “wisdom of crowds” benefit in having multiple evaluators.

00:03:56.583 --> 00:03:58.772
And one of the reasons that we’re talking about this early in the class

00:03:58.772 --> 00:04:05.067
is that it’s a technique that you can use, either on a working user interface or on sketches of user interfaces.

00:04:05.067 --> 00:04:10.417
And so heuristic evaluation works really well in conjunction with paper prototypes

00:04:10.417 --> 00:04:16.352
and other rapid, low fidelity techniques that you may be using to get your design ideas out quick and fast.

00:04:18.290 --> 00:04:22.375
Here’s Neilsen’s ten heuristics, and they’re a pretty darn good set.

00:04:22.375 --> 00:04:25.044
That said, there’s nothing magic about these heuristics.

00:04:25.044 --> 00:04:30.300
They do a pretty good job of covering many of the problems that you’ll see in many user interfaces;

00:04:30.300 --> 00:04:33.488
but you can add on any that you want

00:04:33.488 --> 00:04:37.608
and get rid of any that aren’t appropriate for your system.

00:04:37.608 --> 00:04:40.984
We’re going to go over the content of these ten heuristics in the next couple lectures,

00:04:40.984 --> 00:04:45.543
and in this lecture I’d like to introduce the process that you’re going to use with these heuristics.

00:04:46.343 --> 00:04:49.243
So here’s what you’re going to have your evaluators do:

00:04:49.243 --> 00:04:52.272
Give them a couple of tasks to use your design for,

00:04:52.272 --> 00:04:57.025
and have them do each task, stepping through carefully several times.

00:04:57.025 --> 00:05:00.576
When they’re doing this, they’re going to keep the list of usability principles

00:05:00.576 --> 00:05:03.065
as a reminder of things to pay attention to.

00:05:03.065 --> 00:05:05.707
Now which principles will you use?

00:05:05.707 --> 00:05:08.752
I think Nielsen’s ten heuristics are a fantastic start,

00:05:08.752 --> 00:05:12.955
and you can augment those with anything else that’s relevant for your domain.

00:05:12.955 --> 00:05:19.035
So, if you have particular design goals that you would like your design to achieve, include those in the list.

00:05:19.035 --> 00:05:21.572
Or, if you have particular goals that you’ve set up

00:05:21.572 --> 00:05:25.893
from competitive analysis of designs that are out there already,

00:05:25.893 --> 00:05:27.312
that’s great too.

00:05:27.312 --> 00:05:32.621
Or if there are things that you’ve seen your or other designs excel at,

00:05:32.621 --> 00:05:37.189
those are important goals too and can be included in your list of heuristics.

00:05:38.835 --> 00:05:42.706
And then obviously, the important part is that you’re going to take what you learn from these evaluators

00:05:42.706 --> 00:05:48.606
and use those violations of the heuristics as a way of fixing problems and redesigning.

00:05:49.360 --> 00:05:55.042
Let’s talk a little bit more about why you might want to have multiple evaluators rather than just one.

00:05:55.042 --> 00:05:59.899
The graph on this slide is adapted from Jacob Neilsen’s work on heuristic evaluation

00:05:59.899 --> 00:06:06.568
and what you see is each black square is a bug that a particular evaluator found.

00:06:07.783 --> 00:06:11.908
An individual evaluator represents a row of this matrix

00:06:11.908 --> 00:06:15.036
and there’s about twenty evaluators in this set.

00:06:15.036 --> 00:06:16.973
The columns represent the problems.

00:06:16.973 --> 00:06:21.566
And what you can see is that there’s some problems that were found by relatively few evaluators

00:06:21.566 --> 00:06:24.621
and other stuff which almost everybody found.

00:06:24.621 --> 00:06:29.056
So we’re going to call the stuff on the right the easy problems and the stuff on the left hard problems.

00:06:30.087 --> 00:06:35.004
And so, in aggregate, what we can say is that no evaluator found every problem,

00:06:35.004 --> 00:06:41.407
and some evaluators found more than others, and so there are better and worse people to do this.

00:06:43.007 --> 00:06:44.951
So why not have lots of evaluators?

00:06:44.951 --> 00:06:48.878
Well, as you add more evaluators, they do find more problems;

00:06:49.617 --> 00:06:53.160
but it kind of tapers off over time — you lose that benefit eventually.

00:06:53.544 --> 00:06:58.435
And so from a cost-benefit perspective it’s just stops making sense after a certain point.

00:06:59.035 --> 00:07:00.604
So where’s the peak of this curve?

00:07:00.604 --> 00:07:04.134
It’s of course going to depend on the user interface that you’re working with,

00:07:04.134 --> 00:07:08.470
how much you’re paying people, how much time is involved — all sorts of factors.

00:07:08.470 --> 00:07:13.459
Jakob Nielsen’s rule of thumb for these kinds of user interfaces and heuristic evaluation

00:07:13.459 --> 00:07:19.033
is that three to five people tends to work pretty well; and that’s been my experience too.

00:07:20.171 --> 00:07:24.015
And I think that definitely one of the reasons that people use heuristic evaluation

00:07:24.015 --> 00:07:28.042
is because it can be an extremely cost-effective way of finding problems.

00:07:29.119 --> 00:07:31.590
In one study that Jacob Nielsen ran,

00:07:31.590 --> 00:07:37.293
he estimated that the cost of the problems found with heuristic evaluation were $500,000

00:07:37.293 --> 00:07:41.171
and the cost of performing it was just over $10,000,

00:07:41.171 --> 00:07:48.980
and so he estimates a 48-fold benefit-cost ratio for this particular user interface.

00:07:48.980 --> 00:07:54.906
Obviously, these numbers are back of the envelope, and your mileage will vary.

00:07:54.906 --> 00:07:58.984
You can think about how to estimate the benefit that you get from something like this

00:07:58.984 --> 00:08:03.302
if you have an in-house software tool using something like productivity increases —

00:08:03.302 --> 00:08:06.956
that, if you are making an expense reporting system

00:08:06.956 --> 00:08:11.672
or other in-house system that will make people’s time more efficiently used —

00:08:11.672 --> 00:08:13.901
that’s a big usability win.

00:08:13.901 --> 00:08:17.537
And if you’ve got software that you’re making available on the open market,

00:08:17.537 --> 00:08:22.450
you can think about the benefit from sales or other measures like that.

00:08:23.604 --> 00:08:28.265
One thing that we can get from that graph is that evaluators are more likely to find severe problems

00:08:28.265 --> 00:08:29.615
and that’s good news;

00:08:29.615 --> 00:08:32.258
and so with a relatively small number of people,

00:08:32.258 --> 00:08:35.911
you’re pretty likely to stumble across the most important stuff.

00:08:35.911 --> 00:08:40.927
However, as we saw with just one person in this particular case,

00:08:40.927 --> 00:08:46.109
even the best evaluator found only about a third of the problems of the system.

00:08:46.109 --> 00:08:50.680
And so that’s why ganging up a number of evaluators, say five,

00:08:50.680 --> 00:08:54.974
is going to get you most of the benefit that you’ll be going to be able to achieve.

00:08:55.958 --> 00:09:00.017
If we compare heuristic evaluation and user testing, one of the things that we see

00:09:00.017 --> 00:09:06.927
is that heuristic evaluation can often be a lot faster — It takes just an hour or two for an evaluator —

00:09:06.927 --> 00:09:11.458
and the mechanics of getting a user test up and running can take longer,

00:09:11.458 --> 00:09:16.344
not even accounting for the fact that you may have to build software.

00:09:17.667 --> 00:09:21.465
Also, the heuristic evaluation results come pre-interpreted

00:09:21.465 --> 00:09:26.164
because your evaluators are directly providing you with problems and things to fix,

00:09:26.164 --> 00:09:34.315
and so it saves you the time of having to infer from the usability tests what might be the problem or solution.

00:09:35.638 --> 00:09:39.235
Now conversely, experts walking through your system

00:09:39.235 --> 00:09:44.095
can generate false positives that wouldn’t actually happen in a real environment.

00:09:44.105 --> 00:09:50.376
And this indeed does happen, and so user testing is, sort of, by definition going to be more accurate.

00:09:52.099 --> 00:09:55.071
At the end of the day I think it’s valuable to alternate methods:

00:09:55.071 --> 00:10:00.306
All of the different techniques that you’ll learn in this class for getting feedback can each be valuable,

00:10:00.306 --> 00:10:04.849
and that [by] cycling through them you can often get the benefits of each.

00:10:04.849 --> 00:10:10.642
And that can be because with user evaluation and user testing, you’ll find different problems,

00:10:10.642 --> 00:10:15.491
and by running HE or something like that early in the design process,

00:10:15.491 --> 00:10:20.217
you’ll avoid wasting real users that you may bring in later on.

00:10:21.479 --> 00:10:24.944
So now that we’ve seen the benefits, what are the steps?

00:10:24.944 --> 00:10:29.640
The first thing to do is to get all of your evaluators up to speed,

00:10:29.640 --> 00:10:35.798
on what the story is behind your software — any necessary domain knowledge they might need —

00:10:35.814 --> 00:10:39.663
and tell them about the scenario that you’re going to have them step through.

00:10:40.879 --> 00:10:45.081
Then obviously, you have the evaluation phase where people are working through the interface.

00:10:45.081 --> 00:10:50.075
Afterwards, each person is going to assign a severity rating,

00:10:50.075 --> 00:10:52.742
and you do this individually first,

00:10:52.742 --> 00:10:56.123
and then you’re going to aggregate those into a group severity rating

00:10:56.123 --> 00:10:59.504
and produce an aggregate report out of that.

00:11:00.689 --> 00:11:06.284
And finally, once you’ve got this aggregated report, you can share that with the design team,

00:11:06.284 --> 00:11:09.715
and the design team can discuss what to do with that.

00:11:10.007 --> 00:11:12.906
Doing this kind of expert review can be really taxing,

00:11:12.906 --> 00:11:16.096
and so for each of the scenarios that you lay out in your design,

00:11:16.096 --> 00:11:22.056
it can be valuable to have the evaluator go through that scenario twice.

00:11:22.056 --> 00:11:28.029
The first time, they’ll just get a sense of it; and the second time, they can focus on more specific elements.

00:11:30.029 --> 00:11:34.710
If you’ve got some walk-up-and-use system, like a ticket machine somewhere,

00:11:34.710 --> 00:11:38.897
then you may want to not give people any background information at all,

00:11:38.897 --> 00:11:42.098
because if you’ve got people that are just getting off the bus or the train,

00:11:42.098 --> 00:11:45.369
and they walk up to your machine without any prior information,

00:11:45.369 --> 00:11:49.348
that’s the experience you’ll want your evaluators to have.

00:11:49.348 --> 00:11:53.485
On the other hand, if you’re going to have a genomic system or other expert user interface,

00:11:53.485 --> 00:11:57.020
you’ll want to to make sure that whatever training you would give to real users,

00:11:57.020 --> 00:11:59.570
you’re going to give to your evaluators as well.

00:11:59.570 --> 00:12:03.553
In other words, whatever the background is, it should be realistic.

00:12:05.738 --> 00:12:08.647
When your evaluators are walking through your interface,

00:12:08.647 --> 00:12:12.571
it’s going to be important to produce a list of very specific problems

00:12:12.571 --> 00:12:16.983
and explain those problems with regard to one of the design heuristics.

00:12:16.983 --> 00:12:21.200
You don’t want people to just to be, like, “I don’t like it.”

00:12:21.200 --> 00:12:26.233
And in order to maxilinearly preach you these results for the design team;

00:12:26.233 --> 00:12:31.445
you’ll want to list each one of these separately so that they can be dealt with efficiently.

00:12:31.445 --> 00:12:37.158
Separate listings can also help you avoid listing the same repeated problem over and over again.

00:12:37.158 --> 00:12:42.483
If there’s a repeated element on every single screen, you don’t want to list it at every single screen;

00:12:42.483 --> 00:12:45.819
you want to list it once so that it can be fixed once.

00:12:46.881 --> 00:12:52.322
And these problems can be very detailed, like “the name of something is confusing,”

00:12:52.322 --> 00:12:55.709
or it can be something that has to do more with the flow of the user interface,

00:12:55.709 --> 00:13:02.109
or the architecture of the user experience and that’s not specifically tied to an interface element.

00:13:03.232 --> 00:13:07.048
Your evaluators may also find that something is missing that ought to be there,

00:13:07.048 --> 00:13:11.247
and this can be sometime ambiguous with early prototypes, like paper prototypes.

00:13:11.247 --> 00:13:17.365
And so you’ll want to clarify whether the user interface is something that you believe to be complete,

00:13:17.365 --> 00:13:21.762
or whether there are intentional elements missing ahead of time.

00:13:22.177 --> 00:13:25.789
And, of course, sometimes there are features that are going to be obviously there

00:13:25.789 --> 00:13:28.077
that are implied by the user interface.

00:13:28.077 --> 00:13:31.893
And so, mellow out, and relax on those.

00:13:34.509 --> 00:13:36.755
After your evaluators have gone through the interface,

00:13:36.755 --> 00:13:41.265
they can each independently assign a severity rating to all of the problems that they’ve found.

00:13:41.265 --> 00:13:45.099
And that’s going to enable you to allocate resources to fix those problems.

00:13:45.099 --> 00:13:48.220
It can also help give you feedback about how well you’re doing

00:13:48.220 --> 00:13:50.972
in terms of the usability of your system in general,

00:13:50.972 --> 00:13:55.180
and give you a kind of a benchmark of your efforts in this vein.

00:13:56.380 --> 00:14:01.119
The severity measure that your evaluators are going to come up with is going to combine several things:

00:14:01.119 --> 00:14:05.032
It’s going to combine the frequency, the impact,

00:14:05.032 --> 00:14:08.930
and the pervasiveness of the problem that they’re seeing on the screen.

00:14:08.930 --> 00:14:14.052
So, something that is only in one place may be a less big deal

00:14:14.052 --> 00:14:18.563
than something that shows up throughout the entire user interface.

00:14:18.563 --> 00:14:23.024
Similarly, there are going to be some things like misaligned text,

00:14:23.024 --> 00:14:27.553
which may be inelegant, but aren’t a deal killer in terms of your software.

00:14:29.060 --> 00:14:34.441
And here is the severity rating system that Nielsen created; you can obviously use anything that you want:

00:14:34.441 --> 00:14:36.692
It ranges from zero to four,

00:14:36.692 --> 00:14:41.896
where zero is “at the end of the day your evaluator decides it actually is not usability problem,”

00:14:41.896 --> 00:14:47.720
all the way up to it being something really catastrophic that has to get fixed right away.

00:14:48.766 --> 00:14:51.335
And here is an example of a particular problem

00:14:51.335 --> 00:14:56.027
that our TA Robby found when he was taking CS147 as a student.

00:14:56.027 --> 00:15:01.079
He walked through somebody’s mobile interface that had a “weight” entry element to it;

00:15:01.079 --> 00:15:05.916
and he realized that once you entered your weight, there is no way to edit it after the fact.

00:15:05.916 --> 00:15:12.258
So, that’s kind of clunky, you wish you could fix it — maybe not a disaster.

00:15:12.258 --> 00:15:17.085
And so what you see here is he’s listed the issue, he’s given it a severity rating,

00:15:17.085 --> 00:15:23.157
he’s got the heuristic that it violates, and then he describes exactly what the problem is.

00:15:23.634 --> 00:15:26.869
And finally, after all your evaluators have gone through the interface,

00:15:26.869 --> 00:15:31.272
listed their problems, and combined them in terms of the severity and importance,

00:15:31.272 --> 00:15:34.183
you’ll want to debrief with the design team.

00:15:34.183 --> 00:15:39.171
This is a nice chance to be able to discuss general issues in the user interface and qualitative feedback,

00:15:39.171 --> 00:15:42.234
and it gives you a chance to go through each of these items

00:15:42.234 --> 00:15:45.683
and suggest improvements on how you can address these problems.

00:15:47.713 --> 00:15:51.096
In this debrief session, it can be valuable for the development team

00:15:51.096 --> 00:15:55.913
to estimate the amount of effort that it would take to fix one of these problems.

00:15:55.913 --> 00:16:01.436
So, for example, if you’ve got something that is one on your severity scale and not too big a deal —

00:16:01.436 --> 00:16:06.128
it might have something to do with wording and its dirt simple to fix —

00:16:06.128 --> 00:16:08.335
that tells you “go ahead and fix it.”

00:16:08.335 --> 00:16:11.147
Conversely, you may having something which is a catastrophe

00:16:11.147 --> 00:16:15.483
which takes a lot more effort, but its importance will lead you to fix it.

00:16:15.483 --> 00:16:19.602
And there’s other things where the importance relative to the cost involved

00:16:19.602 --> 00:16:22.813
just don’t make sense to deal with right now.

00:16:22.813 --> 00:16:26.867
And this debrief session can be a great way to brainstorm future design ideas,

00:16:26.867 --> 00:16:29.723
especially while you’ve got all the stakeholders in the room,

00:16:29.723 --> 00:16:34.373
and the ideas about what the issues are with the user interface are fresh in their minds.

00:16:34.373 --> 00:16:40.749
In the next two videos we’ll go through Neilsons’ ten heuristics and talk more about what they mean.