0:00:02.008,0:00:04.895
In this lecture, we’re going to talk about trying out your interface with people

0:00:04.895,0:00:12.050
and doing so in a way that you can improve your designs based on what you learned.

0:00:12.050,0:00:16.802
One of the most common things that people ask when running studies is: “Do you like my interface?”

0:00:16.802,0:00:20.990
and it’s a really natural thing to ask, because on some level it’s what we all want to know.

0:00:20.990,0:00:23.916
But this is really problematic on a whole lot of levels.

0:00:23.916,0:00:28.099
For one it’s not very specific, and so sometimes people are trying to make this better

0:00:28.099,0:00:34.359
and so they’ll improve it by doing something like: “How much do you like my interface on one to five scale?”

0:00:34.359,0:00:39.401
Or: “‘This is a useful interface’ — Agree or disagree on one to five scale.”

0:00:39.401,0:00:42.636
And this adds some kind of a patina of scientificness to it

0:00:42.636,0:00:46.749
but really it’s just the same thing — you’re asking somebody “Do you like my interface?”

0:00:46.749,0:00:49.679
And people are nice, so they’re going to say “Sure I like your interface.”

0:00:49.679,0:00:52.257
This is the “please the experimenter” bias.

0:00:52.257,0:00:56.699
And this can be especially strong when there are social or cultural or power differences

0:00:56.699,0:01:00.606
between the experimenter and the people that you’re trying out your interface with:

0:01:00.606,0:01:05.061
For example, [inaudible] and colleague show this effect in India

0:01:05.061,0:01:09.316
where this effect was exacerbated when the experimenter was white.

0:01:09.316,0:01:15.908
Now, you should <i>not</i> take this to mean that you shouldn’t have your developers try out stuff with users —

0:01:15.908,0:01:21.800
Being the person who is both the developer and the person who is trying stuff out is incredible valuable.

0:01:21.800,0:01:24.522
And one example I like a lot of this is Mike Krieger,

0:01:24.522,0:01:30.125
one of the Instagram founders — [he] is also a former master student and TA of mine.

0:01:30.125,0:01:32.313
And Mike, when he left Stanford and joined Silicon Valley,

0:01:32.313,0:01:36.483
every Friday afternoon he would bring people into the lab into his office

0:01:36.483,0:01:39.606
and have them try out whatever they were working on that week.

0:01:39.606,0:01:43.217
And so that way they were able to get this regular feedback each week

0:01:43.217,0:01:48.009
and the people who were building those systems got to see real people trying them out.

0:01:48.009,0:01:52.169
This can be nails-on-a-chalkboard painful, but you’ll also learn a ton.

0:01:52.169,0:01:55.118
So how do we get beyond “Do you like my interface?”

0:01:55.118,0:01:58.972
The basic strategy that we’re going to talk about today is being able

0:01:58.972,0:02:05.039
to use specific measures and concrete questions to be able to deliver meaningful results.

0:02:05.039,0:02:10.216
One of the problems of “Do you like my interface?” is “Compared to what?”

0:02:10.216,0:02:16.102
And I think one of the reasons people say “Yeah sure” is that there’s no comparison point

0:02:16.102,0:02:21.889
and so one thing that’s really important is when you’re measuring the effectiveness of your interface,

0:02:21.889,0:02:25.783
even informally, it’s really nice to have some kind of comparison.

0:02:25.783,0:02:28.687
It’s also important think about, well, what’s the yardstick?

0:02:28.687,0:02:31.184
What constitutes “good” in this arena?

0:02:31.184,0:02:33.925
What are the measures that you’re going to use?

0:02:33.925,0:02:36.885
So how can we get beyond “Do you like my interface?”

0:02:36.885,0:02:41.071
One of the ways that we can start out is by asking a base rate question,

0:02:41.071,0:02:46.526
like “What fraction of people click on the first link in a search results page?”

0:02:46.526,0:02:50.142
Or “What fraction of students come to class?”

0:02:50.142,0:02:54.555
Once we start to measure correlations things get even more interesting,

0:02:54.555,0:03:00.328
like, “Is there a relationship between the time of day a class is offered and how many students attend it?”

0:03:00.328,0:03:07.610
Or “Is there a relationship between the order of a search result and the clickthrough rate?”

0:03:07.610,0:03:11.492
For both students and clickthrough, there can be multiple explanations.

0:03:11.492,0:03:16.410
For example, if there are fewer students that attend early morning classes,

0:03:16.410,0:03:19.054
is that a function of when students want to show up,

0:03:19.054,0:03:22.865
or is that a function of when good professors want to teach?

0:03:22.865,0:03:26.219
With the clickthrough example, there are also two kinds of explanations.

0:03:26.219,0:03:37.528
If lower placed links yield fewer clicks, Is that because the links are of intrinsically poorer quality,

0:03:37.528,0:03:41.075
or is it because people just click on the first link —

0:03:41.075,0:03:45.238
[that] they don’t bother getting to the second one even if it might be better?

0:03:45.238,0:03:48.869
To isolate the effect of placement and identifying it as playing a casual role,

0:03:48.869,0:03:54.155
you’d need to isolate that as a variable by say, randomizing the order or search results.

0:03:54.155,0:04:00.329
As we start to talk about these experiments, let’s introduce a few terms that are going to help us.

0:04:00.329,0:04:05.485
The multiple different conditions that we try, that’s the thing we are manipulating —

0:04:05.485,0:04:12.402
for example, the time of a class, or the location of a particular link on a search results page.

0:04:12.402,0:04:18.379
These manipulations are independent variables because they are independent of what the user does.

0:04:18.379,0:04:22.245
They are in the control of the experimenter.

0:04:22.245,0:04:26.706
Then we are going to measure what the user does

0:04:26.706,0:04:31.447
and those measures are called dependent variables because they depend on what the user does.

0:04:31.447,0:04:36.007
Common measures in HCI include things like task completion time —

0:04:36.007,0:04:38.983
How long does it take somebody to complete a task

0:04:38.983,0:04:43.375
(for example, find something I want to buy, create a new account, order an item)?

0:04:43.375,0:04:46.838
Accuracy — How many mistakes did people make,

0:04:46.838,0:04:51.298
and were those fatal errors or were those things that they were able to quickly recover from?

0:04:51.300,0:04:55.376
Recall — How much does a person remember afterward, or after periods of non-use?

0:04:55.376,0:04:59.183
And emotional response — How does the person feel about the tasks being completed?

0:04:59.183,0:05:01.440
Were they confident, were they stressed?

0:05:01.440,0:05:04.354
Would the user recommend this system to a friend?

0:05:04.354,0:05:09.075
So, your independent variables are the things that you manipulate,

0:05:09.075,0:05:11.983
your dependent variables are the things that you measure.

0:05:11.983,0:05:14.031
How reliable is your experiment?

0:05:14.031,0:05:17.573
If you ran this again, would you see the same results?

0:05:17.573,0:05:20.922
That’s the internal validity of an experiment.

0:05:20.922,0:05:24.776
So, have a precise experiment, you need to better remove the confounding factors.

0:05:24.776,0:05:30.348
Also, it’s important to study enough people so that the result is unlikely to have been by chance.

0:05:30.348,0:05:34.373
You may be able to run the same study over and over and get the same results

0:05:34.373,0:05:42.212
but it may not matter in some real-world sense and the external validity is the generalizability of your results.

0:05:42.212,0:05:44.898
Does this apply only to eighteen-year-olds in a college classroom?

0:05:44.898,0:05:47.908
Or does this apply to everybody in the world?

0:05:47.908,0:05:52.003
Let’s bring this back to HCI and talk about one of the problems you’re likely to face as a designer.

0:05:52.003,0:05:55.499
I think one of the things that we commonly want to be able to do

0:05:55.499,0:06:00.364
is to be able to ask something like “Is my cool new approach better than the industry standard?”

0:06:00.364,0:06:03.290
Because after all, that’s why you’re making the new thing.

0:06:03.290,0:06:06.956
Now, one of the challenges with this, especially early on in the design process

0:06:06.956,0:06:11.026
is that you may have something which is very much in its prototype stages

0:06:11.026,0:06:16.841
and something that is the industry standard is likely to benefit from years and years of refinement.

0:06:16.841,0:06:21.514
And at the same time, it may be stuck with years and years of cruft

0:06:21.514,0:06:25.114
which may or may not be intrinsic to its approach.

0:06:25.114,0:06:30.586
So if you compare your cool new tool to some industry standard, there is two things varying here.

0:06:30.586,0:06:35.725
One is the fidelity of the implementation and the other one of course is the approach.

0:06:35.725,0:06:37.822
Consequently, when you get the results,

0:06:37.822,0:06:43.933
you can’t know whether to attribute the results to fidelity or approach or some combination of the two.

0:06:43.933,0:06:48.400
So we’re going to talk about ways of teasing apart those different causal factors.

0:06:48.400,0:06:53.712
Now, one thing I should say right off the bat is there are some times where it may be more

0:06:53.712,0:06:57.332
or less relevant whether you have a good handle on what the causal factors are.

0:06:57.332,0:07:01.407
So for example, if you’re trying to decide between two different digital cameras,

0:07:01.407,0:07:07.730
at the end of the day, maybe all you care about is image quality or usability or some other factor

0:07:07.730,0:07:12.828
and exactly what makes that image quality better or worse

0:07:12.828,0:07:17.834
or any other element along the way may be less relevant to you.

0:07:17.834,0:07:24.032
If you don’t have control over the variables, then identifying cause may not always be what you want.

0:07:24.032,0:07:27.693
But when you are a designer, you do have control over the variables,

0:07:27.693,0:07:30.718
and that’s when it is really important to ascertain cause.

0:07:30.718,0:07:35.951
Here’s an example of a study that came out right when the iPhone was released,

0:07:35.951,0:07:41.041
done by a research firm User Centric, and I’m going to read from this news article here.

0:07:41.041,0:07:43.496
Research firm User Centric has released a study

0:07:43.496,0:07:48.734
that tries to gauge how effective the iPhone’s unusual onscreen keyboard is.

0:07:48.734,0:07:51.066
The goal is certainly a noble one

0:07:51.066,0:07:56.337
but I cannot say the survey’s approach results in data that makes much sense.

0:07:56.337,0:07:59.857
User Centric brought in twenty owners of other phones.

0:07:59.857,0:08:05.118
Half had qwerty keyboards, half had ordinary numeric phones, with keypads.

0:08:05.118,0:08:08.086
None were familiar with the iPhone.

0:08:08.086,0:08:13.677
The research involved having the test subjects enter six sample test messages with the phones

0:08:13.677,0:08:17.335
that they already had, and six with the iPhone.

0:08:17.335,0:08:20.817
The end result was that the iPhone newbies took twice as long

0:08:20.817,0:08:26.785
to enter text with an iPhone as they did with their own phones and made lots more typos.

0:08:26.785,0:08:31.625
So let’s critique this study and talk about its benefits and drawbacks.

0:08:31.625,0:08:34.025
Here’s the webpage directly from User Centric.

0:08:34.025,0:08:37.615
What’s our manipulation in this study?

0:08:37.615,0:08:41.779
Well the manipulation is going to be the input style.

0:08:41.779,0:08:45.078
How about the measure in the study?

0:08:45.078,0:08:48.630
It’s going to be the words per minute.

0:08:48.630,0:08:56.312
And there’s absolutely value in being able to measure the initial usability of the iPhone.

0:08:56.312,0:09:00.368
For several reasons, one is if you’re introducing new technology,

0:09:00.368,0:09:03.678
it’s beneficial if people are able to get up to speed pretty quickly.

0:09:03.678,0:09:09.326
However it’s important to realize that this comparison is intrinsically unfair

0:09:09.326,0:09:14.945
because the users of the previous cell phones were experts at that input modality

0:09:14.945,0:09:18.696
and the people who are using the iphone are novices in that modality.

0:09:18.696,0:09:24.036
And so it seems quite likely that the iPhone users, once they become actual users,

0:09:24.036,0:09:29.476
are going to get better over time and so if you’re not used to something the first time you try it,

0:09:29.476,0:09:35.060
that may not be a deal killer, and it’s certainly not an apples-to-apples comparison.

0:09:35.060,0:09:40.008
Another thing that we don’t get out of this article is “Is this difference significant?”

0:09:40.008,0:09:46.965
So we read that each person who typed six messages in each of two conditions

0:09:46.965,0:09:52.004
and so they did their own device and the iPhone, or vice versa.

0:09:52.004,0:10:00.001
Six messages each and that the iPhone users were half the speed of the…

0:10:00.001,0:10:08.812
or rather the people typing with the iPhone were half as fast as when they got to type with a mini qwerty

0:10:08.812,0:10:12.572
at the device that they were accustomed to.

0:10:12.572,0:10:17.131
So while this may tell us something about the initial usability of the iPhone,

0:10:17.131,0:10:23.014
in terms of the long-term usability, you know, I don’t think we get so much out of this here.

0:10:23.014,0:10:29.819
If you weren’t s atisfied by that initial data, you’re in good company: neither were the authors of that study.

0:10:29.819,0:10:35.450
So they went back a month later and they ran another study where they brought in 40 new people to the lab

0:10:35.450,0:10:39.947
who were either iPhone users, qwerty users, or nine key users.

0:10:39.947,0:10:42.871
And now it’s more of an apples-to-apples comparison

0:10:42.871,0:10:48.989
in that they are going to test people that are relatively experts in these three different modalities —

0:10:48.989,0:10:55.307
after about a month on the iPhone you’re <i>probably</i> starting to asymptote in terms of your performance.

0:10:55.307,0:11:02.878
Definitely it gets better over time, even past a month; but, you know, a month starts to get more reasonable.

0:11:02.878,0:11:12.011
And what they found was that iPhone users and qwerty users were about the same in terms of speed,

0:11:12.011,0:11:16.921
and that the numeric keypad users were much slower.

0:11:16.921,0:11:21.738
So once again our manipulation is going to be input style and we’re going to measure speed.

0:11:21.738,0:11:24.558
This time we’re also going to measure error rate.

0:11:24.558,0:11:30.416
And what we see is that iPhone users and qwerty users are essentially the same speed.

0:11:30.416,0:11:36.545
However, the iPhone users make many more errors.

0:11:36.545,0:11:40.153
Now, one thing I should point out about the study is

0:11:40.153,0:11:46.775
that each of the different devices was used by a different group of people.

0:11:46.775,0:11:51.596
And it was done this way so that each device was used by somebody

0:11:51.596,0:11:55.881
who is comfortable and had experience with working with that device.

0:11:55.881,0:12:00.518
And so, we removed the worry that you had newbies working on these devices.

0:12:00.518,0:12:04.595
However, especially in 2007, there may have been significant differences

0:12:04.595,0:12:11.310
in who the people were who were using the early adopters of the 2007 iPhone

0:12:11.310,0:12:17.053
or maybe business users were particularly drawn to the qwerty devices or people who had better things

0:12:17.053,0:12:22.457
to do with their time than send e-mail on their telephone or using the nine key devices.

0:12:22.457,0:12:26.639
And so, while this comparison is better than the previous one,

0:12:26.639,0:12:31.501
the potential for variation between the user populations is still problematic.

0:12:31.501,0:12:36.838
If what you’d like to be able to claim is something about the intrinsic properties of the device,

0:12:36.838,0:12:42.212
it may at least in part have to do with the users.

0:12:42.212,0:12:45.445
So, what are some st rategies for fairer comparison?

0:12:45.445,0:12:50.253
To brainstorm a couple of options one thing that you can do is insert your approach in to your production setting

0:12:50.253,0:12:52.687
and this may seem like a lot of work —

0:12:52.687,0:12:56.543
sometimes it is but in the age of the web this is a lot easier than it used to be.

0:12:56.543,0:13:03.126
And it’s possible even if you don’t have access to the server of the service that you’re comparing against.

0:13:03.126,0:13:06.564
You can use things like a proxy server or client-side scripting

0:13:06.564,0:13:11.566
to be able to put your own technique in and have an apples-to-apples comparison.

0:13:11.566,0:13:16.576
A second strategy for neutralizing the environment difference between a production version

0:13:16.576,0:13:25.692
and your new approach is to make a version of the production thing in the same style as your new approach.

0:13:25.692,0:13:30.897
That also makes them equivalent in terms of their implementation fidelity.

0:13:30.897,0:13:34.003
A third strategy and one that’s used commonly in research,

0:13:34.003,0:13:39.423
is to scale things down so you’re looking at just a piece of the system at a particular point in time.

0:13:39.423,0:13:42.711
That way you don’t have to worry about implementing a whole big, giant thing.

0:13:42.711,0:13:48.186
You can just focus on one small piece and have that comparison be fair.

0:13:48.186,0:13:52.775
And the fourth strategy is that when expertise is relevant,

0:13:52.775,0:13:55.859
train people up — give them the practice that they need —,

0:13:55.859,0:14:00.742
so that they can start at least hitting that asymptote in terms of performance

0:14:00.742,0:14:04.990
and you can get a better read than what they would be as newbies.

0:14:04.990,0:14:11.804
So now to close out this lecture, if somebody asks you the question “Is interface x better than interface y?”

0:14:11.804,0:14:15.259
you know that we’re off to a good start because we have a comparison.

0:14:15.259,0:14:18.541
However, you also know to be worried: What does “better” mean?

0:14:18.541,0:14:25.963
And often, in a complex system, you’re going to have several measures. That’s totally cool.

0:14:25.963,0:14:30.578
There’s a lot of value in being explicit though about what it is you mean by better —

0:14:30.578,0:14:33.722
What are you trying to accomplish? What are you trying to [im]prove?

0:14:33.722,0:14:38.003
And if anybody ever tells you that their interface is <i>always</i> better,

0:14:38.003,0:14:44.296
don’t believe them because nearly all of the time the answer is going to be “it depends.”

0:14:44.296,0:14:48.441
And the interesting question is “What does it depend on?”

0:14:48.441,0:14:53.004
Most interfaces are good for some things and not for others.

0:14:53.004,0:14:57.972
For example if you have a tablet computer where all of the screen is devoted to display,

0:14:57.972,0:15:04.204
that is going to be great for reading, for web browsing, for that kind of activity, looking at pictures.

0:15:04.204,0:15:06.374
Not so good if you want to type a novel.

0:15:06.374,0:15:09.143
So here, we’ve introduced controlled comparison

0:15:09.143,0:15:13.777
as a way of finding the smoking gun, as a way of inferring cause.

0:15:13.777,0:15:17.313
And often for, when you have only two conditions,

0:15:17.313,0:15:21.000
we’re going to talk about that as being a minimal pairs design.

0:15:21.000,0:15:24.920
As a practicing designer, the reason to care about what’s causal

0:15:24.920,0:15:29.605
is that it gives you the material to make a better decision going forward.

0:15:29.605,0:15:32.205
A lot of studies violate this constraint.

0:15:32.205,0:15:39.711
And, that gets dangerous because it doesn’t, it prevents you from being able to make sound decisions.

0:15:39.711,0:15:43.800
I hope that the tools that we’ve talked about today and in the next several lectures

0:15:43.800,0:15:48.823
will help you become a wise skeptic like our friend in this XKCD comic.

0:15:48.823,0:15:53.001
I’ll see you next time.