1
00:00:02,008 --> 00:00:04,895
In this lecture, we’re going to talk about trying out your interface with people
2
00:00:04,895 --> 00:00:12,050
and doing so in a way that you can improve your designs based on what you learned.
3
00:00:12,050 --> 00:00:16,802
One of the most common things that people ask when running studies is: “Do you like my interface?”
4
00:00:16,802 --> 00:00:20,990
and it’s a really natural thing to ask, because on some level it’s what we all want to know.
5
00:00:20,990 --> 00:00:23,916
But this is really problematic on a whole lot of levels.
6
00:00:23,916 --> 00:00:28,099
For one it’s not very specific, and so sometimes people are trying to make this better
7
00:00:28,099 --> 00:00:34,359
and so they’ll improve it by doing something like: “How much do you like my interface on one to five scale?”
8
00:00:34,359 --> 00:00:39,401
Or: “‘This is a useful interface’ — Agree or disagree on one to five scale.”
9
00:00:39,401 --> 00:00:42,636
And this adds some kind of a patina of scientificness to it
10
00:00:42,636 --> 00:00:46,749
but really it’s just the same thing — you’re asking somebody “Do you like my interface?”
11
00:00:46,749 --> 00:00:49,679
And people are nice, so they’re going to say “Sure I like your interface.”
12
00:00:49,679 --> 00:00:52,257
This is the “please the experimenter” bias.
13
00:00:52,257 --> 00:00:56,699
And this can be especially strong when there are social or cultural or power differences
14
00:00:56,699 --> 00:01:00,606
between the experimenter and the people that you’re trying out your interface with:
15
00:01:00,606 --> 00:01:05,061
For example, [inaudible] and colleague show this effect in India
16
00:01:05,061 --> 00:01:09,316
where this effect was exacerbated when the experimenter was white.
17
00:01:09,316 --> 00:01:15,908
Now, you should not take this to mean that you shouldn’t have your developers try out stuff with users —
18
00:01:15,908 --> 00:01:21,800
Being the person who is both the developer and the person who is trying stuff out is incredible valuable.
19
00:01:21,800 --> 00:01:24,522
And one example I like a lot of this is Mike Krieger,
20
00:01:24,522 --> 00:01:30,125
one of the Instagram founders — [he] is also a former master student and TA of mine.
21
00:01:30,125 --> 00:01:32,313
And Mike, when he left Stanford and joined Silicon Valley,
22
00:01:32,313 --> 00:01:36,483
every Friday afternoon he would bring people into the lab into his office
23
00:01:36,483 --> 00:01:39,606
and have them try out whatever they were working on that week.
24
00:01:39,606 --> 00:01:43,217
And so that way they were able to get this regular feedback each week
25
00:01:43,217 --> 00:01:48,009
and the people who were building those systems got to see real people trying them out.
26
00:01:48,009 --> 00:01:52,169
This can be nails-on-a-chalkboard painful, but you’ll also learn a ton.
27
00:01:52,169 --> 00:01:55,118
So how do we get beyond “Do you like my interface?”
28
00:01:55,118 --> 00:01:58,972
The basic strategy that we’re going to talk about today is being able
29
00:01:58,972 --> 00:02:05,039
to use specific measures and concrete questions to be able to deliver meaningful results.
30
00:02:05,039 --> 00:02:10,216
One of the problems of “Do you like my interface?” is “Compared to what?”
31
00:02:10,216 --> 00:02:16,102
And I think one of the reasons people say “Yeah sure” is that there’s no comparison point
32
00:02:16,102 --> 00:02:21,889
and so one thing that’s really important is when you’re measuring the effectiveness of your interface,
33
00:02:21,889 --> 00:02:25,783
even informally, it’s really nice to have some kind of comparison.
34
00:02:25,783 --> 00:02:28,687
It’s also important think about, well, what’s the yardstick?
35
00:02:28,687 --> 00:02:31,184
What constitutes “good” in this arena?
36
00:02:31,184 --> 00:02:33,925
What are the measures that you’re going to use?
37
00:02:33,925 --> 00:02:36,885
So how can we get beyond “Do you like my interface?”
38
00:02:36,885 --> 00:02:41,071
One of the ways that we can start out is by asking a base rate question,
39
00:02:41,071 --> 00:02:46,526
like “What fraction of people click on the first link in a search results page?”
40
00:02:46,526 --> 00:02:50,142
Or “What fraction of students come to class?”
41
00:02:50,142 --> 00:02:54,555
Once we start to measure correlations things get even more interesting,
42
00:02:54,555 --> 00:03:00,328
like, “Is there a relationship between the time of day a class is offered and how many students attend it?”
43
00:03:00,328 --> 00:03:07,610
Or “Is there a relationship between the order of a search result and the clickthrough rate?”
44
00:03:07,610 --> 00:03:11,492
For both students and clickthrough, there can be multiple explanations.
45
00:03:11,492 --> 00:03:16,410
For example, if there are fewer students that attend early morning classes,
46
00:03:16,410 --> 00:03:19,054
is that a function of when students want to show up,
47
00:03:19,054 --> 00:03:22,865
or is that a function of when good professors want to teach?
48
00:03:22,865 --> 00:03:26,219
With the clickthrough example, there are also two kinds of explanations.
49
00:03:26,219 --> 00:03:37,528
If lower placed links yield fewer clicks, Is that because the links are of intrinsically poorer quality,
50
00:03:37,528 --> 00:03:41,075
or is it because people just click on the first link —
51
00:03:41,075 --> 00:03:45,238
[that] they don’t bother getting to the second one even if it might be better?
52
00:03:45,238 --> 00:03:48,869
To isolate the effect of placement and identifying it as playing a casual role,
53
00:03:48,869 --> 00:03:54,155
you’d need to isolate that as a variable by say, randomizing the order or search results.
54
00:03:54,155 --> 00:04:00,329
As we start to talk about these experiments, let’s introduce a few terms that are going to help us.
55
00:04:00,329 --> 00:04:05,485
The multiple different conditions that we try, that’s the thing we are manipulating —
56
00:04:05,485 --> 00:04:12,402
for example, the time of a class, or the location of a particular link on a search results page.
57
00:04:12,402 --> 00:04:18,379
These manipulations are independent variables because they are independent of what the user does.
58
00:04:18,379 --> 00:04:22,245
They are in the control of the experimenter.
59
00:04:22,245 --> 00:04:26,706
Then we are going to measure what the user does
60
00:04:26,706 --> 00:04:31,447
and those measures are called dependent variables because they depend on what the user does.
61
00:04:31,447 --> 00:04:36,007
Common measures in HCI include things like task completion time —
62
00:04:36,007 --> 00:04:38,983
How long does it take somebody to complete a task
63
00:04:38,983 --> 00:04:43,375
(for example, find something I want to buy, create a new account, order an item)?
64
00:04:43,375 --> 00:04:46,838
Accuracy — How many mistakes did people make,
65
00:04:46,838 --> 00:04:51,298
and were those fatal errors or were those things that they were able to quickly recover from?
66
00:04:51,300 --> 00:04:55,376
Recall — How much does a person remember afterward, or after periods of non-use?
67
00:04:55,376 --> 00:04:59,183
And emotional response — How does the person feel about the tasks being completed?
68
00:04:59,183 --> 00:05:01,440
Were they confident, were they stressed?
69
00:05:01,440 --> 00:05:04,354
Would the user recommend this system to a friend?
70
00:05:04,354 --> 00:05:09,075
So, your independent variables are the things that you manipulate,
71
00:05:09,075 --> 00:05:11,983
your dependent variables are the things that you measure.
72
00:05:11,983 --> 00:05:14,031
How reliable is your experiment?
73
00:05:14,031 --> 00:05:17,573
If you ran this again, would you see the same results?
74
00:05:17,573 --> 00:05:20,922
That’s the internal validity of an experiment.
75
00:05:20,922 --> 00:05:24,776
So, have a precise experiment, you need to better remove the confounding factors.
76
00:05:24,776 --> 00:05:30,348
Also, it’s important to study enough people so that the result is unlikely to have been by chance.
77
00:05:30,348 --> 00:05:34,373
You may be able to run the same study over and over and get the same results
78
00:05:34,373 --> 00:05:42,212
but it may not matter in some real-world sense and the external validity is the generalizability of your results.
79
00:05:42,212 --> 00:05:44,898
Does this apply only to eighteen-year-olds in a college classroom?
80
00:05:44,898 --> 00:05:47,908
Or does this apply to everybody in the world?
81
00:05:47,908 --> 00:05:52,003
Let’s bring this back to HCI and talk about one of the problems you’re likely to face as a designer.
82
00:05:52,003 --> 00:05:55,499
I think one of the things that we commonly want to be able to do
83
00:05:55,499 --> 00:06:00,364
is to be able to ask something like “Is my cool new approach better than the industry standard?”
84
00:06:00,364 --> 00:06:03,290
Because after all, that’s why you’re making the new thing.
85
00:06:03,290 --> 00:06:06,956
Now, one of the challenges with this, especially early on in the design process
86
00:06:06,956 --> 00:06:11,026
is that you may have something which is very much in its prototype stages
87
00:06:11,026 --> 00:06:16,841
and something that is the industry standard is likely to benefit from years and years of refinement.
88
00:06:16,841 --> 00:06:21,514
And at the same time, it may be stuck with years and years of cruft
89
00:06:21,514 --> 00:06:25,114
which may or may not be intrinsic to its approach.
90
00:06:25,114 --> 00:06:30,586
So if you compare your cool new tool to some industry standard, there is two things varying here.
91
00:06:30,586 --> 00:06:35,725
One is the fidelity of the implementation and the other one of course is the approach.
92
00:06:35,725 --> 00:06:37,822
Consequently, when you get the results,
93
00:06:37,822 --> 00:06:43,933
you can’t know whether to attribute the results to fidelity or approach or some combination of the two.
94
00:06:43,933 --> 00:06:48,400
So we’re going to talk about ways of teasing apart those different causal factors.
95
00:06:48,400 --> 00:06:53,712
Now, one thing I should say right off the bat is there are some times where it may be more
96
00:06:53,712 --> 00:06:57,332
or less relevant whether you have a good handle on what the causal factors are.
97
00:06:57,332 --> 00:07:01,407
So for example, if you’re trying to decide between two different digital cameras,
98
00:07:01,407 --> 00:07:07,730
at the end of the day, maybe all you care about is image quality or usability or some other factor
99
00:07:07,730 --> 00:07:12,828
and exactly what makes that image quality better or worse
100
00:07:12,828 --> 00:07:17,834
or any other element along the way may be less relevant to you.
101
00:07:17,834 --> 00:07:24,032
If you don’t have control over the variables, then identifying cause may not always be what you want.
102
00:07:24,032 --> 00:07:27,693
But when you are a designer, you do have control over the variables,
103
00:07:27,693 --> 00:07:30,718
and that’s when it is really important to ascertain cause.
104
00:07:30,718 --> 00:07:35,951
Here’s an example of a study that came out right when the iPhone was released,
105
00:07:35,951 --> 00:07:41,041
done by a research firm User Centric, and I’m going to read from this news article here.
106
00:07:41,041 --> 00:07:43,496
Research firm User Centric has released a study
107
00:07:43,496 --> 00:07:48,734
that tries to gauge how effective the iPhone’s unusual onscreen keyboard is.
108
00:07:48,734 --> 00:07:51,066
The goal is certainly a noble one
109
00:07:51,066 --> 00:07:56,337
but I cannot say the survey’s approach results in data that makes much sense.
110
00:07:56,337 --> 00:07:59,857
User Centric brought in twenty owners of other phones.
111
00:07:59,857 --> 00:08:05,118
Half had qwerty keyboards, half had ordinary numeric phones, with keypads.
112
00:08:05,118 --> 00:08:08,086
None were familiar with the iPhone.
113
00:08:08,086 --> 00:08:13,677
The research involved having the test subjects enter six sample test messages with the phones
114
00:08:13,677 --> 00:08:17,335
that they already had, and six with the iPhone.
115
00:08:17,335 --> 00:08:20,817
The end result was that the iPhone newbies took twice as long
116
00:08:20,817 --> 00:08:26,785
to enter text with an iPhone as they did with their own phones and made lots more typos.
117
00:08:26,785 --> 00:08:31,625
So let’s critique this study and talk about its benefits and drawbacks.
118
00:08:31,625 --> 00:08:34,025
Here’s the webpage directly from User Centric.
119
00:08:34,025 --> 00:08:37,615
What’s our manipulation in this study?
120
00:08:37,615 --> 00:08:41,779
Well the manipulation is going to be the input style.
121
00:08:41,779 --> 00:08:45,078
How about the measure in the study?
122
00:08:45,078 --> 00:08:48,630
It’s going to be the words per minute.
123
00:08:48,630 --> 00:08:56,312
And there’s absolutely value in being able to measure the initial usability of the iPhone.
124
00:08:56,312 --> 00:09:00,368
For several reasons, one is if you’re introducing new technology,
125
00:09:00,368 --> 00:09:03,678
it’s beneficial if people are able to get up to speed pretty quickly.
126
00:09:03,678 --> 00:09:09,326
However it’s important to realize that this comparison is intrinsically unfair
127
00:09:09,326 --> 00:09:14,945
because the users of the previous cell phones were experts at that input modality
128
00:09:14,945 --> 00:09:18,696
and the people who are using the iphone are novices in that modality.
129
00:09:18,696 --> 00:09:24,036
And so it seems quite likely that the iPhone users, once they become actual users,
130
00:09:24,036 --> 00:09:29,476
are going to get better over time and so if you’re not used to something the first time you try it,
131
00:09:29,476 --> 00:09:35,060
that may not be a deal killer, and it’s certainly not an apples-to-apples comparison.
132
00:09:35,060 --> 00:09:40,008
Another thing that we don’t get out of this article is “Is this difference significant?”
133
00:09:40,008 --> 00:09:46,965
So we read that each person who typed six messages in each of two conditions
134
00:09:46,965 --> 00:09:52,004
and so they did their own device and the iPhone, or vice versa.
135
00:09:52,004 --> 00:10:00,001
Six messages each and that the iPhone users were half the speed of the…
136
00:10:00,001 --> 00:10:08,812
or rather the people typing with the iPhone were half as fast as when they got to type with a mini qwerty
137
00:10:08,812 --> 00:10:12,572
at the device that they were accustomed to.
138
00:10:12,572 --> 00:10:17,131
So while this may tell us something about the initial usability of the iPhone,
139
00:10:17,131 --> 00:10:23,014
in terms of the long-term usability, you know, I don’t think we get so much out of this here.
140
00:10:23,014 --> 00:10:29,819
If you weren’t s atisfied by that initial data, you’re in good company: neither were the authors of that study.
141
00:10:29,819 --> 00:10:35,450
So they went back a month later and they ran another study where they brought in 40 new people to the lab
142
00:10:35,450 --> 00:10:39,947
who were either iPhone users, qwerty users, or nine key users.
143
00:10:39,947 --> 00:10:42,871
And now it’s more of an apples-to-apples comparison
144
00:10:42,871 --> 00:10:48,989
in that they are going to test people that are relatively experts in these three different modalities —
145
00:10:48,989 --> 00:10:55,307
after about a month on the iPhone you’re probably starting to asymptote in terms of your performance.
146
00:10:55,307 --> 00:11:02,878
Definitely it gets better over time, even past a month; but, you know, a month starts to get more reasonable.
147
00:11:02,878 --> 00:11:12,011
And what they found was that iPhone users and qwerty users were about the same in terms of speed,
148
00:11:12,011 --> 00:11:16,921
and that the numeric keypad users were much slower.
149
00:11:16,921 --> 00:11:21,738
So once again our manipulation is going to be input style and we’re going to measure speed.
150
00:11:21,738 --> 00:11:24,558
This time we’re also going to measure error rate.
151
00:11:24,558 --> 00:11:30,416
And what we see is that iPhone users and qwerty users are essentially the same speed.
152
00:11:30,416 --> 00:11:36,545
However, the iPhone users make many more errors.
153
00:11:36,545 --> 00:11:40,153
Now, one thing I should point out about the study is
154
00:11:40,153 --> 00:11:46,775
that each of the different devices was used by a different group of people.
155
00:11:46,775 --> 00:11:51,596
And it was done this way so that each device was used by somebody
156
00:11:51,596 --> 00:11:55,881
who is comfortable and had experience with working with that device.
157
00:11:55,881 --> 00:12:00,518
And so, we removed the worry that you had newbies working on these devices.
158
00:12:00,518 --> 00:12:04,595
However, especially in 2007, there may have been significant differences
159
00:12:04,595 --> 00:12:11,310
in who the people were who were using the early adopters of the 2007 iPhone
160
00:12:11,310 --> 00:12:17,053
or maybe business users were particularly drawn to the qwerty devices or people who had better things
161
00:12:17,053 --> 00:12:22,457
to do with their time than send e-mail on their telephone or using the nine key devices.
162
00:12:22,457 --> 00:12:26,639
And so, while this comparison is better than the previous one,
163
00:12:26,639 --> 00:12:31,501
the potential for variation between the user populations is still problematic.
164
00:12:31,501 --> 00:12:36,838
If what you’d like to be able to claim is something about the intrinsic properties of the device,
165
00:12:36,838 --> 00:12:42,212
it may at least in part have to do with the users.
166
00:12:42,212 --> 00:12:45,445
So, what are some st rategies for fairer comparison?
167
00:12:45,445 --> 00:12:50,253
To brainstorm a couple of options one thing that you can do is insert your approach in to your production setting
168
00:12:50,253 --> 00:12:52,687
and this may seem like a lot of work —
169
00:12:52,687 --> 00:12:56,543
sometimes it is but in the age of the web this is a lot easier than it used to be.
170
00:12:56,543 --> 00:13:03,126
And it’s possible even if you don’t have access to the server of the service that you’re comparing against.
171
00:13:03,126 --> 00:13:06,564
You can use things like a proxy server or client-side scripting
172
00:13:06,564 --> 00:13:11,566
to be able to put your own technique in and have an apples-to-apples comparison.
173
00:13:11,566 --> 00:13:16,576
A second strategy for neutralizing the environment difference between a production version
174
00:13:16,576 --> 00:13:25,692
and your new approach is to make a version of the production thing in the same style as your new approach.
175
00:13:25,692 --> 00:13:30,897
That also makes them equivalent in terms of their implementation fidelity.
176
00:13:30,897 --> 00:13:34,003
A third strategy and one that’s used commonly in research,
177
00:13:34,003 --> 00:13:39,423
is to scale things down so you’re looking at just a piece of the system at a particular point in time.
178
00:13:39,423 --> 00:13:42,711
That way you don’t have to worry about implementing a whole big, giant thing.
179
00:13:42,711 --> 00:13:48,186
You can just focus on one small piece and have that comparison be fair.
180
00:13:48,186 --> 00:13:52,775
And the fourth strategy is that when expertise is relevant,
181
00:13:52,775 --> 00:13:55,859
train people up — give them the practice that they need —,
182
00:13:55,859 --> 00:14:00,742
so that they can start at least hitting that asymptote in terms of performance
183
00:14:00,742 --> 00:14:04,990
and you can get a better read than what they would be as newbies.
184
00:14:04,990 --> 00:14:11,804
So now to close out this lecture, if somebody asks you the question “Is interface x better than interface y?”
185
00:14:11,804 --> 00:14:15,259
you know that we’re off to a good start because we have a comparison.
186
00:14:15,259 --> 00:14:18,541
However, you also know to be worried: What does “better” mean?
187
00:14:18,541 --> 00:14:25,963
And often, in a complex system, you’re going to have several measures. That’s totally cool.
188
00:14:25,963 --> 00:14:30,578
There’s a lot of value in being explicit though about what it is you mean by better —
189
00:14:30,578 --> 00:14:33,722
What are you trying to accomplish? What are you trying to [im]prove?
190
00:14:33,722 --> 00:14:38,003
And if anybody ever tells you that their interface is always better,
191
00:14:38,003 --> 00:14:44,296
don’t believe them because nearly all of the time the answer is going to be “it depends.”
192
00:14:44,296 --> 00:14:48,441
And the interesting question is “What does it depend on?”
193
00:14:48,441 --> 00:14:53,004
Most interfaces are good for some things and not for others.
194
00:14:53,004 --> 00:14:57,972
For example if you have a tablet computer where all of the screen is devoted to display,
195
00:14:57,972 --> 00:15:04,204
that is going to be great for reading, for web browsing, for that kind of activity, looking at pictures.
196
00:15:04,204 --> 00:15:06,374
Not so good if you want to type a novel.
197
00:15:06,374 --> 00:15:09,143
So here, we’ve introduced controlled comparison
198
00:15:09,143 --> 00:15:13,777
as a way of finding the smoking gun, as a way of inferring cause.
199
00:15:13,777 --> 00:15:17,313
And often for, when you have only two conditions,
200
00:15:17,313 --> 00:15:21,000
we’re going to talk about that as being a minimal pairs design.
201
00:15:21,000 --> 00:15:24,920
As a practicing designer, the reason to care about what’s causal
202
00:15:24,920 --> 00:15:29,605
is that it gives you the material to make a better decision going forward.
203
00:15:29,605 --> 00:15:32,205
A lot of studies violate this constraint.
204
00:15:32,205 --> 00:15:39,711
And, that gets dangerous because it doesn’t, it prevents you from being able to make sound decisions.
205
00:15:39,711 --> 00:15:43,800
I hope that the tools that we’ve talked about today and in the next several lectures
206
00:15:43,800 --> 00:15:48,823
will help you become a wise skeptic like our friend in this XKCD comic.
207
00:15:48,823 --> 00:15:53,001
I’ll see you next time.