WEBVTT 00:00:02.008 --> 00:00:04.895 In this lecture, we’re going to talk about trying out your interface with people 00:00:04.895 --> 00:00:12.050 and doing so in a way that you can improve your designs based on what you learned. 00:00:12.050 --> 00:00:16.802 One of the most common things that people ask when running studies is: “Do you like my interface?” 00:00:16.802 --> 00:00:20.990 and it’s a really natural thing to ask, because on some level it’s what we all want to know. 00:00:20.990 --> 00:00:23.916 But this is really problematic on a whole lot of levels. 00:00:23.916 --> 00:00:28.099 For one it’s not very specific, and so sometimes people are trying to make this better 00:00:28.099 --> 00:00:34.359 and so they’ll improve it by doing something like: “How much do you like my interface on one to five scale?” 00:00:34.359 --> 00:00:39.401 Or: “‘This is a useful interface’ — Agree or disagree on one to five scale.” 00:00:39.401 --> 00:00:42.636 And this adds some kind of a patina of scientificness to it 00:00:42.636 --> 00:00:46.749 but really it’s just the same thing — you’re asking somebody “Do you like my interface?” 00:00:46.749 --> 00:00:49.679 And people are nice, so they’re going to say “Sure I like your interface.” 00:00:49.679 --> 00:00:52.257 This is the “please the experimenter” bias. 00:00:52.257 --> 00:00:56.699 And this can be especially strong when there are social or cultural or power differences 00:00:56.699 --> 00:01:00.606 between the experimenter and the people that you’re trying out your interface with: 00:01:00.606 --> 00:01:05.061 For example, [inaudible] and colleague show this effect in India 00:01:05.061 --> 00:01:09.316 where this effect was exacerbated when the experimenter was white. 00:01:09.316 --> 00:01:15.908 Now, you should not take this to mean that you shouldn’t have your developers try out stuff with users — 00:01:15.908 --> 00:01:21.800 Being the person who is both the developer and the person who is trying stuff out is incredible valuable. 00:01:21.800 --> 00:01:24.522 And one example I like a lot of this is Mike Krieger, 00:01:24.522 --> 00:01:30.125 one of the Instagram founders — [he] is also a former master student and TA of mine. 00:01:30.125 --> 00:01:32.313 And Mike, when he left Stanford and joined Silicon Valley, 00:01:32.313 --> 00:01:36.483 every Friday afternoon he would bring people into the lab into his office 00:01:36.483 --> 00:01:39.606 and have them try out whatever they were working on that week. 00:01:39.606 --> 00:01:43.217 And so that way they were able to get this regular feedback each week 00:01:43.217 --> 00:01:48.009 and the people who were building those systems got to see real people trying them out. 00:01:48.009 --> 00:01:52.169 This can be nails-on-a-chalkboard painful, but you’ll also learn a ton. 00:01:52.169 --> 00:01:55.118 So how do we get beyond “Do you like my interface?” 00:01:55.118 --> 00:01:58.972 The basic strategy that we’re going to talk about today is being able 00:01:58.972 --> 00:02:05.039 to use specific measures and concrete questions to be able to deliver meaningful results. 00:02:05.039 --> 00:02:10.216 One of the problems of “Do you like my interface?” is “Compared to what?” 00:02:10.216 --> 00:02:16.102 And I think one of the reasons people say “Yeah sure” is that there’s no comparison point 00:02:16.102 --> 00:02:21.889 and so one thing that’s really important is when you’re measuring the effectiveness of your interface, 00:02:21.889 --> 00:02:25.783 even informally, it’s really nice to have some kind of comparison. 00:02:25.783 --> 00:02:28.687 It’s also important think about, well, what’s the yardstick? 00:02:28.687 --> 00:02:31.184 What constitutes “good” in this arena? 00:02:31.184 --> 00:02:33.925 What are the measures that you’re going to use? 00:02:33.925 --> 00:02:36.885 So how can we get beyond “Do you like my interface?” 00:02:36.885 --> 00:02:41.071 One of the ways that we can start out is by asking a base rate question, 00:02:41.071 --> 00:02:46.526 like “What fraction of people click on the first link in a search results page?” 00:02:46.526 --> 00:02:50.142 Or “What fraction of students come to class?” 00:02:50.142 --> 00:02:54.555 Once we start to measure correlations things get even more interesting, 00:02:54.555 --> 00:03:00.328 like, “Is there a relationship between the time of day a class is offered and how many students attend it?” 00:03:00.328 --> 00:03:07.610 Or “Is there a relationship between the order of a search result and the clickthrough rate?” 00:03:07.610 --> 00:03:11.492 For both students and clickthrough, there can be multiple explanations. 00:03:11.492 --> 00:03:16.410 For example, if there are fewer students that attend early morning classes, 00:03:16.410 --> 00:03:19.054 is that a function of when students want to show up, 00:03:19.054 --> 00:03:22.865 or is that a function of when good professors want to teach? 00:03:22.865 --> 00:03:26.219 With the clickthrough example, there are also two kinds of explanations. 00:03:26.219 --> 00:03:37.528 If lower placed links yield fewer clicks, Is that because the links are of intrinsically poorer quality, 00:03:37.528 --> 00:03:41.075 or is it because people just click on the first link — 00:03:41.075 --> 00:03:45.238 [that] they don’t bother getting to the second one even if it might be better? 00:03:45.238 --> 00:03:48.869 To isolate the effect of placement and identifying it as playing a casual role, 00:03:48.869 --> 00:03:54.155 you’d need to isolate that as a variable by say, randomizing the order or search results. 00:03:54.155 --> 00:04:00.329 As we start to talk about these experiments, let’s introduce a few terms that are going to help us. 00:04:00.329 --> 00:04:05.485 The multiple different conditions that we try, that’s the thing we are manipulating — 00:04:05.485 --> 00:04:12.402 for example, the time of a class, or the location of a particular link on a search results page. 00:04:12.402 --> 00:04:18.379 These manipulations are independent variables because they are independent of what the user does. 00:04:18.379 --> 00:04:22.245 They are in the control of the experimenter. 00:04:22.245 --> 00:04:26.706 Then we are going to measure what the user does 00:04:26.706 --> 00:04:31.447 and those measures are called dependent variables because they depend on what the user does. 00:04:31.447 --> 00:04:36.007 Common measures in HCI include things like task completion time — 00:04:36.007 --> 00:04:38.983 How long does it take somebody to complete a task 00:04:38.983 --> 00:04:43.375 (for example, find something I want to buy, create a new account, order an item)? 00:04:43.375 --> 00:04:46.838 Accuracy — How many mistakes did people make, 00:04:46.838 --> 00:04:51.298 and were those fatal errors or were those things that they were able to quickly recover from? 00:04:51.300 --> 00:04:55.376 Recall — How much does a person remember afterward, or after periods of non-use? 00:04:55.376 --> 00:04:59.183 And emotional response — How does the person feel about the tasks being completed? 00:04:59.183 --> 00:05:01.440 Were they confident, were they stressed? 00:05:01.440 --> 00:05:04.354 Would the user recommend this system to a friend? 00:05:04.354 --> 00:05:09.075 So, your independent variables are the things that you manipulate, 00:05:09.075 --> 00:05:11.983 your dependent variables are the things that you measure. 00:05:11.983 --> 00:05:14.031 How reliable is your experiment? 00:05:14.031 --> 00:05:17.573 If you ran this again, would you see the same results? 00:05:17.573 --> 00:05:20.922 That’s the internal validity of an experiment. 00:05:20.922 --> 00:05:24.776 So, have a precise experiment, you need to better remove the confounding factors. 00:05:24.776 --> 00:05:30.348 Also, it’s important to study enough people so that the result is unlikely to have been by chance. 00:05:30.348 --> 00:05:34.373 You may be able to run the same study over and over and get the same results 00:05:34.373 --> 00:05:42.212 but it may not matter in some real-world sense and the external validity is the generalizability of your results. 00:05:42.212 --> 00:05:44.898 Does this apply only to eighteen-year-olds in a college classroom? 00:05:44.898 --> 00:05:47.908 Or does this apply to everybody in the world? 00:05:47.908 --> 00:05:52.003 Let’s bring this back to HCI and talk about one of the problems you’re likely to face as a designer. 00:05:52.003 --> 00:05:55.499 I think one of the things that we commonly want to be able to do 00:05:55.499 --> 00:06:00.364 is to be able to ask something like “Is my cool new approach better than the industry standard?” 00:06:00.364 --> 00:06:03.290 Because after all, that’s why you’re making the new thing. 00:06:03.290 --> 00:06:06.956 Now, one of the challenges with this, especially early on in the design process 00:06:06.956 --> 00:06:11.026 is that you may have something which is very much in its prototype stages 00:06:11.026 --> 00:06:16.841 and something that is the industry standard is likely to benefit from years and years of refinement. 00:06:16.841 --> 00:06:21.514 And at the same time, it may be stuck with years and years of cruft 00:06:21.514 --> 00:06:25.114 which may or may not be intrinsic to its approach. 00:06:25.114 --> 00:06:30.586 So if you compare your cool new tool to some industry standard, there is two things varying here. 00:06:30.586 --> 00:06:35.725 One is the fidelity of the implementation and the other one of course is the approach. 00:06:35.725 --> 00:06:37.822 Consequently, when you get the results, 00:06:37.822 --> 00:06:43.933 you can’t know whether to attribute the results to fidelity or approach or some combination of the two. 00:06:43.933 --> 00:06:48.400 So we’re going to talk about ways of teasing apart those different causal factors. 00:06:48.400 --> 00:06:53.712 Now, one thing I should say right off the bat is there are some times where it may be more 00:06:53.712 --> 00:06:57.332 or less relevant whether you have a good handle on what the causal factors are. 00:06:57.332 --> 00:07:01.407 So for example, if you’re trying to decide between two different digital cameras, 00:07:01.407 --> 00:07:07.730 at the end of the day, maybe all you care about is image quality or usability or some other factor 00:07:07.730 --> 00:07:12.828 and exactly what makes that image quality better or worse 00:07:12.828 --> 00:07:17.834 or any other element along the way may be less relevant to you. 00:07:17.834 --> 00:07:24.032 If you don’t have control over the variables, then identifying cause may not always be what you want. 00:07:24.032 --> 00:07:27.693 But when you are a designer, you do have control over the variables, 00:07:27.693 --> 00:07:30.718 and that’s when it is really important to ascertain cause. 00:07:30.718 --> 00:07:35.951 Here’s an example of a study that came out right when the iPhone was released, 00:07:35.951 --> 00:07:41.041 done by a research firm User Centric, and I’m going to read from this news article here. 00:07:41.041 --> 00:07:43.496 Research firm User Centric has released a study 00:07:43.496 --> 00:07:48.734 that tries to gauge how effective the iPhone’s unusual onscreen keyboard is. 00:07:48.734 --> 00:07:51.066 The goal is certainly a noble one 00:07:51.066 --> 00:07:56.337 but I cannot say the survey’s approach results in data that makes much sense. 00:07:56.337 --> 00:07:59.857 User Centric brought in twenty owners of other phones. 00:07:59.857 --> 00:08:05.118 Half had qwerty keyboards, half had ordinary numeric phones, with keypads. 00:08:05.118 --> 00:08:08.086 None were familiar with the iPhone. 00:08:08.086 --> 00:08:13.677 The research involved having the test subjects enter six sample test messages with the phones 00:08:13.677 --> 00:08:17.335 that they already had, and six with the iPhone. 00:08:17.335 --> 00:08:20.817 The end result was that the iPhone newbies took twice as long 00:08:20.817 --> 00:08:26.785 to enter text with an iPhone as they did with their own phones and made lots more typos. 00:08:26.785 --> 00:08:31.625 So let’s critique this study and talk about its benefits and drawbacks. 00:08:31.625 --> 00:08:34.025 Here’s the webpage directly from User Centric. 00:08:34.025 --> 00:08:37.615 What’s our manipulation in this study? 00:08:37.615 --> 00:08:41.779 Well the manipulation is going to be the input style. 00:08:41.779 --> 00:08:45.078 How about the measure in the study? 00:08:45.078 --> 00:08:48.630 It’s going to be the words per minute. 00:08:48.630 --> 00:08:56.312 And there’s absolutely value in being able to measure the initial usability of the iPhone. 00:08:56.312 --> 00:09:00.368 For several reasons, one is if you’re introducing new technology, 00:09:00.368 --> 00:09:03.678 it’s beneficial if people are able to get up to speed pretty quickly. 00:09:03.678 --> 00:09:09.326 However it’s important to realize that this comparison is intrinsically unfair 00:09:09.326 --> 00:09:14.945 because the users of the previous cell phones were experts at that input modality 00:09:14.945 --> 00:09:18.696 and the people who are using the iphone are novices in that modality. 00:09:18.696 --> 00:09:24.036 And so it seems quite likely that the iPhone users, once they become actual users, 00:09:24.036 --> 00:09:29.476 are going to get better over time and so if you’re not used to something the first time you try it, 00:09:29.476 --> 00:09:35.060 that may not be a deal killer, and it’s certainly not an apples-to-apples comparison. 00:09:35.060 --> 00:09:40.008 Another thing that we don’t get out of this article is “Is this difference significant?” 00:09:40.008 --> 00:09:46.965 So we read that each person who typed six messages in each of two conditions 00:09:46.965 --> 00:09:52.004 and so they did their own device and the iPhone, or vice versa. 00:09:52.004 --> 00:10:00.001 Six messages each and that the iPhone users were half the speed of the… 00:10:00.001 --> 00:10:08.812 or rather the people typing with the iPhone were half as fast as when they got to type with a mini qwerty 00:10:08.812 --> 00:10:12.572 at the device that they were accustomed to. 00:10:12.572 --> 00:10:17.131 So while this may tell us something about the initial usability of the iPhone, 00:10:17.131 --> 00:10:23.014 in terms of the long-term usability, you know, I don’t think we get so much out of this here. 00:10:23.014 --> 00:10:29.819 If you weren’t s atisfied by that initial data, you’re in good company: neither were the authors of that study. 00:10:29.819 --> 00:10:35.450 So they went back a month later and they ran another study where they brought in 40 new people to the lab 00:10:35.450 --> 00:10:39.947 who were either iPhone users, qwerty users, or nine key users. 00:10:39.947 --> 00:10:42.871 And now it’s more of an apples-to-apples comparison 00:10:42.871 --> 00:10:48.989 in that they are going to test people that are relatively experts in these three different modalities — 00:10:48.989 --> 00:10:55.307 after about a month on the iPhone you’re probably starting to asymptote in terms of your performance. 00:10:55.307 --> 00:11:02.878 Definitely it gets better over time, even past a month; but, you know, a month starts to get more reasonable. 00:11:02.878 --> 00:11:12.011 And what they found was that iPhone users and qwerty users were about the same in terms of speed, 00:11:12.011 --> 00:11:16.921 and that the numeric keypad users were much slower. 00:11:16.921 --> 00:11:21.738 So once again our manipulation is going to be input style and we’re going to measure speed. 00:11:21.738 --> 00:11:24.558 This time we’re also going to measure error rate. 00:11:24.558 --> 00:11:30.416 And what we see is that iPhone users and qwerty users are essentially the same speed. 00:11:30.416 --> 00:11:36.545 However, the iPhone users make many more errors. 00:11:36.545 --> 00:11:40.153 Now, one thing I should point out about the study is 00:11:40.153 --> 00:11:46.775 that each of the different devices was used by a different group of people. 00:11:46.775 --> 00:11:51.596 And it was done this way so that each device was used by somebody 00:11:51.596 --> 00:11:55.881 who is comfortable and had experience with working with that device. 00:11:55.881 --> 00:12:00.518 And so, we removed the worry that you had newbies working on these devices. 00:12:00.518 --> 00:12:04.595 However, especially in 2007, there may have been significant differences 00:12:04.595 --> 00:12:11.310 in who the people were who were using the early adopters of the 2007 iPhone 00:12:11.310 --> 00:12:17.053 or maybe business users were particularly drawn to the qwerty devices or people who had better things 00:12:17.053 --> 00:12:22.457 to do with their time than send e-mail on their telephone or using the nine key devices. 00:12:22.457 --> 00:12:26.639 And so, while this comparison is better than the previous one, 00:12:26.639 --> 00:12:31.501 the potential for variation between the user populations is still problematic. 00:12:31.501 --> 00:12:36.838 If what you’d like to be able to claim is something about the intrinsic properties of the device, 00:12:36.838 --> 00:12:42.212 it may at least in part have to do with the users. 00:12:42.212 --> 00:12:45.445 So, what are some st rategies for fairer comparison? 00:12:45.445 --> 00:12:50.253 To brainstorm a couple of options one thing that you can do is insert your approach in to your production setting 00:12:50.253 --> 00:12:52.687 and this may seem like a lot of work — 00:12:52.687 --> 00:12:56.543 sometimes it is but in the age of the web this is a lot easier than it used to be. 00:12:56.543 --> 00:13:03.126 And it’s possible even if you don’t have access to the server of the service that you’re comparing against. 00:13:03.126 --> 00:13:06.564 You can use things like a proxy server or client-side scripting 00:13:06.564 --> 00:13:11.566 to be able to put your own technique in and have an apples-to-apples comparison. 00:13:11.566 --> 00:13:16.576 A second strategy for neutralizing the environment difference between a production version 00:13:16.576 --> 00:13:25.692 and your new approach is to make a version of the production thing in the same style as your new approach. 00:13:25.692 --> 00:13:30.897 That also makes them equivalent in terms of their implementation fidelity. 00:13:30.897 --> 00:13:34.003 A third strategy and one that’s used commonly in research, 00:13:34.003 --> 00:13:39.423 is to scale things down so you’re looking at just a piece of the system at a particular point in time. 00:13:39.423 --> 00:13:42.711 That way you don’t have to worry about implementing a whole big, giant thing. 00:13:42.711 --> 00:13:48.186 You can just focus on one small piece and have that comparison be fair. 00:13:48.186 --> 00:13:52.775 And the fourth strategy is that when expertise is relevant, 00:13:52.775 --> 00:13:55.859 train people up — give them the practice that they need —, 00:13:55.859 --> 00:14:00.742 so that they can start at least hitting that asymptote in terms of performance 00:14:00.742 --> 00:14:04.990 and you can get a better read than what they would be as newbies. 00:14:04.990 --> 00:14:11.804 So now to close out this lecture, if somebody asks you the question “Is interface x better than interface y?” 00:14:11.804 --> 00:14:15.259 you know that we’re off to a good start because we have a comparison. 00:14:15.259 --> 00:14:18.541 However, you also know to be worried: What does “better” mean? 00:14:18.541 --> 00:14:25.963 And often, in a complex system, you’re going to have several measures. That’s totally cool. 00:14:25.963 --> 00:14:30.578 There’s a lot of value in being explicit though about what it is you mean by better — 00:14:30.578 --> 00:14:33.722 What are you trying to accomplish? What are you trying to [im]prove? 00:14:33.722 --> 00:14:38.003 And if anybody ever tells you that their interface is always better, 00:14:38.003 --> 00:14:44.296 don’t believe them because nearly all of the time the answer is going to be “it depends.” 00:14:44.296 --> 00:14:48.441 And the interesting question is “What does it depend on?” 00:14:48.441 --> 00:14:53.004 Most interfaces are good for some things and not for others. 00:14:53.004 --> 00:14:57.972 For example if you have a tablet computer where all of the screen is devoted to display, 00:14:57.972 --> 00:15:04.204 that is going to be great for reading, for web browsing, for that kind of activity, looking at pictures. 00:15:04.204 --> 00:15:06.374 Not so good if you want to type a novel. 00:15:06.374 --> 00:15:09.143 So here, we’ve introduced controlled comparison 00:15:09.143 --> 00:15:13.777 as a way of finding the smoking gun, as a way of inferring cause. 00:15:13.777 --> 00:15:17.313 And often for, when you have only two conditions, 00:15:17.313 --> 00:15:21.000 we’re going to talk about that as being a minimal pairs design. 00:15:21.000 --> 00:15:24.920 As a practicing designer, the reason to care about what’s causal 00:15:24.920 --> 00:15:29.605 is that it gives you the material to make a better decision going forward. 00:15:29.605 --> 00:15:32.205 A lot of studies violate this constraint. 00:15:32.205 --> 00:15:39.711 And, that gets dangerous because it doesn’t, it prevents you from being able to make sound decisions. 00:15:39.711 --> 00:15:43.800 I hope that the tools that we’ve talked about today and in the next several lectures 00:15:43.800 --> 00:15:48.823 will help you become a wise skeptic like our friend in this XKCD comic. 00:15:48.823 --> 00:15:53.001 I’ll see you next time.