1 00:00:00,057 --> 00:00:01,092 So we spent a bunch of time 2 00:00:01,092 --> 00:00:03,032 in the last couple of lectures 3 00:00:03,032 --> 00:00:05,082 talking about different kinds of testing 4 00:00:05,082 --> 00:00:08,021 about unit testing versus integration testing 5 00:00:08,021 --> 00:00:10,010 We talked about how do you use RSpec 6 00:00:10,010 --> 00:00:12,049 to really isolate the parts of your code you want to test 7 00:00:12,049 --> 00:00:14,090 you’ve also, you know, because of homework 3, 8 00:00:14,090 --> 00:00:18,017 and other stuff, we have been doing BDD, 9 00:00:18,017 --> 00:00:20,062 where we’ve been using Cucumber to turn user stories 10 00:00:20,062 --> 00:00:22,095 into, essentially, integration and acceptance tests 11 00:00:22,095 --> 00:00:25,061 So you’ve seen testing in a couple of different levels 12 00:00:25,061 --> 00:00:27,063 and the goal here is sort of to do a few remarks 13 00:00:27,063 --> 00:00:29,092 to, you know, let’s back up a little bit 14 00:00:29,092 --> 00:00:33,001 and see the big picture, and tie those things together 15 00:00:33,001 --> 00:00:34,095 So this sort of spans material 16 00:00:34,095 --> 00:00:37,000 that covers three or four sections in the book 17 00:00:37,000 --> 00:00:39,061 and I want to just hit the high points in lecture 18 00:00:39,061 --> 00:00:41,046 So a question that comes up 19 00:00:41,046 --> 00:00:43,025 I’m sure it’s come up for all of you 20 00:00:43,025 --> 00:00:44,052 as you have been doing homework 21 00:00:44,052 --> 00:00:45,069 is: “How much testing is enough?” 22 00:00:45,069 --> 00:00:48,049 And, sadly, for a long time 23 00:00:48,049 --> 00:00:51,009 kind of if you asked this question in industry 24 00:00:51,009 --> 00:00:52,017 the answer was basically 25 00:00:52,017 --> 00:00:53,017 “Well, we have a shipping deadline, 26 00:00:53,017 --> 00:00:54,099 so however much testing we can do 27 00:00:54,099 --> 00:00:56,066 before that deadline, that’s how much.” 28 00:00:56,066 --> 00:00:58,015 That’s what you have time for. 29 00:00:58,015 --> 00:01:00,002 So, you know, that’s a little flip 30 00:01:00,002 --> 00:01:01,011 obviously not very good 31 00:01:01,011 --> 00:01:02,054 So you can do a bit better, right? 32 00:01:02,054 --> 00:01:03,070 There’re some static measures 33 00:01:03,070 --> 00:01:06,003 like how many lines of code does your app have 34 00:01:06,003 --> 00:01:08,021 and how many lines of tests do you have? 35 00:01:08,021 --> 00:01:10,029 And it’s not unusual in industry 36 00:01:10,029 --> 00:01:12,068 in a well-tested piece of software 37 00:01:12,068 --> 00:01:14,057 for the number of lines of tests 38 00:01:14,057 --> 00:01:17,073 to go far beyond the number of lines of code 39 00:01:17,073 --> 00:01:19,075 So, integer multiples are not unusual 40 00:01:19,075 --> 00:01:21,084 And I think even for sort of, you know, 41 00:01:21,084 --> 00:01:23,022 research code or classwork 42 00:01:23,022 --> 00:01:26,085 a ratio of, you know, maybe 1.5 is not unreasonable 43 00:01:26,085 --> 00:01:30,005 so one and a half times the amount of test code 44 00:01:30,005 --> 00:01:32,024 as you have application code 45 00:01:32,024 --> 00:01:34,022 And in a lot of production systems 46 00:01:34,022 --> 00:01:35,027 where they really care about testing 47 00:01:35,027 --> 00:01:36,091 it is much higher than that 48 00:01:36,091 --> 00:01:38,015 So maybe a better question to ask: 49 00:01:38,015 --> 00:01:39,047 Rather than saying “How much testing is enough?” 50 00:01:39,047 --> 00:01:42,049 is to ask “How good is the testing I am doing now? 51 00:01:42,049 --> 00:01:44,035 How thorough is it?” 52 00:01:44,035 --> 00:01:45,056 Later in this semester 53 00:01:45,056 --> 00:01:46,056 Professor Sen will talk about 54 00:01:46,056 --> 00:01:48,018 a little bit about formal methods 55 00:01:48,018 --> 00:01:50,085 and sort of what’s at the frontiers of testing and debugging 56 00:01:50,085 --> 00:01:52,068 But a couple of things that we can talk about 57 00:01:52,068 --> 00:01:54,007 based on what you already know 58 00:01:54,007 --> 00:01:57,074 is some basic concepts about test coverage 59 00:01:57,074 --> 00:01:59,054 And although I would say 60 00:01:59,054 --> 00:02:01,001 you know, we’ve been saying all along 61 00:02:01,001 --> 00:02:03,003 formal methods, they don’t really work on big systems 62 00:02:03,003 --> 00:02:05,033 I think that statement, in my personal opinion 63 00:02:05,033 --> 00:02:07,001 is actually a lot less true than it used to be 64 00:02:07,001 --> 00:02:09,019 I think there are a number of specific places 65 00:02:09,019 --> 00:02:10,052 especially in testing and debugging 66 00:02:10,052 --> 00:02:12,084 where formal methods are actually making fast progress 67 00:02:12,084 --> 00:02:15,075 and Koushik Sen is one of the leaders in that 68 00:02:15,075 --> 00:02:17,094 So you’ll have the opportunity to hear more about that later 69 00:02:17,094 --> 00:02:21,043 but for the moment I think, kind of bread and butter 70 00:02:21,043 --> 00:02:22,073 is let’s talk about coverage measurement 71 00:02:22,073 --> 00:02:24,047 because this is where the rubber meets the road 72 00:02:24,047 --> 00:02:26,020 in terms of how you’d be evaluated 73 00:02:26,020 --> 00:02:28,063 if you are doing this for real 74 00:02:28,063 --> 00:02:29,052 So what’s some basics? 75 00:02:29,052 --> 00:02:30,078 Here’s a really simple class you can use 76 00:02:30,078 --> 00:02:32,090 to talk about different ways to measure 77 00:02:32,090 --> 00:02:34,080 how our test covers this code 78 00:02:34,080 --> 00:02:36,063 And there’re a few different levels 79 00:02:36,063 --> 00:02:37,085 with different terminologies 80 00:02:37,085 --> 00:02:40,073 It’s not really universal across all software houses 81 00:02:40,073 --> 00:02:42,064 But one common set of terminology 82 00:02:42,064 --> 00:02:43,064 that the book exposes 83 00:02:43,064 --> 00:02:44,068 is we could talk about S0 84 00:02:44,068 --> 00:02:47,045 where we’d just mean you’ve called every method once 85 00:02:47,045 --> 00:02:50,045 So you know, if you call foo, and you call bar, you’re done 86 00:02:50,045 --> 00:02:52,015 That’s S0 coverage: not terribly thorough 87 00:02:52,015 --> 00:02:54,068 A little more stringent, S1, is 88 00:02:54,068 --> 00:02:56,013 you could say, we’re calling every method 89 00:02:56,013 --> 00:02:57,028 from every place that it could be called 90 00:02:57,028 --> 00:02:58,082 So what does that mean? 91 00:02:58,082 --> 00:03:00,007 It means, for example 92 00:03:00,007 --> 00:03:01,012 it’s not enough to call bar 93 00:03:01,012 --> 00:03:02,095 You have to make sure that you have to call it 94 00:03:02,095 --> 00:03:05,057 at least once from in here 95 00:03:05,057 --> 00:03:07,016 as well as calling it once 96 00:03:07,016 --> 00:03:10,037 from any exterior function that might call it 97 00:03:10,037 --> 00:03:12,081 C0 which is what SimpleCov measures 98 00:03:12,081 --> 00:03:15,099 (those of you who’ve gotten SimpleCov up and running) 99 00:03:15,099 --> 00:03:18,052 basically says you’ve executed every statement 100 00:03:18,052 --> 00:03:20,004 you’ve touched every statement in your code once 101 00:03:20,004 --> 00:03:22,048 But the caveat there is that 102 00:03:22,048 --> 00:03:25,058 conditionals really just count as a single statement 103 00:03:25,058 --> 00:03:28,091 So, if you, no matter which branch of this “if” you took 104 00:03:28,091 --> 00:03:31,074 as long as you touched one of the other branch 105 00:03:31,074 --> 00:03:33,035 you’ve executed the “if’ statement 106 00:03:33,035 --> 00:03:35,066 So even C0 is still, you know, sort of superficial coverage 107 00:03:35,066 --> 00:03:37,026 But, as we will see 108 00:03:37,026 --> 00:03:39,023 the way that you will want to read this information is: 109 00:03:39,023 --> 00:03:41,079 if you are getting bad coverage at the C0 level 110 00:03:41,079 --> 00:03:44,007 then you have really really bad coverage 111 00:03:44,007 --> 00:03:46,008 So if you are not kind of making 112 00:03:46,008 --> 00:03:47,037 this simple level of superficial coverage 113 00:03:47,037 --> 00:03:50,002 then your testing is probably deficient 114 00:03:50,002 --> 00:03:51,091 C1 is the next step up from that 115 00:03:51,091 --> 00:03:53,071 We could say: 116 00:03:53,071 --> 00:03:55,019 Well, we have to take every branch in both directions 117 00:03:55,019 --> 00:03:56,061 So, when we are doing this “if” statement 118 00:03:56,061 --> 00:03:58,066 we have to make sure that 119 00:03:58,066 --> 00:03:59,092 we do the “if x” part once 120 00:03:59,092 --> 00:04:05,013 and the “if not x” part at least once to meet C1 121 00:04:05,013 --> 00:04:08,036 You can augment that with decision coverage 122 00:04:08,036 --> 00:04:09,063 saying: Well, if we’re gonna… 123 00:04:09,063 --> 00:04:12,036 If we have “if” statments where the condition 124 00:04:12,036 --> 00:04:13,088 is made up of multiple terms 125 00:04:13,088 --> 00:04:15,071 we have to make sure that every subexpression 126 00:04:15,071 --> 00:04:17,097 has been evaluated both directions 127 00:04:17,097 --> 00:04:19,067 In other words, that means that 128 00:04:19,067 --> 00:04:22,041 if we’re going to fail this “if” statement 129 00:04:22,041 --> 00:04:24,034 we have to make sure to fail it at least once 130 00:04:24,034 --> 00:04:26,044 because y was false in at least once because z was false 131 00:04:26,044 --> 00:04:28,088 In other words, any subexpression that could 132 00:04:28,088 --> 00:04:31,021 independently change the outcome of the condition 133 00:04:31,021 --> 00:04:34,048 has to be exercised in both directions 134 00:04:34,048 --> 00:04:36,003 And then, 135 00:04:36,003 --> 00:04:38,052 kind of, the one that, you know, a lot of people aspire to 136 00:04:38,052 --> 00:04:41,026 but there is disagreement on how much more valuable it is 137 00:04:41,026 --> 00:04:42,083 is you take every path through the code 138 00:04:42,083 --> 00:04:45,053 Obviously, this is kind of difficult because 139 00:04:45,053 --> 00:04:48,033 it tends to be exponential in the number of conditions 140 00:04:48,033 --> 00:04:53,008 And in general it’s difficult 141 00:04:53,008 --> 00:04:55,031 to evaluate if you’ve taken every path through the code 142 00:04:55,031 --> 00:04:57,001 There are formal techniques that you can use 143 00:04:57,001 --> 00:04:58,083 to tell you where the holes are 144 00:04:58,083 --> 00:05:01,031 but the bottom line is that 145 00:05:01,031 --> 00:05:03,004 in most commercial software houses 146 00:05:03,004 --> 00:05:04,089 there is, I would say, not complete consensus 147 00:05:04,089 --> 00:05:06,070 on how much more valuable C2 is 148 00:05:06,070 --> 00:05:08,068 compared to C0 or C1 149 00:05:08,068 --> 00:05:10,013 So, I think, for the purpose of our class 150 00:05:10,013 --> 00:05:11,067 you get exposed to the idea 151 00:05:11,067 --> 00:05:13,020 of how you use coverage information 152 00:05:13,020 --> 00:05:16,040 SimpleCov takes advantage of some built-in Ruby features 153 00:05:16,040 --> 00:05:18,009 to give you C0 coverage 154 00:05:18,009 --> 00:05:19,062 [It] does really nice reports 155 00:05:19,062 --> 00:05:21,025 We can sort of see it 156 00:05:21,025 --> 00:05:22,096 at the level of individual lines in your file 157 00:05:22,096 --> 00:05:24,091 You can see what your coverage is 158 00:05:24,091 --> 00:05:27,015 and I think that’s kind of a, you know 159 00:05:27,015 --> 00:05:31,018 a good start for where we are 160 00:05:31,018 --> 00:05:33,076 So, having see a sort of different flavours of tests 161 00:05:33,076 --> 00:05:37,020 Stepping back and looking back at the big picture 162 00:05:37,020 --> 00:05:38,098 what are the different kind of tests 163 00:05:38,098 --> 00:05:40,078 that we’ve seen concretely? 164 00:05:40,078 --> 00:05:42,032 and what are the tradeoffs 165 00:05:42,032 --> 00:05:43,089 between using those different kinds of tests? 166 00:05:43,089 --> 00:05:47,016 So we’ve seen at the level of individual classes or methods 167 00:05:47,016 --> 00:05:50,009 we use RSpec, with extensive use of mocking and stubbing 168 00:05:50,009 --> 00:05:53,004 So, for example when we do testing methods in the model 169 00:05:53,004 --> 00:05:55,057 that will be an example of unit testing 170 00:05:55,057 --> 00:05:59,025 We also did something that is pretty similar to 171 00:05:59,025 --> 00:06:00,097 functional or module testing 172 00:06:00,097 --> 00:06:02,071 where there is more than one module participating 173 00:06:02,071 --> 00:06:04,065 So, for example when we did controller specs 174 00:06:04,065 --> 00:06:07,085 we saw that—we simulate a POST action 175 00:06:07,085 --> 00:06:09,029 but remember that the POST action 176 00:06:09,029 --> 00:06:10,086 has to go through the routing subsystem 177 00:06:10,086 --> 00:06:12,042 before it gets to the controller 178 00:06:12,042 --> 00:06:14,048 Once the controller is done it will try to render a view 179 00:06:14,048 --> 00:06:16,007 So in fact there’s other pieces 180 00:06:16,007 --> 00:06:17,067 that collaborate with the controller 181 00:06:17,067 --> 00:06:19,099 that have to be working in order for controller specs to pass 182 00:06:19,099 --> 00:06:21,051 So that’s somewhere inbetween: 183 00:06:21,051 --> 00:06:23,035 where we’re doing more than a single method 184 00:06:23,035 --> 00:06:25,000 touching more than a single class 185 00:06:25,000 --> 00:06:27,000 but we’re still concentrating [our] attention 186 00:06:27,000 --> 00:06:28,088 on a fairly narrow slice of the system at a time 187 00:06:28,088 --> 00:06:31,044 and we’re still using mocking and stubbing extensively 188 00:06:31,044 --> 00:06:35,030 to sort of isolate that behaviour that we want to test 189 00:06:35,030 --> 00:06:36,091 And then at the level of Cucumber scenarios 190 00:06:36,091 --> 00:06:38,047 these are more like integration or system tests 191 00:06:38,047 --> 00:06:41,069 They exercise complete paths throughout the application 192 00:06:41,069 --> 00:06:43,044 They probably touch a lot of different modules 193 00:06:43,044 --> 00:06:46,003 They make minimal use of mocks and stubs 194 00:06:46,003 --> 00:06:48,032 because part of the goal of an integration test 195 00:06:48,032 --> 00:06:50,099 is exactly to test the interaction between pieces 196 00:06:50,099 --> 00:06:53,021 So you don’t want to stub or control those interactions 197 00:06:53,021 --> 00:06:54,080 You actually want to let the system do 198 00:06:54,080 --> 00:06:56,030 what it would really do 199 00:06:56,030 --> 00:06:58,025 if this was a scenario happening in production 200 00:06:58,025 --> 00:07:00,069 So how would we compare these different kinds of tests? 201 00:07:00,069 --> 00:07:02,038 There’s a few different axes we can look at 202 00:07:02,038 --> 00:07:05,007 One of them is how long they take to run 203 00:07:05,007 --> 00:07:06,090 Now, both RSpec and Cucumber 204 00:07:06,090 --> 00:07:09,013 have, kind of, high startup times and stuff like that 205 00:07:09,013 --> 00:07:10,008 But, as you’ll see 206 00:07:10,008 --> 00:07:11,090 as you start adding more and more RSpec tests 207 00:07:11,090 --> 00:07:14,038 and using autotest to run them in the background 208 00:07:14,038 --> 00:07:17,088 by and large, once RSpec kind of gets off the launching pad 209 00:07:17,088 --> 00:07:19,092 it runs specs really fast 210 00:07:19,092 --> 00:07:21,095 whereas running Cucumber features just takes a long time 211 00:07:21,095 --> 00:07:24,059 as it essentially fires up your entire application 212 00:07:24,059 --> 00:07:26,010 And later in this semester 213 00:07:26,010 --> 00:07:28,086 we’ll see a way to make Cucumber even slower— 214 00:07:28,086 --> 00:07:30,070 which is to have it fire up an entire browser 215 00:07:30,070 --> 00:07:33,045 basically act like a puppet, remote-controlling Firefox 216 00:07:33,045 --> 00:07:35,083 so you can test Javascript code 217 00:07:35,083 --> 00:07:37,000 We’ll do that when we actually— 218 00:07:37,000 --> 00:07:40,032 I think we’ll be able to work with our friends at SourceLabs 219 00:07:40,032 --> 00:07:42,080 so you can do that in the cloud—That will be exciting 220 00:07:42,080 --> 00:07:45,083 So, “run fast” versus “run slow” 221 00:07:45,083 --> 00:07:46,068 Resolution: 222 00:07:46,068 --> 00:07:48,025 If an error happens in your unit tests 223 00:07:48,025 --> 00:07:49,075 it’s usually pretty easy 224 00:07:49,075 --> 00:07:52,029 to figure out and track down what the source of that error is 225 00:07:52,029 --> 00:07:53,071 because the tests are so isolated 226 00:07:53,071 --> 00:07:56,025 You’ve stubbed out everything that doesn’t matter 227 00:07:56,025 --> 00:07:58,025 and you’re focusing on only the behaviour of interest 228 00:07:58,025 --> 00:07:59,076 So, if you’ve done a good job of doing that 229 00:07:59,076 --> 00:08:01,097 when something goes wrong in one of your tests 230 00:08:01,097 --> 00:08:03,045 there’s not a lot of places 231 00:08:03,045 --> 00:08:04,088 that something could have gone wrong 232 00:08:04,088 --> 00:08:07,041 In contrast, if you’re running a Cucumber scenario 233 00:08:07,041 --> 00:08:08,089 that’s got, you know, 10 steps 234 00:08:08,089 --> 00:08:10,031 and every step is touching 235 00:08:10,031 --> 00:08:11,061 a whole bunch of pieces of the app 236 00:08:11,061 --> 00:08:12,091 it could take a long time 237 00:08:12,091 --> 00:08:14,076 to actually get to the bottom of a bug 238 00:08:14,076 --> 00:08:16,014 So it is kind of a tradeoff 239 00:08:16,014 --> 00:08:17,054 between how well can you localize errors 240 00:08:17,054 --> 00:08:20,065 Coverage: 241 00:08:20,065 --> 00:08:23,002 It’s possible if you write a good suite 242 00:08:23,002 --> 00:08:24,072 of unit and functional tests 243 00:08:24,072 --> 00:08:26,020 you can get really high coverage 244 00:08:26,020 --> 00:08:27,085 You can run your SimpleCov report 245 00:08:27,085 --> 00:08:30,080 and you can actually identify specific lines in your files 246 00:08:30,080 --> 00:08:32,036 that have not been exercised by any test 247 00:08:32,036 --> 00:08:34,016 and then you can go right at tests that cover them 248 00:08:34,016 --> 00:08:36,014 So, figuring out how to improve your coverage 249 00:08:36,014 --> 00:08:37,057 for example at the C0 level 250 00:08:37,057 --> 00:08:40,021 is something much more easily done with unit tests 251 00:08:40,021 --> 00:08:42,018 whereas, with a Cucumber test— 252 00:08:42,018 --> 00:08:43,078 with a Cucumber scenario— 253 00:08:43,078 --> 00:08:45,076 you are touching a lot of parts of the code 254 00:08:45,076 --> 00:08:47,080 but you are doing it very sparsely 255 00:08:47,080 --> 00:08:49,038 So, if your goal is to get your coverage up 256 00:08:49,038 --> 00:08:51,031 use the tools at that are at the unit levels 257 00:08:51,031 --> 00:08:53,007 so that you can focusing on understanding 258 00:08:53,007 --> 00:08:54,074 what parts or my code are undertested 259 00:08:54,074 --> 00:08:56,055 and then you can write very targeted tests 260 00:08:56,055 --> 00:08:58,086 just to focus on them 261 00:08:58,086 --> 00:09:01,043 And, sort of, you know, putting those pieces together 262 00:09:01,043 --> 00:09:03,039 the unit tests 263 00:09:03,039 --> 00:09:05,059 because of their isolation and their fine resolution 264 00:09:05,059 --> 00:09:07,039 tend to use a lot of mocks 265 00:09:07,039 --> 00:09:09,012 to isolate the behaviours you don’t care about 266 00:09:09,012 --> 00:09:11,020 But that means that, by definition 267 00:09:11,020 --> 00:09:12,070 you’re not testing the interfaces 268 00:09:12,070 --> 00:09:14,099 and it’s sort of a “received wisdom” in software 269 00:09:14,099 --> 00:09:16,069 that a lot of the interesting bugs 270 00:09:16,069 --> 00:09:18,076 occur at the interfaces between pieces 271 00:09:18,076 --> 00:09:20,078 and not sort of within a class or within a method— 272 00:09:20,078 --> 00:09:22,040 those are sort of the easy bugs to track down 273 00:09:22,040 --> 00:09:24,026 And at the other extreme 274 00:09:24,026 --> 00:09:26,081 the more you get towards the integration testing extreme 275 00:09:26,081 --> 00:09:29,072 you’re supposed to rely less and less on mocks 276 00:09:29,072 --> 00:09:30,090 for that exact reason 277 00:09:30,090 --> 00:09:32,066 Now we saw, if you’re testing something like 278 00:09:32,066 --> 00:09:34,015 say, in a service-oriented architecture 279 00:09:34,015 --> 00:09:35,089 where you have to interact with the remote site 280 00:09:35,089 --> 00:09:37,028 you still end up 281 00:09:37,028 --> 00:09:38,094 having to do a fair amount of mocking and stubbing 282 00:09:38,094 --> 00:09:40,028 so that you don’t rely on the Internet 283 00:09:40,028 --> 00:09:41,067 in order for your tests to pass 284 00:09:41,067 --> 00:09:43,006 but, generally speaking 285 00:09:43,006 --> 00:09:47,014 you’re trying to remove as many of the mocks that you can 286 00:09:47,014 --> 00:09:48,095 and let the system run the way it would run in real life 287 00:09:48,095 --> 00:09:52,070 So, the good news is you are testing the interfaces 288 00:09:52,070 --> 00:09:54,074 but when something goes wrong in one of the interfaces 289 00:09:54,074 --> 00:09:57,053 because your resolution is not as good 290 00:09:57,053 --> 00:10:00,031 it may take longer to figure out what it is 291 00:10:00,031 --> 00:10:05,019 So, what’s sort of the high-order bit from this tradeoff 292 00:10:05,019 --> 00:10:07,024 is you don’t really want to rely 293 00:10:07,024 --> 00:10:08,076 too heavily on any one kind of test 294 00:10:08,076 --> 00:10:10,078 They serve different purposes and, depending on 295 00:10:10,078 --> 00:10:13,043 are you trying to exercise your interfaces more 296 00:10:13,043 --> 00:10:15,089 or are you trying to improve your fine-grain coverage 297 00:10:15,089 --> 00:10:18,003 that affects how you develop your test suite 298 00:10:18,003 --> 00:10:20,065 and you’ll evolve it along with your software 299 00:10:20,065 --> 00:10:24,014 So, we’ve used a certain set of terminology in testing 300 00:10:24,014 --> 00:10:26,028 It’s the terminology that, by and large 301 00:10:26,028 --> 00:10:29,001 is most commonly used in the Rails community 302 00:10:29,001 --> 00:10:30,060 but there’s some variation 303 00:10:30,060 --> 00:10:33,069 [and] some other terms that you might hear 304 00:10:33,069 --> 00:10:35,018 if you go get a job somewhere 305 00:10:35,018 --> 00:10:36,093 and you hear about mutation testing 306 00:10:36,093 --> 00:10:38,072 which we haven’t done 307 00:10:38,072 --> 00:10:40,024 This is an interesting idea that was, I think, invented by 308 00:10:40,024 --> 00:10:43,037 Ammann and Offutt, who have, sort of 309 00:10:43,037 --> 00:10:44,093 the definitive book on software testing 310 00:10:44,093 --> 00:10:46,048 The idea is: 311 00:10:46,048 --> 00:10:48,000 Suppose I introduced a deliberate bug into my code 312 00:10:48,000 --> 00:10:49,051 does that force some test to fail? 313 00:10:49,051 --> 00:10:53,003 Because, if I changed, you know, “if x” to “if not x” 314 00:10:53,003 --> 00:10:56,010 and no tests fail, then either I’m missing some coverage 315 00:10:56,010 --> 00:10:59,019 or my app is very strange and somehow nondeterministic 316 00:10:59,019 --> 00:11:03,099 Fuzz testing, which Koushik Sen may talk more about 317 00:11:03,099 --> 00:11:07,085 basically, this is the “10,000 monkeys at typewriters 318 00:11:07,085 --> 00:11:09,024 throwing random input at your code” 319 00:11:09,024 --> 00:11:10,037 What’s interesting about it is that 320 00:11:10,037 --> 00:11:11,065 those tests we’ve been doing 321 00:11:11,065 --> 00:11:13,086 essentially are crafted to test the app 322 00:11:13,086 --> 00:11:15,058 the way it was designed 323 00:11:15,058 --> 00:11:16,088 and these, you know, fuzz testing 324 00:11:16,088 --> 00:11:19,064 is about testing the app in ways it wasn’t meant to be used 325 00:11:19,064 --> 00:11:22,098 So, what happens if you throw enormous form submissions 326 00:11:22,098 --> 00:11:25,036 What happens if you put control characters in your forms? 327 00:11:25,036 --> 00:11:27,062 What happens if you submit the same thing over and over? 328 00:11:27,062 --> 00:11:29,093 And, Koushik has a statistic that 329 00:11:29,093 --> 00:11:32,033 Microsoft finds up to 20% of their bugs 330 00:11:32,033 --> 00:11:34,064 using some variation of fuzz testing 331 00:11:34,064 --> 00:11:36,029 and that about 25% 332 00:11:36,029 --> 00:11:39,021 of the common Unix command-line programs 333 00:11:39,021 --> 00:11:40,092 can be made to crash 334 00:11:40,092 --> 00:11:44,018 [when] put through aggressive fuzz testing 335 00:11:44,018 --> 00:11:46,089 Defining-use coverage is something that we haven’t done 336 00:11:46,089 --> 00:11:48,089 but it’s another interesting concept 337 00:11:48,089 --> 00:11:50,089 The idea is that at any point in my program 338 00:11:50,089 --> 00:11:52,062 there’s a place where I define— 339 00:11:52,062 --> 00:11:54,046 or I assign a value to some variable— 340 00:11:54,046 --> 00:11:56,000 and then there’s a place downstream 341 00:11:56,000 --> 00:11:57,075 where presumably I’m going to consume that value— 342 00:11:57,075 --> 00:11:59,058 someone’s going to use that value 343 00:11:59,058 --> 00:12:01,013 Have I covered every pair? 344 00:12:01,013 --> 00:12:02,059 In other words, do I have tests where every pair 345 00:12:02,059 --> 00:12:04,054 of defining a variable and using it somewhere 346 00:12:04,054 --> 00:12:07,014 is executed at some part of my test suites 347 00:12:07,014 --> 00:12:10,071 It’s sometimes called DU-coverage 348 00:12:10,071 --> 00:12:14,011 And other terms that I think are not as widely used anymore 349 00:12:14,011 --> 00:12:17,071 blackbox versus whitebox, or blackbox versus glassbox 350 00:12:17,071 --> 00:12:20,025 Roughly, a blackbox test is one that is written from 351 00:12:20,025 --> 00:12:22,041 the point of view of the external specification of the thing 352 00:12:22,041 --> 00:12:24,022 [For example:] “This is a hash table 353 00:12:24,022 --> 00:12:26,015 When I put in a key I should get back a value 354 00:12:26,015 --> 00:12:28,011 If I delete the key the value shouldn’t be there” 355 00:12:28,011 --> 00:12:29,099 That’s a blackbox test because it doesn’t say 356 00:12:29,099 --> 00:12:32,028 anything about how the hash table is implemented 357 00:12:32,028 --> 00:12:34,072 and it doesn’t try to stress the implementation 358 00:12:34,072 --> 00:12:36,056 A corresponding whitebox test might be: 359 00:12:36,056 --> 00:12:38,008 “I know something about the hash function 360 00:12:38,008 --> 00:12:39,098 and I’m going to deliberately create 361 00:12:39,098 --> 00:12:41,088 hash keys in my test cases 362 00:12:41,088 --> 00:12:43,078 that cause a lot of hash collisions 363 00:12:43,078 --> 00:12:45,095 to make sure that I’m testing that part of the functionality” 364 00:12:45,095 --> 00:12:49,007 Now, a C0 test coverage tool, like SimpleCov 365 00:12:49,007 --> 00:12:52,001 would reveal that if all you had is blackbox tests 366 00:12:52,001 --> 00:12:53,028 you might find that 367 00:12:53,028 --> 00:12:55,056 the collision coverage code wasn’t being hit very often 368 00:12:55,056 --> 00:12:56,075 And that might tip you off and say: 369 00:12:56,075 --> 00:12:58,028 “Ok, if I really want to strengthen that— 370 00:12:58,028 --> 00:13:00,008 for one, if I want to boost coverage for those tests 371 00:13:00,008 --> 00:13:02,006 now I have to write a whitebox or a glassbox test 372 00:13:02,006 --> 00:13:04,057 I have to look inside, see what the implementation does 373 00:13:04,057 --> 00:13:05,061 and find specific ways 374 00:13:05,061 --> 00:13:10,060 to try to break the implementation in evil ways” 375 00:13:10,060 --> 00:13:13,075 So, I think, testing is a kind of a way of life, right? 376 00:13:13,075 --> 00:13:16,069 We’ve gotten away from the phase of 377 00:13:16,069 --> 00:13:18,033 “We’d build the whole thing and then we’d test it” 378 00:13:18,033 --> 00:13:19,092 and we’ve gotten into the phase of 379 00:13:19,092 --> 00:13:20,077 “We’re testing as we go” 380 00:13:20,077 --> 00:13:22,048 Testing is really more like a development tool 381 00:13:22,048 --> 00:13:24,022 and like so many development tools 382 00:13:24,022 --> 00:13:25,062 the effectiveness of it depends 383 00:13:25,062 --> 00:13:27,013 on whether you’re using it in a tasteful manner 384 00:13:27,013 --> 00:13:31,002 So, you could say: “Well, let’s see—I kicked the tires 385 00:13:31,002 --> 00:13:33,048 You know, I fired up the browser, I tried a couple of things 386 00:13:33,048 --> 00:13:35,097 (claps hand) Looks like it works! Deploy it!” 387 00:13:35,097 --> 00:13:38,045 That’s obviously a little more cavalier than you’d want to be 388 00:13:38,045 --> 00:13:41,024 And, by the way, one of the things that we discovered 389 00:13:41,024 --> 00:13:43,077 with this online course just starting up 390 00:13:43,077 --> 00:13:45,090 when 60,000 people are enrolled in the course 391 00:13:45,090 --> 00:13:48,099 and 0.1% of those people have a problem 392 00:13:48,099 --> 00:13:50,083 you’d get 60 emails 393 00:13:50,083 --> 00:13:53,078 The corollary is: when your site is used by a lot of people 394 00:13:53,078 --> 00:13:55,089 some stupid bug that you didn’t find 395 00:13:55,089 --> 00:13:57,018 but that could have found by testing 396 00:13:57,018 --> 00:13:59,080 could very quickly generate *a lot* of pain 397 00:13:59,080 --> 00:14:02,023 On the other hand, you don’t want to be dogmatic and say 398 00:14:02,023 --> 00:14:04,056 “Uh, until we have 100% coverage and every test is green 399 00:14:04,056 --> 00:14:06,005 we absolutely will not ship” 400 00:14:06,005 --> 00:14:07,012 That’s not healthy either 401 00:14:07,012 --> 00:14:08,048 And the test quality 402 00:14:08,048 --> 00:14:10,057 doesn’t necessarily correlate with the statement 403 00:14:10,057 --> 00:14:11,064 unless you can say something 404 00:14:11,064 --> 00:14:12,068 about the quality of your tests 405 00:14:12,068 --> 00:14:14,029 just because you’ve executed every line 406 00:14:14,029 --> 00:14:17,010 doesn’t mean that you’ve tested the interesting cases 407 00:14:17,010 --> 00:14:18,068 So, somewhere in between, you could say 408 00:14:18,068 --> 00:14:20,014 “Well, we’ll use coverage tools to identify 409 00:14:20,014 --> 00:14:23,004 undertested or poorly-tested parts of the code 410 00:14:23,004 --> 00:14:24,073 and we’ll use them as a guideline 411 00:14:24,073 --> 00:14:27,011 to sort of help improve our overall confidence level” 412 00:14:27,011 --> 00:14:29,024 But remember, Agile is about embracing change 413 00:14:29,024 --> 00:14:30,032 and dealing with it 414 00:14:30,032 --> 00:14:32,002 Part of change is things would change that will cause 415 00:14:32,002 --> 00:14:33,038 bugs that you didn’t foresee 416 00:14:33,038 --> 00:14:34,031 and the right reaction is: 417 00:14:34,031 --> 00:14:36,026 Be comfortable enough for the testing tools 418 00:14:36,026 --> 00:14:37,064 [so] that you can quickly find those bugs 419 00:14:37,064 --> 00:14:39,025 Write a test that reproduces that bug 420 00:14:39,025 --> 00:14:40,062 And then make the test green 421 00:14:40,062 --> 00:14:41,061 Then you’ll really fix it 422 00:14:41,061 --> 00:14:43,004 That means, the way that you really fix a bug is 423 00:14:43,004 --> 00:14:45,049 if you created a test that correctly failed 424 00:14:45,049 --> 00:14:46,088 to reproduce that bug 425 00:14:46,088 --> 00:14:48,055 and then you went back and fixed the code 426 00:14:48,055 --> 00:14:49,057 to make those tests pass 427 00:14:49,057 --> 00:14:51,073 Similarly, you don’t want to say 428 00:14:51,073 --> 00:14:53,036 “Well, unit tests give you better coverage 429 00:14:53,036 --> 00:14:54,073 They’re more thorough and detailed 430 00:14:54,073 --> 00:14:56,044 So let’s focus all our energy on that” 431 00:14:56,044 --> 00:14:57,062 as opposed to 432 00:14:57,062 --> 00:14:58,093 “Oh, focus on integration tests 433 00:14:58,093 --> 00:15:00,006 because they’re more realistic, right? 434 00:15:00,006 --> 00:15:01,056 They reflect what the customer said they want 435 00:15:01,056 --> 00:15:03,034 So, if the integration tests are passing 436 00:15:03,034 --> 00:15:05,067 by definition we’re meeting a customer need” 437 00:15:05,067 --> 00:15:07,034 Again, both extremes are kind of unhealthy 438 00:15:07,034 --> 00:15:09,079 because each one of these can find problems 439 00:15:09,079 --> 00:15:11,031 that would be missed by the other 440 00:15:11,031 --> 00:15:12,060 So, having a good combination of them 441 00:15:12,060 --> 00:15:15,042 is kind of all it is all about 442 00:15:15,042 --> 00:15:18,072 The last thing I want to leave you with is, I think 443 00:15:18,072 --> 00:15:20,036 in terms of testing, is “TDD versus 444 00:15:20,036 --> 00:15:22,005 what I call conventional debugging— 445 00:15:22,005 --> 00:15:24,004 i.e., the way that we all kind of do it 446 00:15:24,004 --> 00:15:25,051 even though we say we don’t” 447 00:15:25,051 --> 00:15:26,064 and we’re all trying to get better, right? 448 00:15:26,064 --> 00:15:27,085 We’re all kind of in the gutter 449 00:15:27,085 --> 00:15:29,036 Some of us are looking up at the stars 450 00:15:29,036 --> 00:15:31,011 trying to improve our practices 451 00:15:31,011 --> 00:15:33,099 But, having now lived with this for 3 or 4 years myself 452 00:15:33,099 --> 00:15:35,091 and—I’ll be honest—3 years ago I didn’t do TDD 453 00:15:35,091 --> 00:15:37,079 I do it now, because I find that it’s better 454 00:15:37,079 --> 00:15:40,081 and here’s my distillation of why I think it works for me 455 00:15:40,081 --> 00:15:43,032 Sorry, the colours are a little weird 456 00:15:43,032 --> 00:15:45,000 but on the left column of the table 457 00:15:45,000 --> 00:15:46,034 [it] says “Conventional debugging” 458 00:15:46,034 --> 00:15:47,044 and the right side says “TDD” 459 00:15:47,044 --> 00:15:49,069 So what’s the way I used to write code? 460 00:15:49,069 --> 00:15:51,056 Maybe some of you still do this 461 00:15:51,056 --> 00:15:53,013 I write a whole bunch of lines 462 00:15:53,013 --> 00:15:54,043 maybe a few tens of lines of code 463 00:15:54,043 --> 00:15:55,059 I’m sure they’re right— 464 00:15:55,059 --> 00:15:56,061 I mean, I am a good programmer, right? 465 00:15:56,061 --> 00:15:57,099 This is not that hard 466 00:15:57,099 --> 00:15:59,002 I run it – It doesn’t work 467 00:15:59,002 --> 00:16:01,098 Ok, fire up the debugger – Start putting in printf’s 468 00:16:01,098 --> 00:16:04,088 If I’d been using TDD what would I do instead? 469 00:16:04,088 --> 00:16:08,022 Well I’d write a few lines of code, having written a test first 470 00:16:08,022 --> 00:16:10,071 So as soon as the test goes from red to green 471 00:16:10,071 --> 00:16:12,064 I know I wrote code that works— 472 00:16:12,064 --> 00:16:15,013 or at least the parts of the behaviour that I had in mind 473 00:16:15,013 --> 00:16:16,096 Those parts of the behaviour work, because I had a test 474 00:16:16,096 --> 00:16:19,056 Ok, back to conventional debugging: 475 00:16:19,056 --> 00:16:21,073 I’m running my program, trying to find the bugs 476 00:16:21,073 --> 00:16:23,028 I start putting in printf’s everywhere 477 00:16:23,028 --> 00:16:24,062 to print out the values of things 478 00:16:24,062 --> 00:16:25,064 which by the way is a lot fun 479 00:16:25,064 --> 00:16:26,073 when you’re trying to read them 480 00:16:26,073 --> 00:16:28,014 out of the 500 lines of log output 481 00:16:28,014 --> 00:16:29,035 that you’d get in a Rails app 482 00:16:29,035 --> 00:16:30,087 trying to find your printf’s 483 00:16:30,087 --> 00:16:32,035 you know, “I know what I’ll do— 484 00:16:32,035 --> 00:16:34,008 I’ll put in 75 asterisks before and after 485 00:16:34,008 --> 00:16:36,043 That will make it readable” (laughter) 486 00:16:36,043 --> 00:16:38,071 Who don’t—Ok, raise your hands if you don’t do this! 487 00:16:38,071 --> 00:16:40,090 Thank you for your honesty. (laughter) Ok. 488 00:16:40,090 --> 00:16:43,014 Or— Or I could do the other thing, I could say: 489 00:16:43,014 --> 00:16:45,030 Instead of printing the value of a variable 490 00:16:45,030 --> 00:16:47,039 why don’t I write a test that inspects it 491 00:16:47,039 --> 00:16:48,079 in such an expectation which should 492 00:16:48,079 --> 00:16:50,090 and I’ll know immediately in bright red letters 493 00:16:50,090 --> 00:16:53,033 if that expectation wasn’t met 494 00:16:53,033 --> 00:16:56,005 Ok, I’m back on the conventional debugging side: 495 00:16:56,005 --> 00:16:58,090 I break out the big guns: I pull out the Ruby debugger 496 00:16:58,092 --> 00:17:02,044 I set a debug breakpoint, and I now start tweaking and say 497 00:17:02,044 --> 00:17:04,085 “Oh, let’s see, I have to get past that ‘if’ statement 498 00:17:04,085 --> 00:17:06,002 so I have to set that thing 499 00:17:06,002 --> 00:17:07,063 Oh, I have to call that method and so I need to…” 500 00:17:07,063 --> 00:17:08,065 No! 501 00:17:08,065 --> 00:17:10,087 I could instead—if I’m going to do that anyway— 502 00:17:10,087 --> 00:17:13,000 let’s just do it in a file, set up some mocks and stubs 503 00:17:13,000 --> 00:17:16,045 to control the code path, make it go the way I want 504 00:17:16,045 --> 00:17:19,013 And now, “Ok, for sure I’ve fixed it! 505 00:17:19,013 --> 00:17:22,012 I’ll get out of the debugger, run it all again!” 506 00:17:22,012 --> 00:17:24,022 And, of course, 9 times out of 10, you didn’t fix it 507 00:17:24,022 --> 00:17:26,072 or you kind of partly fixed it but you didn’t completely fix it 508 00:17:26,072 --> 00:17:30,040 and now I have to do all these manual things all over again 509 00:17:30,040 --> 00:17:32,086 or I already have a bunch of tests 510 00:17:32,086 --> 00:17:34,031 and I can just rerun them automatically 511 00:17:34,031 --> 00:17:35,056 and I could, if some of them fail 512 00:17:35,056 --> 00:17:36,087 “Oh, I didn’t fix the whole thing 513 00:17:36,087 --> 00:17:38,040 No problem, I’ll just go back!” 514 00:17:38,040 --> 00:17:39,096 So, the bottom line is that 515 00:17:39,096 --> 00:17:41,095 you know, you could do it on the left side 516 00:17:41,095 --> 00:17:45,004 but you’re using the same techniques in both cases 517 00:17:45,004 --> 00:17:48,062 The only difference is, in one case you’re doing it manually 518 00:17:48,062 --> 00:17:50,004 which is boring and error-prone 519 00:17:50,004 --> 00:17:51,078 In the other case you’re doing a little more work 520 00:17:51,078 --> 00:17:53,095 but you can make it automatic and repeatable 521 00:17:53,095 --> 00:17:55,071 and have, you know, some high confidence 522 00:17:55,071 --> 00:17:57,003 that as you change things in your code 523 00:17:57,003 --> 00:17:58,092 you are not breaking stuff that used to work 524 00:17:58,092 --> 00:18:00,091 and basically it’s more productive 525 00:18:00,091 --> 00:18:02,047 So you’re doing all the same things 526 00:18:02,047 --> 00:18:04,037 but with a, kind of, “delta” extra work 527 00:18:04,037 --> 00:18:07,086 you are using your effort at a much higher leverage 528 00:18:07,086 --> 00:18:10,036 So that’s kind of my view of why TDD is a good thing 529 00:18:10,036 --> 00:18:11,088 It’s really, it doesn’t require new skills 530 00:18:11,088 --> 00:18:15,011 It just requires [you] to refactor your existing skills 531 00:18:15,011 --> 00:18:18,014 I also tried when I—again, honest confessions, right?— 532 00:18:18,014 --> 00:18:19,034 when I started doing this it was like 533 00:18:19,034 --> 00:18:21,049 “Ok, I gonna be teaching a course on Rails 534 00:18:21,049 --> 00:18:22,065 I should really focus on testing 535 00:18:22,065 --> 00:18:24,032 So I went back to some code I had written 536 00:18:24,032 --> 00:18:26,087 that was working—you know, that was decent code— 537 00:18:26,087 --> 00:18:29,006 and I started trying to write tests for it 538 00:18:29,006 --> 00:18:31,019 and it was *so painful* 539 00:18:31,019 --> 00:18:33,033 because the code wasn’t written in way that was testable 540 00:18:33,033 --> 00:18:34,097 There were all kinds of interactions 541 00:18:34,097 --> 00:18:36,038 There were, like, nested conditionals 542 00:18:36,038 --> 00:18:38,083 And if you wanted to isolate a particular statement 543 00:18:38,083 --> 00:18:41,070 and have it test—to trigger test—just that statement 544 00:18:41,070 --> 00:18:44,000 the amount of stuff you’d have to set up in your test 545 00:18:44,000 --> 00:18:45,009 to have it happen— 546 00:18:45,009 --> 00:18:46,040 remember when talked about mock train wrecks— 547 00:18:46,040 --> 00:18:48,014 you have to set up all this infrastructure 548 00:18:48,014 --> 00:18:49,063 just to get one line of code 549 00:18:49,063 --> 00:18:51,015 and you do that and you go 550 00:18:51,015 --> 00:18:52,074 “Gawd, testing is really not worth it! 551 00:18:52,074 --> 00:18:54,034 I wrote 20 lines of setup 552 00:18:54,034 --> 00:18:56,059 so that I could test two lines in my function!” 553 00:18:56,059 --> 00:18:58,085 What that’s really telling you—as I now realize— 554 00:18:58,085 --> 00:19:00,042 is your function is bad 555 00:19:00,042 --> 00:19:01,049 It’s a badly written function 556 00:19:01,049 --> 00:19:02,052 It’s not a testable function 557 00:19:02,052 --> 00:19:03,088 It’s got too many moving parts 558 00:19:03,088 --> 00:19:06,026 whose dependencies can be broken 559 00:19:06,026 --> 00:19:07,070 There’s no seams in my function 560 00:19:07,070 --> 00:19:11,008 that allow me to individually test the different behaviours 561 00:19:11,008 --> 00:19:12,083 And once you start doing Test First Development 562 00:19:12,083 --> 00:19:15,043 because you have to write your tests in small chunks 563 00:19:15,043 --> 00:19:17,053 it kind of make this problem go away 564 00:19:17,053 --> 99:59:59,999 So that’s been my epiphany