1
00:00:00,057 --> 00:00:01,092
So we spent a bunch of time
2
00:00:01,092 --> 00:00:03,032
in the last couple of lectures
3
00:00:03,032 --> 00:00:05,082
talking about different kinds of testing
4
00:00:05,082 --> 00:00:08,021
about unit testing versus integration testing
5
00:00:08,021 --> 00:00:10,010
We talked about how do you use RSpec
6
00:00:10,010 --> 00:00:12,049
to really isolate the parts of your code you want to test
7
00:00:12,049 --> 00:00:14,090
you’ve also, you know, because of homework 3,
8
00:00:14,090 --> 00:00:18,017
and other stuff, we have been doing BDD,
9
00:00:18,017 --> 00:00:20,062
where we’ve been using Cucumber to turn user stories
10
00:00:20,062 --> 00:00:22,095
into, essentially, integration and acceptance tests
11
00:00:22,095 --> 00:00:25,061
So you’ve seen testing in a couple of different levels
12
00:00:25,061 --> 00:00:27,063
and the goal here is sort of to do a few remarks
13
00:00:27,063 --> 00:00:29,092
to, you know, let’s back up a little bit
14
00:00:29,092 --> 00:00:33,001
and see the big picture, and tie those things together
15
00:00:33,001 --> 00:00:34,095
So this sort of spans material
16
00:00:34,095 --> 00:00:37,000
that covers three or four sections in the book
17
00:00:37,000 --> 00:00:39,061
and I want to just hit the high points in lecture
18
00:00:39,061 --> 00:00:41,046
So a question that comes up
19
00:00:41,046 --> 00:00:43,025
I’m sure it’s come up for all of you
20
00:00:43,025 --> 00:00:44,052
as you have been doing homework
21
00:00:44,052 --> 00:00:45,069
is: “How much testing is enough?”
22
00:00:45,069 --> 00:00:48,049
And, sadly, for a long time
23
00:00:48,049 --> 00:00:51,009
kind of if you asked this question in industry
24
00:00:51,009 --> 00:00:52,017
the answer was basically
25
00:00:52,017 --> 00:00:53,017
“Well, we have a shipping deadline,
26
00:00:53,017 --> 00:00:54,099
so however much testing we can do
27
00:00:54,099 --> 00:00:56,066
before that deadline, that’s how much.”
28
00:00:56,066 --> 00:00:58,015
That’s what you have time for.
29
00:00:58,015 --> 00:01:00,002
So, you know, that’s a little flip
30
00:01:00,002 --> 00:01:01,011
obviously not very good
31
00:01:01,011 --> 00:01:02,054
So you can do a bit better, right?
32
00:01:02,054 --> 00:01:03,070
There’re some static measures
33
00:01:03,070 --> 00:01:06,003
like how many lines of code does your app have
34
00:01:06,003 --> 00:01:08,021
and how many lines of tests do you have?
35
00:01:08,021 --> 00:01:10,029
And it’s not unusual in industry
36
00:01:10,029 --> 00:01:12,068
in a well-tested piece of software
37
00:01:12,068 --> 00:01:14,057
for the number of lines of tests
38
00:01:14,057 --> 00:01:17,073
to go far beyond the number of lines of code
39
00:01:17,073 --> 00:01:19,075
So, integer multiples are not unusual
40
00:01:19,075 --> 00:01:21,084
And I think even for sort of, you know,
41
00:01:21,084 --> 00:01:23,022
research code or classwork
42
00:01:23,022 --> 00:01:26,085
a ratio of, you know, maybe 1.5 is not unreasonable
43
00:01:26,085 --> 00:01:30,005
so one and a half times the amount of test code
44
00:01:30,005 --> 00:01:32,024
as you have application code
45
00:01:32,024 --> 00:01:34,022
And in a lot of production systems
46
00:01:34,022 --> 00:01:35,027
where they really care about testing
47
00:01:35,027 --> 00:01:36,091
it is much higher than that
48
00:01:36,091 --> 00:01:38,015
So maybe a better question to ask:
49
00:01:38,015 --> 00:01:39,047
Rather than saying “How much testing is enough?”
50
00:01:39,047 --> 00:01:42,049
is to ask “How good is the testing I am doing now?
51
00:01:42,049 --> 00:01:44,035
How thorough is it?”
52
00:01:44,035 --> 00:01:45,056
Later in this semester
53
00:01:45,056 --> 00:01:46,056
Professor Sen will talk about
54
00:01:46,056 --> 00:01:48,018
a little bit about formal methods
55
00:01:48,018 --> 00:01:50,085
and sort of what’s at the frontiers of testing and debugging
56
00:01:50,085 --> 00:01:52,068
But a couple of things that we can talk about
57
00:01:52,068 --> 00:01:54,007
based on what you already know
58
00:01:54,007 --> 00:01:57,074
is some basic concepts about test coverage
59
00:01:57,074 --> 00:01:59,054
And although I would say
60
00:01:59,054 --> 00:02:01,001
you know, we’ve been saying all along
61
00:02:01,001 --> 00:02:03,003
formal methods, they don’t really work on big systems
62
00:02:03,003 --> 00:02:05,033
I think that statement, in my personal opinion
63
00:02:05,033 --> 00:02:07,001
is actually a lot less true than it used to be
64
00:02:07,001 --> 00:02:09,019
I think there are a number of specific places
65
00:02:09,019 --> 00:02:10,052
especially in testing and debugging
66
00:02:10,052 --> 00:02:12,084
where formal methods are actually making fast progress
67
00:02:12,084 --> 00:02:15,075
and Koushik Sen is one of the leaders in that
68
00:02:15,075 --> 00:02:17,094
So you’ll have the opportunity to hear more about that later
69
00:02:17,094 --> 00:02:21,043
but for the moment I think, kind of bread and butter
70
00:02:21,043 --> 00:02:22,073
is let’s talk about coverage measurement
71
00:02:22,073 --> 00:02:24,047
because this is where the rubber meets the road
72
00:02:24,047 --> 00:02:26,020
in terms of how you’d be evaluated
73
00:02:26,020 --> 00:02:28,063
if you are doing this for real
74
00:02:28,063 --> 00:02:29,052
So what’s some basics?
75
00:02:29,052 --> 00:02:30,078
Here’s a really simple class you can use
76
00:02:30,078 --> 00:02:32,090
to talk about different ways to measure
77
00:02:32,090 --> 00:02:34,080
how our test covers this code
78
00:02:34,080 --> 00:02:36,063
And there’re a few different levels
79
00:02:36,063 --> 00:02:37,085
with different terminologies
80
00:02:37,085 --> 00:02:40,073
It’s not really universal across all software houses
81
00:02:40,073 --> 00:02:42,064
But one common set of terminology
82
00:02:42,064 --> 00:02:43,064
that the book exposes
83
00:02:43,064 --> 00:02:44,068
is we could talk about S0
84
00:02:44,068 --> 00:02:47,045
where we’d just mean you’ve called every method once
85
00:02:47,045 --> 00:02:50,045
So you know, if you call foo, and you call bar, you’re done
86
00:02:50,045 --> 00:02:52,015
That’s S0 coverage: not terribly thorough
87
00:02:52,015 --> 00:02:54,068
A little more stringent, S1, is
88
00:02:54,068 --> 00:02:56,013
you could say, we’re calling every method
89
00:02:56,013 --> 00:02:57,028
from every place that it could be called
90
00:02:57,028 --> 00:02:58,082
So what does that mean?
91
00:02:58,082 --> 00:03:00,007
It means, for example
92
00:03:00,007 --> 00:03:01,012
it’s not enough to call bar
93
00:03:01,012 --> 00:03:02,095
You have to make sure that you have to call it
94
00:03:02,095 --> 00:03:05,057
at least once from in here
95
00:03:05,057 --> 00:03:07,016
as well as calling it once
96
00:03:07,016 --> 00:03:10,037
from any exterior function that might call it
97
00:03:10,037 --> 00:03:12,081
C0 which is what SimpleCov measures
98
00:03:12,081 --> 00:03:15,099
(those of you who’ve gotten SimpleCov up and running)
99
00:03:15,099 --> 00:03:18,052
basically says you’ve executed every statement
100
00:03:18,052 --> 00:03:20,004
you’ve touched every statement in your code once
101
00:03:20,004 --> 00:03:22,048
But the caveat there is that
102
00:03:22,048 --> 00:03:25,058
conditionals really just count as a single statement
103
00:03:25,058 --> 00:03:28,091
So, if you, no matter which branch of this “if” you took
104
00:03:28,091 --> 00:03:31,074
as long as you touched one of the other branch
105
00:03:31,074 --> 00:03:33,035
you’ve executed the “if’ statement
106
00:03:33,035 --> 00:03:35,066
So even C0 is still, you know, sort of superficial coverage
107
00:03:35,066 --> 00:03:37,026
But, as we will see
108
00:03:37,026 --> 00:03:39,023
the way that you will want to read this information is:
109
00:03:39,023 --> 00:03:41,079
if you are getting bad coverage at the C0 level
110
00:03:41,079 --> 00:03:44,007
then you have really really bad coverage
111
00:03:44,007 --> 00:03:46,008
So if you are not kind of making
112
00:03:46,008 --> 00:03:47,037
this simple level of superficial coverage
113
00:03:47,037 --> 00:03:50,002
then your testing is probably deficient
114
00:03:50,002 --> 00:03:51,091
C1 is the next step up from that
115
00:03:51,091 --> 00:03:53,071
We could say:
116
00:03:53,071 --> 00:03:55,019
Well, we have to take every branch in both directions
117
00:03:55,019 --> 00:03:56,061
So, when we are doing this “if” statement
118
00:03:56,061 --> 00:03:58,066
we have to make sure that
119
00:03:58,066 --> 00:03:59,092
we do the “if x” part once
120
00:03:59,092 --> 00:04:05,013
and the “if not x” part at least once to meet C1
121
00:04:05,013 --> 00:04:08,036
You can augment that with decision coverage
122
00:04:08,036 --> 00:04:09,063
saying: Well, if we’re gonna…
123
00:04:09,063 --> 00:04:12,036
If we have “if” statments where the condition
124
00:04:12,036 --> 00:04:13,088
is made up of multiple terms
125
00:04:13,088 --> 00:04:15,071
we have to make sure that every subexpression
126
00:04:15,071 --> 00:04:17,097
has been evaluated both directions
127
00:04:17,097 --> 00:04:19,067
In other words, that means that
128
00:04:19,067 --> 00:04:22,041
if we’re going to fail this “if” statement
129
00:04:22,041 --> 00:04:24,034
we have to make sure to fail it at least once
130
00:04:24,034 --> 00:04:26,044
because y was false in at least once because z was false
131
00:04:26,044 --> 00:04:28,088
In other words, any subexpression that could
132
00:04:28,088 --> 00:04:31,021
independently change the outcome of the condition
133
00:04:31,021 --> 00:04:34,048
has to be exercised in both directions
134
00:04:34,048 --> 00:04:36,003
And then,
135
00:04:36,003 --> 00:04:38,052
kind of, the one that, you know, a lot of people aspire to
136
00:04:38,052 --> 00:04:41,026
but there is disagreement on how much more valuable it is
137
00:04:41,026 --> 00:04:42,083
is you take every path through the code
138
00:04:42,083 --> 00:04:45,053
Obviously, this is kind of difficult because
139
00:04:45,053 --> 00:04:48,033
it tends to be exponential in the number of conditions
140
00:04:48,033 --> 00:04:53,008
And in general it’s difficult
141
00:04:53,008 --> 00:04:55,031
to evaluate if you’ve taken every path through the code
142
00:04:55,031 --> 00:04:57,001
There are formal techniques that you can use
143
00:04:57,001 --> 00:04:58,083
to tell you where the holes are
144
00:04:58,083 --> 00:05:01,031
but the bottom line is that
145
00:05:01,031 --> 00:05:03,004
in most commercial software houses
146
00:05:03,004 --> 00:05:04,089
there is, I would say, not complete consensus
147
00:05:04,089 --> 00:05:06,070
on how much more valuable C2 is
148
00:05:06,070 --> 00:05:08,068
compared to C0 or C1
149
00:05:08,068 --> 00:05:10,013
So, I think, for the purpose of our class
150
00:05:10,013 --> 00:05:11,067
you get exposed to the idea
151
00:05:11,067 --> 00:05:13,020
of how you use coverage information
152
00:05:13,020 --> 00:05:16,040
SimpleCov takes advantage of some built-in Ruby features
153
00:05:16,040 --> 00:05:18,009
to give you C0 coverage
154
00:05:18,009 --> 00:05:19,062
[It] does really nice reports
155
00:05:19,062 --> 00:05:21,025
We can sort of see it
156
00:05:21,025 --> 00:05:22,096
at the level of individual lines in your file
157
00:05:22,096 --> 00:05:24,091
You can see what your coverage is
158
00:05:24,091 --> 00:05:27,015
and I think that’s kind of a, you know
159
00:05:27,015 --> 00:05:31,018
a good start for where we are
160
00:05:31,018 --> 00:05:33,076
So, having see a sort of different flavours of tests
161
00:05:33,076 --> 00:05:37,020
Stepping back and looking back at the big picture
162
00:05:37,020 --> 00:05:38,098
what are the different kind of tests
163
00:05:38,098 --> 00:05:40,078
that we’ve seen concretely?
164
00:05:40,078 --> 00:05:42,032
and what are the tradeoffs
165
00:05:42,032 --> 00:05:43,089
between using those different kinds of tests?
166
00:05:43,089 --> 00:05:47,016
So we’ve seen at the level of individual classes or methods
167
00:05:47,016 --> 00:05:50,009
we use RSpec, with extensive use of mocking and stubbing
168
00:05:50,009 --> 00:05:53,004
So, for example when we do testing methods in the model
169
00:05:53,004 --> 00:05:55,057
that will be an example of unit testing
170
00:05:55,057 --> 00:05:59,025
We also did something that is pretty similar to
171
00:05:59,025 --> 00:06:00,097
functional or module testing
172
00:06:00,097 --> 00:06:02,071
where there is more than one module participating
173
00:06:02,071 --> 00:06:04,065
So, for example when we did controller specs
174
00:06:04,065 --> 00:06:07,085
we saw that—we simulate a POST action
175
00:06:07,085 --> 00:06:09,029
but remember that the POST action
176
00:06:09,029 --> 00:06:10,086
has to go through the routing subsystem
177
00:06:10,086 --> 00:06:12,042
before it gets to the controller
178
00:06:12,042 --> 00:06:14,048
Once the controller is done it will try to render a view
179
00:06:14,048 --> 00:06:16,007
So in fact there’s other pieces
180
00:06:16,007 --> 00:06:17,067
that collaborate with the controller
181
00:06:17,067 --> 00:06:19,099
that have to be working in order for controller specs to pass
182
00:06:19,099 --> 00:06:21,051
So that’s somewhere inbetween:
183
00:06:21,051 --> 00:06:23,035
where we’re doing more than a single method
184
00:06:23,035 --> 00:06:25,000
touching more than a single class
185
00:06:25,000 --> 00:06:27,000
but we’re still concentrating [our] attention
186
00:06:27,000 --> 00:06:28,088
on a fairly narrow slice of the system at a time
187
00:06:28,088 --> 00:06:31,044
and we’re still using mocking and stubbing extensively
188
00:06:31,044 --> 00:06:35,030
to sort of isolate that behaviour that we want to test
189
00:06:35,030 --> 00:06:36,091
And then at the level of Cucumber scenarios
190
00:06:36,091 --> 00:06:38,047
these are more like integration or system tests
191
00:06:38,047 --> 00:06:41,069
They exercise complete paths throughout the application
192
00:06:41,069 --> 00:06:43,044
They probably touch a lot of different modules
193
00:06:43,044 --> 00:06:46,003
They make minimal use of mocks and stubs
194
00:06:46,003 --> 00:06:48,032
because part of the goal of an integration test
195
00:06:48,032 --> 00:06:50,099
is exactly to test the interaction between pieces
196
00:06:50,099 --> 00:06:53,021
So you don’t want to stub or control those interactions
197
00:06:53,021 --> 00:06:54,080
You actually want to let the system do
198
00:06:54,080 --> 00:06:56,030
what it would really do
199
00:06:56,030 --> 00:06:58,025
if this was a scenario happening in production
200
00:06:58,025 --> 00:07:00,069
So how would we compare these different kinds of tests?
201
00:07:00,069 --> 00:07:02,038
There’s a few different axes we can look at
202
00:07:02,038 --> 00:07:05,007
One of them is how long they take to run
203
00:07:05,007 --> 00:07:06,090
Now, both RSpec and Cucumber
204
00:07:06,090 --> 00:07:09,013
have, kind of, high startup times and stuff like that
205
00:07:09,013 --> 00:07:10,008
But, as you’ll see
206
00:07:10,008 --> 00:07:11,090
as you start adding more and more RSpec tests
207
00:07:11,090 --> 00:07:14,038
and using autotest to run them in the background
208
00:07:14,038 --> 00:07:17,088
by and large, once RSpec kind of gets off the launching pad
209
00:07:17,088 --> 00:07:19,092
it runs specs really fast
210
00:07:19,092 --> 00:07:21,095
whereas running Cucumber features just takes a long time
211
00:07:21,095 --> 00:07:24,059
as it essentially fires up your entire application
212
00:07:24,059 --> 00:07:26,010
And later in this semester
213
00:07:26,010 --> 00:07:28,086
we’ll see a way to make Cucumber even slower—
214
00:07:28,086 --> 00:07:30,070
which is to have it fire up an entire browser
215
00:07:30,070 --> 00:07:33,045
basically act like a puppet, remote-controlling Firefox
216
00:07:33,045 --> 00:07:35,083
so you can test Javascript code
217
00:07:35,083 --> 00:07:37,000
We’ll do that when we actually—
218
00:07:37,000 --> 00:07:40,032
I think we’ll be able to work with our friends at SourceLabs
219
00:07:40,032 --> 00:07:42,080
so you can do that in the cloud—That will be exciting
220
00:07:42,080 --> 00:07:45,083
So, “run fast” versus “run slow”
221
00:07:45,083 --> 00:07:46,068
Resolution:
222
00:07:46,068 --> 00:07:48,025
If an error happens in your unit tests
223
00:07:48,025 --> 00:07:49,075
it’s usually pretty easy
224
00:07:49,075 --> 00:07:52,029
to figure out and track down what the source of that error is
225
00:07:52,029 --> 00:07:53,071
because the tests are so isolated
226
00:07:53,071 --> 00:07:56,025
You’ve stubbed out everything that doesn’t matter
227
00:07:56,025 --> 00:07:58,025
and you’re focusing on only the behaviour of interest
228
00:07:58,025 --> 00:07:59,076
So, if you’ve done a good job of doing that
229
00:07:59,076 --> 00:08:01,097
when something goes wrong in one of your tests
230
00:08:01,097 --> 00:08:03,045
there’s not a lot of places
231
00:08:03,045 --> 00:08:04,088
that something could have gone wrong
232
00:08:04,088 --> 00:08:07,041
In contrast, if you’re running a Cucumber scenario
233
00:08:07,041 --> 00:08:08,089
that’s got, you know, 10 steps
234
00:08:08,089 --> 00:08:10,031
and every step is touching
235
00:08:10,031 --> 00:08:11,061
a whole bunch of pieces of the app
236
00:08:11,061 --> 00:08:12,091
it could take a long time
237
00:08:12,091 --> 00:08:14,076
to actually get to the bottom of a bug
238
00:08:14,076 --> 00:08:16,014
So it is kind of a tradeoff
239
00:08:16,014 --> 00:08:17,054
between how well can you localize errors
240
00:08:17,054 --> 00:08:20,065
Coverage:
241
00:08:20,065 --> 00:08:23,002
It’s possible if you write a good suite
242
00:08:23,002 --> 00:08:24,072
of unit and functional tests
243
00:08:24,072 --> 00:08:26,020
you can get really high coverage
244
00:08:26,020 --> 00:08:27,085
You can run your SimpleCov report
245
00:08:27,085 --> 00:08:30,080
and you can actually identify specific lines in your files
246
00:08:30,080 --> 00:08:32,036
that have not been exercised by any test
247
00:08:32,036 --> 00:08:34,016
and then you can go right at tests that cover them
248
00:08:34,016 --> 00:08:36,014
So, figuring out how to improve your coverage
249
00:08:36,014 --> 00:08:37,057
for example at the C0 level
250
00:08:37,057 --> 00:08:40,021
is something much more easily done with unit tests
251
00:08:40,021 --> 00:08:42,018
whereas, with a Cucumber test—
252
00:08:42,018 --> 00:08:43,078
with a Cucumber scenario—
253
00:08:43,078 --> 00:08:45,076
you are touching a lot of parts of the code
254
00:08:45,076 --> 00:08:47,080
but you are doing it very sparsely
255
00:08:47,080 --> 00:08:49,038
So, if your goal is to get your coverage up
256
00:08:49,038 --> 00:08:51,031
use the tools at that are at the unit levels
257
00:08:51,031 --> 00:08:53,007
so that you can focusing on understanding
258
00:08:53,007 --> 00:08:54,074
what parts or my code are undertested
259
00:08:54,074 --> 00:08:56,055
and then you can write very targeted tests
260
00:08:56,055 --> 00:08:58,086
just to focus on them
261
00:08:58,086 --> 00:09:01,043
And, sort of, you know, putting those pieces together
262
00:09:01,043 --> 00:09:03,039
the unit tests
263
00:09:03,039 --> 00:09:05,059
because of their isolation and their fine resolution
264
00:09:05,059 --> 00:09:07,039
tend to use a lot of mocks
265
00:09:07,039 --> 00:09:09,012
to isolate the behaviours you don’t care about
266
00:09:09,012 --> 00:09:11,020
But that means that, by definition
267
00:09:11,020 --> 00:09:12,070
you’re not testing the interfaces
268
00:09:12,070 --> 00:09:14,099
and it’s sort of a “received wisdom” in software
269
00:09:14,099 --> 00:09:16,069
that a lot of the interesting bugs
270
00:09:16,069 --> 00:09:18,076
occur at the interfaces between pieces
271
00:09:18,076 --> 00:09:20,078
and not sort of within a class or within a method—
272
00:09:20,078 --> 00:09:22,040
those are sort of the easy bugs to track down
273
00:09:22,040 --> 00:09:24,026
And at the other extreme
274
00:09:24,026 --> 00:09:26,081
the more you get towards the integration testing extreme
275
00:09:26,081 --> 00:09:29,072
you’re supposed to rely less and less on mocks
276
00:09:29,072 --> 00:09:30,090
for that exact reason
277
00:09:30,090 --> 00:09:32,066
Now we saw, if you’re testing something like
278
00:09:32,066 --> 00:09:34,015
say, in a service-oriented architecture
279
00:09:34,015 --> 00:09:35,089
where you have to interact with the remote site
280
00:09:35,089 --> 00:09:37,028
you still end up
281
00:09:37,028 --> 00:09:38,094
having to do a fair amount of mocking and stubbing
282
00:09:38,094 --> 00:09:40,028
so that you don’t rely on the Internet
283
00:09:40,028 --> 00:09:41,067
in order for your tests to pass
284
00:09:41,067 --> 00:09:43,006
but, generally speaking
285
00:09:43,006 --> 00:09:47,014
you’re trying to remove as many of the mocks that you can
286
00:09:47,014 --> 00:09:48,095
and let the system run the way it would run in real life
287
00:09:48,095 --> 00:09:52,070
So, the good news is you are testing the interfaces
288
00:09:52,070 --> 00:09:54,074
but when something goes wrong in one of the interfaces
289
00:09:54,074 --> 00:09:57,053
because your resolution is not as good
290
00:09:57,053 --> 00:10:00,031
it may take longer to figure out what it is
291
00:10:00,031 --> 00:10:05,019
So, what’s sort of the high-order bit from this tradeoff
292
00:10:05,019 --> 00:10:07,024
is you don’t really want to rely
293
00:10:07,024 --> 00:10:08,076
too heavily on any one kind of test
294
00:10:08,076 --> 00:10:10,078
They serve different purposes and, depending on
295
00:10:10,078 --> 00:10:13,043
are you trying to exercise your interfaces more
296
00:10:13,043 --> 00:10:15,089
or are you trying to improve your fine-grain coverage
297
00:10:15,089 --> 00:10:18,003
that affects how you develop your test suite
298
00:10:18,003 --> 00:10:20,065
and you’ll evolve it along with your software
299
00:10:20,065 --> 00:10:24,014
So, we’ve used a certain set of terminology in testing
300
00:10:24,014 --> 00:10:26,028
It’s the terminology that, by and large
301
00:10:26,028 --> 00:10:29,001
is most commonly used in the Rails community
302
00:10:29,001 --> 00:10:30,060
but there’s some variation
303
00:10:30,060 --> 00:10:33,069
[and] some other terms that you might hear
304
00:10:33,069 --> 00:10:35,018
if you go get a job somewhere
305
00:10:35,018 --> 00:10:36,093
and you hear about mutation testing
306
00:10:36,093 --> 00:10:38,072
which we haven’t done
307
00:10:38,072 --> 00:10:40,024
This is an interesting idea that was, I think, invented by
308
00:10:40,024 --> 00:10:43,037
Ammann and Offutt, who have, sort of
309
00:10:43,037 --> 00:10:44,093
the definitive book on software testing
310
00:10:44,093 --> 00:10:46,048
The idea is:
311
00:10:46,048 --> 00:10:48,000
Suppose I introduced a deliberate bug into my code
312
00:10:48,000 --> 00:10:49,051
does that force some test to fail?
313
00:10:49,051 --> 00:10:53,003
Because, if I changed, you know, “if x” to “if not x”
314
00:10:53,003 --> 00:10:56,010
and no tests fail, then either I’m missing some coverage
315
00:10:56,010 --> 00:10:59,019
or my app is very strange and somehow nondeterministic
316
00:10:59,019 --> 00:11:03,099
Fuzz testing, which Koushik Sen may talk more about
317
00:11:03,099 --> 00:11:07,085
basically, this is the “10,000 monkeys at typewriters
318
00:11:07,085 --> 00:11:09,024
throwing random input at your code”
319
00:11:09,024 --> 00:11:10,037
What’s interesting about it is that
320
00:11:10,037 --> 00:11:11,065
those tests we’ve been doing
321
00:11:11,065 --> 00:11:13,086
essentially are crafted to test the app
322
00:11:13,086 --> 00:11:15,058
the way it was designed
323
00:11:15,058 --> 00:11:16,088
and these, you know, fuzz testing
324
00:11:16,088 --> 00:11:19,064
is about testing the app in ways it wasn’t meant to be used
325
00:11:19,064 --> 00:11:22,098
So, what happens if you throw enormous form submissions
326
00:11:22,098 --> 00:11:25,036
What happens if you put control characters in your forms?
327
00:11:25,036 --> 00:11:27,062
What happens if you submit the same thing over and over?
328
00:11:27,062 --> 00:11:29,093
And, Koushik has a statistic that
329
00:11:29,093 --> 00:11:32,033
Microsoft finds up to 20% of their bugs
330
00:11:32,033 --> 00:11:34,064
using some variation of fuzz testing
331
00:11:34,064 --> 00:11:36,029
and that about 25%
332
00:11:36,029 --> 00:11:39,021
of the common Unix command-line programs
333
00:11:39,021 --> 00:11:40,092
can be made to crash
334
00:11:40,092 --> 00:11:44,018
[when] put through aggressive fuzz testing
335
00:11:44,018 --> 00:11:46,089
Defining-use coverage is something that we haven’t done
336
00:11:46,089 --> 00:11:48,089
but it’s another interesting concept
337
00:11:48,089 --> 00:11:50,089
The idea is that at any point in my program
338
00:11:50,089 --> 00:11:52,062
there’s a place where I define—
339
00:11:52,062 --> 00:11:54,046
or I assign a value to some variable—
340
00:11:54,046 --> 00:11:56,000
and then there’s a place downstream
341
00:11:56,000 --> 00:11:57,075
where presumably I’m going to consume that value—
342
00:11:57,075 --> 00:11:59,058
someone’s going to use that value
343
00:11:59,058 --> 00:12:01,013
Have I covered every pair?
344
00:12:01,013 --> 00:12:02,059
In other words, do I have tests where every pair
345
00:12:02,059 --> 00:12:04,054
of defining a variable and using it somewhere
346
00:12:04,054 --> 00:12:07,014
is executed at some part of my test suites
347
00:12:07,014 --> 00:12:10,071
It’s sometimes called DU-coverage
348
00:12:10,071 --> 00:12:14,011
And other terms that I think are not as widely used anymore
349
00:12:14,011 --> 00:12:17,071
blackbox versus whitebox, or blackbox versus glassbox
350
00:12:17,071 --> 00:12:20,025
Roughly, a blackbox test is one that is written from
351
00:12:20,025 --> 00:12:22,041
the point of view of the external specification of the thing
352
00:12:22,041 --> 00:12:24,022
[For example:] “This is a hash table
353
00:12:24,022 --> 00:12:26,015
When I put in a key I should get back a value
354
00:12:26,015 --> 00:12:28,011
If I delete the key the value shouldn’t be there”
355
00:12:28,011 --> 00:12:29,099
That’s a blackbox test because it doesn’t say
356
00:12:29,099 --> 00:12:32,028
anything about how the hash table is implemented
357
00:12:32,028 --> 00:12:34,072
and it doesn’t try to stress the implementation
358
00:12:34,072 --> 00:12:36,056
A corresponding whitebox test might be:
359
00:12:36,056 --> 00:12:38,008
“I know something about the hash function
360
00:12:38,008 --> 00:12:39,098
and I’m going to deliberately create
361
00:12:39,098 --> 00:12:41,088
hash keys in my test cases
362
00:12:41,088 --> 00:12:43,078
that cause a lot of hash collisions
363
00:12:43,078 --> 00:12:45,095
to make sure that I’m testing that part of the functionality”
364
00:12:45,095 --> 00:12:49,007
Now, a C0 test coverage tool, like SimpleCov
365
00:12:49,007 --> 00:12:52,001
would reveal that if all you had is blackbox tests
366
00:12:52,001 --> 00:12:53,028
you might find that
367
00:12:53,028 --> 00:12:55,056
the collision coverage code wasn’t being hit very often
368
00:12:55,056 --> 00:12:56,075
And that might tip you off and say:
369
00:12:56,075 --> 00:12:58,028
“Ok, if I really want to strengthen that—
370
00:12:58,028 --> 00:13:00,008
for one, if I want to boost coverage for those tests
371
00:13:00,008 --> 00:13:02,006
now I have to write a whitebox or a glassbox test
372
00:13:02,006 --> 00:13:04,057
I have to look inside, see what the implementation does
373
00:13:04,057 --> 00:13:05,061
and find specific ways
374
00:13:05,061 --> 00:13:10,060
to try to break the implementation in evil ways”
375
00:13:10,060 --> 00:13:13,075
So, I think, testing is a kind of a way of life, right?
376
00:13:13,075 --> 00:13:16,069
We’ve gotten away from the phase of
377
00:13:16,069 --> 00:13:18,033
“We’d build the whole thing and then we’d test it”
378
00:13:18,033 --> 00:13:19,092
and we’ve gotten into the phase of
379
00:13:19,092 --> 00:13:20,077
“We’re testing as we go”
380
00:13:20,077 --> 00:13:22,048
Testing is really more like a development tool
381
00:13:22,048 --> 00:13:24,022
and like so many development tools
382
00:13:24,022 --> 00:13:25,062
the effectiveness of it depends
383
00:13:25,062 --> 00:13:27,013
on whether you’re using it in a tasteful manner
384
00:13:27,013 --> 00:13:31,002
So, you could say: “Well, let’s see—I kicked the tires
385
00:13:31,002 --> 00:13:33,048
You know, I fired up the browser, I tried a couple of things
386
00:13:33,048 --> 00:13:35,097
(claps hand) Looks like it works! Deploy it!”
387
00:13:35,097 --> 00:13:38,045
That’s obviously a little more cavalier than you’d want to be
388
00:13:38,045 --> 00:13:41,024
And, by the way, one of the things that we discovered
389
00:13:41,024 --> 00:13:43,077
with this online course just starting up
390
00:13:43,077 --> 00:13:45,090
when 60,000 people are enrolled in the course
391
00:13:45,090 --> 00:13:48,099
and 0.1% of those people have a problem
392
00:13:48,099 --> 00:13:50,083
you’d get 60 emails
393
00:13:50,083 --> 00:13:53,078
The corollary is: when your site is used by a lot of people
394
00:13:53,078 --> 00:13:55,089
some stupid bug that you didn’t find
395
00:13:55,089 --> 00:13:57,018
but that could have found by testing
396
00:13:57,018 --> 00:13:59,080
could very quickly generate *a lot* of pain
397
00:13:59,080 --> 00:14:02,023
On the other hand, you don’t want to be dogmatic and say
398
00:14:02,023 --> 00:14:04,056
“Uh, until we have 100% coverage and every test is green
399
00:14:04,056 --> 00:14:06,005
we absolutely will not ship”
400
00:14:06,005 --> 00:14:07,012
That’s not healthy either
401
00:14:07,012 --> 00:14:08,048
And the test quality
402
00:14:08,048 --> 00:14:10,057
doesn’t necessarily correlate with the statement
403
00:14:10,057 --> 00:14:11,064
unless you can say something
404
00:14:11,064 --> 00:14:12,068
about the quality of your tests
405
00:14:12,068 --> 00:14:14,029
just because you’ve executed every line
406
00:14:14,029 --> 00:14:17,010
doesn’t mean that you’ve tested the interesting cases
407
00:14:17,010 --> 00:14:18,068
So, somewhere in between, you could say
408
00:14:18,068 --> 00:14:20,014
“Well, we’ll use coverage tools to identify
409
00:14:20,014 --> 00:14:23,004
undertested or poorly-tested parts of the code
410
00:14:23,004 --> 00:14:24,073
and we’ll use them as a guideline
411
00:14:24,073 --> 00:14:27,011
to sort of help improve our overall confidence level”
412
00:14:27,011 --> 00:14:29,024
But remember, Agile is about embracing change
413
00:14:29,024 --> 00:14:30,032
and dealing with it
414
00:14:30,032 --> 00:14:32,002
Part of change is things would change that will cause
415
00:14:32,002 --> 00:14:33,038
bugs that you didn’t foresee
416
00:14:33,038 --> 00:14:34,031
and the right reaction is:
417
00:14:34,031 --> 00:14:36,026
Be comfortable enough for the testing tools
418
00:14:36,026 --> 00:14:37,064
[so] that you can quickly find those bugs
419
00:14:37,064 --> 00:14:39,025
Write a test that reproduces that bug
420
00:14:39,025 --> 00:14:40,062
And then make the test green
421
00:14:40,062 --> 00:14:41,061
Then you’ll really fix it
422
00:14:41,061 --> 00:14:43,004
That means, the way that you really fix a bug is
423
00:14:43,004 --> 00:14:45,049
if you created a test that correctly failed
424
00:14:45,049 --> 00:14:46,088
to reproduce that bug
425
00:14:46,088 --> 00:14:48,055
and then you went back and fixed the code
426
00:14:48,055 --> 00:14:49,057
to make those tests pass
427
00:14:49,057 --> 00:14:51,073
Similarly, you don’t want to say
428
00:14:51,073 --> 00:14:53,036
“Well, unit tests give you better coverage
429
00:14:53,036 --> 00:14:54,073
They’re more thorough and detailed
430
00:14:54,073 --> 00:14:56,044
So let’s focus all our energy on that”
431
00:14:56,044 --> 00:14:57,062
as opposed to
432
00:14:57,062 --> 00:14:58,093
“Oh, focus on integration tests
433
00:14:58,093 --> 00:15:00,006
because they’re more realistic, right?
434
00:15:00,006 --> 00:15:01,056
They reflect what the customer said they want
435
00:15:01,056 --> 00:15:03,034
So, if the integration tests are passing
436
00:15:03,034 --> 00:15:05,067
by definition we’re meeting a customer need”
437
00:15:05,067 --> 00:15:07,034
Again, both extremes are kind of unhealthy
438
00:15:07,034 --> 00:15:09,079
because each one of these can find problems
439
00:15:09,079 --> 00:15:11,031
that would be missed by the other
440
00:15:11,031 --> 00:15:12,060
So, having a good combination of them
441
00:15:12,060 --> 00:15:15,042
is kind of all it is all about
442
00:15:15,042 --> 00:15:18,072
The last thing I want to leave you with is, I think
443
00:15:18,072 --> 00:15:20,036
in terms of testing, is “TDD versus
444
00:15:20,036 --> 00:15:22,005
what I call conventional debugging—
445
00:15:22,005 --> 00:15:24,004
i.e., the way that we all kind of do it
446
00:15:24,004 --> 00:15:25,051
even though we say we don’t”
447
00:15:25,051 --> 00:15:26,064
and we’re all trying to get better, right?
448
00:15:26,064 --> 00:15:27,085
We’re all kind of in the gutter
449
00:15:27,085 --> 00:15:29,036
Some of us are looking up at the stars
450
00:15:29,036 --> 00:15:31,011
trying to improve our practices
451
00:15:31,011 --> 00:15:33,099
But, having now lived with this for 3 or 4 years myself
452
00:15:33,099 --> 00:15:35,091
and—I’ll be honest—3 years ago I didn’t do TDD
453
00:15:35,091 --> 00:15:37,079
I do it now, because I find that it’s better
454
00:15:37,079 --> 00:15:40,081
and here’s my distillation of why I think it works for me
455
00:15:40,081 --> 00:15:43,032
Sorry, the colours are a little weird
456
00:15:43,032 --> 00:15:45,000
but on the left column of the table
457
00:15:45,000 --> 00:15:46,034
[it] says “Conventional debugging”
458
00:15:46,034 --> 00:15:47,044
and the right side says “TDD”
459
00:15:47,044 --> 00:15:49,069
So what’s the way I used to write code?
460
00:15:49,069 --> 00:15:51,056
Maybe some of you still do this
461
00:15:51,056 --> 00:15:53,013
I write a whole bunch of lines
462
00:15:53,013 --> 00:15:54,043
maybe a few tens of lines of code
463
00:15:54,043 --> 00:15:55,059
I’m sure they’re right—
464
00:15:55,059 --> 00:15:56,061
I mean, I am a good programmer, right?
465
00:15:56,061 --> 00:15:57,099
This is not that hard
466
00:15:57,099 --> 00:15:59,002
I run it – It doesn’t work
467
00:15:59,002 --> 00:16:01,098
Ok, fire up the debugger – Start putting in printf’s
468
00:16:01,098 --> 00:16:04,088
If I’d been using TDD what would I do instead?
469
00:16:04,088 --> 00:16:08,022
Well I’d write a few lines of code, having written a test first
470
00:16:08,022 --> 00:16:10,071
So as soon as the test goes from red to green
471
00:16:10,071 --> 00:16:12,064
I know I wrote code that works—
472
00:16:12,064 --> 00:16:15,013
or at least the parts of the behaviour that I had in mind
473
00:16:15,013 --> 00:16:16,096
Those parts of the behaviour work, because I had a test
474
00:16:16,096 --> 00:16:19,056
Ok, back to conventional debugging:
475
00:16:19,056 --> 00:16:21,073
I’m running my program, trying to find the bugs
476
00:16:21,073 --> 00:16:23,028
I start putting in printf’s everywhere
477
00:16:23,028 --> 00:16:24,062
to print out the values of things
478
00:16:24,062 --> 00:16:25,064
which by the way is a lot fun
479
00:16:25,064 --> 00:16:26,073
when you’re trying to read them
480
00:16:26,073 --> 00:16:28,014
out of the 500 lines of log output
481
00:16:28,014 --> 00:16:29,035
that you’d get in a Rails app
482
00:16:29,035 --> 00:16:30,087
trying to find your printf’s
483
00:16:30,087 --> 00:16:32,035
you know, “I know what I’ll do—
484
00:16:32,035 --> 00:16:34,008
I’ll put in 75 asterisks before and after
485
00:16:34,008 --> 00:16:36,043
That will make it readable” (laughter)
486
00:16:36,043 --> 00:16:38,071
Who don’t—Ok, raise your hands if you don’t do this!
487
00:16:38,071 --> 00:16:40,090
Thank you for your honesty. (laughter) Ok.
488
00:16:40,090 --> 00:16:43,014
Or— Or I could do the other thing, I could say:
489
00:16:43,014 --> 00:16:45,030
Instead of printing the value of a variable
490
00:16:45,030 --> 00:16:47,039
why don’t I write a test that inspects it
491
00:16:47,039 --> 00:16:48,079
in such an expectation which should
492
00:16:48,079 --> 00:16:50,090
and I’ll know immediately in bright red letters
493
00:16:50,090 --> 00:16:53,033
if that expectation wasn’t met
494
00:16:53,033 --> 00:16:56,005
Ok, I’m back on the conventional debugging side:
495
00:16:56,005 --> 00:16:58,090
I break out the big guns: I pull out the Ruby debugger
496
00:16:58,092 --> 00:17:02,044
I set a debug breakpoint, and I now start tweaking and say
497
00:17:02,044 --> 00:17:04,085
“Oh, let’s see, I have to get past that ‘if’ statement
498
00:17:04,085 --> 00:17:06,002
so I have to set that thing
499
00:17:06,002 --> 00:17:07,063
Oh, I have to call that method and so I need to…”
500
00:17:07,063 --> 00:17:08,065
No!
501
00:17:08,065 --> 00:17:10,087
I could instead—if I’m going to do that anyway—
502
00:17:10,087 --> 00:17:13,000
let’s just do it in a file, set up some mocks and stubs
503
00:17:13,000 --> 00:17:16,045
to control the code path, make it go the way I want
504
00:17:16,045 --> 00:17:19,013
And now, “Ok, for sure I’ve fixed it!
505
00:17:19,013 --> 00:17:22,012
I’ll get out of the debugger, run it all again!”
506
00:17:22,012 --> 00:17:24,022
And, of course, 9 times out of 10, you didn’t fix it
507
00:17:24,022 --> 00:17:26,072
or you kind of partly fixed it but you didn’t completely fix it
508
00:17:26,072 --> 00:17:30,040
and now I have to do all these manual things all over again
509
00:17:30,040 --> 00:17:32,086
or I already have a bunch of tests
510
00:17:32,086 --> 00:17:34,031
and I can just rerun them automatically
511
00:17:34,031 --> 00:17:35,056
and I could, if some of them fail
512
00:17:35,056 --> 00:17:36,087
“Oh, I didn’t fix the whole thing
513
00:17:36,087 --> 00:17:38,040
No problem, I’ll just go back!”
514
00:17:38,040 --> 00:17:39,096
So, the bottom line is that
515
00:17:39,096 --> 00:17:41,095
you know, you could do it on the left side
516
00:17:41,095 --> 00:17:45,004
but you’re using the same techniques in both cases
517
00:17:45,004 --> 00:17:48,062
The only difference is, in one case you’re doing it manually
518
00:17:48,062 --> 00:17:50,004
which is boring and error-prone
519
00:17:50,004 --> 00:17:51,078
In the other case you’re doing a little more work
520
00:17:51,078 --> 00:17:53,095
but you can make it automatic and repeatable
521
00:17:53,095 --> 00:17:55,071
and have, you know, some high confidence
522
00:17:55,071 --> 00:17:57,003
that as you change things in your code
523
00:17:57,003 --> 00:17:58,092
you are not breaking stuff that used to work
524
00:17:58,092 --> 00:18:00,091
and basically it’s more productive
525
00:18:00,091 --> 00:18:02,047
So you’re doing all the same things
526
00:18:02,047 --> 00:18:04,037
but with a, kind of, “delta” extra work
527
00:18:04,037 --> 00:18:07,086
you are using your effort at a much higher leverage
528
00:18:07,086 --> 00:18:10,036
So that’s kind of my view of why TDD is a good thing
529
00:18:10,036 --> 00:18:11,088
It’s really, it doesn’t require new skills
530
00:18:11,088 --> 00:18:15,011
It just requires [you] to refactor your existing skills
531
00:18:15,011 --> 00:18:18,014
I also tried when I—again, honest confessions, right?—
532
00:18:18,014 --> 00:18:19,034
when I started doing this it was like
533
00:18:19,034 --> 00:18:21,049
“Ok, I gonna be teaching a course on Rails
534
00:18:21,049 --> 00:18:22,065
I should really focus on testing
535
00:18:22,065 --> 00:18:24,032
So I went back to some code I had written
536
00:18:24,032 --> 00:18:26,087
that was working—you know, that was decent code—
537
00:18:26,087 --> 00:18:29,006
and I started trying to write tests for it
538
00:18:29,006 --> 00:18:31,019
and it was *so painful*
539
00:18:31,019 --> 00:18:33,033
because the code wasn’t written in way that was testable
540
00:18:33,033 --> 00:18:34,097
There were all kinds of interactions
541
00:18:34,097 --> 00:18:36,038
There were, like, nested conditionals
542
00:18:36,038 --> 00:18:38,083
And if you wanted to isolate a particular statement
543
00:18:38,083 --> 00:18:41,070
and have it test—to trigger test—just that statement
544
00:18:41,070 --> 00:18:44,000
the amount of stuff you’d have to set up in your test
545
00:18:44,000 --> 00:18:45,009
to have it happen—
546
00:18:45,009 --> 00:18:46,040
remember when talked about mock train wrecks—
547
00:18:46,040 --> 00:18:48,014
you have to set up all this infrastructure
548
00:18:48,014 --> 00:18:49,063
just to get one line of code
549
00:18:49,063 --> 00:18:51,015
and you do that and you go
550
00:18:51,015 --> 00:18:52,074
“Gawd, testing is really not worth it!
551
00:18:52,074 --> 00:18:54,034
I wrote 20 lines of setup
552
00:18:54,034 --> 00:18:56,059
so that I could test two lines in my function!”
553
00:18:56,059 --> 00:18:58,085
What that’s really telling you—as I now realize—
554
00:18:58,085 --> 00:19:00,042
is your function is bad
555
00:19:00,042 --> 00:19:01,049
It’s a badly written function
556
00:19:01,049 --> 00:19:02,052
It’s not a testable function
557
00:19:02,052 --> 00:19:03,088
It’s got too many moving parts
558
00:19:03,088 --> 00:19:06,026
whose dependencies can be broken
559
00:19:06,026 --> 00:19:07,070
There’s no seams in my function
560
00:19:07,070 --> 00:19:11,008
that allow me to individually test the different behaviours
561
00:19:11,008 --> 00:19:12,083
And once you start doing Test First Development
562
00:19:12,083 --> 00:19:15,043
because you have to write your tests in small chunks
563
00:19:15,043 --> 00:19:17,053
it kind of make this problem go away
564
00:19:17,053 --> 99:59:59,999
So that’s been my epiphany