So we spent a bunch of time
in the last couple of lectures
talking about different kinds of testing
about unit testing versus integration testing
We talked about how do you use RSpec
to really isolate the parts of your code you want to test
you’ve also, you know, because of homework 3,
and other stuff, we have been doing BDD,
where we’ve been using Cucumber to turn user stories
into, essentially, integration and acceptance tests
So you’ve seen testing in a couple of different levels
and the goal here is sort of to do a few remarks
to, you know, let’s back up a little bit
and see the big picture, and tie those things together
So this sort of spans material
that covers three or four sections in the book
and I want to just hit the high points in lecture
So a question that comes up
I’m sure it’s come up for all of you
as you have been doing homework
is: “How much testing is enough?”
And, sadly, for a long time
kind of if you asked this question in industry
the answer was basically
“Well, we have a shipping deadline,
so however much testing we can do
before that deadline, that’s how much.”
That’s what you have time for.
So, you know, that’s a little flip
obviously not very good
So you can do a bit better, right?
There’re some static measures
like how many lines of code does your app have
and how many lines of tests do you have?
And it’s not unusual in industry
in a well-tested piece of software
for the number of lines of tests
to go far beyond the number of lines of code
So, integer multiples are not unusual
And I think even for sort of, you know,
research code or classwork
a ratio of, you know, maybe 1.5 is not unreasonable
so one and a half times the amount of test code
as you have application code
And in a lot of production systems
where they really care about testing
it is much higher than that
So maybe a better question to ask:
Rather than saying “How much testing is enough?”
is to ask “How good is the testing I am doing now?
How thorough is it?”
Later in this semester
Professor Sen will talk about
a little bit about formal methods
and sort of what’s at the frontiers of testing and debugging
But a couple of things that we can talk about
based on what you already know
is some basic concepts about test coverage
And although I would say
you know, we’ve been saying all along
formal methods, they don’t really work on big systems
I think that statement, in my personal opinion
is actually a lot less true than it used to be
I think there are a number of specific places
especially in testing and debugging
where formal methods are actually making fast progress
and Koushik Sen is one of the leaders in that
So you’ll have the opportunity to hear more about that later
but for the moment I think, kind of bread and butter
is let’s talk about coverage measurement
because this is where the rubber meets the road
in terms of how you’d be evaluated
if you are doing this for real
So what’s some basics?
Here’s a really simple class you can use
to talk about different ways to measure
how our test covers this code
And there’re a few different levels
with different terminologies
It’s not really universal across all software houses
But one common set of terminology
that the book exposes
is we could talk about S0
where we’d just mean you’ve called every method once
So you know, if you call foo, and you call bar, you’re done
That’s S0 coverage: not terribly thorough
A little more stringent, S1, is
you could say, we’re calling every method
from every place that it could be called
So what does that mean?
It means, for example
it’s not enough to call bar
You have to make sure that you have to call it
at least once from in here
as well as calling it once
from any exterior function that might call it
C0 which is what SimpleCov measures
(those of you who’ve gotten SimpleCov up and running)
basically says you’ve executed every statement
you’ve touched every statement in your code once
But the caveat there is that
conditionals really just count as a single statement
So, if you, no matter which branch of this “if” you took
as long as you touched one of the other branch
you’ve executed the “if’ statement
So even C0 is still, you know, sort of superficial coverage
But, as we will see
the way that you will want to read this information is:
if you are getting bad coverage at the C0 level
then you have really really bad coverage
So if you are not kind of making
this simple level of superficial coverage
then your testing is probably deficient
C1 is the next step up from that
We could say:
Well, we have to take every branch in both directions
So, when we are doing this “if” statement
we have to make sure that
we do the “if x” part once
and the “if not x” part at least once to meet C1
You can augment that with decision coverage
saying: Well, if we’re gonna…
If we have “if” statments where the condition
is made up of multiple terms
we have to make sure that every subexpression
has been evaluated both directions
In other words, that means that
if we’re going to fail this “if” statement
we have to make sure to fail it at least once
because y was false in at least once because z was false
In other words, any subexpression that could
independently change the outcome of the condition
has to be exercised in both directions
And then,
kind of, the one that, you know, a lot of people aspire to
but there is disagreement on how much more valuable it is
is you take every path through the code
Obviously, this is kind of difficult because
it tends to be exponential in the number of conditions
And in general it’s difficult
to evaluate if you’ve taken every path through the code
There are formal techniques that you can use
to tell you where the holes are
but the bottom line is that
in most commercial software houses
there is, I would say, not complete consensus
on how much more valuable C2 is
compared to C0 or C1
So, I think, for the purpose of our class
you get exposed to the idea
of how you use coverage information
SimpleCov takes advantage of some built-in Ruby features
to give you C0 coverage
[It] does really nice reports
We can sort of see it
at the level of individual lines in your file
You can see what your coverage is
and I think that’s kind of a, you know
a good start for where we are
So, having see a sort of different flavours of tests
Stepping back and looking back at the big picture
what are the different kind of tests
that we’ve seen concretely?
and what are the tradeoffs
between using those different kinds of tests?
So we’ve seen at the level of individual classes or methods
we use RSpec, with extensive use of mocking and stubbing
So, for example when we do testing methods in the model
that will be an example of unit testing
We also did something that is pretty similar to
functional or module testing
where there is more than one module participating
So, for example when we did controller specs
we saw that—we simulate a POST action
but remember that the POST action
has to go through the routing subsystem
before it gets to the controller
Once the controller is done it will try to render a view
So in fact there’s other pieces
that collaborate with the controller
that have to be working in order for controller specs to pass
So that’s somewhere inbetween:
where we’re doing more than a single method
touching more than a single class
but we’re still concentrating [our] attention
on a fairly narrow slice of the system at a time
and we’re still using mocking and stubbing extensively
to sort of isolate that behaviour that we want to test
And then at the level of Cucumber scenarios
these are more like integration or system tests
They exercise complete paths throughout the application
They probably touch a lot of different modules
They make minimal use of mocks and stubs
because part of the goal of an integration test
is exactly to test the interaction between pieces
So you don’t want to stub or control those interactions
You actually want to let the system do
what it would really do
if this was a scenario happening in production
So how would we compare these different kinds of tests?
There’s a few different axes we can look at
One of them is how long they take to run
Now, both RSpec and Cucumber
have, kind of, high startup times and stuff like that
But, as you’ll see
as you start adding more and more RSpec tests
and using autotest to run them in the background
by and large, once RSpec kind of gets off the launching pad
it runs specs really fast
whereas running Cucumber features just takes a long time
as it essentially fires up your entire application
And later in this semester
we’ll see a way to make Cucumber even slower—
which is to have it fire up an entire browser
basically act like a puppet, remote-controlling Firefox
so you can test Javascript code
We’ll do that when we actually—
I think we’ll be able to work with our friends at SourceLabs
so you can do that in the cloud—That will be exciting
So, “run fast” versus “run slow”
Resolution:
If an error happens in your unit tests
it’s usually pretty easy
to figure out and track down what the source of that error is
because the tests are so isolated
You’ve stubbed out everything that doesn’t matter
and you’re focusing on only the behaviour of interest
So, if you’ve done a good job of doing that
when something goes wrong in one of your tests
there’s not a lot of places
that something could have gone wrong
In contrast, if you’re running a Cucumber scenario
that’s got, you know, 10 steps
and every step is touching
a whole bunch of pieces of the app
it could take a long time
to actually get to the bottom of a bug
So it is kind of a tradeoff
between how well can you localize errors
Coverage:
It’s possible if you write a good suite
of unit and functional tests
you can get really high coverage
You can run your SimpleCov report
and you can actually identify specific lines in your files
that have not been exercised by any test
and then you can go right at tests that cover them
So, figuring out how to improve your coverage
for example at the C0 level
is something much more easily done with unit tests
whereas, with a Cucumber test—
with a Cucumber scenario—
you are touching a lot of parts of the code
but you are doing it very sparsely
So, if your goal is to get your coverage up
use the tools at that are at the unit levels
so that you can focusing on understanding
what parts or my code are undertested
and then you can write very targeted tests
just to focus on them
And, sort of, you know, putting those pieces together
the unit tests
because of their isolation and their fine resolution
tend to use a lot of mocks
to isolate the behaviours you don’t care about
But that means that, by definition
you’re not testing the interfaces
and it’s sort of a “received wisdom” in software
that a lot of the interesting bugs
occur at the interfaces between pieces
and not sort of within a class or within a method—
those are sort of the easy bugs to track down
And at the other extreme
the more you get towards the integration testing extreme
you’re supposed to rely less and less on mocks
for that exact reason
Now we saw, if you’re testing something like
say, in a service-oriented architecture
where you have to interact with the remote site
you still end up
having to do a fair amount of mocking and stubbing
so that you don’t rely on the Internet
in order for your tests to pass
but, generally speaking
you’re trying to remove as many of the mocks that you can
and let the system run the way it would run in real life
So, the good news is you are testing the interfaces
but when something goes wrong in one of the interfaces
because your resolution is not as good
it may take longer to figure out what it is
So, what’s sort of the high-order bit from this tradeoff
is you don’t really want to rely
too heavily on any one kind of test
They serve different purposes and, depending on
are you trying to exercise your interfaces more
or are you trying to improve your fine-grain coverage
that affects how you develop your test suite
and you’ll evolve it along with your software
So, we’ve used a certain set of terminology in testing
It’s the terminology that, by and large
is most commonly used in the Rails community
but there’s some variation
[and] some other terms that you might hear
if you go get a job somewhere
and you hear about mutation testing
which we haven’t done
This is an interesting idea that was, I think, invented by
Ammann and Offutt, who have, sort of
the definitive book on software testing
The idea is:
Suppose I introduced a deliberate bug into my code
does that force some test to fail?
Because, if I changed, you know, “if x” to “if not x”
and no tests fail, then either I’m missing some coverage
or my app is very strange and somehow nondeterministic
Fuzz testing, which Koushik Sen may talk more about
basically, this is the “10,000 monkeys at typewriters
throwing random input at your code”
What’s interesting about it is that
those tests we’ve been doing
essentially are crafted to test the app
the way it was designed
and these, you know, fuzz testing
is about testing the app in ways it wasn’t meant to be used
So, what happens if you throw enormous form submissions
What happens if you put control characters in your forms?
What happens if you submit the same thing over and over?
And, Koushik has a statistic that
Microsoft finds up to 20% of their bugs
using some variation of fuzz testing
and that about 25%
of the common Unix command-line programs
can be made to crash
[when] put through aggressive fuzz testing
Defining-use coverage is something that we haven’t done
but it’s another interesting concept
The idea is that at any point in my program
there’s a place where I define—
or I assign a value to some variable—
and then there’s a place downstream
where presumably I’m going to consume that value—
someone’s going to use that value
Have I covered every pair?
In other words, do I have tests where every pair
of defining a variable and using it somewhere
is executed at some part of my test suites
It’s sometimes called DU-coverage
And other terms that I think are not as widely used anymore
blackbox versus whitebox, or blackbox versus glassbox
Roughly, a blackbox test is one that is written from
the point of view of the external specification of the thing
[For example:] “This is a hash table
When I put in a key I should get back a value
If I delete the key the value shouldn’t be there”
That’s a blackbox test because it doesn’t say
anything about how the hash table is implemented
and it doesn’t try to stress the implementation
A corresponding whitebox test might be:
“I know something about the hash function
and I’m going to deliberately create
hash keys in my test cases
that cause a lot of hash collisions
to make sure that I’m testing that part of the functionality”
Now, a C0 test coverage tool, like SimpleCov
would reveal that if all you had is blackbox tests
you might find that
the collision coverage code wasn’t being hit very often
And that might tip you off and say:
“Ok, if I really want to strengthen that—
for one, if I want to boost coverage for those tests
now I have to write a whitebox or a glassbox test
I have to look inside, see what the implementation does
and find specific ways
to try to break the implementation in evil ways”
So, I think, testing is a kind of a way of life, right?
We’ve gotten away from the phase of
“We’d build the whole thing and then we’d test it”
and we’ve gotten into the phase of
“We’re testing as we go”
Testing is really more like a development tool
and like so many development tools
the effectiveness of it depends
on whether you’re using it in a tasteful manner
So, you could say: “Well, let’s see—I kicked the tires
You know, I fired up the browser, I tried a couple of things
(claps hand) Looks like it works! Deploy it!”
That’s obviously a little more cavalier than you’d want to be
And, by the way, one of the things that we discovered
with this online course just starting up
when 60,000 people are enrolled in the course
and 0.1% of those people have a problem
you’d get 60 emails
The corollary is: when your site is used by a lot of people
some stupid bug that you didn’t find
but that could have found by testing
could very quickly generate *a lot* of pain
On the other hand, you don’t want to be dogmatic and say
“Uh, until we have 100% coverage and every test is green
we absolutely will not ship”
That’s not healthy either
And the test quality
doesn’t necessarily correlate with the statement
unless you can say something
about the quality of your tests
just because you’ve executed every line
doesn’t mean that you’ve tested the interesting cases
So, somewhere in between, you could say
“Well, we’ll use coverage tools to identify
undertested or poorly-tested parts of the code
and we’ll use them as a guideline
to sort of help improve our overall confidence level”
But remember, Agile is about embracing change
and dealing with it
Part of change is things would change that will cause
bugs that you didn’t foresee
and the right reaction is:
Be comfortable enough for the testing tools
[so] that you can quickly find those bugs
Write a test that reproduces that bug
And then make the test green
Then you’ll really fix it
That means, the way that you really fix a bug is
if you created a test that correctly failed
to reproduce that bug
and then you went back and fixed the code
to make those tests pass
Similarly, you don’t want to say
“Well, unit tests give you better coverage
They’re more thorough and detailed
So let’s focus all our energy on that”
as opposed to
“Oh, focus on integration tests
because they’re more realistic, right?
They reflect what the customer said they want
So, if the integration tests are passing
by definition we’re meeting a customer need”
Again, both extremes are kind of unhealthy
because each one of these can find problems
that would be missed by the other
So, having a good combination of them
is kind of all it is all about
The last thing I want to leave you with is, I think
in terms of testing, is “TDD versus
what I call conventional debugging—
i.e., the way that we all kind of do it
even though we say we don’t”
and we’re all trying to get better, right?
We’re all kind of in the gutter
Some of us are looking up at the stars
trying to improve our practices
But, having now lived with this for 3 or 4 years myself
and—I’ll be honest—3 years ago I didn’t do TDD
I do it now, because I find that it’s better
and here’s my distillation of why I think it works for me
Sorry, the colours are a little weird
but on the left column of the table
[it] says “Conventional debugging”
and the right side says “TDD”
So what’s the way I used to write code?
Maybe some of you still do this
I write a whole bunch of lines
maybe a few tens of lines of code
I’m sure they’re right—
I mean, I am a good programmer, right?
This is not that hard
I run it – It doesn’t work
Ok, fire up the debugger – Start putting in printf’s
If I’d been using TDD what would I do instead?
Well I’d write a few lines of code, having written a test first
So as soon as the test goes from red to green
I know I wrote code that works—
or at least the parts of the behaviour that I had in mind
Those parts of the behaviour work, because I had a test
Ok, back to conventional debugging:
I’m running my program, trying to find the bugs
I start putting in printf’s everywhere
to print out the values of things
which by the way is a lot fun
when you’re trying to read them
out of the 500 lines of log output
that you’d get in a Rails app
trying to find your printf’s
you know, “I know what I’ll do—
I’ll put in 75 asterisks before and after
That will make it readable” (laughter)
Who don’t—Ok, raise your hands if you don’t do this!
Thank you for your honesty. (laughter) Ok.
Or— Or I could do the other thing, I could say:
Instead of printing the value of a variable
why don’t I write a test that inspects it
in such an expectation which should
and I’ll know immediately in bright red letters
if that expectation wasn’t met
Ok, I’m back on the conventional debugging side:
I break out the big guns: I pull out the Ruby debugger
I set a debug breakpoint, and I now start tweaking and say
“Oh, let’s see, I have to get past that ‘if’ statement
so I have to set that thing
Oh, I have to call that method and so I need to…”
No!
I could instead—if I’m going to do that anyway—
let’s just do it in a file, set up some mocks and stubs
to control the code path, make it go the way I want
And now, “Ok, for sure I’ve fixed it!
I’ll get out of the debugger, run it all again!”
And, of course, 9 times out of 10, you didn’t fix it
or you kind of partly fixed it but you didn’t completely fix it
and now I have to do all these manual things all over again
or I already have a bunch of tests
and I can just rerun them automatically
and I could, if some of them fail
“Oh, I didn’t fix the whole thing
No problem, I’ll just go back!”
So, the bottom line is that
you know, you could do it on the left side
but you’re using the same techniques in both cases
The only difference is, in one case you’re doing it manually
which is boring and error-prone
In the other case you’re doing a little more work
but you can make it automatic and repeatable
and have, you know, some high confidence
that as you change things in your code
you are not breaking stuff that used to work
and basically it’s more productive
So you’re doing all the same things
but with a, kind of, “delta” extra work
you are using your effort at a much higher leverage
So that’s kind of my view of why TDD is a good thing
It’s really, it doesn’t require new skills
It just requires [you] to refactor your existing skills
I also tried when I—again, honest confessions, right?—
when I started doing this it was like
“Ok, I gonna be teaching a course on Rails
I should really focus on testing
So I went back to some code I had written
that was working—you know, that was decent code—
and I started trying to write tests for it
and it was *so painful*
because the code wasn’t written in way that was testable
There were all kinds of interactions
There were, like, nested conditionals
And if you wanted to isolate a particular statement
and have it test—to trigger test—just that statement
the amount of stuff you’d have to set up in your test
to have it happen—
remember when talked about mock train wrecks—
you have to set up all this infrastructure
just to get one line of code
and you do that and you go
“Gawd, testing is really not worth it!
I wrote 20 lines of setup
so that I could test two lines in my function!”
What that’s really telling you—as I now realize—
is your function is bad
It’s a badly written function
It’s not a testable function
It’s got too many moving parts
whose dependencies can be broken
There’s no seams in my function
that allow me to individually test the different behaviours
And once you start doing Test First Development
because you have to write your tests in small chunks
it kind of make this problem go away
So that’s been my epiphany