So we spent a bunch of time

in the last couple of lectures

talking about different kinds of testing

about unit testing versus integration testing

We talked about how do you use RSpec

to really isolate the parts of your code you want to test

you’ve also, you know, because of homework 3,

and other stuff, we have been doing BDD,

where we’ve been using Cucumber to turn user stories

into, essentially, integration and acceptance tests

So you’ve seen testing in a couple of different levels

and the goal here is sort of to do a few remarks

to, you know, let’s back up a little bit

and see the big picture, and tie those things together

So this sort of spans material

that covers three or four sections in the book

and I want to just hit the high points in lecture

So a question that comes up

I’m sure it’s come up for all of you

as you have been doing homework

is: “How much testing is enough?”

And, sadly, for a long time

kind of if you asked this question in industry

the answer was basically

“Well, we have a shipping deadline,

so however much testing we can do

before that deadline, that’s how much.”

That’s what you have time for.

So, you know, that’s a little flip

obviously not very good

So you can do a bit better, right?

There’re some static measures

like how many lines of code does your app have

and how many lines of tests do you have?

And it’s not unusual in industry

in a well-tested piece of software

for the number of lines of tests

to go far beyond the number of lines of code

So, integer multiples are not unusual

And I think even for sort of, you know,

research code or classwork

a ratio of, you know, maybe 1.5 is not unreasonable

so one and a half times the amount of test code

as you have application code

And in a lot of production systems

where they really care about testing

it is much higher than that

So maybe a better question to ask:

Rather than saying “How much testing is enough?”

is to ask “How good is the testing I am doing now?

How thorough is it?”

Later in this semester

Professor Sen will talk about

a little bit about formal methods

and sort of what’s at the frontiers of testing and debugging

But a couple of things that we can talk about

based on what you already know

is some basic concepts about test coverage

And although I would say

you know, we’ve been saying all along

formal methods, they don’t really work on big systems

I think that statement, in my personal opinion

is actually a lot less true than it used to be

I think there are a number of specific places

especially in testing and debugging

where formal methods are actually making fast progress

and Koushik Sen is one of the leaders in that

So you’ll have the opportunity to hear more about that later

but for the moment I think, kind of bread and butter

is let’s talk about coverage measurement

because this is where the rubber meets the road

in terms of how you’d be evaluated

if you are doing this for real

So what’s some basics?

Here’s a really simple class you can use

to talk about different ways to measure

how our test covers this code

And there’re a few different levels

with different terminologies

It’s not really universal across all software houses

But one common set of terminology

that the book exposes

is we could talk about S0

where we’d just mean you’ve called every method once

So you know, if you call foo, and you call bar, you’re done

That’s S0 coverage: not terribly thorough

A little more stringent, S1, is

you could say, we’re calling every method

from every place that it could be called

So what does that mean?

It means, for example

it’s not enough to call bar

You have to make sure that you have to call it

at least once from in here

as well as calling it once

from any exterior function that might call it

C0 which is what SimpleCov measures

(those of you who’ve gotten SimpleCov up and running)

basically says you’ve executed every statement

you’ve touched every statement in your code once

But the caveat there is that

conditionals really just count as a single statement

So, if you, no matter which branch of this “if” you took

as long as you touched one of the other branch

you’ve executed the “if’ statement

So even C0 is still, you know, sort of superficial coverage

But, as we will see

the way that you will want to read this information is:

if you are getting <i>bad</i> coverage at the C0 level

then you have really really bad coverage

So if you are not kind of making

this simple level of superficial coverage

then your testing is probably deficient

C1 is the next step up from that

We could say:

Well, we have to take every branch in both directions

So, when we are doing this “if” statement

we have to make sure that

we do the “if x” part once

and the “if not x” part at least once to meet C1

You can augment that with decision coverage

saying: Well, if we’re gonna…

If we have “if” statments where the condition

is made up of multiple terms

we have to make sure that every subexpression

has been evaluated both directions

In other words, that means that

if we’re going to fail this “if” statement

we have to make sure to fail it at least once

because y was false in at least once because z was false

In other words, any subexpression that could

independently change the outcome of the condition

has to be exercised in both directions

And then,

kind of, the one that, you know, a lot of people aspire to

but there is disagreement on how much more valuable it is

is you take every path through the code

Obviously, this is kind of difficult because

it tends to be exponential in the number of conditions

And in general it’s difficult

to evaluate if you’ve taken every path through the code

There are formal techniques that you can use

to tell you where the holes are

but the bottom line is that

in most commercial software houses

there is, I would say, not complete consensus

on how much more valuable C2 is

compared to C0 or C1

So, I think, for the purpose of our class

you get exposed to the idea

of how you use coverage information

SimpleCov takes advantage of some built-in Ruby features

to give you C0 coverage

[It] does really nice reports

We can sort of see it

at the level of individual lines in your file

You can see what your coverage is

and I think that’s kind of a, you know

a good start for where we are

So, having see a sort of different flavours of tests

Stepping back and looking back at the big picture

what are the different kind of tests

that we’ve seen concretely?

and what are the tradeoffs

between using those different kinds of tests?

So we’ve seen at the level of individual classes or methods

we use RSpec, with extensive use of mocking and stubbing

So, for example when we do testing methods in the model

that will be an example of unit testing

We also did something that is pretty similar to

functional or module testing

where there is more than one module participating

So, for example when we did controller specs

we saw that—we simulate a POST action

but remember that the POST action

has to go through the routing subsystem

before it gets to the controller

Once the controller is done it will try to render a view

So in fact there’s other pieces

that collaborate with the controller

that have to be working in order for controller specs to pass

So that’s somewhere inbetween:

where we’re doing more than a single method

touching more than a single class

but we’re still concentrating [our] attention

on a fairly narrow slice of the system at a time

and we’re still using mocking and stubbing extensively

to sort of isolate that behaviour that we want to test

And then at the level of Cucumber scenarios

these are more like integration or system tests

They exercise complete paths throughout the application

They probably touch a lot of different modules

They make minimal use of mocks and stubs

because part of the goal of an integration test

is exactly to test the interaction between pieces

So you don’t want to stub or control those interactions

You actually want to let the system do

what it would really do

if this was a scenario happening in production

So how would we compare these different kinds of tests?

There’s a few different axes we can look at

One of them is how long they take to run

Now, both RSpec and Cucumber

have, kind of, high startup times and stuff like that

But, as you’ll see

as you start adding more and more RSpec tests

and using autotest to run them in the background

by and large, once RSpec kind of gets off the launching pad

it runs specs really fast

whereas running Cucumber features just takes a long time

as it essentially fires up your entire application

And later in this semester

we’ll see a way to make Cucumber even slower—

which is to have it fire up an entire browser

basically act like a puppet, remote-controlling Firefox

so you can test Javascript code

We’ll do that when we actually—

I think we’ll be able to work with our friends at SourceLabs

so you can do that in the cloud—That will be exciting

So, “run fast” versus “run slow”

Resolution:

If an error happens in your unit tests

it’s usually pretty easy

to figure out and track down what the source of that error is

because the tests are so isolated

You’ve stubbed out everything that doesn’t matter

and you’re focusing on only the behaviour of interest

So, if you’ve done a good job of doing that

when something goes wrong in one of your tests

there’s not a lot of places

that something could have gone wrong

In contrast, if you’re running a Cucumber scenario

that’s got, you know, 10 steps

and every step is touching

a whole bunch of pieces of the app

it could take a long time

to actually get to the bottom of a bug

So it is kind of a tradeoff

between how well can you localize errors

Coverage:

It’s possible if you write a good suite

of unit and functional tests

you can get really high coverage

You can run your SimpleCov report

and you can actually identify specific lines in your files

that have not been exercised by any test

and then you can go right at tests that cover them

So, figuring out how to improve your coverage

for example at the C0 level

is something much more easily done with unit tests

whereas, with a Cucumber test—

with a Cucumber scenario—

you <i>are</i> touching a lot of parts of the code

but you are doing it very sparsely

So, if your goal is to get your coverage up

use the tools at that are at the unit levels

so that you can focusing on understanding

what parts or my code are undertested

and then you can write very targeted tests

just to focus on them

And, sort of, you know, putting those pieces together

the unit tests

because of their isolation and their fine resolution

tend to use a lot of mocks

to isolate the behaviours you don’t care about

But that means that, by definition

you’re not testing the interfaces

and it’s sort of a “received wisdom” in software

that a lot of the interesting bugs

occur at the interfaces between pieces

and not sort of within a class or within a method—

those are sort of the easy bugs to track down

And at the other extreme

the more you get towards the integration testing extreme

you’re supposed to rely less and less on mocks

for that exact reason

Now we saw, if you’re testing something like

say, in a service-oriented architecture

where you have to interact with the remote site

you still end up

having to do a fair amount of mocking and stubbing

so that you don’t rely on the Internet

in order for your tests to pass

but, generally speaking

you’re trying to remove as many of the mocks that you can

and let the system run the way it would run in real life

So, the good news is you <i>are</i> testing the interfaces

<i>but</i> when something goes wrong in one of the interfaces

because your resolution is not as good

it may take longer to figure out what it is

So, what’s sort of the high-order bit from this tradeoff

is you don’t really want to rely

too heavily on any one kind of test

They serve different purposes and, depending on

are you trying to exercise your interfaces more

or are you trying to improve your fine-grain coverage

that affects how you develop your test suite

and you’ll evolve it along with your software

So, we’ve used a certain set of terminology in testing

It’s the terminology that, by and large

is most commonly used in the Rails community

but there’s some variation

[and] some other terms that you might hear

if you go get a job somewhere

and you hear about mutation testing

which we haven’t done

This is an interesting idea that was, I think, invented by

Ammann and Offutt, who have, sort of

the definitive book on software testing

The idea is:

Suppose I introduced a deliberate bug into my code

does that force some test to fail?

Because, if I changed, you know, “if x” to “if not x”

and no tests fail, then either I’m missing some coverage

or my app is very strange and somehow nondeterministic

Fuzz testing, which Koushik Sen may talk more about

basically, this is the “10,000 monkeys at typewriters

throwing random input at your code”

What’s interesting about it is that

those tests we’ve been doing

essentially are crafted to test the app

the way it was designed

and these, you know, fuzz testing

is about testing the app in ways it <i>wasn’t</i> meant to be used

So, what happens if you throw enormous form submissions

What happens if you put control characters in your forms?

What happens if you submit the same thing over and over?

And, Koushik has a statistic that

Microsoft finds up to 20% of their bugs

using some variation of fuzz testing

and that about 25%

of the common Unix command-line programs

can be made to crash

[when] put through aggressive fuzz testing

Defining-use coverage is something that we haven’t done

but it’s another interesting concept

The idea is that at any point in my program

there’s a place where I define—

or I assign a value to some variable—

and then there’s a place downstream

where presumably I’m going to consume that value—

someone’s going to use that value

Have I covered every pair?

In other words, do I have tests where every pair

of defining a variable and using it somewhere

is executed at some part of my test suites

It’s sometimes called DU-coverage

And other terms that I think are not as widely used anymore

blackbox versus whitebox, or blackbox versus glassbox

Roughly, a blackbox test is one that is written from

the point of view of the external specification of the thing

[For example:] “This is a hash table

When I put in a key I should get back a value

If I delete the key the value shouldn’t be there”

That’s a blackbox test because it doesn’t say

anything about how the hash table is implemented

and it doesn’t try to stress the implementation

A corresponding whitebox test might be:

“I know something about the hash function

and I’m going to deliberately create

hash keys in my test cases

that cause a lot of hash collisions

to make sure that I’m testing that part of the functionality”

Now, a C0 test coverage tool, like SimpleCov

would reveal that if all you had is blackbox tests

you might find that

the collision coverage code wasn’t being hit very often

And that might tip you off and say:

“Ok, if I really want to strengthen that—

for one, if I want to boost coverage for those tests

now I have to write a whitebox or a glassbox test

I have to look inside, see what the implementation does

and find specific ways

to try to break the implementation in evil ways”

So, I think, testing is a kind of a way of life, right?

We’ve gotten away from the phase of

“We’d build the whole thing and then we’d test it”

and we’ve gotten into the phase of

“We’re testing as we go”

Testing is really more like a development tool

and like so many development tools

the effectiveness of it depends

on whether you’re using it in a tasteful manner

So, you could say: “Well, let’s see—I kicked the tires

You know, I fired up the browser, I tried a couple of things

(claps hand) Looks like it works! Deploy it!”

That’s obviously a little more cavalier than you’d want to be

And, by the way, one of the things that we discovered

with this online course just starting up

when 60,000 people are enrolled in the course

and 0.1% of those people have a problem

you’d get 60 emails

The corollary is: when your site is used by a lot of people

some stupid bug that you didn’t find

but that could have found by testing

could very quickly generate *a lot* of pain

On the other hand, you don’t want to be dogmatic and say

“Uh, until we have 100% coverage and every test is green

we absolutely will not ship”

That’s not healthy either

And the test quality

doesn’t necessarily correlate with the statement

unless you can say something

about the quality of your tests

just because you’ve executed every line

doesn’t mean that you’ve tested the interesting cases

So, somewhere in between, you could say

“Well, we’ll use coverage tools to identify

undertested or poorly-tested parts of the code

and we’ll use them as a guideline

to sort of help improve our overall confidence level”

But remember, Agile is about embracing change

and dealing with it

Part of change is things would change that will cause

bugs that you didn’t foresee

and the right reaction is:

Be comfortable enough for the testing tools

[so] that you can quickly find those bugs

Write a test that reproduces that bug

And then make the test green

Then you’ll really fix it

That means, the way that you really fix a bug is

if you created a test that correctly failed

to reproduce that bug

and then you went back and fixed the code

to make those tests pass

Similarly, you don’t want to say

“Well, unit tests give you better coverage

They’re more thorough and detailed

So let’s focus all our energy on that”

as opposed to

“Oh, focus on integration tests

because they’re more realistic, right?

They reflect what the customer said they want

So, if the integration tests are passing

by definition we’re meeting a customer need”

Again, both extremes are kind of unhealthy

because each one of these can find problems

that would be missed by the other

So, having a good combination of them

is kind of all it is all about

The last thing I want to leave you with is, I think

in terms of testing, is “TDD versus

what I call conventional debugging—

i.e., the way that we all kind of do it

even though we say we don’t”

and we’re all trying to get better, right?

We’re all kind of in the gutter

Some of us are looking up at the stars

trying to improve our practices

But, having now lived with this for 3 or 4 years myself

and—I’ll be honest—3 years ago I didn’t do TDD

I do it now, because I find that it’s better

and here’s my distillation of why I think it works for me

Sorry, the colours are a little weird

but on the left column of the table

[it] says “Conventional debugging”

and the right side says “TDD”

So what’s the way I used to write code?

Maybe some of you still do this

I write a whole bunch of lines

maybe a few tens of lines of code

I’m <i>sure</i> they’re right—

I mean, I <i>am</i> a good programmer, right?

This is not that hard

I run it – It doesn’t work

Ok, fire up the debugger – Start putting in printf’s

If I’d been using TDD what would I do instead?

Well I’d write a <i>few</i> lines of code, having written a test first

So as soon as the test goes from red to green

I know I wrote code that works—

or at least the parts of the behaviour that I had in mind

Those parts of the behaviour work, because I had a test

Ok, back to conventional debugging:

I’m running my program, trying to find the bugs

I start putting in printf’s everywhere

to print out the values of things

which by the way is a lot fun

when you’re trying to read them

out of the 500 lines of log output

that you’d get in a Rails app

trying to find <i>your</i> printf’s

you know, “I know what I’ll do—

I’ll put in 75 asterisks before and after

That will make it readable” (laughter)

Who don’t—Ok, raise your hands if you don’t do this!

Thank you for your honesty. (laughter) Ok.

Or— Or I could do the other thing, I could say:

Instead of printing the value of a variable

why don’t I write a test that inspects it

in such an expectation which should

and I’ll know immediately in bright red letters

if that expectation wasn’t met

Ok, I’m back on the conventional debugging side:

I break out the big guns: I pull out the Ruby debugger

I set a debug breakpoint, and I now start <i>tweaking</i> and say

“Oh, let’s see, I have to get past that ‘if’ statement

so I have to set that thing

Oh, I have to call that method and so I need to…”

No!

I <i>could</i> instead—if I’m going to do that anyway—

let’s just do it in a file, set up some mocks and stubs

to control the code path, make it go the way I want

And now, “Ok, for sure I’ve fixed it!

I’ll get out of the debugger, run it all again!”

And, of course, 9 times out of 10, you didn’t fix it

or you kind of partly fixed it but you didn’t completely fix it

and now I have to do all these manual things all over again

<i>or</i> I already have a bunch of tests

and I can just rerun them automatically

and I could, if some of them fail

“Oh, I didn’t fix the whole thing

No problem, I’ll just go back!”

So, the bottom line is that

you know, you <i>could</i> do it on the left side

but you’re using the same techniques in both cases

The only difference is, in one case you’re doing it manually

which is boring and error-prone

In the other case you’re doing a little more work

but you can make it automatic and repeatable

and have, you know, some high confidence

that as you change things in your code

you are not breaking stuff that used to work

and basically it’s more productive

So you’re doing all the same things

but with a, kind of, “delta” extra work

you are using your effort at a much higher leverage

So that’s kind of my view of why TDD is a good thing

It’s really, it doesn’t require new skills

It just requires [you] to refactor your existing skills

I also tried when I—again, honest confessions, right?—

when I started doing this it was like

“Ok, I gonna be teaching a course on Rails

I should really focus on testing

So I went back to some code I had written

that was <i>working</i>—you know, that was decent code—

and I started trying to write tests for it

and it was *so painful*

because the code wasn’t written in way that was testable

There were all kinds of interactions

There were, like, nested conditionals

And if you wanted to isolate a particular statement

and have it test—to trigger test—just that statement

the amount of stuff you’d have to set up in your test

to have it happen—

remember when talked about mock train wrecks—

you have to set up all this infrastructure

just to get <i>one</i> line of code

and you do that and you go

“Gawd, testing is really not worth it!

I wrote 20 lines of setup

so that I could test two lines in my function!”

What that’s really telling you—as I now realize—

is your function is <i>bad</i>

It’s a badly written function

It’s not a testable function

It’s got too many moving parts

whose dependencies <i>can</i> be broken

There’s no seams in my function

that allow me to individually test the different behaviours

And once you start doing Test First Development

because you have to write your tests in small chunks

it kind of make this problem go away

So that’s been my epiphany