So we spent a bunch of time in the last couple of lectures talking about different kinds of testing about unit testing versus integration testing We talked about how do you use RSpec to really isolate the parts of your code you want to test you’ve also, you know, because of homework 3, and other stuff, we have been doing BDD, where we’ve been using Cucumber to turn user stories into, essentially, integration and acceptance tests So you’ve seen testing in a couple of different levels and the goal here is sort of to do a few remarks to, you know, let’s back up a little bit and see the big picture, and tie those things together So this sort of spans material that covers three or four sections in the book and I want to just hit the high points in lecture So a question that comes up I’m sure it’s come up for all of you as you have been doing homework is: “How much testing is enough?” And, sadly, for a long time kind of if you asked this question in industry the answer was basically “Well, we have a shipping deadline, so however much testing we can do before that deadline, that’s how much.” That’s what you have time for. So, you know, that’s a little flip obviously not very good So you can do a bit better, right? There’re some static measures like how many lines of code does your app have and how many lines of tests do you have? And it’s not unusual in industry in a well-tested piece of software for the number of lines of tests to go far beyond the number of lines of code So, integer multiples are not unusual And I think even for sort of, you know, research code or classwork a ratio of, you know, maybe 1.5 is not unreasonable so one and a half times the amount of test code as you have application code And in a lot of production systems where they really care about testing it is much higher than that So maybe a better question to ask: Rather than saying “How much testing is enough?” is to ask “How good is the testing I am doing now? How thorough is it?” Later in this semester Professor Sen will talk about a little bit about formal methods and sort of what’s at the frontiers of testing and debugging But a couple of things that we can talk about based on what you already know is some basic concepts about test coverage And although I would say you know, we’ve been saying all along formal methods, they don’t really work on big systems I think that statement, in my personal opinion is actually a lot less true than it used to be I think there are a number of specific places especially in testing and debugging where formal methods are actually making fast progress and Koushik Sen is one of the leaders in that So you’ll have the opportunity to hear more about that later but for the moment I think, kind of bread and butter is let’s talk about coverage measurement because this is where the rubber meets the road in terms of how you’d be evaluated if you are doing this for real So what’s some basics? Here’s a really simple class you can use to talk about different ways to measure how our test covers this code And there’re a few different levels with different terminologies It’s not really universal across all software houses But one common set of terminology that the book exposes is we could talk about S0 where we’d just mean you’ve called every method once So you know, if you call foo, and you call bar, you’re done That’s S0 coverage: not terribly thorough A little more stringent, S1, is you could say, we’re calling every method from every place that it could be called So what does that mean? It means, for example it’s not enough to call bar You have to make sure that you have to call it at least once from in here as well as calling it once from any exterior function that might call it C0 which is what SimpleCov measures (those of you who’ve gotten SimpleCov up and running) basically says you’ve executed every statement you’ve touched every statement in your code once But the caveat there is that conditionals really just count as a single statement So, if you, no matter which branch of this “if” you took as long as you touched one of the other branch you’ve executed the “if’ statement So even C0 is still, you know, sort of superficial coverage But, as we will see the way that you will want to read this information is: if you are getting bad coverage at the C0 level then you have really really bad coverage So if you are not kind of making this simple level of superficial coverage then your testing is probably deficient C1 is the next step up from that We could say: Well, we have to take every branch in both directions So, when we are doing this “if” statement we have to make sure that we do the “if x” part once and the “if not x” part at least once to meet C1 You can augment that with decision coverage saying: Well, if we’re gonna… If we have “if” statments where the condition is made up of multiple terms we have to make sure that every subexpression has been evaluated both directions In other words, that means that if we’re going to fail this “if” statement we have to make sure to fail it at least once because y was false in at least once because z was false In other words, any subexpression that could independently change the outcome of the condition has to be exercised in both directions And then, kind of, the one that, you know, a lot of people aspire to but there is disagreement on how much more valuable it is is you take every path through the code Obviously, this is kind of difficult because it tends to be exponential in the number of conditions And in general it’s difficult to evaluate if you’ve taken every path through the code There are formal techniques that you can use to tell you where the holes are but the bottom line is that in most commercial software houses there is, I would say, not complete consensus on how much more valuable C2 is compared to C0 or C1 So, I think, for the purpose of our class you get exposed to the idea of how you use coverage information SimpleCov takes advantage of some built-in Ruby features to give you C0 coverage [It] does really nice reports We can sort of see it at the level of individual lines in your file You can see what your coverage is and I think that’s kind of a, you know a good start for where we are So, having see a sort of different flavours of tests Stepping back and looking back at the big picture what are the different kind of tests that we’ve seen concretely? and what are the tradeoffs between using those different kinds of tests? So we’ve seen at the level of individual classes or methods we use RSpec, with extensive use of mocking and stubbing So, for example when we do testing methods in the model that will be an example of unit testing We also did something that is pretty similar to functional or module testing where there is more than one module participating So, for example when we did controller specs we saw that—we simulate a POST action but remember that the POST action has to go through the routing subsystem before it gets to the controller Once the controller is done it will try to render a view So in fact there’s other pieces that collaborate with the controller that have to be working in order for controller specs to pass So that’s somewhere inbetween: where we’re doing more than a single method touching more than a single class but we’re still concentrating [our] attention on a fairly narrow slice of the system at a time and we’re still using mocking and stubbing extensively to sort of isolate that behaviour that we want to test And then at the level of Cucumber scenarios these are more like integration or system tests They exercise complete paths throughout the application They probably touch a lot of different modules They make minimal use of mocks and stubs because part of the goal of an integration test is exactly to test the interaction between pieces So you don’t want to stub or control those interactions You actually want to let the system do what it would really do if this was a scenario happening in production So how would we compare these different kinds of tests? There’s a few different axes we can look at One of them is how long they take to run Now, both RSpec and Cucumber have, kind of, high startup times and stuff like that But, as you’ll see as you start adding more and more RSpec tests and using autotest to run them in the background by and large, once RSpec kind of gets off the launching pad it runs specs really fast whereas running Cucumber features just takes a long time as it essentially fires up your entire application And later in this semester we’ll see a way to make Cucumber even slower— which is to have it fire up an entire browser basically act like a puppet, remote-controlling Firefox so you can test Javascript code We’ll do that when we actually— I think we’ll be able to work with our friends at SourceLabs so you can do that in the cloud—That will be exciting So, “run fast” versus “run slow” Resolution: If an error happens in your unit tests it’s usually pretty easy to figure out and track down what the source of that error is because the tests are so isolated You’ve stubbed out everything that doesn’t matter and you’re focusing on only the behaviour of interest So, if you’ve done a good job of doing that when something goes wrong in one of your tests there’s not a lot of places that something could have gone wrong In contrast, if you’re running a Cucumber scenario that’s got, you know, 10 steps and every step is touching a whole bunch of pieces of the app it could take a long time to actually get to the bottom of a bug So it is kind of a tradeoff between how well can you localize errors Coverage: It’s possible if you write a good suite of unit and functional tests you can get really high coverage You can run your SimpleCov report and you can actually identify specific lines in your files that have not been exercised by any test and then you can go right at tests that cover them So, figuring out how to improve your coverage for example at the C0 level is something much more easily done with unit tests whereas, with a Cucumber test— with a Cucumber scenario— you are touching a lot of parts of the code but you are doing it very sparsely So, if your goal is to get your coverage up use the tools at that are at the unit levels so that you can focusing on understanding what parts or my code are undertested and then you can write very targeted tests just to focus on them And, sort of, you know, putting those pieces together the unit tests because of their isolation and their fine resolution tend to use a lot of mocks to isolate the behaviours you don’t care about But that means that, by definition you’re not testing the interfaces and it’s sort of a “received wisdom” in software that a lot of the interesting bugs occur at the interfaces between pieces and not sort of within a class or within a method— those are sort of the easy bugs to track down And at the other extreme the more you get towards the integration testing extreme you’re supposed to rely less and less on mocks for that exact reason Now we saw, if you’re testing something like say, in a service-oriented architecture where you have to interact with the remote site you still end up having to do a fair amount of mocking and stubbing so that you don’t rely on the Internet in order for your tests to pass but, generally speaking you’re trying to remove as many of the mocks that you can and let the system run the way it would run in real life So, the good news is you are testing the interfaces but when something goes wrong in one of the interfaces because your resolution is not as good it may take longer to figure out what it is So, what’s sort of the high-order bit from this tradeoff is you don’t really want to rely too heavily on any one kind of test They serve different purposes and, depending on are you trying to exercise your interfaces more or are you trying to improve your fine-grain coverage that affects how you develop your test suite and you’ll evolve it along with your software So, we’ve used a certain set of terminology in testing It’s the terminology that, by and large is most commonly used in the Rails community but there’s some variation [and] some other terms that you might hear if you go get a job somewhere and you hear about mutation testing which we haven’t done This is an interesting idea that was, I think, invented by Ammann and Offutt, who have, sort of the definitive book on software testing The idea is: Suppose I introduced a deliberate bug into my code does that force some test to fail? Because, if I changed, you know, “if x” to “if not x” and no tests fail, then either I’m missing some coverage or my app is very strange and somehow nondeterministic Fuzz testing, which Koushik Sen may talk more about basically, this is the “10,000 monkeys at typewriters throwing random input at your code” What’s interesting about it is that those tests we’ve been doing essentially are crafted to test the app the way it was designed and these, you know, fuzz testing is about testing the app in ways it wasn’t meant to be used So, what happens if you throw enormous form submissions What happens if you put control characters in your forms? What happens if you submit the same thing over and over? And, Koushik has a statistic that Microsoft finds up to 20% of their bugs using some variation of fuzz testing and that about 25% of the common Unix command-line programs can be made to crash [when] put through aggressive fuzz testing Defining-use coverage is something that we haven’t done but it’s another interesting concept The idea is that at any point in my program there’s a place where I define— or I assign a value to some variable— and then there’s a place downstream where presumably I’m going to consume that value— someone’s going to use that value Have I covered every pair? In other words, do I have tests where every pair of defining a variable and using it somewhere is executed at some part of my test suites It’s sometimes called DU-coverage And other terms that I think are not as widely used anymore blackbox versus whitebox, or blackbox versus glassbox Roughly, a blackbox test is one that is written from the point of view of the external specification of the thing [For example:] “This is a hash table When I put in a key I should get back a value If I delete the key the value shouldn’t be there” That’s a blackbox test because it doesn’t say anything about how the hash table is implemented and it doesn’t try to stress the implementation A corresponding whitebox test might be: “I know something about the hash function and I’m going to deliberately create hash keys in my test cases that cause a lot of hash collisions to make sure that I’m testing that part of the functionality” Now, a C0 test coverage tool, like SimpleCov would reveal that if all you had is blackbox tests you might find that the collision coverage code wasn’t being hit very often And that might tip you off and say: “Ok, if I really want to strengthen that— for one, if I want to boost coverage for those tests now I have to write a whitebox or a glassbox test I have to look inside, see what the implementation does and find specific ways to try to break the implementation in evil ways” So, I think, testing is a kind of a way of life, right? We’ve gotten away from the phase of “We’d build the whole thing and then we’d test it” and we’ve gotten into the phase of “We’re testing as we go” Testing is really more like a development tool and like so many development tools the effectiveness of it depends on whether you’re using it in a tasteful manner So, you could say: “Well, let’s see—I kicked the tires You know, I fired up the browser, I tried a couple of things (claps hand) Looks like it works! Deploy it!” That’s obviously a little more cavalier than you’d want to be And, by the way, one of the things that we discovered with this online course just starting up when 60,000 people are enrolled in the course and 0.1% of those people have a problem you’d get 60 emails The corollary is: when your site is used by a lot of people some stupid bug that you didn’t find but that could have found by testing could very quickly generate *a lot* of pain On the other hand, you don’t want to be dogmatic and say “Uh, until we have 100% coverage and every test is green we absolutely will not ship” That’s not healthy either And the test quality doesn’t necessarily correlate with the statement unless you can say something about the quality of your tests just because you’ve executed every line doesn’t mean that you’ve tested the interesting cases So, somewhere in between, you could say “Well, we’ll use coverage tools to identify undertested or poorly-tested parts of the code and we’ll use them as a guideline to sort of help improve our overall confidence level” But remember, Agile is about embracing change and dealing with it Part of change is things would change that will cause bugs that you didn’t foresee and the right reaction is: Be comfortable enough for the testing tools [so] that you can quickly find those bugs Write a test that reproduces that bug And then make the test green Then you’ll really fix it That means, the way that you really fix a bug is if you created a test that correctly failed to reproduce that bug and then you went back and fixed the code to make those tests pass Similarly, you don’t want to say “Well, unit tests give you better coverage They’re more thorough and detailed So let’s focus all our energy on that” as opposed to “Oh, focus on integration tests because they’re more realistic, right? They reflect what the customer said they want So, if the integration tests are passing by definition we’re meeting a customer need” Again, both extremes are kind of unhealthy because each one of these can find problems that would be missed by the other So, having a good combination of them is kind of all it is all about The last thing I want to leave you with is, I think in terms of testing, is “TDD versus what I call conventional debugging— i.e., the way that we all kind of do it even though we say we don’t” and we’re all trying to get better, right? We’re all kind of in the gutter Some of us are looking up at the stars trying to improve our practices But, having now lived with this for 3 or 4 years myself and—I’ll be honest—3 years ago I didn’t do TDD I do it now, because I find that it’s better and here’s my distillation of why I think it works for me Sorry, the colours are a little weird but on the left column of the table [it] says “Conventional debugging” and the right side says “TDD” So what’s the way I used to write code? Maybe some of you still do this I write a whole bunch of lines maybe a few tens of lines of code I’m sure they’re right— I mean, I am a good programmer, right? This is not that hard I run it – It doesn’t work Ok, fire up the debugger – Start putting in printf’s If I’d been using TDD what would I do instead? Well I’d write a few lines of code, having written a test first So as soon as the test goes from red to green I know I wrote code that works— or at least the parts of the behaviour that I had in mind Those parts of the behaviour work, because I had a test Ok, back to conventional debugging: I’m running my program, trying to find the bugs I start putting in printf’s everywhere to print out the values of things which by the way is a lot fun when you’re trying to read them out of the 500 lines of log output that you’d get in a Rails app trying to find your printf’s you know, “I know what I’ll do— I’ll put in 75 asterisks before and after That will make it readable” (laughter) Who don’t—Ok, raise your hands if you don’t do this! Thank you for your honesty. (laughter) Ok. Or— Or I could do the other thing, I could say: Instead of printing the value of a variable why don’t I write a test that inspects it in such an expectation which should and I’ll know immediately in bright red letters if that expectation wasn’t met Ok, I’m back on the conventional debugging side: I break out the big guns: I pull out the Ruby debugger I set a debug breakpoint, and I now start tweaking and say “Oh, let’s see, I have to get past that ‘if’ statement so I have to set that thing Oh, I have to call that method and so I need to…” No! I could instead—if I’m going to do that anyway— let’s just do it in a file, set up some mocks and stubs to control the code path, make it go the way I want And now, “Ok, for sure I’ve fixed it! I’ll get out of the debugger, run it all again!” And, of course, 9 times out of 10, you didn’t fix it or you kind of partly fixed it but you didn’t completely fix it and now I have to do all these manual things all over again or I already have a bunch of tests and I can just rerun them automatically and I could, if some of them fail “Oh, I didn’t fix the whole thing No problem, I’ll just go back!” So, the bottom line is that you know, you could do it on the left side but you’re using the same techniques in both cases The only difference is, in one case you’re doing it manually which is boring and error-prone In the other case you’re doing a little more work but you can make it automatic and repeatable and have, you know, some high confidence that as you change things in your code you are not breaking stuff that used to work and basically it’s more productive So you’re doing all the same things but with a, kind of, “delta” extra work you are using your effort at a much higher leverage So that’s kind of my view of why TDD is a good thing It’s really, it doesn’t require new skills It just requires [you] to refactor your existing skills I also tried when I—again, honest confessions, right?— when I started doing this it was like “Ok, I gonna be teaching a course on Rails I should really focus on testing So I went back to some code I had written that was working—you know, that was decent code— and I started trying to write tests for it and it was *so painful* because the code wasn’t written in way that was testable There were all kinds of interactions There were, like, nested conditionals And if you wanted to isolate a particular statement and have it test—to trigger test—just that statement the amount of stuff you’d have to set up in your test to have it happen— remember when talked about mock train wrecks— you have to set up all this infrastructure just to get one line of code and you do that and you go “Gawd, testing is really not worth it! I wrote 20 lines of setup so that I could test two lines in my function!” What that’s really telling you—as I now realize— is your function is bad It’s a badly written function It’s not a testable function It’s got too many moving parts whose dependencies can be broken There’s no seams in my function that allow me to individually test the different behaviours And once you start doing Test First Development because you have to write your tests in small chunks it kind of make this problem go away So that’s been my epiphany