Thursday, July 10, 2008

Comedy Of Errors

** Build Page / Testing **
==========================

As reported earlier, in reaction to the crashing tests rendering the build page ineffective, I have been working on creating a script to isolate test modules in subprocesses. The approach I took, was to compile the results of each isolated test into the same form as the old test runner. A quick hack, or so I hoped.

I realised that subprocess out of the box has no cross platform non-blocking calls, so you can't timeout on hung tests. I had to find a recipe for this which unfortunately required win32 extensions. Not really a big deal but still time spent and dependencies.

So what we have is a test runner parsing the results of a unittest text report, meant for human consumption, which is then in turn parsed by the automated build pages regexes. This seems pretty ridiculous, especially as the form is not exactly machine friendly. I could have (should have?) hacked into the build page code and modularised the test parsing code there, sharing between the test runner and build page.

But then if you are going to do that why not just replace the TextTestRunner class with something completely customised for the job? Replace unittest bit by bit in an adhoc as-needed fashion? Slowly building a framework? I didn't want to. I'm not really supposed to be and that was the psychology in play.

Another? foolish design decision I made, based on a shallow visual aesthetic of less LOC, was to parse unittest results in a way that only worked when there was no "test noise". What do I mean by test noise? print statments left in source code. C extensions that don't respect sys.stderr, sys.stdout redirection/supression.

See below exhibit A, a specimen from a sunny day of testing.

...............
---------------------------------------------------------------------
Ran 15 tests in 1.234s

As the tests are running unittest prints to a stream, by default stderr, but it can be any file-like object of choice, either a dot an E or an F, mapping to pass, error or fail. I used a simple regex ^[.EF]*$ to find any "dots" in the return output. If there were any, I would take a slice from the length of the dots. From there I would take the first of a split at the "Ran xxx tests" boundary, defined as '%s\nRan' % (70 * '-'). In between the DOTS and the RAN_TEST_DIV (thus named) would lay the failures.

To piece it all together as if it was the output of one run I would "join the dots" and join the failures. Then at the end count the total length of DOTS (., E, F combined), E)rrors and F)ailures. Voila. Worked a charm.

What the hell was I thinking? The whole point of the exercise was to create a reliable test runner. I suppose I thought I was. I wrote a few tests for some spectacularly unimaginative cases. I compared output of single process mode and subprocess mode running some fake test suites, zero assertions, all passing, some failures, some errors. The subprocess mode was character for character perfect in its mime artistry. In fact it was for this easy, pull apart, bind together, compare automated testing that I did it in the first place.

All was simple and peaceful, until I finally got a linux test box working again. (my laptop fan died) I used ssh to log in and run the tests from my friends windows machine. Of course one of the tests that required initiating the display failed.

single process mode: 504 tests, FAIL (failures=1)
subprocess mode: 495 tests OK

What the hell was going on? With horror I realized what I had done. Something was wreaking havoc with the fragile little regex. On failure a huge amount of debugging output was put out by one of the SDL functions interupting the DOTS. I thought about rewriting it using some more substantive regular expressions. I tossed up between doing that and redirecting sys.(stdout|stderr) and passing a StringIO to unittest for test results. I figured by doing that I would be able to keep the comparison tests I had in place, and for that matter the same degree of mimicry. I opted for redirecting std(err|out). I imagined other uses for this at the same time, none all that compelling upon reflection and only useful if implemented in another manner. (only show stdout/stderr on failure of test, can leave print statement debugging in there, I did global redirection)

Of course to do that I needed to create a command line option for each individual test module to call from the "master" script in subprocess mode. Because unittest.main() is running with it's own getopt parsing, you can't just add an option and check sys.argv or use optparse. You have to do either and then clear those options from sys.argv which would otherwise cause unittest to error. So more fun hamfisting around with unittest. I realised that I would need to do that at some stage for profiling cmd line options so there was another push in that direction. All the time wondering whether I should just completely override the parseArgs method.

I replaced the test_utils.get_fail_incomplete_option();unittest.main() in each module with a test_utils.get_command_line_options(). unittest.main() always calls sys.exit() on completion of tests so I had to subclass it, overriding one of it's methods. I did this because after catching the unittest result stream to a StringIO, I would restore stderr and write the results to it.

I added in some test cases, print_stdout and print_stderr, comparing the results (I of course had to put a redirect mode onto single process mode for purposes of testing). Everything was OK again, until I ran it again on the the linux box through ssh.

495 tests OK. (should have been 504 with one failure)

Damn it! So it seems that some stderr, stdout is not redirected. I imagine it's mostly C extensions (or system calls) and the like that would do this but then that is pygame all over. Briefly I pondered printing results back out on stdout, and just PrayingTM that any such noise would always be stderr.

So what did I do? What any fool, already invested would do. I decided to markup the results, with lines like.

<!-- UNITTEST_RESULTS_START_HERE --!>


I created 3 sections using 2 divisors. The first is all the noise output, anything not respecting redirection. The second is the unittest results and the last is the multiplexed results, what you would see if running the script in a shell. I overrode the write method on a StringIO collecting unittest results and made it also write to (a previously redirected) stdout. using subprocess.Popen(...., stdout=subprocess.PIPE, stderr = subprocess.STDOUT) everything is muxed together. I then wrote a function that regex splits the 3, keeping the results for compiling DOTS. It's a long way from Kansas though isn't it Toto.

What a PITA? That's not even the half of it. I ended up having to rewrite all the command lines I was passing to subprocess.Popen from string template to lists so it would work cross platform. Also, the way subprocess multiplexes stderr and stdout when you use the same file object for both is inconsistent cross platform. What you would see is not neccessarily what you get. On windows it would suffice to just "print compiled_test_results", but on linux had there was need to print >> sys.stderr.

All in all, a lot of tipsy toeing around unittest. I really made a complete tangled webby mess of the whole job. A black comedy of errors. I'm not sure whether to remove the stderr/stdout redirection and replace the regexes with something less fragile. It's already been too much of a hole, sucking in time. I would have to update the run_tests__tests also.

What would I do differently looking back? What would I do if I had no constraints? Unfortunately, probably two very different questions.

** What I would do differently? **
==================================

This much I do know, the build page and the test runner script require intersecting functionality. They both parse the results of a unittest TextTestRunner output to gather statistics on test results. I could have modularised this parsing functionality, sharing between the two of them. This really begs the question though, why parse something designed for human consumption at all? Why not pass a customised test runner class into unittest?

Still there is the problem of communication across process boundaries, solved by using an asynchronous extension class of subprocess.Popen. Would you log the result of each processes output to a file using something like xml? Or maybe, pickling the results and then joining them back together? You could even have a client / server architecture, using sockets to transfer pickled test results as native python objects back to the server to piece together.

As well as the requirement for isolation of tests, we are wanting to add profiling functionality and tagging to split tests into different groups.

No comments: