Monday, June 30, 2008

subprocessed

PyGame tests are structured in such a way that for each module in the pygame package (eg pygame.sprite, pygame.color) there is a test/xxxx_test.py file containing corresponding unittests. PyGame has an automated build page that shows build and test results for the latest svn version of PyGame on a variety of platforms and versions of python. It uses regular expressions to parse the results of the test runner script.

The test runner script compiles tests from each of the xxxx_test.py files and runs them in a single process. Advantage: speed, disadvantage: instability. PyGame uses a lot of c code, and where there is c code there is potential for strange errors.
As one example, there was a test for the ability to save OpenGl surfaces which would segfault on windows. This would stop the test runner half way through, leaving it's output in a form the automated build page could not decipher.

"Build Successful, Invalid Test Results"

Other issues with running all tests in one process is the need to restore a "fresh" state for tests that rely on it. Conflicts can cause the test runner script to crash completely. On the other hand, some obscure bugs have been uncovered due to them.

Besides writing tests for individual units I have recently been working on adding a subprocess mode to the python test runner script. It processes the output of each module's test script and outputs the results in the same form as the single process mode.

There is a library called subunit that uses os.fork() to run unittest suites in subprocesses, that seemed like it would have been a perfect candidate for the job. Unfortunately windows doesn't have the fork system call so it was not an option. Windows python does not even provide os.kill().

What good is running all tests in subprocesses if one of them hangs and python is using a blocking call to retrieve it's output?
As I was going to the trouble of making a subprocess mode, I realized I should deal with this possibility. Unfortunately the python subprocess module doesn't ship with async calls but I found a recipe on the ActiveState Python CookBook site.

On windows it relies on win32pipe and win32file from the pywin32 package. I worked around the lack of os.kill on windows by using sytem calls to "taskkill" or "pskill". If a wayward test suite running in a subprocess doesn't finish up in a specified allowance of time then it will be os.kill'd.

COMPLETE_FAILURE_TEMPLATE = """
======================================================================
ERROR: all_tests_for (
%s.AllTestCases)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test\
%s.py", line 1, in all_tests_for

subprocess completely failed with return code of
%s

cmd:
%s

return (abbrv):
%s

"""
# Leave that last empty line else build page regex won't match


Running each test suite in a subprocess is a huge performance hit. I think for the automated build page the performance hit won't really effect the experience as it's all running headlessly from cron jobs. Nevertheless, I added the ability to run subprocessed tests simultaneously in multiple threads. Also, the single process mode is still available as is running module specific tests suites.

I wrote some tests comparing the output of (single|sub)process modes running a group of fake test suites, some all OK, some with errors and failures.

$ run_tests__test.py
all_ok suite OK
failures1 suite OK

2/2 passes

-h for help

$ run_tests__test.py -h

-v, to output diffs even on success
-u, to output diffs of unnormalized tests


The standard library module difflib is very good, and extremely well documented.

Other than the obvious differences such as timing which are normalized before comparison, all is OK :)

Wednesday, June 25, 2008

Testing

Just a quick note,

Still waiting on my fan, it's getting shipped in from Amurricah.

I wrote a run_tests_sub.py the other day that uses subprocess to run each xxxx_test.py in the trunk/test directory.

It will run with an optional threads paramater: -t num_threads

$ run_tests_sub.py -t 4

Apparently this runs faster on mult-core.



It should output results similar to run_tests.py though may need tweaking to get it run transparently in place of run_tests.py for the build page.

Speaking of the build page, Rene and I have had a few ideas for a combined build / test web app that collected builds and test statistics ( profiling / passes etc). Also, a means to distribute the writing of tests. Many hands make light work.

If it was possible to be assigned a stub of a test to fill out and then post it back painlessly we could quite quickly increase the coverage of our tests. If twenty people filled out 1 test a week, then over a month that would be 80 extra unit tests.

ATM there are "FAILED (failures=232)", unimplemented tests and possibly that many again that haven't been stubbed out waiting to be written.

$ run_tests.py -i


Will show tests that need fleshing out.

Friday, June 20, 2008

Aha!

I realised why the change from CONSTANT = (expr) to CONSTANT = [expr] fixed the bug in the color_test.py

A generator expression is only good for one iteration and after that it will act as an empty sequence. I thought it would be a reusable lazily evaluated simily of a list comp. Turns out I was dead wrong.

I went over the stub generator recently and it's pretty much in it's finalized form as far as the naming scheme is concerned.

Can't wait to get my own computer back in action.

Wednesday, June 18, 2008

Fan

Damn fan on my laptop packed it in. I will have to convince my friend to let me install linux on his windows box while I wait for a replacement. Can't get a windows build of development pygame at the moment due to failing tests... or can I? Temporarily disable the failing tests and let the build farm run?

What a pain in the arse.

Thursday, June 12, 2008

Happenings

I have been writing unittests using the naming scheme (see below) keeping to it as much as possible.

There have been a few modifications but that's fine as long as I am consistent. I haven't yet written the part of the test stub generator that filters from the generated tests any tests for units that have already been written. I am letting the writing of tests dictate the naming schemes evolution.

Will post some more thoughts on the naming scheme in days to come. Also thoughts on one to one test names.

Thoughts on speed of test suites, isolating "dangerous" tests that can crash the whole test suite.

Thumb Rules: Testing Generalities

It's been quite an illuminating experience spending most of my time of late writing tests. A lot of the rules of thumb I have read about have proven to make good sense.

With the unittest framework especially, it seems to pay to keep to a minimum the number of assertions per test. Sometimes this increases the verbosity of your tests, something that python programmers seem to almost despise. People tend to optimize for elegance of code and to some extent performance. Test code should probably should be optimized to other priorities

Why does it pay though? The unittest framework stops any test dead in it's tracks on any error or assertion failure. If you have a whole batallion of assertions grouped under one test, on a failure you will only get information about one failure, and you miss out on a lot of context that would otherwise have been reported. Is there anything common to all these failures?

For example, there was a bug in some of the sprite collision tests (that intriguingly was not apparent on windows). The problem eluded me, but my brother discovered that it was due to testing the equivalence of lists of sprites, one of which was sourced from a dict, the order of which can not be guaranteed.

At the time there was a squadron of tests under one test, test_spritecollide.
( I restructured the assertions while renaming tests to fit the test stubber naming scheme. )

Running the test on linux, it would only report one failure when in fact the bug was repeated through about 4 - 5 assertions. Fix one and then the next assertion would fail.

Had they been structured in a way with less assertions per test, showing all the failures, it's possible a programmer of less ability like myself would have been able to solve the problem. Other programmers instantly.

This context makes things easier to hone in on the real problem. This brings me to another thought. unittests, what are they?

Unit tests; tests of units. Leaving aside the definition of unit, tests for what? One could say that they are testing for defined behaviour. Essentially then, you are testing for bugs, as if the unit is not working within defined behaviour, the unit is buggy.

What do you do when something is buggy? You debug of course. Tests then could or even should be debugging aids, especially useful when the tests are written before the actual units.

If you are not using tests for debugging then what? Some temporary scaffolding that with a tiny bit more energy could have been a test?

How to make the tests help in debugging without spending too much extra energy?

I'm not really sure on this, other than trying to keep assertions per test to a minimum. Another thing that may help is to not use anonymous expressions in the tests. If everything has been named, it's easier to use a debugger and get a glimpes of what's going on.

Tests should probably sacrifice compactness and abstraction for explicitness. There is probably a line to draw somewhere near repeating yourself too much (copy paste programming).

Sometimes it is easy to miss some bugs in your tests that give a false sense of everything being "OK".

An example of that is some tests for the Color type properties. I was looping over a generator expression that I defined for some test fixtures. All the tests were passing a OK. It turns out that for some reason ( I still don't know why ) in the scope of the tests, the expression was behaving as an empty sequence.

eg

for fixture in []:
assert something_about(fixture) == something

Nothing was ever asserted, the tests passed. I since changed the generator expression to a list comp and the tests are asserting themselves. ( expr ) => [ expr ]