Thursday, June 12, 2008

Thumb Rules: Testing Generalities

It's been quite an illuminating experience spending most of my time of late writing tests. A lot of the rules of thumb I have read about have proven to make good sense.

With the unittest framework especially, it seems to pay to keep to a minimum the number of assertions per test. Sometimes this increases the verbosity of your tests, something that python programmers seem to almost despise. People tend to optimize for elegance of code and to some extent performance. Test code should probably should be optimized to other priorities

Why does it pay though? The unittest framework stops any test dead in it's tracks on any error or assertion failure. If you have a whole batallion of assertions grouped under one test, on a failure you will only get information about one failure, and you miss out on a lot of context that would otherwise have been reported. Is there anything common to all these failures?

For example, there was a bug in some of the sprite collision tests (that intriguingly was not apparent on windows). The problem eluded me, but my brother discovered that it was due to testing the equivalence of lists of sprites, one of which was sourced from a dict, the order of which can not be guaranteed.

At the time there was a squadron of tests under one test, test_spritecollide.
( I restructured the assertions while renaming tests to fit the test stubber naming scheme. )

Running the test on linux, it would only report one failure when in fact the bug was repeated through about 4 - 5 assertions. Fix one and then the next assertion would fail.

Had they been structured in a way with less assertions per test, showing all the failures, it's possible a programmer of less ability like myself would have been able to solve the problem. Other programmers instantly.

This context makes things easier to hone in on the real problem. This brings me to another thought. unittests, what are they?

Unit tests; tests of units. Leaving aside the definition of unit, tests for what? One could say that they are testing for defined behaviour. Essentially then, you are testing for bugs, as if the unit is not working within defined behaviour, the unit is buggy.

What do you do when something is buggy? You debug of course. Tests then could or even should be debugging aids, especially useful when the tests are written before the actual units.

If you are not using tests for debugging then what? Some temporary scaffolding that with a tiny bit more energy could have been a test?

How to make the tests help in debugging without spending too much extra energy?

I'm not really sure on this, other than trying to keep assertions per test to a minimum. Another thing that may help is to not use anonymous expressions in the tests. If everything has been named, it's easier to use a debugger and get a glimpes of what's going on.

Tests should probably sacrifice compactness and abstraction for explicitness. There is probably a line to draw somewhere near repeating yourself too much (copy paste programming).

Sometimes it is easy to miss some bugs in your tests that give a false sense of everything being "OK".

An example of that is some tests for the Color type properties. I was looping over a generator expression that I defined for some test fixtures. All the tests were passing a OK. It turns out that for some reason ( I still don't know why ) in the scope of the tests, the expression was behaving as an empty sequence.

eg

for fixture in []:
assert something_about(fixture) == something

Nothing was ever asserted, the tests passed. I since changed the generator expression to a list comp and the tests are asserting themselves. ( expr ) => [ expr ]

No comments: