Thursday, August 28, 2008

GSOC OverView

My GSOC project was all about testing for PyGame;

  • I wrote lots of tests; Almost every module in PyGame now has at least one test

  • Test modules can now be isolated in subprocesses; one segfault no longer brings down the whole test suite

  • Can now test for speed regressions; important for real time software such as games

  • PyGame Automated Build Page extended
    • Shows / Collects more info
    • Runs tests in subprocesses


  • Test Stubbing Utility: A Testing "Todo List"

  • Optional Interactive Tests / Test Tagging



For writing the tests I wrote a small utility that inspects the PyGame package and finds all the untested callables (functions, properties) and creates test stubs, including documentation for each so you don't have to leave the editor. The stubber knows which functions have already been tested by using a naming scheme for all of the tests. Essentially, "test_$callable__$comment", namespaced by having TestCase[s] per Class and a test module per module.

In this way I could create stubs for each module, essentially a TODO list, and cycle through all the modules looking for tests that were easy to write. The functions in PyGame are many and greatly varied, each requiring somewhat specialised knowledge to test. I wasn't able to write tests for all them but hopefully the test stubbing utility will help enable some testing sprints. I intend to develop a testing website where people can submit bugs/tests in the form of a unittest.

PyGame has a somewhat unique set of requirements compared to most python libraries in that most of the framework is actually written in C. C code when it goes awry can do some very strange things. We had a test runner running all of the tests in one single process so if one failed hard it would bring down the whole suite. This can be a bit of a pain so I developed a test runner that isolates each module in a subprocess.

Some of the tests in PyGame have requirements that make them unsuitable for running as part of the main test suite. For example some require a CDRom, a JoyStick, take way too long or need interaction with a human. With the test runner script I extended unittest with the ability to exclude certain tests by tags. The tags can be module, class or individual test level and are inheritable/ over-ridable.

Another extension to the test runner was the ability to randomize the run ordering of tests, so along with the test results the seed is printed out. If there are failures you can seed the randomizer with the failure inducing seed. We also wanted to be able to record the timings of each individual test so we could make comparisons between revisions / platforms. I again extended the test runner with that ability.

I worked with Brian Fisher to extend the PyGame automated build page to record the test results in a ZODB and utilize the new test runner to run tests in subprocesses. We will be able to use this information for detecting speed regressions amongst other things.

Saturday, August 23, 2008

Johnny, Kick A Hole Right In The Sky

Johnny, Kick a hole right in the sky! Won't some body testify? Poke a lion in it's eye!

I bought pygame-testify.net today, and set up a python/cgi based form that takes a zip and enumerates the results + adds the (safe evaled) test results dict to a ZODB.

I found a multi-part python snippet for POST[ing] of test results.

The test/build page is starting to come together.

I am using htpasswd for security.

Saturday, August 2, 2008

todo_xxxxxxx

I recently altered the "fail incomplete tests" mechanism we use in the pygame test runner. Before we were doing assertions on test_utils.test_not_implemented(). This would check a module level variable test_utils.fail_incomplete_tests, which we would set as desired depending on whether we wanted to fail incomplete tests.

This was a fairly non-invasive technique but as I was already hijacking the test loading mechanism for filtering tests by tags, I realized I could alter the TestLoader class to pick up tests starting with the prefix "todo_" as well as "test_". I would call TestCase.fail directly which would only run if picking up todo_ tests.

This of course meant altering all the stubs. I pondered briefly doing a mass search and replace, completely automating it but I don't really trust that for tests.

For the test stubs I have been including the documentation so it's really easy to walk through a test file writing tests without having to leave the editor. I was just using inspect.getdoc to get the __doc__ string.

It seems the documentation included in the .doc files is different to that contained in the __doc__ for each function. The __doc__ seems to be the function signature and a very brief, usually one sentence description. The .doc files contains a lot more detailed descriptions that can be very useful when writing tests.

I quickly added a docs_as_dict() function to makeref.py, then added it to the stub generator. The stub generator will add both the __doc__ and the .doc file documentation to each stub.

I went through semi-manually updating all the unfilled out stubs for each test file with the more complete docs and the new todo_xxxxx test naming. It took about an hour but I feel more confident than if I had just grep'd it.

Everything is pretty much now in place for the test site I wanted to create.

Test Timing
Test Tagging
Isolated Tests

Friday, July 18, 2008

import test.unittest as unittest

I split the test runner further, now into three files, with all the monkey business in unittest_patch.

The patching is done by a patch() function taking an optparse options object as the solo argument, which drives the decisions behind which parts of unittest are patched.

With the features we wanted I had to override some methods in a quite drastic way. I even needed to override TestCase.run, a many many line method. The only way I could do this was to basically copy/paste, alter and monkey-patch in. This meant sometimes calling private members.

Unfortunately, the author of unittest had decided somewhere between python 2.4 and 2.5 that he would rename all the private members from the double underscore preceding __name_mangling convention to a single underscore _caution.

As my mentor said (or something like it), "using an underscore is a warning, that said member is an implementation detail not an interface".

What to do? We now include a 2.5 version of unittest in the test directory. Apparenly pygame has come full circle; it was included way back in the day before PyUnit was part of the standard library.

All of our individual test files, typically $module_test.py, all import an unpatched unittest and run unittest.main() to make the module "conveniently executable". Only when running the complete suite is unittest enhanced with extra functionality.

While I was in there tinkering with the internals, recording timings of individual tests I moved the redirect std(err|out) per module to per test. I then patched the TextTestRunner to dump stderr/stdout on error.

def printErrorList(self, flavour, errors):
for test, err in ((e[0], e[1]) for e in errors):
self.stream.writeln(self.separator1)
self.stream.writeln("%s: %s" % (flavour, test))
self.stream.writeln(self.separator2)
self.stream.writeln("%s" % err)

# DUMP REDIRECTED STDERR / STDOUT ON ERROR / FAILURE
if self.show_redirected_on_errors:
stderr, stdout = map(self.tests[test].get, ('stderr','stdout'))
if stderr: self.stream.writeln("STDERR:\n%s" % stderr)
if stdout: self.stream.writeln("STDOUT:\n%s" % stdout)


It would be relatively easy to add in support for show locals() etc.

Tuesday, July 15, 2008

Redesign

I decided to (had to) redesign the test runner, this time cutting more directly to the root of matters, overriding select methods of unittest classes.

Before, in subprocess mode, I was calling the individual test modules, which would in turn run unittest.main() with all the attendant pains of cmd line options conflicting and output parsing. (we have to add profiling, exclusion by tags etc). One major design change I made was to unify the single / subprocess modes to use one test runner, (test_runner.py).

In it, along with a lot of utility functions, is defined a run_test() function. It takes a list of modules and an options object as arguments. It compiles a dictionary of the test results and on completion either returns the dict or in subprocess mode pretty prints it to stdout. (This is then eval'd for an all_results.update(result))

RESULTS_TEMPLATE = {
'output' : '', # unittest.TextTestRunner output
'stderr' : '', # stderr outpout
'stdout' : '', # stdout output
'num_tests' : 0, # taken directly from the unittest results object
'failures' : [], # ditto
'errors' : [], # ditto
}


In single process mode run_tests.py just imports from test_runner.py run_test() function and passes it the optparse options object and list of modules to search for tests.

Both run_tests.py and test_runner.py, share the same optparse cmd line parser options. In subprocess mode, run_tests.py calls test_runner.py with essentialy the same sys.argv it was initiated with. if __main__ it runs the run_test() function on a list of [args[0]]. Now all the extra functionality and cmd line parsing is all in one place.

There were quite a few extra little changes that have made it not perfect but a lot better. Adding exclution by tagging functionality took 10 minutes, most of the time being spent on picking a format.

|Tags:display|


Adding profiling decorators or whatever other functionality is desired will also be a lot easier now.

Thursday, July 10, 2008

Comedy Of Errors

** Build Page / Testing **
==========================

As reported earlier, in reaction to the crashing tests rendering the build page ineffective, I have been working on creating a script to isolate test modules in subprocesses. The approach I took, was to compile the results of each isolated test into the same form as the old test runner. A quick hack, or so I hoped.

I realised that subprocess out of the box has no cross platform non-blocking calls, so you can't timeout on hung tests. I had to find a recipe for this which unfortunately required win32 extensions. Not really a big deal but still time spent and dependencies.

So what we have is a test runner parsing the results of a unittest text report, meant for human consumption, which is then in turn parsed by the automated build pages regexes. This seems pretty ridiculous, especially as the form is not exactly machine friendly. I could have (should have?) hacked into the build page code and modularised the test parsing code there, sharing between the test runner and build page.

But then if you are going to do that why not just replace the TextTestRunner class with something completely customised for the job? Replace unittest bit by bit in an adhoc as-needed fashion? Slowly building a framework? I didn't want to. I'm not really supposed to be and that was the psychology in play.

Another? foolish design decision I made, based on a shallow visual aesthetic of less LOC, was to parse unittest results in a way that only worked when there was no "test noise". What do I mean by test noise? print statments left in source code. C extensions that don't respect sys.stderr, sys.stdout redirection/supression.

See below exhibit A, a specimen from a sunny day of testing.

...............
---------------------------------------------------------------------
Ran 15 tests in 1.234s

As the tests are running unittest prints to a stream, by default stderr, but it can be any file-like object of choice, either a dot an E or an F, mapping to pass, error or fail. I used a simple regex ^[.EF]*$ to find any "dots" in the return output. If there were any, I would take a slice from the length of the dots. From there I would take the first of a split at the "Ran xxx tests" boundary, defined as '%s\nRan' % (70 * '-'). In between the DOTS and the RAN_TEST_DIV (thus named) would lay the failures.

To piece it all together as if it was the output of one run I would "join the dots" and join the failures. Then at the end count the total length of DOTS (., E, F combined), E)rrors and F)ailures. Voila. Worked a charm.

What the hell was I thinking? The whole point of the exercise was to create a reliable test runner. I suppose I thought I was. I wrote a few tests for some spectacularly unimaginative cases. I compared output of single process mode and subprocess mode running some fake test suites, zero assertions, all passing, some failures, some errors. The subprocess mode was character for character perfect in its mime artistry. In fact it was for this easy, pull apart, bind together, compare automated testing that I did it in the first place.

All was simple and peaceful, until I finally got a linux test box working again. (my laptop fan died) I used ssh to log in and run the tests from my friends windows machine. Of course one of the tests that required initiating the display failed.

single process mode: 504 tests, FAIL (failures=1)
subprocess mode: 495 tests OK

What the hell was going on? With horror I realized what I had done. Something was wreaking havoc with the fragile little regex. On failure a huge amount of debugging output was put out by one of the SDL functions interupting the DOTS. I thought about rewriting it using some more substantive regular expressions. I tossed up between doing that and redirecting sys.(stdout|stderr) and passing a StringIO to unittest for test results. I figured by doing that I would be able to keep the comparison tests I had in place, and for that matter the same degree of mimicry. I opted for redirecting std(err|out). I imagined other uses for this at the same time, none all that compelling upon reflection and only useful if implemented in another manner. (only show stdout/stderr on failure of test, can leave print statement debugging in there, I did global redirection)

Of course to do that I needed to create a command line option for each individual test module to call from the "master" script in subprocess mode. Because unittest.main() is running with it's own getopt parsing, you can't just add an option and check sys.argv or use optparse. You have to do either and then clear those options from sys.argv which would otherwise cause unittest to error. So more fun hamfisting around with unittest. I realised that I would need to do that at some stage for profiling cmd line options so there was another push in that direction. All the time wondering whether I should just completely override the parseArgs method.

I replaced the test_utils.get_fail_incomplete_option();unittest.main() in each module with a test_utils.get_command_line_options(). unittest.main() always calls sys.exit() on completion of tests so I had to subclass it, overriding one of it's methods. I did this because after catching the unittest result stream to a StringIO, I would restore stderr and write the results to it.

I added in some test cases, print_stdout and print_stderr, comparing the results (I of course had to put a redirect mode onto single process mode for purposes of testing). Everything was OK again, until I ran it again on the the linux box through ssh.

495 tests OK. (should have been 504 with one failure)

Damn it! So it seems that some stderr, stdout is not redirected. I imagine it's mostly C extensions (or system calls) and the like that would do this but then that is pygame all over. Briefly I pondered printing results back out on stdout, and just PrayingTM that any such noise would always be stderr.

So what did I do? What any fool, already invested would do. I decided to markup the results, with lines like.

<!-- UNITTEST_RESULTS_START_HERE --!>


I created 3 sections using 2 divisors. The first is all the noise output, anything not respecting redirection. The second is the unittest results and the last is the multiplexed results, what you would see if running the script in a shell. I overrode the write method on a StringIO collecting unittest results and made it also write to (a previously redirected) stdout. using subprocess.Popen(...., stdout=subprocess.PIPE, stderr = subprocess.STDOUT) everything is muxed together. I then wrote a function that regex splits the 3, keeping the results for compiling DOTS. It's a long way from Kansas though isn't it Toto.

What a PITA? That's not even the half of it. I ended up having to rewrite all the command lines I was passing to subprocess.Popen from string template to lists so it would work cross platform. Also, the way subprocess multiplexes stderr and stdout when you use the same file object for both is inconsistent cross platform. What you would see is not neccessarily what you get. On windows it would suffice to just "print compiled_test_results", but on linux had there was need to print >> sys.stderr.

All in all, a lot of tipsy toeing around unittest. I really made a complete tangled webby mess of the whole job. A black comedy of errors. I'm not sure whether to remove the stderr/stdout redirection and replace the regexes with something less fragile. It's already been too much of a hole, sucking in time. I would have to update the run_tests__tests also.

What would I do differently looking back? What would I do if I had no constraints? Unfortunately, probably two very different questions.

** What I would do differently? **
==================================

This much I do know, the build page and the test runner script require intersecting functionality. They both parse the results of a unittest TextTestRunner output to gather statistics on test results. I could have modularised this parsing functionality, sharing between the two of them. This really begs the question though, why parse something designed for human consumption at all? Why not pass a customised test runner class into unittest?

Still there is the problem of communication across process boundaries, solved by using an asynchronous extension class of subprocess.Popen. Would you log the result of each processes output to a file using something like xml? Or maybe, pickling the results and then joining them back together? You could even have a client / server architecture, using sockets to transfer pickled test results as native python objects back to the server to piece together.

As well as the requirement for isolation of tests, we are wanting to add profiling functionality and tagging to split tests into different groups.

Tuesday, July 8, 2008

killer redux

I was laboring under the bastard conception that when using subprocess.Popen(), shell=True is required for a subprocess executable to have access to the environment variables. Where the hell did I get that idea? Stupid unquestioned assumption that almost gave birth to a lasting bug.

For the test runner I was using system calls to taskkill or pskill for process controll under windows. The idea was to try executing each and if one was on the %PATH% the return code would not be one of err. If this was the case then the search was over and a Popen wrapper of (taskkill|pskill) would suffice as an os.kill().

This worked fine and dandy except that on windows98, there would be no error code if either of the task killers weren't on the path. It would define a useless os.kill.

Lenard, the windows maintainer of PyGame questioned why use a hacky wrapper of pskill or one of it's ilk, when if there was already a reliance on pywin32, why not use win32api.TerminateProcess?

That works fine but does not kill process trees, something I thought was a requirement due to using shell = True as a Popen constructor argument. Using shell = 1 calls cmd.exe etc which in turn calls the subprocess of choice.

Realizing that there was only need to kill one process, and that it would also avoid problems with differing return codes on older versions of windows, TerminateProcess was given the job.

Long live TerminateProcess.

Friday, July 4, 2008

dot points on build page extensions

I have been thinking about making some extensions to the build page.

Raw Data

  • Keep raw_data to process at any time. No need to discount old data collected from buggy analysis.

Profiling

  • Use function wrappers, that log profiling of each test and multiple calls.

  • -p|--profile command line mode

Tests

  • Use subprocess mode by default for run_tests.py

  • Web interface for ticketing off tests

Build information

  • Post compiler version

  • Post complete Setup file

  • Post complete build output

  • Post complete test output

  • Python sys.path

  • Environment variables

  • As much as possible, unprocessed for archives

Machine information

  • Processor speed

  • CDRom availability

  • etc, etc.

Breaking up tests

Should the tests fail if a machine doesn't have a CD drive (assuming stubs were filled out) for example?

Should tests that require Numeric or NumPy fail if neither available?

There are some classes of tests that it seems to make sense to split apart from the main "base" group of tests. What should be the "base" group of tests to automate with the run_tests.py test runner?

What about tests that require human verification? For the build page a "base" group of tests should be specified.

What should be the requirements for machines sending results to the build page? Numeric, Numpy? win32 extensions on windows? A CD rom drive? 32 bit color display?

Thursday, July 3, 2008

test_not_implemented()

def test_get_arraytypes(self):

# __doc__ (as of 2008-06-25) for pygame.sndarray.get_arraytypes:

# pygame.sndarray.get_arraytypes (): return tuple
#
# Gets the array system types currently supported.
#
# Checks, which array system types are available and returns them as a
# tuple of strings. The values of the tuple can be used directly in
# the use_arraytype () method.
#
# If no supported array system could be found, None will be returned.

self.assert_(test_not_implemented())


test_not_implemented() will fail if any test suite is run with a "(-i|--incomplete)" command line option.

As mentioned in previous posts, I developed a unittest stub generator that will output stubs for any untested units. It is supported by a naming scheme for the tests. The stubber will inspect the xxxx_test.py modules and based upon the names of the unittest.TestCase's and their children test_xxxx methods will determine what is already tested.

For each public callable there is a corresponding test named test_$callable_name. Comments or descriptions will be appended to this separated by a double underscore.

test_quit__returns_None_if_not_already_init


What if there is a module.quit and a module.class.quit ? Each class has it's own TestCase (and thus namespace) named $classTypeTest. This is typically the case anyway with setUp()'s specific to the class tested.

def get_callables(obj, if_of = None, check_where_defined=False):
publics = (getattr(obj, x) for x in dir(obj) if is_public(x))
callables = (x for x in publics if callable(x) or isgetsetdescriptor(x))

if check_where_defined:
callables = (c for c in callables if ( 'pygame' in c.__module__ or
('__builtin__' == c.__module__ and isclass(c)) )
and REAL_HOMES.get(c, 0) in (0, obj))

if if_of:
callables = (x for x in callables if if_of(x)) # isclass, ismethod etc

return set(callables)


The script uses inspection to find all testables in pygame but there were a few complications, for example getter/setter properties and the fact that some objects need to be instantiated before inspection reveals their innards. Also, filtering out non-pygame callables and after that callables that appeared in more than one module.

eg pygame.rect.Rect led a double life as pygame.sprite.Rect. Just check the __module__ attribute ?

In [4]: pygame.sprite.Rect.__module__
Out[4]: 'pygame'


The workaround was to make a mapping of object to the place where it was defined. There were only 9 of these.

REAL_HOMES = {
pygame.rect.Rect : pygame.rect,
pygame.mask.from_surface : pygame.mask,
pygame.time.get_ticks : pygame.time,
.....


On some of the classes the __module__ attribute was __builtin__ so I needed put an exception for them in the filtering out of non pygame callables.

In [7]: pygame.cdrom.CDType.__module__
Out[7]: '__builtin__'


def module_stubs(module):
stubs = {}
all_callables = get_callables(module, check_where_defined = True) - IGNORES
classes = set (
c for c in all_callables if isclass(c) or c in MUST_INSTANTIATE
)

for class_ in classes:
base_type = class_

if class_ in MUST_INSTANTIATE:
class_ = get_instance(class_)

stubs.update (
make_stubs(get_callables(class_) - IGNORES, module, base_type)
)

stubs.update(make_stubs(all_callables - classes, module))

return stubs


The stubber finds all modules in the pygame package. For each module it uses inspection to create a set of all the callables minus those set in the IGNORE setting. This is here for any exceptions to the filtering and also for tests that have been grouped under one test name. These objects will not be stubbed.

IGNORES = set([

pygame.rect.Rect.h, pygame.rect.Rect.w,
pygame.rect.Rect.x, pygame.rect.Rect.y,

pygame.color.Color.a, pygame.color.Color.b,
pygame.color.Color.g, pygame.color.Color.r,

......



From that it creates a subset of "classes", the criteria being that for each element "inspect.isclass(element)" or that the element is in the manually set MUST_INSTANTIATE dict. This is a mapping of class to helper function, and instantiation args required to return an instance.

MUST_INSTANTIATE = {

# BaseType / Helper # (Instantiator / Args) / Callable

pygame.cdrom.CDType : (pygame.cdrom.CD, (0,)),
pygame.mixer.ChannelType : (pygame.mixer.Channel, (0,)),
pygame.time.Clock : (pygame.time.Clock, ()),


..

}


Inspecting the xxxxType would reveal no methods, and they needed to be instantiated, but then the object returned contained no other attributes; one example being __name__ needed later for determing the test name. Therefore the xxxxType was sent to the stub generation function as the "parent class" for each callable that was gathered by inspecting the instantiation.

Any callables not in the "classes" set are assumed module level functions and a stub is created for each.


The test stubber is used from the command line:

$ gen_stubs.py --help
Usage:
$ gen_stubs.py ROOT

eg.

$ gen_stubs.py sprite.Sprite

def test_add(self):

# Doc string for pygame.sprite.Sprite:

...


Options:
-h, --help show this help message and exit
-l, --list list callable names not stubs
-t, --test_names list test names not stubs


$ gen_stubs.py pygame -l
pygame.base.error.args,
pygame.bufferproxy.BufferProxy.length,
pygame.bufferproxy.BufferProxy.raw,
pygame.event.Event,
pygame.image.tostring,
pygame.joystick.Joystick,
pygame.key.get_repeat,
pygame.mask.Mask,
pygame.mixer.Channel,
pygame.movie.Movie,
pygame.overlay.overlay.display,
pygame.overlay.overlay.get_hardware,
pygame.overlay.overlay.set_location,
pygame.pixelarray.PixelArray.surface,
pygame.sprite.AbstractGroup.add,
pygame.sprite.AbstractGroup.add_internal,
pygame.sprite.AbstractGroup.clear,
pygame.sprite.AbstractGroup.copy,
pygame.sprite.AbstractGroup.draw,
pygame.sprite.AbstractGroup.empty,
pygame.sprite.AbstractGroup.has_internal,
pygame.sprite.AbstractGroup.remove,
pygame.sprite.AbstractGroup.remove_internal,
pygame.sprite.AbstractGroup.sprites,
pygame.sprite.AbstractGroup.update,
pygame.sprite.collide_rect,


Commas are appended for easy copy/paste into IGNORE list.

gen_stubs.py is an integral part of the plan to make it extremely easy for people to contribute to unittests. One man can only do so much.

Monday, June 30, 2008

subprocessed

PyGame tests are structured in such a way that for each module in the pygame package (eg pygame.sprite, pygame.color) there is a test/xxxx_test.py file containing corresponding unittests. PyGame has an automated build page that shows build and test results for the latest svn version of PyGame on a variety of platforms and versions of python. It uses regular expressions to parse the results of the test runner script.

The test runner script compiles tests from each of the xxxx_test.py files and runs them in a single process. Advantage: speed, disadvantage: instability. PyGame uses a lot of c code, and where there is c code there is potential for strange errors.
As one example, there was a test for the ability to save OpenGl surfaces which would segfault on windows. This would stop the test runner half way through, leaving it's output in a form the automated build page could not decipher.

"Build Successful, Invalid Test Results"

Other issues with running all tests in one process is the need to restore a "fresh" state for tests that rely on it. Conflicts can cause the test runner script to crash completely. On the other hand, some obscure bugs have been uncovered due to them.

Besides writing tests for individual units I have recently been working on adding a subprocess mode to the python test runner script. It processes the output of each module's test script and outputs the results in the same form as the single process mode.

There is a library called subunit that uses os.fork() to run unittest suites in subprocesses, that seemed like it would have been a perfect candidate for the job. Unfortunately windows doesn't have the fork system call so it was not an option. Windows python does not even provide os.kill().

What good is running all tests in subprocesses if one of them hangs and python is using a blocking call to retrieve it's output?
As I was going to the trouble of making a subprocess mode, I realized I should deal with this possibility. Unfortunately the python subprocess module doesn't ship with async calls but I found a recipe on the ActiveState Python CookBook site.

On windows it relies on win32pipe and win32file from the pywin32 package. I worked around the lack of os.kill on windows by using sytem calls to "taskkill" or "pskill". If a wayward test suite running in a subprocess doesn't finish up in a specified allowance of time then it will be os.kill'd.

COMPLETE_FAILURE_TEMPLATE = """
======================================================================
ERROR: all_tests_for (
%s.AllTestCases)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test\
%s.py", line 1, in all_tests_for

subprocess completely failed with return code of
%s

cmd:
%s

return (abbrv):
%s

"""
# Leave that last empty line else build page regex won't match


Running each test suite in a subprocess is a huge performance hit. I think for the automated build page the performance hit won't really effect the experience as it's all running headlessly from cron jobs. Nevertheless, I added the ability to run subprocessed tests simultaneously in multiple threads. Also, the single process mode is still available as is running module specific tests suites.

I wrote some tests comparing the output of (single|sub)process modes running a group of fake test suites, some all OK, some with errors and failures.

$ run_tests__test.py
all_ok suite OK
failures1 suite OK

2/2 passes

-h for help

$ run_tests__test.py -h

-v, to output diffs even on success
-u, to output diffs of unnormalized tests


The standard library module difflib is very good, and extremely well documented.

Other than the obvious differences such as timing which are normalized before comparison, all is OK :)

Wednesday, June 25, 2008

Testing

Just a quick note,

Still waiting on my fan, it's getting shipped in from Amurricah.

I wrote a run_tests_sub.py the other day that uses subprocess to run each xxxx_test.py in the trunk/test directory.

It will run with an optional threads paramater: -t num_threads

$ run_tests_sub.py -t 4

Apparently this runs faster on mult-core.



It should output results similar to run_tests.py though may need tweaking to get it run transparently in place of run_tests.py for the build page.

Speaking of the build page, Rene and I have had a few ideas for a combined build / test web app that collected builds and test statistics ( profiling / passes etc). Also, a means to distribute the writing of tests. Many hands make light work.

If it was possible to be assigned a stub of a test to fill out and then post it back painlessly we could quite quickly increase the coverage of our tests. If twenty people filled out 1 test a week, then over a month that would be 80 extra unit tests.

ATM there are "FAILED (failures=232)", unimplemented tests and possibly that many again that haven't been stubbed out waiting to be written.

$ run_tests.py -i


Will show tests that need fleshing out.

Friday, June 20, 2008

Aha!

I realised why the change from CONSTANT = (expr) to CONSTANT = [expr] fixed the bug in the color_test.py

A generator expression is only good for one iteration and after that it will act as an empty sequence. I thought it would be a reusable lazily evaluated simily of a list comp. Turns out I was dead wrong.

I went over the stub generator recently and it's pretty much in it's finalized form as far as the naming scheme is concerned.

Can't wait to get my own computer back in action.

Wednesday, June 18, 2008

Fan

Damn fan on my laptop packed it in. I will have to convince my friend to let me install linux on his windows box while I wait for a replacement. Can't get a windows build of development pygame at the moment due to failing tests... or can I? Temporarily disable the failing tests and let the build farm run?

What a pain in the arse.

Thursday, June 12, 2008

Happenings

I have been writing unittests using the naming scheme (see below) keeping to it as much as possible.

There have been a few modifications but that's fine as long as I am consistent. I haven't yet written the part of the test stub generator that filters from the generated tests any tests for units that have already been written. I am letting the writing of tests dictate the naming schemes evolution.

Will post some more thoughts on the naming scheme in days to come. Also thoughts on one to one test names.

Thoughts on speed of test suites, isolating "dangerous" tests that can crash the whole test suite.

Thumb Rules: Testing Generalities

It's been quite an illuminating experience spending most of my time of late writing tests. A lot of the rules of thumb I have read about have proven to make good sense.

With the unittest framework especially, it seems to pay to keep to a minimum the number of assertions per test. Sometimes this increases the verbosity of your tests, something that python programmers seem to almost despise. People tend to optimize for elegance of code and to some extent performance. Test code should probably should be optimized to other priorities

Why does it pay though? The unittest framework stops any test dead in it's tracks on any error or assertion failure. If you have a whole batallion of assertions grouped under one test, on a failure you will only get information about one failure, and you miss out on a lot of context that would otherwise have been reported. Is there anything common to all these failures?

For example, there was a bug in some of the sprite collision tests (that intriguingly was not apparent on windows). The problem eluded me, but my brother discovered that it was due to testing the equivalence of lists of sprites, one of which was sourced from a dict, the order of which can not be guaranteed.

At the time there was a squadron of tests under one test, test_spritecollide.
( I restructured the assertions while renaming tests to fit the test stubber naming scheme. )

Running the test on linux, it would only report one failure when in fact the bug was repeated through about 4 - 5 assertions. Fix one and then the next assertion would fail.

Had they been structured in a way with less assertions per test, showing all the failures, it's possible a programmer of less ability like myself would have been able to solve the problem. Other programmers instantly.

This context makes things easier to hone in on the real problem. This brings me to another thought. unittests, what are they?

Unit tests; tests of units. Leaving aside the definition of unit, tests for what? One could say that they are testing for defined behaviour. Essentially then, you are testing for bugs, as if the unit is not working within defined behaviour, the unit is buggy.

What do you do when something is buggy? You debug of course. Tests then could or even should be debugging aids, especially useful when the tests are written before the actual units.

If you are not using tests for debugging then what? Some temporary scaffolding that with a tiny bit more energy could have been a test?

How to make the tests help in debugging without spending too much extra energy?

I'm not really sure on this, other than trying to keep assertions per test to a minimum. Another thing that may help is to not use anonymous expressions in the tests. If everything has been named, it's easier to use a debugger and get a glimpes of what's going on.

Tests should probably sacrifice compactness and abstraction for explicitness. There is probably a line to draw somewhere near repeating yourself too much (copy paste programming).

Sometimes it is easy to miss some bugs in your tests that give a false sense of everything being "OK".

An example of that is some tests for the Color type properties. I was looping over a generator expression that I defined for some test fixtures. All the tests were passing a OK. It turns out that for some reason ( I still don't know why ) in the scope of the tests, the expression was behaving as an empty sequence.

eg

for fixture in []:
assert something_about(fixture) == something

Nothing was ever asserted, the tests passed. I since changed the generator expression to a list comp and the tests are asserting themselves. ( expr ) => [ expr ]

Tuesday, May 27, 2008

blog.update(recent_events)

I have resolved the failing test issues I had on Ubuntu.

It was a bug in testing the content equality of two lists; one was the return value of a function, the order of which could not be guaranteed. Odd that it was only an issue on Ubuntu, and presumably the linux platform in general. I commited a patch that fixed it.

At the moment I am working on creating a script, and supporting test structuring, to automate test stubbing for all units that are not tested.

This is so I can pick off the tests more methodically. Also, if it's possible to get a large group of people writing tests, just one at a time each week, a lot could very quickly get done. How to reduce the overhead so that is worthwhile and easy to be assigned one unit and write some tests?

The basic idea is to have a naming convention (possibly renaming existing tests) for tests that makes it easy to see if units are tested. A one 2 one mapping of callable to test.

test_$Class__$callable__$description

$Class if a method
$callable
$description if any / optional


At the moment the script just creates test stubs for all "public" callables in the pygame package with that naming convention. It appends the callable's doc string as a comment.

def test_Surface__get_colorkey(self):
"""
TODO: Test for unit, get_colorkey

"""


# Docstring:

# Surface.get_colorkey(): return RGB or None
# Get the current transparent colorkey


self.assert_(not_completed())


I am still fleshing out the idea

Tuesday, May 6, 2008

Compilation of Blues

I was advised that I should get some experience compiling pygame. I use windows and thanks to Brian/Lenard it's a one click process to install the latest version of pygame. I knew that in the future I would be needing to run tests on Linux so I thought I may as well set it up now and compile pygame on that. I toyed briefly with the idea of setting up a dual boot. I have ran that setup in the past and it's a PITA so I opted for a virtual machine running Ubuntu 8.04 running on one of my virtual desktops. Virtual insanity.

After setting up the VM and installing Ubuntu I decided to have a crack at compiling everything from source and not using apt-get. This both for the experience and because I was under the mistaken impression that the svn HEAD version of pygame required newer versions of it dependencies than available through apt-get packaging.

(The wiki has compilation steps for Mac and Windows with links to source archives of pygame dependencies. A note on the wiki says "We should have a download with everything included. As well as patches for each one that we need". Sounds like a great idea to me. Maybe an svn repo? Is there much difference between the platforms?)


I downloaded the packages and began the quite tedious process of compiling everything.

............

I eventually had all the dependencies compiled and was glad to be finally able to begin compiling pygame

$
sudo python setup.py install

............

building 'pygame.scrap' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -D_REENTRANT -I/usr/X11R6/include -I/usr/local/include/SDL -I/usr/include/python2.5 -c src/scrap.c -o build/temp.linux-i686-2.5/src/scrap.o
In file included from src/scrap.c:59:
src/scrap_x11.c: In function ‘_convert_format’:
src/scrap_x11.c:77: error: ‘XA_PIXMAP’ undeclared (first use in this function)
src/scrap_x11.c:77: error: (Each undeclared identifier is reported only once
src/scrap_x11.c:77: error: for each function it appears in.)
src/scrap_x11.c:79: error: ‘XA_BITMAP’ undeclared (first use in this function)
src/scrap_x11.c: In function ‘_add_clip_data’:

............


My happiness was shortlived. I googled the problem and found someone back in 2005 having similiar problems trying to compile qt. It was something to do with not having the XFree86 development package. Turns out this package goes by the name libx11-dev for debian/ubuntu.

I installed it but the problem remained so I turned to #pygame for help. Someone helpful there, I can't recall who, said that it seemed I was missing some X11 header files and specifically a file called Xatom.h. Off in circles again for while looking for a missing package. I was pointed toward packages.ubuntu.com.

I searched for all packages containing Xatom.h. and..... aha! "x11proto-core-dev". That must be it!
$
sudo apt-get install x11proto-core-dev


It was already installed.

What now? I harassed one of my mentors, Rene, and he said that it could be a problem with pygame-trunk/src/scrap_x11.c or that it could possibly be because I got a non X version of SDL. "Damn.., I don't think I did"

He taught me about "sudo updatedb" and "locate X" with which I was able to confirm the existence of and location of Xatom.h on the system. He recommended that I use apt-get to get ready made packages of the dependencies and compile again.

I managed to replace all the local versions of the dependencies with apt-get ones but the first time round I missed deleting a lot of them. I have not that much experience with linux. The first time I recompiled it worked but upon running the unit tests I found that 5 of them were failing.

Rene, was quick to the rescue. "my guess is you're linking to your self installed sdl_image... try ldd `find . -name imageext.so`"

That confirmed it. I hunted down all the non apt managed packages and exterminated them from the system then compiled pygame again.

1 Unit test was still failing. IS still failing.

1  ======================================================================
2 FAIL: test_spritecollide (sprite_test.SpriteTest)
3 ----------------------------------------------------------------------
4 Traceback (most recent call last):
5 File "test/sprite_test.py", line 68, in test_spritecollide
6 self.assertEqual(sprite.spritecollide(s1, ag2, dokill = False, collided = sprite.collide_rect_ratio(20.0)),[s2,s3])
7 AssertionError: [<Sprite sprite(in 1 groups)>, <Sprite sprite(in 1 groups)>] != [<Sprite sprite(in 1 groups)>, <Sprite sprite(in 1 groups)>]

This test passes fine under Windows.

The hunt is on.

Tuesday, April 29, 2008

Journey of a Thousand Miles



On the advice of my brother I have changed the name of the blog to something more generic so I can keep the blog once the winter has passed. You will be hearing from me soon. I want your ideas.

Tuesday, April 22, 2008

Winter Of Testing

Wow, my "Google Summer of Code" application was accepted. My inbox has been flooded with congratulations and introductions from people from all over the world, who likewise were accepted in the program. Brazil, China, Russia to name just a few places. Hundreds of people excited about being sponsored to work on a project that interests them "over the summer". I live in Australia, and summer has just ended; no more morning swims, I will leave that to the diehards. Winter is here and it looks like I will spend it underneath a blanket at the keyboard.

For my GSOC project I will be extending the coverage of tests in pygame. With Py3K on the horizon and lots of other pygame ports in conception it's critical to get more tests in place. I will be extending the unittests and implementing some speed regression tests. Also, I will develop some interactive tests for things that are difficult/impossible/notWorthTheTime to test automatically.

My general strategy will be "breadth first", cycling through the modules. As time allows I will pinpoint target modules for intensive testing based on community feedback. I look forward to working with the PyGame community and welcome any ideas and advice.