Sunday, March 29, 2009

CTags 1: Captain Bisect And His Rag (c)Tag Crew

CTags

Ctags generates an index (or tag) file of language objects found in source files
that allows these items to be quickly and easily located by a text editor or
other utility. A tag signifies a language object for which an index entry is
available (or, alternatively, the index entry created for that object).

Python has a philosophy of `Batteries Included` (and `Designer Straight Jacket`)
and its standard library has many useful modules to allow you to zoom out and
fly. Say what? What pixie dust have you have been snorting!?

The higher level you are the higher level you can go. If your stuck in the
trenches of detail it's hard to see emergent patterns to exploit which can help
you simplify code. Refactoring and simplifying is generally an iterative process
of simplifying, getting a higher level view and simplifying again.

Taking a bottom up approach in this helps as you know what building blocks you
need to create along the way. Don't want to be *too* simplistic. How can you do
things `top down` if you aren't `on top`?
Maybe you have flown over many times already and are just implementing an old
innovation from a time honoured map. That's exactly the spirit of prototyping.
Sending in the scouts to survey the terrain and report back.

However, those troops get mighty mutinous really quick if you don't take the
time to at least make a spot before you send em out like a swarm of jellyfish
(especially when they are being fired at)

Strategy? Fly over, send in the scouts and maintain communication.

What the hell is all that got to do with CTags? Not much I admit. But we does
love to babble.

Python std libraries are your `A Team` of special operatives that can do the
work you tell em without constant supervision and no need to worry about the
messy details. Go nuke PHP! "Yes Sir!"

One of these crackshot ninjas is Captain `bisect` with his special move
`bisect_left`.

He's a great leader because he's absolutely fastidious in keeping his company of
troops in perfect sorted order at all times.

Some of his duties involves training new recruits and he'll do a trick where
upon meeting them all for the first time he will get them to silently line up in
alphabetical order while he has his back turned. (we'll say alphabetical by
name, but well, these guys *are* macho army men so..)

>>> scum = "omewEDjyFapAdxslfhbgBcnCtkqzuvri"

>>> len(scum) == 32
True

>>> men = ''.join(sorted(scum))
>>> men
'ABCDEFabcdefghijklmnopqrstuvwxyz'
( Isn't it funny how the ones that try and make them selves seem big are in fact the smallest? )

" I bet I can find the position in the line of any one of you maggots after at most 5 guesses "

A, Yeah right, he thinks. "Find q sir"

bisect, "man 16 are you a lesser man than q?" k, "Yes"
bisect, "man 24 are you a lesser man than q?" s, "No"
bisect, "man 20 are you a lesser man than q?" o, "Yes"
bisect, "man 22 are you a lesser man than q?" q, "No" (smiles)
bisect, "man 21 are you a lesser man than q?" p, "Yes"

bisect, "man 22 you are q!"

troops, "Bravo Sir!"

( bisect is a a bit of a kookoo and uses 0 based indexing. Lucky the scum neverseem to get confused )

How does he do it!?

He starts in the middle (32 / 2 = 16) and compares what he was searching for with what he finds there. What he finds, k, is less than q so he knows he can rule out all other men before k as they too would be less than q.

This only worked for him because all the men are in order.

He then subdivides again. The man at position 16, k, is less than q so his area will start at
17 and extend in the opposite direction ending (as before) at 32 (the greatest man). He then looks for his next midpoint with an eye to ruling out another half of the remaining men.

( He repeats this simple process until his start point is no longer less than the
end point )

((17 + 32) // 2 = 24)
men[24] (s) is not lt q so his end point becomes 24

((17 + 24) // 2 = 20)
men[20] (o) is less than q so his start point becomes 21

((21 + 24) // 2 = 22)
men[22] (q) is not less than q :) so his end point becomes 22

((21 + 22 // 2 = 21))
men[21] (p) is less than q so his start point becomes 22
Start is 22 and end is 22 so he has a match!

With 64 men (double 32) he would only take 1 more guess; As each guess rules
out half and half is the inversion of double.

What about 128 (or some arbitrary number?) How many times can you half 128
before you get one (the right `one`).

Or in other words what to the power of two makes 128?

32 log 2 = 5
64 log 2 = 6
128 log 2 = 7

import math
math.log(128, 2)
Captain bisect's talent scales exceedingly well and in fact he's ready to put
forth his talent to whatever use come up with.
import bisect
bisect.bisect_left(some_sequence, search)

WOAH What a verbose explanation! Too much detail! Couldn't you just have said
use `bisect.bisect_left` to index left most occurence of any item in a sequence.

Yeah, that's kind of the point. Python is chock full of handy high level
abstractions you can trust to do the job without worrying about the details.
You are already zooming.

"Ok you admit your are babbling but what the heck is this got to do with
CTags?"

To paraphrase, Ctags generates an index (in sorted order) of symbols to
be quickly and easily located. Sounds like a job for Captain bisect. He is
actually made of this stern stuff called `C` that makes him faster than anything
you could genetically engineer in your laboratories.

Python also has this useful thing going called duck typing. bisect will work
on any class that exposes a __getitem__ method.

But what if you wanted Captain bisect to search a 50 MB ctags file? You can't
sub class a file object... can you? In any case each index would just return
a character wouldn't it?

A sneak preview of how to use bisect and mmap to binary search CTags `tags` files. Explanation to follow.

 1  #################################### IMPORTS ###################################
2

3 from __future__ import with_statement
4
5 import os
6 import bisect
7 import mmap
8 import unittest
9
10 ################################### CONSTANTS ##################################
11

12 # CSV Column in tag file
13
SYMBOL = 0
14 FILENAME = 1
15
16 ################################################################################
17

18 class TagFile(object):
19 def __init__(self, p, column):
20 self.p = p
21 self.column = column
22
23 def __getitem__(self, index):
24 self.fh.seek(index)
25 self.fh.readline()
26 return self.fh.readline().split('\t')[self.column]
27
28 def __len__(self):
29 return os.stat(self.p).st_size
30
31 def get(self, *tags):
32 with open(self.p, 'r+') as fh:
33 self.fh = mmap.mmap(fh.fileno(), 0)
34
35 for tag in tags:
36 b4 = bisect.bisect_left(self, tag)
37 fh.seek(b4)
38
39 for l in fh:
40 comp = cmp(l.split('\t')[self.column], tag)
41
42 if comp == -1: continue
43 elif comp: break
44
45 yield l
46
47 self.fh.close()
48
49 ##################################### TESTS ####################################
50

51 class CTagsTest(unittest.TestCase):
52 def test_tags_files(self):
53 tags = r"tags"
54 tag_file = TagFile(tags, SYMBOL)
55
56 with open(tags, 'r') as fh:
57 latest = ''
58 lines = []
59
60 for l in fh:
61 symbol = l.split('\t')[SYMBOL]
62
63 if symbol != latest:
64
65 if latest:
66 tags = list(tag_file.get(latest))
67 self.assertEqual(lines, tags)
68
69 lines = []
70
71 latest = symbol
72
73 lines += [l]
74
75 if __name__ == '__main__':
76 unittest.main()
77
78 ################################################################################

No comments: