Getting erlang_hgsvn to dance with Subversion 1.5

Protocol-driven programming in Erlang can be extremely rewarding. It can also lead you down paths where the subtlest change in the external system you are interfacing with breaks your code. I had just such an experience when I upgraded my workstation from Ubuntu Hardy Heron to Intrepid Ibex.

To my horror my carefully constructed merging system (based on erlang_hgsvn) crashed when I tried to run it. First I investigated what might have changed. I quickly identified that Subversion 1.5 was the standard in the Intrepid repository. I looked at the output from svn log and found it to be an exact match to Subversion 1.4. Something more subtle must be at work I told myself.

Something more subtle was at work. It turns out that Subversion 1.5 does not flush STDOUT strategically at the end of a log entry like Subversion 1.4. Instead it appears to be flushed seemingly at random. Initially I cursed myself for having made the assumption that this would continue to work. I went over and over in my mind why I would have done that instead of using a simple getline approach. Then I remembered the performance issues and I set my mind to retaining a protocol-driven design. This time it would be not be sensitive to the exact boundaries of data yielded by svn log.

To tolerate arbitrary termination of data yielded by svn log I chose to make all parsing routines able to fallback on a receive block to get more data. The design I chose uses a rolling 8 byte binary parsing window to match the telltale aspects of a log entry. It turns out to be more code than the original list-based implementation. However, I found it much easier to walk my friend through the binary-based implementation.

The core of the design — the binary rolling window — was inspired by a fast get_line solution by Per Gustafsson. Originally I looked at Per's solution in horror since it seemed to have needless repetition of pattern-matching clauses. I later learned that this repetition was, in fact, the implementation of a highly-efficient binary parsing window. I modified and reused this technique in erlang_hgsvn to support Subversion 1.5. I hope it has also improved its performance for large repository conversion, but I have not yet done any performance metrics.

You do not need to tie yourself down to obvious get_line solutions to make things work. Quite often that approach will get you into bigger trouble. This trouble can be seen fairly obviously when loading a large file without line terminators into most text editors including — to my surprise and diasappointment — Vim. Using a protocol-driven approach to data processing makes it much easier to avoid memory crashes in these circumstances and also gives you the insight you need to store and parse data stored in binary file formats. Erlang opens such arcane niches of data processing to the average developer.