Going away from bzr toward git

(this is a small rant about why I like bzr less and less and like git more and more; this is only a personal experience, not a general git vs bzr thing, take it as such).

Source control systems are a vital tool for any serious software project. They provide an history of the project, are an invaluable tool for release process, etc… When I started to develop some code outside school exercises, I wanted to learn one for my own projects.

Using svn

This was not so long ago – 3-4 years ago, and at that time, SVN was the logical choice. I wanted to use it on my machine, to keep history, and being able to go back; since I mainly code for scientific research, the time and rollback aspects were particularly important.

Using SVN did not really make sense to me at that time: Using it to track other projects was of course easy (checking out, log, commit), but I could not really understand how to use it for my own projects:

  • I could not understand their branches and tags concept. Note that I did not even know what those terms mean at that time; I did not understand why it would matter at all where I would put the tags and   branches, why I needed to copy things for tags, etc… From the svn-book, it was not really clear what the difference between branch and tags was.
  • Setting up svn on one machine is awkward: Why should I create a repository somewhere, and populate it from somewhere else ? How should I do backup of the repository ?
  • Getting back in time is unintuitive: you have to “merge back” in time the revisions you want to rollback. This is really error prone.

Bzr, the first source control which made sense to me

At the end, I found easier to just use tarballs to save the state of my projects (my projects are always quite small). Then, a bit more than two years ago, I discovered bzr (bzr-ng at that time): it was a better arch, the SCS developed by Tom Lord for distributed development. Arch always intrigued me, but was extremely awkward: it could not handle windows very well, there were strange filenames, and it was source code invasive. Even checking out other projects like rhythmbox was painful. bzr on the contrary was really simple:

  • Creating a new project ? bzr init in the top directory, and then adding the code and committing. No separate directory for the db, no “bzradmin” to create the repository
  • branches and tags (tags came a bit later in bzr, starting at version 0.15 IIRC) were dead easy: bzr branch to create the branch, no need to use some copy commands, etc.. tags are even easier.

I have used bzr ever since for all my projects; in the mean time, I have been much more involved with several open source projects, which all use svn, and I always felt svn was an inferior, more complicated tool compared to bzr. With bzr, I understood what branch could be used for, and more generally how a SCS
can be helpful for development.

Since bzr was so pleasant to use, I of course wanted to use it for the projects I was involved with, so I was really excited by bzr-svn to track svn repositories. Unfortunately, bzr-svn has never been a really pleasant
experience. One problem was that the python wrapper of libsvn were really buggy (to the point that bzr-svn has now its own wrapper). Also, it was extremely slow to import revisions, and failed on some repositories I used bzr-svn on. That’s how I started to look at other tools, in particular hg: hg had an ability to import svn, and it was more reliable than bzr-svn in my experience. But it was not really practical to use to commit back to svn repository, so I never investigated this really deeply.

Bzr annoyances

At the same time, there were some things which I was never thrilled by with bzr. Two in particular:

One branch per directory

That’s a conscious design decision from bzr developers. This means it is a bit simpler to know where you are (a branch is a path), but I find it awkward when you need to compare branches / need to “jump” from branch to branch. When you are deep down inside the tree of your project, comparing branches (diff, log, etc…) becomes annoying because you have to refer to branch form their path.

Revision numbers

Each commit is assigned a revid by bzr, which is a unique number per repository. That’s the number bzr deals with internally. But for most UI purpose, you deal with revno, that is simple integers numbers: of course, because of the distributed nature of bzr, those numbers are not unique for a repository, only within a branch. I find this extremely confusing. Again, this appears more clearly when comparing several branches at the same time. For example, when I have not worked on a project for a long time, I may not remember the relative state of different branches: the bzr command missing is then very useful to know which commits are unique to one branch. But the numbers mean different things in different branches, which mean they are useless in that case; being useless would have actually been ok, but they are in fact very confusing.

For example, I recently went back to a branch I have not worked on for more than one month. Let’s say my current development focus in in branch A, and I wanted to see the status of branch B. I can use bzr missing for that purpose. I can see that 5 revisions, from 300 to 305 are missing. I then go into branch B, and study a bit the source code, in particular with bzr blame. I see some code with revision under 300 in branch B, which I could not see in branch A. Now, this was confusing: any revision before 300 is in A too according to bzr missing, so how is it possible for bzr blame to report difference code in A and B, for a section commited with a revno < 300 ? The reason is that revision 305 is actually a merge, and when going through the detailed log in branch B, I can see that revision 305 contains 296.1.1, then 299.1, 299.2, 299.3 and 299.4. I can’t see how this a useful behavior. Maybe I am biased as someone doing a lot of math all day long, but having 296.1.1 after 304 does not make any sense to me. What’s the point of using supposedly simple numbers when they have arbitrary ordering, which changes depending on where you are seeing them ? SVN revno were already quite confusing when using branches, but bzr made it worse in my opinion.

Nitpicks

There were also things which were less significant for me, but still unpleasant: bzr startup is really slow, its use in script not really useful – if you want to do anything substantial, you have to study the plugin API. Also, it  tarted to become a bit inflexible for some things: for example, incorporating a second project also tracked by bzr into a first project is difficult (if not impossible; I could never manage to do it), history-related perations are often slow, using a lot of branches takes a lot of space unless you are using shared repository which feel like an hack more than a real solution, etc…

(Re)-Discovering git

About the same time, I had to use git for one project which I was interested in. I found it much easier to use than when I looked at it for the first time. There was no cogito anymore, the basic commands were like bzr. I decided to give git-svn a try, and it was much faster than bzr-svn to import some projects; the repositories were extremely small [1]. Also, although git UI is still quite arcane, I found git itself a pleasure to use: it felt simple, because the concept were simple – much more than bzr, in fact. sha-1 for revision are not awkward, because you barely use them at the UI level (git UI is very powerful for human-revision handling: no number, but you can easily ask for parent in a branch or in the DAG relatively to a given revision, you can look by commiters, by string in the commit or the code, by date, etc…); bzr revno feel like an hack after being used to git. For example, wherever I am, if I want to compare branch2 to branch1, in git I can do:

git log branch1..branch2
git diff branch1..branch2

Also, git is scriptable, which is appealing to the Unix user in me. I can understand the POV of bzr developers concerning extensibility with plugin (it is not unlike the argument of UNIX pipe vs Windows COM extensions as developed by Miguel in his Let’s make Unix not suck [2]), but I prefer the git model at the end. Bzr decision to go toward extensibility with plugins is not without merit: I  think the good error report from bzr is partly a consequence of this choice. OTOH, git messages can be cryptic; but git simplicity at the core level makes this much less significant than I first expected.

A key git difference compared to bzr is that git is really just a content tracker. It does not track directory at all, or filenames for example: it instead tries to detect when you rename files. I remember at least once  then this was mentioned on bzr ML [3], where a bzr developer argued that bzr could do like git, while keeping explicit meta information (when you tell bzr to rename a file). One obvious drawback is that depending on how the change was made to the tree, patch vs merge for example, bzr behavior will be different; this is very serious in my opinion. Specially for a language like python, where the files/directory name matters, directory renames should be quickly propagated, and can never be done lightly anyway. And it means git can be much better at dealing with renames when import external data, merge between unrelated branches, etc…  Because its algorithm for renames detection is used all the time, it has to work quite well. It is a bit similar to the merge capability of distributed SCS: there is no reason for them to be inherently better at merging, but because they would be unusable without good merge tracking capability, this has to work reliably from the start in DVCS. Even if in theory, bzr could detect renames like git (in addition to its explicit rename handling), in practice, it has not happened, and as far as I am aware, nobody has done any work in that direction.

Another advantage of git I did not mention, but that’s because it has been rehashed ad nauseam, and it is the most obvious one to anyone using both tools: git is incredibly fast. Many things I would never do with bzr because it would take too much time are doable with git; sometimes, git favor speed to much (in its rename detection, for example: you should really be aware of the -M and -C options in log and other history-related command), but even when telling git to spend time detecting renames, it is still much faster than bzr.

Finally, git is getting a lot of traction: it is used by Linux, Xorg, android, RoR, a lot of freedesktop projects, is being discussed for KDE. This means it will become even better, and that other DVCS will have a very hard time to compete. As a very concrete example: Git UI improvements were much more significant than bzr speed improvements during the last year (bzr speed has not improved much in my experience since 0.92 and the pack format: long history and network make bzr almost unusable for big projects with large history contributed by a large team across the world; OTOH, git 1.5.3 was the first git version which I could use without hurting my head too much).

For all those reasons – simplicity of the core model, flexibility, scriptability, and speed – I think I will start to use git for all my projects, and give up on bzr. I think bzr is still superior to git for some things, and
depending on the project or the tree you are tracking, bzr may be better (in particular because it tracks directories, which git does not, and this can matter; I am also not sure whether git would be appropriate for tracking /etc or your $HOME).

[1] for every project I have imported so far, the git clone is as big or smaller than a svn checkout; you read that right: one revision checked out from svn is often bigger than a full history; I have imported the full history of numpy, scipy, scikits on my github account, and I have not used much more than half of my 100 Mb account)

[2] http://primates.ximian.com/~miguel/bongo-bong.html

[3] https://lists.ubuntu.com/archives/bazaar/2007q3/028591.html

A python 2.5.2 binary for Mac OS X with dtrace enabled

As promised a few days ago, I took the time to build a .dmg of python from the official sources + my patch for dtrace. The binary is built with the build-script.py script in the Mac/ directory of python, and except the dtrace patch, no other modification has been done, so it should be usable as a drop-in replacement for the official binary on python.org. You can find the binary here

Again, use it at your own risk. If you prefer building it yourself, or with different options, the patch can be found here

Building dtrace-enabled python from sources on Mac OS X

One highlight of Mac OS X Tiger is dtrace. Providers for ruby and python are also available, but only with the “system” interpreters (the one included out of the box). If you install python from http://www.python.org, you can’t use dtrace anymore. Since the code to make dtrace enable python is available on the open source corner of Apple, I thought it would be easy to apply it to pristine sources available on python.org.

Unfortunately, for some strange reasons, Apple only provides changes in the form of ed scripts applied through a Makefile; the changes are not feature specific, you just have a bunch of files for each modified file: dtrace, Apple-specific changes are all put together. I managed to extract the dtrace part of this mess, so that I can apply only dtrace related changes. The goal is to have a python as close as possible to the official binary available on python.org. The patch can be found there.

How to use ?

  • Untar the python 2.5.2 tarball
  • Apply the patch
  • Regenerate the configure scripts by running autoconf
  • configure as usual, with the additional option –enable-dtrace  (the configuration is buggy, and will fail if you don’t enable dtrace, unfortunately)
  • build python (make, make install).
It time permits, I will post a .dmg. Needless to say, you run this at your own risk.

numscons and cython

numscons 0.9.2 has just been released. The main feat of this release is cython support: I implemented a small cython tool during the cython tutorial at scipy08, and now, you can build a cython extension from .py or .pyx:

from numscons import GetNumpyEnvironment
env = GetNumpyEnvironment(ARGUMENTS)
# cython tool not loaded by default
name = "cython"
env.Tool(name)
# Build a python extension from yop.py
env.DistutilsPythonExtension(source = ["yop.py"])

The example can be found in test/examples/cython in numscons sources. This is preliminary, since there is no way to pass option to cython generation.


Observing memory usage for python programs

Several times already, I wish I could observe the memory usage of some python scripts/modules, and more importantly, which portion is consumed in which objects. For C/python problems, I’ve used massif from valgrind, but this is not alway usable. On the bzr ML, several alternatives have been suggested:

  1. reducing-the-footprint-of-python-applications
  2. Heapy
  3. Another solution which requires recompiling the python interpreter: see here. Unfortunately, numpy does not work with an interpreter compiled with this (at least on linux).

Will try those methods when I will have some more time, and report a bit in more details

April 2014
M T W T F S S
« Oct    
 123456
78910111213
14151617181920
21222324252627
282930  

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 9 other followers

Follow

Get every new post delivered to your Inbox.