The “every Linux distribution should have the same package manager” fallacy

I have heard several times that every linux distribution should have the same package manager (where it is understood that there is one-too-many within the rpm vs deb), and it was mentioned once again recently in a well publicized video (see on linux hater blog)

The argument goes as follows: doing packaging takes time, and making packages for every distribution is a waste of time. If every distribution used the same package system, it would be much better for 3rd party distributors. Many people answer that competition is good, having many distributions is what makes Linux great – [insert usual stuff about how good Linux is].

While it is true that multiple packages systems means more work, saying that there should only be one is kinda clueless – I wonder if anyone pushing for this has even done any rpm/deb pacaking. What makes deb vs rpm a problem is not that they are different formats, like say zip vs gunzip, but that they are deployed on different systems. A RHEL rpm won’t work great on Mandrake, and even if a lot of debian .deb work on Ubuntu, it is not always ideal. The problem is that each distribution-specific package needs to be designed for the target distribution. To build a rpm or a deb package, you need:

  • To decide where to put what
  • To encode the exact versions for the dependencies
  • To decide how to handle configuration files, set up start/stop scripts for servers, etc…

Basically, almost everything which makes the difference between a distribution A and B ! For file locations, the LSB tries to standardize on this, but some things are different, like where to put 64 vs
32 bits libraries. One distribution may have libfoo 1.2, another one 1.3, so even if they are compatible, you can’t use the same for every distribution. Or some libraries do not have the same name under different distributions.

So requesting the same package manager for every distribution is almost equivalent to asking that every distribution should be the same. You can’t have one without the other. You can argue that there should be only one distribution, but don’t forget that Ubuntu appeared like 5 years ago.

Why people should stop talking about git speed

As I have already written in a previous post, I have moved away from bzr to git for most of my software projects (I still prefer bzr for documents, like my research papers). A lot if not most of the comparison of git vs other tools focus on speed. True, git is quite fast for source code management, but I think this kinds of miss the point of git. It took me time to appreciate it, but one of the git’s killer feature for source code control is the notion of content tracking. Bzr (and I believe hg although I could not find good information on that point) use file id, i.e. they track files, and a tree is a set of files. Git, on the contrary, tracks content, not files. In other words, it does not treat files individually, but always internally consider the whole tree.

This may seem like an internal detail, and an annoyance because it leaks at the UI level quite a lot (the so-called index is linked to this). But this means that it can record the history of code instead of files quite accurately. This is especially visible with git blame. One example: I recently started a massive surgery on the numpy C source code. Because of some C limitations, the numpy core C code was in a couple of giantic source files, and I split this into more logical units. But this breaks svn blame heavily. If you just rename a file, svn blame is lost can follow renames. But if you split one file into two, it becomes useless. Because git tracks the whole tree, the blame command can be asked to detect code moves across files. For example, git blame with rename detections gives me the following on one file in numpy:

dc35f24e numpy/core/src/arrayobject.c         1) #define PY_SSIZE_T_CLEAN
dc35f24e numpy/core/src/arrayobject.c         2) #include <Python.h>
dc35f24e numpy/core/src/arrayobject.c         3) #include "structmember.h"
dc35f24e numpy/core/src/arrayobject.c         4)
65d13826 numpy/core/src/arrayobject.c         5) /*#include <stdio.h>*/
5568f288 scipy/base/src/multiarraymodule.c    6) #define _MULTIARRAYMODULE
2f91f91e numpy/core/src/multiarraymodule.c    7) #define NPY_NO_PREFIX
2f91f91e numpy/core/src/multiarraymodule.c    8) #include "numpy/arrayobject.h"
dc35f24e numpy/core/src/arrayobject.c         9) #include "numpy/arrayscalars.h"
38f46d90 numpy/core/src/multiarray/common.c  10)
38f46d90 numpy/core/src/multiarray/common.c  11) #include "config.h"
0f81da6f numpy/core/src/multiarray/common.c  12)
71875d5c numpy/core/src/multiarray/common.c  13) #include "usertypes.h"
71875d5c numpy/core/src/multiarray/common.c  14)  
0f81da6f numpy/core/src/multiarray/common.c  15) #include "common.h"
5568f288 scipy/base/src/arrayobject.c        16)
65d13826 numpy/core/src/arrayobject.c        17) /*
65d13826 numpy/core/src/arrayobject.c        18)  * new reference
65d13826 numpy/core/src/arrayobject.c        19)  * doesn't alter refcount of chktype or mintype ---
65d13826 numpy/core/src/arrayobject.c        20)  * unless one of them is returned
65d13826 numpy/core/src/arrayobject.c        21)  */

You can notice that the original file can be found for every line of code in the new file. The original author and date may be found as well, I just removed them for the blog post.

This is truely impressive, and is one of the reason why git is so far ahead of the competition IMHO. This kind of features is extremely useful for open source projects, much more than rename support. I am ready to deal with quite a few (real) Git UI annoyances for this.

Edit

It looks like my example was not very clear. I am not interested in following the renames of the file: in the example above, the file was not arrayobject.c first, then renamed to multiarraymodules.c, and later to common.c. The file was created from scratch, with content taken from those files at some point. You can try the following simplified example. First, create two files prod.c and sum.c:

#include double sum(const double* in, int n)
{
int i;
double acc = 0;

for(i = 0; i < n; ++i) { acc += in[i]; } return acc; } [/sourcecode] [sourcecode language='c'] #include

double prod(const double* in, int n)
{
int i;
double acc = 1;

for(i = 0; i < n; ++i) { acc *= in[i]; } return acc; } [/sourcecode] Commit to your favorite VCS. Then, you reorganize the code, and in particular you put the code of both files into a new file common.c. So you create a new file common.c: [sourcecode language='c']#include

double prod(const double* in, int n)
{
int i;
double acc = 1;

for(i = 0; i < n; ++i) { acc *= in[i]; } return acc; } double sum(const double* in, int n) { int i; double acc = 0; for(i = 0; i < n; ++i) { acc += in[i]; } return acc; } [/sourcecode] And commit. Then, try blame. Rename tracking won't help at all, since nothing was renamed. On this very simple example, you could improve things by first renaming say sum.c to common.c, then adding the content of prod.c to common.c, but you will still loose that the prod function comes from prod.c. git blame -C -M gives me the following:

^ae7f28a prod.c  1) #include <math.h>
^ae7f28a prod.c  2)
^ae7f28a prod.c  3) double prod(const double* in, int n)
^ae7f28a prod.c  4) {
^ae7f28a prod.c  5)         int i;
^ae7f28a prod.c  6)         double acc = 1;
^ae7f28a prod.c  7)
^ae7f28a prod.c  8)         for(i = 0; i < n; ++i) {
^ae7f28a prod.c  9)                 acc *= in[i];
^ae7f28a prod.c 10)         }
^ae7f28a prod.c 11)
^ae7f28a prod.c 12)         return acc;
^ae7f28a prod.c 13) }
^ae7f28a sum.c  14)
^ae7f28a sum.c  15) double sum(const double* in, int n)
^ae7f28a sum.c  16) {
^ae7f28a sum.c  17)         int i;
^ae7f28a sum.c  18)         double acc = 0;
^ae7f28a sum.c  19)
^ae7f28a sum.c  20)         for(i = 0; i < n; ++i) {
^ae7f28a sum.c  21)                 acc += in[i];
^ae7f28a sum.c  22)         }
^ae7f28a sum.c  23)
^ae7f28a sum.c  24)         return acc;
^ae7f28a sum.c  25) }

hg blame on the contrary will tell me everything comes from common.c. Even when using the rename trick, I cannot get more than the following with hg blame -f -c:

81c4468e59f9    sum.c: #include <math.h>
81c4468e59f9    sum.c:
81c4468e59f9    sum.c: double sum(const double* in, int n)
81c4468e59f9    sum.c: {
81c4468e59f9    sum.c:         int i;
81c4468e59f9    sum.c:         double acc = 0;
81c4468e59f9    sum.c:
81c4468e59f9    sum.c:         for(i = 0; i < n; ++i) {
81c4468e59f9    sum.c:                 acc += in[i];
81c4468e59f9    sum.c:         }
81c4468e59f9    sum.c:
81c4468e59f9    sum.c:         return acc;
81c4468e59f9    sum.c: }
3c1ac7db76ba common.c:
3c1ac7db76ba common.c: double prod(const double* in, int n)
3c1ac7db76ba common.c: {
3c1ac7db76ba common.c:         int i;
3c1ac7db76ba common.c:         double acc = 1;
3c1ac7db76ba common.c:
3c1ac7db76ba common.c:         for(i = 0; i < n; ++i) {
3c1ac7db76ba common.c:                 acc *= in[i];
3c1ac7db76ba common.c:         }
3c1ac7db76ba common.c:
3c1ac7db76ba common.c:         return acc;
3c1ac7db76ba common.c: }

First steps toward C code coverage in numpy

For quite some time, I wanted to add code coverage to the C part of numpy. The upcoming port to python 3k will make this even more useful, and besides, Stefan Van Der Walt promised me a beer if I could do it.

There are several tools to do code coverage of C code – the most well known is gcov (I obviously discard non-free tools – those tend to be fairly expensive anyway). The problem with gcov is its inability to do code coverage for dynamically loaded code such as python extensions. The solution is thus to build numpy and statically link it into python, which is not totally straightforward.

Statically linking simple extensions

I first looked into simpler extensions: the basic solution is to add the source files of the extensions into Modules/Setup.local in python sources. For example, to build the zlib module statically, you add

*static*
zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz

And run make, this will statically link the zlib module to python. One simple way to check whether the extension is indeed statically link is to look into the  __file__ attribute of the extension. In the dynamically loaded case, the __file__ returns the location of the .so, but the attribute does not exist in the static case.

Code coverage

To use gcov, two compilation flags are needed, and one link flag:

gcc -c -fprofile-arcs -ftest-coverage …
gcc … -lgcov

Note that -lgcov must be near the end of the link command (after other libraries flags). To do code coverage of e.g. the zlib module, the following works in Modules/Setup.local:

*static*
zlib zlibmodule.c -I$(prefix)/include -fprofile-arcs -ftest-coverage -L$(exec_prefix)/lib -lz -lgcov

If everything goes right after a make call, you should have two files zlibmodule.gcda and zlibmodule.gcno into your Modules directory. You can now run gcov in Modules to get code coverage:

cd Modules && gcov zlibmodule

Of course, since nothing was run yet, the code coverage is 0. After running the zlib test suite, things are better though:

./python Lib/test/test_zlib.py && gcov -o Modules Modules/zlibmodule

The -o tells gcov where to look for gcov data (the .gcda an .gcno files), and the output is

File ‘./Modules/zlibmodule.c’
Lines executed:74.55% of 448

Build numpy statically

I quickly added a hack to build numpy C code statically instead of dynamically in numscons, static_build branch, available on github. As it is, numpy will not work, some source code modifications are needed to make it work. The modifications reside in the static_link branch on github as well.

Then, to statically build numpy with code coverage:

LINKFLAGSEND=”-lgcov” CFLAGS=”-pg -fprofile-arcs -ftest-coverage” $PYTHON setupscons.py scons –static=1

where $PYTHON refers to the python you build from sources. This will build every extension as a static library. To link them to the python binary, I simply added a fake source file and link the numpy as libraries to the fake source in Modules/Setup.local

*static*
multiarray fake.c -L$LIBPATH -lmultiarray -lnpymath
umath fake.c -L$LIBPATH -lumath -lnpymath
_sort fake.c -L$LIBPATH -l_sort -lnpymath

where LIBPATH refers to the path where to find the static numpy libraries (e.g. build/scons/numpy/core in your numpy source tree). To run the testsuite, one has to make sure to import a numpy where multiarray, umath and _sort extensions have been removed, it will crash otherwise (as the extesions would be present twice in the python process, one for the dynamically loaded code, one for the statically linked code). The test suite kind of run (~1500 tests), and on can get code coverage afterwards. For multiarray extension, here is what I get:

File ‘build/scons/numpy/core/src/multiarray/common.c’
Lines executed:52.56% of 293
build/scons/numpy/core/src/multiarray/common.c:creating ‘common.c.gcov’

File ‘build/scons/numpy/core/include/numpy/npy_math.h’
Lines executed:50.00% of 12
build/scons/numpy/core/include/numpy/npy_math.h:creating ‘npy_math.h.gcov’

File ‘build/scons/numpy/core/src/multiarray/arraytypes.c’
Lines executed:62.23% of 1030
build/scons/numpy/core/src/multiarray/arraytypes.c:creating ‘arraytypes.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/hashdescr.c’
Lines executed:68.38% of 117
build/scons/numpy/core/src/multiarray/hashdescr.c:creating ‘hashdescr.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/numpyos.c’
Lines executed:81.48% of 189
build/scons/numpy/core/src/multiarray/numpyos.c:creating ‘numpyos.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/scalarapi.c’
Lines executed:47.43% of 350
build/scons/numpy/core/src/multiarray/scalarapi.c:creating ‘scalarapi.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/descriptor.c’
Lines executed:61.96% of 1028
build/scons/numpy/core/src/multiarray/descriptor.c:creating ‘descriptor.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/flagsobject.c’
Lines executed:42.31% of 208
build/scons/numpy/core/src/multiarray/flagsobject.c:creating ‘flagsobject.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/ctors.c’
Lines executed:64.69% of 1583
build/scons/numpy/core/src/multiarray/ctors.c:creating ‘ctors.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/iterators.c’
Lines executed:70.41% of 774
build/scons/numpy/core/src/multiarray/iterators.c:creating ‘iterators.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/mapping.c’
Lines executed:77.95% of 721
build/scons/numpy/core/src/multiarray/mapping.c:creating ‘mapping.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/number.c’
Lines executed:51.80% of 361
build/scons/numpy/core/src/multiarray/number.c:creating ‘number.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/getset.c’
Lines executed:44.09% of 372
build/scons/numpy/core/src/multiarray/getset.c:creating ‘getset.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/sequence.c’
Lines executed:50.00% of 60
build/scons/numpy/core/src/multiarray/sequence.c:creating ‘sequence.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/methods.c’
Lines executed:47.35% of 942
build/scons/numpy/core/src/multiarray/methods.c:creating ‘methods.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/convert_datatype.c’
Lines executed:56.11% of 442
build/scons/numpy/core/src/multiarray/convert_datatype.c:creating ‘convert_datatype.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/convert.c’
Lines executed:66.67% of 183
build/scons/numpy/core/src/multiarray/convert.c:creating ‘convert.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/shape.c’
Lines executed:76.81% of 345
build/scons/numpy/core/src/multiarray/shape.c:creating ‘shape.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/item_selection.c’
Lines executed:55.07% of 937
build/scons/numpy/core/src/multiarray/item_selection.c:creating ‘item_selection.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/calculation.c’
Lines executed:59.08% of 523
build/scons/numpy/core/src/multiarray/calculation.c:creating ‘calculation.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/usertypes.c’
Lines executed:0.00% of 111
build/scons/numpy/core/src/multiarray/usertypes.c:creating ‘usertypes.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/refcount.c’
Lines executed:66.67% of 129
build/scons/numpy/core/src/multiarray/refcount.c:creating ‘refcount.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/conversion_utils.c’
Lines executed:59.49% of 316
build/scons/numpy/core/src/multiarray/conversion_utils.c:creating ‘conversion_utils.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/buffer.c’
Lines executed:56.00% of 25
build/scons/numpy/core/src/multiarray/buffer.c:creating ‘buffer.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/scalartypes.c’
Lines executed:42.42% of 877
build/scons/numpy/core/src/multiarray/scalartypes.c:creating ‘scalartypes.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/ucsnarrow.c’
Lines executed:89.36% of 47
build/scons/numpy/core/src/multiarray/ucsnarrow.c:creating ‘ucsnarrow.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/arrayobject.c’
Lines executed:58.75% of 514
build/scons/numpy/core/src/multiarray/arrayobject.c:creating ‘arrayobject.c.gcov’

File ‘build/scons/numpy/core/src/multiarray/multiarraymodule.c’
Lines executed:49.12% of 1134
build/scons/numpy/core/src/multiarray/multiarraymodule.c:creating ‘multiarraymodule.c.gcov’

The figures themselves are not that meaningful ATM, since the test suite does not run completely, and the built numpy is a quite bastardized version of the real numpy.

The numpy modifications, although small, are very hackish – I just wanted to see if that could work at all. If time permits, I hope to be able to automate most of this, and have a system where it can be integrated in the trunk. I am still not sure about the best way to build the extensions themselves. I can see other solutions, such as producing a single file per extension, with every internal numpy header/source integrated, so that they could be easily build from Setup.local. Or maybe a patch to the python sources so that make in python sources would automatically build numpy.

Python packaging: a few observations, cabal for a solution ?

The python packaging situation has been causing quite some controversy for some time. The venerable distutils has been augmented with setuptools, zc.buildout, pip, yolk and what not. Some people praise those tools, some other despise them; in particular, discussion about setuptools keeps coming up in the python community, and almost every time, the discussion goes nowhere, because what some people consider broken is a feature for the other. It seems to me that the conclusion of those discussions is obvious: no tool can make everybody happy, so there has to be a system such as different tools can be used for different usage, without intefering with each other. The solution is to agree on common format and data/metadata, so that people can build on it and communicate each other.

You can find a lot of information on people who like setuptools/eggs, and their rationale for it. A good summary, with a web-developer POV is given by Ian Bicking. I thought it would be useful to give another side to the story, that is people like me, whose needs are very different from the web-development crowd (the community which pushes eggs the most AFAICS).

Distutils limitation

Most of those tools are built on top of distutils, which is a first problem. Distutils is a giant mess, with tight, undocumented coupling between vastly different parts. Distutils takes care of configuration (rarely used, except for projects like numpy which need to probe for fairly low level system dependencies), build, installation and package building. I think that’s the fundamental issue of distutils: the installation and deployment parts do not need to know so much about each other, and should be split. The build part should be easily extensible, without too much magic or assumption, because different projects have different needs. The king here is of course make; but ruby for example has rake and rant, etc…

A second problem of distutils is its design, which is not so good. Distutils is based on commands (one command do the build of C extension, one command do the installation, one command build eggs in the case of setuptools, etc…). Commands are fundamentally imperative in distutils: do this, and then that. This is far from ideal for several reasons:

You can’t pass option between commands

For example, if you want to change the compilation flags, you have to pass them to every concerned command.

Building requires handling dependencies

You declare some targets, which depend on some other targets, and the build tool build a dependency graph to build this in the right order. AFAIK, this is the ONLY correct way to build software. Distutils commands are inherently incapable of doint that. That’s one example where the web development crowd may be unaware of the need for this: Ian Bicking for example says that we do pretty well without it. Well, I know I don’t, and having a real dependency system for numpy/scipy would be wonderful. In the scientific area, large, compiled libraries won’t go away soon.

Fragile extension system

Maybe even worse: extending distutils means extending commands, which makes code reuse quite difficult, or cause some weird issue. In particular, in numpy, we need to extend distutils fairly extensively (for fortran support, etc…), and setuptools extends distutils as well. Problem: we have to take into account setuptools monkey patching. It quickly becomes impractical when more tools are involved (the combinations grow exponentially).

Typical problem: how to make setuptools and numpy distutils extensions cohabite ? Another example: paver is a recent, but interesting tool for doing common tasks related to build. Paver extend setuptools commands, which means it does (it can’t) work with numpy.distutils extensions. The problem can be somewhat summarized by: I have class A in project A, class B(A) in project B and class C(A) in project C – how to I handle B and C in a later package. I am starting to think it can’t be done reliably using inheritance (the current way).

Extending commands is also particularly difficult for anything non trivial, due to various issues: lack of documentation, the related distutils code is horrible (attributes added on the fly for no good reason), and nothing is very well specified. You can’t retrieve where distutils build a given file (library, source file, .o file, etc…), for example. You can’t get the name of the sdist target (you have to recreate the logic yourself, which is platform dependent). Etc…

Final problem: you can’t really call commands directly in setup.py. As a recent example encountered in numpy: I want to install a C library build through the libraries argument of setup. I can’t just add the file to the install command. Now, since we extend the install command in numpy.distutils, it should have been simple: just retrieve the name of the library, and add it to the list of files to install. But you can’t retrieve the name of the built library from the install command, and the install command does not know about the build_clib one (the one which builds C libs).

Packaging, dependency management

This is maybe the most controversial issue. By packaging, I mean putting everything which constitute the software (configuration, .py, .so/.pyd, documentation, etc…) in a a format which can be deployed on many machines in a consistent way. For web-developers, it seems this mean something which can be put on a couple of machine, in an known state. For packages like numpy, this means being able to install on many different kind of platforms, with different capabilities (different C runtimes, different math libraries, different optimized libraries, etc…). And other cases exist as well.

For some people, the answer is: use a sane OS with package management, and life goes on. Other people consider setuptools way of doing things almost perfect; it does everything they want, and don’t understand those pesky Debian developers who complain about multiple versions, etc… I will try to summarize the different approaches here, and the related issues.

The underlying problem is simple: any non trivial software depends on other things to work. Obviously, any python package needs a python interpreter. But most will also need other packages: for example, sphinx needs pygments, Jinja to work correctly. This becomes a problem because software evolves: unless you take a great care about it, software will become incompatible with an older version. For example, the package foo 1.1 decided to change the order of arguments in one function, so bar which worked with foo 1.0 will not work with foo 1.1. There are basically three ways to deal with this problem:

  1. Forbid the situation. Foo 1.1 should not break software which works with foo 1.0. It is a bug, and foo should be fixed. That’s generally the prefered OS vendor approach
  2. Bypass the problem by bundling foo in bar. The idea is to distribute a snapshot of most of your dependencies, in a known working situation. That’s the bundling situation.
  3. Install multiple versions: bar will require foo 1.1, but fubar still uses the old foo 1.0, so both foo 1.0 and foo 1.1 should be installed. That’s the “setuptools approach”.

Package management ala linux is the most robust approach in the long term for the OS. If foo has a bug, only one version needs to be repackaged. For system administrators, that’s often the best solution. It has some problems, too: generally, things cannot be installed without admin privileges, and packages are often fairly old. The later point is not really a problem, but inherent to the approach: you can’t request both stability and bleeding edge. And obviously, it does not work for the other OS. It also means you are at the mercy of your OS vendor.

Bundling is the easiest. The developer works with a known working test, and is not dependent on the OS vendor to get an up to date version.

3 sounds like the best solution, but in my opinion, it is the worst, at least in the current state of affairs as far as python is concerned, and when the software target is “average users”. The first problem is that many people seem to ignore the problem caused by multiple, side by side installation. Once you start saying “depends on foo 1.1 and later, but not higher than 1.3”, you start creating a management hell, where many versions of every package is installed. The more it happens, the more likely you get into a situation like the following:

  • A depends on B >= 1.1
  • A depends on C which depends on B <= 1.0

Meaning a broken dependency. This situation has to be avoided as much as possible, and the best way to avoid it is to maintain compatibility such as B 1.2 can be used as a drop-in replacement for 1.0. I think too often people request multiple version as a poor man’s replacement for backward compatibility. I don’t think it is manageable. If you need a known version of a library which keeps changing, I think bundling is better – generally, if you want deployable software, you should really avoid depending on libraries which change too often, I think there is no way around it. If you don’t care about deploying on many machines (which seem to be the case for web-deployment), then virtualenv and other similar tools are helpful; but they can’t seriously be suggested as a general deployment tool for the same audience as .deb/.rpm/.msi/.pkg. Deployment for testing is very different from deployment to many machines you can’t control at all (the users’ ones)

Now, having a few major versions of the most common libraries should be possible – after all, it is used for C libraries (with the same library installed under different versions with different sonames). But python, contrary to C loaders, does not support explicit version loading independently of the name. You can’t say something like “import foo with v >= 1.1”, but you have to use a new name for the module – meaning changing every library user source code. So you end up with hacks as used by setuptools/easy_install, which are very fragile ( sys.path overriding, PYTHONPATH mess, easy_install.pth, etc…). At least for me, that’s a constant source of frustration, to the point that I effectively forbid setuptools to do anything on my machine: easy-install.pth is read only, and I always install with –single-version-externally-managed.

With thing like virtualenv and pip freeze, I don’t understand the need for multiple versions of the same libraries installed system-wide. I can see how python does not make it easy to support tools like virtualenv and pip directly (that is wo setuptools), but maybe people should focus on enabling virtualenv/zc.buildout usage without setuptools hacks (sys.path hacking, easy_install.pth), basically without setuptools, instead of pushing the multiple library thing on everyone ?

Standardize on data, not on tools

As mentioned previously, I don’t think python should standardize on one tool. The problem is just too vast. I would be very frustrated if setuptools becomes the tool of choice for python – but I understand that it solves issues for some people. Instead, I hope the python community will be able to stdandardize on metadata. Most packages have relatively simple need, which could be covered with a set of static metadata.

It looks like such a design already exists: cabal, the packaging tool for haskell (Thanks to Fernando Perez for pointing me to cabal):

http://www.haskell.org/cabal/release/cabal-latest/doc/users-guide/

Cabal work with two files:

  • setup.hs -> equivalent of our setup.py. Can use haskell, and as such can do pretty much anything
  • cabal: static metadata.

For example:

Name: HUnit

Version: 1.1.1

Cabal-Version: >= 1.2

License: BSD3

License-File: LICENSE

Author: Dean Herington

Homepage: http://hunit.sourceforge.net/

Category: Testing

Synopsis: A unit testing framework for Haskell

Library

Build-Depends: base

Exposed-modules:

Test.HUnit.Base, Test.HUnit.Lang, Test.HUnit.Terminal,

Test.HUnit.Text, Test.HUnit

Extensions: CPP

Even for the developer who knows nothing about haskell (like me :) ), this looks obvious. Basically, classifiers and arguments of the distutils setup function goes into the static file in haskell. By being a simple, readable text file, other tools can use it pretty easily. Of course, we would provide an API to get those data, but the common infrastructure is the file format and meta-data, not the API.

Note that the .cabal file enables for conditional, albeit in a very structured form. I don’t know whether this should be followed or not: the point of a static file is that it is easily parsable. Having conditional severly decreases the simplicity. OTOH, a simple way to add options is nice – and other almost static metadata files for packaging, such as RPM .spec file, allow for this.

It could also be simple to convert many distutils packages to such a format; actually, I would be surprised if the majority of packages out there could not be automatically translated to such a mechanism.

Then, we could gradually deprecate some distutils commands (to end up with a configure/build/instasll, with configure optional), such as different build tools could be plugged for the build itself – distutils could be used for the simple packages (the one wo compiled extensions), and other people could use other tools for more advanced needs (something like what I did with numscons, which bypass distutils entirely for building C/C++/Fortran code).

uninstall

Another often requested feature. I think it is a difficult feature to support reliably. Uninstall is not just about removing files: if you install a deamon, you should stop it, you may ask about configuration files, etc… It should at least support pre install/post install hooks and corresponding uninstall equivalents. But the main problem for python is how to keep a list of installed packages/files. Since python packages can be installed in many locations, there should be one db (the db could and most likely should be a simple flat file) for each site-package. I am yet familiar with haskell module management, but it looks like that’s how haskell does it

Conclusion

Different people have different needs. Any solution from one camp which prevents other solutions is very unhelpful and counter productive. I don’t want to get my ubuntu deployment system screwed up by some toy dependency system – but I don’t want to prevent the web developers from using their workflow. I can’t see a single system solving all this altogether – the problem has not been solved by anything I know of – it is too big of a problem to hope for a general solution. Instead of piling complexity and hack over complexity and hack, we should standardize the commonalities (of which there are plenty), and make sure different systems can be used by different projects.

Scipy on windows amd64

I spent some more time on this today, and for the first time ever, I managed to build and run the full test suite on windows 64 bits ! The code changes in scipy are small (~50 lines), for internal isnan and co functions in cephes (the windows linker is much stricter about multiple symbols with the same name).

There are still some issues, in particular depending on how I run the test suite, it crashes right away (when nose looks for tests). But when the test suite does run, it successfully pass 3000 unit tests, only fails for a couple (~20 tests), and some of them are relatively harmless. To build both numpy and scipy on windows amd64, I used the following combination:

– gcc : 4.4, snapshot 20090320

– binutils: 2.19.51

– mingw-w64: trunk@rev 692

I use the same versions for the cross compiler (linux->windows) and to build the native compiler from the cross-compiler. Two things need to be done:

– remove redundant pow function from mingwex (it is already available in the MS runtime)

– build the native compiler with -O0. Building with the usual optimization flags build a buggy native compiler (the gcc driver does not call as, resulting in a quite strange “foo.c file type not recognized”, because the driver then gives the foo.c to the linker directly instead of assembling it first).

Maybe scipy 0.7.1 + numpy 1.3.0 will be able to run on windows 64 bits :)

Gfortran + Visual studio

For numpy 1.3, I wanted to make gfortran works with Visual Studio on windows. The reason is two folds: gfortran is the fortran compiler of the gcc 4.* serie ( the 3.* serie starts to become ancient), and g77 will not run windows 64 bits. Making gfortran + gcc work together is of course a no brainer – but linking together fortran code built by gfortran with visual studio is another story.

There are no official binaries for gcc 4.* serie yet on windows – I simply built my own native toolchain from Mac OS X. The makefiles can be found on a mini git branch here:

http://github.com/cournape/cross-mingw-w64/tree/master

Makefile.mingw32 builds the cross compiler from unix (both linux and mac os X work), and makefile.native uses the cross compiler to build the native toolchain. On a fast computer, building both toolchains for C/Fortran/C++ takes little time (< 30 minutes on a quadcore with lots of memory).

Building blas/lapack is easy: just use the make.inc.gfortran in lapack-lite-3.1.1, and run make from cygwin (or msys – if you use cygwin, take care to use the native compilers, and not the cygwin ones). Now, for linking example, we will use the following code snippet:


#include <stdio.h>
void sgemm_(char *, char*, int*, int*, int*,
          float*, float*, int*, float*, int*, float*, float*, int*);
int
main (void)
{
    char transa = 'N', transb = 'N';
    int lda = 2;
    int ldb = 3;
    int n = 2, m = 2, k = 3;
    float alpha = 1.0, beta = 0.0;
 
    float A[] = {1, 4,
                 2, 5,
                 3, 6};
 
    float B[] = {1, 3, 5,
                 2, 4, 6};
    int ldc = 2;
    float C[] = { 0.00, 0.00,
                 0.00, 0.00 };
 
    /* Compute C = A B */
    sgemm_(&transa, &transb, &n, &m, &k,
          &alpha, A, &lda, B, &ldb, &beta, C, &ldc);
 
    printf("C = {%f, %f; %f, %f}\n", C[0], C[2], C[1], C[3]);
    return 0;
}

 

It simply calls into lapack to compute the matrix product between A and B. First, the C code is compiled as follows:

cl /c main.c

Then, copy the following files as follows:

copy libmingw32.a mingw.lib
copy libmingwx.a mingwex.lib
copy libgcc.a gcc.lib
copy libgfortran.a gfortran.lib

(all those libraries are installed either in lib or mingw¥lib directories of the mingw install). Also copy the blas library into blas.lib. You can then link the whole as follows:

link.exe main.obj blas.lib gfortran.lib mingw32.lib gcc.lib mingwex.lib

If everything goes well, the executable main.exe should run, and display the correct matrix product. This works for VS 2008 and should work for 2005 as well. It does not work for me for VS 2003, but this may be a bug in mingw or a problem when building the toolchain.

Numscons: current state, future, alternative build tools for numpy

Several people in numpy/scipy community have raised build issues recently. Brian Granger has wondered whether numscons is still maintained, and Ondrej recently asked why numscons was not the default build tool for numpy. I thought I owe some explanations and how I see the future for numscons.

First, I am still interested in working on numscons, but lately, time has become a sparser ressource: I am at the end of my PhD, and as such cannot spend too much time on it. Also, numscons is more or less done, in the sense that it does what it was supposed to do. Of course, many problems remain. Most of them are either implementation details or platform specific bugs, which can only be dealt if numscons is integrated to numpy — which raises the question of integrating numscons into numpy.

Currently, I see one big limitation in numscons: the current architecture is based on launching a scons subprocess for every subpackage, sequentially. As stupid as this decisions may sound, there are relatively strong rationales for this. First, scipy is designed as a set of almost independent packages, and that’s true for the build process as well. Every subpackage declares its requirements (blas, lapack, etc…), and can be build independently of the others: if you launch the build process at the top of the source tree, the whole scipy is built; if you launch it in scipy/sparse/sparsetools, only sparsetools is built. This is a strong requirement: this is impossible to do with autotools, for example, unless each subpackage has its own configure (like gcc for example).
It is possible in theory with scons, but it is practically almost impossible to do so while staying compatible with distutils, because of build directory issues (the build directory cannot be a subdirectory of the source tree). When I started numscons, I could see only two solutions: launching independent scons builds for each subpackage, or having a whole source tree with the configuration at the top, but scons is too slow for the later solution (although it is certainly the best one from a design POV).

The second problem is that scons cannot be used as library. You cannot do something like “from scons import *; build(‘package1’); build(‘package2’)”. Which means the only simple solution to have independent builds for each package is to launch independent scons processes. Having to use subprocesses to launch scons is the single fundamental numscons issue.

1 Because scons is slow to start (it needs to check for tools, etc…), it means no-op builds are slow (it takes 20 seconds to complete a no-op full scipy built on a more than decent machine, which is why numscons has an option–package-list to list the packages to rescan, but that’s nothing more than an ugly hack).

2 Error handling is hard to do: if scons fails, it is hard to pass useful information back to the calling process

3 Since distutils still handle installation and tarballs generation, it needs to know about the source files. But since only scons knows about it, it is hard to pass this information back to distutils from scon. Currently, it only works because it knows the sources from the conventional setup.py files.

Another limitation I see with scons is the code quality: scons is a relatively old project, and focused a lot on backward compatibility, with a lot of cruft (scons still support python 1.5). There is still a lot of development happening, and is still supported; scons is used in several high profile projects (some vmware products are built with scons, Intel acknowledges its use internally, Google uses it – Steve Knight, the first author of scons works at Google on the Chrome project, and chromes sources have scons scripts). But there is a lot of tight coupling, and changing core implementations issues is extremely challenging. It is definitely much better than distutils (in the sense that in distutils, everything is wrong: the implementation, the documentation, the UI, the concepts). But fixing scons tight coupling is a huge task, to the point where rewriting from scratch some core parts may be easier (see here). There are also some design decisions in scons which are not great, like options handling (everything is passed through Environments instances, which is nothing more than a big global variable).

A potential solution would be to use waf insteaf of scons. Waf started as a scons fork, and dropped backward compatibility. Waf has several advantages:

  • it is much smaller and nicer implementation-wise than scons (core waf is ~ 4000 LOC, scons is ten times more). There are some things I can do today in waf I still have no idea how to do in scons, although I am much more familiar with the latter.
  • waf is much faster than scons (see here for some benchmarks)
  • it seems like waf can be used as a library

But:

  • waf is not stable (the API kept changing; the main waf developer said he would focus on stability from now on)
  • waf does not support Fortran – this one should be relatively easy to solve (I have some code working already for most fortran requirements in scipy)
  • I am still not satisfied with its tool handling – that’s a hard problem though, I have not seen a single build tool which handle this well. That’s not a fundamental issue to prevent the use of waf.
  • support on windows looks flaky

There may also be hope to see more cross-pollination between scons and waf – but that’s a long term goal, unless someone can work on it fulltime for at least several weeks IMHO. Porting numscons to waf should be relatively easy once fortran is handled – I think basic porting could be done in one day or two.

Ondrej also mentioned cmake: I think it would be very hard to build numpy with cmake, because it means you have to give up distutils entirely. How to make it work with easy_install ? How to generate tarballs, windows installers, etc… ? If anyone wants to try something else here is my suggestion: ignoring tarball/install issues, try to build numpy.core alone on windows, mac os X and linux, with either Atlas, Accelerate or nothing. Use scons scripts as informations for the checks to do (it is much more readable than distutils setup.py files). If you can do this, you will have solved most build details necessary to build the whole scipy. It will certainly give you a good overview of the difficulty of the task.

From ctypes to cython for C library wrapping

Since the cython presentation by R. Bradshaw at Scipy08, I wanted to give cython a shot to wrap existing C libraries. Up to now, my method of choice has been ctypes, because it is relatively simple, and can be done in python directly.

The problem with ctypes

I was not entirely satisfied with ctypes, in particular because it is sometimes difficult to control some platform dependant details, like type size and so on; ctypes has of course the notion of platform-independant type with a given size (int32_t, etc…), but some libraries define their own type, with underlying implementation depending on the platform. Also, making sure the function declarations match the real ones is awckward; ctypes’ uthor Thomas Heller developed a code generator to generate those declarations from headers, but they are dependent on the header you are using; some libraries unfortunately have platform-dependant headers, so in heory you should generate the declarations at installation, but this is awckward because the code generator uses gccxml, which is not widely available.

Here comes cython

One of the advantage of Cython for low leve C wrapping is that cython declarations need not be exact: in theory, you can’t pass an invalid pointer for example, because even if the cython declaration is wrong, the C compiler will complain on the C file generated by cython. Since the generated C file uses the actual header file, you are also pretty sure to avoid any mismatch between declarations and usage; at worse, the failure will happen at compilation time.

Unfortunately, cython does not have a code generator like ctypes. For a long time, I wanted to add sound output capabilities to audiolab, in particular for mac os X and ALSA (linux). Unfortunately, those API are fairly low levels. For example, here is an extract of AudioHardware (the HAL of CoreAudio) usage:

<br />
AudioHardwareGetProperty(kAudioHardwarePropertyDefaultOutputDevice,<br />
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &amp;count, (void *) &amp;(audio_data.device))</p>
<p>AudioDeviceGetProperty(audio_data.device, 0, false,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; kAudioDevicePropertyBufferSize,<br />
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &amp;count, &amp;buffer_size)<br />

Mac OS X conventions is that variables starting with k are enums, defined like:

<br />
kAudioDevicePropertyDeviceName = 'name',<br />
kAudioDevicePropertyDeviceNameCFString = kAudioObjectPropertyName, kAudioDevicePropertyDeviceManufacturer = 'makr',<br />
kAudioDevicePropertyDeviceManufacturerCFString = kAudioObjectPropertyManufacturer,<br />
kAudioDevicePropertyRegisterBufferList = 'rbuf',<br />
kAudioDevicePropertyBufferSize = 'bsiz',<br />
kAudioDevicePropertyBufferSizeRange = 'bsz#',<br />
kAudioDevicePropertyChannelName = 'chnm',<br />
kAudioDevicePropertyChannelNameCFString = kAudioObjectPropertyElementName,<br />
kAudioDevicePropertyChannelCategoryName = 'ccnm',<br />
kAudioDevicePropertyChannelNominalLineLevelNameForID = 'cnlv'<br />
...<br />

Using the implicit conversion char[4] to int – which is not supported by cython AFAIK. With thousand of enums defined this way, any process which is not mostly automatic will be painful.

During Scipy08 cython’s presentation, I asked whether there was any plan toward automatic generation of cython ‘headers’, and Robert fairly answered please feel free to do so. As announced a couple of days ago, I have taken the idea of ctypes code generator, and ‘ported’ it to cython; I have used on scikits.audiolab to write a basic ALSA and CoreAudio player, and used it to convert my old ctypes-based wrapper to sndfile (a C library for audio file IO). This has worked really well: the optional typing in cython makes some part of the wrapper easier to implement than in ctypes (I don’t need to check whether an int-like argument won’t overflow, for example). Kudos to cython developers !

Usage on alsa

For completness, I added a simple example on how to use xml2cython codegen with ALSA, as used in scikits.audiolab. Hopefully, it should show how it can be used for other libraries. First, I parse the headers with gccxml; I use the ctypes codegenlib helper:

h2xml /usr/include/alsa/asoundlib.h -o asoundlib.xml

Now, I use the xml2cython script to parse the xml file and generate some .pxd files. By default, the sript will pull out almost everything from the xml file, which will generate a big cython file. xml2cython has a couple of basic filters, though, so that I only pull out what I want; in the alsa case, I was mostly interested by a couple of functions, so I used the input file filter:

xml2cython.py -i input -o alsa.pxd alsa/asoundlib.h asoundlib.xml

Which will generates alsa.pxd with declarations of functions whose name matches the list in input, plus all the typedefs/structures used as arguments (they are recursively pulled out, so if one argument is a function pointer, the types in the function pointer should hopefully be pulled out as well). The exception is enums: every enums defined in the parsed tree from the xml are put out automatically in the cython file, because ‘anonymous’ enums are usually not part of function declarations in C (enums are not typed in C, so it is not so useful). This means every enum coming from standard header files will be included as well, and this is ugly – as well as making cython compilation much slower. So I used a location filter as well, which tells xml2cython to pull out only enums which are defined in some files match by the filter:

xml2cython.py -l alsa -i input -o alsa.pxd alsa/asoundlib.h asoundlib.xml

This works since every alsa header on my system is of the form /usr/include/alsa/*.h. I used something very similar on AudioHardware.h header in CoreAudio. The generated cython can be seen in scikits trunk here. Doing this kind of things by hand would have been particularly error-prone…

cython-codegen: cython code generator from gccxml files

I have enjoyed using cython to wrap from C libraries recently. Unfortunately, some libraries I was interested in (Alsa, CoreAudio) are quite big. In particular, they have a lot of structures, typedefs and enumerations which are easy to get wrong by doing it manually. Since the problem is quite similar to wrapping with ctypes (my former method of choice), I thought it would be interesting to do something like ctypeslib code generator for cython – hencecython-codegen “project”, available on github:

http://github.com/cournape/cython-codegen

Basic usage goes like this to generate a .pyx file for the foo.h header:

gccxml -I. foo.h -o foo.xml
xml2cython.py -l 'foo' foo.h foo.xml

I can’t stress enough that this is little more than a throw-away script, and is likely to fail on many header files, or generate invalid cython code. I could use it successfully on non trivial headers though, like alsa or CoreAudio on Mac OS X. Your mileage may vary.

Going away from bzr toward git

(this is a small rant about why I like bzr less and less and like git more and more; this is only a personal experience, not a general git vs bzr thing, take it as such).

Source control systems are a vital tool for any serious software project. They provide an history of the project, are an invaluable tool for release process, etc… When I started to develop some code outside school exercises, I wanted to learn one for my own projects.

Using svn

This was not so long ago – 3-4 years ago, and at that time, SVN was the logical choice. I wanted to use it on my machine, to keep history, and being able to go back; since I mainly code for scientific research, the time and rollback aspects were particularly important.

Using SVN did not really make sense to me at that time: Using it to track other projects was of course easy (checking out, log, commit), but I could not really understand how to use it for my own projects:

  • I could not understand their branches and tags concept. Note that I did not even know what those terms mean at that time; I did not understand why it would matter at all where I would put the tags and   branches, why I needed to copy things for tags, etc… From the svn-book, it was not really clear what the difference between branch and tags was.
  • Setting up svn on one machine is awkward: Why should I create a repository somewhere, and populate it from somewhere else ? How should I do backup of the repository ?
  • Getting back in time is unintuitive: you have to “merge back” in time the revisions you want to rollback. This is really error prone.

Bzr, the first source control which made sense to me

At the end, I found easier to just use tarballs to save the state of my projects (my projects are always quite small). Then, a bit more than two years ago, I discovered bzr (bzr-ng at that time): it was a better arch, the SCS developed by Tom Lord for distributed development. Arch always intrigued me, but was extremely awkward: it could not handle windows very well, there were strange filenames, and it was source code invasive. Even checking out other projects like rhythmbox was painful. bzr on the contrary was really simple:

  • Creating a new project ? bzr init in the top directory, and then adding the code and committing. No separate directory for the db, no “bzradmin” to create the repository
  • branches and tags (tags came a bit later in bzr, starting at version 0.15 IIRC) were dead easy: bzr branch to create the branch, no need to use some copy commands, etc.. tags are even easier.

I have used bzr ever since for all my projects; in the mean time, I have been much more involved with several open source projects, which all use svn, and I always felt svn was an inferior, more complicated tool compared to bzr. With bzr, I understood what branch could be used for, and more generally how a SCS
can be helpful for development.

Since bzr was so pleasant to use, I of course wanted to use it for the projects I was involved with, so I was really excited by bzr-svn to track svn repositories. Unfortunately, bzr-svn has never been a really pleasant
experience. One problem was that the python wrapper of libsvn were really buggy (to the point that bzr-svn has now its own wrapper). Also, it was extremely slow to import revisions, and failed on some repositories I used bzr-svn on. That’s how I started to look at other tools, in particular hg: hg had an ability to import svn, and it was more reliable than bzr-svn in my experience. But it was not really practical to use to commit back to svn repository, so I never investigated this really deeply.

Bzr annoyances

At the same time, there were some things which I was never thrilled by with bzr. Two in particular:

One branch per directory

That’s a conscious design decision from bzr developers. This means it is a bit simpler to know where you are (a branch is a path), but I find it awkward when you need to compare branches / need to “jump” from branch to branch. When you are deep down inside the tree of your project, comparing branches (diff, log, etc…) becomes annoying because you have to refer to branch form their path.

Revision numbers

Each commit is assigned a revid by bzr, which is a unique number per repository. That’s the number bzr deals with internally. But for most UI purpose, you deal with revno, that is simple integers numbers: of course, because of the distributed nature of bzr, those numbers are not unique for a repository, only within a branch. I find this extremely confusing. Again, this appears more clearly when comparing several branches at the same time. For example, when I have not worked on a project for a long time, I may not remember the relative state of different branches: the bzr command missing is then very useful to know which commits are unique to one branch. But the numbers mean different things in different branches, which mean they are useless in that case; being useless would have actually been ok, but they are in fact very confusing.

For example, I recently went back to a branch I have not worked on for more than one month. Let’s say my current development focus in in branch A, and I wanted to see the status of branch B. I can use bzr missing for that purpose. I can see that 5 revisions, from 300 to 305 are missing. I then go into branch B, and study a bit the source code, in particular with bzr blame. I see some code with revision under 300 in branch B, which I could not see in branch A. Now, this was confusing: any revision before 300 is in A too according to bzr missing, so how is it possible for bzr blame to report difference code in A and B, for a section commited with a revno < 300 ? The reason is that revision 305 is actually a merge, and when going through the detailed log in branch B, I can see that revision 305 contains 296.1.1, then 299.1, 299.2, 299.3 and 299.4. I can’t see how this a useful behavior. Maybe I am biased as someone doing a lot of math all day long, but having 296.1.1 after 304 does not make any sense to me. What’s the point of using supposedly simple numbers when they have arbitrary ordering, which changes depending on where you are seeing them ? SVN revno were already quite confusing when using branches, but bzr made it worse in my opinion.

Nitpicks

There were also things which were less significant for me, but still unpleasant: bzr startup is really slow, its use in script not really useful – if you want to do anything substantial, you have to study the plugin API. Also, it  tarted to become a bit inflexible for some things: for example, incorporating a second project also tracked by bzr into a first project is difficult (if not impossible; I could never manage to do it), history-related perations are often slow, using a lot of branches takes a lot of space unless you are using shared repository which feel like an hack more than a real solution, etc…

(Re)-Discovering git

About the same time, I had to use git for one project which I was interested in. I found it much easier to use than when I looked at it for the first time. There was no cogito anymore, the basic commands were like bzr. I decided to give git-svn a try, and it was much faster than bzr-svn to import some projects; the repositories were extremely small [1]. Also, although git UI is still quite arcane, I found git itself a pleasure to use: it felt simple, because the concept were simple – much more than bzr, in fact. sha-1 for revision are not awkward, because you barely use them at the UI level (git UI is very powerful for human-revision handling: no number, but you can easily ask for parent in a branch or in the DAG relatively to a given revision, you can look by commiters, by string in the commit or the code, by date, etc…); bzr revno feel like an hack after being used to git. For example, wherever I am, if I want to compare branch2 to branch1, in git I can do:

git log branch1..branch2
git diff branch1..branch2

Also, git is scriptable, which is appealing to the Unix user in me. I can understand the POV of bzr developers concerning extensibility with plugin (it is not unlike the argument of UNIX pipe vs Windows COM extensions as developed by Miguel in his Let’s make Unix not suck [2]), but I prefer the git model at the end. Bzr decision to go toward extensibility with plugins is not without merit: I  think the good error report from bzr is partly a consequence of this choice. OTOH, git messages can be cryptic; but git simplicity at the core level makes this much less significant than I first expected.

A key git difference compared to bzr is that git is really just a content tracker. It does not track directory at all, or filenames for example: it instead tries to detect when you rename files. I remember at least once  then this was mentioned on bzr ML [3], where a bzr developer argued that bzr could do like git, while keeping explicit meta information (when you tell bzr to rename a file). One obvious drawback is that depending on how the change was made to the tree, patch vs merge for example, bzr behavior will be different; this is very serious in my opinion. Specially for a language like python, where the files/directory name matters, directory renames should be quickly propagated, and can never be done lightly anyway. And it means git can be much better at dealing with renames when import external data, merge between unrelated branches, etc…  Because its algorithm for renames detection is used all the time, it has to work quite well. It is a bit similar to the merge capability of distributed SCS: there is no reason for them to be inherently better at merging, but because they would be unusable without good merge tracking capability, this has to work reliably from the start in DVCS. Even if in theory, bzr could detect renames like git (in addition to its explicit rename handling), in practice, it has not happened, and as far as I am aware, nobody has done any work in that direction.

Another advantage of git I did not mention, but that’s because it has been rehashed ad nauseam, and it is the most obvious one to anyone using both tools: git is incredibly fast. Many things I would never do with bzr because it would take too much time are doable with git; sometimes, git favor speed to much (in its rename detection, for example: you should really be aware of the -M and -C options in log and other history-related command), but even when telling git to spend time detecting renames, it is still much faster than bzr.

Finally, git is getting a lot of traction: it is used by Linux, Xorg, android, RoR, a lot of freedesktop projects, is being discussed for KDE. This means it will become even better, and that other DVCS will have a very hard time to compete. As a very concrete example: Git UI improvements were much more significant than bzr speed improvements during the last year (bzr speed has not improved much in my experience since 0.92 and the pack format: long history and network make bzr almost unusable for big projects with large history contributed by a large team across the world; OTOH, git 1.5.3 was the first git version which I could use without hurting my head too much).

For all those reasons – simplicity of the core model, flexibility, scriptability, and speed – I think I will start to use git for all my projects, and give up on bzr. I think bzr is still superior to git for some things, and
depending on the project or the tree you are tracking, bzr may be better (in particular because it tracks directories, which git does not, and this can matter; I am also not sure whether git would be appropriate for tracking /etc or your $HOME).

[1] for every project I have imported so far, the git clone is as big or smaller than a svn checkout; you read that right: one revision checked out from svn is often bigger than a full history; I have imported the full history of numpy, scipy, scikits on my github account, and I have not used much more than half of my 100 Mb account)

[2] http://primates.ximian.com/~miguel/bongo-bong.html

[3] https://lists.ubuntu.com/archives/bazaar/2007q3/028591.html