Going away from bzr toward git

(this is a small rant about why I like bzr less and less and like git more and more; this is only a personal experience, not a general git vs bzr thing, take it as such).

Source control systems are a vital tool for any serious software project. They provide an history of the project, are an invaluable tool for release process, etc… When I started to develop some code outside school exercises, I wanted to learn one for my own projects.

Using svn

This was not so long ago – 3-4 years ago, and at that time, SVN was the logical choice. I wanted to use it on my machine, to keep history, and being able to go back; since I mainly code for scientific research, the time and rollback aspects were particularly important.

Using SVN did not really make sense to me at that time: Using it to track other projects was of course easy (checking out, log, commit), but I could not really understand how to use it for my own projects:

  • I could not understand their branches and tags concept. Note that I did not even know what those terms mean at that time; I did not understand why it would matter at all where I would put the tags and   branches, why I needed to copy things for tags, etc… From the svn-book, it was not really clear what the difference between branch and tags was.
  • Setting up svn on one machine is awkward: Why should I create a repository somewhere, and populate it from somewhere else ? How should I do backup of the repository ?
  • Getting back in time is unintuitive: you have to “merge back” in time the revisions you want to rollback. This is really error prone.

Bzr, the first source control which made sense to me

At the end, I found easier to just use tarballs to save the state of my projects (my projects are always quite small). Then, a bit more than two years ago, I discovered bzr (bzr-ng at that time): it was a better arch, the SCS developed by Tom Lord for distributed development. Arch always intrigued me, but was extremely awkward: it could not handle windows very well, there were strange filenames, and it was source code invasive. Even checking out other projects like rhythmbox was painful. bzr on the contrary was really simple:

  • Creating a new project ? bzr init in the top directory, and then adding the code and committing. No separate directory for the db, no “bzradmin” to create the repository
  • branches and tags (tags came a bit later in bzr, starting at version 0.15 IIRC) were dead easy: bzr branch to create the branch, no need to use some copy commands, etc.. tags are even easier.

I have used bzr ever since for all my projects; in the mean time, I have been much more involved with several open source projects, which all use svn, and I always felt svn was an inferior, more complicated tool compared to bzr. With bzr, I understood what branch could be used for, and more generally how a SCS
can be helpful for development.

Since bzr was so pleasant to use, I of course wanted to use it for the projects I was involved with, so I was really excited by bzr-svn to track svn repositories. Unfortunately, bzr-svn has never been a really pleasant
experience. One problem was that the python wrapper of libsvn were really buggy (to the point that bzr-svn has now its own wrapper). Also, it was extremely slow to import revisions, and failed on some repositories I used bzr-svn on. That’s how I started to look at other tools, in particular hg: hg had an ability to import svn, and it was more reliable than bzr-svn in my experience. But it was not really practical to use to commit back to svn repository, so I never investigated this really deeply.

Bzr annoyances

At the same time, there were some things which I was never thrilled by with bzr. Two in particular:

One branch per directory

That’s a conscious design decision from bzr developers. This means it is a bit simpler to know where you are (a branch is a path), but I find it awkward when you need to compare branches / need to “jump” from branch to branch. When you are deep down inside the tree of your project, comparing branches (diff, log, etc…) becomes annoying because you have to refer to branch form their path.

Revision numbers

Each commit is assigned a revid by bzr, which is a unique number per repository. That’s the number bzr deals with internally. But for most UI purpose, you deal with revno, that is simple integers numbers: of course, because of the distributed nature of bzr, those numbers are not unique for a repository, only within a branch. I find this extremely confusing. Again, this appears more clearly when comparing several branches at the same time. For example, when I have not worked on a project for a long time, I may not remember the relative state of different branches: the bzr command missing is then very useful to know which commits are unique to one branch. But the numbers mean different things in different branches, which mean they are useless in that case; being useless would have actually been ok, but they are in fact very confusing.

For example, I recently went back to a branch I have not worked on for more than one month. Let’s say my current development focus in in branch A, and I wanted to see the status of branch B. I can use bzr missing for that purpose. I can see that 5 revisions, from 300 to 305 are missing. I then go into branch B, and study a bit the source code, in particular with bzr blame. I see some code with revision under 300 in branch B, which I could not see in branch A. Now, this was confusing: any revision before 300 is in A too according to bzr missing, so how is it possible for bzr blame to report difference code in A and B, for a section commited with a revno < 300 ? The reason is that revision 305 is actually a merge, and when going through the detailed log in branch B, I can see that revision 305 contains 296.1.1, then 299.1, 299.2, 299.3 and 299.4. I can’t see how this a useful behavior. Maybe I am biased as someone doing a lot of math all day long, but having 296.1.1 after 304 does not make any sense to me. What’s the point of using supposedly simple numbers when they have arbitrary ordering, which changes depending on where you are seeing them ? SVN revno were already quite confusing when using branches, but bzr made it worse in my opinion.

Nitpicks

There were also things which were less significant for me, but still unpleasant: bzr startup is really slow, its use in script not really useful – if you want to do anything substantial, you have to study the plugin API. Also, it  tarted to become a bit inflexible for some things: for example, incorporating a second project also tracked by bzr into a first project is difficult (if not impossible; I could never manage to do it), history-related perations are often slow, using a lot of branches takes a lot of space unless you are using shared repository which feel like an hack more than a real solution, etc…

(Re)-Discovering git

About the same time, I had to use git for one project which I was interested in. I found it much easier to use than when I looked at it for the first time. There was no cogito anymore, the basic commands were like bzr. I decided to give git-svn a try, and it was much faster than bzr-svn to import some projects; the repositories were extremely small [1]. Also, although git UI is still quite arcane, I found git itself a pleasure to use: it felt simple, because the concept were simple – much more than bzr, in fact. sha-1 for revision are not awkward, because you barely use them at the UI level (git UI is very powerful for human-revision handling: no number, but you can easily ask for parent in a branch or in the DAG relatively to a given revision, you can look by commiters, by string in the commit or the code, by date, etc…); bzr revno feel like an hack after being used to git. For example, wherever I am, if I want to compare branch2 to branch1, in git I can do:

git log branch1..branch2
git diff branch1..branch2

Also, git is scriptable, which is appealing to the Unix user in me. I can understand the POV of bzr developers concerning extensibility with plugin (it is not unlike the argument of UNIX pipe vs Windows COM extensions as developed by Miguel in his Let’s make Unix not suck [2]), but I prefer the git model at the end. Bzr decision to go toward extensibility with plugins is not without merit: I  think the good error report from bzr is partly a consequence of this choice. OTOH, git messages can be cryptic; but git simplicity at the core level makes this much less significant than I first expected.

A key git difference compared to bzr is that git is really just a content tracker. It does not track directory at all, or filenames for example: it instead tries to detect when you rename files. I remember at least once  then this was mentioned on bzr ML [3], where a bzr developer argued that bzr could do like git, while keeping explicit meta information (when you tell bzr to rename a file). One obvious drawback is that depending on how the change was made to the tree, patch vs merge for example, bzr behavior will be different; this is very serious in my opinion. Specially for a language like python, where the files/directory name matters, directory renames should be quickly propagated, and can never be done lightly anyway. And it means git can be much better at dealing with renames when import external data, merge between unrelated branches, etc…  Because its algorithm for renames detection is used all the time, it has to work quite well. It is a bit similar to the merge capability of distributed SCS: there is no reason for them to be inherently better at merging, but because they would be unusable without good merge tracking capability, this has to work reliably from the start in DVCS. Even if in theory, bzr could detect renames like git (in addition to its explicit rename handling), in practice, it has not happened, and as far as I am aware, nobody has done any work in that direction.

Another advantage of git I did not mention, but that’s because it has been rehashed ad nauseam, and it is the most obvious one to anyone using both tools: git is incredibly fast. Many things I would never do with bzr because it would take too much time are doable with git; sometimes, git favor speed to much (in its rename detection, for example: you should really be aware of the -M and -C options in log and other history-related command), but even when telling git to spend time detecting renames, it is still much faster than bzr.

Finally, git is getting a lot of traction: it is used by Linux, Xorg, android, RoR, a lot of freedesktop projects, is being discussed for KDE. This means it will become even better, and that other DVCS will have a very hard time to compete. As a very concrete example: Git UI improvements were much more significant than bzr speed improvements during the last year (bzr speed has not improved much in my experience since 0.92 and the pack format: long history and network make bzr almost unusable for big projects with large history contributed by a large team across the world; OTOH, git 1.5.3 was the first git version which I could use without hurting my head too much).

For all those reasons – simplicity of the core model, flexibility, scriptability, and speed – I think I will start to use git for all my projects, and give up on bzr. I think bzr is still superior to git for some things, and
depending on the project or the tree you are tracking, bzr may be better (in particular because it tracks directories, which git does not, and this can matter; I am also not sure whether git would be appropriate for tracking /etc or your $HOME).

[1] for every project I have imported so far, the git clone is as big or smaller than a svn checkout; you read that right: one revision checked out from svn is often bigger than a full history; I have imported the full history of numpy, scipy, scikits on my github account, and I have not used much more than half of my 100 Mb account)

[2] http://primates.ximian.com/~miguel/bongo-bong.html

[3] https://lists.ubuntu.com/archives/bazaar/2007q3/028591.html

Advertisements

How to embed a manifest into a dll with mingw tools only

(DISCLAIMER: I am not a windows guy; all the discussion here is how I understand things from various sources).

With Visual Studio 2005, MS introduced a mechanism called side by side assemblies and C/C++ isolated applications. Assembly is the MS term which encompasses usual dll, as well as .Net modules implemented in CLR, the .Net bytecode (e.g. anything programmed in C#). The idea is to provide a mechanism to deal with the well known dll hell, because there was no proper versioning scheme with dll in windows. You can read more here:

Why should you care as a python developer ? Concretely, starting from VS 2005, if you build a python extension with the mingw compiler, it will link against a runtime which is not available system-wise (such as in C:\Windows\system32 by default), causing a runtime error when loading the extension (msvcr80.dll not found). A simple way to reproduce the result is to have a small dll, and try to link it to a simple executable with the ms runtime:

# This works:
gcc -shared hello.c -o hello.dll
gcc main.c hello.dll -o main.exe
# This does not:
gcc -shared hello.c -o hello.dll
gcc main.c hello.dll -o main.exe -lmsvcr90

If you build the 2nd way, explicitely linking the msvcr90, you will get a dll not found error when running the executable, because the dll is not in the system paths (and should not be; the dll is not redistributable). Starting from VS 2005, the only way to refer to VS libraries is to use manifest, which are xml files embedded in the binary. Those manifest are automatically generated by the MS compiler. Assuming you already have the manifest, how can you generate a binary using it without using MS compilers ?

Build the object of the dll:

    gcc -c hello.c

    Have a hello.rc file which refers to the manifest file  (2 seems to refer to dll, vs 1 for exe, but I am not sure):

      #include "winuser.h"
      2 RT_MANIFEST hello.dll.manifest

      Build the .res file, which will embed the xml file into the resource file (.res)

        windres --input hello.rc --output hello.res --output-format=coff

        Link the whole thing together:

          gcc -shared -o hello.dll hello.o hello.res -lmsvcr90

          Now, executing main.exe should be possible. There is still the problem of generating the manifest file. Since in our case, the problem is mainly with the MSVC runtime, to stay compatible with the python.org binary, we may just reuse the same manifest all the time ?

          A few more links on the topic:

          http://www.ddj.com/windows/184406482

          http://msdn.microsoft.com/en-us/library/ms235591(VS.80).aspx

          http://www.codeproject.com/KB/COM/regsvr42.aspx

          Build a cross compiler from linux to windows (mingw)

          It is not too difficult to find mingw cross compiler to produce windows binaries from linux, except that it does not include the fortran part (g77, or gfortran for 4.*). Here is a relatively simple script which can build gcc, g77 (and other compilers as well, I guess: the difficult part is to build the bootstrap gcc + system headers/libs; once you get this right, all the other parts are much easier).

          The script has the following characteristics:

          1. It needs the tarballs in the archives directory
          2. It will build everything in the build directory
          3. It will install everything in the install directory
          4. By default, the above directories are created in the directory of the script
          5. It does not build gcc in the sources, but in separate build directories, which is the advised way to do it

          It is not too difficult to do it by hand once you know exactly what to do. The basic scheme is the following:

          1. Build the binutils with target mingw32 and install it in $PREFIX (e.g. –prefix=$PREFIX –target=mingw32).
          2. Add $PREFIX in your path.
          3. Copy the include path content of both w32api and mingw32-runtime in $PREFIX/mingw32/include
          4. Create an empty build directory for the bootstrap gcc, and inside it, build a minimal gcc (e.g. –enable-languages=c) with the same options than binutils for target and prefix.
          5. The tricky part: once the bootstrapped gcc is built, go back into w32api directory, and configure it with *host* as mingw32 and prefix as $PREFIX/mingw32: configure –prefix=$PREFIX/mingw32 –host=mingw32. Build and install it
          6. In the parent directory of w32api build directory, make a soft link from the w32api build directory and w32api (e.g. ln -s w32api-3.11 w32api). I think that this is a bug, but wo doing this, mingw32 won’t pick up the w32api headers (it does not look in $PREFIX/target/include).
          7. Make sure that the extracted archive of mingw runtime is in the parent directory of w32api
          8. configure, build and install mingw runtime the exact same way as w32api (same host, same prefix)
          9. now, you can build the final gcc, g77, etc… by just using –target=mingw32.

          I managed to build both gcc 3.* and 4.* series for C and fortran using the sources provided on mingw website. I have not extensively tested the built compilers, but I managed to compile non trivial programs with both g77 and gfortran (blas, lapack). The only thing not working properly is dlltool, to produce the .lib from the dll: using the import library with visual studio will crash, but this does not seem to work with mingw32 on windows either (e.g. this is a bug in mingw).

            Build recent python wrapper for subversion

            I wanted to use a more recent version of python wrapper for subversion. I though it would be easy. Boy was I wrong: compiling subversion is a pain. I want to use subversion 1.4.6, which depends on the following softwares:

            1. apache with mod_dav support
            2. neon (0.25.5 version, nothing else !)
            3. openssl

            Compiling OpenSSL is not too difficult once I solved the TLS problem with libz. Compiling apache was not too difficult, but you need to install it with the following options:

            ./configure --enable-mods-shared="most ssl dav" --prefix=/export/bbc8/local/

            Compiling neon was not difficult either. For subversion:

            ./configure --prefix=/export/bbc8/local/stow/subversion-1.4.6 --with-apxs=/export/bbc8/local/bin/apxs

            (of course, change the install paths instead of /export/bbc8/local as desired)

            TLS definition section .tbss mismatches non-TLS reference

            This is the nice message I get while trying to compile some stuff on CENTOS 5. I am not familiar with this distribution, and have not admin right on it, so I have to compile quite a lot of softwares myself. This error seems to come from the fact that for some reasons, when the linker links libz, it gets a really old (libc 5) libz, which is not binary compatible with the 6 (TLS refers to Thread Local Storage). The culprit is in ld configuration file, which contains the following line:

            /usr/i486-linux-libc5/lib

            The only way I found to get the rights lib first was to put /usr/lib in  LD_LIBRARY_PATH, which is ugly. But without it, I can not get most softwares to compile correctly…