Recent progress on bento – build numpy !

I have spent the last few days on a relatively big feature for bento: recursive package description. The idea is to be able to simply describe packages in a deeply nested hierarchy without having to write long paths, and to split complicated packages descriptions into several files.

At the level, the addition is easy:

Subento: numpy/core, numpy/lib ...

It took me more time to figure out a way to do it in the hook file. I ended up with a recurse decorator:

@recurse(["numpy/core/bscript", "numpy/lib/bscript"])
def some_func(ctx):

I am not sure it is the right solution yet, but it works for now. My first idea was to simply use a recurse function attached to hook contexts (the ctx argument), but I did not find a good way to guarantee an execution order (declaration order == execution order), and it was a bit unintuitive to integrate both hook decorator and the recurse together.

The reason why I tackle this now is that bento is at a stage where it need to be used on “real” builds to get a feeling of what works and what does not. The target is numpy and hopefully later scipy. Although I still hope to integrate waf or scons in bento as the canonical way of building numpy/scipy with bento, this also gives a good test for yaku (my simple build system).

It took me less than half a day to port the scons scripts to bento/yaku. A full build, unnoptimized build of numpy with clang is less than 10 seconds. A no-op build is ~ 150 ms, but as yaku does not have all the infrastructure for header dependency tracking yet, the number for no-op build is rather meaningless.

Solving huge memory use on Ubuntu 9.10

I noticed huge (~ 500 Mb) usage of memory on my workstation running on Ubuntu 9.10 (on amd64), causing issues when running large numerical numpy scripts. It ended up being related to ureadahead.

Removing ureadahead, and rebooting seems to fix the issue for now. I am baffled such a bug stayed in a release version. It seems that every new version of Ubuntu is slightly worse than the previous one. It is still better than many other low-hassle distributions, but I worry about the trend.

The “every Linux distribution should have the same package manager” fallacy

I have heard several times that every linux distribution should have the same package manager (where it is understood that there is one-too-many within the rpm vs deb), and it was mentioned once again recently in a well publicized video (see on linux hater blog)

The argument goes as follows: doing packaging takes time, and making packages for every distribution is a waste of time. If every distribution used the same package system, it would be much better for 3rd party distributors. Many people answer that competition is good, having many distributions is what makes Linux great – [insert usual stuff about how good Linux is].

While it is true that multiple packages systems means more work, saying that there should only be one is kinda clueless – I wonder if anyone pushing for this has even done any rpm/deb pacaking. What makes deb vs rpm a problem is not that they are different formats, like say zip vs gunzip, but that they are deployed on different systems. A RHEL rpm won’t work great on Mandrake, and even if a lot of debian .deb work on Ubuntu, it is not always ideal. The problem is that each distribution-specific package needs to be designed for the target distribution. To build a rpm or a deb package, you need:

  • To decide where to put what
  • To encode the exact versions for the dependencies
  • To decide how to handle configuration files, set up start/stop scripts for servers, etc…

Basically, almost everything which makes the difference between a distribution A and B ! For file locations, the LSB tries to standardize on this, but some things are different, like where to put 64 vs
32 bits libraries. One distribution may have libfoo 1.2, another one 1.3, so even if they are compatible, you can’t use the same for every distribution. Or some libraries do not have the same name under different distributions.

So requesting the same package manager for every distribution is almost equivalent to asking that every distribution should be the same. You can’t have one without the other. You can argue that there should be only one distribution, but don’t forget that Ubuntu appeared like 5 years ago.

Why people should stop talking about git speed

As I have already written in a previous post, I have moved away from bzr to git for most of my software projects (I still prefer bzr for documents, like my research papers). A lot if not most of the comparison of git vs other tools focus on speed. True, git is quite fast for source code management, but I think this kinds of miss the point of git. It took me time to appreciate it, but one of the git’s killer feature for source code control is the notion of content tracking. Bzr (and I believe hg although I could not find good information on that point) use file id, i.e. they track files, and a tree is a set of files. Git, on the contrary, tracks content, not files. In other words, it does not treat files individually, but always internally consider the whole tree.

This may seem like an internal detail, and an annoyance because it leaks at the UI level quite a lot (the so-called index is linked to this). But this means that it can record the history of code instead of files quite accurately. This is especially visible with git blame. One example: I recently started a massive surgery on the numpy C source code. Because of some C limitations, the numpy core C code was in a couple of giantic source files, and I split this into more logical units. But this breaks svn blame heavily. If you just rename a file, svn blame is lost can follow renames. But if you split one file into two, it becomes useless. Because git tracks the whole tree, the blame command can be asked to detect code moves across files. For example, git blame with rename detections gives me the following on one file in numpy:

dc35f24e numpy/core/src/arrayobject.c         1) #define PY_SSIZE_T_CLEAN
dc35f24e numpy/core/src/arrayobject.c         2) #include <Python.h>
dc35f24e numpy/core/src/arrayobject.c         3) #include "structmember.h"
dc35f24e numpy/core/src/arrayobject.c         4)
65d13826 numpy/core/src/arrayobject.c         5) /*#include <stdio.h>*/
5568f288 scipy/base/src/multiarraymodule.c    6) #define _MULTIARRAYMODULE
2f91f91e numpy/core/src/multiarraymodule.c    7) #define NPY_NO_PREFIX
2f91f91e numpy/core/src/multiarraymodule.c    8) #include "numpy/arrayobject.h"
dc35f24e numpy/core/src/arrayobject.c         9) #include "numpy/arrayscalars.h"
38f46d90 numpy/core/src/multiarray/common.c  10)
38f46d90 numpy/core/src/multiarray/common.c  11) #include "config.h"
0f81da6f numpy/core/src/multiarray/common.c  12)
71875d5c numpy/core/src/multiarray/common.c  13) #include "usertypes.h"
71875d5c numpy/core/src/multiarray/common.c  14)  
0f81da6f numpy/core/src/multiarray/common.c  15) #include "common.h"
5568f288 scipy/base/src/arrayobject.c        16)
65d13826 numpy/core/src/arrayobject.c        17) /*
65d13826 numpy/core/src/arrayobject.c        18)  * new reference
65d13826 numpy/core/src/arrayobject.c        19)  * doesn't alter refcount of chktype or mintype ---
65d13826 numpy/core/src/arrayobject.c        20)  * unless one of them is returned
65d13826 numpy/core/src/arrayobject.c        21)  */

You can notice that the original file can be found for every line of code in the new file. The original author and date may be found as well, I just removed them for the blog post.

This is truely impressive, and is one of the reason why git is so far ahead of the competition IMHO. This kind of features is extremely useful for open source projects, much more than rename support. I am ready to deal with quite a few (real) Git UI annoyances for this.


It looks like my example was not very clear. I am not interested in following the renames of the file: in the example above, the file was not arrayobject.c first, then renamed to multiarraymodules.c, and later to common.c. The file was created from scratch, with content taken from those files at some point. You can try the following simplified example. First, create two files prod.c and sum.c:

#include double sum(const double* in, int n)
int i;
double acc = 0;

for(i = 0; i < n; ++i) { acc += in[i]; } return acc; } [/sourcecode] [sourcecode language='c'] #include

double prod(const double* in, int n)
int i;
double acc = 1;

for(i = 0; i < n; ++i) { acc *= in[i]; } return acc; } [/sourcecode] Commit to your favorite VCS. Then, you reorganize the code, and in particular you put the code of both files into a new file common.c. So you create a new file common.c: [sourcecode language='c']#include

double prod(const double* in, int n)
int i;
double acc = 1;

for(i = 0; i < n; ++i) { acc *= in[i]; } return acc; } double sum(const double* in, int n) { int i; double acc = 0; for(i = 0; i < n; ++i) { acc += in[i]; } return acc; } [/sourcecode] And commit. Then, try blame. Rename tracking won't help at all, since nothing was renamed. On this very simple example, you could improve things by first renaming say sum.c to common.c, then adding the content of prod.c to common.c, but you will still loose that the prod function comes from prod.c. git blame -C -M gives me the following:

^ae7f28a prod.c  1) #include <math.h>
^ae7f28a prod.c  2)
^ae7f28a prod.c  3) double prod(const double* in, int n)
^ae7f28a prod.c  4) {
^ae7f28a prod.c  5)         int i;
^ae7f28a prod.c  6)         double acc = 1;
^ae7f28a prod.c  7)
^ae7f28a prod.c  8)         for(i = 0; i < n; ++i) {
^ae7f28a prod.c  9)                 acc *= in[i];
^ae7f28a prod.c 10)         }
^ae7f28a prod.c 11)
^ae7f28a prod.c 12)         return acc;
^ae7f28a prod.c 13) }
^ae7f28a sum.c  14)
^ae7f28a sum.c  15) double sum(const double* in, int n)
^ae7f28a sum.c  16) {
^ae7f28a sum.c  17)         int i;
^ae7f28a sum.c  18)         double acc = 0;
^ae7f28a sum.c  19)
^ae7f28a sum.c  20)         for(i = 0; i < n; ++i) {
^ae7f28a sum.c  21)                 acc += in[i];
^ae7f28a sum.c  22)         }
^ae7f28a sum.c  23)
^ae7f28a sum.c  24)         return acc;
^ae7f28a sum.c  25) }

hg blame on the contrary will tell me everything comes from common.c. Even when using the rename trick, I cannot get more than the following with hg blame -f -c:

81c4468e59f9    sum.c: #include <math.h>
81c4468e59f9    sum.c:
81c4468e59f9    sum.c: double sum(const double* in, int n)
81c4468e59f9    sum.c: {
81c4468e59f9    sum.c:         int i;
81c4468e59f9    sum.c:         double acc = 0;
81c4468e59f9    sum.c:
81c4468e59f9    sum.c:         for(i = 0; i < n; ++i) {
81c4468e59f9    sum.c:                 acc += in[i];
81c4468e59f9    sum.c:         }
81c4468e59f9    sum.c:
81c4468e59f9    sum.c:         return acc;
81c4468e59f9    sum.c: }
3c1ac7db76ba common.c:
3c1ac7db76ba common.c: double prod(const double* in, int n)
3c1ac7db76ba common.c: {
3c1ac7db76ba common.c:         int i;
3c1ac7db76ba common.c:         double acc = 1;
3c1ac7db76ba common.c:
3c1ac7db76ba common.c:         for(i = 0; i < n; ++i) {
3c1ac7db76ba common.c:                 acc *= in[i];
3c1ac7db76ba common.c:         }
3c1ac7db76ba common.c:
3c1ac7db76ba common.c:         return acc;
3c1ac7db76ba common.c: }