Archive Page 2

Going away from bzr toward git

(this is a small rant about why I like bzr less and less and like git more and more; this is only a personal experience, not a general git vs bzr thing, take it as such).

Source control systems are a vital tool for any serious software project. They provide an history of the project, are an invaluable tool for release process, etc… When I started to develop some code outside school exercises, I wanted to learn one for my own projects.

Using svn

This was not so long ago – 3-4 years ago, and at that time, SVN was the logical choice. I wanted to use it on my machine, to keep history, and being able to go back; since I mainly code for scientific research, the time and rollback aspects were particularly important.

Using SVN did not really make sense to me at that time: Using it to track other projects was of course easy (checking out, log, commit), but I could not really understand how to use it for my own projects:

  • I could not understand their branches and tags concept. Note that I did not even know what those terms mean at that time; I did not understand why it would matter at all where I would put the tags and   branches, why I needed to copy things for tags, etc… From the svn-book, it was not really clear what the difference between branch and tags was.
  • Setting up svn on one machine is awkward: Why should I create a repository somewhere, and populate it from somewhere else ? How should I do backup of the repository ?
  • Getting back in time is unintuitive: you have to “merge back” in time the revisions you want to rollback. This is really error prone.

Bzr, the first source control which made sense to me

At the end, I found easier to just use tarballs to save the state of my projects (my projects are always quite small). Then, a bit more than two years ago, I discovered bzr (bzr-ng at that time): it was a better arch, the SCS developed by Tom Lord for distributed development. Arch always intrigued me, but was extremely awkward: it could not handle windows very well, there were strange filenames, and it was source code invasive. Even checking out other projects like rhythmbox was painful. bzr on the contrary was really simple:

  • Creating a new project ? bzr init in the top directory, and then adding the code and committing. No separate directory for the db, no “bzradmin” to create the repository
  • branches and tags (tags came a bit later in bzr, starting at version 0.15 IIRC) were dead easy: bzr branch to create the branch, no need to use some copy commands, etc.. tags are even easier.

I have used bzr ever since for all my projects; in the mean time, I have been much more involved with several open source projects, which all use svn, and I always felt svn was an inferior, more complicated tool compared to bzr. With bzr, I understood what branch could be used for, and more generally how a SCS
can be helpful for development.

Since bzr was so pleasant to use, I of course wanted to use it for the projects I was involved with, so I was really excited by bzr-svn to track svn repositories. Unfortunately, bzr-svn has never been a really pleasant
experience. One problem was that the python wrapper of libsvn were really buggy (to the point that bzr-svn has now its own wrapper). Also, it was extremely slow to import revisions, and failed on some repositories I used bzr-svn on. That’s how I started to look at other tools, in particular hg: hg had an ability to import svn, and it was more reliable than bzr-svn in my experience. But it was not really practical to use to commit back to svn repository, so I never investigated this really deeply.

Bzr annoyances

At the same time, there were some things which I was never thrilled by with bzr. Two in particular:

One branch per directory

That’s a conscious design decision from bzr developers. This means it is a bit simpler to know where you are (a branch is a path), but I find it awkward when you need to compare branches / need to “jump” from branch to branch. When you are deep down inside the tree of your project, comparing branches (diff, log, etc…) becomes annoying because you have to refer to branch form their path.

Revision numbers

Each commit is assigned a revid by bzr, which is a unique number per repository. That’s the number bzr deals with internally. But for most UI purpose, you deal with revno, that is simple integers numbers: of course, because of the distributed nature of bzr, those numbers are not unique for a repository, only within a branch. I find this extremely confusing. Again, this appears more clearly when comparing several branches at the same time. For example, when I have not worked on a project for a long time, I may not remember the relative state of different branches: the bzr command missing is then very useful to know which commits are unique to one branch. But the numbers mean different things in different branches, which mean they are useless in that case; being useless would have actually been ok, but they are in fact very confusing.

For example, I recently went back to a branch I have not worked on for more than one month. Let’s say my current development focus in in branch A, and I wanted to see the status of branch B. I can use bzr missing for that purpose. I can see that 5 revisions, from 300 to 305 are missing. I then go into branch B, and study a bit the source code, in particular with bzr blame. I see some code with revision under 300 in branch B, which I could not see in branch A. Now, this was confusing: any revision before 300 is in A too according to bzr missing, so how is it possible for bzr blame to report difference code in A and B, for a section commited with a revno < 300 ? The reason is that revision 305 is actually a merge, and when going through the detailed log in branch B, I can see that revision 305 contains 296.1.1, then 299.1, 299.2, 299.3 and 299.4. I can’t see how this a useful behavior. Maybe I am biased as someone doing a lot of math all day long, but having 296.1.1 after 304 does not make any sense to me. What’s the point of using supposedly simple numbers when they have arbitrary ordering, which changes depending on where you are seeing them ? SVN revno were already quite confusing when using branches, but bzr made it worse in my opinion.

Nitpicks

There were also things which were less significant for me, but still unpleasant: bzr startup is really slow, its use in script not really useful – if you want to do anything substantial, you have to study the plugin API. Also, it  tarted to become a bit inflexible for some things: for example, incorporating a second project also tracked by bzr into a first project is difficult (if not impossible; I could never manage to do it), history-related perations are often slow, using a lot of branches takes a lot of space unless you are using shared repository which feel like an hack more than a real solution, etc…

(Re)-Discovering git

About the same time, I had to use git for one project which I was interested in. I found it much easier to use than when I looked at it for the first time. There was no cogito anymore, the basic commands were like bzr. I decided to give git-svn a try, and it was much faster than bzr-svn to import some projects; the repositories were extremely small [1]. Also, although git UI is still quite arcane, I found git itself a pleasure to use: it felt simple, because the concept were simple – much more than bzr, in fact. sha-1 for revision are not awkward, because you barely use them at the UI level (git UI is very powerful for human-revision handling: no number, but you can easily ask for parent in a branch or in the DAG relatively to a given revision, you can look by commiters, by string in the commit or the code, by date, etc…); bzr revno feel like an hack after being used to git. For example, wherever I am, if I want to compare branch2 to branch1, in git I can do:

git log branch1..branch2
git diff branch1..branch2

Also, git is scriptable, which is appealing to the Unix user in me. I can understand the POV of bzr developers concerning extensibility with plugin (it is not unlike the argument of UNIX pipe vs Windows COM extensions as developed by Miguel in his Let’s make Unix not suck [2]), but I prefer the git model at the end. Bzr decision to go toward extensibility with plugins is not without merit: I  think the good error report from bzr is partly a consequence of this choice. OTOH, git messages can be cryptic; but git simplicity at the core level makes this much less significant than I first expected.

A key git difference compared to bzr is that git is really just a content tracker. It does not track directory at all, or filenames for example: it instead tries to detect when you rename files. I remember at least once  then this was mentioned on bzr ML [3], where a bzr developer argued that bzr could do like git, while keeping explicit meta information (when you tell bzr to rename a file). One obvious drawback is that depending on how the change was made to the tree, patch vs merge for example, bzr behavior will be different; this is very serious in my opinion. Specially for a language like python, where the files/directory name matters, directory renames should be quickly propagated, and can never be done lightly anyway. And it means git can be much better at dealing with renames when import external data, merge between unrelated branches, etc…  Because its algorithm for renames detection is used all the time, it has to work quite well. It is a bit similar to the merge capability of distributed SCS: there is no reason for them to be inherently better at merging, but because they would be unusable without good merge tracking capability, this has to work reliably from the start in DVCS. Even if in theory, bzr could detect renames like git (in addition to its explicit rename handling), in practice, it has not happened, and as far as I am aware, nobody has done any work in that direction.

Another advantage of git I did not mention, but that’s because it has been rehashed ad nauseam, and it is the most obvious one to anyone using both tools: git is incredibly fast. Many things I would never do with bzr because it would take too much time are doable with git; sometimes, git favor speed to much (in its rename detection, for example: you should really be aware of the -M and -C options in log and other history-related command), but even when telling git to spend time detecting renames, it is still much faster than bzr.

Finally, git is getting a lot of traction: it is used by Linux, Xorg, android, RoR, a lot of freedesktop projects, is being discussed for KDE. This means it will become even better, and that other DVCS will have a very hard time to compete. As a very concrete example: Git UI improvements were much more significant than bzr speed improvements during the last year (bzr speed has not improved much in my experience since 0.92 and the pack format: long history and network make bzr almost unusable for big projects with large history contributed by a large team across the world; OTOH, git 1.5.3 was the first git version which I could use without hurting my head too much).

For all those reasons – simplicity of the core model, flexibility, scriptability, and speed – I think I will start to use git for all my projects, and give up on bzr. I think bzr is still superior to git for some things, and
depending on the project or the tree you are tracking, bzr may be better (in particular because it tracks directories, which git does not, and this can matter; I am also not sure whether git would be appropriate for tracking /etc or your $HOME).

[1] for every project I have imported so far, the git clone is as big or smaller than a svn checkout; you read that right: one revision checked out from svn is often bigger than a full history; I have imported the full history of numpy, scipy, scikits on my github account, and I have not used much more than half of my 100 Mb account)

[2] http://primates.ximian.com/~miguel/bongo-bong.html

[3] https://lists.ubuntu.com/archives/bazaar/2007q3/028591.html

The links of the week

A few articles which I have recently read:

  • The worst academic job . An article summarizing what’s wrong with academic career paths today in the US and in Europe.
  • Is Google making us stupid ?: . A bit late, but it is an interesting article on how Google, and more generally easy access to a vast knowledge base may influence how we think. I can’t help linking this to a recent work from James Evans on the effects of open access to science (discussed in The Economist here).

Is science obsolete ?

That’s basically what is argued in the Wired article“the end of theory” (found through the article “La fin de la théorie?”, on the excellent Econ/French blog econoclastes). The article itself is not that interesting: it tries to be provocative, but fails at giving good arguments for his case. The main argument is that thanks to enormous amount of data, number crunching, and computer farms as available for example at Google, it will be more effective to find patterns with just data, and without models. But the analysis is fundamentally flawed at several levels, both scientific and philosophical. It is true that some sciences, or more exactly some activities traditionally labeled as science are endangered by number crunching; but computers already made some repetitive activities obsolete, and it is hardly big news, except for the people concerned by the changes.

First, for the epistemological arguments against this thesis: the reasons why science is first about making theories, and then making experiments to confront the theory to reality is not just about practicality. There are some fundamental reasons why this is the case, as mentioned in the article by econoclaste (warning, in French): data gathering itself is subject to various biases, and theory can somewhat alleviate this bias. There is also an ambiguity on what Chris Anderson means by scientific models: he argues that google succeeded in getting reliable search results by avoiding using a model. But you could argue on the contrary that Google did better than everyone else because it had a better model for getting interesting pages related to keywords; indeed, the PageRank algorithm, which is the foundation of Google search engine, is a better algorithm than what other search engines used to do. And the PageRank algorithm is based upon some other works and theories, in particular citation analysis, which trace back to at least 1950. Another example given by Anderson is translation: he argues that translation with any language model can work better than with any linguistic knowledge. But arguing that translation can be done better without knowing the language is different than arguing it can be done without any model at all. For people who work on machine translation, it is actually quite well known that you don’t need to know a language to be successful in translating to it.

Reductionism and the curse of dimensionality

But more significantly in my opinion, the number crunching approach is fundamentally reductionist: it assumes you can explain the whole phenomenon from its smallest parts. A typical example of the failure of reductionism in science is fluid mechanics: you could explain the behavior of a fluid from the behavior from each particle in your fluid, but actually, you can’t, because once you have a reasonable number of particles, it becomes intractable to do it at the particle level. Tom Roud made a similar argument (in French) for the flaws in reductionist approach in biology.

There are some theoretical reasons why you will never manage to make a model of everything just from data. Most number crunching data methods are statistical in nature, and rely on estimating some probability distribution. From this point of view, Anderson’s argument can be understood as “with enough data, you can estimate any probability distribution”. But this is not true for several reasons; one is that complex problems often require computation in high-dimension spaces, and high-dimension spaces have some funky properties which are not intuitive and do not map well to our fundamentally three dimension world. One particularly significant property is the localization of volume in smooth solids. For example, in high dimension, most of the volume of a sphere is on a very thin shell, e.g. really near the surface. In dimension 1000, for a sphere of radius one, the volume contained in the sub-sphere of radius 0.5 is only 1/10^300 of the total volume. This means that if you could put uniformly all the atoms of the universe in the sphere, you would not even get one in the subsphere. In statistics, this phenomenon is known as the curse of dimensionality: the number of data necessary for estimation in high dimensions grows exponentially with the number of dimension.

Also, more data does not always mean you will get better information: a common quote in the data-mining community is “there is no better data than more data”, but this is a fallacy. You want data which brings more information, and in some cases, you can only easily get data which are not very informative. For example, when transcribing speech with computers machine translation (Automatic Speech Recognition, e.g. “speaking to your computer instead of typing”), you need to estimate the probability distribution of the words, you are interested in the probability of the words which do not appear often. After analyzing a few thousand examples, you will get a pretty good estimation of the “behavior” of common words like “the” and the likes, but maybe not for words like “hermeneutic”. And for practical applications, those rare words are the one which matter: if you miss “the” in a sentence, you can still understand it, but if you miss “hermeneutic”, this is much less likely.

Is number crunching new ?

So is this number crunching really the beginning of something new ? Actually, similar thesis have been argued before Anderson; the fields of data-mining and artificial intelligence (AI) have since their inception an history of making claims which never really materialize (AI, for example, has known several “AI winter”, for periods of low-funding, generally after periods of high-funding and high claims about what AI could do). Anyone familiar with the data-mining and artificial intelligence communities should be skeptical about big announcements like paradigm shift, or like here claiming to make science obsolete. I would not be surprised if AI/data-mining/associated fields are the ones which use the expression  “paradigm shift” the most often.

It baffles me that people still argue similar points with similar claims as 50 years ago.

Linked articles

For more on this, you can also see on Cosma’s blog. Also, in French:

A python 2.5.2 binary for Mac OS X with dtrace enabled

As promised a few days ago, I took the time to build a .dmg of python from the official sources + my patch for dtrace. The binary is built with the build-script.py script in the Mac/ directory of python, and except the dtrace patch, no other modification has been done, so it should be usable as a drop-in replacement for the official binary on python.org. You can find the binary here

Again, use it at your own risk. If you prefer building it yourself, or with different options, the patch can be found here

How to embed a manifest into a dll with mingw tools only

(DISCLAIMER: I am not a windows guy; all the discussion here is how I understand things from various sources).

With Visual Studio 2005, MS introduced a mechanism called side by side assemblies and C/C++ isolated applications. Assembly is the MS term which encompasses usual dll, as well as .Net modules implemented in CLR, the .Net bytecode (e.g. anything programmed in C#). The idea is to provide a mechanism to deal with the well known dll hell, because there was no proper versioning scheme with dll in windows. You can read more here:

Why should you care as a python developer ? Concretely, starting from VS 2005, if you build a python extension with the mingw compiler, it will link against a runtime which is not available system-wise (such as in C:\Windows\system32 by default), causing a runtime error when loading the extension (msvcr80.dll not found). A simple way to reproduce the result is to have a small dll, and try to link it to a simple executable with the ms runtime:

# This works:
gcc -shared hello.c -o hello.dll
gcc main.c hello.dll -o main.exe
# This does not:
gcc -shared hello.c -o hello.dll
gcc main.c hello.dll -o main.exe -lmsvcr90

If you build the 2nd way, explicitely linking the msvcr90, you will get a dll not found error when running the executable, because the dll is not in the system paths (and should not be; the dll is not redistributable). Starting from VS 2005, the only way to refer to VS libraries is to use manifest, which are xml files embedded in the binary. Those manifest are automatically generated by the MS compiler. Assuming you already have the manifest, how can you generate a binary using it without using MS compilers ?

Build the object of the dll:

    gcc -c hello.c

    Have a hello.rc file which refers to the manifest file  (2 seems to refer to dll, vs 1 for exe, but I am not sure):

      #include "winuser.h"
      2 RT_MANIFEST hello.dll.manifest

      Build the .res file, which will embed the xml file into the resource file (.res)

        windres --input hello.rc --output hello.res --output-format=coff

        Link the whole thing together:

          gcc -shared -o hello.dll hello.o hello.res -lmsvcr90

          Now, executing main.exe should be possible. There is still the problem of generating the manifest file. Since in our case, the problem is mainly with the MSVC runtime, to stay compatible with the python.org binary, we may just reuse the same manifest all the time ?

          A few more links on the topic:

          http://www.ddj.com/windows/184406482

          http://msdn.microsoft.com/en-us/library/ms235591(VS.80).aspx

          http://www.codeproject.com/KB/COM/regsvr42.aspx

          Building dtrace-enabled python from sources on Mac OS X

          One highlight of Mac OS X Tiger is dtrace. Providers for ruby and python are also available, but only with the “system” interpreters (the one included out of the box). If you install python from http://www.python.org, you can’t use dtrace anymore. Since the code to make dtrace enable python is available on the open source corner of Apple, I thought it would be easy to apply it to pristine sources available on python.org.

          Unfortunately, for some strange reasons, Apple only provides changes in the form of ed scripts applied through a Makefile; the changes are not feature specific, you just have a bunch of files for each modified file: dtrace, Apple-specific changes are all put together. I managed to extract the dtrace part of this mess, so that I can apply only dtrace related changes. The goal is to have a python as close as possible to the official binary available on python.org. The patch can be found there.

          How to use ?

          • Untar the python 2.5.2 tarball
          • Apply the patch
          • Regenerate the configure scripts by running autoconf
          • configure as usual, with the additional option –enable-dtrace  (the configuration is buggy, and will fail if you don’t enable dtrace, unfortunately)
          • build python (make, make install).
          It time permits, I will post a .dmg. Needless to say, you run this at your own risk.

          numscons and cython

          numscons 0.9.2 has just been released. The main feat of this release is cython support: I implemented a small cython tool during the cython tutorial at scipy08, and now, you can build a cython extension from .py or .pyx:

          from numscons import GetNumpyEnvironment
          env = GetNumpyEnvironment(ARGUMENTS)
          # cython tool not loaded by default
          name = "cython"
          env.Tool(name)
          # Build a python extension from yop.py
          env.DistutilsPythonExtension(source = ["yop.py"])
          

          The example can be found in test/examples/cython in numscons sources. This is preliminary, since there is no way to pass option to cython generation.

          
          

          numscons, part 2 : Why scons ?

          This is the 2nd part of the serie about numscons. This part will present scons in more details, to show it can solve problems mentioned in part 1.

          scons is a software intended as a replacement to the venerable make software. It is written in python, making it a logical candidate to build complex extension code like numpy and scipy. The scons process is driven by a scons script, as make process is driven by a Makefile. As makefiles, scons scripts are declarative, and scons automatically builds the Directed Acyclic Graph (DAG) from the description in scons scripts to build the software in a correct order. The comparison stops here, though, because scons is fundamentally different than make in many aspects.

          Scons scripts are python scripts

          Not only Scons itself is written in python, but scons scripts themselves are also python scripts. Almost anything possible in python is possible in scons script; rules in makefiles are mostly replaced by Builders in scons parlance, which are python functions. This also means that anything fancy done in numpy.distutils can be used in scons script if the need arises, which is not a small feat.

          Scons has a top notch dependency system

          This is one of the reason people go from make to scons. Although make does handle dependency, you have to set up the dependencies in the rules, for example, for a simple object file hello.c which has a header hello.h:

          hello.o : hello.c hello.h
                  $(CC) -c hello.c -o hello.o
          

          If you don’t set the hello.h, and changes hello.h later, make will not detect it as a change, and will consider hello.o as up to date. This is quickly becoming intractable for large projects, and thus several softwares exist to automatically handle dependency and generate rules for make. Automake (used in most projects using autotools) does this, for example; distutils itself does this, but it is not really reliable. With make files, you have to regenerate the make files every time the dependency changes.

          On the contrary, scons does this automatically: if you have #include “hello.h” in your source file, scons will automatically add hello.h as a dependency to hello.c. It does though by scanning hello.c content. Even better, scons automatically adds for each target a dependency on the code and commands used to build the target; concretely, if you build some C code, and the compiler changes, scons detects it.

          Thus, scons solves for free the dependency problem, one of the fundamental problem of distutils for extension code (this problem is the first in the list of distutils revamp goals).

          build configurations are handled in objects, not in code:

          Another fundamental problem of distutils is the way it stores knowledge about build a particular kind of target: the compilation flags, compilers, paths are embedded in the code of distutils itself, and not available programmatically. Some of it is available through distutils.sysconfig, but not always (in particular, it is not available for python built with MS Visual Studio).

          On the other hand, Scons stores compiler flags and any kind of build specific knowledge in environment objects. In that regard, Environment instances are like python dictionaries, which store compiler, compiler flags, etc… Those environment can be copied, modified at will. They can also be used to compile differently different source files, for example with different optimization or warning level. For example

          warnflags = ['-Wall', '-W']
          env = Environment()
          warnenv = env.Clone(CFLAGS = warnflags)
          

          Will create two environments, and any build command related to env will use the default compiler flags, whereas warnenv will use the warning flags. This also makes customization by the user much easier. People often have trouble compiling numpy with different options, for example for more agressive compilation:

          CFLAGS="-O3 -funroll-loops" python setup.py build

          Does not work because CFLAGS overrides CFLAGS as used by distutils, and all compiler flags are kept in the same variable (Flags from distutils and flags from the user are stored at the same place). With scons, those can easily be put in different locations. With numscons, those work out of the box:

          python setup.py build # Default build
          CFLAGS="-W -Wall -Wextra -DDEBUG -g" python setup.py build # Unoptimized, debug build
          CFLAGS="-funroll-loops -O3" python setup.py build # Agressive build

          scons enables straightforward compilation customization through the command line. This is important for users who like to build numpy/scipy on special configuration (which is quite common in the scientific community), and also for packagers, who complain a lot about distutils and its weird arguments handling.

          Scons is extensible

          scons is also extensible. Although it has some quircks, in particular some unpythonic way of doing things, it is built with customization in mind. As mentionned earlier, scons generate targets from source (for example hello.o from hello.c) through special methods called Builders. It is possible and relatively easy to create your own builder. Builders can be complex, though, but that’s because they can be very flexible:

          • Builders can have their own scanner. For example, the f2py builder in numscons has its own scanner to automatically handle dependencies in <include_file=…> f2py directives.
          • Builders can have their own emitters: an emitter is a function which generate the list of targets from the list of sources. It can be used to dynamically add new source files, and modify the list of targets. For example, when building f2py extensions, some extra files are needed, and emitter is a way to do it.
          • Builders have many other options which I won’t talk about here.

          The scons wiki also contains a vast range of builders for different kind of tasks (building documentation, tarballs, etc…). With builders, building code using swig, cython, ctypes is possible, and does not require some distutils magic: if you know how to build them from the command line, implementing builders for them is relatively straifgtforward, as long as they fit in the DAG view (f2py for example was quite difficult to fit there).

          Scons has a configure subsystem

          When building numpy/scipy, we need to check for dependencies such as BLAS/LAPACK, fft libraries, etc… The way numpy.distutils does it is to look for files in some paths. This is highly unreliable, because the mere existence of a file does not mean it is usable; in particular, maybe it is too old, or nor usable by the used compiler, etc… Scons has a configure subsystem which works in a manner similar to autotools: to check for libfoo with the foo.h header, scons will try to compile a code snippet including foo.h, and try to link it with -lfoo (or /LIB:foo.lib with MS compiler). This is much more robust. Robustness is important here because people often try to build their own blas/lapack, make some mistake in the process, and then can build numpy successfully. Only once they try to run numpy do they have some problems. Another problem with the current scheme in numpy.distutils is that it is fragile, and difficult to modify by people with unusual configuration (Using Intel or AMD optimized libraries for example); thus, only the few people who know enough about numpy.distutils can do it. Finally, the scons subsystem is much easier to use:

          
          config = Configure()
          
          config.CheckLibraryWithHeader('foo', 'foo.h')
          
          config.Finish()
          

          Is straightfoward, whereas the same thing in numpy.distutils takes around 50 lines of code. Out of the box, the scons configure subsystem has the following checks:

          • CheckHeader: to check for the availability of a C/C++ header
          • CheckLib: to check for the availability of a library
          • CheckType/CheckTypeSize: to check for the availability of a type and its size
          • CheckDeclaration: to check for #define

          An example I find striking is to compare the setup.py and the scons script for numpy.core. Because of the configure subsystem, the scons script is much easier to understsand IMHO.

          Now, the scons subsystem is not ideal either: internally, it relies heavily on some obscure features of scons itself for the dependency handling, which means it is quite fragile.  For most usages (in particular checking for libraries/headers, which is the only thing that the vast majority of numscons users will use), this works perfectly. For some advanced uses of the subsystem, this is problematic: the fortran configuration subsystem of numscons for example requires grepping through the output (both stdout/stderr) of the builders inside the checkers, and this does not work well in scons (I have to bypass the configure buidlers, basically).

          Conclusion

          When looking at the list prepared by David M. Cook for distutils improvements, one can see that scons already solve most of them:

          • better dependency handling: done by scons DAG handling
          • make it easier to use a specific compiler or compiler option: through scons environments
          • allow .c files to specify what options they should/shouldn’t be compiled with (such as using -O1 when optimization screws up, or not using -Wall for .c made from Pyrex files: through scons environments
          • simplify system_info so that adding checks for libraries, etc., is easier: through scons configure subsytem
          • a more “pluggable” architecture: adding source file generators (such as Pyrex or SWIG) should be easy: through builders, actions, etc..

          And more interesting for me, when I see some problems in scons, I can solve them upstream, so that it benefit other people, not just numpy/scipy. In particular, the fortran support was problematic in scons, and since scons 0.98.2, my work for a new fortran support is available. CheckTypeSize and CheckDeclaration, as well as some configuration header generation improvements were also committed upstream.

          In Part 3, I will explain the basic design of numscons, and how it brings scons power into numpy build system.

          Redirecting stderr/stdout in cmd.exe

          Here is the magic:

          command > file.log 2>&1

          numscons, part 1: the problems with building numpy/scipy with distutils

          This will be the first post of a serie about numscons, a project I have been working now for a bit more than 6 months. Simply put, numscons is an alternative build system to build numpy/scipy and other python softwares which heavily rely on compiled code. Before talking about numscons, this first post will be a list of problems with the current build system.

          Current flaws in distutils/numpy.distutils:

          Here are some things that several people, including, would like to be able to do:

          1. If a package depends on a library, it is difficult to test for the dependency (header, library). In autoconf, it is one line to test for the headers/libraries. With numpy.distutils, you have to use 50 lines of code,  and it is quite fragile.
          2. Not possible to build ctypes extensions in a portable way.
          3. Not possible to compile different part of a package with different compilation options.
          4. No dependency system: if you change some C code, the only reliable way to build correctly is to start from scratch.
          5. CFLAGS/FFLAGS/LDFLAGS do not have the expected semantics : instead of prepending options to the one used for actual compilation, they override the flags, which means that doing something like CFLAGS=”-O3″ will break, since -fPIC and all necessary options to build python extensions are missing.
          6. The way to use different BLAS/LAPACK/Compilers is arcane, with too many options, which may fail in different ways.

          Why not improving the current build system ?

          I sent last year an email on the numpy ML explaining the problems I got with distutils and its extensions numpy.distutils. The majority agreed that the current situation was less than ideal, but the people who knew enough about the current system to improve it could not spend a lot of time on it. The current build system is a set of extensions around distutils, the standard package for build/distribution under python. Here lies the first problem: distutils is a big mess. The code is ugly, badly designed, and not documented. In particular:

          1. Difficult to extend: although in theory, distutils has the Command class which can be inherited from, a lot of magic is going on, and there is not clear public API. Depending on the way you call distutils, the classes have different attributes !!!
          2. Distutils fundamentally works as a set of commands. You first do that, then that, then that. That’s a wrong model for building softwares; the right model is a DAG of dependencies (ala make). In particular, for numpy/scipy, when you change some C code, the only way to reliably rebuild the package is to start from scratch.
          3. the compilation options are spread everywhere in the code. Depending on the platform, it is available in distutils.sysconfig (UNIX) or not (windows). On the later, it is not possible to retrieve the options for compilation. This, combined with the lack of extensibility means simple things like building ctypes extensions is much more difficult than it should be.

          Using scons to build compiled extensions:

          For this reason, I thought it may be better to use a build system which knows about dependencies and compiled code, and preferably in python. The most known contender with those characteristics is scons. scons is a make replacement, written 100% in python. In particular:

          1. scons is built around the DAG concept. Its dependency system is top-notch: if you change link option, it will only relink; if header files change, scons automatically detects it.
          2. scons has a primitive but working system to check for dependencies (check for headers, libraries, etc…). It works like autoconf, that is instead of looking for files, it tries to build code snippets. This is much more robust than the current numpy.distutils ways, because if for example your blas/lapack is buggy, you can detect it. Since many people build their own blas/lapack for numpy/scipy, and get it wrong, this is important
          3. scons is heavily commented, reasonably well documented, and some relatively high-profiles companies are using it, so it is a proven software (vmware uses for some of its main products, Intel uses it, Doom and all Id-softwares on Linux are built with scons; it seems that generally, scons is quite popular in the gaming community, both open source and proprietary).

          Scons has also some disadvantages:

          1. It uses ancient python (compatible with 1.5.2). This has many consequences which are unfortunate IMO, and the advantages of compatibility are outweights by its disadvantages IMO. In particular, some code is quite arcane because of it (use of apply instead of the foo(*args, **kw) idiom).
          2. A lot of things are ‘unpythonic’, and a lot of the logic in harcoded in the main callee, meaning you cannot really use it as a library within your project. You have to let scons drive the whole process.
          3. It misses a lot of essential features for packaging, meaning it is not often used for open source projects.
          4. It is relatively slow, although this is not a big problem for numpy/scipy.
          5. scons developers community is not large: it is mainly the job of 2-3 people, and I believe this is partly a consequence of 1 and 3.

          Nevertheless, I decided to use scons, and I believe it was the right choice. One thing which pleased me is that instead of improving numpy.distutils, a fragile system that nobody outside numpy/scipy will use anyway, I instead spend time implementing missing features in scons, some of which are already integrated upstream (better fortran support, better support of some fortran compilers, etc…). This way, everybody can benefit of those new features.

          Next post in the serie will be about the features I was interested in implementing in numscons, and how I implemented them.

          « Previous PageNext Page »


          a

           

          November 2009
          M T W T F S S
          « Jun    
           1
          2345678
          9101112131415
          16171819202122
          23242526272829
          30