Python packaging: a few observations, cabal for a solution ?

The python packaging situation has been causing quite some controversy for some time. The venerable distutils has been augmented with setuptools, zc.buildout, pip, yolk and what not. Some people praise those tools, some other despise them; in particular, discussion about setuptools keeps coming up in the python community, and almost every time, the discussion goes nowhere, because what some people consider broken is a feature for the other. It seems to me that the conclusion of those discussions is obvious: no tool can make everybody happy, so there has to be a system such as different tools can be used for different usage, without intefering with each other. The solution is to agree on common format and data/metadata, so that people can build on it and communicate each other.

You can find a lot of information on people who like setuptools/eggs, and their rationale for it. A good summary, with a web-developer POV is given by Ian Bicking. I thought it would be useful to give another side to the story, that is people like me, whose needs are very different from the web-development crowd (the community which pushes eggs the most AFAICS).

Distutils limitation

Most of those tools are built on top of distutils, which is a first problem. Distutils is a giant mess, with tight, undocumented coupling between vastly different parts. Distutils takes care of configuration (rarely used, except for projects like numpy which need to probe for fairly low level system dependencies), build, installation and package building. I think that’s the fundamental issue of distutils: the installation and deployment parts do not need to know so much about each other, and should be split. The build part should be easily extensible, without too much magic or assumption, because different projects have different needs. The king here is of course make; but ruby for example has rake and rant, etc…

A second problem of distutils is its design, which is not so good. Distutils is based on commands (one command do the build of C extension, one command do the installation, one command build eggs in the case of setuptools, etc…). Commands are fundamentally imperative in distutils: do this, and then that. This is far from ideal for several reasons:

You can’t pass option between commands

For example, if you want to change the compilation flags, you have to pass them to every concerned command.

Building requires handling dependencies

You declare some targets, which depend on some other targets, and the build tool build a dependency graph to build this in the right order. AFAIK, this is the ONLY correct way to build software. Distutils commands are inherently incapable of doint that. That’s one example where the web development crowd may be unaware of the need for this: Ian Bicking for example says that we do pretty well without it. Well, I know I don’t, and having a real dependency system for numpy/scipy would be wonderful. In the scientific area, large, compiled libraries won’t go away soon.

Fragile extension system

Maybe even worse: extending distutils means extending commands, which makes code reuse quite difficult, or cause some weird issue. In particular, in numpy, we need to extend distutils fairly extensively (for fortran support, etc…), and setuptools extends distutils as well. Problem: we have to take into account setuptools monkey patching. It quickly becomes impractical when more tools are involved (the combinations grow exponentially).

Typical problem: how to make setuptools and numpy distutils extensions cohabite ? Another example: paver is a recent, but interesting tool for doing common tasks related to build. Paver extend setuptools commands, which means it does (it can’t) work with numpy.distutils extensions. The problem can be somewhat summarized by: I have class A in project A, class B(A) in project B and class C(A) in project C – how to I handle B and C in a later package. I am starting to think it can’t be done reliably using inheritance (the current way).

Extending commands is also particularly difficult for anything non trivial, due to various issues: lack of documentation, the related distutils code is horrible (attributes added on the fly for no good reason), and nothing is very well specified. You can’t retrieve where distutils build a given file (library, source file, .o file, etc…), for example. You can’t get the name of the sdist target (you have to recreate the logic yourself, which is platform dependent). Etc…

Final problem: you can’t really call commands directly in setup.py. As a recent example encountered in numpy: I want to install a C library build through the libraries argument of setup. I can’t just add the file to the install command. Now, since we extend the install command in numpy.distutils, it should have been simple: just retrieve the name of the library, and add it to the list of files to install. But you can’t retrieve the name of the built library from the install command, and the install command does not know about the build_clib one (the one which builds C libs).

Packaging, dependency management

This is maybe the most controversial issue. By packaging, I mean putting everything which constitute the software (configuration, .py, .so/.pyd, documentation, etc…) in a a format which can be deployed on many machines in a consistent way. For web-developers, it seems this mean something which can be put on a couple of machine, in an known state. For packages like numpy, this means being able to install on many different kind of platforms, with different capabilities (different C runtimes, different math libraries, different optimized libraries, etc…). And other cases exist as well.

For some people, the answer is: use a sane OS with package management, and life goes on. Other people consider setuptools way of doing things almost perfect; it does everything they want, and don’t understand those pesky Debian developers who complain about multiple versions, etc… I will try to summarize the different approaches here, and the related issues.

The underlying problem is simple: any non trivial software depends on other things to work. Obviously, any python package needs a python interpreter. But most will also need other packages: for example, sphinx needs pygments, Jinja to work correctly. This becomes a problem because software evolves: unless you take a great care about it, software will become incompatible with an older version. For example, the package foo 1.1 decided to change the order of arguments in one function, so bar which worked with foo 1.0 will not work with foo 1.1. There are basically three ways to deal with this problem:

  1. Forbid the situation. Foo 1.1 should not break software which works with foo 1.0. It is a bug, and foo should be fixed. That’s generally the prefered OS vendor approach
  2. Bypass the problem by bundling foo in bar. The idea is to distribute a snapshot of most of your dependencies, in a known working situation. That’s the bundling situation.
  3. Install multiple versions: bar will require foo 1.1, but fubar still uses the old foo 1.0, so both foo 1.0 and foo 1.1 should be installed. That’s the “setuptools approach”.

Package management ala linux is the most robust approach in the long term for the OS. If foo has a bug, only one version needs to be repackaged. For system administrators, that’s often the best solution. It has some problems, too: generally, things cannot be installed without admin privileges, and packages are often fairly old. The later point is not really a problem, but inherent to the approach: you can’t request both stability and bleeding edge. And obviously, it does not work for the other OS. It also means you are at the mercy of your OS vendor.

Bundling is the easiest. The developer works with a known working test, and is not dependent on the OS vendor to get an up to date version.

3 sounds like the best solution, but in my opinion, it is the worst, at least in the current state of affairs as far as python is concerned, and when the software target is “average users”. The first problem is that many people seem to ignore the problem caused by multiple, side by side installation. Once you start saying “depends on foo 1.1 and later, but not higher than 1.3″, you start creating a management hell, where many versions of every package is installed. The more it happens, the more likely you get into a situation like the following:

  • A depends on B >= 1.1
  • A depends on C which depends on B <= 1.0

Meaning a broken dependency. This situation has to be avoided as much as possible, and the best way to avoid it is to maintain compatibility such as B 1.2 can be used as a drop-in replacement for 1.0. I think too often people request multiple version as a poor man’s replacement for backward compatibility. I don’t think it is manageable. If you need a known version of a library which keeps changing, I think bundling is better – generally, if you want deployable software, you should really avoid depending on libraries which change too often, I think there is no way around it. If you don’t care about deploying on many machines (which seem to be the case for web-deployment), then virtualenv and other similar tools are helpful; but they can’t seriously be suggested as a general deployment tool for the same audience as .deb/.rpm/.msi/.pkg. Deployment for testing is very different from deployment to many machines you can’t control at all (the users’ ones)

Now, having a few major versions of the most common libraries should be possible – after all, it is used for C libraries (with the same library installed under different versions with different sonames). But python, contrary to C loaders, does not support explicit version loading independently of the name. You can’t say something like “import foo with v >= 1.1″, but you have to use a new name for the module – meaning changing every library user source code. So you end up with hacks as used by setuptools/easy_install, which are very fragile ( sys.path overriding, PYTHONPATH mess, easy_install.pth, etc…). At least for me, that’s a constant source of frustration, to the point that I effectively forbid setuptools to do anything on my machine: easy-install.pth is read only, and I always install with –single-version-externally-managed.

With thing like virtualenv and pip freeze, I don’t understand the need for multiple versions of the same libraries installed system-wide. I can see how python does not make it easy to support tools like virtualenv and pip directly (that is wo setuptools), but maybe people should focus on enabling virtualenv/zc.buildout usage without setuptools hacks (sys.path hacking, easy_install.pth), basically without setuptools, instead of pushing the multiple library thing on everyone ?

Standardize on data, not on tools

As mentioned previously, I don’t think python should standardize on one tool. The problem is just too vast. I would be very frustrated if setuptools becomes the tool of choice for python – but I understand that it solves issues for some people. Instead, I hope the python community will be able to stdandardize on metadata. Most packages have relatively simple need, which could be covered with a set of static metadata.

It looks like such a design already exists: cabal, the packaging tool for haskell (Thanks to Fernando Perez for pointing me to cabal):

http://www.haskell.org/cabal/release/cabal-latest/doc/users-guide/

Cabal work with two files:

  • setup.hs -> equivalent of our setup.py. Can use haskell, and as such can do pretty much anything
  • cabal: static metadata.

For example:

Name: HUnit

Version: 1.1.1

Cabal-Version: >= 1.2

License: BSD3

License-File: LICENSE

Author: Dean Herington

Homepage: http://hunit.sourceforge.net/

Category: Testing

Synopsis: A unit testing framework for Haskell

Library

Build-Depends: base

Exposed-modules:

Test.HUnit.Base, Test.HUnit.Lang, Test.HUnit.Terminal,

Test.HUnit.Text, Test.HUnit

Extensions: CPP

Even for the developer who knows nothing about haskell (like me :) ), this looks obvious. Basically, classifiers and arguments of the distutils setup function goes into the static file in haskell. By being a simple, readable text file, other tools can use it pretty easily. Of course, we would provide an API to get those data, but the common infrastructure is the file format and meta-data, not the API.

Note that the .cabal file enables for conditional, albeit in a very structured form. I don’t know whether this should be followed or not: the point of a static file is that it is easily parsable. Having conditional severly decreases the simplicity. OTOH, a simple way to add options is nice – and other almost static metadata files for packaging, such as RPM .spec file, allow for this.

It could also be simple to convert many distutils packages to such a format; actually, I would be surprised if the majority of packages out there could not be automatically translated to such a mechanism.

Then, we could gradually deprecate some distutils commands (to end up with a configure/build/instasll, with configure optional), such as different build tools could be plugged for the build itself – distutils could be used for the simple packages (the one wo compiled extensions), and other people could use other tools for more advanced needs (something like what I did with numscons, which bypass distutils entirely for building C/C++/Fortran code).

uninstall

Another often requested feature. I think it is a difficult feature to support reliably. Uninstall is not just about removing files: if you install a deamon, you should stop it, you may ask about configuration files, etc… It should at least support pre install/post install hooks and corresponding uninstall equivalents. But the main problem for python is how to keep a list of installed packages/files. Since python packages can be installed in many locations, there should be one db (the db could and most likely should be a simple flat file) for each site-package. I am yet familiar with haskell module management, but it looks like that’s how haskell does it

Conclusion

Different people have different needs. Any solution from one camp which prevents other solutions is very unhelpful and counter productive. I don’t want to get my ubuntu deployment system screwed up by some toy dependency system – but I don’t want to prevent the web developers from using their workflow. I can’t see a single system solving all this altogether – the problem has not been solved by anything I know of – it is too big of a problem to hope for a general solution. Instead of piling complexity and hack over complexity and hack, we should standardize the commonalities (of which there are plenty), and make sure different systems can be used by different projects.

numscons and cython

numscons 0.9.2 has just been released. The main feat of this release is cython support: I implemented a small cython tool during the cython tutorial at scipy08, and now, you can build a cython extension from .py or .pyx:

from numscons import GetNumpyEnvironment
env = GetNumpyEnvironment(ARGUMENTS)
# cython tool not loaded by default
name = "cython"
env.Tool(name)
# Build a python extension from yop.py
env.DistutilsPythonExtension(source = ["yop.py"])

The example can be found in test/examples/cython in numscons sources. This is preliminary, since there is no way to pass option to cython generation.


numscons, part 2 : Why scons ?

This is the 2nd part of the serie about numscons. This part will present scons in more details, to show it can solve problems mentioned in part 1.

scons is a software intended as a replacement to the venerable make software. It is written in python, making it a logical candidate to build complex extension code like numpy and scipy. The scons process is driven by a scons script, as make process is driven by a Makefile. As makefiles, scons scripts are declarative, and scons automatically builds the Directed Acyclic Graph (DAG) from the description in scons scripts to build the software in a correct order. The comparison stops here, though, because scons is fundamentally different than make in many aspects.

Scons scripts are python scripts

Not only Scons itself is written in python, but scons scripts themselves are also python scripts. Almost anything possible in python is possible in scons script; rules in makefiles are mostly replaced by Builders in scons parlance, which are python functions. This also means that anything fancy done in numpy.distutils can be used in scons script if the need arises, which is not a small feat.

Scons has a top notch dependency system

This is one of the reason people go from make to scons. Although make does handle dependency, you have to set up the dependencies in the rules, for example, for a simple object file hello.c which has a header hello.h:

hello.o : hello.c hello.h
        $(CC) -c hello.c -o hello.o

If you don’t set the hello.h, and changes hello.h later, make will not detect it as a change, and will consider hello.o as up to date. This is quickly becoming intractable for large projects, and thus several softwares exist to automatically handle dependency and generate rules for make. Automake (used in most projects using autotools) does this, for example; distutils itself does this, but it is not really reliable. With make files, you have to regenerate the make files every time the dependency changes.

On the contrary, scons does this automatically: if you have #include “hello.h” in your source file, scons will automatically add hello.h as a dependency to hello.c. It does though by scanning hello.c content. Even better, scons automatically adds for each target a dependency on the code and commands used to build the target; concretely, if you build some C code, and the compiler changes, scons detects it.

Thus, scons solves for free the dependency problem, one of the fundamental problem of distutils for extension code (this problem is the first in the list of distutils revamp goals).

build configurations are handled in objects, not in code:

Another fundamental problem of distutils is the way it stores knowledge about build a particular kind of target: the compilation flags, compilers, paths are embedded in the code of distutils itself, and not available programmatically. Some of it is available through distutils.sysconfig, but not always (in particular, it is not available for python built with MS Visual Studio).

On the other hand, Scons stores compiler flags and any kind of build specific knowledge in environment objects. In that regard, Environment instances are like python dictionaries, which store compiler, compiler flags, etc… Those environment can be copied, modified at will. They can also be used to compile differently different source files, for example with different optimization or warning level. For example

warnflags = ['-Wall', '-W']
env = Environment()
warnenv = env.Clone(CFLAGS = warnflags)

Will create two environments, and any build command related to env will use the default compiler flags, whereas warnenv will use the warning flags. This also makes customization by the user much easier. People often have trouble compiling numpy with different options, for example for more agressive compilation:

CFLAGS="-O3 -funroll-loops" python setup.py build

Does not work because CFLAGS overrides CFLAGS as used by distutils, and all compiler flags are kept in the same variable (Flags from distutils and flags from the user are stored at the same place). With scons, those can easily be put in different locations. With numscons, those work out of the box:

python setup.py build # Default build
CFLAGS="-W -Wall -Wextra -DDEBUG -g" python setup.py build # Unoptimized, debug build
CFLAGS="-funroll-loops -O3" python setup.py build # Agressive build

scons enables straightforward compilation customization through the command line. This is important for users who like to build numpy/scipy on special configuration (which is quite common in the scientific community), and also for packagers, who complain a lot about distutils and its weird arguments handling.

Scons is extensible

scons is also extensible. Although it has some quircks, in particular some unpythonic way of doing things, it is built with customization in mind. As mentionned earlier, scons generate targets from source (for example hello.o from hello.c) through special methods called Builders. It is possible and relatively easy to create your own builder. Builders can be complex, though, but that’s because they can be very flexible:

  • Builders can have their own scanner. For example, the f2py builder in numscons has its own scanner to automatically handle dependencies in <include_file=…> f2py directives.
  • Builders can have their own emitters: an emitter is a function which generate the list of targets from the list of sources. It can be used to dynamically add new source files, and modify the list of targets. For example, when building f2py extensions, some extra files are needed, and emitter is a way to do it.
  • Builders have many other options which I won’t talk about here.

The scons wiki also contains a vast range of builders for different kind of tasks (building documentation, tarballs, etc…). With builders, building code using swig, cython, ctypes is possible, and does not require some distutils magic: if you know how to build them from the command line, implementing builders for them is relatively straifgtforward, as long as they fit in the DAG view (f2py for example was quite difficult to fit there).

Scons has a configure subsystem

When building numpy/scipy, we need to check for dependencies such as BLAS/LAPACK, fft libraries, etc… The way numpy.distutils does it is to look for files in some paths. This is highly unreliable, because the mere existence of a file does not mean it is usable; in particular, maybe it is too old, or nor usable by the used compiler, etc… Scons has a configure subsystem which works in a manner similar to autotools: to check for libfoo with the foo.h header, scons will try to compile a code snippet including foo.h, and try to link it with -lfoo (or /LIB:foo.lib with MS compiler). This is much more robust. Robustness is important here because people often try to build their own blas/lapack, make some mistake in the process, and then can build numpy successfully. Only once they try to run numpy do they have some problems. Another problem with the current scheme in numpy.distutils is that it is fragile, and difficult to modify by people with unusual configuration (Using Intel or AMD optimized libraries for example); thus, only the few people who know enough about numpy.distutils can do it. Finally, the scons subsystem is much easier to use:


config = Configure()

config.CheckLibraryWithHeader('foo', 'foo.h')

config.Finish()

Is straightfoward, whereas the same thing in numpy.distutils takes around 50 lines of code. Out of the box, the scons configure subsystem has the following checks:

  • CheckHeader: to check for the availability of a C/C++ header
  • CheckLib: to check for the availability of a library
  • CheckType/CheckTypeSize: to check for the availability of a type and its size
  • CheckDeclaration: to check for #define

An example I find striking is to compare the setup.py and the scons script for numpy.core. Because of the configure subsystem, the scons script is much easier to understsand IMHO.

Now, the scons subsystem is not ideal either: internally, it relies heavily on some obscure features of scons itself for the dependency handling, which means it is quite fragile.  For most usages (in particular checking for libraries/headers, which is the only thing that the vast majority of numscons users will use), this works perfectly. For some advanced uses of the subsystem, this is problematic: the fortran configuration subsystem of numscons for example requires grepping through the output (both stdout/stderr) of the builders inside the checkers, and this does not work well in scons (I have to bypass the configure buidlers, basically).

Conclusion

When looking at the list prepared by David M. Cook for distutils improvements, one can see that scons already solve most of them:

  • better dependency handling: done by scons DAG handling
  • make it easier to use a specific compiler or compiler option: through scons environments
  • allow .c files to specify what options they should/shouldn’t be compiled with (such as using -O1 when optimization screws up, or not using -Wall for .c made from Pyrex files: through scons environments
  • simplify system_info so that adding checks for libraries, etc., is easier: through scons configure subsytem
  • a more “pluggable” architecture: adding source file generators (such as Pyrex or SWIG) should be easy: through builders, actions, etc..

And more interesting for me, when I see some problems in scons, I can solve them upstream, so that it benefit other people, not just numpy/scipy. In particular, the fortran support was problematic in scons, and since scons 0.98.2, my work for a new fortran support is available. CheckTypeSize and CheckDeclaration, as well as some configuration header generation improvements were also committed upstream.

In Part 3, I will explain the basic design of numscons, and how it brings scons power into numpy build system.

numscons, part 1: the problems with building numpy/scipy with distutils

This will be the first post of a serie about numscons, a project I have been working now for a bit more than 6 months. Simply put, numscons is an alternative build system to build numpy/scipy and other python softwares which heavily rely on compiled code. Before talking about numscons, this first post will be a list of problems with the current build system.

Current flaws in distutils/numpy.distutils:

Here are some things that several people, including, would like to be able to do:

  1. If a package depends on a library, it is difficult to test for the dependency (header, library). In autoconf, it is one line to test for the headers/libraries. With numpy.distutils, you have to use 50 lines of code,  and it is quite fragile.
  2. Not possible to build ctypes extensions in a portable way.
  3. Not possible to compile different part of a package with different compilation options.
  4. No dependency system: if you change some C code, the only reliable way to build correctly is to start from scratch.
  5. CFLAGS/FFLAGS/LDFLAGS do not have the expected semantics : instead of prepending options to the one used for actual compilation, they override the flags, which means that doing something like CFLAGS=”-O3″ will break, since -fPIC and all necessary options to build python extensions are missing.
  6. The way to use different BLAS/LAPACK/Compilers is arcane, with too many options, which may fail in different ways.

Why not improving the current build system ?

I sent last year an email on the numpy ML explaining the problems I got with distutils and its extensions numpy.distutils. The majority agreed that the current situation was less than ideal, but the people who knew enough about the current system to improve it could not spend a lot of time on it. The current build system is a set of extensions around distutils, the standard package for build/distribution under python. Here lies the first problem: distutils is a big mess. The code is ugly, badly designed, and not documented. In particular:

  1. Difficult to extend: although in theory, distutils has the Command class which can be inherited from, a lot of magic is going on, and there is not clear public API. Depending on the way you call distutils, the classes have different attributes !!!
  2. Distutils fundamentally works as a set of commands. You first do that, then that, then that. That’s a wrong model for building softwares; the right model is a DAG of dependencies (ala make). In particular, for numpy/scipy, when you change some C code, the only way to reliably rebuild the package is to start from scratch.
  3. the compilation options are spread everywhere in the code. Depending on the platform, it is available in distutils.sysconfig (UNIX) or not (windows). On the later, it is not possible to retrieve the options for compilation. This, combined with the lack of extensibility means simple things like building ctypes extensions is much more difficult than it should be.

Using scons to build compiled extensions:

For this reason, I thought it may be better to use a build system which knows about dependencies and compiled code, and preferably in python. The most known contender with those characteristics is scons. scons is a make replacement, written 100% in python. In particular:

  1. scons is built around the DAG concept. Its dependency system is top-notch: if you change link option, it will only relink; if header files change, scons automatically detects it.
  2. scons has a primitive but working system to check for dependencies (check for headers, libraries, etc…). It works like autoconf, that is instead of looking for files, it tries to build code snippets. This is much more robust than the current numpy.distutils ways, because if for example your blas/lapack is buggy, you can detect it. Since many people build their own blas/lapack for numpy/scipy, and get it wrong, this is important
  3. scons is heavily commented, reasonably well documented, and some relatively high-profiles companies are using it, so it is a proven software (vmware uses for some of its main products, Intel uses it, Doom and all Id-softwares on Linux are built with scons; it seems that generally, scons is quite popular in the gaming community, both open source and proprietary).

Scons has also some disadvantages:

  1. It uses ancient python (compatible with 1.5.2). This has many consequences which are unfortunate IMO, and the advantages of compatibility are outweights by its disadvantages IMO. In particular, some code is quite arcane because of it (use of apply instead of the foo(*args, **kw) idiom).
  2. A lot of things are ‘unpythonic’, and a lot of the logic in harcoded in the main callee, meaning you cannot really use it as a library within your project. You have to let scons drive the whole process.
  3. It misses a lot of essential features for packaging, meaning it is not often used for open source projects.
  4. It is relatively slow, although this is not a big problem for numpy/scipy.
  5. scons developers community is not large: it is mainly the job of 2-3 people, and I believe this is partly a consequence of 1 and 3.

Nevertheless, I decided to use scons, and I believe it was the right choice. One thing which pleased me is that instead of improving numpy.distutils, a fragile system that nobody outside numpy/scipy will use anyway, I instead spend time implementing missing features in scons, some of which are already integrated upstream (better fortran support, better support of some fortran compilers, etc…). This way, everybody can benefit of those new features.

Next post in the serie will be about the features I was interested in implementing in numscons, and how I implemented them.

October 2014
M T W T F S S
« Oct    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 10 other followers

Follow

Get every new post delivered to your Inbox.