Distributing python programs ala go, as a single file

I am glad to see discussions about the problem of distributing python programs in the wild. A recent post by Glyph articulates the main issues better than I could. The developers vs end-users focus is indeed critical, as is making the platform an implementation detail.

There is one solution that Glyph did not mention, the freeze tool in python itself. While not for the faint of the heart, it allows building a single, self-contained executable. Since the process is not really documented, I thought I would do it here.

Setting up a statically linked python

The freeze tool is not installed by default, so you need to get it from the sources, e.g. one of the source tarball. You also need to build python statically, which is itself a bit of an adventure.

I prepared a special build of static python on OS X which statically link sqlite (3.8.11) and ssl (1.0.2d), both from homebrew.

Building a single-file, hello world binary

Let’s say you have a script hello.py with the following content:

print("hello world")

To freeze it, simply do as follows:

<static-python>/bin/python <static-python>/lib/python2.7/freeze/freeze.py hello.py
make

You should now have an executable called hello of approximately 7-8 MB. This binary should be relatively portable across machines, although in this case I built the binary on Yosemite, so I am not sure whether it would work on older OS X versions.

How does it work ?

The freeze tool works by bytecompiling every dependent module, and creating a corresponding .c file containing the bytecode as a string. Every module is then statically linked into a new executable.

Limitations

I have used this process successfully to build non trivial applications that depend on dozens of libraries. If you want a single executable, the main limitation is no C extension requirement.

More generally, the main limitations are:

  1. you need to statically build python
  2. you have to use unix
  3. you are not depending on C extensions
  4. none of your dependency uses shenanigans for package data or import

1 and 2 are linked. There is no reason why it should not work on windows, but statically linking python on windows is even less supported than doing it on unix. It would be nice for python itself to support static builds better.

3 is one of the feature that has been solved over and over by the multiple freezer tools. It would be nice to get a minimal, well-written library solving this problem. Alternatively, a way to load C extensions from within a file would be even better, but not every platform can do this.

4 is actually the main issue in practice, it would be nice for good solution here. Something like pkg_resources, but more hackable/tested.

I would argue that the pieces for a better deployment story in python are there: what is needed is taking the existing pieces to build a cohesive solution.

Bento at Pycon2011 and what’s coming in bento 0.0.5

I could not spend much time (if any) on bento the last few weeks of 2010, but I fortunately got back some time to work on it this month. It is a good time to describe a bit what I hope will happen in bento in the next few months.

Bento poster @ Pycon2011

First, my bento proposal has been rejected for PyCon 2011, so it will only be presented as a poster. It is a bit unfortunate because I think it would have worked much better as a talk than as a poster. Nevertheless, I hope it will help bringing awareness of bento outside the scipy community, and give me a better understanding of people’s need for packaging (poster should be better for the latter point).

Bento 0.0.5

Bento 0.0.5 should be coming soon (mid-february). Contrary to the 0.0.4 release, this version won’t bring major user-visible features, but it got a lot of internal redesigns to make bento easier to use:

Automatic command dependency

One does not need to run each command separately anymore. If you run “bentomaker install”, it will automatically run configure and build on its own, in the right order. What’s interesting about it is how dependencies are specified. In distutils, subcommand order is hardcoded inside the parent command, which makes it virtually impossible to extend them. Bento does not suffer from this major deficiency:

  • Dependencies are specified outside the classes: you just need to say which class must be run before/after
  • Class order is then computed at run time using a simple topological sort. Although the API is not there yet, this will enable arbitrary insertion of new commands between existing commands without the need to monkey patch anything

Virtualenv support

If a bento package is installed under virtualenv, the package will be installed inside the virtualenv by default:

virtualenv .env
source .env/bin/activate
bentomaker install # this will install the package inside the virtualenv

Of course, if the install path has been customized (through prefix/eprefix), those take precedence over virtualenv.

List files to be installed

The install command can optionally print the list of files to be installed and their actual installation path. This can be used to check where things are installed. This list is exactly what bento would install by design, so it is more difficult to have weird corner cases where the list and what is actually installed is different.

First steps toward uninstall

Initial “transaction-based” install is available: in this mode, a transaction log will be generated, which can be used to rollback an install. For example, if the install fails in the middle, already installed files will be removed to keep the system in a clean state. This is a first step toward uninstall support.

Refactoring to help using waf inside bento

Bentos internal have been improved to enable easier customization of the build tool. I have a proof of concept where bento can be customized to use waf to build extensions. The whole point is to be able to do so without changing bento’s code itself, of course. The same scheme can be used to build extensions with distutils(for compatibility reasons, to help complex packages to move to bento one step at a time.

Bentoshop: a framework to manage installed packages

I am hoping to have at least a proof of concept for a package manager based around bento for Pycon 2011. As already stated on this blog, there are few non-negotiable features that the design must follow:

  1. Robust by design: things that can be installed can be removed, avoid synchronisation issues between metadata and installed packages
  2. Transparent: it should play well with native packaging tools and not go in the way of anyone’s workflow.
  3. No support whatsoever for multiple version: this can be handled with virtualenv for trivial cases, and through native “virtualization” scheme when virtualenv is not enough (chroot for fs “virtualziation”, or actual virtual machines for more)
  4. Efficient

This means PEP376 is out of the question (it breaks points 1 and 4). I will follow a first proof of concept following the haskell cabal and R (CRAN) systems, but backed with a db for performances.

The main design issue is point 2: ideally, one would want a user-specific, python-specific package manager to be aware of packages installed through the native system, but I am not sure it is really possible without breaking other points.

A few remarks on distutils2

Disclaimer: I am working on a project which may be seen as a concurrent to
distutils2 efforts, and I am quite biased against the existing packaging tools
in python. On the other hand, I know distutils extremely well, and have been
maintaining numpy.distutils extensions for several years, and most of my
criticisims should stand on their own

There is a strong consensus in the python community that the current packaging
tools (distutils) are too limited. There has been various attempts to improve
the situation, through setuptools, the distribute fork, etc… Beginning this
year, the focus has been shifted toward distutils2, which is scheduled to be
part of the stdlib for python 3.3, while staying compatible with python 2.4
onwards. A first alpha has been released recently, and I thought it was a good
occasion to look at what happened in that space.

As far as I can see, distutils2 had at least the three following goals:

  • standardize a lot of setuptools practices through PEPS and implement them.
  • refactor distutils code and add a test suite with a significant coverage.
  • get rid of setup.py for most packages, while adding hooks for people who
    need to customize their build/installation/deployment process

I won’t discuss much about the first point: most setuptools features are
useless to the scipy community, and are generally poor reimplementations of
existing solutions anyway. As far as I can see, the third point is still being
discussed, and not present in the mainline.

The second point is more interesting: distutils code quality was pretty low,
but the main issue was (and still is) the overall design. Unfortunately, adding
tests does not address the reliability issue which have plagued the scipy
community (and I am sure other communitues as well). The main issues w.r.t.
build and installation remain:

  • unreliable installation: distutils install things by simply copying trees
    built into a build directory (build/ by default). This is a problem when
    you decide to change your source code (e.g. renaming some modules), as
    distutils will add things to the existing build tree, and hence install
    will copy both old and new targets. As with distutils, the only way to have
    a reliable build will be to first rm -rf build. This alone is a consistent
    source of issues for numpy/scipy, as many end-users are bitten by this. We
    somewhat alleviate this by distributing binary installers (which know how
    to uninstall things and are built by people familiar with distutils idiocy)
  • Inconsistencies between compiler classes. For example, the MSVCCompiler
    class compiler executable is defined as a string, and set as the attribute
    cc. On the other hand, most other compiler classes define the compiler_so
    attribute (which is a list in that case). They also don’t have the same
    methods.
  • No consistent, centralized API to obtain basic compilation options (CC
    flags, etc…)

Even more significantly, it means that the fundamental issue of extensibility
has not been adressed at all, because the command-based design is still there.
This is by far the worst part of the original distutils design, and I fail to
see the point of a backward-incompatible successor to distutils which does not
address this issue.

Issues with command-based design

Distutils is built around commands, which almost correpond 1 to 1 to command
line command: when you do “python setup.py install”, distutils will essentially
call the install.run command after some initialization stuff. This by itself is
a relatively common pattern, but the issue lies elsewhere.

Options handling

First, each command has its own set of options, but the options of one command
often affect the other commands, and there is no easy way for one command to
know the options from the other one. For example, you may want to know the
options of the install command at build time. The usual pattern to do so is to
call the command you want to know the options, instantiate it and get its
options, by using e.g. get_finalized_command:

install = self.get_finalized_command("install")
install_lib = install.install_lib

This is hard to use correctly because every command can be reset by other
commands, and some commands cannot be instancialized this way depending on the
context. Worse, this can cause unexpected issues later on if you are calling a
command which has not already been run (like the install command in a build
command). Quite a few subtle bugs in setuptools and in numpy.distutils were/are
caused by this.

 

According to Tarek Ziade (the main maintainer of distutils2), this is addressed in a distutils2 development branch. I cannot comment on it as I have not looked at the code yet.

Sub-commands

Distutils has a notion of commands and “sub-commands”. Subcommands may override
each other’s options, through set_undefined_options function, which create
new attributes on the fly. This is every bit as bad as it sounds.

Moreover, the harcoding of dependencies between commands and sub-commands
significantly hampers extensibility. For example, in numpy, we use some
templated source files which are processed into .c: this is done in the
build_src command. Now, because the build command of distutils does not know
about build_src, we need to override build as well to call build_src. Then
came setuptools, which of course did not know about build_src, so we had to
conditionally subclass from setuptools to run build_src too [1]. Every command
which may potentially trigger this command may need to be overriden, with all
the complexity that follows. This is completely insane.

Hook

Distutils2 has added the notion of hooks, which are functions to be run/before
the command they hook into. But because they interact with distutils2 through
the command instances, they share all the issues aforementioned, and I suspect
they won’t be of much use.

More concretely, let’s consider a simple example: a simple file generated from
a template (say config.pkg.in), containing some information only known at
runtime (like the version and build time). Doing this correctly is
surprisingly difficult:

  • you need to generate the file in a build command, and put it at the right
    place in the build directory
  • you need to install it at the right place (in-place vs normal build, egg
    install vs non-egg install vs externally_managed install)
  • you may want to automatically include the version.py.in in sdist
  • you may want the file to be installed in bdist/msi/mpkg, so you may need to
    know all the details of those commands

Each of this step may be quite complex and error-prone. Some are impossible with a
simple hook: it is currently impossible to add files to sdist without rewriting
the sdist.run function AFAIK.

To deal with this correctly, the whole command business needs a significant
redesign. Several extremely talented people in the scipy community have
indepedently attempted to improve this in the last decade or so, without any
succes. Nothing short of a rewrite will work there, and commands constitutes a
good third of distutils code.

Build customization

distutils2 does not improve the situation w.r.t. building compiled code, but I
guess that’s relatively specific to the big packages like numpy, scipy or
pywin32. Needless to say, the compilers classes are practically impossible to
extend (they don’t even share a consistent interface), and very few people know
how to add support for new compilers, new tools or new binaries (ctypes
extensions, for example).

Overall, I don’t quite understand the rationale for distutils2. It seems that
most setuptools-standardization could have happened without breaking backward
compatibility, and the improvements are too minor for people with significant
distutils extensions to switch. Certainly, I don’t see myself porting
numpy.distutils to distutils2 anytime soon.

[1]: it should be noted that most setuptools issues are really distutils
issues, in the sense that distutils does not provide the right abstractions to
be extended.

Bento 0.0.4 released !

I have just released the new version of Bento, 0.0.4. You can get it on github as usual

 

Bento itself did not change too much, except for the support of sub-packages and a few things. But now bento can build both numpy and scipy on the “easy” platforms (linux + Atlas + gcc/clang). This posts shows a few cool things that you can do now with bento

Full distribution check

The best way to use this version of bento is to do the following:

# Download bento and create bentomaker
git clone http://github.com/cournape/Bento.git bento-git
cd bento-git && python bootstrap.py && cd ..
# Download the _bento_build branch from numpy
git clone http://github.com/cournape/numpy.git numpy-git
cd numpy-git && git checkout -b bento_build origin/_bento_build
# Create a source tarball from numpy, configure, build and test numpy
# from that tarball
../bento-git/bentomaker distcheck

For some reasons I am still unclear about, the test suite fails to run from distcheck for scipy, but that seems to be more of a nose issue than bento proper.

Building numpy with clang

Assuming you are on Linux, you can try to build numpy with clang, the LLVM-based C compiler. Clang is faster at compiling than gcc, and generally gives better error messages than gcc. Although bento itself does not have any support for clang yet, you can easily play with the bento scripts to do so. In the top bscript file from numpy, at the end of the post_configure hook, replace every compiler with clang, i.e.:

for flag in ["CC", "PYEXT_CC"]:
     yctx.env[flag] = ["clang"]

Once the project is configured, you can also get a detailed look at the configured options, in the file build/default.env.py. You should not modify this file, but it is very useful to debug build issues. Another aid for debugging configuration options is the build/config.log file. Not only does it list every configuration command (both success and failures), but it also shows the source content as well as the command output.

What’s coming next ?

Version 0.0.5 will hopefully have a shorter release period than 0.0.4. The goal for 0.0.5 is to make bento good enough so that other people can jump in bento development.

The main features I am thinking about are windows and python 3 support + a lot of code cleaning/documentation. Windows should not be too difficult, it is mainly about ripping off numscons/scons code for Visual studio support and adapt it into yaku. I have already started working on python 3 support as well – the main issue is bootstrapping bento, and finding an efficient process to work on both python 2 and 3 at the same time. Depending on the difficulty, I will also try to add proper dependency handling in yaku for compiled libraries and dependent headers: ATM, yaku does not detect header change, nor does it rebuild an extension if the linked libraries changed. An alternative is to bite the bullet and start working on integration with waf, which already does all this internally.

Bento (ex-toydist): what’s coming for 0.0.3

A lot has happened feature-wise since the 0.0.2 release of toydist. This is a
short summary of what is about to come in the 0.0.3 release.

Toydist renamed to bento

I have finally found a not too sucky name for toydist: bento. As you may know, bento is a Japanese word for lunch-box (see picture if you have no idea what I am talking about). The idea is that those are often nicely prepared, and bentomaker becomes the command to get a nicely packaged software :)

Integration of yaku, a micro build framework

The 0.0.2 release of toydist was still dependent on distutils to build C
extensions. I have since then integrated a small package to build things, yaku
(“grill, bake” in Japanese). This gives the following features when building C extensions

  • basic dependency handling (soon auto-detection
    of header file dependency through compiler-specific extensions)
  • reliable out-of-date detection though file content checksum
  • reliable parallel execution

I still think complex packages should use a real build system like waf or
scons, and in that regard, bento will remain completely agnostic (the distutils
build is still available as a configuration option).

Hook

Any command may now be overridden, and some hooks have been added as well.
Here is a list of possible customizations through hooks:

  • adding custom commands (for example build_doc to build doc)
  • adding dynamically generated files in sdist
  • using waf as a build tool
  • adding autoconf-like tests in configure

This opens a lot of possibilities. Some examples are found in the hook subdirectory

Distcheck command

This command configure, build, install and optionally test a package from the
tarball generated by sdist. This is very useful to test a release.

This command is still very much in infancy, but quite useful already.

One file distribution

Since bento is still in the planning phase, its API is subject to significant
changes, and I obviously don’t care about backward compatibility at this stage.
Nevertheless, several people want to use it already, so I intend to support
a waf-like one file support. It would be a self-extracting file which looks
like a python script, and could be included to avoid any extra dependency. This
would solve both distribution and compatibility issues until bento stabilized.
There is a nice explanation on how this works on the waf-devel blog

Bug fixes, python 2.4 support

I have started to fix the numerous but mostly trivial issues under
python 2.4. Bento 0.0.3 should be compatible with any python version from 2.4
to 2.7. Although python 3.x support should not be too difficult, it is rather
low priority. Let me know if you think otherwise.

Yaku, a simple python build system for toydist

[EDIT] Of course, just after having written this post, I came across two
interesting projects: mem and fbuild. That’s what I get for not
having Internet for weeks now … Both projects are based on memoization
instead of a dependency graph, and seem quite advanced feature-wise.
Unfortunately, fbuild requires python 3.1. Maybe mem would do. If so, consider
yaku dead[/EDIT]

While working on toydist, I was considering re-using distutils ability to build
C code at first, with the idea that people would use waf/scons/etc… if they
have involved compilation needs. But distutils is so horrendous that I realized
that implementing something significantly better and simpler would be possible.
After a few hours of coding, I had something which could build extensions on a
few platforms: yaku (“bake” in Japanese).

Yaku main design goal is simplicity: I don’t want the core code to be more than
~ 1000 LOC. Fortunately, this is more than enough to create something
significantly better than distutils. The current codebase is strongly inspired
by waf (and scons to some extent), and has the following features:

  • Task-based: a yaku task is like a rule in make, with a list of
    targets, dependencies, and a list of executable commands
  • Each task knows about its environment (e.g. flags for C compilation),
    and environment changes as well as dependencies changes trigger a
    task (re)-execution
  • Extension through callback: adding support for new source files
    (cython, swig, fortran, etc…) requires neither monkey patching or
    inheritence. This is one of my biggest grip with distutils
  • Primitive autoconf-like features to check for header, libraries, etc…

Besides polishing the API, I intend to add the following features:

  • Parallel build
  • Automatically find header dependencies for C/C++ code (through
    scannning sources)

I want to emphasize that yaku is not meant as a replacement for a real build
tool. To keep it simple, yaku has no abstraction of the filesystem (node
concept in scons and waf), which has serious impact on the reliability and
power as a build tool. The graph of dependencies is also built in one shot, and
cannot be changed dynamically (so yaku won’t ever be able to detect dependency
on generated code, for example foo.c which depends on foo.h generated from
foo.h.in).

Nevertheless, I believe yaku’s features are significant enough to warrant the
project. If the project takes off, it may be possible to integrate yaku within
the Distribute project, for example, whereas integrating waf or scons is out of
the question.

First public release of toydist

Toydist 0.0.2 has just been announced, and since this is the first public release since I announced it at Scipy India 2009, I thought it would be the occasion of summarizing the current status of toydist, and where I see it going the next few months.

Toydist is an experimental alternative to distutils/setuptools, and aims at replacing the whole packaging infrastructure for python softwares, without requiring people to throw away their current infrastructure. The main philosophy of toydist is simplicity + extensibility:

  • simple: it should be simpler than distutils for simple packages, to the point where it is difficult to get it wrong. Although packaging is difficult, there are known good practices, and the tools should at least hint at those practices.
  • extensible: it should be possible to do things as complex as wanted in some parts of packaging, while still benefiting from toydist capabilities otherwise.

In other words, making toydist more pythonic, with OOWTDI, without getting in your way.

The present

The focus of this first release has been the design of a declarative package description, and implementing just enough features so that toydist can install itself. A simple command line interface, called toymaker, is provided as well. Installing a package with toymaker is very similar to the autotools’ way

toymaker configure
toymaker build
toymaker install
toymaker sdist # Assemble a tarball

I have also implemented preliminary support to build eggs and windows installers (.exe-based), through the buildegg and buildwininst commands.

This first release also brings a few distribution-related features which have been big pain points in distutils/setuptools. First, the flexibility of autotools installation scheme is available at configuration stage

    toymaker configure --prefix=somepath --libdir=someotherpath --mandir=yetanotherpath

works as expected, and every customized path is available inside toydist from the beginning, instead of being available only at install time as in distutils.

Secondly, data files are correctly handled, instead of the distutils/setuptools’ mess. Toydist makes the difference between extra source files, which are not intended to be installed (say .rst source documentation), and data files which are installed. For the later, you can declare as many data files sections as possible, and each data file section potentially has a different installation path

DataFiles: manpath
SourceDir: doc/
        TargetDir: $manpath
        Files: man1/foo.1, man3/foo.3

This syntax, inspired from automake, will cause doc/man1/foo.1 to be installed as $manpath/man1/foo.1 and doc/man3/foo.3 as $manpath/man3/foo.3. As the TargetDir field accepts non-expanded path variables, and because you can define new path variables, you can be as flexible as possible.

For toydist to be successful at all, transition from a setup.py-based build must be straightforward. For simple packages, this is as simple as

toymaker convert

inside the same directory as setup.py. Packages such as Jinja2 and Sphinx can already be converted pretty accurately using this method. Packages which rely heavily on distutils extensions, like NumPy or Twisted will most likely never be convertible this way.

As there is a lot of existing infrastructure based on distutils (and setuptools), with tools like virtualenv, pip or buildout, going from toydist to setup.py is also desirable. This can be done manually at the moment

from distutils.core import setup
from toydist.core import PackageDescription

pkg = PackageDescription.from_file("toysetup.info")

DESCR = pkg.description
CLASSIFIERS = pkg.classifiers

METADATA = {
            'name': pkg.name,
            'version': pkg.version,
            'description': pkg.summary,
            'url': pkg.url,
            'author': pkg.author,
            'author_email': pkg.author_email,
            'license': pkg.license,
            'long_description': pkg.description,
            'platforms': 'any',
            'classifiers': pkg.classifiers,
}

PACKAGE_DATA = {
            'packages': pkg.packages,
}

if __name__ == '__main__':
        config = {}
        for d in (METADATA, PACKAGE_DATA):
                for k, v in d.items():
                        config[k] = v
        setup(**config)

Toydist own setup.py is basically as above. The next version of toydist will have a distutils compatibility layer so that this will look as follows

from toydist.distutils_compat import setup

if __name__ == '__main__':
        setup("toysetup.info")

Depending on the required compatibility level with distutils, one can write distutils command to support some toydist features.

What’s coming next ?

Easy interoperation with distutils, setuptools, etc…

For toydist 0.0.3, I intend to add support for a single-file distribution of toydist, ala waf. Integrating the full code of the packaging program in a source distribution is sometimes quite useful in my experience (that’s how autotools manage its cross-platformness that to a some degree), and this would make distributing toydist-enabled packages easier.

Except on windows, it should be possible to make this single bootstrapping file not bigger than 100-200 kb, so space would not be an issue. Windows needs more as building windows installers require binaries which take a lot of space.

Extensibility through commands hooks

My minimal threshold to consider toydist succesfull is the ability to build numpy and scipy. I am convinced that a packaging tool should leverage existing build tools for complex extension builds, be it scons, waf or even the venerable make. Toydist started as a prototype to make writing things like numscons easier and it is still a major design principle I intend to follow throughout toydist development.

I am currently working on a hook API so that any toymaker command can be customized in an auxiliary python file. Toydist 0.0.3 will contains examples to build simple python C extensions with waf in a couple of lines of code. Building extensions with a real build system like waf brings automatic dependency handling, parallel builds and other features which are near impossible to implement correctly in distutils.

Replacement for pkg_resources

There are currently only two ways to retrieve data files from an installed python package: through __file__ and pkg_resources. file has the advantage of simplicity, but it is inflexible. pkg_resources is too complicated, and significantly slows down everything which uses it, and I have no use for its other features (plugins).

Using something akin to autoheader to install-time generated data locations should be easy to implement:

  • no more imports slow down (pkg_resources can easily increase import times by a factor of 2 to 3)
  • much more robust, without the possibility to break other packages (pkg_resources is a single point of failure for every package which uses it – I have had some experience where installing one setuptools package broke unrelated existing packages on my system).

Progress for numpy on windows 64 bits

The numpy 1.3.0 installer for windows 64 does not work very well. On some configurations, it does not even import without crashing. The crashes are mostly likely due to some bad interactions between the 64 bits mingw compilers and python (built with Visual Studio 2008). Although I know it is working, I had no interest in building numpy with MS compiler, because gfortran does not work with VS 2008. There are some incompatibilities because the fortran runtime from gfortran is incompatible with the VS 2008 C runtime (I get some scary linking errors).

So the situation is either building numpy with MS compiler, but with no hope of getting scipy afterwards, or building a numpy with crashes which are very difficult to track down. Today, I realized that I may go somewhere if somehow, I could use gfortran without using the gfortran runtime (e.g. libgfortran.a). I first tried calling a gfortran-built blas/lapack from a C program built with VS 2008, and after a couple of hours, I managed to get it working. Building numpy itself with full blas/lapack was a no-brainer then.

Now, there is the problem of scipy. Since scipy has some fortran code, which itself depends on the gfortran runtime when built with gfortran, I am trying to ‘fake’ a minimal gfortran runtime built with the C compiler. Since this mini runtime is built with the MS compiler and with the same  C runtime as used by python, it should work if the runtime is ABI compatible with the gfortran one. As gfortran is open source, this may not be intractable :)

With this technique, I could go relatively far in a short time. Among the packages which build and pass most of the test suite:
 – scipy.fftpack
 – scipy.lapack
 – some scipy.sparse

Some packages like cluster or spatial are not ANSI C compatible, so they fail to build. This should not be too hard to fix. The main problem is scipy.special: the C code is horrible, and there needs many hacks to build the C code. The Fortran code needs quite a few functions from the fortran runtime, so this needs some work. But ~ 300 unit tests of scipy pass, so this is encouraging.

Python packaging: a few observations, cabal for a solution ?

The python packaging situation has been causing quite some controversy for some time. The venerable distutils has been augmented with setuptools, zc.buildout, pip, yolk and what not. Some people praise those tools, some other despise them; in particular, discussion about setuptools keeps coming up in the python community, and almost every time, the discussion goes nowhere, because what some people consider broken is a feature for the other. It seems to me that the conclusion of those discussions is obvious: no tool can make everybody happy, so there has to be a system such as different tools can be used for different usage, without intefering with each other. The solution is to agree on common format and data/metadata, so that people can build on it and communicate each other.

You can find a lot of information on people who like setuptools/eggs, and their rationale for it. A good summary, with a web-developer POV is given by Ian Bicking. I thought it would be useful to give another side to the story, that is people like me, whose needs are very different from the web-development crowd (the community which pushes eggs the most AFAICS).

Distutils limitation

Most of those tools are built on top of distutils, which is a first problem. Distutils is a giant mess, with tight, undocumented coupling between vastly different parts. Distutils takes care of configuration (rarely used, except for projects like numpy which need to probe for fairly low level system dependencies), build, installation and package building. I think that’s the fundamental issue of distutils: the installation and deployment parts do not need to know so much about each other, and should be split. The build part should be easily extensible, without too much magic or assumption, because different projects have different needs. The king here is of course make; but ruby for example has rake and rant, etc…

A second problem of distutils is its design, which is not so good. Distutils is based on commands (one command do the build of C extension, one command do the installation, one command build eggs in the case of setuptools, etc…). Commands are fundamentally imperative in distutils: do this, and then that. This is far from ideal for several reasons:

You can’t pass option between commands

For example, if you want to change the compilation flags, you have to pass them to every concerned command.

Building requires handling dependencies

You declare some targets, which depend on some other targets, and the build tool build a dependency graph to build this in the right order. AFAIK, this is the ONLY correct way to build software. Distutils commands are inherently incapable of doint that. That’s one example where the web development crowd may be unaware of the need for this: Ian Bicking for example says that we do pretty well without it. Well, I know I don’t, and having a real dependency system for numpy/scipy would be wonderful. In the scientific area, large, compiled libraries won’t go away soon.

Fragile extension system

Maybe even worse: extending distutils means extending commands, which makes code reuse quite difficult, or cause some weird issue. In particular, in numpy, we need to extend distutils fairly extensively (for fortran support, etc…), and setuptools extends distutils as well. Problem: we have to take into account setuptools monkey patching. It quickly becomes impractical when more tools are involved (the combinations grow exponentially).

Typical problem: how to make setuptools and numpy distutils extensions cohabite ? Another example: paver is a recent, but interesting tool for doing common tasks related to build. Paver extend setuptools commands, which means it does (it can’t) work with numpy.distutils extensions. The problem can be somewhat summarized by: I have class A in project A, class B(A) in project B and class C(A) in project C – how to I handle B and C in a later package. I am starting to think it can’t be done reliably using inheritance (the current way).

Extending commands is also particularly difficult for anything non trivial, due to various issues: lack of documentation, the related distutils code is horrible (attributes added on the fly for no good reason), and nothing is very well specified. You can’t retrieve where distutils build a given file (library, source file, .o file, etc…), for example. You can’t get the name of the sdist target (you have to recreate the logic yourself, which is platform dependent). Etc…

Final problem: you can’t really call commands directly in setup.py. As a recent example encountered in numpy: I want to install a C library build through the libraries argument of setup. I can’t just add the file to the install command. Now, since we extend the install command in numpy.distutils, it should have been simple: just retrieve the name of the library, and add it to the list of files to install. But you can’t retrieve the name of the built library from the install command, and the install command does not know about the build_clib one (the one which builds C libs).

Packaging, dependency management

This is maybe the most controversial issue. By packaging, I mean putting everything which constitute the software (configuration, .py, .so/.pyd, documentation, etc…) in a a format which can be deployed on many machines in a consistent way. For web-developers, it seems this mean something which can be put on a couple of machine, in an known state. For packages like numpy, this means being able to install on many different kind of platforms, with different capabilities (different C runtimes, different math libraries, different optimized libraries, etc…). And other cases exist as well.

For some people, the answer is: use a sane OS with package management, and life goes on. Other people consider setuptools way of doing things almost perfect; it does everything they want, and don’t understand those pesky Debian developers who complain about multiple versions, etc… I will try to summarize the different approaches here, and the related issues.

The underlying problem is simple: any non trivial software depends on other things to work. Obviously, any python package needs a python interpreter. But most will also need other packages: for example, sphinx needs pygments, Jinja to work correctly. This becomes a problem because software evolves: unless you take a great care about it, software will become incompatible with an older version. For example, the package foo 1.1 decided to change the order of arguments in one function, so bar which worked with foo 1.0 will not work with foo 1.1. There are basically three ways to deal with this problem:

  1. Forbid the situation. Foo 1.1 should not break software which works with foo 1.0. It is a bug, and foo should be fixed. That’s generally the prefered OS vendor approach
  2. Bypass the problem by bundling foo in bar. The idea is to distribute a snapshot of most of your dependencies, in a known working situation. That’s the bundling situation.
  3. Install multiple versions: bar will require foo 1.1, but fubar still uses the old foo 1.0, so both foo 1.0 and foo 1.1 should be installed. That’s the “setuptools approach”.

Package management ala linux is the most robust approach in the long term for the OS. If foo has a bug, only one version needs to be repackaged. For system administrators, that’s often the best solution. It has some problems, too: generally, things cannot be installed without admin privileges, and packages are often fairly old. The later point is not really a problem, but inherent to the approach: you can’t request both stability and bleeding edge. And obviously, it does not work for the other OS. It also means you are at the mercy of your OS vendor.

Bundling is the easiest. The developer works with a known working test, and is not dependent on the OS vendor to get an up to date version.

3 sounds like the best solution, but in my opinion, it is the worst, at least in the current state of affairs as far as python is concerned, and when the software target is “average users”. The first problem is that many people seem to ignore the problem caused by multiple, side by side installation. Once you start saying “depends on foo 1.1 and later, but not higher than 1.3”, you start creating a management hell, where many versions of every package is installed. The more it happens, the more likely you get into a situation like the following:

  • A depends on B >= 1.1
  • A depends on C which depends on B <= 1.0

Meaning a broken dependency. This situation has to be avoided as much as possible, and the best way to avoid it is to maintain compatibility such as B 1.2 can be used as a drop-in replacement for 1.0. I think too often people request multiple version as a poor man’s replacement for backward compatibility. I don’t think it is manageable. If you need a known version of a library which keeps changing, I think bundling is better – generally, if you want deployable software, you should really avoid depending on libraries which change too often, I think there is no way around it. If you don’t care about deploying on many machines (which seem to be the case for web-deployment), then virtualenv and other similar tools are helpful; but they can’t seriously be suggested as a general deployment tool for the same audience as .deb/.rpm/.msi/.pkg. Deployment for testing is very different from deployment to many machines you can’t control at all (the users’ ones)

Now, having a few major versions of the most common libraries should be possible – after all, it is used for C libraries (with the same library installed under different versions with different sonames). But python, contrary to C loaders, does not support explicit version loading independently of the name. You can’t say something like “import foo with v >= 1.1”, but you have to use a new name for the module – meaning changing every library user source code. So you end up with hacks as used by setuptools/easy_install, which are very fragile ( sys.path overriding, PYTHONPATH mess, easy_install.pth, etc…). At least for me, that’s a constant source of frustration, to the point that I effectively forbid setuptools to do anything on my machine: easy-install.pth is read only, and I always install with –single-version-externally-managed.

With thing like virtualenv and pip freeze, I don’t understand the need for multiple versions of the same libraries installed system-wide. I can see how python does not make it easy to support tools like virtualenv and pip directly (that is wo setuptools), but maybe people should focus on enabling virtualenv/zc.buildout usage without setuptools hacks (sys.path hacking, easy_install.pth), basically without setuptools, instead of pushing the multiple library thing on everyone ?

Standardize on data, not on tools

As mentioned previously, I don’t think python should standardize on one tool. The problem is just too vast. I would be very frustrated if setuptools becomes the tool of choice for python – but I understand that it solves issues for some people. Instead, I hope the python community will be able to stdandardize on metadata. Most packages have relatively simple need, which could be covered with a set of static metadata.

It looks like such a design already exists: cabal, the packaging tool for haskell (Thanks to Fernando Perez for pointing me to cabal):

http://www.haskell.org/cabal/release/cabal-latest/doc/users-guide/

Cabal work with two files:

  • setup.hs -> equivalent of our setup.py. Can use haskell, and as such can do pretty much anything
  • cabal: static metadata.

For example:

Name: HUnit

Version: 1.1.1

Cabal-Version: >= 1.2

License: BSD3

License-File: LICENSE

Author: Dean Herington

Homepage: http://hunit.sourceforge.net/

Category: Testing

Synopsis: A unit testing framework for Haskell

Library

Build-Depends: base

Exposed-modules:

Test.HUnit.Base, Test.HUnit.Lang, Test.HUnit.Terminal,

Test.HUnit.Text, Test.HUnit

Extensions: CPP

Even for the developer who knows nothing about haskell (like me :) ), this looks obvious. Basically, classifiers and arguments of the distutils setup function goes into the static file in haskell. By being a simple, readable text file, other tools can use it pretty easily. Of course, we would provide an API to get those data, but the common infrastructure is the file format and meta-data, not the API.

Note that the .cabal file enables for conditional, albeit in a very structured form. I don’t know whether this should be followed or not: the point of a static file is that it is easily parsable. Having conditional severly decreases the simplicity. OTOH, a simple way to add options is nice – and other almost static metadata files for packaging, such as RPM .spec file, allow for this.

It could also be simple to convert many distutils packages to such a format; actually, I would be surprised if the majority of packages out there could not be automatically translated to such a mechanism.

Then, we could gradually deprecate some distutils commands (to end up with a configure/build/instasll, with configure optional), such as different build tools could be plugged for the build itself – distutils could be used for the simple packages (the one wo compiled extensions), and other people could use other tools for more advanced needs (something like what I did with numscons, which bypass distutils entirely for building C/C++/Fortran code).

uninstall

Another often requested feature. I think it is a difficult feature to support reliably. Uninstall is not just about removing files: if you install a deamon, you should stop it, you may ask about configuration files, etc… It should at least support pre install/post install hooks and corresponding uninstall equivalents. But the main problem for python is how to keep a list of installed packages/files. Since python packages can be installed in many locations, there should be one db (the db could and most likely should be a simple flat file) for each site-package. I am yet familiar with haskell module management, but it looks like that’s how haskell does it

Conclusion

Different people have different needs. Any solution from one camp which prevents other solutions is very unhelpful and counter productive. I don’t want to get my ubuntu deployment system screwed up by some toy dependency system – but I don’t want to prevent the web developers from using their workflow. I can’t see a single system solving all this altogether – the problem has not been solved by anything I know of – it is too big of a problem to hope for a general solution. Instead of piling complexity and hack over complexity and hack, we should standardize the commonalities (of which there are plenty), and make sure different systems can be used by different projects.

From ctypes to cython for C library wrapping

Since the cython presentation by R. Bradshaw at Scipy08, I wanted to give cython a shot to wrap existing C libraries. Up to now, my method of choice has been ctypes, because it is relatively simple, and can be done in python directly.

The problem with ctypes

I was not entirely satisfied with ctypes, in particular because it is sometimes difficult to control some platform dependant details, like type size and so on; ctypes has of course the notion of platform-independant type with a given size (int32_t, etc…), but some libraries define their own type, with underlying implementation depending on the platform. Also, making sure the function declarations match the real ones is awckward; ctypes’ uthor Thomas Heller developed a code generator to generate those declarations from headers, but they are dependent on the header you are using; some libraries unfortunately have platform-dependant headers, so in heory you should generate the declarations at installation, but this is awckward because the code generator uses gccxml, which is not widely available.

Here comes cython

One of the advantage of Cython for low leve C wrapping is that cython declarations need not be exact: in theory, you can’t pass an invalid pointer for example, because even if the cython declaration is wrong, the C compiler will complain on the C file generated by cython. Since the generated C file uses the actual header file, you are also pretty sure to avoid any mismatch between declarations and usage; at worse, the failure will happen at compilation time.

Unfortunately, cython does not have a code generator like ctypes. For a long time, I wanted to add sound output capabilities to audiolab, in particular for mac os X and ALSA (linux). Unfortunately, those API are fairly low levels. For example, here is an extract of AudioHardware (the HAL of CoreAudio) usage:

<br />
AudioHardwareGetProperty(kAudioHardwarePropertyDefaultOutputDevice,<br />
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &amp;count, (void *) &amp;(audio_data.device))</p>
<p>AudioDeviceGetProperty(audio_data.device, 0, false,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; kAudioDevicePropertyBufferSize,<br />
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &amp;count, &amp;buffer_size)<br />

Mac OS X conventions is that variables starting with k are enums, defined like:

<br />
kAudioDevicePropertyDeviceName = 'name',<br />
kAudioDevicePropertyDeviceNameCFString = kAudioObjectPropertyName, kAudioDevicePropertyDeviceManufacturer = 'makr',<br />
kAudioDevicePropertyDeviceManufacturerCFString = kAudioObjectPropertyManufacturer,<br />
kAudioDevicePropertyRegisterBufferList = 'rbuf',<br />
kAudioDevicePropertyBufferSize = 'bsiz',<br />
kAudioDevicePropertyBufferSizeRange = 'bsz#',<br />
kAudioDevicePropertyChannelName = 'chnm',<br />
kAudioDevicePropertyChannelNameCFString = kAudioObjectPropertyElementName,<br />
kAudioDevicePropertyChannelCategoryName = 'ccnm',<br />
kAudioDevicePropertyChannelNominalLineLevelNameForID = 'cnlv'<br />
...<br />

Using the implicit conversion char[4] to int – which is not supported by cython AFAIK. With thousand of enums defined this way, any process which is not mostly automatic will be painful.

During Scipy08 cython’s presentation, I asked whether there was any plan toward automatic generation of cython ‘headers’, and Robert fairly answered please feel free to do so. As announced a couple of days ago, I have taken the idea of ctypes code generator, and ‘ported’ it to cython; I have used on scikits.audiolab to write a basic ALSA and CoreAudio player, and used it to convert my old ctypes-based wrapper to sndfile (a C library for audio file IO). This has worked really well: the optional typing in cython makes some part of the wrapper easier to implement than in ctypes (I don’t need to check whether an int-like argument won’t overflow, for example). Kudos to cython developers !

Usage on alsa

For completness, I added a simple example on how to use xml2cython codegen with ALSA, as used in scikits.audiolab. Hopefully, it should show how it can be used for other libraries. First, I parse the headers with gccxml; I use the ctypes codegenlib helper:

h2xml /usr/include/alsa/asoundlib.h -o asoundlib.xml

Now, I use the xml2cython script to parse the xml file and generate some .pxd files. By default, the sript will pull out almost everything from the xml file, which will generate a big cython file. xml2cython has a couple of basic filters, though, so that I only pull out what I want; in the alsa case, I was mostly interested by a couple of functions, so I used the input file filter:

xml2cython.py -i input -o alsa.pxd alsa/asoundlib.h asoundlib.xml

Which will generates alsa.pxd with declarations of functions whose name matches the list in input, plus all the typedefs/structures used as arguments (they are recursively pulled out, so if one argument is a function pointer, the types in the function pointer should hopefully be pulled out as well). The exception is enums: every enums defined in the parsed tree from the xml are put out automatically in the cython file, because ‘anonymous’ enums are usually not part of function declarations in C (enums are not typed in C, so it is not so useful). This means every enum coming from standard header files will be included as well, and this is ugly – as well as making cython compilation much slower. So I used a location filter as well, which tells xml2cython to pull out only enums which are defined in some files match by the filter:

xml2cython.py -l alsa -i input -o alsa.pxd alsa/asoundlib.h asoundlib.xml

This works since every alsa header on my system is of the form /usr/include/alsa/*.h. I used something very similar on AudioHardware.h header in CoreAudio. The generated cython can be seen in scikits trunk here. Doing this kind of things by hand would have been particularly error-prone…