Python packaging: a few observations, cabal for a solution ?

The python packaging situation has been causing quite some controversy for some time. The venerable distutils has been augmented with setuptools, zc.buildout, pip, yolk and what not. Some people praise those tools, some other despise them; in particular, discussion about setuptools keeps coming up in the python community, and almost every time, the discussion goes nowhere, because what some people consider broken is a feature for the other. It seems to me that the conclusion of those discussions is obvious: no tool can make everybody happy, so there has to be a system such as different tools can be used for different usage, without intefering with each other. The solution is to agree on common format and data/metadata, so that people can build on it and communicate each other.

You can find a lot of information on people who like setuptools/eggs, and their rationale for it. A good summary, with a web-developer POV is given by Ian Bicking. I thought it would be useful to give another side to the story, that is people like me, whose needs are very different from the web-development crowd (the community which pushes eggs the most AFAICS).

Distutils limitation

Most of those tools are built on top of distutils, which is a first problem. Distutils is a giant mess, with tight, undocumented coupling between vastly different parts. Distutils takes care of configuration (rarely used, except for projects like numpy which need to probe for fairly low level system dependencies), build, installation and package building. I think that’s the fundamental issue of distutils: the installation and deployment parts do not need to know so much about each other, and should be split. The build part should be easily extensible, without too much magic or assumption, because different projects have different needs. The king here is of course make; but ruby for example has rake and rant, etc…

A second problem of distutils is its design, which is not so good. Distutils is based on commands (one command do the build of C extension, one command do the installation, one command build eggs in the case of setuptools, etc…). Commands are fundamentally imperative in distutils: do this, and then that. This is far from ideal for several reasons:

You can’t pass option between commands

For example, if you want to change the compilation flags, you have to pass them to every concerned command.

Building requires handling dependencies

You declare some targets, which depend on some other targets, and the build tool build a dependency graph to build this in the right order. AFAIK, this is the ONLY correct way to build software. Distutils commands are inherently incapable of doint that. That’s one example where the web development crowd may be unaware of the need for this: Ian Bicking for example says that we do pretty well without it. Well, I know I don’t, and having a real dependency system for numpy/scipy would be wonderful. In the scientific area, large, compiled libraries won’t go away soon.

Fragile extension system

Maybe even worse: extending distutils means extending commands, which makes code reuse quite difficult, or cause some weird issue. In particular, in numpy, we need to extend distutils fairly extensively (for fortran support, etc…), and setuptools extends distutils as well. Problem: we have to take into account setuptools monkey patching. It quickly becomes impractical when more tools are involved (the combinations grow exponentially).

Typical problem: how to make setuptools and numpy distutils extensions cohabite ? Another example: paver is a recent, but interesting tool for doing common tasks related to build. Paver extend setuptools commands, which means it does (it can’t) work with numpy.distutils extensions. The problem can be somewhat summarized by: I have class A in project A, class B(A) in project B and class C(A) in project C – how to I handle B and C in a later package. I am starting to think it can’t be done reliably using inheritance (the current way).

Extending commands is also particularly difficult for anything non trivial, due to various issues: lack of documentation, the related distutils code is horrible (attributes added on the fly for no good reason), and nothing is very well specified. You can’t retrieve where distutils build a given file (library, source file, .o file, etc…), for example. You can’t get the name of the sdist target (you have to recreate the logic yourself, which is platform dependent). Etc…

Final problem: you can’t really call commands directly in setup.py. As a recent example encountered in numpy: I want to install a C library build through the libraries argument of setup. I can’t just add the file to the install command. Now, since we extend the install command in numpy.distutils, it should have been simple: just retrieve the name of the library, and add it to the list of files to install. But you can’t retrieve the name of the built library from the install command, and the install command does not know about the build_clib one (the one which builds C libs).

Packaging, dependency management

This is maybe the most controversial issue. By packaging, I mean putting everything which constitute the software (configuration, .py, .so/.pyd, documentation, etc…) in a a format which can be deployed on many machines in a consistent way. For web-developers, it seems this mean something which can be put on a couple of machine, in an known state. For packages like numpy, this means being able to install on many different kind of platforms, with different capabilities (different C runtimes, different math libraries, different optimized libraries, etc…). And other cases exist as well.

For some people, the answer is: use a sane OS with package management, and life goes on. Other people consider setuptools way of doing things almost perfect; it does everything they want, and don’t understand those pesky Debian developers who complain about multiple versions, etc… I will try to summarize the different approaches here, and the related issues.

The underlying problem is simple: any non trivial software depends on other things to work. Obviously, any python package needs a python interpreter. But most will also need other packages: for example, sphinx needs pygments, Jinja to work correctly. This becomes a problem because software evolves: unless you take a great care about it, software will become incompatible with an older version. For example, the package foo 1.1 decided to change the order of arguments in one function, so bar which worked with foo 1.0 will not work with foo 1.1. There are basically three ways to deal with this problem:

  1. Forbid the situation. Foo 1.1 should not break software which works with foo 1.0. It is a bug, and foo should be fixed. That’s generally the prefered OS vendor approach
  2. Bypass the problem by bundling foo in bar. The idea is to distribute a snapshot of most of your dependencies, in a known working situation. That’s the bundling situation.
  3. Install multiple versions: bar will require foo 1.1, but fubar still uses the old foo 1.0, so both foo 1.0 and foo 1.1 should be installed. That’s the “setuptools approach”.

Package management ala linux is the most robust approach in the long term for the OS. If foo has a bug, only one version needs to be repackaged. For system administrators, that’s often the best solution. It has some problems, too: generally, things cannot be installed without admin privileges, and packages are often fairly old. The later point is not really a problem, but inherent to the approach: you can’t request both stability and bleeding edge. And obviously, it does not work for the other OS. It also means you are at the mercy of your OS vendor.

Bundling is the easiest. The developer works with a known working test, and is not dependent on the OS vendor to get an up to date version.

3 sounds like the best solution, but in my opinion, it is the worst, at least in the current state of affairs as far as python is concerned, and when the software target is “average users”. The first problem is that many people seem to ignore the problem caused by multiple, side by side installation. Once you start saying “depends on foo 1.1 and later, but not higher than 1.3”, you start creating a management hell, where many versions of every package is installed. The more it happens, the more likely you get into a situation like the following:

  • A depends on B >= 1.1
  • A depends on C which depends on B <= 1.0

Meaning a broken dependency. This situation has to be avoided as much as possible, and the best way to avoid it is to maintain compatibility such as B 1.2 can be used as a drop-in replacement for 1.0. I think too often people request multiple version as a poor man’s replacement for backward compatibility. I don’t think it is manageable. If you need a known version of a library which keeps changing, I think bundling is better – generally, if you want deployable software, you should really avoid depending on libraries which change too often, I think there is no way around it. If you don’t care about deploying on many machines (which seem to be the case for web-deployment), then virtualenv and other similar tools are helpful; but they can’t seriously be suggested as a general deployment tool for the same audience as .deb/.rpm/.msi/.pkg. Deployment for testing is very different from deployment to many machines you can’t control at all (the users’ ones)

Now, having a few major versions of the most common libraries should be possible – after all, it is used for C libraries (with the same library installed under different versions with different sonames). But python, contrary to C loaders, does not support explicit version loading independently of the name. You can’t say something like “import foo with v >= 1.1”, but you have to use a new name for the module – meaning changing every library user source code. So you end up with hacks as used by setuptools/easy_install, which are very fragile ( sys.path overriding, PYTHONPATH mess, easy_install.pth, etc…). At least for me, that’s a constant source of frustration, to the point that I effectively forbid setuptools to do anything on my machine: easy-install.pth is read only, and I always install with –single-version-externally-managed.

With thing like virtualenv and pip freeze, I don’t understand the need for multiple versions of the same libraries installed system-wide. I can see how python does not make it easy to support tools like virtualenv and pip directly (that is wo setuptools), but maybe people should focus on enabling virtualenv/zc.buildout usage without setuptools hacks (sys.path hacking, easy_install.pth), basically without setuptools, instead of pushing the multiple library thing on everyone ?

Standardize on data, not on tools

As mentioned previously, I don’t think python should standardize on one tool. The problem is just too vast. I would be very frustrated if setuptools becomes the tool of choice for python – but I understand that it solves issues for some people. Instead, I hope the python community will be able to stdandardize on metadata. Most packages have relatively simple need, which could be covered with a set of static metadata.

It looks like such a design already exists: cabal, the packaging tool for haskell (Thanks to Fernando Perez for pointing me to cabal):

http://www.haskell.org/cabal/release/cabal-latest/doc/users-guide/

Cabal work with two files:

  • setup.hs -> equivalent of our setup.py. Can use haskell, and as such can do pretty much anything
  • cabal: static metadata.

For example:

Name: HUnit

Version: 1.1.1

Cabal-Version: >= 1.2

License: BSD3

License-File: LICENSE

Author: Dean Herington

Homepage: http://hunit.sourceforge.net/

Category: Testing

Synopsis: A unit testing framework for Haskell

Library

Build-Depends: base

Exposed-modules:

Test.HUnit.Base, Test.HUnit.Lang, Test.HUnit.Terminal,

Test.HUnit.Text, Test.HUnit

Extensions: CPP

Even for the developer who knows nothing about haskell (like me :) ), this looks obvious. Basically, classifiers and arguments of the distutils setup function goes into the static file in haskell. By being a simple, readable text file, other tools can use it pretty easily. Of course, we would provide an API to get those data, but the common infrastructure is the file format and meta-data, not the API.

Note that the .cabal file enables for conditional, albeit in a very structured form. I don’t know whether this should be followed or not: the point of a static file is that it is easily parsable. Having conditional severly decreases the simplicity. OTOH, a simple way to add options is nice – and other almost static metadata files for packaging, such as RPM .spec file, allow for this.

It could also be simple to convert many distutils packages to such a format; actually, I would be surprised if the majority of packages out there could not be automatically translated to such a mechanism.

Then, we could gradually deprecate some distutils commands (to end up with a configure/build/instasll, with configure optional), such as different build tools could be plugged for the build itself – distutils could be used for the simple packages (the one wo compiled extensions), and other people could use other tools for more advanced needs (something like what I did with numscons, which bypass distutils entirely for building C/C++/Fortran code).

uninstall

Another often requested feature. I think it is a difficult feature to support reliably. Uninstall is not just about removing files: if you install a deamon, you should stop it, you may ask about configuration files, etc… It should at least support pre install/post install hooks and corresponding uninstall equivalents. But the main problem for python is how to keep a list of installed packages/files. Since python packages can be installed in many locations, there should be one db (the db could and most likely should be a simple flat file) for each site-package. I am yet familiar with haskell module management, but it looks like that’s how haskell does it

Conclusion

Different people have different needs. Any solution from one camp which prevents other solutions is very unhelpful and counter productive. I don’t want to get my ubuntu deployment system screwed up by some toy dependency system – but I don’t want to prevent the web developers from using their workflow. I can’t see a single system solving all this altogether – the problem has not been solved by anything I know of – it is too big of a problem to hope for a general solution. Instead of piling complexity and hack over complexity and hack, we should standardize the commonalities (of which there are plenty), and make sure different systems can be used by different projects.

18 thoughts on “Python packaging: a few observations, cabal for a solution ?

  1. “I think too often people request multiple version as a poor man’s replacement for backward compatibility. I don’t think it is manageable.” – yes! Well put.

    I also agree that whatever the solution is, it should not be a single tool. The solution should enable tools to work together without conflict. And, yes, this can be solved with metadata.

    But the metadata problem is sort of already solved. Similar to the .cabal file for Haskell there is a standard PKG-INFO file in Python :
    http://www.python.org/dev/peps/pep-0241/
    And setuptools eggs adds some extra keywords to setup.py to store the metadata necessary to express dependencies: http://peak.telecommunity.com/DevCenter/setuptools#new-and-changed-setup-keywords

    But it would be nice to see a new metadata version for PKG-INFO that exposed these new key/value pairs. The current egg format will provide them in files like requires.txt but they should be in a single place like PKG-INFO

    • But the metadata problem is sort of already solved. Similar to the .cabal file for Haskell there is a standard PKG-INFO file in Python :

      Yes, the idea is there – but for this kind of things, the devil is in the details. Note how .cabal files are much richer than PKG-INFO (as I read it from PEP 241 at least). The richer it is, the less package need to do complicated and hairy stuff in setup.py. I am convinced that the vast majority of python package do not need anything but a trivial setup.py + static meta-data. I think we should really look at .cabal in details, for things like “entry point” (at least for simple scripts), etc… I think a good test is how many packages’ setup.py can be converted to a static-metadata only.

      The other question is whether to allow for conditional or not: after having thought more about it, I realized that’s something I miss in debian packaging compared to rpm .spec file. Being able to condition on python interpreter version, for example, would be useful. Since I am clueless about parsing technology, I don’t know the complexity brought by conditional. Is it significant ? Does it prevent tools – maybe outside python – from parsing it easily ?

  2. “You can’t pass option between commands”

    I’d like to change that *asap*. I have started in fact : register and upload are sharing the password of the user.

    Let’s think about something to change this.

    Cheers

  3. The other question is whether to allow for conditional or not: after having thought more about it, I realized that’s something I miss in debian packaging compared to rpm .spec file.

    you mean something like “python (>= 2.5) | python-celementtree | python-elementtree” or the fact that different applications need different library versions?

    (like package A depends on foo >=0.4, <=0.5 – that’s possible in Debian too but note that two different versions of the same Python library cannot be installed at the same time)

    • you mean something like “python (>= 2.5) | python-celementtree | python-elementtree” or the fact that different applications need different library versions?

      No, I mean something more general, not just condition on number for version. See here in cabal.

      It is useful in spec file, for example to condition on distributions (FC 9 vs RHEL 5 vs …), like here. For python, it is even more useful: you would condition on windows vs mac, etc… (of course, it is better to be as general as possible, but sometimes, it is just not possible).

      But again, I would guess this complicates the matter quite a bit. I see the advantages, but I don’t have a clear idea on the disadvantages.

  4. cournape: you are a system administrator, aren’t you? Python developers say “install as many module versions as your hard drive can keep and use virtualenv if your administrator is insane” ;-)

    • you are a system administrator, aren’t you?

      Actually, no – I am mostly a python developer those days, for scientific computing. Eggs and setuptools have made my life miserable when trying to distribute things to colleagues. It is too fragile, and if things don’t work right away after installation, it is broken. If uninstalling and reinstalling does not work, it is, against, broken.

      install as many module versions as your hard drive …

      I think this is fine – as long as it is not installed system-wide and is done by developers/administrators for developers/administrators.
      . That’s what I tried to explain: there are many situations, and the problem is what works in some case doesn’t in others. Virtual env for example is nice for bootstrapping, testing things, developing things. But it is way too fragile to distribute things to users who do not care (or even know) about things like PYTHONPATH. The possibility to control versions is nice for developers, and often needed – but that does not work for system maintainability. I have yet to see eggs used reliably as an end to end channel (from developers to end-users) without a lot of trouble – the problem are not with the egg format itself (which is fine) but with all the hackery around.

  5. […] Newsflash: chances are that higher Python officers will succeed in confusing the situation even more. PEP 382 is on the way. Good luck with explaining why Python needs two notions of packages whereas it is difficult enough to justify even one. It’s not a coincidence though that this is placed in the context of setuptools. When Pythons import system is a tragedy than distutils and the layers above are the accompanying farce. […]

  6. Excellent comments, David. Where do you think we are, 6 months later ?
    pip, PyPM … entropy ?

    I’d like to see package trees made more *visible*:
    – list all directories and single files from “install X”
    – look at a tree of required packages, their requires … before installing: a tiny database as dbs go, displaying subtrees with “X>= 1.2” easy

    • AFAICS, the divide is even greater between people who see virtualenv-like tools (put everything under one directory, each application bundling everything) as deployment models vs. people who are more into binary installers (be it .msi, .exe or deb/rpm), people who prefer convenience over repeatability and robustness. The current direction of distribute goes in the wrong direction IMO – but to be fair, the “web-dev crowd” are the ones doing the work, so it is logical they care more about their use case. I just wish they were more concerned with the problems they are bringing to the whole python ecosystem.

      I recently came across a small post about the ruby gems situation (http://www.madstop.com/ruby/ruby_has_a_distribution_problem.html) which summarizes the problem quite well (you can replace rails community with setuptools/virtualenv community): “This is basically anathema to how I think about management, yet it’s the standard, recommended practice in the Rails community, because it makes it easy to “guarantee” behaviour in a given environment. Of course, your guarantee is only good if no one ever tries to run the software anywhere except an exact duplicate of where you run it.”

      This sentence nails it IMHO.

      • I think you overstate the difference in your perspective and the webdev perspective. We have very similar goals. The tools we work with have somewhat different challenges — you deal with a lot of compiled code and system libraries, and we don’t. It’s actually *really* hard to get adoption of compiled projects (as I’ve found with lxml).

        But virtualenv, pip, and buildout are all based on the same basic desires as you: get something running reliably on a variety of systems, and make it consistent over different machines and over time. And these tools aren’t just about hacking something together that works for one person, they are very much about serving *lots* of people… because while we’re web developers, we’re also developing reusable open source libraries, and it’s probably that second role that drives the direction more than the first.

        I read the Rails post, and it’s *not* how we’re going about things. I do prefer isolation and independence at the application level. I do give pushback to people who “require” a particular installation scheme for principled rather than practical reasons. But these are robust and repeatable systems that don’t punt on the issues he’s talking about.

        • Compiled code is indeed a challenge, but as you mentioned, this is relatively specific to our usage. I have thought about the problem quite a bit, but I don’t think that’s a very important topic for this discussion.

          When you say that virtualenv and pip serve the same desire as me, you would have to give me more details, because I don’t understand how that’s possible, the goals being so different. What I demand from a deployment tool is a very controlled way to specify each step of the installation, and something which integrates well with the target platform. Ideally, it should be resilient against failure, with rollback mode, and should have a query system to deal with different kind of dependencies. How virtualenv/pip/buildout can help me to produce something I can redistribute to many people ? I myself use Virtualenv for developing quite a bit, I think that’s a useful tool in that context. I But for deployment, I don’t see it ? I may just use the tool backward. Whenever I saw a description on how people use it, it was antithetic to how I think about deployment, though (put everything in one self-contained directory – which is why I refered to the rail post).

  7. How do you like SCons for building? http://www.scons.org/ It is both declarative and Python. Or I should say, the emphasis is on putting the declarative parts (although Python) in the makefile-like files, and specialized build and install procedural knowledge in SCons “Builder” classes.

    Do you really need to focus on Python packaging specifically? Why not wedge into some popular compiled-code distribution system– maybe rewriting one in Python– and make it more Python friendly instead?

    It would seem easier to require only a Python interpreter and internet access, than saying you have to have a certain package installer, make and gcc…

  8. I wonder, after spending a few days with python packaging and deployment hell, why this most important things can not be solved once and forever. We are using computers for more that two generations now, we are updating software every day and this is still a problem. What are all these computer scientists doing? If evolving software is such a problem, why isn´t it fixed in the language itself? Why do “different versions” that make problems exist? Why fight with infrastructure if the core problem is about people changing names of classes / functions or variables? Why isn´t the abstraction level of computer languages abstract enough to get away with these kinds of ridiculous problems? In the end it´s only 1 and 0, so the processor is not the problem, it is the abstraction. Why do I have to think about “refactoring” software?
    If we look at what really causes the problem, we have to think about how to improve compilers to make software change easier to manage for humans, not how to trick around with infrastructure tools.
    Please, Mister Scientist, start inventing! We need a paradigm shift, because we need brilliant developers not to waste their time with dependency and deployment bullshit but solve real problems and maybe we have no time to waste.

  9. I just stumbled upon this post after being in the distribution hell for a long time. Building recursive extension modules that depend on each other is nothing like the sheer pain with distutils. Your argumentation seems sound to me and your approach very reasonable for scientists like myself. I will follow your work closely and help you out when I can.

    In the scientific computing community, we are still pretty far behind something like matlab which has a best practice for distributing and deploying “libraries”. Hopefully this can be changed a bit in the next few months.

    • To build recursive extension modules, you should look at waf, it has really good support for this through use_lib.

      Bento (ex-toydist) now has its own build mini-framework, to that at least basic dependencies (on header and compilation flags) will be taken care of. But dealing with dependencies between generated libraries, that’s much harder, and that’s exactly why an improved distutils should be usable with an arbitrary build system.

Leave a comment