Bento at Pycon2011 and what’s coming in bento 0.0.5

I could not spend much time (if any) on bento the last few weeks of 2010, but I fortunately got back some time to work on it this month. It is a good time to describe a bit what I hope will happen in bento in the next few months.

Bento poster @ Pycon2011

First, my bento proposal has been rejected for PyCon 2011, so it will only be presented as a poster. It is a bit unfortunate because I think it would have worked much better as a talk than as a poster. Nevertheless, I hope it will help bringing awareness of bento outside the scipy community, and give me a better understanding of people’s need for packaging (poster should be better for the latter point).

Bento 0.0.5

Bento 0.0.5 should be coming soon (mid-february). Contrary to the 0.0.4 release, this version won’t bring major user-visible features, but it got a lot of internal redesigns to make bento easier to use:

Automatic command dependency

One does not need to run each command separately anymore. If you run “bentomaker install”, it will automatically run configure and build on its own, in the right order. What’s interesting about it is how dependencies are specified. In distutils, subcommand order is hardcoded inside the parent command, which makes it virtually impossible to extend them. Bento does not suffer from this major deficiency:

  • Dependencies are specified outside the classes: you just need to say which class must be run before/after
  • Class order is then computed at run time using a simple topological sort. Although the API is not there yet, this will enable arbitrary insertion of new commands between existing commands without the need to monkey patch anything

Virtualenv support

If a bento package is installed under virtualenv, the package will be installed inside the virtualenv by default:

virtualenv .env
source .env/bin/activate
bentomaker install # this will install the package inside the virtualenv

Of course, if the install path has been customized (through prefix/eprefix), those take precedence over virtualenv.

List files to be installed

The install command can optionally print the list of files to be installed and their actual installation path. This can be used to check where things are installed. This list is exactly what bento would install by design, so it is more difficult to have weird corner cases where the list and what is actually installed is different.

First steps toward uninstall

Initial “transaction-based” install is available: in this mode, a transaction log will be generated, which can be used to rollback an install. For example, if the install fails in the middle, already installed files will be removed to keep the system in a clean state. This is a first step toward uninstall support.

Refactoring to help using waf inside bento

Bentos internal have been improved to enable easier customization of the build tool. I have a proof of concept where bento can be customized to use waf to build extensions. The whole point is to be able to do so without changing bento’s code itself, of course. The same scheme can be used to build extensions with distutils(for compatibility reasons, to help complex packages to move to bento one step at a time.

Bentoshop: a framework to manage installed packages

I am hoping to have at least a proof of concept for a package manager based around bento for Pycon 2011. As already stated on this blog, there are few non-negotiable features that the design must follow:

  1. Robust by design: things that can be installed can be removed, avoid synchronisation issues between metadata and installed packages
  2. Transparent: it should play well with native packaging tools and not go in the way of anyone’s workflow.
  3. No support whatsoever for multiple version: this can be handled with virtualenv for trivial cases, and through native “virtualization” scheme when virtualenv is not enough (chroot for fs “virtualziation”, or actual virtual machines for more)
  4. Efficient

This means PEP376 is out of the question (it breaks points 1 and 4). I will follow a first proof of concept following the haskell cabal and R (CRAN) systems, but backed with a db for performances.

The main design issue is point 2: ideally, one would want a user-specific, python-specific package manager to be aware of packages installed through the native system, but I am not sure it is really possible without breaking other points.

14 thoughts on “Bento at Pycon2011 and what’s coming in bento 0.0.5

    • Although pip and bento both aim at doing the same thing, they are completely different in how they work. Pip is an easy_install replacement, which itself is based on setuptools, itself built on top of distutils. Those tools are too fragile, and extremely buggy (we have weekly bugs reports linked to their usage with numpy or scipy).

      Bento aims at replacing the whole toolchain, without depending at all on distutils. You can browse this blog to look for issues with distutils and the rationales for bento, but it boils down to being based on clearly defined and separate components, to enable complex packages to customize any part of the packaging.

      Concretely, bento already allows you to use different build tools to build C extensions (waf, scons or distutils itself if you want to), include data files and install them where you want (can be customized by the user), describe your package in a simple format, and other things which are inherently difficult if not impossible in distutils.

  1. I am the author of an extension – APSW – that wraps SQLite. In general the SQLite on your machine is too old, so my setup.py includes a command to fetch the latest version and embed that inside the extension.

    The problem with all these systems (pip/easy_install/pypy) is I haven’t seen any way to tell them you probably want to run extra commands. For example the best way of just having everything work (ie download and use latest) for APSW is:

    python setup.py fetch –all build –enable-all-extensions install test

    I can’t make that the default since I can’t tell if the user does want SQLite etc downloaded or not. You can see some of the complexity here:

    http://apidoc.apsw.googlecode.com/hg/build.html

    • I am not sure I understand exactly what you need to do for APSW, but would the following workflow works ?

      * at configure, look for sqlite (the C library), check its version and add the sqlite compilation options (–all, etc…)
      * insert a new fetch command to be run between configure and build, which does something only if the available sqlite version does not match what you need
      * build the thing

      In that case, it is perfectly possible to do so with bento, and not that difficult. One thing that makes easier in bento than distutils is to have a configure stage where all customization are done. That’s one of the issue I wanted to fix in bento (which is hard to fix in distutils proper as the whole design is based on each command having independent options).

      • As a user I like ‘setup.py install’ interface – type one thing and it just works.

        The problem for my own package is that it is virtually impossible for me to determine the right thing to do unless given further direction from the person doing the install. The existing distutils will likely fail with the line above and so the installer (as in person) has to to read the doc. pip etc don’t help since they don’t provide a way for me as a packager to say that this package requires (in 99% of cases) further directives from the installer.

        I’ve mostly been solving the problem by supplying binaries instead which works well for Windows users although the combinations of Python versions and 32/64 bit is getting large and for Ubuntu I do a PPA where Ubuntu deal with the all version/arch combinations.

        If the configure stage allows for some combination of probing and querying then it would make my life easier. At the moment I solve the distutils having independent command options by using global variables.

        • We have exactly the same issue for numpy/scipy (with similar solutions based around distributing binaries), and your scenario sounds like one I want to solve with bento.

          You can already add flags and custom path in the bento.info, and you will be able to add custom options in configure (with default values), although the latter is not entirely implemented yet. Basically, it will look something like

          @pre_configure()
          def pre(context):
          context.add_option(“–all”, help=….)

          @post_configure()
          def post(context):
          if context.options.all:
          do_something()

          Bento is designed such as whatever you do in configure will be available to build/install commands. The latter is the hard part in distutils: you could always customized existing commands, albeit with a lot of boilerplate, but interaction between commands was not very practical in distutils.

          As for querying capabilities: the internal build system used in bento supports most basic autoconf-like checks, and you will soon be able to use a real build system like waf if you want to. Waf is extremely powerful, and quite simple (to the point I am thinking about giving up my own internal build library to just always using waf).

          • Ironically enough actually building my extension is trivial. Although there are several .c source files there is a master one that #includes them all. If I downloaded SQLite then its amalgamation is also #included inside. Consequently one file gets compiled.

            Other than communicating between setup commands there are two areas I had trouble. The first is doing an sdist since I really have two kinds of sdist. One is regular pristine source, and the second is where the help has been build and SQLite has been downloaded and they should be included so that anyone passed the sdist output can actually build it.

            The second problem is 64 bit Windows compilers. None of the logic in the existing distutils nor of the vcvars batch files actually work correctly together. (I’m using free downloads from Microsoft not paid for versions) so I did a hack based on default install locations.

            I support Python 2.3 onwards including Python 3 so getting everything working right is a pain.

            Documentation on how to setup a build machine for Windows: http://code.google.com/p/apsw/wiki/Win7build

            32kb of my setup.py: http://code.google.com/p/apsw/source/browse/setup.py

  2. I’m curious why you think PEP 376 breaks reliability and efficiency. What information do you think is missing, and what do you think would improve efficiency without imposing additional demands on system packaging tools (which generally have an easier time if files aren’t shared between packages)?

    (Apart from those questions, I think bento as a concept is cool and want to look into it more when I have the time.)

    • As far as efficiency goes, my main grip is the lack of indexing. Because packages are discovered at runtime, it cannot be really efficient – yolk works following a design close to PEP 376, and it is quite slow, especially if you install things on NFS (as is common in universities). I don’t understand why python community seems so against indexing for packaging – it is really unique to python, all other packaging solutions are based on index.

      As for robustness, part of it is also indexing, although in a more subtle way. More fundamentally, I don’t think you can build a reliable packaging solution without a clearly defined and enforced set of metadata. This goes beyond just versions, but for all other metadata fields.

      Moreoever, and this is not directly linked to PEP 376, getting a *reliable* list of installed files in distutils once you extend it is very difficult. There are so many corner cases, special cases (egg vs install vs bdist_*) that I think 99 % people will get it wrong. I still get it wrong and I spent more time than most in distutils. With that in mind, I don’t see how you can get something reliable for installation as is.

  3. IOW, there isn’t anything wrong with the *format* wrt reliability, you’re saying that you don’t think distutils can reliably fill out the data *in* that format, yes?

    With respect to efficiency, I don’t see how you can actually get any faster than os.listdir() as the primary mechanism for determining whether a package is present, and even using PEP 376’s underpowered variant of .egg-info, you can still get that information. (setuptools’ approach of course also gives you version information from listdir).

    So I’m curious as to what use cases you have that would both need more metadata than package name/version, *and* which occur often enough that efficiency is even worth *thinking* about, let alone actually coding for. ;-)

    • I have plenty of reservations about the format itself, but I think it is indeed secondary to the problem that distutils cannot fill out the data in that format.

      As for efficiency, I was not precise enough in my answer: I meant it was too slow to get package information at runtime, not packages themselves. Listing directories by itself is fine (especially since python has to do it anyway to import things, so most IO for listing directories will cached by the OS at the time you are doing anything *in* python). The problem is when you need to read files in those directories. Having a single index makes things much faster, especially once you need to get detailed information such as which file is owned by which package (apt-file kind of information).

      As for whether it worths being thought about – that’s a valid concern. I think it is, and is not that costly in term of development cost. Time will tell whether I am wrong or not.

      • So, you’re concerned about the performance of package management tools themselves, then? Or are you talking about finding plugins?

        What *actual* use cases, in other words, are you concerned about the performance of? Listing packages? Doing an install or uninstall? Locating potential plugins?

        The issue here isn’t that the code will be difficult to write — it probably isn’t that difficult at all. The issue is the burden placed on people integrating with system packaging tools, for the sake of a use case they might not even need (since their packaging system already has its own index and file listing mechanism).

        • I am ignoring plugins, if by plugin you mean something like eggs with entry points loaded at runtime. Nobody in scipy’s community is using plugins, and I think it is out of scope of bento proper. I am mainly concerned about efficienty and reliability of the packaging tools (so indeed listing packages, install/uninstall, etc…).

          As for the issues related to system packaging, I think it is a matter of implementation rather than principle. Cabal (haskell) has a package index, as do other packaging systems. You can also imagine providing tools/conventions/api so that system packagers may integrate both systems together. As you mentioned, packagers generally hate multiple packages sharing the same file(s), but here the issue is different because the package which would own the index would be separate. Writing files in a common area (e.g. config files for pkg-config, or .dist-info directories) is not that different than updating an index as far as maintenance goes. I also believe to have enough experience with debian and rpm packaging to avoid common pitfalls and not making the situation worse than it is currently is (the latter argument being kind of hand-wavy, obviously).

Leave a reply to cournape Cancel reply