Is science obsolete ?

That’s basically what is argued in the Wired article“the end of theory” (found through the article “La fin de la théorie?”, on the excellent Econ/French blog econoclastes). The article itself is not that interesting: it tries to be provocative, but fails at giving good arguments for his case. The main argument is that thanks to enormous amount of data, number crunching, and computer farms as available for example at Google, it will be more effective to find patterns with just data, and without models. But the analysis is fundamentally flawed at several levels, both scientific and philosophical. It is true that some sciences, or more exactly some activities traditionally labeled as science are endangered by number crunching; but computers already made some repetitive activities obsolete, and it is hardly big news, except for the people concerned by the changes.

First, for the epistemological arguments against this thesis: the reasons why science is first about making theories, and then making experiments to confront the theory to reality is not just about practicality. There are some fundamental reasons why this is the case, as mentioned in the article by econoclaste (warning, in French): data gathering itself is subject to various biases, and theory can somewhat alleviate this bias. There is also an ambiguity on what Chris Anderson means by scientific models: he argues that google succeeded in getting reliable search results by avoiding using a model. But you could argue on the contrary that Google did better than everyone else because it had a better model for getting interesting pages related to keywords; indeed, the PageRank algorithm, which is the foundation of Google search engine, is a better algorithm than what other search engines used to do. And the PageRank algorithm is based upon some other works and theories, in particular citation analysis, which trace back to at least 1950. Another example given by Anderson is translation: he argues that translation with any language model can work better than with any linguistic knowledge. But arguing that translation can be done better without knowing the language is different than arguing it can be done without any model at all. For people who work on machine translation, it is actually quite well known that you don’t need to know a language to be successful in translating to it.

Reductionism and the curse of dimensionality

But more significantly in my opinion, the number crunching approach is fundamentally reductionist: it assumes you can explain the whole phenomenon from its smallest parts. A typical example of the failure of reductionism in science is fluid mechanics: you could explain the behavior of a fluid from the behavior from each particle in your fluid, but actually, you can’t, because once you have a reasonable number of particles, it becomes intractable to do it at the particle level. Tom Roud made a similar argument (in French) for the flaws in reductionist approach in biology.

There are some theoretical reasons why you will never manage to make a model of everything just from data. Most number crunching data methods are statistical in nature, and rely on estimating some probability distribution. From this point of view, Anderson’s argument can be understood as “with enough data, you can estimate any probability distribution”. But this is not true for several reasons; one is that complex problems often require computation in high-dimension spaces, and high-dimension spaces have some funky properties which are not intuitive and do not map well to our fundamentally three dimension world. One particularly significant property is the localization of volume in smooth solids. For example, in high dimension, most of the volume of a sphere is on a very thin shell, e.g. really near the surface. In dimension 1000, for a sphere of radius one, the volume contained in the sub-sphere of radius 0.5 is only 1/10^300 of the total volume. This means that if you could put uniformly all the atoms of the universe in the sphere, you would not even get one in the subsphere. In statistics, this phenomenon is known as the curse of dimensionality: the number of data necessary for estimation in high dimensions grows exponentially with the number of dimension.

Also, more data does not always mean you will get better information: a common quote in the data-mining community is “there is no better data than more data”, but this is a fallacy. You want data which brings more information, and in some cases, you can only easily get data which are not very informative. For example, when transcribing speech with computers machine translation (Automatic Speech Recognition, e.g. “speaking to your computer instead of typing”), you need to estimate the probability distribution of the words, you are interested in the probability of the words which do not appear often. After analyzing a few thousand examples, you will get a pretty good estimation of the “behavior” of common words like “the” and the likes, but maybe not for words like “hermeneutic”. And for practical applications, those rare words are the one which matter: if you miss “the” in a sentence, you can still understand it, but if you miss “hermeneutic”, this is much less likely.

Is number crunching new ?

So is this number crunching really the beginning of something new ? Actually, similar thesis have been argued before Anderson; the fields of data-mining and artificial intelligence (AI) have since their inception an history of making claims which never really materialize (AI, for example, has known several “AI winter”, for periods of low-funding, generally after periods of high-funding and high claims about what AI could do). Anyone familiar with the data-mining and artificial intelligence communities should be skeptical about big announcements like paradigm shift, or like here claiming to make science obsolete. I would not be surprised if AI/data-mining/associated fields are the ones which use the expression  “paradigm shift” the most often.

It baffles me that people still argue similar points with similar claims as 50 years ago.

Linked articles

For more on this, you can also see on Cosma’s blog. Also, in French:

A python 2.5.2 binary for Mac OS X with dtrace enabled

As promised a few days ago, I took the time to build a .dmg of python from the official sources + my patch for dtrace. The binary is built with the build-script.py script in the Mac/ directory of python, and except the dtrace patch, no other modification has been done, so it should be usable as a drop-in replacement for the official binary on python.org. You can find the binary here

Again, use it at your own risk. If you prefer building it yourself, or with different options, the patch can be found here

How to embed a manifest into a dll with mingw tools only

(DISCLAIMER: I am not a windows guy; all the discussion here is how I understand things from various sources).

With Visual Studio 2005, MS introduced a mechanism called side by side assemblies and C/C++ isolated applications. Assembly is the MS term which encompasses usual dll, as well as .Net modules implemented in CLR, the .Net bytecode (e.g. anything programmed in C#). The idea is to provide a mechanism to deal with the well known dll hell, because there was no proper versioning scheme with dll in windows. You can read more here:

Why should you care as a python developer ? Concretely, starting from VS 2005, if you build a python extension with the mingw compiler, it will link against a runtime which is not available system-wise (such as in C:\Windows\system32 by default), causing a runtime error when loading the extension (msvcr80.dll not found). A simple way to reproduce the result is to have a small dll, and try to link it to a simple executable with the ms runtime:

# This works:
gcc -shared hello.c -o hello.dll
gcc main.c hello.dll -o main.exe
# This does not:
gcc -shared hello.c -o hello.dll
gcc main.c hello.dll -o main.exe -lmsvcr90

If you build the 2nd way, explicitely linking the msvcr90, you will get a dll not found error when running the executable, because the dll is not in the system paths (and should not be; the dll is not redistributable). Starting from VS 2005, the only way to refer to VS libraries is to use manifest, which are xml files embedded in the binary. Those manifest are automatically generated by the MS compiler. Assuming you already have the manifest, how can you generate a binary using it without using MS compilers ?

Build the object of the dll:

    gcc -c hello.c

    Have a hello.rc file which refers to the manifest file  (2 seems to refer to dll, vs 1 for exe, but I am not sure):

      #include "winuser.h"
      2 RT_MANIFEST hello.dll.manifest

      Build the .res file, which will embed the xml file into the resource file (.res)

        windres --input hello.rc --output hello.res --output-format=coff

        Link the whole thing together:

          gcc -shared -o hello.dll hello.o hello.res -lmsvcr90

          Now, executing main.exe should be possible. There is still the problem of generating the manifest file. Since in our case, the problem is mainly with the MSVC runtime, to stay compatible with the python.org binary, we may just reuse the same manifest all the time ?

          A few more links on the topic:

          http://www.ddj.com/windows/184406482

          http://msdn.microsoft.com/en-us/library/ms235591(VS.80).aspx

          http://www.codeproject.com/KB/COM/regsvr42.aspx

          Building dtrace-enabled python from sources on Mac OS X

          One highlight of Mac OS X Tiger is dtrace. Providers for ruby and python are also available, but only with the “system” interpreters (the one included out of the box). If you install python from http://www.python.org, you can’t use dtrace anymore. Since the code to make dtrace enable python is available on the open source corner of Apple, I thought it would be easy to apply it to pristine sources available on python.org.

          Unfortunately, for some strange reasons, Apple only provides changes in the form of ed scripts applied through a Makefile; the changes are not feature specific, you just have a bunch of files for each modified file: dtrace, Apple-specific changes are all put together. I managed to extract the dtrace part of this mess, so that I can apply only dtrace related changes. The goal is to have a python as close as possible to the official binary available on python.org. The patch can be found there.

          How to use ?

          • Untar the python 2.5.2 tarball
          • Apply the patch
          • Regenerate the configure scripts by running autoconf
          • configure as usual, with the additional option –enable-dtrace  (the configuration is buggy, and will fail if you don’t enable dtrace, unfortunately)
          • build python (make, make install).
          It time permits, I will post a .dmg. Needless to say, you run this at your own risk.