Why people should stop talking about git speed

As I have already written in a previous post, I have moved away from bzr to git for most of my software projects (I still prefer bzr for documents, like my research papers). A lot if not most of the comparison of git vs other tools focus on speed. True, git is quite fast for source code management, but I think this kinds of miss the point of git. It took me time to appreciate it, but one of the git’s killer feature for source code control is the notion of content tracking. Bzr (and I believe hg although I could not find good information on that point) use file id, i.e. they track files, and a tree is a set of files. Git, on the contrary, tracks content, not files. In other words, it does not treat files individually, but always internally consider the whole tree.

This may seem like an internal detail, and an annoyance because it leaks at the UI level quite a lot (the so-called index is linked to this). But this means that it can record the history of code instead of files quite accurately. This is especially visible with git blame. One example: I recently started a massive surgery on the numpy C source code. Because of some C limitations, the numpy core C code was in a couple of giantic source files, and I split this into more logical units. But this breaks svn blame heavily. If you just rename a file, svn blame is lost can follow renames. But if you split one file into two, it becomes useless. Because git tracks the whole tree, the blame command can be asked to detect code moves across files. For example, git blame with rename detections gives me the following on one file in numpy:

dc35f24e numpy/core/src/arrayobject.c         1) #define PY_SSIZE_T_CLEAN
dc35f24e numpy/core/src/arrayobject.c         2) #include <Python.h>
dc35f24e numpy/core/src/arrayobject.c         3) #include "structmember.h"
dc35f24e numpy/core/src/arrayobject.c         4)
65d13826 numpy/core/src/arrayobject.c         5) /*#include <stdio.h>*/
5568f288 scipy/base/src/multiarraymodule.c    6) #define _MULTIARRAYMODULE
2f91f91e numpy/core/src/multiarraymodule.c    7) #define NPY_NO_PREFIX
2f91f91e numpy/core/src/multiarraymodule.c    8) #include "numpy/arrayobject.h"
dc35f24e numpy/core/src/arrayobject.c         9) #include "numpy/arrayscalars.h"
38f46d90 numpy/core/src/multiarray/common.c  10)
38f46d90 numpy/core/src/multiarray/common.c  11) #include "config.h"
0f81da6f numpy/core/src/multiarray/common.c  12)
71875d5c numpy/core/src/multiarray/common.c  13) #include "usertypes.h"
71875d5c numpy/core/src/multiarray/common.c  14)  
0f81da6f numpy/core/src/multiarray/common.c  15) #include "common.h"
5568f288 scipy/base/src/arrayobject.c        16)
65d13826 numpy/core/src/arrayobject.c        17) /*
65d13826 numpy/core/src/arrayobject.c        18)  * new reference
65d13826 numpy/core/src/arrayobject.c        19)  * doesn't alter refcount of chktype or mintype ---
65d13826 numpy/core/src/arrayobject.c        20)  * unless one of them is returned
65d13826 numpy/core/src/arrayobject.c        21)  */

You can notice that the original file can be found for every line of code in the new file. The original author and date may be found as well, I just removed them for the blog post.

This is truely impressive, and is one of the reason why git is so far ahead of the competition IMHO. This kind of features is extremely useful for open source projects, much more than rename support. I am ready to deal with quite a few (real) Git UI annoyances for this.

Edit

It looks like my example was not very clear. I am not interested in following the renames of the file: in the example above, the file was not arrayobject.c first, then renamed to multiarraymodules.c, and later to common.c. The file was created from scratch, with content taken from those files at some point. You can try the following simplified example. First, create two files prod.c and sum.c:

#include <math.h>
double sum(const double* in, int n)
{
 int i;
 double acc = 0;

 for(i = 0; i < n; ++i) {
 acc += in[i];
 }

 return acc;
}
#include <math.h>

double prod(const double* in, int n)
{
 int i;
 double acc = 1;

 for(i = 0; i < n; ++i) {
 acc *= in[i];
 }

 return acc;
}

Commit to your favorite VCS. Then, you reorganize the code, and in particular you put the code of both files into a new file common.c. So you create a new file common.c:

#include <math.h>

double prod(const double* in, int n)
{
 int i;
 double acc = 1;

 for(i = 0; i < n; ++i) {
 acc *= in[i];
 }

 return acc;
}

double sum(const double* in, int n)
{
 int i;
 double acc = 0;

 for(i = 0; i < n; ++i) {
 acc += in[i];
 }

 return acc;
}

And commit. Then, try blame. Rename tracking won’t help at all, since nothing was renamed. On this very simple example, you could improve things by first renaming say sum.c to common.c, then adding the content of prod.c to common.c, but you will still loose that the prod function comes from prod.c. git blame -C -M gives me the following:

^ae7f28a prod.c  1) #include <math.h>
^ae7f28a prod.c  2)
^ae7f28a prod.c  3) double prod(const double* in, int n)
^ae7f28a prod.c  4) {
^ae7f28a prod.c  5)         int i;
^ae7f28a prod.c  6)         double acc = 1;
^ae7f28a prod.c  7)
^ae7f28a prod.c  8)         for(i = 0; i < n; ++i) {
^ae7f28a prod.c  9)                 acc *= in[i];
^ae7f28a prod.c 10)         }
^ae7f28a prod.c 11)
^ae7f28a prod.c 12)         return acc;
^ae7f28a prod.c 13) }
^ae7f28a sum.c  14)
^ae7f28a sum.c  15) double sum(const double* in, int n)
^ae7f28a sum.c  16) {
^ae7f28a sum.c  17)         int i;
^ae7f28a sum.c  18)         double acc = 0;
^ae7f28a sum.c  19)
^ae7f28a sum.c  20)         for(i = 0; i < n; ++i) {
^ae7f28a sum.c  21)                 acc += in[i];
^ae7f28a sum.c  22)         }
^ae7f28a sum.c  23)
^ae7f28a sum.c  24)         return acc;
^ae7f28a sum.c  25) }

hg blame on the contrary will tell me everything comes from common.c. Even when using the rename trick, I cannot get more than the following with hg blame -f -c:

81c4468e59f9    sum.c: #include <math.h>
81c4468e59f9    sum.c:
81c4468e59f9    sum.c: double sum(const double* in, int n)
81c4468e59f9    sum.c: {
81c4468e59f9    sum.c:         int i;
81c4468e59f9    sum.c:         double acc = 0;
81c4468e59f9    sum.c:
81c4468e59f9    sum.c:         for(i = 0; i < n; ++i) {
81c4468e59f9    sum.c:                 acc += in[i];
81c4468e59f9    sum.c:         }
81c4468e59f9    sum.c:
81c4468e59f9    sum.c:         return acc;
81c4468e59f9    sum.c: }
3c1ac7db76ba common.c:
3c1ac7db76ba common.c: double prod(const double* in, int n)
3c1ac7db76ba common.c: {
3c1ac7db76ba common.c:         int i;
3c1ac7db76ba common.c:         double acc = 1;
3c1ac7db76ba common.c:
3c1ac7db76ba common.c:         for(i = 0; i < n; ++i) {
3c1ac7db76ba common.c:                 acc *= in[i];
3c1ac7db76ba common.c:         }
3c1ac7db76ba common.c:
3c1ac7db76ba common.c:         return acc;
3c1ac7db76ba common.c: }
About these ads

24 responses to “Why people should stop talking about git speed

  1. Linus described this at his talk in Google, it’s online here: http://www.youtube.com/watch?v=4XpnKHJAok8

  2. Interesting post. I’m interested in your comment about research papers, however. I write most of my research papers in latex and sometimes have to reorganize the text which leads to multiple latex files with paragraphs moving between them. It would seem that this kind of process would benefit greatly from git’s improved origin analysis. Why do you prefer bzr for your papers?

  3. Dave

    bzr really doesn’t do this?
    Even svn does this; I can’t imagine using a SCM without this functionality.

    • cournape

      bzr really doesn’t do this? Even svn does this; I can’t imagine using a SCM without this functionality.

      svn does not do this. It does not even work after a file rename AFAIK. How would you obtain this information from svn ? bzr, hg (and git by default) display the same information as svn, that is it shows the last revision *in the file* of each line. If the file was renamed, bzr follows the rename (svn does not). What git does is much more powerful (and useful).

      • Dave

        Oh I missed that git blame detects code moving between existing files (I’ve made do with git checkout rev^ -> git blame chains)
        svn blame does work for renamed and split files if you used svn move/copy, although I don’t think it won’t show what the file was called when the commit happened.

  4. Jason Dusek

    If C had a decent module system this feature would not be so cool.

    • cournape

      If C had a decent module system this feature would not be so cool.

      The feature is equally useful for python source code.

  5. me

    Watch out, der’s a smiley in yer git log.

  6. Stephen Thorne

    If you rename a file in subversion, it keeps all the blame history. Same if you split a file in two, you just do a svn copy to retain the blame history. I do this often.

    • cournape

      If you rename a file in subversion, it keeps all the blame history

      Yes, it looks like it can do as good as bzr if you told svn about the renames. But my example is more complicated than just file rename/split: I had several files with content, and then several new files, where each of them had content from several older files. That’s not just rename and copy (the new file has sources from different files, so you can’t use copy). svn inherently can’t find this information (the file is taken from a svn repository, BTW – but to be fair, I don’t think the renames were done explicitly with svn mv).

  7. anonymouse

    could this not be done in svn with branching and merging?

    I realize this is a total horrible kludge, but:

    initial cond:
    files a b c

    branch.

    modify branch so (a,b,c) -> (d,e,f,g,h). (each of which only contain one of (a,b,c)

    modify trunk so (a,b,c) -> (d,e,f,g,h)

    (one branch for every one of (a,b,c) in any of (d,e,f,g,h)

    merge all the branches back together again.

  8. angch

    Mercurial can do this via “hg blame -c -f”

    $ hg blame -c -f integers.txt
    728b6663ae06 integers.txt: 1
    4fb31bac9393 primes.txt: 2
    4fb31bac9393 primes.txt: 3
    728b6663ae06 integers.txt: 4
    4fb31bac9393 primes.txt: 5
    728b6663ae06 integers.txt: 6
    4fb31bac9393 primes.txt: 7
    728b6663ae06 integers.txt: 8
    728b6663ae06 integers.txt: 9
    728b6663ae06 integers.txt: 10

    • cournape

      Mercurial can do this via “hg blame -c -f”

      It looks like my example was not clearly explained :) I can’t tell for sure from your example, but I think we are not talking about the same thing. In your case, I believe that integers.txt was primes.txt before, and what’s what the -f option does: it can follow renames. But what if the content of the file comes from another file (a file which was not renamed to integers.txt) ? Bzr can follow renames too (and it looks like svn too, I was wrong on this one, although svn blame does not have a follow-copy option, suprisingly). In the example given, that’s really different: the file I am showing was a file created from scratch, where I moved at some point some content from arrayobject.c, and later some content from another file.

      I will fix my example so it becomes clearer.

  9. Uncle T

    I was thinking about this myself and was wondering if you needed to duplicate the ENTIRE file several times, check those in, THEN prune the copies down to the desired contents. If you just created new files with partial content it seems to me they hashes would be different and thus it would lose track of where the code came from originally.

  10. RonnyPfannschmidt

    git itself does NOT track any content moves at all
    git is all about snapshoots of the content

    the rest is just analysis tools on top of it that feed from the speed

    so stop the lie about git tracking content moves
    it doesnt do that
    it just has the tools to analize the snapshoots good enough

    • cournape

      Please refrain from using strong language in the comments.

      I never said that git track content moves – but that it detects it. And certainly, it can do so only because it always consider the whole tree instead of each file separately (which is what most other VCS do).

      • RonnyPfannschmidt

        bzr and hg also do have whole trees at hand
        its simply lack of such analysis tools that prevents them
        they simply dont work at the level of the whole content but at the level of the trees

        to get a usefull content move detection there is lots of guessing involved

        and git in general seems to love guessing around, as it doesnt track what happens, but what is – thus things like renames have to be infered from the trees

        • cournape

          bzr and hg also do have whole trees at hand

          Well, at some level, they have, of course. But my understanding with git is that it deals with the whole tree internally quite pervasively – that’s why git cannot handle huge trees, for example (contrary to say subversion). Being able to deal with the whole tree internally requires quite a lot of optimizations: bzr for example needs several copies of the file about to commit, and I doubt they could get away with it if they considered the tree snapshot as a whole for every operation. Another explanation can be found in this Linus’ email:

          http://thread.gmane.org/gmane.comp.version-control.git/46341 (note the link between index, the need for file ID, etc…)

          But again, I don’t claim a big understanding of the internals. I have started to look at git implementation “for fun”, and I am far from having a good grasp of it yet. All I can see is that git can do it, today, and neither bzr nor hg can do it (nor do they plan to do it soon).

  11. David W-F

    > I still prefer bzr for documents, like my research papers

    Can I ask why? What makes it better for this?

    • cournape

      I prefer bzr for several reasons:
      – I rarely if ever use branches for papers
      – I am the only one having a copy of the repository, and it is not published anywhere, so if I screw up something, I have no way to get it back from independent sources. It is a bit easier to screw things up with git than with bzr.
      – most git advantages over bzr do not matter for papers

      So basically, bzr is simpler and easier, and git advantages are not useful in this case.

  12. Pingback: Weekly linkdump #175 - amaslov - блог разработчиков

  13. If I add a method in file A:

    + def __unicode__(self):
    + return self.title

    and I remove a method in file B (which happens to be the same):

    – def __unicode__(self):
    – return self.title

    is git going to tell me that the __unicode__ method in file A “came from” file B?

    • cournape

      Yes, exactly. Obviously, it works better if you don’t move and modify the code too much in one same commit. But it works extremely well in my experience, at least for C and python code. Since it infers it, it works even for import, so it can show you things on a svn repo you cannot see with svn.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

May 2009
M T W T F S S
« Apr   Jun »
 123
45678910
11121314151617
18192021222324
25262728293031

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 11 other followers

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: