Looking back at 16 years of dpkg history with some figures

With Debian’s 19th anniversary approaching, I thought it would be nice to look back at dpkg’s history. After all, it’s one of the key components of any Debian system.

The figures in this article are all based on dpkg’s git repository (as of today, commit 9a06920). While the git repository doesn’t have all the history, we tried to integrate as much as possible when we created it in 2007. We have data going back to April 1996…

In this period between April 1996 and August 2012:

  • 146 persons contributed to dpkg (result of git log --pretty='%aN'|sort -u|wc -l)
  • 6948 commits have been made (result of git log --oneline | wc -l)
  • 3133612 lines have been written/modified (result of git log --stat|perl -ne 'END { print $c } $c += $1 if /(\d+) insertions/;')

Currently the dpkg source tree contains 28303 lines of C, 14956 lines of Perl and 6984 lines of shell (figures generated by David A. Wheeler’s ‘SLOCCount’) and is translated in 40 languages (but very few languages managed to translate everything, with all the manual pages there are 3997 strings to translate).

The top 5 contributors of all times (in number of commits) is the following (result of git log --pretty='%aN'|sort| uniq -c|sort -k1 -n -r|head -n 5):

  1. Guillem Jover with 2663 commits
  2. Raphaël Hertzog with 993 commits
  3. Wichert Akkerman with 682 commits
  4. Christian Perrier with 368 commits
  5. Adam Heath with 342 commits

I would like to point out that those statistics are not entirely representative as people like Ian Jackson (the original author of dpkg’s C reimplementation) or Scott James Remnant were important contributors in parts of the history that were recreated by importing tarballs. Each tarball counts for a single commit but usually bundles much more than one change. Also each contributor has its own habits in terms of crafting a work in multiple commits.

Last but not least, I have generated this 3 minutes gource visualization of dpkg git’s history (I used Planet’s head pictures for dpkg maintainers where I could find it).

Watching this video made me realize that I have been contributing to dpkg for 5 years already. I’m looking forward to the next 5 years :-)

And what about you? You could be the 147th contributor… see this wiki page to learn more about the team and to start contributing.

3-way merge of Debian changelog files

I’ve been considering for some time now to create a merge tool specifically suited for debian/changelog files. My goal was to let Git use it automatically thanks to gitattributes.

I’ve just gone ahead, so let me introduce you git-merge-dch. Grab it with git clone git://git.debian.org/~hertzog/git-merge-dch.git, you can build a package if you wish. Beware, you need to have a dpkg-dev 1.15.5 that is not yet published (so you need to build dpkg from its git repository, git clone git://git.debian.org/dpkg/dpkg.git) as I rely on features that I introduced recently… you will also need the libalgorithm-merge-perl package.

Using it in a git repository requires two changes:

  • defining a new merge driver somewhere in the git configuration (in .git/config or ~/.gitconfig for example):
    [merge "git-merge-dch"]
            name = debian/changelog merge driver
            driver = git-merge-dch -m %O %A %B %A
    
  • defining the merge attribute for debian/changelog files either in .gitattributes in the repository itself or in .git/info/attributes:
    debian/changelog merge=git-merge-dch
    

Now you can safely maintain two branches of a package with changelog files evolving separately and merge one into the other without creating undue conflicts. Suppose you created an experimental branch for version 2.28 (you use a version 2.28-1~exp1) when 2.26.2 was current stable in the master branch. In the mean time, 2.26.3 got out and was packaged in master. Next time you merge stable into experimental, the changelog entries for 2.28 and 2.26.3 won’t collide despite being at the same place in the changelog file compared to the common ancestor.

Let’s continue with this example, 2.28 is out. Instead of adding a new changelog entry with “New upstream release” without further changes, you keep the current changelog entry and simply change the version into 2.28-1. While preparing this you discover a branch with fixes that was based on 2.28-1~exp1, if you merge it it will reintroduce a 2.28-1~exp1 entry that you don’t want. Fortunately you can use the --merge-prereleases (-m) option of git-merge-dch so that it strip the prerelease part of the version string and considers 2.28-1~exp1 and 2.28-1 to be the same entry really.

The only limitation is that this merge tool will remove any lines which are not parsed by Dpkg::Changelog (and which in theory are not supposed to be there).

Feel free to test, share your comments, report bugs and send patches!

Update: the script has been merged in dpkg-dev (>= 1.15.7) under the name dpkg-mergechangelogs.

Git, CIA and branch merging

Dear Joey, we also had this problem for dpkg, that’s why I hacked the /usr/local/bin/git-commit-notice script that we’re using on Alioth to do something like this instead:

while read oldrev newrev refname; do
    branchname=${refname#refs/heads/}
    [ "$branchname" = "master" ] && branchname=""
    for merged in $(git rev-parse --not --branches | grep -v $(git rev-parse $refname) | git rev-list --reverse --stdin $oldrev..$newrev); do
         /usr/local/bin/git-ciabot.pl $merged $branchname
    done
done

It will stop git rev-list each time that it encounters a commit that is available in any of the other branches present in the repository and thus when you merge a branch, you only see the merge commit in CIA.

You should also note that the script is smarter as it calls CIA only for branch updates, not for tag creation (and other kinds of updates) where it only leads to strange errors IIRC.

Assembling bits of history with git: take two

Following my previous article, I had some interesting comments introducing me to git-filter-branch (which is a new function coming from cogito’s cg-admin-rewritehist). This command is really designed to rewrite the history and you can do much more changes… it enabled me to fix the dates/authors/committers/logs of all the commits that were created with git_load_dirs. It can also be used to add one or more “parent commits” to any commit.

In parallel I discovered some problems with the git repository that I created: the tags were no more pointing to my master branch. This is because git rebase won’t convert them while rewriting history.

This lead me to redo everything from scratch. This time I used git-filter-branch instead. The man page even gives an example of how to link two branches together as if one was the predecessor of the other. Here’s how you can do it: let’s bind together “old” and “new”… the resulting branch will be “new-rewritten”.

$ git rev-parse old
0975870bb1631379f2da798fa78736a4fe32960a
$ git checkout new
$ git-filter-branch --tag-name-filter=cat --parent-filter \
"sed -e 's/^$/-p 0975870bb1631379f2da798fa78736a4fe32960a/'" \
new-rewritten
[...]
Rewritten history saved to the new-rewritten branch

Short explanation: the only commit without a parent commit (thus matching the empty regex “^$”) is the root commit and this one is changed to have a parent (-p) which is the last commit of the branch “old”.

At the end, you remove all the temporary branches, keep only what’s needed and repack everything to save space:


$ git branch -D old new
$ git prune
$ git repack -a -d

Assembling bits of history with git

The dpkg team has a nice history of changing VCS over time. At the beginning, Ian Jackson simply uploaded new tarballs, then CVS was used during a few years, then Arch got used and up to now Subversion was used. When the subversion repository got created, the arch history has not been integrated as somehow the conversion tools didn’t work.

Now we’re likely to move over git for various reasons and we wanted to get back the various bits of history stored in the different VCS. Unfortunately we lost the arch repository. So we have disjoints bits of history and we want to put them all in a single nice git branch… git comes with git-cvsimport, git-archimport and git-svnimport, so converting CVS/SVN/Arch repositories is relatively easy. But you end up with several repositories and several branches.

Git comes with a nice feature called “git rebase” which is able to replay history over another branch, but for this to work you need to have a common ancestor in the branch used for the rebase. That’s not the case… so let’s try to create that common ancestor! Extracting the first tree from the newest branch and committing it on top on the oldest branch will give that common ancestor because two identical trees will have the same identifier. Using git_load_dirs you can easily load a tree in your git repository, and “git archive” will let you extract the first tree too.

In short, let’s see how I attach the “master” branch of my “git-svn” repository to the “master” branch of my “git-cvs” repository:

$ cd git-svn
$ git-rev-list --all | tail -1
0d6ec86c5d05f7e60a484c68d37fb5fc31146c40
$ git-archive --prefix=dpkg-1.13.11/ 0d6ec86c5d05f7e60a484c68d37fb5fc31146c40 | (cd /tmp && tar xf -)
$ cd ../git-cvs
$ git checkout master
$ git_load_dirs -L"Fake commit to link SVN to older CVS history" /tmp/dpkg-1.13.11
[...]
$ git fetch ../git-svn master:svn
$ git checkout svn
$ git rebase master

That’s it, your svn branch now contains the old cvs history. Repeat as many times as necessary…