Source control is an extremely valuable tool most developers use nowadays and to me, both the commidity of making it easier to work with a team and keeping a complete history of changes made to the source code are very important features it delivers. Git is a tool that has gained a lot of marketshare in the source control ecosystem over the years, and with good reason : its design is very good and is more efficient for multiple workflows such as branching and contributing code to projects without being part of that project’s team. This post isn’t meant to argue about git over svn though. It’s about things that can go wrong, without anyone noticing, while migrating an existing codebase from svn to git.
Git’s complete history consumes a lot of disk space when there are large binary files under version control and that they change frequently. With SVN, all of the load is centralized. But using git, the problem grows linearly with the amount of users. Fortunately, there is a git module called git-lfs (git large file storage) that addresses this problem with a system of file pointers which are lazily downloaded. Basically, the large binary files stored in the git history are only small pointers that use a few bytes. When checking out a branch, the pointer is resolved and the corresponding file is downloaded. Files that should be handled using git-lfs can be hand-picked or matched using wildcards. This solution can be very helpful if disk space becomes a problem for your team and you don’t mind reducing the redundancy of the repository internally.
One repository, multiple projects
One of the choices you are faced with when using SVN is to create multiple repositories for different projects, or create a single repository and reproduce the svn conventions of having folders
tags at the root of a project and creating a hierarchy of projects in the parent directories. Apache’s svn repository is a perfect example of that. The subfolder
svnroot/ant/core is home to a project structure, and there are multiple more projects all over the place. Using that structure, SVN allows you to checkout specific subfolders, or even branch subfolders.
Git’s model is very different and I don’t know any solution to easily clone a subset of a repository, or branch a subset of a repository such as what was possible using Svn. Because of this problem, there are two options : merge all the projects together into a single monolithic repository, or create a distinct repository for every project. Both choices have important downsides. Merging everything together changes your repository structure and it will have an impact on the continuous integration system. You will need much more manpower to get the migration from Svn to Git done because you will also need to restructure those build servers to account for the new project structure. Ultimately, a monolithic repository is the direction most large companies are heading for, because it offers lots of advantages. I won’t deny some can debate various aspects of this choice. The single argument I’d use is that those big companies are those that usually involve in the toolchains, and when they do, they invest on solving problems for their bottlenecks which they encounter using a monolithic repository. Unless you outgrow one of those companies, you shouldn’t encounter any problem that hasn’t been successfully solved already. I’m not saying they’re easy to solve, just that they
can be solved.
While testing the migration process and results, we discovered that the history didn’t go as far back as it should have. It turns out that some Svn move operations are a barrier to the git migration process. I’ve seen people describe this barrier using the words : the great relocation which sounds pretty accurate and also represents the situation I have encountered. I don’t have a final solution for this problem, but some people seem to have done it. In the great relocation link above, Matthew McCullough mentions he went through a great deal of trouble to migrate the Groovy codebase while preserving the history. There are also other paths that look interesting. If anyone has experience solving this issue, I’d really like to hear about it.
Merge ancestor lobotomy
After the migration project was delivered and people started developing on Git, I discovered a problem we had missed while testing the migration results. Some branches were old enough that they had been created on Svn and merge commits bringing the
trunk’s changes to the branch had occurred multiple times. When looking at that branch on the Git repository though, the merge commits were not recognized as merges from
master; they were just ordinary commits. Naturally, that led to problems when trying to merge more changes from
master to the branch because most changes Git was trying to apply had already been applied in the past. That’s how I discovered Git Grafts.
Basically, there are two very different approaches to fixing the merge history. One is to amend the mergeinfo in Svn so that the git migration can pick up those merge commits correctly and then start the migration to git again. Or you can also fix the git history by rewriting it. I explored the second option as it seemed easier considering the team was already working on Git for a while when the problem was discovered, and
git grafts is the tool for the job.
graft file allows you to override a commit’s list of ancestors with a new list of hashes. First, you need to create a file with the path
.git/info/grafts. Inside that file, every commit that must be modified should be on a single line. Here is an example
1 2 e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e 7448d8798a4380162d4b56f9b452e2f6f9e24e7a a3db5c13ff90a36963278c6a39e4ee3c22e2a436 9c6b057a2b9d96a4067a749ee3b3b0158d390cf1 5d9474c0309b7ca09a182d888f73b37a8fe1362c ccf271b7830882da1791852baeca1737fcbe4b90
The first hash on a line is the hash of the commit that is modified. The rest of the line is a space-separated list of parent hashes for that commit.
In order to fix merge history in your git history, all you need is to find the hash of the merge commits and enumerate those on different lines. Then for every line (every merge commit), you can retrieve its current parent from the log and append it to the line. Finally, you must locate the commit from which the merge was done in git in order to get its hash and append it to the line as well. When this process is over, just save the
grafts file, and you can test the results of your correction by executing
git log --graph.
Depending on the amount of information in your merge commit messages, it may be hard to find the exact revisions for the merges and then finding the corresponding commits in the git repository. You can use
svn propget svn:mergeinfo to help you discover the revision numbers you need.
grafts file is not a permanent, or even a shared modification. But there is a way to make it permanent using a command that is going to rewrite commit history with the changes specified in the
grafts file. Modifying published history is very delicate and I advise you to read carefully on this subject before you go down that road. The command you will need is
git filter-branch and it is delicate to use it. But that is the tool that will allow you to make your
grafts permanent and share them with your team so that the git history is fixed. Please refer to this article for more details on how to make the