Computer Corner II #18

The Computer Corner Take II (#18) by Bill Kibler

To see more Computer Corner articles look here: CCII page or check out the Home Page .

Version Control 101 Part I

Some "Why" and "What For" of svn and git

A couple years back, I was asked to continue Charley Shattuck's work on his "MyForth" software project as it related to a project at the time being managed by Bob Nash. We are all members of the Forth community and have known and worked on many things together for a very long time. Both Charley and Bob had developed a style of working that is not un-common amoung small and independent programmers. The basic structure of their work, is creating a new directory when ever the next major change is needed.

This "new" directory method is probably the most common version control method and used equally with the tar/zip method. The basic idea is being able to go back to some past work you did, by either un-tarring a file set, or going into a previous directory. For single person projects and those that are simple by nature this has served many a programmer since programming started. This concept however can fall apart rather rapidly when new or more programmers are added. As the complexity of the project increases you will rapidly find it not possible to keep up.

My programming experience has covered both small projects and projects with programmers in several different time zones and countries. Nothing simple would have worked for these situations. I have probably used five or six different version control systems, and before joining the team had spent six years as one of several ClearCase gurus on some major projects for a very big company. I am back at the big company on a project to convert a large ClearCase history and file set into GIT. We are talking about moving over 200,000 files and 280 branches out of ClearCase into git.

This new project for me, has taught me all about git and how it works. For those who haven't heard about git, it is the product if you will of Linus and the linux kernel team. They took a new approach to version control and developed some new ways of handling the problems. A major difference is being "distributed". Distributed, means the entire repository of data is copied around and viewed by everyone - there is no central or one location for the repository. Here is an example of what that means.

I have been on the project now for over a year, and in the mean time Bob and Charley have hooked up to do some more work with MyForth. They have made changes and altered some of the basic operations such that some of the old code no longer works. I have started a new home project to collect data using some of my old test code in MyForth as the starting point. I went to use the old code, only to find it wouldn't work. At the momment our svn version control server is down and I am stuck with my current version of the code base. Normally with svn, I could have simply figured out which version number I last used for the test120 code base and had the svn server "check me out" that code stream. However svn is a central repository structure and when the main server is down your down.

"git" on the other hand is distributed and each person's copy is a complete repository. That means I have all the history and code variations in the one repository. Had we been using git, discovering that change which broken my current task, would have me finding the point in time where the changes happened and checking that selection of coding to use. Since I have it all, I have no need of a main repository or anyone elses repository. This however is just one of many features that git provides and which I hope to explain as we continue.

An email statement clears it all up

When Charley left and I started on the project, the current code base seemed beyond believe to me. I really can't fault them, as they had developed a way of working that suited them well, but not me. I found it impossible to figure out the good code from the bad code or even the new from the old. I found one file in 20 some locations with something like 8 different versions and no idea which was the current version. Although it was simple to work on, everything was in your current directory, fixing bugs over more than one directory was a night mare.

I came up with a solution of using the directory structure as a means of controlling the fixed sections of code from the project specific coding. This spread things over several directories, but the main idea was to only have one version of a code module for the whole source tree. Thus bug fixes once applied would magically fix every project when recompiled. This also worked well due to svn's branching setup that is directory based. An added feature was dividing the code base into public or trunk and private or branched. The trunk or public could be tarred up and posted without fear of distributing some private or paid for code.

At various times Charley had access and downloaded the svn tree but never liked what I did. This puzzled me for sometime until recently when I needed the updated code and mentioned thinking about moving to git from svn. Charley sent me this email:



"In my defense, it's not so much that I'm not interested in version control as 
that I'm not happy with the changes that were made to the originally simple myForth. 
I purposely kept everything in a single directory so that I wouldn't break my 
old applications by changing the system underneath. I learned that lesson at 
AM Research too many times. I won't use the split up version of myForth and it 
was a big pain to remember how to get it and put it back the way I like it 
so I just gave up."

It clearly makes me wonder how many other programmers have given up on version control simply because the structure is too complicated, or requires a master degree to understand. Let me say up front that learning ClearCase is not for the un-educated programmers, however Linus and team have certainly tried to make git simple. One of our problems at work with ClearCase is that past managers didn't understand it's working and thus created a mess that is practically impossible to fix and port to git. Using git forces simpler designs and processes.

Let us tear apart Charley's statments to see if we can find where I went wrong before and how to move forward. First I think we need to comment on his experience at AMR where no version control was allowed. Since there was no version control, it was pretty easy to get bit by some changes and thus Charley had to develop his own processes to make sure that didn't happen. Had they been using version control, the chances of gitting bit might not go away, but a solution would be as simple as just finding a good version from the history and using that. Simple problem, easily handled by version control.

Important Note - It is important to make sure that everyone understands what features of version control would have allowed Charley to recover from his problems at AMR. There are two asspects of version control that solve most problems, namely "history" and "branching". History is the ability to go back in time and work on your code base as it was then. Branching is the ability to work on a code base such that your work is not affecting any other work. There is no way to prevent you from making mistakes, but there are plenty of ways when using version control systems to help you recover from them. I give some examples of these ideas in article #19.

I offer no defense for my actions, as I picked what at the time seemed like a good solution. I was able to get Bob to use the system and for some time it worked well for him. It did require considerable hands on from me, as we ran into several issues related to using M/S Windows and svn together - not a good combination as it turned out. I used the system and found it great, as I knew one change in trunk would fix issues in all the different modules I was using - sometime five or six modules with one edit. I was doing my part remotely from my home and all was well until I started to work again and let too much time and problems end up bringing down the server. It still remains down and thus un-accessable.

This article is an attempt to focus on the last part of what Charley said, mainly that svn and how I set it up was too complex. I though I was pretty clever in figuring out how to use the gforth path statement and separating the public from the private. But the main moto, much like a doctors, is to do no harm, which I failed. How so? I think the issue, of which svn was not ideal, was keeping their current work flow, but protecting it by tracking the changes. When looking at how git works, I see that it fits their style much more closely than svn. Now to explain and teach a bit of git usage.

git by example

I guess the first place to start is by explaing what git is as it relates to being a program. "git" is in no way similar to any other version control system. There is no database as such, no central toolset or servers, and it is not a single program, but a collection tools that manage a collection of files which represent a structure of work over time. The idea is based on how the kernel team works and thus is important to understand that workflow. Linus has the utilmate git repository and say over what goes into the kernel. His git repo is the main repo from which a set of topic managers control and collect changes from sub-managers who get changes from users and offical developers. These changes are normally pushed up the chain of command as email patches. Each level tests and selects patches for inclusion in a furture release of the kernel.

In git parlance, changes are pushed and pulled from one repo (repository) to another, or in the kernel case pulled from Linus's repo, but changes are emailed back - never pushed. A developer starts by cloning Linus's repo to their own desktop, find what they want to work on, and creates a branch to do the work on. This temporary branch allows for changes they make to be separate from those of the original source. When a fix is finished and tested, the developer runs a git command to make a diff or patch file of their changes as they relate to the original. This diff or patch is then set up the chain for review.

While a developer is working, other changes are being added to various branches of pending changes and they will be "fetch"ed and "rebased" into the curent repo so the developer can see what others have been doing. Most devleopers will be keeping their "master" vesion in track by "pull"ing in changes regularly, while others may keep separate repos for as many different asspects of the work as they are dealing with. git can be used in lots of ways, since each repo is a standalone version of the source tree.

A little term explanation might help at this point. git is composed of over a hundred separate programs that represent the entire tool set. The toolset is divided into two groups, "pumbling" or low level tools, and "porclean" tools that represent the simpler user tools. "git" is the command line interface, while git-gui provides a gui interface and gitk is a gui based history browser. The tools are mainly for linux, but most will run on M/S Windows and there are now several commerical vesions such as smart-git - which is very good. There is even a toolset called "gitolite" which can be used for access control of git repos down to the branch level. There are several ways to clone repos and it is compatible with HTTP/HTTPS/SSH and it's own git server. There is even a gitweb toolset for displaying the entire repo as web pages.

There are several ways to create a git repo, take a directory tree and turn it into a repo, import from another version system like "perforce" or "svn" where you can get all the history and changes, or start from an empty repository adding file by file. Since git was influnced by svn, you will see that git has a built in svn to git set of tools. You can use git as the frontend - or user interface - and svn as the backend - or storage container. Adding more features or tools is simple, as the entire system is so modular anything can be done. To show how simple it is, here is how you turn a directory into a git repo:


> cd proj_a
> git init
> git add .
> git commit -am "my new repo"

That is it - you have a new repo. What happened? The "git init" created a directory called ".git", and in it are numerous directories and files that together define the repo. Next we "add ." every thing in the directory into the repo's inner workings - it now knows about the entire file structure. We then "commit" what it knows about into the actual ".git" repo structure and thus a permanent part of the repo. At this point we could "push" the entire repo somewhere or have other developers "pull" from it. It could even be "cloned" or copied entirely by someone else. We check our "status" by doing "git status" and it will tell us what it currently knows about the repo. Man pages are available by doing "man git-status" as "git" is mearly a command line reader/parser that calls the actual program as appropriate, in this case the "git-status" program.

More details

"git" is different in one way from all other version control systems, it uses "sha1" hash strings to define just about everything. When we commited the files, it computed the hash for the entire set of files it could see and uses that sha1 string as the ID of that commit and the file structure at that point in time. When you want to return to a given point in the history of the repo, you find the sha1 as displayed in the log and use it to "checkout" the repo to the same structure as it was when it was commited. You can use the sha1 of a given file's version to see what it looked like at the time of a specific commit. You can compare two files from different times by doing a "git diff 4321abf dcf1234 filename". You do not need to use the full 31 character sha1, a shorter 7 characters will work. Be advised however that git does not track empty directories, it only tracks files (links are treated as files). You must put a file in the directory for git to track it, I use "touch .dir" to track empty directories.

"git" is a bit unusal in another sense by making each repo a mix of personal settings and with data from the original repo. By this I mean a cloned repo is composed of two sets of internal structures in the ".git" directory, those from the clone source and those from your systems settings. The data objects are from the cloned source and represent the files in a compressed set of "packs" which were downloaded from the original source. The configuration settings are mostly from your local system's default settings and the commands you used to create the clone. There are some items that are not cloned unless you clearly request them from the "remote" source. This can catch the new user and cause puzzling results.

"git" branches are context switches. You enter a branched state by doing a "git checkout branch_name". At that point git will switch the reference sha1 pointer from where it was, to one that represents your branch state. If this is the first use of the branch, the sha1 will be the same as before the branch switch, or more simply put, your code is the same, it is just that git will track your changes separately from before the "checkout". Your directory tree and all the files will remain the same for a first time switch. However, if you have made "commit"s to this branch, or it is from some other user with lots of changes, the checkout operation can take several seconds as it updates all the files and directories to that branch's different structure. The normal use of branching in git is for simple or single project changes where only a few files are modified for the fix. There is no limit on number of branches so it is common to branch for any little change and when done merge the changes back to the orginal source and remove your temporay branch.

I think this is a good place to drop back and talk about how git might have been better than svn for helping out Charley and Bob. I chose to define a set of directories as the public or trunk of the project. This set of files represents the basic set of modules that combined represent MyForth. Conceptully all other projects are derived from this set of modules. I felt there should be only one set of each of these files and any updating should be seen by all other projects. This actaully was not the case, some other projects could be broken by using the latest fixes in the base file set. In svn I would have to check out a select version of the entire tree or use a different branch of the same files to separate the good from the bad. In git it is simpler, just branch and merge or update accordingly, with your selected changes being limited to that branch. Explained with an example in article #19.

Since git is a context branch switch, for that first time Charley wanted to diverge his code, his code is still the same after his branch checkout. The operation is pretty much the same as if he just copied all his files to another directory. The difference comes into play as he makes his changes. If he edits half the files and doesn't edit the other half, the changed files are only in the branch context, while the un-edited files remain the same in both the "master" or original structure and the newer branched structure. He may commit dozens of changes as he works through the set of needs, and when done maybe only a few of the changed files needed to go back to the original master stream of data. He "merges" those changes back down to the master source which makes those changes available to all future branches and the current main work. This is much closer to Charley's previous work flow and gives him much tigher control over what gets changed and when. Since every change has it's own sha1 value, every change can be reverted, or the code be brought back to some previous state with one simple command. Remember, you have the entire repository at your disposal, and thus each version of everything is avialble to be used in someway.

It is very important at this point to cover one aspect of git associated with branching, mainly "remote" or "source" branches and private branches and your work. "git" is ideal for one up projects or private repositories. In this case everything you do is your own private work and the concept of pushing and pulling data means nothing to you. When we clone or share our work then branching takes on a new concept and new actions on our part. Depending on how the repository was setup, when you clone you normally get all the branches as well. Some repos are however setup in which you see no branches unless specifically requested. In the same sense you can do all your work on private branches and never push any of those branches back to the remote (cloned from) source. You may email changes, or only use the repo for tracking the source and private changes you don't want anyone to see. All these optional ways of using git are available to you. Typically some combination of private and public branches are in play, where a set of release branches will exist in the master repo which you pull down, and thus have the ability to see or checkout each specific release and if needed make changes to just that release stream you need.

For the Myforth repo, I can see a set of directories in the "master" or base set of the repo. These would be base toolsets and documents, as well as a few working directories. Since there could major differences based on the chipset, directories labeled 120 or 640 might be used. In the 120 directory we would have a complete example of MyForth for the 120 chip. A user would simply create a branch and make their changes in the 120 directory say for a new stream of uses. As that stream matures, branches from that branch might be forks of the original work and so on as need arises. Don't forget that git handles branching well, so fixed files that represent never or rarely changed files ( like debuging tools ) could be linked from one directory to another, instead of using the svn scheme I tried. I think this approach might get Charley back to the simpler form of development he craves and yet provide the protection from getting bit too often.

Getting git

The first thing that you might want to do is get git. On linux you can use the appropriate software management tool to update your distribution if it is not already loaded, although all recent releases contain git (try "git --version" to see if it on your system). The fun thing to do is to get git by doing a "clone" of it. Try this command keeping in mind that it creates a directory where you run the command.


git clone http://git.kernel.org/pub/scm/git/git.git

The on-line book "pro-git" is great, even better if you buy one. We have found at work that viewing some of the references, videos, and talks, listed below really help the user come on line fast. There is way too much stuff on-line covering git for every user level, that I feel that whatever skill level you are at, you should be able to master git in short order.

For more detailed step-by-step exploration of git, follow along as I develop the new MyForth repo, by reading my article #19.

Links

Git Information Links
- Git home
  - http://git-scm.org
- The community book
  - http://book.git-scm.com
- The 'Pro Git' book online
  - http://progit.org/book
- Understanding Git Conceptually
  - http://www.eecs.harvard.edu/~cduan/technical/git/
- Git Ready - learn git one commit at a time
  - http://www.gitready.com
- Git branching model explained
  - http://nvie.com/git-model
- Git Magic
  - http://www-cs-students.stanford.edu/~blynn/gitmagic/
- The git-fu web site
  - http://gitfu.wordpress.com/
- An Illustrated Guide to Git on Windows
  - http://nathanj.github.com/gitguide/tour.html
Git Quick Reference
- Git Cheat Sheet
  - http://zrusin.blogspot.com/2007/09/git-cheat-sheet.html
- Git FAQ
  - https://git.wiki.kernel.org/index.php/GitFaq
- Git Cheats
  - http://cheat.errtheblog.com/
- Git for the Lazy
  - http://www.spheredev.org/wiki/Git_for_the_lazy
- Git Terminology
  - http://sitaramc.github.com/concepts/0-terminology.html
- Why the Staging Area is Useful
  - http://sitaramc.github.com/concepts/uses-of-index.html
Git Video Links
- Git tutorial web casts
  - http://gitcasts.com
- Presentation covering many aspects of Git (3 presenters)
  - http://developer.yahoo.com/yui/theater/video.php?v=prestonwerner-github
- Presentation from Rails conference by Scott Chacon
  - http://www.gitcasts.com/posts/railsconf-git-talk
- Presentation from python conference by Scott Chacon
  - http://pycon.blip.tv/file/3359640/
Git Discussions
- Git from the Bottom Up
  - http://ftp.newartisans.com/pub/git.from.bottom.up.pdf
- The Git parable
  - http://tom.preston-werner.com/2009/05/19/the-git-parable.html
- Git from the bottom up
  - http://www.newartisans.com/2008/04/git-from-the-bottom-up.html
- Git tutorial (man page)
  - http://www.kernel.org/pub/software/scm/git/docs/gittutorial.html
- Git for computer scientists
  - http://eagain.net/articles/git-for-computer-scientists/
- Git for Windows devs
  - http://www.lostechies.com/blogs/jason_meridth/archive/2009/06/01/git-for-windows-developers-git-series-part-1.aspx
- Git is great - by Scott Chacon
  - http://carsonified.com/blog/web-apps/why-you-should-switch-from-subversion-to-git/
HOW-TO Links
- Tagging in Git
  - http://www.gitready.com/beginner/2009/02/03/tagging.html
- Using Git with a central repository
  - http://toroid.org/ams/git-central-repo-howto
- Generating SSH keys in Windows using msysgit
  - http://help.github.com/msysgit-key-setup/
- Sparse directory checkout
  - http://blog.quilitz.de/2010/03/checkout-sub-directories-in-git-sparse-checkouts/
  - http://vmiklos.hu/blog/sparse-checkout-example-in-git-1-7/
- Recovering from a disasterous Git rebase mistake
- How do you roll back (reset) a git repository to a particular commit?
  - http://stackoverflow.com/questions/1616957/how-do-you-roll-back-reset-a-git-repository-to-a-particular-commit
Useful Git Commands
- bisect - find which change introduced a bug
  - http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
- rebase - making the case for rebase
  - http://darwinweb.net/articles/86
Gitolite Information Links
- Gitolite source site
  - http://github.com/sitaramc/gitolite
- Git and Gitolite in a corporate environment
  - http://sitaramc.github.com/0-installing/corporate-env.html
- Gitolite creates 'bare' repos
  - http://sitaramc.github.com/concepts/bare.html