Archiving Changed Files in Git

Intro – Back to the Command Line

Believe it or not my favorite operating system was always VMS (later, OpenVMS), partly because of the clear and elegant online HELP facility, and partly due to the clear and elegant DCL command language.  I’ve used various flavors of Unix/Linux, including AIX, but while I got my work done, somehow I never quite got the hang of the MAN page format.  HELP in VMS always provides numerous examples and I found it much easier to wrap my head around them than the arcane lists of parameters favored on the *ix platforms.

In my current work, we’re using git for source control.  Git is a distributed version control system (DVCS) that is used on a wide variety of proprietary and open-source projects including the massive Linux kernel.  For more on why we like Git, please see the article “Why Git is better than X” (where X is any non-DVCS system).  Github.com is a very nice hosted git service with many features — issue tracking and wiki among them — that make for a very nice set of software development tools.

Most people using git on windows use the freely-available msysgit.  It has some GUI tools, but is mostly a command-line implementation, and that command-line is the same bash shell that is used on most linux systems.  Thus, I find myself back in the world of command-line interfaces for much of my day-to-day work — and I have to say that I am finding a lot of power and utility in the command line version.  And unlike back in my VMS days, I can find help in a lot more places than from MAN pages — with Google, I can simply search for a string, and immediately have several examples to use to learn a particular feature or function.

By the way, to learn more about git, try the Pro Git book (free online but please donate to the author if you like and use it).  You can also refer to these instructions for getting started with git and Github.com.

Files in the Cloud

This week we moved all of our graphics to the cloud — storage on RackSpace’s Cloud Files service and content delivery via a content delivery network (CDN).  We put all the images in a separate git repository (or ‘repo’ as the kids say), and can check in (‘commit’) versions as we add and change images.  Thus, we can deploy new images to cloud storage without changing the web code, and make the images available on the CDN.  A side benefit is that at scale, all image serving is kept off of the main web servers, leaving more performance capacity for the actual web traffic.

As I was setting this up, I quickly discovered the git archive command, which creates a nice zip or tar archive of files in your repository.  However, I wasn’t able to create an archive containing only the changes — what files were new or changed since the last deployment?  Right now, we have only a few hundred images and I could easily deploy the whole set each time, but that seemed both wasteful and risky — in a production system, I only like to change what needs to be changed, sort of a ‘minimally-invasive’ deployment strategy for software.

So after many Google searches, and much experimentation on the command line, I came up with the following process:

Initial Deployment

  • Commit your images to git. When you are ready to deploy, create a tag (I use tag names like “FirstDeployToCDN” but use what you like). Create an archive of the entire repository based on the tag:

$git archive --format tar -o FirstDeployToCDN.tar FirstDeployToCDN

This says to do the following:

  • git archive —format tar — Create an archive using the ‘tar’ format
  • -o FirstDeployToCDN.tar — Call the archive FirstDeployToCDN.tar
  • FirstDeployToCDN — Create the archive based on what is in the FirstDeployToCDN tag

In other words, this command creates a file called FirstDeployToCDN.tar containing all the files in FirstDeployToCDN tag.

Tip: if you want the archive name to contain a datestamp, you can use a command line trick to insert a formatted date string.  The bash ‘date’ command can be used with a format string, and you can insert those results in another command using the back-tick delimeters “`” (backwards apostrophe), to wit:

$git archive --format tar -o FirstDeployToCDN-`date +%Y-%m-%dT%H%M%Z`.tar FirstDeployToCDN

This would create a file called something like  FirstDeployToCDN-2010-08-26-2010-08-28T1114EDT.tar, which is a mouthful, but also contains info about when it was created.

  • Anyway, once you have your archive file, you can back it up, or extract the files to your server or staging area, then copy the files to the CDN/Cloud storage location.  For the RackSpace Cloud Files service, I find FireUploader to be very effective and efficient.

Uploading Changes (the point of this article)

  • Make changes to, add, and remove images, commiting as you go.  When you are ready to deploy an incremental set of images, you need to commit everything, adding a tag if you wish.  Remember, you want the changes since the last deployment (i.e., since that last tag, in our case FirstDeployToCDN).  The current state of commited files is represented in git by “HEAD”, so here’s the command I use to pull out the changes since the FirstDeployToCDN tag
$git archive --format tar -o cdn-diff-FirstDeployToCDN-`date +%Y-%m-%dT%H%M%Z`.tar HEAD `git diff FirstDeployToCDN --name-only`

So what’s this command all about?  Let’s break it down:

  • git archive —format tar — Ok, we’re creating a tar-format archive from git
  • -o cdn-diff-FirstDeployToCDN-`date +%Y-%m-%dT%H%M%Z`.tar — The output file will be called ‘cdn-diff-” plus the name of our prior tag, plus the current timestamp including timezone, plus the “.tar” extension.
  • HEAD — this says we’re pulling files out of the current latest commit in the current branch — the latest updates
  • `git diff FirstDeployToCDN –name-only` — this is a bit more command-line trickery, delimited with back-ticks (backwards apostrophes — ` not ‘).  This says, “Git, please give me a list of all changes between the current repository state (aka HEAD) and the set of files represented by the git tag FirstDeployToCDN” — that is, the changed and added files.  That list is, in turn, passed to the git archive command as a list of files to put in the archive.
  • So now you have a new file, called something like cdn-diff-since-FirstDeployToCDN-2010-08-28T1215EDT.tar.  It contains all the changed and added files since the FirstDeployToCDN tag, as of  August 28 at 12:15 EDT.

Final Notes

Of course this is a lot of typing, so I’ve created a script to handle most of this.  Please note that the script needs to be run from within your CDN repository, and assumes you have committed everything so that you want a diff between the specified tag and HEAD.  No warranty implied, as is, use at your own risk, YMMV, comments and suggestions welcome.

# make-cdn-deployment-file.sh
#
# Creates a dated tar file archive containing all changes between HEAD and
# a specified a git tag.
# If this code works, it is Copyright 2010 Carl Leubsdorf, Jr., Solvitor.com ► Problem Solved.
# If not, author is unknown.
#

if [[  -n $1 ]]; then

    TAG=$1

else

    echo "usage: $0 TAG-NAME"

    exit

fi

echo "Creating archive file for changes since tag $TAG

Archive Name: cdn-diff-$TAG-`date +%Y-%m-%dT%H%M%Z`.tar

Files:

`git diff $TAG --name-only`

"

git archive --format tar -o cdn-diff-$TAG-`date +%Y-%m-%dT%H%M%Z`.tar HEAD `git diff $TAG --name-only`

 

3 thoughts on “Archiving Changed Files in Git

  1. Thanks for the comment Lionel.

    You wrote:

    <<
    I just wanted to know how do you handle deleted files? For example, a renamed image between HEAD and the tag would echo that error :

    fatal: path not found: image.png
    >>

    In our case, we are publishing content to a public-facing web site, so as a policy we never rename or delete images, CSS, etc. We may use a new image (for example, a logo) or a new .css file, but in every case we leave the old version in place. In theory that could lead to a lot of clutter, but in practice it avoids broken links and image tags — it’s just too easy to miss an updateotherwise.

    <<
    Finally, I would suggest a gzipped version of the archive:
    git archive –format tar HEAD^{tree} `git diff $TAG –name-only` | gzip >cdn-diff-$TAG-`date +%Y-%m-%dT%H%M%Z`.tar.gz
    >>

    Great idea particularly if the fileset is very large. In our case, the number of changed files (and the size of the archive) tends to be quite small.

    Thanks again for your response.

    -C

  2. Great script! Thank you!

    I actually wrote a whole script to do what you’ve almost done in one single line… (https://github.com/lpointet/git_package)

    I just wanted to know how do you handle deleted files? For example, a renamed image between HEAD and the tag would echo that error :
    fatal: path not found: image.png

    Maybe you don’t have to handle those files, but I wanted such a script to help me in my deployments. So I have to delete files on the production server when I update it.

    Finally, I would suggest a gzipped version of the archive, without the PAX header:
    git archive –format tar HEAD^{tree} `git diff $TAG –name-only` | gzip >cdn-diff-$TAG-`date +%Y-%m-%dT%H%M%Z`.tar.gz

    Even better, a version which could take 2 arguments : the old tag and the new one (HEAD by default, but could be a more recent tag or other commit ref).

    Regards.

Leave a Reply to NCook Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.