Git Repo (Resuming) Tarballs
At various points the sysadmins have gotten a request to have repository tarballs made available. The idea is that some repositories can be very large; I believe test runs of the KOffice conversion produced a repository that is 350MB in size. Once you get this down to your system, updating with new refs is relatively fast, but what about that first part? What if you’re on a slow dial-up link somewhere (as some of our contributors are) and/or only have access to the Internet sporadically?
A solution to this is to provide repository tarballs to users which contain a snapshot of the repository. Once downloaded and expanded, you have a viable git clone; run a pull and you’re done (it’s pre-configured to fetch new refs from anongit.kde.org). Actually, run the init script and then a pull; to keep the tarball size as small as possible, it doesn’t include a checked-out working tree, only the .git directory and a tiny script that will a) delete itself and b) check out the working tree.
Now, the idea is to be able to start downloading at one time and to finish it later. The key here is that HTTP file transfers are resumable, so if you start downloading the tarball you can pick up where you left off. However, there was a question — how to be able to show a consistent link on Projects without a priori knowledge of tarball characteristics (since we can’t know any from within Redmine). In Redmine, all we could really do, without massive hackery, is display a statically-named tarball — in this case, it always ends in “-latest.tar.gz”. You can see this at https://projects.kde.org/projects/playground/network/aki/repository or any other project’s repository tab — note the “Tarball” checkout method.
This means however that the client needs some way of knowing whether it’s the same tarball or has been regenerated since. An easy way to do this is to use the If-Unmodified-Since header; however, the web server in use currently doesn’t support this (I checked and the author didn’t believe it was widely used; I pointed him to cURL/libcurl). The other problem with this approach is that it’s not very visible to the user.
So, I first changed the script such that the tarballs being generated weren’t actually named *-latest.tar.gz, but instead that they had more descriptive names and had a -latest.tar.gz version symlinked to the actual tarball. I then wrote a simple web service that the web server proxies to whenever it finds a “-latest.tar.gz” file being requested. This service resolves the symlink and redirects the client, so the user sees immediately what the name of the actual file is. For instance, right now a tarball of the aki repository via http://anongit1.kde.org/aki/aki-latest.tar.gz forwards to http://anongit1.kde.org/aki/aki_20101202050239_sha1-5f866b3a42872f8fd54adcf30bcb8a3a79d02542.tar.gz
Note the components of that filename: the name, the date/time stamp (to the second), “sha1″ indicating that that’s the algorithm to check the hash against, and the hash of the file for easy verification of the completed download. This should make it both easy for the client and the user to verify that it’s the same file, in addition to making it easy for the user to verify the integrity of the full file since they can check the sha1sum against the filename itself.
I think this provides a pretty nice solution to the problem.
RSS
email
Print
PDF
Add to favorites
Identi.ca
Twitter
Google Buzz
Facebook
del.icio.us
Google Bookmarks
Reddit
Technorati
Slashdot
Digg
Most resuming clients send an ‘If-Match’ header, so make sure the server sets an ETag header in the response (using sha1 would be great). This is how HTTP clients don’t accidentally resume the wrong file.
If the ETag doesn’t match, then your solution above is still required for the client to cancel the download and find the correct file by hand.
Craig
Nginx doesn’t support ETags, and they’re also not recommended except in specific instances. See http://markmail.org/message/xiungpgciwvocl4w and http://nginx.org/pipermail/nginx/2009-April/011382.html and http://developer.yahoo.com/performance/rules.html#etags
At my inquiry the nginx developers have already created a patch that adds If-Unmodified-Since support, but I actually prefer my solution anyways because it makes it very clear and easy for the user to both realize that it’s a different file (instead of just the download program) and to verify the sha1sum. The If-Unmodified-Since support will be just an extra level of verification.
That’s sad that Nginx doesn’t support ETags. And I agree that using the Apache default of inode doesn’t help; I only ever use a checksum for my ETags.
My views are probably skewed by the scale I’m used to operating at; I assumed there would be multiple servers and/or failover. I’ve been bitten so often by a single server with clock skew that I hate basing things on wall-clock. Dates also make me think about time zones, ugh.
Also, careful about viewers that download part of -latest.tar.gz and come back for the rest a month later. They’re probably not going to have cached the redirect, but will try -latest.tar.gz again, which now points at a new file.
Craig
Currently there is one server; I don’t really see the issue here with multiple servers/failover, or time zones. It doesn’t matter the exact time it was generated (in terms of time zone), only that it’s different than the next time it was generated.
Ultimately it’s up to the user to verify the sha1sum and use a decent method of downloading it. The good news is that later this month Nginx will release a version with support for If-Unmodified-Since, so even downloaders that can break with the current scheme (sadly, including KGet) should then be fully supported.
Thanks for this. It really helps us without good Internet access.