2016-05-13

Automatically keeping a GitHub fork up to date

We recently setup a departmental GitHub account for Hutton ICS, and one of the things we'll use this for is to showcase projects which ICS staff are contributing to - such as Biopython in my case.

To start with we have forked https://github.com/biopython/biopython as https://github.com/huttonics/biopython which we'll use as a read-only mirror - but now we want to keep it up to date with commits pushed to the upstream repository.

How can we automatically mirror the upstream repository? Enter GitHub Deploy Keys, which we can use to grant read/write access on a repository basis - which a cron job can use to push changes to our mirrored git repository.

My plan requires an online server where we can setup a cron job which will essentially run:

$ git checkout master
$ git pull --ff-only origin master
$ git push mirror master --tags

(Update: Original version did not push new tags to the mirror)

This assumes you've left the default remote branch (origin) as the upstream repository you want to pull from, and added mirror as the downstream mirror you want to push to. In this example, I did:

$ git clone https://github.com/biopython/biopython.git
$ cd biopython
$ git remote add mirror git@github.com:HuttonICS/biopython.git
$ git fetch mirror

Given we're mirroring a public open source project, the fetch doesn't need any special permissions. However, we do need write access to write to our mirror machine. You could do this with a pass-phrase-less personal SSH key associated with your user's account - or a GitHub account setup just for the script, but a GitHub Deployment Key seems the best option.

So, we'll make a new pass-phrase-less RSA SSH key just for this mirroring task:

$ ssh-keygen -t rsa -b 4096 -C "biopython key" -f biopython_key -N ""
Generating public/private rsa key pair.
Your identification has been saved in biopython_key.
Your public key has been saved in biopython_key.pub.
The key fingerprint is:
ce:9d:c4:de:aa:63:02:10:fe:a8:25:b6:ec:37:b5:dc biopython key
The key's randomart image is:
+--[ RSA 4096]----+
|                 |
|  .              |
| . .             |
|  o      .       |
|   +    S o      |
|..o o. o + o     |
|o+. o.o o + .    |
|.o o o.Eo  .     |
|... .  o.o.      |
+-----------------+

Then go into the GitHub settings for the mirror repository, and add this deployment key (copy and paste the biopython_key.pub file contents):



If this was a personal account, you could configure which SSH key to use with GitHub via your ~/.ssh/config file, but to do this at the command line seems easiest via the $GIT_SSH environment variable which points at the binary or shell script to use in place of the default ssh command. So we have a simple shell script named mirror_ssh,

#!/bin/bash
# Call ssh using our GitHub repository deploy key (set via -i)
# using -F to make sure this ignores ~/.ssh/config
ssh -i /path/to/deploy_key -F /dev/null -p 22 $*

The basic task script becomes:

$ export GIT_SSH=./mirror_ssh
$ git checkout master
$ git pull --ff-only origin master
$ git push mirror master --tags

I wanted to be able to extend this to mirroring multiple repositories, each of which could (and perhaps should) have their own unique GitHub Deploy Key. I'll setup up local git repositories and keys for each, and do the sync via a master script mirror_git (see mirror_git gist) taking the git folder and deploy key file path as arguments:

Then I add multiple calls to mirror_git to cron, one for each repository, e.g.

$ crontab -l
0 * * * * ~/cron/mirror_git ~/cron/biopython ~/cron/biopython_key

As written mirror_git takes two arguments, the directory name where the temporary git repository is, and the location of the (private) SSH key used to push to the mirror repository, and does some minimal sanity checking before pulling and pushing to GitHub.

These cron-jobs are running on an existing server under a non-user account (without any admin privileges).

Update (20 May 2016)


In testing when there were no new commits, everything looked fine - the deploy key seemed to be working. But now there are some upstream changes, as shown by a dry-run:

$ export GIT_SSH=./mirror_ssh
$ git push mirror master --dry-run
To git@github.com:HuttonICS/biopython.git
   1517344..c5b3309  master -> master

However actually pushing the changes gives a novel git failure I've not seen before:

$ export GIT_SSH=./mirror_ssh
$ git push mirror master
Counting objects: 45, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (45/45), done.
Writing objects: 100% (45/45), 5.73 KiB | 0 bytes/s, done.
Total 45 (delta 34), reused 0 (delta 0)
remote: fatal error in commit_refs
To git@github.com:HuttonICS/biopython.git
 ! [remote rejected] master -> master (failure)
error: failed to push some refs to 'git@github.com:HuttonICS/biopython.git'

Adding --verbose didn't reveal any clues. Adding using ssh -v in the wrapper script confirmed the deploy key was accepted, and that some data was sent.

If we deliberately don't use the deploy key, even the dry-run will fail:

$ unset GIT_SSH
$ git push mirror master 
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights

and the repository exists.

$ git push mirror master --dry-run
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

GitHub seemed to be working (pushing to other repositories as myself worked, and there were no problems reported on the GitHub Status page)

Update (23 May 2016)

A few other people also hit this remote: fatal error in commit_refs problem around the same time, e.g. Tomas Skogberg and Mona Jalal, and neither of them were trying to use a Deploy Key. This looks like a rare but more general problem at GitHub, so I have reported it to them.

Update (23 May 2016)


GitHub replied that this does look like a problem at their end, and they are looking into it.

Update (1 June 2016)


GitHub confirmed yesterday that this was due to their protected branch settings, and have updated their system to give a much more useful error message:

$ git push mirror master
Counting objects: 391, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (391/391), done.
Writing objects: 100% (391/391), 99.28 KiB | 0 bytes/s, done.
Total 391 (delta 298), reused 0 (delta 0)
remote: error: GH006: Protected branch update failed for refs/heads/master.
remote: error: You're not authorized to push to this branch. Visit https://help.github.com/articles/about-protected-branches/ for more information.
To git@github.com:HuttonICS/biopython.git
 ! [remote rejected] master -> master (protected branch hook declined)
error: failed to push some refs to 'git@github.com:HuttonICS/biopython.git'

The master branch on the mirror repository was protected (to prevent force-pushes etc), but also "Restrict who can push to this branch" was ticked (I wanted to avoid any accidental updates) which had the perhaps unexpected side effect of preventing use of a Deploy Key. Unpicking this has fixed my automated deployment.


Update (12 July 2016)

I'm told that after some internal discussions, GitHub will now allow a Deploy Key to be used on a protected branch ignoring the "Restrict who can push to this branch" setting. Since you need Admin permissions to create the key, this seems better to me.

Update (9 October 2017)

Added --tags argument to the push command.

Update (2 February 2018)

Added --tags argument to the fetch command (in case any tags were changed).

Update (17 August 2018)


I've posted the mirror_setup script I used to simplify adding more repositories to the collection we mirror under HuttonICS:

This is all hard coded with HuttonICS as the mirror account username, but may be useful anyway.

2 comments:

  1. Interestingly, keeping a fork of a project with a significant number of modifications is quite difficult because you often can't simply pull from master with ease. Everytime that you try to pull from the master branch, it will make you re-resolve many of the merge conflicts that you have already solved. Some ancient git wizardry called "git rerere" can help, but otherwise, the better solutions are just using better software engineering and more careful forking

    ReplyDelete
    Replies
    1. Interesting - Here I'm focusing on the easy case where where the mirror will have no local changes, but thanks for mentioning the "git rerere" command which I might one day have a need for.

      Delete