To start with we have forked https://github.com/biopython/biopython as https://github.com/huttonics/biopython which we'll use as a read-only mirror - but now we want to keep it up to date with commits pushed to the upstream repository.
How can we automatically mirror the upstream repository? Enter GitHub Deploy Keys, which we can use to grant read/write access on a repository basis - which a cron job can use to push changes to our mirrored git repository.
My plan requires an online server where we can setup a cron job which will essentially run:
$ git checkout master
$ git pull --ff-only origin master
$ git pull --ff-only origin master
$ git push mirror master --tags
(Update: Original version did not push new tags to the mirror)
This assumes you've left the default remote branch (origin) as the upstream repository you want to pull from, and added mirror as the downstream mirror you want to push to. In this example, I did:
$ git clone https://github.com/biopython/biopython.git
$ cd biopython
$ git remote add mirror git@github.com:HuttonICS/biopython.git
$ git fetch mirror
Given we're mirroring a public open source project, the fetch doesn't need any special permissions. However, we do need write access to write to our mirror machine. You could do this with a pass-phrase-less personal SSH key associated with your user's account - or a GitHub account setup just for the script, but a GitHub Deployment Key seems the best option.
So, we'll make a new pass-phrase-less RSA SSH key just for this mirroring task:
$ ssh-keygen -t rsa -b 4096 -C "biopython key" -f biopython_key -N ""
Generating public/private rsa key pair.
Your identification has been saved in biopython_key.
Your public key has been saved in biopython_key.pub.
The key fingerprint is:
ce:9d:c4:de:aa:63:02:10:fe:a8:25:b6:ec:37:b5:dc biopython key
The key's randomart image is:
+--[ RSA 4096]----+
| |
| . |
| . . |
| o . |
| + S o |
|..o o. o + o |
|o+. o.o o + . |
|.o o o.Eo . |
|... . o.o. |
+-----------------+
Then go into the GitHub settings for the mirror repository, and add this deployment key (copy and paste the biopython_key.pub file contents):
If this was a personal account, you could configure which SSH key to use with GitHub via your ~/.ssh/config file, but to do this at the command line seems easiest via the $GIT_SSH environment variable which points at the binary or shell script to use in place of the default ssh command. So we have a simple shell script named mirror_ssh,
#!/bin/bash
# Call ssh using our GitHub repository deploy key (set via -i)
# using -F to make sure this ignores ~/.ssh/config
ssh -i /path/to/deploy_key -F /dev/null -p 22 $*
ssh -i /path/to/deploy_key -F /dev/null -p 22 $*
The basic task script becomes:
$ export GIT_SSH=./mirror_ssh
$ git checkout master
$ git pull --ff-only origin master
$ git checkout master
$ git pull --ff-only origin master
$ git push mirror master --tags
I wanted to be able to extend this to mirroring multiple repositories, each of which could (and perhaps should) have their own unique GitHub Deploy Key. I'll setup up local git repositories and keys for each, and do the sync via a master script mirror_git (see mirror_git gist) taking the git folder and deploy key file path as arguments:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# Enable strict bash mode - halts on any error | |
set -euo pipefail | |
# Takes exactly three command line arguments: | |
git_dir=$1 | |
ssh_key=$2 | |
log_file=$3 | |
#Start new log file | |
echo "$(date) Running git mirror script for ${git_dir}" > $log_file | |
if [ ! -d "${git_dir}/.git" ]; then | |
echo "ERROR: The ${git_dir}/.git directory does not exist" >> $log_file | |
exit 1 | |
fi | |
if [ ! -f "${ssh_key}" ]; then | |
echo "ERROR: The ${ssh_key} SSH key file does not exist" >> $log_file | |
exit 1 | |
fi | |
cd ${git_dir} | |
# Check the origin and mirror remotes exist | |
if [ ! `git remote | grep origin` ]; then | |
echo "ERROR: No origin remote repository to pull from" >> $log_file | |
exit 1 | |
fi | |
if [ ! `git remote | grep mirror` ]; then | |
echo "ERROR: No mirror remote repository to push to" >> $log_file | |
echo "(Did you run 'git fetch mirror' yet?)" >> $log_file | |
exit 1 | |
fi | |
# Might as well check our local copy of origin is current: | |
echo "$(date) Fetching from origin..." >> $log_file | |
git fetch origin --tags >> $log_file 2>&1 | |
if [ ! `git branch -a | grep remotes/origin/master` ]; then | |
echo "ERROR: No origin/master to pull from" >> $log_file | |
exit 1 | |
fi | |
#Wait until setup ssh script and fetch mirror in case this is | |
#the first time the mirroring script has been run and noone | |
#has done a "git fetch mirror" yet: | |
#if [ ! `git branch -a | grep remotes/mirror/master` ]; then | |
# echo "ERROR: No mirror/master to push to" >> $log_file | |
# exit 1 | |
#fi | |
# Create ssh wrapper script using the specified key, i.e. | |
# #!/bin/bash | |
# ssh -i /path/to/key -F /dev/null -p 22 $* | |
# Will use $GIT_SSH to get git to use this SSH key via -i argument, | |
# while -F ensures any ~/.ssh/config settings are ignored. | |
if [ -f "mirror_ssh" ]; then | |
rm -f "mirror_ssh" | |
fi | |
touch "mirror_ssh" | |
echo "#!/bin/bash" >> "mirror_ssh" | |
echo "ssh -i ${ssh_key} -F /dev/null -p 22 \$*" >> "mirror_ssh" | |
chmod u+x "mirror_ssh" | |
# Ensure git will use ssh with our GitHub Deploy Key: | |
export GIT_SSH="${git_dir}/mirror_ssh" | |
if [ ! -x $GIT_SSH ]; then | |
echo "ERROR: Failed to setup ssh wrapper script to use key" >> $log_file | |
exit 1 | |
fi | |
echo "$(date) Setup git ssh script" >> $log_file | |
# Might as well check our local copy of mirror is current, | |
# and possible there has been no fetch since doing | |
# git remote add mirror ... | |
echo "$(date) Fetching from mirror..." >> $log_file | |
git fetch mirror --tags >> $log_file 2>&1 | |
if [ ! `git branch -a | grep remotes/mirror/master` ]; then | |
echo "ERROR: No mirror/master to push to" >> $log_file | |
exit 1 | |
fi | |
# Make sure we're on the master branch: | |
git reset --hard >> $log_file 2>&1 | |
git checkout master >> $log_file 2>&1 | |
# Get the latest changes from the original repository | |
# (Using fast forward only means this will fail if | |
# the git history was rewritten - which would likely | |
# need reviewing rather than blindly mirroring) | |
echo "$(date) Fetching from origin..." >> $log_file | |
# Already did fetch origin, could do merge --ff-only: | |
git pull --ff-only origin master >> $log_file 2>&1 | |
# Push the lastest changes to the master branch on our | |
# mirror repository (write access via GitHub Deploy Key) | |
echo "$(date) Pushing to mirror..." >> $log_file | |
git push mirror master --tags >> $log_file 2>&1 | |
echo "$(date) GitHub mirror sync done." >> log_file |
$ crontab -l
0 * * * * ~/cron/mirror_git ~/cron/biopython ~/cron/biopython_key
As written mirror_git takes two arguments, the directory name where the temporary git repository is, and the location of the (private) SSH key used to push to the mirror repository, and does some minimal sanity checking before pulling and pushing to GitHub.
These cron-jobs are running on an existing server under a non-user account (without any admin privileges).
Update (20 May 2016)
In testing when there were no new commits, everything looked fine - the deploy key seemed to be working. But now there are some upstream changes, as shown by a dry-run:
$ export GIT_SSH=./mirror_ssh
$ git push mirror master --dry-run
To git@github.com:HuttonICS/biopython.git
1517344..c5b3309 master -> master
However actually pushing the changes gives a novel git failure I've not seen before:
$ export GIT_SSH=./mirror_ssh
$ git push mirror master
Counting objects: 45, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (45/45), done.
Writing objects: 100% (45/45), 5.73 KiB | 0 bytes/s, done.
Total 45 (delta 34), reused 0 (delta 0)
remote: fatal error in commit_refs
To git@github.com:HuttonICS/biopython.git
! [remote rejected] master -> master (failure)
error: failed to push some refs to 'git@github.com:HuttonICS/biopython.git'
error: failed to push some refs to 'git@github.com:HuttonICS/biopython.git'
Adding --verbose didn't reveal any clues. Adding using ssh -v in the wrapper script confirmed the deploy key was accepted, and that some data was sent.
If we deliberately don't use the deploy key, even the dry-run will fail:
$ unset GIT_SSH
$ git push mirror master
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
$ git push mirror master --dry-run
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
GitHub seemed to be working (pushing to other repositories as myself worked, and there were no problems reported on the GitHub Status page)
Update (23 May 2016)
A few other people also hit this remote: fatal error in commit_refs problem around the same time, e.g. Tomas Skogberg and Mona Jalal, and neither of them were trying to use a Deploy Key. This looks like a rare but more general problem at GitHub, so I have reported it to them.
GitHub replied that this does look like a problem at their end, and they are looking into it.
GitHub confirmed yesterday that this was due to their protected branch settings, and have updated their system to give a much more useful error message:
The master branch on the mirror repository was protected (to prevent force-pushes etc), but also "Restrict who can push to this branch" was ticked (I wanted to avoid any accidental updates) which had the perhaps unexpected side effect of preventing use of a Deploy Key. Unpicking this has fixed my automated deployment.
Update (23 May 2016)
GitHub replied that this does look like a problem at their end, and they are looking into it.
Update (1 June 2016)
GitHub confirmed yesterday that this was due to their protected branch settings, and have updated their system to give a much more useful error message:
$ git push mirror master
Counting objects: 391, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (391/391), done.
Writing objects: 100% (391/391), 99.28 KiB | 0 bytes/s, done.
Total 391 (delta 298), reused 0 (delta 0)
remote: error: GH006: Protected branch update failed for refs/heads/master.
remote: error: You're not authorized to push to this branch. Visit https://help.github.com/articles/about-protected-branches/ for more information.
To git@github.com:HuttonICS/biopython.git
! [remote rejected] master -> master (protected branch hook declined)
error: failed to push some refs to 'git@github.com:HuttonICS/biopython.git'
The master branch on the mirror repository was protected (to prevent force-pushes etc), but also "Restrict who can push to this branch" was ticked (I wanted to avoid any accidental updates) which had the perhaps unexpected side effect of preventing use of a Deploy Key. Unpicking this has fixed my automated deployment.
Update (12 July 2016)
I'm told that after some internal discussions, GitHub will now allow a Deploy Key to be used on a protected branch ignoring the "Restrict who can push to this branch" setting. Since you need Admin permissions to create the key, this seems better to me.
I've posted the mirror_setup script I used to simplify adding more repositories to the collection we mirror under HuttonICS:
Update (9 October 2017)
Added --tags argument to the push command.Update (2 February 2018)
Added --tags argument to the fetch command (in case any tags were changed).Update (17 August 2018)
I've posted the mirror_setup script I used to simplify adding more repositories to the collection we mirror under HuttonICS:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
set -euo pipefail | |
# See https://blastedbio.blogspot.co.uk/2016/05/sync-github-mirror-with-cron.html and | |
# https://gist.github.com/peterjc/eccac1942a9709993040425d33680352 for mirroring script | |
# | |
# Usage: | |
# | |
# 1. Fork upstream repo under HuttonICS, disable wiki, projects, issues etc. Protect master branch. | |
# 2. Run: | |
# | |
# ./mirror_setup.sh repo-name https://github.com/upstream-owner/repo-name.git | |
# | |
# 3. Copy and paste repo-name_key.pub into GitHub fork settings as deploy key with write permissions | |
# 4. Add crontab entry | |
# | |
# | |
#The script does this: | |
# | |
# 1. ssh-keygen -t rsa -b 4096 -C "repo-name key" -f repo-name_key -N "" | |
# 2. Clone upstream repo using HTTPS, cd repo-name | |
# 3. git remote add mirror *HuttonForkUsingGit* | |
# 4. git fetch mirror | |
name="$1" | |
upstream="$2" | |
if [ ! -f "${name}_key" ]; then | |
echo "Generating ${name} SSH key" | |
ssh-keygen -t rsa -b 4096 -C "huttonics/${name} deployment (Peter's iMac)" -f ${name}_key -N "" | |
fi | |
if [ ! -d "${name}/.git" ]; then | |
echo "Cloning upstream ${name} repository ${upstream}" | |
git clone "$upstream" "$name" | |
cd $name | |
git remote add mirror git@github.com:HuttonICS/${name}.git | |
cd .. | |
fi | |
echo "=======================================================" | |
echo | |
echo "For the GitHub deployment key:" | |
echo | |
cat ${name}_key.pub | |
echo | |
echo "Paste this into https://github.com/HuttonICS/${name}/settings/keys" | |
echo | |
echo "=======================================================" | |
echo | |
echo "For the cron tab:" | |
echo "~/cron/mirror_git ~/cron/${name} ~/cron/${name}_key ~/cron/${name}.log" |
Interestingly, keeping a fork of a project with a significant number of modifications is quite difficult because you often can't simply pull from master with ease. Everytime that you try to pull from the master branch, it will make you re-resolve many of the merge conflicts that you have already solved. Some ancient git wizardry called "git rerere" can help, but otherwise, the better solutions are just using better software engineering and more careful forking
ReplyDeleteInteresting - Here I'm focusing on the easy case where where the mirror will have no local changes, but thanks for mentioning the "git rerere" command which I might one day have a need for.
Delete