2024-06-07

FASTQ uploads to ENA FTP site with rclone

I've recently been working what I considered to be a large scale FASTQ upload to the European Nucleotide Archive (ENA), from where it will be mirrored to the NCBS Short or Sequence Read Archive (SRA). Although the total size was "only" 37GB, this was about 3500 pairs of Illumina MiSeq FASTQ files - more than enough to make me worry about the job being interrupted and needing to resume without repeating uploads.

A colleague had shared a shell script for bulk uploads to the ENA FTP site using curl, which took care of ample multiple retries and was intended to be submitted to our cluster's long queue where it could work away quietly through a list of FASTQ files. In hindsight, that would have sufficed as my full upload only took an afternoon.

However, being pessimistic I was looking for something like the --no-clobber switch in curl but for uploads, something it only supports for downloads. There didn't seem to be anything built into wget either, so I started on an internet search for bulk FTP uploading or synchronisation tools. After ruling out a couple of candidates as being non-trivial to install, I settled on a solution, rclone.

The website makes it clear that rclone is focused on synching files to cloud or virtual filesystems, but it supports a range of protocols including SFTP, HTTP, and good old FTP - which was what I needed for the ENA upload. Working through their documentation I could solve my task in just a few commands (and just one command next time round).

Installing rclone

First, I installed rclone via conda thanks to conda-forge (it is also available in standard Linux package channels):

$ conda install rclone

Configuring a target

Next I defined a target configuration called ENA, where $ENAUSER was something like Webin-12345 (not my actual username) and $ENAPASS was my ENA password:

$ export ENAUSER="Webin-12345"
$ export ENAPASS="my-ftp-password"
$ rclone config create ena ftp host webin2.ebi.ac.uk port 21 \
  user "$ENAUSER" pass $(rclone obscure "$ENAPASS")
[ena]
type = ftp
host = webin2.ebi.ac.uk
port = 21
user = Webin-12345
pass = *** ENCRYPTED ***

If you've never used pclone before, there is a one-off warning about using default values without a $USER/.config/rclone/rclone.conf configuration file (omitted here for clarity).

As this is a one-off manual step, you can of course type the username and password into the command directly, but expressing it with environment variables seemed clearer to me as documentation. The beauty of this approach is hereafter you don't need to supply your FTP username and password for each action.

Note that the command to turn your password into an obscured token is non-deterministic, it seems to generate a different token for the same input in successive runs.

Copying files

The one line solution in action, here copying all the FASTQ files under raw_data/ to the default directory on my FTP account:

$ rclone copy --copy-links --include "*.fastq.gz" raw_data/ . -v

The -v is optional for verbose. Adding --copy-links is important if the input folder actually contains symlinks to the FASTQ files, rather than the files themselves. The --include argument avoids copying any non-FASTQ files present. Using --dry-run first is wise too.

In my case I had a sub-directory for each Illumina MiSeq run, and preserved this with a loop:

$ for PLATE in plate_*; do \
echo $PLATE; rclone copy  --copy-links --include "*.fastq.gz" \
$PLATE/ ena:$PLATE/ -v; \
done

Importantly I could re-run these commands (e.g. if interrupted, or I wanted to add a few more files last minute), and nothing gets overridden or reuploaded. As it happened, the only interuption was me aborting the fun a few minutes in to confirm it was working as intended by logging into the FTP server.

I suspect my use case here can be shortened, but I didn't manage to work out how to do the copy command without first configuring a target.

No comments:

Post a Comment