2012-08-16

Stop breaking NCBI BLAST searches!

Have you ever tried to use a BLAST database of protein sequences containing stop codons? If you work on nice model organisms with solid gene annotation maybe not. However, with draft annotations, mutation studies, or read through translation it is not unreasonable for the odd internal stop codon to appear in a protein sequence. And some translation pipelines do leave in a trailing * character. It turns out the BLAST+ suite has a rather nasty glitch with this sort of sequence.

Update: BLAST 2.2.27+ fixed this bug.

Making a test BLAST database

Here is a simple example protein FASTA file, where I have manually added some stop codons as asterisk characters to the amino acid sequence (highlighted here in red):

$ cat four_proteins_star.fasta
>original_Q9BS26 Unedited sequence
MHPAVFLSLPDLRCSLLLLVTWVFTPVTTEITSLDTENIDEILNNADVALVNFYADWCRF
SQMLHPIFEEASDVIKEEFPNENQVVFARVDCDQHSDIAQRYRISKYPTLKLFRNGMMMK
REYRGQRSVKALADYIRQQKSDPIQEIRDLAEITTLDRSKRNIIGYFEQKDSDNYRVFER
VANILHDDCAFLSAFGDVSKPERYSGDNIIYKPPGHSAPDMVYLGAMTNFDVTYNWIQDK
CVPLVREITFENGEELTEEGLPFLILFHMKEDTESLEIFQNEVARQLISEKGTINFLHAD
CDKFRHPLLHIQKTPADCPVIAIDSFRHMYVFGDFKDVLIPGKLKQFVFDLHSGKLHREF
HHGPDPTDTAPGEQAQDVASSPPESSFQKLAPSEYRYTLLRDRDEL
>tailing_stop_Q9NSY1 Added trailing * character (representing stop codon)
MKKFSRMPKSEGGSGGGAAGGGAGGAGAGAGCGSGGSSVGVRVFAVGRHQVTLEESLAEG
GFSTVFLVRTHGGIRCALKRMYVNNMPDLNVCKREITIMKELSGHKNIVGYLDCAVNSIS
DNVWEVLILMEYCRAGQVVNQMNKKLQTGFTEPEVLQIFCDTCEAVARLHQCKTPIIHRD
LKVENILLNDGGNYVLCDFGSATNKFLNPQKDGVNVVEEEIKKYTTLSYRAPEMINLYGG
KPITTKADIWALGCLLYKLCFFTLPFGESQVAICDGNFTIPDNSRYSRNIHCLIRFMLEP
DPEHRPDIFQVSYFAFKFAKKDCPVSNINNSSIPSALPEPMTASEAAARKSQIKARITDT
IGPTETSIAPRQRPKANSATTATPSVLTIQSSATPVKVLAPGEFGNHRPKGALRPGNGPE
ILLGQGPPQQPPQQHRVLQQLQQGDWRLQQLHLQHRHPHQQQQQQQQQQQQQQQQQQQQQ
QQQQQQHHHHHHHHLLQDAYMQQYQHATQQQQMLQQQFLMHSVYQPQPSASQYPTMMPQY
QQAFFQQQMLAQHQPSQQQASPEYLTSPQEFSPALVSYTSSLPAQVGTIMDSSYSANRSV
ADKEAIANFTNQKNISNPPDMSGWNPFGEDNFSKLTEEELLDREFDLLRSNRLEERASSD
KNVDSLSAPHNHPPEDPFGSVPFISHSGSPEKKAEHSSINQENGTANPIKNGKTSPASKD
QRTGKKTSVQGQVQKGNDESESDFESDPPSPKSSEEEEQDDEEVLQGEQGDFNDDDTEPE
NLGHRPLLMDSEDEEEEEKHSSDSDYEQAKAKYSDMSSVYRDRSGSGPTQDLNTILLTSA
QLSSDVAVETPKQEFDVFGAVPFFAVRAQQPQQEKNEKNLPQHRFPAAGLEQEEFDVFTK
APFSKKVNVQECHAVGPEAHTIPGYPKSVDVFGSTPFQPFLTSTSKSESNEDLFGLVPFD
EITGSQQQKVKQRSLQKLSSRQRRTKQDMSKSNGKRHHGTPTSTKKTLKPTYRTPERARR
HKKVGRRDSQSSNEFLTISDSKENISVALTDGKDRGNVLQPEESLLDPFGAKPFHSPDLS
WHPPHQGLSDIRADHNTVLPGRPRQNSLHGSFHSADVLKMDDFGAVPFTELVVQSITPHQ
SQQSQPVELDPFGAAPFPSKQ*
>internal_stop_P06213 Added two internal * characters
MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHL
QILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYAL
VIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNYIVLNKDDNE
ECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECL
GNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQG
CHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGC
TVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETL
EIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQE
RNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQ
NVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFS
DERRTYGAKSDIIYVQTDATNPSVPLDPISVSNS*SQIILKWKPPSDPNGNITHYLVFWE
RQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQIL
KELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAF
PNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYV
SARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCV
SRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIG
PLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSR
EKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKG
FTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP*PTLQEMIQMA
AEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPV
RWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDN
CPERVTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEME
FEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSN
PS
>internal_and_trailing_stop_P08100 Added internal and trailing * characters
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFT*KEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA*

Now let's turn it into a BLAST+ database using makeblastdb v2.2.25+ on my Mac (but I also duplicated this whole process under Linux using the latest BLAST v2.2.26+ as well),

$ makeblastdb -version
makeblastdb: 2.2.25+
Package: blast 2.2.25, build Mar 21 2011 12:21:32


$ makeblastdb -in four_proteins_star.fasta -dbtype prot -out four_proteins_star


Building a new DB, current time: 08/16/2012 14:39:33
New DB name:   four_proteins_star
New DB title:  four_proteins_star.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 4 sequences in 0.00264692 seconds.
 

$ ls four_proteins_star.*
four_proteins_star.fasta  four_proteins_star.phr    four_proteins_star.pin    four_proteins_star.psq

Let's use blastdbcmd to recover the FASTA sequence (and check it stored the stop codons):

$ blastdbcmd -db four_proteins_star -entry all
>gnl|BL_ORD_ID|0 original_Q9BS26 Unedited sequence
MHPAVFLSLPDLRCSLLLLVTWVFTPVTTEITSLDTENIDEILNNADVALVNFYADWCRFSQMLHPIFEEASDVIKEEFP
NENQVVFARVDCDQHSDIAQRYRISKYPTLKLFRNGMMMKREYRGQRSVKALADYIRQQKSDPIQEIRDLAEITTLDRSK
RNIIGYFEQKDSDNYRVFERVANILHDDCAFLSAFGDVSKPERYSGDNIIYKPPGHSAPDMVYLGAMTNFDVTYNWIQDK
CVPLVREITFENGEELTEEGLPFLILFHMKEDTESLEIFQNEVARQLISEKGTINFLHADCDKFRHPLLHIQKTPADCPV
IAIDSFRHMYVFGDFKDVLIPGKLKQFVFDLHSGKLHREFHHGPDPTDTAPGEQAQDVASSPPESSFQKLAPSEYRYTLL
RDRDEL
>gnl|BL_ORD_ID|1 tailing_stop_Q9NSY1 Added trailing * character (representing stop codon)
MKKFSRMPKSEGGSGGGAAGGGAGGAGAGAGCGSGGSSVGVRVFAVGRHQVTLEESLAEGGFSTVFLVRTHGGIRCALKR
MYVNNMPDLNVCKREITIMKELSGHKNIVGYLDCAVNSISDNVWEVLILMEYCRAGQVVNQMNKKLQTGFTEPEVLQIFC
DTCEAVARLHQCKTPIIHRDLKVENILLNDGGNYVLCDFGSATNKFLNPQKDGVNVVEEEIKKYTTLSYRAPEMINLYGG
KPITTKADIWALGCLLYKLCFFTLPFGESQVAICDGNFTIPDNSRYSRNIHCLIRFMLEPDPEHRPDIFQVSYFAFKFAK
KDCPVSNINNSSIPSALPEPMTASEAAARKSQIKARITDTIGPTETSIAPRQRPKANSATTATPSVLTIQSSATPVKVLA
PGEFGNHRPKGALRPGNGPEILLGQGPPQQPPQQHRVLQQLQQGDWRLQQLHLQHRHPHQQQQQQQQQQQQQQQQQQQQQ
QQQQQQHHHHHHHHLLQDAYMQQYQHATQQQQMLQQQFLMHSVYQPQPSASQYPTMMPQYQQAFFQQQMLAQHQPSQQQA
SPEYLTSPQEFSPALVSYTSSLPAQVGTIMDSSYSANRSVADKEAIANFTNQKNISNPPDMSGWNPFGEDNFSKLTEEEL
LDREFDLLRSNRLEERASSDKNVDSLSAPHNHPPEDPFGSVPFISHSGSPEKKAEHSSINQENGTANPIKNGKTSPASKD
QRTGKKTSVQGQVQKGNDESESDFESDPPSPKSSEEEEQDDEEVLQGEQGDFNDDDTEPENLGHRPLLMDSEDEEEEEKH
SSDSDYEQAKAKYSDMSSVYRDRSGSGPTQDLNTILLTSAQLSSDVAVETPKQEFDVFGAVPFFAVRAQQPQQEKNEKNL
PQHRFPAAGLEQEEFDVFTKAPFSKKVNVQECHAVGPEAHTIPGYPKSVDVFGSTPFQPFLTSTSKSESNEDLFGLVPFD
EITGSQQQKVKQRSLQKLSSRQRRTKQDMSKSNGKRHHGTPTSTKKTLKPTYRTPERARRHKKVGRRDSQSSNEFLTISD
SKENISVALTDGKDRGNVLQPEESLLDPFGAKPFHSPDLSWHPPHQGLSDIRADHNTVLPGRPRQNSLHGSFHSADVLKM
DDFGAVPFTELVVQSITPHQSQQSQPVELDPFGAAPFPSKQ*
>gnl|BL_ORD_ID|2 internal_stop_P06213 Added two internal * characters
MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPK
LIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDW
SRILDSVEDNYIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECL
GNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYT
MNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRS
YALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQE
RNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVD
IDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNS*SQIIL
KWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQIL
KELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEK
VVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGL
IVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIG
PLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEG
NARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRP
EAENNPGRP*PTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPV
RWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMR
PTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIP
YTHMNGGKKNGRILTLPRSNPS
>gnl|BL_ORD_ID|3 internal_and_trailing_stop_P08100 Added internal and trailing * characters
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLA
VADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFT
WVMALACAAPPLAGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFT*KEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTT
ICCGKNPLGDDEASATVSKTETSQVAPA*


Notice the NCBI defaults to imposing their naming convention on the output - they've assigned made up ordinal identifiers (sequence 0, 1, 2 and 3). But that is not the problem I'm focussing on today.

So far so good - we've got a BLAST database using stop codons.

Using a BLAST database with stop codons

Now let's try a search, using a carp protein chosen to give a nice match:

$ cat carp.fasta
>gi|33355376|gb|AAN52151.1| IGF-I receptor subtype a [Cyprinus carpio]
MRSGTARDVWTLFWGPVLFLSTICLWSAHGEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNN
LNQEHFRTLSFPKLTMVTDYLLLFRVSGLDSLSMLFPNLNVIRGRNLFYNYALVIFEMTSLKDIGLYNLR
NITRGAIRIEKNPELCYLDSVDWSLIMDAEFNNYIAGNKQSKECGDVCPGIMEDRPQCIRTSFNDNYSYR
CWTSNDCQKVCPKECEKRACTYKGKCCHPQCLGSCTEPDNDKACAACQHYFHEDRCVETCPQGTYKFEGW
RCITMDMCARVHLPSEVDFVIHNGECMPECPPGFTRNETQSMFCSACDGLCDKVCESKTIDSVDAAQSLQ
GCTVIKGNLNINIRRGNNIASELESFMGLIQTVTGYVKIKHSQTLGSLSFLKSLRYIHGEELTEQSYAFS
AIDNQNLQYLWDWSQHNLTIRAGKLFFRFNPKLCMSEIQKMWEKTGVKEKMEEDDARSNGERASCESYIL
KFKSNHTQSTRIKLTWERYRPPDYRDLISFIVYYKEAPFQNITEFDGQDGCGSNSWNMVDVDLPQDKSTD
PGVLLSSLKPWTQYAIFVKAVTLVVEDKHILGAKSEVVYIRTNASAPSMPLDARAYANSSSSLMVKWSPP
ISPNGNKTFYLLRWQQQAEDRELYQHNYCSKELKIPIRISATALSDMEGETKPTKSDVGGGEKGCCPCPK
TKEDLKAEAEDRSYRKVFENFLHNSIFTPRPPDRKRRDLFGVANGTLLQGASRNITGMESNITDPKLHEK
EYPFSEDKVYTEFMEISNLQPFTVYRIDIHACNYEIHRCSAAAFVFSRTKPADKADDIPGTVTQERDEKG
EGTVLLRWPEPHHPNGLILMYEIKYRLGTEPEKHECVSRQQYRVLRGAQLSNLPSGNYSARVRATSLAGN
GSWTEPVSFYVPPPKRNYDNAFLLAIIIPIIVLLLLTAVISVVIFVTKKRNSDRLGNGVLYASVNPEYFS
PFEMYVPDEWEVAREKITMCRELGQGSFGMVYEGIAKGVVKDEPETRVAIKTVNESASMRERIEFLNEAS
VMKEFNCHHVVRLLGVVSQRQPTLVIMELMTRGDLKSYLRSLRSKEQGYSTQSLPPLKKMIQMASEIADG
MAYLNANKFVHRDLAARNCMVAEDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMSPESLKDGVFTTN
SDVWSFGVVLWEIATLAKQPYQGMSNEQVLRFVMEGGLLDKPDNCPDMLFELMRMCWQYNPKMRPSFLEI
INSIKEELEPPFQEVSFFYSEENKPPDTEELDMEVENMENVPLEPASMLAPSSQGSTGAQTQSPQPLSPA
QGPGTPVGSAQTPSTSPPPSSSSASPGLVLDKHSGQVAANGPVLLLGPAFDETQPYAHMNGGRKNERALP
LPQSSAC

First, the default text output, which I have edited for conciseness - marking removed text with ellipsises, and then highlighting points of interest in red.

$ blastp -query carp.fasta -db four_proteins_star
BLASTP 2.2.25+
...
> internal_stop_P06213 Added two internal * characters
Length=1382

 Score = 1460 bits (3780),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 758/1416 (54%), Positives = 960/1416 (68%), Gaps = 107/1416 (7%)

Query  30    GEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNNLNQEHFRTLSFPKLTMVTD  89
             GEVC P +DIRN+++   +LENC+V+EG+LQILL+         E FR LSFPKL M+TD
Sbjct  32    GEVC-PGMDIRNNLTRLHELENCSVIEGHLQILLMFKTR----PEDFRDLSFPKLIMITD  86
...
Query  610   PLDARAYANSSSSLMVKWSPPISPNGNKTFYLLRWQQQAEDRELYQHNYCSKELKIPIRI  669
             PLD  + +NS S +++KW PP  PNGN T YL+ W++QAED EL++ +YC K LK+P R
Sbjct  625   PLDPISVSNS?SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRT  684
...
Query  1070  RQPTLVIMELMTRGDLKSYLRSLRSKEQGYSTQSLPPLKKMIQMASEIADGMAYLNANKF  1129
              QPTLV+MELM  GDLKSYLRSLR + +    +  P L++MIQMA+EIADGMAYLNA KF
Sbjct  1096  GQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP?PTLQEMIQMAAEIADGMAYLNAKKF  1155
...

When sending the output directly to the terminal like this, the stop codons appear in the text output as a question mark (marked in red). The same happens with the tabular output if you request the sseq column, and the XML. But all is not as it might seem - and there is a clue if you page the output though the 'more' command:

$ blastp -query carp.fasta -db four_proteins_star | more
BLASTP 2.2.25+
...
> internal_stop_P06213 Added two internal * characters
Length=1382

 Score = 1460 bits (3780),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 758/1416 (54%), Positives = 960/1416 (68%), Gaps = 107/1416 (7%)

Query  30    GEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNNLNQEHFRTLSFPKLTMVTD  89
             GEVC P +DIRN+++   +LENC+V+EG+LQILL+         E FR LSFPKL M+TD
Sbjct  32    GEVC-PGMDIRNNLTRLHELENCSVIEGHLQILLMFKTR----PEDFRDLSFPKLIMITD  86
...
Query  610   PLDARAYANSSSSLMVKWSPPISPNGNKTFYLLRWQQQAEDRELYQHNYCSKELKIPIRI  669
             PLD  + +NS S +++KW PP  PNGN T YL+ W++QAED EL++ +YC K LK+P R
Sbjct  625   PLDPISVSNS<FF>SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRT  684
...
Query  1070  RQPTLVIMELMTRGDLKSYLRSLRSKEQGYSTQSLPPLKKMIQMASEIADGMAYLNANKF  1129
              QPTLV+MELM  GDLKSYLRSLR + +    +  P L++MIQMA+EIADGMAYLNA KF
Sbjct  1096  GQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP<FF>PTLQEMIQMAAEIADGMAYLNAKKF  1155
...

 This was on Mac OS X, although 'more' doesn't actually use red it does print <FF> with reverse contrast highlighting. This indicates the non-printing character with hex code 0xFF, or in decimal 255. There's another way to show this - let's save the BLAST output to files:

$ blastp -query carp.fasta -db four_proteins_star -out carp.txt
$ blastp -query carp.fasta -db four_proteins_star -outfmt "6 std qseq sseq" -out carp.tsv
$ blastp -query carp.fasta -db four_proteins_star -outfmt 5 -out carp.xml

Now let's take a closer look at that first 'stop codon' character using 'hexdump':

$ hexdump -C carp.txt | grep -C 3 SQIILK
00000f40  2b 2b 51 41 45 44 20 45  4c 2b 2b 20 2b 59 43 20  |++QAED EL++ +YC |
00000f50  4b 20 4c 4b 2b 50 20 52  20 0a 53 62 6a 63 74 20  |K LK+P R .Sbjct |
00000f60  20 36 32 35 20 20 20 50  4c 44 50 49 53 56 53 4e  | 625   PLDPISVSN|
00000f70  53 ff 53 51 49 49 4c 4b  57 4b 50 50 53 44 50 4e  |S.SQIILKWKPPSDPN|
00000f80  47 4e 49 54 48 59 4c 56  46 57 45 52 51 41 45 44  |GNITHYLVFWERQAED|
00000f90  53 45 4c 46 45 4c 44 59  43 4c 4b 47 4c 4b 4c 50  |SELFELDYCLKGLKLP|
00000fa0  53 52 54 20 20 36 38 34  0a 0a 51 75 65 72 79 20  |SRT  684..Query |
$ hexdump -C carp.tsv | grep -C 3 SQIILK
00000810  52 47 4c 4b 50 57 54 51  59 41 49 46 56 4b 54 4c  |RGLKPWTQYAIFVKTL|
00000820  56 54 46 53 44 45 52 52  54 59 47 41 4b 53 44 49  |VTFSDERRTYGAKSDI|
00000830  49 59 56 51 54 44 41 54  4e 50 53 56 50 4c 44 50  |IYVQTDATNPSVPLDP|
00000840  49 53 56 53 4e 53 ff 53  51 49 49 4c 4b 57 4b 50  |ISVSNS.SQIILKWKP|
00000850  50 53 44 50 4e 47 4e 49  54 48 59 4c 56 46 57 45  |PSDPNGNITHYLVFWE|
00000860  52 51 41 45 44 53 45 4c  46 45 4c 44 59 43 4c 4b  |RQAEDSELFELDYCLK|
00000870  47 4c 4b 4c 50 53 52 54  57 53 50 50 46 2d 45 53  |GLKLPSRTWSPPF-ES|
$ hexdump -C carp.xml | grep -C 3 SQIILK
00001180  50 57 54 51 59 41 49 46  56 4b 54 4c 56 54 46 53  |PWTQYAIFVKTLVTFS|
00001190  44 45 52 52 54 59 47 41  4b 53 44 49 49 59 56 51  |DERRTYGAKSDIIYVQ|
000011a0  54 44 41 54 4e 50 53 56  50 4c 44 50 49 53 56 53  |TDATNPSVPLDPISVS|
000011b0  4e 53 ff 53 51 49 49 4c  4b 57 4b 50 50 53 44 50  |NS.SQIILKWKPPSDP|
000011c0  4e 47 4e 49 54 48 59 4c  56 46 57 45 52 51 41 45  |NGNITHYLVFWERQAE|
000011d0  44 53 45 4c 46 45 4c 44  59 43 4c 4b 47 4c 4b 4c  |DSELFELDYCLKGLKL|
000011e0  50 53 52 54 57 53 50 50  46 2d 45 53 45 44 53 51  |PSRTWSPPF-ESEDSQ|

The terminal rendered it as a question mark, but the stop codon is actually 0xff, character code 255, a non-printing character. Why is that a problem? Well I can live with it for the text and tabular output, but for XML it has the interesting side effect of rending the XML invalid - because this character is outside the default encoding.

That means any tools to parse the XML should reject this BLAST output!

For example, Biopython's BLAST XML parser will quite rightly reject this file:

$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Blast import NCBIXML
>>> for record in NCBIXML.parse(open("carp.xml")): print record
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/Bio/Blast/NCBIXML.py", line 652, in parse
    expat_parser.Parse(text, False)       
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 50, column 634

XML may be a bit on the heavy handed side, but the fact it can be validated is very handy for investigating parsing issues. You can try this yourself by running this example, and then uploading the broken XML BLAST file to an online XML validator, e.g. http://validator.w3.org/check complains ascii "\xFF" does not map to Unicode on line 50 of the file.

I was first bitten by this NCBI BLAST+ bug over a year ago, but was recently reminded about it when my Biopython Google Summer of Code (GSoC) student Bow Arindrarto hit a very similar issue. So, better late than never, I am reporting this to the NCBI - by email and public blog post since the NCBI BLAST team don't have a public bug tracker.

Update: What does 'legacy' BLAST do?

On Twitter Nick Loman suggested sticking with blastall from the NCBI 'legacy' BLAST suite. So I tried it with BLAST 2.2.24 under Linux. First replace the database:

$ rm four_proteins_star.p* 
$ ~/Downloads/blast-2.2.24/bin/formatdb -p T -i four_proteins_star.fasta -n four_proteins_star

Now make new plain text and XML files (as far as I recall, you can't get the sequence in the tabular output from legacy BLAST so I skipped that):

$ ~/Downloads/blast-2.2.24/bin/blastall -p blastp -i carp.fasta -d four_proteins_star -o carp.txt
$ ~/Downloads/blast-2.2.24/bin/blastall -p blastp -i carp.fasta -d four_proteins_star -m 7 -o carp.xml
$ hexdump -C carp.txt | grep -C 3 SQIILK
00000f20  4c 2b 20 57 2b 2b 51 41  45 44 20 45 4c 2b 2b 20  |L+ W++QAED EL++ |
00000f30  2b 59 43 20 4b 20 4c 4b  2b 50 20 52 20 0a 53 62  |+YC K LK+P R .Sb|
00000f40  6a 63 74 3a 20 36 32 35  20 20 50 4c 44 50 49 53  |jct: 625  PLDPIS|
00000f50  56 53 4e 53 2a 53 51 49  49 4c 4b 57 4b 50 50 53  |VSNS*SQIILKWKPPS|
00000f60  44 50 4e 47 4e 49 54 48  59 4c 56 46 57 45 52 51  |DPNGNITHYLVFWERQ|
00000f70  41 45 44 53 45 4c 46 45  4c 44 59 43 4c 4b 47 4c  |AEDSELFELDYCLKGL|
00000f80  4b 4c 50 53 52 54 20 36  38 34 0a 0a 51 75 65 72  |KLPSRT 684..Quer|
$ hexdump -C carp.xml | grep -C 3 SQIILK
00001140  4c 4d 52 47 4c 4b 50 57  54 51 59 41 49 46 56 4b  |LMRGLKPWTQYAIFVK|
00001150  54 4c 56 54 46 53 44 45  52 52 54 59 47 41 4b 53  |TLVTFSDERRTYGAKS|
00001160  44 49 49 59 56 51 54 44  41 54 4e 50 53 56 50 4c  |DIIYVQTDATNPSVPL|
00001170  44 50 49 53 56 53 4e 53  2a 53 51 49 49 4c 4b 57  |DPISVSNS*SQIILKW|
00001180  4b 50 50 53 44 50 4e 47  4e 49 54 48 59 4c 56 46  |KPPSDPNGNITHYLVF|
00001190  57 45 52 51 41 45 44 53  45 4c 46 45 4c 44 59 43  |WERQAEDSELFELDYC|
000011a0  4c 4b 47 4c 4b 4c 50 53  52 54 57 53 50 50 46 2d  |LKGLKLPSRTWSPPF-|

Here we get the desired asterisk, *, whose ASCII code is 42 or in hex 0x42.

So, low and behold, this bug doesn't happen under 'legacy' BLAST - it is a regression in BLAST+.

Update

I had a reply from my email to the NCBI confirming this was being passed on to the BLAST developers to be fixed.

Update (12 September 2012)

The NCBI have just announced the release of BLAST 2.2.27+ which should fix this bug.

7 comments:

  1. Hi, thanks for your post!
    I've got the same exception from biopython when parsing some blast results: the big problem is that there is no way to known which sequence causes the problem :/

    ReplyDelete
    Replies
    1. The exception should give you the XML line number, so you can (manually) look in the file to find the problem.

      Bow's GSoC work may offer a workaround. You could index the XML file by query, and then request each query's results one by one, catching any exceptions. Doing it this way the new indexing code would extract just the XML fragment of interest, and parse that on its own. I've used this technique with Biopython's SeqIO to handle a corrupted sequence file. You still can't parse the broken results without fixing the XML though.

      Delete
  2. I've yet to find a compelling reason to switch to BLAST+ from blastall!

    ReplyDelete
    Replies
    1. I do like the flexible tabular output, where you can request any combination of the supported fields. It would be better still if they added the obvious BLAST+ omissions of the hit descriptions and taxon ID.

      Right now that's all I can think of as upsides of the switch - which is quite telling.

      If they hadn't done the BLAST+ -subject option as pairwise searches, I would have added FASTA vs FASTA searching as a nice shortcut for one-off searches.

      Delete
  3. I believe that this bug has already been fixed for 2.2.27+. I discovered this bug independently and corresponded with NCBI, documented here:
    https://github.com/yannickwurm/sequenceserver/issues/87#issuecomment-6112940

    ReplyDelete
  4. I'm not sure if this is already known, but another problem I've found in blast+ is that sometimes the tabular output doesn't give all the results that it should, unless you use the -max_target_seqs flag. I documented here:
    http://seqanswers.com/forums/showthread.php?t=21392

    ReplyDelete
  5. Yup, WU-Blast behaved a bit better with stops... and is not soo picky with the defline.

    ReplyDelete