Update: BLAST 2.2.27+ fixed this bug.
Making a test BLAST database
Here is a simple example protein FASTA file, where I have manually added some stop codons as asterisk characters to the amino acid sequence (highlighted here in red):
$ cat four_proteins_star.fasta
>original_Q9BS26 Unedited sequence
MHPAVFLSLPDLRCSLLLLVTWVFTPVTTEITSLDTENIDEILNNADVALVNFYADWCRF
SQMLHPIFEEASDVIKEEFPNENQVVFARVDCDQHSDIAQRYRISKYPTLKLFRNGMMMK
REYRGQRSVKALADYIRQQKSDPIQEIRDLAEITTLDRSKRNIIGYFEQKDSDNYRVFER
VANILHDDCAFLSAFGDVSKPERYSGDNIIYKPPGHSAPDMVYLGAMTNFDVTYNWIQDK
CVPLVREITFENGEELTEEGLPFLILFHMKEDTESLEIFQNEVARQLISEKGTINFLHAD
CDKFRHPLLHIQKTPADCPVIAIDSFRHMYVFGDFKDVLIPGKLKQFVFDLHSGKLHREF
HHGPDPTDTAPGEQAQDVASSPPESSFQKLAPSEYRYTLLRDRDEL
>tailing_stop_Q9NSY1 Added trailing * character (representing stop codon)
MKKFSRMPKSEGGSGGGAAGGGAGGAGAGAGCGSGGSSVGVRVFAVGRHQVTLEESLAEG
GFSTVFLVRTHGGIRCALKRMYVNNMPDLNVCKREITIMKELSGHKNIVGYLDCAVNSIS
DNVWEVLILMEYCRAGQVVNQMNKKLQTGFTEPEVLQIFCDTCEAVARLHQCKTPIIHRD
LKVENILLNDGGNYVLCDFGSATNKFLNPQKDGVNVVEEEIKKYTTLSYRAPEMINLYGG
KPITTKADIWALGCLLYKLCFFTLPFGESQVAICDGNFTIPDNSRYSRNIHCLIRFMLEP
DPEHRPDIFQVSYFAFKFAKKDCPVSNINNSSIPSALPEPMTASEAAARKSQIKARITDT
IGPTETSIAPRQRPKANSATTATPSVLTIQSSATPVKVLAPGEFGNHRPKGALRPGNGPE
ILLGQGPPQQPPQQHRVLQQLQQGDWRLQQLHLQHRHPHQQQQQQQQQQQQQQQQQQQQQ
QQQQQQHHHHHHHHLLQDAYMQQYQHATQQQQMLQQQFLMHSVYQPQPSASQYPTMMPQY
QQAFFQQQMLAQHQPSQQQASPEYLTSPQEFSPALVSYTSSLPAQVGTIMDSSYSANRSV
ADKEAIANFTNQKNISNPPDMSGWNPFGEDNFSKLTEEELLDREFDLLRSNRLEERASSD
KNVDSLSAPHNHPPEDPFGSVPFISHSGSPEKKAEHSSINQENGTANPIKNGKTSPASKD
QRTGKKTSVQGQVQKGNDESESDFESDPPSPKSSEEEEQDDEEVLQGEQGDFNDDDTEPE
NLGHRPLLMDSEDEEEEEKHSSDSDYEQAKAKYSDMSSVYRDRSGSGPTQDLNTILLTSA
QLSSDVAVETPKQEFDVFGAVPFFAVRAQQPQQEKNEKNLPQHRFPAAGLEQEEFDVFTK
APFSKKVNVQECHAVGPEAHTIPGYPKSVDVFGSTPFQPFLTSTSKSESNEDLFGLVPFD
EITGSQQQKVKQRSLQKLSSRQRRTKQDMSKSNGKRHHGTPTSTKKTLKPTYRTPERARR
HKKVGRRDSQSSNEFLTISDSKENISVALTDGKDRGNVLQPEESLLDPFGAKPFHSPDLS
WHPPHQGLSDIRADHNTVLPGRPRQNSLHGSFHSADVLKMDDFGAVPFTELVVQSITPHQ
SQQSQPVELDPFGAAPFPSKQ*
>internal_stop_P06213 Added two internal * characters
MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHL
QILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYAL
VIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNYIVLNKDDNE
ECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECL
GNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQG
CHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGC
TVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETL
EIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQE
RNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQ
NVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFS
DERRTYGAKSDIIYVQTDATNPSVPLDPISVSNS*SQIILKWKPPSDPNGNITHYLVFWE
RQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQIL
KELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAF
PNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYV
SARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCV
SRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIG
PLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSR
EKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKG
FTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP*PTLQEMIQMA
AEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPV
RWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDN
CPERVTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEME
FEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSN
PS
>internal_and_trailing_stop_P08100 Added internal and trailing * characters
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFT*KEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA*
>original_Q9BS26 Unedited sequence
MHPAVFLSLPDLRCSLLLLVTWVFTPVTTEITSLDTENIDEILNNADVALVNFYADWCRF
SQMLHPIFEEASDVIKEEFPNENQVVFARVDCDQHSDIAQRYRISKYPTLKLFRNGMMMK
REYRGQRSVKALADYIRQQKSDPIQEIRDLAEITTLDRSKRNIIGYFEQKDSDNYRVFER
VANILHDDCAFLSAFGDVSKPERYSGDNIIYKPPGHSAPDMVYLGAMTNFDVTYNWIQDK
CVPLVREITFENGEELTEEGLPFLILFHMKEDTESLEIFQNEVARQLISEKGTINFLHAD
CDKFRHPLLHIQKTPADCPVIAIDSFRHMYVFGDFKDVLIPGKLKQFVFDLHSGKLHREF
HHGPDPTDTAPGEQAQDVASSPPESSFQKLAPSEYRYTLLRDRDEL
>tailing_stop_Q9NSY1 Added trailing * character (representing stop codon)
MKKFSRMPKSEGGSGGGAAGGGAGGAGAGAGCGSGGSSVGVRVFAVGRHQVTLEESLAEG
GFSTVFLVRTHGGIRCALKRMYVNNMPDLNVCKREITIMKELSGHKNIVGYLDCAVNSIS
DNVWEVLILMEYCRAGQVVNQMNKKLQTGFTEPEVLQIFCDTCEAVARLHQCKTPIIHRD
LKVENILLNDGGNYVLCDFGSATNKFLNPQKDGVNVVEEEIKKYTTLSYRAPEMINLYGG
KPITTKADIWALGCLLYKLCFFTLPFGESQVAICDGNFTIPDNSRYSRNIHCLIRFMLEP
DPEHRPDIFQVSYFAFKFAKKDCPVSNINNSSIPSALPEPMTASEAAARKSQIKARITDT
IGPTETSIAPRQRPKANSATTATPSVLTIQSSATPVKVLAPGEFGNHRPKGALRPGNGPE
ILLGQGPPQQPPQQHRVLQQLQQGDWRLQQLHLQHRHPHQQQQQQQQQQQQQQQQQQQQQ
QQQQQQHHHHHHHHLLQDAYMQQYQHATQQQQMLQQQFLMHSVYQPQPSASQYPTMMPQY
QQAFFQQQMLAQHQPSQQQASPEYLTSPQEFSPALVSYTSSLPAQVGTIMDSSYSANRSV
ADKEAIANFTNQKNISNPPDMSGWNPFGEDNFSKLTEEELLDREFDLLRSNRLEERASSD
KNVDSLSAPHNHPPEDPFGSVPFISHSGSPEKKAEHSSINQENGTANPIKNGKTSPASKD
QRTGKKTSVQGQVQKGNDESESDFESDPPSPKSSEEEEQDDEEVLQGEQGDFNDDDTEPE
NLGHRPLLMDSEDEEEEEKHSSDSDYEQAKAKYSDMSSVYRDRSGSGPTQDLNTILLTSA
QLSSDVAVETPKQEFDVFGAVPFFAVRAQQPQQEKNEKNLPQHRFPAAGLEQEEFDVFTK
APFSKKVNVQECHAVGPEAHTIPGYPKSVDVFGSTPFQPFLTSTSKSESNEDLFGLVPFD
EITGSQQQKVKQRSLQKLSSRQRRTKQDMSKSNGKRHHGTPTSTKKTLKPTYRTPERARR
HKKVGRRDSQSSNEFLTISDSKENISVALTDGKDRGNVLQPEESLLDPFGAKPFHSPDLS
WHPPHQGLSDIRADHNTVLPGRPRQNSLHGSFHSADVLKMDDFGAVPFTELVVQSITPHQ
SQQSQPVELDPFGAAPFPSKQ*
>internal_stop_P06213 Added two internal * characters
MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHL
QILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYAL
VIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNYIVLNKDDNE
ECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECL
GNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQG
CHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGC
TVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETL
EIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQE
RNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQ
NVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFS
DERRTYGAKSDIIYVQTDATNPSVPLDPISVSNS*SQIILKWKPPSDPNGNITHYLVFWE
RQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQIL
KELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAF
PNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYV
SARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCV
SRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIG
PLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSR
EKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKG
FTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP*PTLQEMIQMA
AEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPV
RWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDN
CPERVTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEME
FEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSN
PS
>internal_and_trailing_stop_P08100 Added internal and trailing * characters
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFT*KEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA*
Now let's turn it into a BLAST+ database using makeblastdb v2.2.25+ on my Mac (but I also duplicated this whole process under Linux using the latest BLAST v2.2.26+ as well),
$ makeblastdb -version
makeblastdb: 2.2.25+
Package: blast 2.2.25, build Mar 21 2011 12:21:32
$ makeblastdb -in four_proteins_star.fasta -dbtype prot -out four_proteins_star
Building a new DB, current time: 08/16/2012 14:39:33
New DB name: four_proteins_star
New DB title: four_proteins_star.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 4 sequences in 0.00264692 seconds.
$ ls four_proteins_star.*
four_proteins_star.fasta four_proteins_star.phr four_proteins_star.pin four_proteins_star.psq
makeblastdb: 2.2.25+
Package: blast 2.2.25, build Mar 21 2011 12:21:32
$ makeblastdb -in four_proteins_star.fasta -dbtype prot -out four_proteins_star
Building a new DB, current time: 08/16/2012 14:39:33
New DB name: four_proteins_star
New DB title: four_proteins_star.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 4 sequences in 0.00264692 seconds.
$ ls four_proteins_star.*
four_proteins_star.fasta four_proteins_star.phr four_proteins_star.pin four_proteins_star.psq
Let's use blastdbcmd to recover the FASTA sequence (and check it stored the stop codons):
$ blastdbcmd -db four_proteins_star -entry all
>gnl|BL_ORD_ID|0 original_Q9BS26 Unedited sequence
MHPAVFLSLPDLRCSLLLLVTWVFTPVTTEITSLDTENIDEILNNADVALVNFYADWCRFSQMLHPIFEEASDVIKEEFP
NENQVVFARVDCDQHSDIAQRYRISKYPTLKLFRNGMMMKREYRGQRSVKALADYIRQQKSDPIQEIRDLAEITTLDRSK
RNIIGYFEQKDSDNYRVFERVANILHDDCAFLSAFGDVSKPERYSGDNIIYKPPGHSAPDMVYLGAMTNFDVTYNWIQDK
CVPLVREITFENGEELTEEGLPFLILFHMKEDTESLEIFQNEVARQLISEKGTINFLHADCDKFRHPLLHIQKTPADCPV
IAIDSFRHMYVFGDFKDVLIPGKLKQFVFDLHSGKLHREFHHGPDPTDTAPGEQAQDVASSPPESSFQKLAPSEYRYTLL
RDRDEL
>gnl|BL_ORD_ID|1 tailing_stop_Q9NSY1 Added trailing * character (representing stop codon)
MKKFSRMPKSEGGSGGGAAGGGAGGAGAGAGCGSGGSSVGVRVFAVGRHQVTLEESLAEGGFSTVFLVRTHGGIRCALKR
MYVNNMPDLNVCKREITIMKELSGHKNIVGYLDCAVNSISDNVWEVLILMEYCRAGQVVNQMNKKLQTGFTEPEVLQIFC
DTCEAVARLHQCKTPIIHRDLKVENILLNDGGNYVLCDFGSATNKFLNPQKDGVNVVEEEIKKYTTLSYRAPEMINLYGG
KPITTKADIWALGCLLYKLCFFTLPFGESQVAICDGNFTIPDNSRYSRNIHCLIRFMLEPDPEHRPDIFQVSYFAFKFAK
KDCPVSNINNSSIPSALPEPMTASEAAARKSQIKARITDTIGPTETSIAPRQRPKANSATTATPSVLTIQSSATPVKVLA
PGEFGNHRPKGALRPGNGPEILLGQGPPQQPPQQHRVLQQLQQGDWRLQQLHLQHRHPHQQQQQQQQQQQQQQQQQQQQQ
QQQQQQHHHHHHHHLLQDAYMQQYQHATQQQQMLQQQFLMHSVYQPQPSASQYPTMMPQYQQAFFQQQMLAQHQPSQQQA
SPEYLTSPQEFSPALVSYTSSLPAQVGTIMDSSYSANRSVADKEAIANFTNQKNISNPPDMSGWNPFGEDNFSKLTEEEL
LDREFDLLRSNRLEERASSDKNVDSLSAPHNHPPEDPFGSVPFISHSGSPEKKAEHSSINQENGTANPIKNGKTSPASKD
QRTGKKTSVQGQVQKGNDESESDFESDPPSPKSSEEEEQDDEEVLQGEQGDFNDDDTEPENLGHRPLLMDSEDEEEEEKH
SSDSDYEQAKAKYSDMSSVYRDRSGSGPTQDLNTILLTSAQLSSDVAVETPKQEFDVFGAVPFFAVRAQQPQQEKNEKNL
PQHRFPAAGLEQEEFDVFTKAPFSKKVNVQECHAVGPEAHTIPGYPKSVDVFGSTPFQPFLTSTSKSESNEDLFGLVPFD
EITGSQQQKVKQRSLQKLSSRQRRTKQDMSKSNGKRHHGTPTSTKKTLKPTYRTPERARRHKKVGRRDSQSSNEFLTISD
SKENISVALTDGKDRGNVLQPEESLLDPFGAKPFHSPDLSWHPPHQGLSDIRADHNTVLPGRPRQNSLHGSFHSADVLKM
DDFGAVPFTELVVQSITPHQSQQSQPVELDPFGAAPFPSKQ*
>gnl|BL_ORD_ID|2 internal_stop_P06213 Added two internal * characters
MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPK
LIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDW
SRILDSVEDNYIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECL
GNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYT
MNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRS
YALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQE
RNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVD
IDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNS*SQIIL
KWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQIL
KELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEK
VVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGL
IVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIG
PLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSREKITLLRELGQGSFGMVYEG
NARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRP
EAENNPGRP*PTLQEMIQMAAEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPV
RWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMR
PTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIP
YTHMNGGKKNGRILTLPRSNPS
>gnl|BL_ORD_ID|3 internal_and_trailing_stop_P08100 Added internal and trailing * characters
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLA
VADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFT
WVMALACAAPPLAGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFT*KEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTT
ICCGKNPLGDDEASATVSKTETSQVAPA*
Notice the NCBI defaults to imposing their naming convention on the output - they've assigned made up ordinal identifiers (sequence 0, 1, 2 and 3). But that is not the problem I'm focussing on today.
So far so good - we've got a BLAST database using stop codons.
Using a BLAST database with stop codons
Now let's try a search, using a carp protein chosen to give a nice match:
$ cat carp.fasta
>gi|33355376|gb|AAN52151.1| IGF-I receptor subtype a [Cyprinus carpio]
MRSGTARDVWTLFWGPVLFLSTICLWSAHGEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNN
LNQEHFRTLSFPKLTMVTDYLLLFRVSGLDSLSMLFPNLNVIRGRNLFYNYALVIFEMTSLKDIGLYNLR
NITRGAIRIEKNPELCYLDSVDWSLIMDAEFNNYIAGNKQSKECGDVCPGIMEDRPQCIRTSFNDNYSYR
CWTSNDCQKVCPKECEKRACTYKGKCCHPQCLGSCTEPDNDKACAACQHYFHEDRCVETCPQGTYKFEGW
RCITMDMCARVHLPSEVDFVIHNGECMPECPPGFTRNETQSMFCSACDGLCDKVCESKTIDSVDAAQSLQ
GCTVIKGNLNINIRRGNNIASELESFMGLIQTVTGYVKIKHSQTLGSLSFLKSLRYIHGEELTEQSYAFS
AIDNQNLQYLWDWSQHNLTIRAGKLFFRFNPKLCMSEIQKMWEKTGVKEKMEEDDARSNGERASCESYIL
KFKSNHTQSTRIKLTWERYRPPDYRDLISFIVYYKEAPFQNITEFDGQDGCGSNSWNMVDVDLPQDKSTD
PGVLLSSLKPWTQYAIFVKAVTLVVEDKHILGAKSEVVYIRTNASAPSMPLDARAYANSSSSLMVKWSPP
ISPNGNKTFYLLRWQQQAEDRELYQHNYCSKELKIPIRISATALSDMEGETKPTKSDVGGGEKGCCPCPK
TKEDLKAEAEDRSYRKVFENFLHNSIFTPRPPDRKRRDLFGVANGTLLQGASRNITGMESNITDPKLHEK
EYPFSEDKVYTEFMEISNLQPFTVYRIDIHACNYEIHRCSAAAFVFSRTKPADKADDIPGTVTQERDEKG
EGTVLLRWPEPHHPNGLILMYEIKYRLGTEPEKHECVSRQQYRVLRGAQLSNLPSGNYSARVRATSLAGN
GSWTEPVSFYVPPPKRNYDNAFLLAIIIPIIVLLLLTAVISVVIFVTKKRNSDRLGNGVLYASVNPEYFS
PFEMYVPDEWEVAREKITMCRELGQGSFGMVYEGIAKGVVKDEPETRVAIKTVNESASMRERIEFLNEAS
VMKEFNCHHVVRLLGVVSQRQPTLVIMELMTRGDLKSYLRSLRSKEQGYSTQSLPPLKKMIQMASEIADG
MAYLNANKFVHRDLAARNCMVAEDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMSPESLKDGVFTTN
SDVWSFGVVLWEIATLAKQPYQGMSNEQVLRFVMEGGLLDKPDNCPDMLFELMRMCWQYNPKMRPSFLEI
INSIKEELEPPFQEVSFFYSEENKPPDTEELDMEVENMENVPLEPASMLAPSSQGSTGAQTQSPQPLSPA
QGPGTPVGSAQTPSTSPPPSSSSASPGLVLDKHSGQVAANGPVLLLGPAFDETQPYAHMNGGRKNERALP
LPQSSAC
>gi|33355376|gb|AAN52151.1| IGF-I receptor subtype a [Cyprinus carpio]
MRSGTARDVWTLFWGPVLFLSTICLWSAHGEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNN
LNQEHFRTLSFPKLTMVTDYLLLFRVSGLDSLSMLFPNLNVIRGRNLFYNYALVIFEMTSLKDIGLYNLR
NITRGAIRIEKNPELCYLDSVDWSLIMDAEFNNYIAGNKQSKECGDVCPGIMEDRPQCIRTSFNDNYSYR
CWTSNDCQKVCPKECEKRACTYKGKCCHPQCLGSCTEPDNDKACAACQHYFHEDRCVETCPQGTYKFEGW
RCITMDMCARVHLPSEVDFVIHNGECMPECPPGFTRNETQSMFCSACDGLCDKVCESKTIDSVDAAQSLQ
GCTVIKGNLNINIRRGNNIASELESFMGLIQTVTGYVKIKHSQTLGSLSFLKSLRYIHGEELTEQSYAFS
AIDNQNLQYLWDWSQHNLTIRAGKLFFRFNPKLCMSEIQKMWEKTGVKEKMEEDDARSNGERASCESYIL
KFKSNHTQSTRIKLTWERYRPPDYRDLISFIVYYKEAPFQNITEFDGQDGCGSNSWNMVDVDLPQDKSTD
PGVLLSSLKPWTQYAIFVKAVTLVVEDKHILGAKSEVVYIRTNASAPSMPLDARAYANSSSSLMVKWSPP
ISPNGNKTFYLLRWQQQAEDRELYQHNYCSKELKIPIRISATALSDMEGETKPTKSDVGGGEKGCCPCPK
TKEDLKAEAEDRSYRKVFENFLHNSIFTPRPPDRKRRDLFGVANGTLLQGASRNITGMESNITDPKLHEK
EYPFSEDKVYTEFMEISNLQPFTVYRIDIHACNYEIHRCSAAAFVFSRTKPADKADDIPGTVTQERDEKG
EGTVLLRWPEPHHPNGLILMYEIKYRLGTEPEKHECVSRQQYRVLRGAQLSNLPSGNYSARVRATSLAGN
GSWTEPVSFYVPPPKRNYDNAFLLAIIIPIIVLLLLTAVISVVIFVTKKRNSDRLGNGVLYASVNPEYFS
PFEMYVPDEWEVAREKITMCRELGQGSFGMVYEGIAKGVVKDEPETRVAIKTVNESASMRERIEFLNEAS
VMKEFNCHHVVRLLGVVSQRQPTLVIMELMTRGDLKSYLRSLRSKEQGYSTQSLPPLKKMIQMASEIADG
MAYLNANKFVHRDLAARNCMVAEDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMSPESLKDGVFTTN
SDVWSFGVVLWEIATLAKQPYQGMSNEQVLRFVMEGGLLDKPDNCPDMLFELMRMCWQYNPKMRPSFLEI
INSIKEELEPPFQEVSFFYSEENKPPDTEELDMEVENMENVPLEPASMLAPSSQGSTGAQTQSPQPLSPA
QGPGTPVGSAQTPSTSPPPSSSSASPGLVLDKHSGQVAANGPVLLLGPAFDETQPYAHMNGGRKNERALP
LPQSSAC
First, the default text output, which I have edited for conciseness - marking removed text with ellipsises, and then highlighting points of interest in red.
$ blastp -query carp.fasta -db four_proteins_star
BLASTP 2.2.25+
...
BLASTP 2.2.25+
...
> internal_stop_P06213 Added two internal * characters
Length=1382
Score = 1460 bits (3780), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 758/1416 (54%), Positives = 960/1416 (68%), Gaps = 107/1416 (7%)
Query 30 GEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNNLNQEHFRTLSFPKLTMVTD 89
GEVC P +DIRN+++ +LENC+V+EG+LQILL+ E FR LSFPKL M+TD
Sbjct 32 GEVC-PGMDIRNNLTRLHELENCSVIEGHLQILLMFKTR----PEDFRDLSFPKLIMITD 86
Length=1382
Score = 1460 bits (3780), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 758/1416 (54%), Positives = 960/1416 (68%), Gaps = 107/1416 (7%)
Query 30 GEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNNLNQEHFRTLSFPKLTMVTD 89
GEVC P +DIRN+++ +LENC+V+EG+LQILL+ E FR LSFPKL M+TD
Sbjct 32 GEVC-PGMDIRNNLTRLHELENCSVIEGHLQILLMFKTR----PEDFRDLSFPKLIMITD 86
...
Query 610 PLDARAYANSSSSLMVKWSPPISPNGNKTFYLLRWQQQAEDRELYQHNYCSKELKIPIRI 669
PLD + +NS S +++KW PP PNGN T YL+ W++QAED EL++ +YC K LK+P R
Sbjct 625 PLDPISVSNS?SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRT 684
PLD + +NS S +++KW PP PNGN T YL+ W++QAED EL++ +YC K LK+P R
Sbjct 625 PLDPISVSNS?SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRT 684
...
Query 1070 RQPTLVIMELMTRGDLKSYLRSLRSKEQGYSTQSLPPLKKMIQMASEIADGMAYLNANKF 1129
QPTLV+MELM GDLKSYLRSLR + + + P L++MIQMA+EIADGMAYLNA KF
Sbjct 1096 GQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP?PTLQEMIQMAAEIADGMAYLNAKKF 1155
QPTLV+MELM GDLKSYLRSLR + + + P L++MIQMA+EIADGMAYLNA KF
Sbjct 1096 GQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP?PTLQEMIQMAAEIADGMAYLNAKKF 1155
...
When sending the output directly to the terminal like this, the stop codons appear in the text output as a question mark (marked in red). The same happens with the tabular output if you request the sseq column, and the XML. But all is not as it might seem - and there is a clue if you page the output though the 'more' command:
$ blastp -query carp.fasta -db four_proteins_star | more
BLASTP 2.2.25+
...
BLASTP 2.2.25+
...
> internal_stop_P06213 Added two internal * characters
Length=1382
Score = 1460 bits (3780), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 758/1416 (54%), Positives = 960/1416 (68%), Gaps = 107/1416 (7%)
Query 30 GEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNNLNQEHFRTLSFPKLTMVTD 89
GEVC P +DIRN+++ +LENC+V+EG+LQILL+ E FR LSFPKL M+TD
Sbjct 32 GEVC-PGMDIRNNLTRLHELENCSVIEGHLQILLMFKTR----PEDFRDLSFPKLIMITD 86
Length=1382
Score = 1460 bits (3780), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 758/1416 (54%), Positives = 960/1416 (68%), Gaps = 107/1416 (7%)
Query 30 GEVCGPHIDIRNDISEFKKLENCTVVEGYLQILLIGDKNNNLNQEHFRTLSFPKLTMVTD 89
GEVC P +DIRN+++ +LENC+V+EG+LQILL+ E FR LSFPKL M+TD
Sbjct 32 GEVC-PGMDIRNNLTRLHELENCSVIEGHLQILLMFKTR----PEDFRDLSFPKLIMITD 86
...
Query 610 PLDARAYANSSSSLMVKWSPPISPNGNKTFYLLRWQQQAEDRELYQHNYCSKELKIPIRI 669
PLD + +NS S +++KW PP PNGN T YL+ W++QAED EL++ +YC K LK+P R
Sbjct 625 PLDPISVSNS<FF>SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRT 684
PLD + +NS S +++KW PP PNGN T YL+ W++QAED EL++ +YC K LK+P R
Sbjct 625 PLDPISVSNS<FF>SQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRT 684
...
Query 1070 RQPTLVIMELMTRGDLKSYLRSLRSKEQGYSTQSLPPLKKMIQMASEIADGMAYLNANKF 1129
QPTLV+MELM GDLKSYLRSLR + + + P L++MIQMA+EIADGMAYLNA KF
Sbjct 1096 GQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP<FF>PTLQEMIQMAAEIADGMAYLNAKKF 1155
QPTLV+MELM GDLKSYLRSLR + + + P L++MIQMA+EIADGMAYLNA KF
Sbjct 1096 GQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRP<FF>PTLQEMIQMAAEIADGMAYLNAKKF 1155
...
This was on Mac OS X, although 'more' doesn't actually use red it does print <FF> with reverse contrast highlighting. This indicates the non-printing character with hex code 0xFF, or in decimal 255. There's another way to show this - let's save the BLAST output to files:
$ blastp -query carp.fasta -db four_proteins_star -out carp.txt
$ blastp -query carp.fasta -db four_proteins_star -outfmt "6 std qseq sseq" -out carp.tsv
$ blastp -query carp.fasta -db four_proteins_star -outfmt 5 -out carp.xmlNow let's take a closer look at that first 'stop codon' character using 'hexdump':
$ hexdump -C carp.txt | grep -C 3 SQIILK
00000f40 2b 2b 51 41 45 44 20 45 4c 2b 2b 20 2b 59 43 20 |++QAED EL++ +YC |
00000f50 4b 20 4c 4b 2b 50 20 52 20 0a 53 62 6a 63 74 20 |K LK+P R .Sbjct |
00000f60 20 36 32 35 20 20 20 50 4c 44 50 49 53 56 53 4e | 625 PLDPISVSN|
00000f70 53 ff 53 51 49 49 4c 4b 57 4b 50 50 53 44 50 4e |S.SQIILKWKPPSDPN|
00000f80 47 4e 49 54 48 59 4c 56 46 57 45 52 51 41 45 44 |GNITHYLVFWERQAED|
00000f90 53 45 4c 46 45 4c 44 59 43 4c 4b 47 4c 4b 4c 50 |SELFELDYCLKGLKLP|
00000fa0 53 52 54 20 20 36 38 34 0a 0a 51 75 65 72 79 20 |SRT 684..Query |
00000f40 2b 2b 51 41 45 44 20 45 4c 2b 2b 20 2b 59 43 20 |++QAED EL++ +YC |
00000f50 4b 20 4c 4b 2b 50 20 52 20 0a 53 62 6a 63 74 20 |K LK+P R .Sbjct |
00000f60 20 36 32 35 20 20 20 50 4c 44 50 49 53 56 53 4e | 625 PLDPISVSN|
00000f70 53 ff 53 51 49 49 4c 4b 57 4b 50 50 53 44 50 4e |S.SQIILKWKPPSDPN|
00000f80 47 4e 49 54 48 59 4c 56 46 57 45 52 51 41 45 44 |GNITHYLVFWERQAED|
00000f90 53 45 4c 46 45 4c 44 59 43 4c 4b 47 4c 4b 4c 50 |SELFELDYCLKGLKLP|
00000fa0 53 52 54 20 20 36 38 34 0a 0a 51 75 65 72 79 20 |SRT 684..Query |
$ hexdump -C carp.tsv | grep -C 3 SQIILK
00000810 52 47 4c 4b 50 57 54 51 59 41 49 46 56 4b 54 4c |RGLKPWTQYAIFVKTL|
00000820 56 54 46 53 44 45 52 52 54 59 47 41 4b 53 44 49 |VTFSDERRTYGAKSDI|
00000830 49 59 56 51 54 44 41 54 4e 50 53 56 50 4c 44 50 |IYVQTDATNPSVPLDP|
00000840 49 53 56 53 4e 53 ff 53 51 49 49 4c 4b 57 4b 50 |ISVSNS.SQIILKWKP|
00000850 50 53 44 50 4e 47 4e 49 54 48 59 4c 56 46 57 45 |PSDPNGNITHYLVFWE|
00000860 52 51 41 45 44 53 45 4c 46 45 4c 44 59 43 4c 4b |RQAEDSELFELDYCLK|
00000870 47 4c 4b 4c 50 53 52 54 57 53 50 50 46 2d 45 53 |GLKLPSRTWSPPF-ES|
00000810 52 47 4c 4b 50 57 54 51 59 41 49 46 56 4b 54 4c |RGLKPWTQYAIFVKTL|
00000820 56 54 46 53 44 45 52 52 54 59 47 41 4b 53 44 49 |VTFSDERRTYGAKSDI|
00000830 49 59 56 51 54 44 41 54 4e 50 53 56 50 4c 44 50 |IYVQTDATNPSVPLDP|
00000840 49 53 56 53 4e 53 ff 53 51 49 49 4c 4b 57 4b 50 |ISVSNS.SQIILKWKP|
00000850 50 53 44 50 4e 47 4e 49 54 48 59 4c 56 46 57 45 |PSDPNGNITHYLVFWE|
00000860 52 51 41 45 44 53 45 4c 46 45 4c 44 59 43 4c 4b |RQAEDSELFELDYCLK|
00000870 47 4c 4b 4c 50 53 52 54 57 53 50 50 46 2d 45 53 |GLKLPSRTWSPPF-ES|
$ hexdump -C carp.xml | grep -C 3 SQIILK
00001180 50 57 54 51 59 41 49 46 56 4b 54 4c 56 54 46 53 |PWTQYAIFVKTLVTFS|
00001190 44 45 52 52 54 59 47 41 4b 53 44 49 49 59 56 51 |DERRTYGAKSDIIYVQ|
000011a0 54 44 41 54 4e 50 53 56 50 4c 44 50 49 53 56 53 |TDATNPSVPLDPISVS|
000011b0 4e 53 ff 53 51 49 49 4c 4b 57 4b 50 50 53 44 50 |NS.SQIILKWKPPSDP|
000011c0 4e 47 4e 49 54 48 59 4c 56 46 57 45 52 51 41 45 |NGNITHYLVFWERQAE|
000011d0 44 53 45 4c 46 45 4c 44 59 43 4c 4b 47 4c 4b 4c |DSELFELDYCLKGLKL|
000011e0 50 53 52 54 57 53 50 50 46 2d 45 53 45 44 53 51 |PSRTWSPPF-ESEDSQ|
00001180 50 57 54 51 59 41 49 46 56 4b 54 4c 56 54 46 53 |PWTQYAIFVKTLVTFS|
00001190 44 45 52 52 54 59 47 41 4b 53 44 49 49 59 56 51 |DERRTYGAKSDIIYVQ|
000011a0 54 44 41 54 4e 50 53 56 50 4c 44 50 49 53 56 53 |TDATNPSVPLDPISVS|
000011b0 4e 53 ff 53 51 49 49 4c 4b 57 4b 50 50 53 44 50 |NS.SQIILKWKPPSDP|
000011c0 4e 47 4e 49 54 48 59 4c 56 46 57 45 52 51 41 45 |NGNITHYLVFWERQAE|
000011d0 44 53 45 4c 46 45 4c 44 59 43 4c 4b 47 4c 4b 4c |DSELFELDYCLKGLKL|
000011e0 50 53 52 54 57 53 50 50 46 2d 45 53 45 44 53 51 |PSRTWSPPF-ESEDSQ|
The terminal rendered it as a question mark, but the stop codon is actually 0xff, character code 255, a non-printing character. Why is that a problem? Well I can live with it for the text and tabular output, but for XML it has the interesting side effect of rending the XML invalid - because this character is outside the default encoding.
That means any tools to parse the XML should reject this BLAST output!
For example, Biopython's BLAST XML parser will quite rightly reject this file:
$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Blast import NCBIXML
>>> for record in NCBIXML.parse(open("carp.xml")): print record
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/Bio/Blast/NCBIXML.py", line 652, in parse
expat_parser.Parse(text, False)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 50, column 634
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Blast import NCBIXML
>>> for record in NCBIXML.parse(open("carp.xml")): print record
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/Bio/Blast/NCBIXML.py", line 652, in parse
expat_parser.Parse(text, False)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 50, column 634
XML may be a bit on the heavy handed side, but the fact it can be validated is very handy for investigating parsing issues. You can try this yourself by running this example, and then uploading the broken XML BLAST file to an online XML validator, e.g. http://validator.w3.org/check complains ascii "\xFF" does not map to Unicode on line 50 of the file.
I was first bitten by this NCBI BLAST+ bug over a year ago, but was recently reminded about it when my Biopython Google Summer of Code (GSoC) student Bow Arindrarto hit a very similar issue. So, better late than never, I am reporting this to the NCBI - by email and public blog post since the NCBI BLAST team don't have a public bug tracker.
Update: What does 'legacy' BLAST do?
On Twitter Nick Loman suggested sticking with blastall from the NCBI 'legacy' BLAST suite. So I tried it with BLAST 2.2.24 under Linux. First replace the database:
$ rm four_proteins_star.p*
$ ~/Downloads/blast-2.2.24/bin/formatdb -p T -i four_proteins_star.fasta -n four_proteins_star
Now make new plain text and XML files (as far as I recall, you can't get the sequence in the tabular output from legacy BLAST so I skipped that):
$ ~/Downloads/blast-2.2.24/bin/blastall -p blastp -i carp.fasta -d four_proteins_star -o carp.txt
$ ~/Downloads/blast-2.2.24/bin/blastall -p blastp -i carp.fasta -d four_proteins_star -m 7 -o carp.xml
$ hexdump -C carp.txt | grep -C 3 SQIILK
00000f20 4c 2b 20 57 2b 2b 51 41 45 44 20 45 4c 2b 2b 20 |L+ W++QAED EL++ |
00000f30 2b 59 43 20 4b 20 4c 4b 2b 50 20 52 20 0a 53 62 |+YC K LK+P R .Sb|
00000f40 6a 63 74 3a 20 36 32 35 20 20 50 4c 44 50 49 53 |jct: 625 PLDPIS|
00000f50 56 53 4e 53 2a 53 51 49 49 4c 4b 57 4b 50 50 53 |VSNS*SQIILKWKPPS|
00000f60 44 50 4e 47 4e 49 54 48 59 4c 56 46 57 45 52 51 |DPNGNITHYLVFWERQ|
00000f70 41 45 44 53 45 4c 46 45 4c 44 59 43 4c 4b 47 4c |AEDSELFELDYCLKGL|
00000f80 4b 4c 50 53 52 54 20 36 38 34 0a 0a 51 75 65 72 |KLPSRT 684..Quer|
$ hexdump -C carp.xml | grep -C 3 SQIILK
00001140 4c 4d 52 47 4c 4b 50 57 54 51 59 41 49 46 56 4b |LMRGLKPWTQYAIFVK|
00001150 54 4c 56 54 46 53 44 45 52 52 54 59 47 41 4b 53 |TLVTFSDERRTYGAKS|
00001160 44 49 49 59 56 51 54 44 41 54 4e 50 53 56 50 4c |DIIYVQTDATNPSVPL|
00001170 44 50 49 53 56 53 4e 53 2a 53 51 49 49 4c 4b 57 |DPISVSNS*SQIILKW|
00001180 4b 50 50 53 44 50 4e 47 4e 49 54 48 59 4c 56 46 |KPPSDPNGNITHYLVF|
00001190 57 45 52 51 41 45 44 53 45 4c 46 45 4c 44 59 43 |WERQAEDSELFELDYC|
000011a0 4c 4b 47 4c 4b 4c 50 53 52 54 57 53 50 50 46 2d |LKGLKLPSRTWSPPF-|
00000f20 4c 2b 20 57 2b 2b 51 41 45 44 20 45 4c 2b 2b 20 |L+ W++QAED EL++ |
00000f30 2b 59 43 20 4b 20 4c 4b 2b 50 20 52 20 0a 53 62 |+YC K LK+P R .Sb|
00000f40 6a 63 74 3a 20 36 32 35 20 20 50 4c 44 50 49 53 |jct: 625 PLDPIS|
00000f50 56 53 4e 53 2a 53 51 49 49 4c 4b 57 4b 50 50 53 |VSNS*SQIILKWKPPS|
00000f60 44 50 4e 47 4e 49 54 48 59 4c 56 46 57 45 52 51 |DPNGNITHYLVFWERQ|
00000f70 41 45 44 53 45 4c 46 45 4c 44 59 43 4c 4b 47 4c |AEDSELFELDYCLKGL|
00000f80 4b 4c 50 53 52 54 20 36 38 34 0a 0a 51 75 65 72 |KLPSRT 684..Quer|
$ hexdump -C carp.xml | grep -C 3 SQIILK
00001140 4c 4d 52 47 4c 4b 50 57 54 51 59 41 49 46 56 4b |LMRGLKPWTQYAIFVK|
00001150 54 4c 56 54 46 53 44 45 52 52 54 59 47 41 4b 53 |TLVTFSDERRTYGAKS|
00001160 44 49 49 59 56 51 54 44 41 54 4e 50 53 56 50 4c |DIIYVQTDATNPSVPL|
00001170 44 50 49 53 56 53 4e 53 2a 53 51 49 49 4c 4b 57 |DPISVSNS*SQIILKW|
00001180 4b 50 50 53 44 50 4e 47 4e 49 54 48 59 4c 56 46 |KPPSDPNGNITHYLVF|
00001190 57 45 52 51 41 45 44 53 45 4c 46 45 4c 44 59 43 |WERQAEDSELFELDYC|
000011a0 4c 4b 47 4c 4b 4c 50 53 52 54 57 53 50 50 46 2d |LKGLKLPSRTWSPPF-|
Here we get the desired asterisk, *, whose ASCII code is 42 or in hex 0x42.
So, low and behold, this bug doesn't happen under 'legacy' BLAST - it is a regression in BLAST+.
Update
I had a reply from my email to the NCBI confirming this was being passed on to the BLAST developers to be fixed.
Update (12 September 2012)
The NCBI have just announced the release of BLAST 2.2.27+ which should fix this bug.
Hi, thanks for your post!
ReplyDeleteI've got the same exception from biopython when parsing some blast results: the big problem is that there is no way to known which sequence causes the problem :/
The exception should give you the XML line number, so you can (manually) look in the file to find the problem.
DeleteBow's GSoC work may offer a workaround. You could index the XML file by query, and then request each query's results one by one, catching any exceptions. Doing it this way the new indexing code would extract just the XML fragment of interest, and parse that on its own. I've used this technique with Biopython's SeqIO to handle a corrupted sequence file. You still can't parse the broken results without fixing the XML though.
I've yet to find a compelling reason to switch to BLAST+ from blastall!
ReplyDeleteI do like the flexible tabular output, where you can request any combination of the supported fields. It would be better still if they added the obvious BLAST+ omissions of the hit descriptions and taxon ID.
DeleteRight now that's all I can think of as upsides of the switch - which is quite telling.
If they hadn't done the BLAST+ -subject option as pairwise searches, I would have added FASTA vs FASTA searching as a nice shortcut for one-off searches.
I believe that this bug has already been fixed for 2.2.27+. I discovered this bug independently and corresponded with NCBI, documented here:
ReplyDeletehttps://github.com/yannickwurm/sequenceserver/issues/87#issuecomment-6112940
I'm not sure if this is already known, but another problem I've found in blast+ is that sometimes the tabular output doesn't give all the results that it should, unless you use the -max_target_seqs flag. I documented here:
ReplyDeletehttp://seqanswers.com/forums/showthread.php?t=21392
Yup, WU-Blast behaved a bit better with stops... and is not soo picky with the defline.
ReplyDelete