python 学习之 fasta/fastq 处理利器–pyfastx
点击上方蓝字关注 老俊俊

1引言
今天来介绍一款能快速处理 fasta 和 fastq 文件的利器: pyfastx ,不用我们自己繁琐的去写代码了。
Pyfastx 是一个轻量级的 Python C 扩展,允许用户随机访问 纯文本 和 gzipped FASTA/Q 文件中的序列。该模块旨在为用户提供简单的 api,从 FASTA 中提取 seqeunce,并通过标识符和索引号从 FASTQ 中读取。Pyfastx 将建立存储在 sqlite3 数据库文件中的索引,以便随机访问,从而避免消耗过多的内存。此外,pyfastx 还可以解析标准(序列分散为相同长度的多行)和非标准(序列分散为不同长度的一行或多行) FASTA 格式。
特点:
-
Single file for the Python extension -
Lightweight, memory efficient for parsing FASTA/Q file -
Fast random access to sequences from gzipped FASTA/Q file -
Read sequences from FASTA file line by line -
Calculate N50 and L50 of sequences in FASTA file -
Calculate GC content and nucleotides composition -
Extract reverse, complement and antisense sequences -
Excellent compatibility, support for parsing nonstandard FASTA file -
Support for FASTQ quality score conversion -
Provide command line interface for splitting FASTA/Q file
语言构成:

github 地址:
https://github.com/lmdu/pyfastx
2安装
目前支持 Python 3.5, 3.6, 3.7, 3.8, 3.9:
pip install pyfastx
更新:
pip install -U pyfastx
3FASTX 序列迭代
fasta 序列迭代
打印出 ID,序列及 ID 后的信息:
# 导入模块
import pyfastx
# 读取文件
fa = pyfastx.Fastx('data/test.fa.gz')
# 输出信息
for name,seq,comment in fa:
print(name)
print(seq)
print(comment)
JZ822577.1
CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence
...
如果含有小写序列,指定输出大写, uppercase=True:
#always output uppercase sequence
for item in pyfastx.Fastx('data/test.fa', uppercase=True):
print(item)
('JZ822577.1', 'CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT', 'contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence')
...
手动指定类型:
#Manually specify sequence format
for item in pyfastx.Fastx('data/test.fa', format="fasta"):
print(item)
('JZ822577.1', 'CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT', 'contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence')
...
fastq 序列迭代
打印 ID,序列,质量值及 ID 后的信息:
# 读入数据
fq = pyfastx.Fastx('data/test.fq.gz')
# 分别打印ID,序列,质量值及ID后的信息
for name,seq,qual,comment in fq:
print(name)
print(seq)
print(qual)
print(comment)
A00129:183:H77K2DMXX:1:1101:6804:1031
TGCACACGTAGGCGCGAGCGTCGCCGCCGGCGGCCTTGCAGGCAGCGACGGCTTCGTCGAGACGTTCGCGGTTGAGATCGACCAGGGCCAGCCTGGCGCCCTTGCCGGCGAGATATTCGCCCATTGCCCGGCCGAGCCCCTGGCAGCCGC
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFF
1:N:0:CAATGGAA+CGAGGCTG
...
4FASTA 处理
读取 fasta 序列
读取普通或者 gzipped 的 FASTA 文件并建立索引,支持对 FASTA 的随机访问:
import pyfastx
fa = pyfastx.Fasta('data/test.fa.gz')
fa
<Fasta> data/test.fa.gz contains 211 sequences
迭代读取 fasta 序列
在不构建索引的情况下迭代纯文本或 gzipped FASTA 文件是最快方法,迭代将返回一个包含名称和序列的 tuple:
import pyfastx
for name, seq in pyfastx.Fasta('data/test.fa.gz', build_index=False):
print(name, seq)
JZ822577.1 CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT
JZ822578.1 TACATGGGGATGGCTCCCCAACTAAGCCATTTGCAGTTCCCAATACATGACAGTATAAAAAAGATTATATAACACAAGAACTCCATGTGGTTGTTTTTGACCTGAAGTGGGCGAAATCTTGTAGCAATGGAGCCAAATGATCAATCTTTTCTTTCACCTACTAGTAGTGATGAAAGGTTGGAAGTTGCTGGGGATCTTTTGGAAGAGTGCTGGTTCTTTGAAAACATACTTGATAGAAAAGCGAGGATGTTGAGATGTCATTCTGACCCTTGTCCTTCTTCCTCAAGTGTTAGTCAAGAAATGTTAGTTTGGAACTCATGTATTGAAGCCCCTTCACCATCAAGGAAACAACCAATTGACAACAATGATTCTGTACAACCAAAAGAAAGCGAATCGACTCAATCATCATCTCGCCATGGTTTGCTTCGAGCACCGTCCTTACCACCTTGTATAGGGAGAAAAGGAAGTGAACCCAGGACGAGCAAATTGACTAGACAAGCATCGCTTAACTCTTCGGATGTCTTGCCACCGCCACACAAGTCAAAGGGGATGATGAAGCAGCAATACAGCATGCAAAGGCATGCACCTCAAAGATTACCAAAAGCGATGAGCTTGAATACTCGATATCCTAGTCAAACTAAAATGACAAAGAGCCTGAGTGATCTGGAATGTGAAGAAGT
...
下面这种方式也可以,建立 index 需要花点时间:
import pyfastx
for seq in pyfastx.Fasta('data/test.fa.gz'):
print(seq.name)
print(seq.seq)
print(seq.description)
获取 fasta 信息
序列数量:
# get sequence counts in FASTA
len(fa)
211
所有序列长度:
# get total sequence length of FASTA
fa.size
86262
GC 含量:
# get GC content of DNA sequence of FASTA
fa.gc_content
43.529014587402344
# get GC skew of DNA sequences in FASTA
# New in pyfastx 0.3.8
fa.gc_skews
0.004287730902433395
碱基数量:
# get composition of nucleotides in FASTA
fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}
序列类型:
# get fasta type (DNA, RNA, or protein)
fa.type
'DNA'
获取最长或最短序列
提取最长序列:
# get longest sequence
s = fa.longest
s
<Sequence> JZ822609.1 with length of 821
s.name
'JZ822609.1'
len(s)
821
提取最短序列:
# get shortest sequence
s = fa.shortest
s
<Sequence> JZ822617.1 with length of 118
s.name
'JZ822617.1'
len(s)
118
计算 N50 和 L50
# get FASTA N50 and L50
fa.nl(50)
(516, 66)
# get FASTA N90 and L90
fa.nl(90)
(231, 161)
# get FASTA N75 and L75
fa.nl(75)
(365, 117)
获取序列的平均长度和中位数
# get sequence average length
fa.mean
408
# get seqeunce median length
fa.median
430
计数一定阈值的序列
# get counts of sequences with length >= 200 bp
fa.count(200)
173
# get counts of sequences with length >= 500 bp
fa.count(500)
70
提取序列
# get subsequence with start and end position
interval = (1, 10)
fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'
# get subsequences with a list of start and end position
intervals = [(1, 10), (50, 60)]
fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'
# get subsequences with reverse strand
fa.fetch('JZ822577.1', (1, 10), strand='-')
'ATCTCTAGAG'
Key function 提取部分 ID
有时候 fasta 的 ID 比较长如 “>JZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence”, 在这种情况下,“JZ822577.1” 和 “contig1” 都可以作为标识符。可以指定 Key function 来选择一个作为标识符:
#default use JZ822577.1 as identifier
#specify key_func to select contig1 as identifer
fa = pyfastx.Fasta('data/test.fa.gz', key_func=lambda x: x.split()[1])
fa
<Fasta> tests/data/test.fa.gz contains 211 seqs
提取序列
内容太多,暂时使用命令行格式:
>>> # get sequence like a dictionary by identifier
>>> s1 = fa['JZ822577.1']
>>> s1
<Sequence> JZ822577.1 with length of 333
>>> # get sequence like a list by index
>>> s2 = fa[2]
>>> s2
<Sequence> JZ822579.1 with length of 176
>>> # get last sequence
>>> s3 = fa[-1]
>>> s3
<Sequence> JZ840318.1 with length of 134
>>> # check a sequence name weather in FASTA file
>>> 'JZ822577.1' in fa
True
提取序列信息
>>> s = fa[-1]
>>> s
<Sequence> JZ840318.1 with length of 134
>>> # get sequence order number in FASTA file
>>> # New in pyfastx 0.3.7
>>> s.id
211
>>> # get sequence name
>>> s.name
'JZ840318.1'
>>> # get sequence description
>>> # New in pyfastx 0.3.1
>>> s.description
'R283 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence'
>>> # get sequence string
>>> s.seq
'ACTGGAGGTTCTTCTTCCTGTGGAAAGTAACTTGTTTTGCCTTCACCTGCCTGTTCTTCACATCAACCTTGTTCCCACACAAAACAATGGGAATGTTCTCACACACCCTGCAGAGATCACGATGCCATGTTGGT'
>>> # get sequence raw string, New in pyfastx 0.6.3
>>> print(s.raw)
>JZ840318.1 R283 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence
ACTGGAGGTTCTTCTTCCTGTGGAAAGTAACTTGTTTTGCCTTCACCTGCCTGTTCTTCACATCAACCTT
GTTCCCACACAAAACAATGGGAATGTTCTCACACACCCTGCAGAGATCACGATGCCATGTTGGT
>>> # get sequence length
>>> len(s)
134
>>> # get GC content if dna sequence
>>> s.gc_content
46.26865768432617
>>> # get nucleotide composition if dna sequence
>>> s.composition
{'A': 31, 'C': 37, 'G': 25, 'T': 41, 'N': 0}
序列切片
>>> # get a sub seq from sequence
>>> s = fa[-1]
>>> ss = s[10:30]
>>> ss
<Sequence> JZ840318.1 from 11 to 30
>>> ss.name
'JZ840318.1:11-30'
>>> ss.seq
'CTTCTTCCTGTGGAAAGTAA'
>>> ss = s[-10:]
>>> ss
<Sequence> JZ840318.1 from 125 to 134
>>> ss.name
'JZ840318.1:125-134'
>>> ss.seq
'CCATGTTGGT'
注意: Slicing start and end coordinates are 0-based.
获取反向和互补及反向互补序列
>>> # get sliced sequence
>>> fa[0][10:20].seq
'GTCAATTTCC'
>>> # get reverse of sliced sequence
>>> fa[0][10:20].reverse
'CCTTTAACTG'
>>> # get complement of sliced sequence
>>> fa[0][10:20].complement
'CAGTTAAAGG'
>>> # get reversed complement sequence, corresponding to sequence in antisense strand
>>> fa[0][10:20].antisense
'GGAAATTGAC'
查找序列
>>> # search subsequence in sense strand
>>> fa[0].search('GCTTCAATACA')
262
>>> # check subsequence weather in sequence
>>> 'GCTTCAATACA' in fa[0]
True
>>> # search subsequence in antisense strand
>>> fa[0].search('CCTCAAGT', '-')
301
获取序列 ID
>>> ids = fa.keys()
>>> ids
<FastaKeys> contains 211 keys
>>> # get count of sequence
>>> len(ids)
211
>>> # get key by index
>>> ids[0]
'JZ822577.1'
>>> # check key whether in fasta
>>> 'JZ822577.1' in ids
True
>>> # iterate over keys
>>> for name in ids:
>>> print(name)
>>> # convert to a list
>>> list(ids)
对 ID 进行排序
>>> # sort keys by length with descending order
>>> for name in ids.sort(by='length', reverse=True):
>>> print(name)
>>> # sort keys by name with ascending order
>>> for name in ids.sort(by='name'):
>>> print(name)
>>> # sort keys by id with descending order
>>> for name in ids.sort(by='id', reverse=True)
>>> print(name)
筛选 ID
>>> # get keys with length > 600
>>> ids.filter(ids > 600)
<FastaKeys> contains 48 keys
>>> # get keys with length >= 500 and <= 700
>>> ids.filter(ids>=500, ids<=700)
<FastaKeys> contains 48 keys
>>> # get keys with length > 500 and < 600
>>> ids.filter(500<ids<600)
<FastaKeys> contains 22 keys
>>> # get keys contain JZ8226
>>> ids.filter(ids % 'JZ8226')
<FastaKeys> contains 90 keys
>>> # get keys contain JZ8226 with length > 550
>>> ids.filter(ids % 'JZ8226', ids>550)
<FastaKeys> contains 17 keys
>>> # clear sort order and filters
>>> ids.reset()
<FastaKeys> contains 211 keys
>>> # list a filtered result
>>> ids.filter(ids % 'JZ8226', ids>730)
>>> list(ids)
['JZ822609.1', 'JZ822650.1', 'JZ822664.1', 'JZ822699.1']
>>> # list a filtered result with sort order
>>> ids.filter(ids % 'JZ8226', ids>730).sort('length', reverse=True)
>>> list(ids)
['JZ822609.1', 'JZ822699.1', 'JZ822664.1', 'JZ822650.1']
>>> ids.filter(ids % 'JZ8226', ids>730).sort('name', reverse=True)
>>> list(ids)
['JZ822699.1', 'JZ822664.1', 'JZ822650.1', 'JZ822609.1']
5FASTQ 处理
读取 fastq 序列
>>> import pyfastx
>>> fq = pyfastx.Fastq('data/test.fq.gz')
>>> fq
<Fastq> tests/data/test.fq.gz contains 100 reads
迭代读取
>>> import pyfastx
>>> for name,seq,qual in pyfastx.Fastq('data/test.fq.gz', build_index=False):
>>> print(name)
>>> print(seq)
>>> print(qual)
换种方式:
>>> import pyfastx
>>> for read in pyfastx.Fastq('data/test.fq.gz'):
>>> print(read.name)
>>> print(read.seq)
>>> print(read.qual)
>>> print(read.quali)
获取 fastq 信息
>>> # get read counts in FASTQ
>>> len(fq)
800
>>> # get total bases
>>> fq.size
120000
>>> # get GC content of FASTQ file
>>> fq.gc_content
66.17471313476562
>>> # get composition of bases in FASTQ
>>> fq.composition
{'A': 20501, 'C': 39705, 'G': 39704, 'T': 20089, 'N': 1}
>>> # New in pyfastx 0.6.10
>>> # get average length of reads
>>> fq.avglen
150.0
>>> # get maximum lenth of reads
>>> fq.maxlen
150
>>> # get minimum length of reas
>>> fq.minlen
150
>>> # get maximum quality score
>>> fq.maxqual
70
>>> # get minimum quality score
>>> fq.minqual
35
>>> # get phred which affects the quality score conversion
>>> fq.phred
33
>>> # Guess fastq quality encoding system
>>> # New in pyfastx 0.4.1
>>> fq.encoding_type
['Sanger Phred+33', 'Illumina 1.8+ Phred+33']
读取 read
>>> #get read like a dict by read name
>>> r1 = fq['A00129:183:H77K2DMXX:1:1101:4752:1047']
>>> r1
<Read> A00129:183:H77K2DMXX:1:1101:4752:1047 with length of 150
>>> # get read like a list by index
>>> r2 = fq[10]
>>> r2
<Read> A00129:183:H77K2DMXX:1:1101:18041:1078 with length of 150
>>> # get the last read
>>> r3 = fq[-1]
>>> r3
<Read> A00129:183:H77K2DMXX:1:1101:31575:4726 with length of 150
>>> # check a read weather in FASTQ file
>>> 'A00129:183:H77K2DMXX:1:1101:4752:1047' in fq
True
获取 read 信息
>>> r = fq[-10]
>>> r
<Read> A00129:183:H77K2DMXX:1:1101:1750:4711 with length of 150
>>> # get read order number in FASTQ file
>>> r.id
791
>>> # get read name
>>> r.name
'A00129:183:H77K2DMXX:1:1101:1750:4711'
>>> # get read full header line, New in pyfastx 0.6.3
>>> r.description
'@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG'
>>> # get read length
>>> len(r)
150
>>> # get read sequence
>>> r.seq
'CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA'
>>> # get raw string of read, New in pyfastx 0.6.3
>>> print(r.raw)
@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG
CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:
>>> # get read quality ascii string
>>> r.qual
'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:'
>>> # get read quality integer value, ascii - 33 or 64
>>> r.quali
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25]
>>> # get read length
>>> len(r)
150
获取 ID
>>> ids = fq.keys()
>>> ids
<FastqKeys> contains 800 keys
>>> # get count of read
>>> len(ids)
800
>>> # get key by index
>>> ids[0]
'A00129:183:H77K2DMXX:1:1101:6804:1031'
>>> # check key whether in fasta
>>> 'A00129:183:H77K2DMXX:1:1101:14416:1031' in ids
True
6命令行版
最后还有命令行版的:
$ pyfastx -h
usage: pyfastx COMMAND [OPTIONS]
A command line tool for FASTA/Q file manipulation
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Commands:
index build index for fasta/q file
stat show detailed statistics information of fasta/q file
split split fasta/q file into multiple files
fq2fa convert fastq file to fasta file
subseq get subsequences from fasta file by region
sample randomly sample sequences from fasta or fastq file
extract extract full sequences or reads from fasta/q file
具体功能结合大家自身的需求去探索吧!功能还是很强大的!

欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
哦,数据代码已上传至QQ群,欢迎加入下载。
群二维码:

老俊俊微信:
知识星球:
所以今天你学习了吗?
欢迎小伙伴留言评论!
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀python 学习之处理 fasta 和 fastq 文件
◀geneExpressionFromGEO 下载并注释芯片数据
◀...
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!