1、cDNA-only index :又叫 salmon_index,属于 quasi-mapping 类型,这种方法将产生最小的索引,并需要最少的资源来构建,但是最容易出现虚假的比对结果。
2、SA mashmap index :又叫 salmon_partial_sa_index (regions of genome that have high sequence similarity to the transcriptome) ,属于 Selective Alignment 类型,含有基因组与转录组上高度相似性的序列的索引,为 partial decoy 索引,提高比对准确性。
3、SAF genome index :又叫 salmon_sa_index (the full genome is used as decoy) ,利用整个基因组的信息来进一步的提高比对的准确性,减少假比对结果,属于 Selective Alignment 类型,索引文件也是三个里最大的,准确效果是最好的,为 full decoy 索引。
index 建立索引、quant 定量、alevin 单细胞分析和 quantmerge 合并定量结果等用法。
$ salmon salmon v1.5.0
Usage: salmon -h|--help or salmon -v|--version or salmon -c|--cite or salmon [--no-version-check] <COMMAND> [-h | options]
Commands: index : create a salmon index quant : quantify a sample alevin : single cell analysis swim : perform super-secret operation quantmerge : merge multiple quantifications into a single file
作者还提供了已经构建好的索引下载:Pre-built versions of both the partial decoy and full decoy (i.e. using the whole genome) salmon indices for some common organisms are available via refgenie here:http://refgenomes.databio.org/
/mnt/d/rnaseq/salmon$ salmon quantmerge -h Version Server Response: Not Found quantmerge ========== Merge multiple quantification results into a single file.
salmon quantmerge options: basic options: -v [ --version ] print version string -h [ --help ] produce help message --quants arg List of quantification directories. --names arg Optional list of names to give to the samples. -c [ --column ] arg (=TPM) The name of the column that will be merged together into the output files. The options are {len, elen, tpm, numreads} --genes Use gene quantification instead of transcript. --missing arg (=NA) The value of missing values. -o [ --output ] arg Output quantification file.
An HTTP error occurred when trying to retrieve this URL. HTTP errors are often intermittent, and a simple retry will get you on your way. 'http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/linux-64' # 修改condarc文件,在镜像地址后面添加win-64就可以了 /mnt/d/rnaseq/salmon$cd && vi .condarc channels: - http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/win-64 - http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/win-64 - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/win-64 - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/win-64 show_channel_urls: true ssl_verify: true
# 查看bedtools可执行文件路径 /mnt/d/rnaseq/salmon$which bedtools /root/miniconda3/bin/bedtools # 查看mashmap可执行文件路径 /mnt/d/rnaseq/salmon$which mashmap /root/miniconda3/bin/mashmap # 获取decoy文件 /mnt/d/rnaseq/salmon$ bash generateDecoyTranscriptome.sh -j 1 -b /root/miniconda3/bin/bedtools -m /root/miniconda3/bin/mashmap -a ./gencode.vM27.annotation.gtf -g ./GRCm39.primary_assembly.genome.fa -t ./gencode.vM27.transcripts.fa.gz -o decoy_file # 后面直接崩了,试了几次还是崩!内存还是太小,去大服务器跑完后 /mnt/d/rnaseq/salmon$ tree decoy_file decoy_file ├── decoys.txt └── gentrome.fa # 完整日志文件 **************** *** getDecoy *** **************** -j <Concurrency level> = 10 -b <bedtools binary> = /home/zhoulab/anaconda3/envs/salmon/bin/bedtools -m <mashmap binary> = /home/zhoulab/anaconda3/envs/salmon/bin/mashmap -a <Annotation GTF file> = /home/zhoulab/salmon/gencode.vM27.annotation.gtf -g <Genome fasta> = /home/zhoulab/salmon/GRCm39.primary_assembly.genome.fa -t <Transcriptome fasta> = /home/zhoulab/salmon/gencode.vM27.transcripts.fa.gz -o <Output files Path> = decoy_file [1/10] Extracting exonic features from the gtf [2/10] Masking the genome fasta [3/10] Aligning transcriptome to genome >>>>>>>>>>>>>>>>>> Reference = [reference.masked.genome.fa] Query = [/home/zhoulab/salmon/gencode.vM27.transcripts.fa.gz] Kmer size = 16 Window size = 5 Segment length = 500 (read split allowed) Alphabet = DNA Percentage identity threshold = 80% Mapping output file = mashmap.out Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none) Execution threads = 10 >>>>>>>>>>>>>>>>>> INFO, skch::Sketch::build, minimizers picked from reference = 844163332 INFO, skch::Sketch::index, unique minimizers = 276702648 INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 141693231) ... (2608258, 1) INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 7364 times during lookup. INFO, skch::main, Time spent computing the reference index: 346.634 sec INFO, skch::Map::mapQuery, [count of mapped reads, reads qualified for mapping, total input reads] = [111792, 111958, 142375] INFO, skch::main, Time spent mapping the query : 3487.89 sec INFO, skch::main, mapping results saved in : mashmap.out [4/10] Extracting intervals from mashmap alignments [5/10] Merging the intervals [6/10] Extracting sequences from the genome index file reference.masked.genome.fa.fai not found, generating... [7/10] Concatenating to get decoy sequences [8/10] Making gentrome [9/10] Extracting decoy sequence ids [10/10] Removing temporary files
********************************************** *** DONE Processing ... *** You can use files `$outfolder/gentrome.fa` *** and $outfolder/decoys.txt` with *** `salmon index` **********************************************
/mnt/d/rnaseq/salmon$ salmon index -p 10 -k 31 --gencode -t gencode.vM27.transcripts.fa.gz -i salmon_index_partial_decoy --decoys decoy_file/decoys.txt # 报错了 [2021-06-15 20:24:24.893] [puff::index::jointLog] [critical] The decoy file contained the names of 49 decoy sequences, but 0 were matched by sequences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta file that is being indexed. [2021-06-15 20:24:24.944] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1