3.1 Reference Genome and its Annotation
Most scRNA-seq experiments are done using human or mouse tissues, organoids, or cell cultures. Even though first drafts of these genomes were published about 20 years ago, assembly and annotation updates happen fairly regularly. There are two popular sources of assembly files: UCSC (their assemblies are named hg19, hg38, mm10, etc), and GRC (GRCh37, GRCh38, GRCm38). Major releases of UCSC and GRC assemblies are matched in main chromosomes (e.g. chr1 from hg38 = chr1 from GRCh38), but differ in additional contigs and so-called ALT loci, which change between minor releases (e.g. GRCh38.p13). More information can be obtained here and here. Genome assembly is usually distributed as a fasta file – a simple text file that includes sequence names and sequences.
大多数scRNA-seq的实验是用人或者老鼠的组织、类器官和培养出来的细胞做的。尽管这些基因组的初稿在大约20年前就出版了，但组装和注释更新得相当高频。有两种流行的组装文件来源：UCSC（组装起来的东西叫作hg19，hg38，mm10等等），和GRC（GRCh37, GRCh38, GRCm38）。这两种来源在染色体上主要是相同的（比如chr1从hg38和GRCh38都是对应一样的），但额外的序列和ALT位点不同，所以它们之间有细微的不同。点击蓝色链接获得更多信息。基因组组装往往以fasta的形式分布，包括序列名称和序列。
Genome annotation process includes defining transcribed regions of genome (genes), as well as annotating exact transcripts with exon-intron boundaries, and assigning the newly defined features a type – e.g. protein coding, non-coding, etc. Example below shows one gene which consists of 5 transcripts: 3 protein coding (red), and two non-coding (blue). Genome annotation is usually provided in either GTF or GFF3 file format, which are organized hierarchically. Each gene is defined by a unique gene ID; each transcript is defined by a unique transcript ID and the gene it belongs to. Exons, UTRs, and coding sequences are in turn assigned to a particular transcript.
基因注释的过程包括定义基因组或基因的转录区域，以及利用外显子和内含子的边界注释精确的转录本，还会把新定义的特征归类到一个类型，比如蛋白质编码，非编码等等。下面这个例子展示了一个由5个转录本组成的基因：3个蛋白质编码（红色），和两个非编码的（蓝色）。基因注释常常展现成 GTF or GFF3 file format 这两种形式，按照层次组织好。每个基因都有独特的基因ID定义，每个转录本都有它独特的ID和它所属的基因定义，外显子，utr和编码序列一次分给特定的转录本。
Figure 3.1: Transcript and intron-exon structure of a typical eukaryotic gene
Popular sources of human and mouse genome annotation are RefSeq, ENSEMBL, and GENCODE. RefSeq is the most conservative of the three, and tends to have the fewest annotated transcripts per gene. RefSeq transcript IDs start with NM_ or NR, e.g. NM_12345. ENSEMBL and GENCODE are very similar to each other and can be used interchangeably for our purposes. Gene names in these start with ENSG (for human) and ENSMUSG (for mouse); transcripts start with ENST and ENSMUST, respectively.
In addition to gene IDs, most genes also have a common name (“gene symbol”) assigned to them; e.g. human actin B will have ensembl gene ID ENSG00000075624 and symbol ACTB. Human gene names are regularly updated and defined by HGNC. Mouse gene names are decided by a similar consortium, MGI.
Current ENSEMBL/GENCODE annotation of the human genome contains approximately 60k genes, 20k of which are protein coding, and 237k transcripts. Most genes can be coarsely divided by type into protein coding genes, long noncoding RNAs, short noncoding RNAs, and pseudogenes. On a higher resolution, over 40 biotypes are defined. Gene biotype annotation also often changes between annotation versions (see below).
目前的 ENSEMBL/GENCODE 人类基因注释包括约60k的基因，其中20k是蛋白质编码的，还有237k的转录本。大多数基因可以大致被分为蛋白质编码基因，长非编码RNA，短非编码RNA和假基因。从更高的角度来看，可以定义至少40种基因类型，不同的注释版本也经常不同。
Figure 3.2: Sankey diagram of gene type changes in GENCODE versions
3.2 Processing of Bulk RNA-seq and Full-length scRNA-seq Data
Raw read processing of bulk RNA-seq is usually done in two steps: read alignment and read counting. Both steps contain important caveats which can strongly influence expression estimates for individual genes. Read alignment can be done against a genome or transcriptome reference. Because of widespread splicing in animal genomes, read alignment against a genome has to be done with a splice-aware aligner; two most popular modern tools are STAR and hisat2. Typical read coverage is shown in panel A below; note that read coverage is relatively uniform across both 3’ and 5’ ends of the given gene. Some reads perfectly align to more than 1 location; these reads are often referred to as multimappers. The ambiguity is much greater when aligning to transcriptome because many transcripts are very similar to each other (e.g. differ by only one exon); however, it is noticeable even on a gene level (panel B below).
Figure 3.3: A: typical read coverage in a bulk RNA-seq or Smart-seq2 experiment; B: types of ambiguity that arise when reads are assigned to features
图3.3：A:典型的 bulk RNA-seq或Smart-seq2实验读数范围；B:当放大到具体特征时，模糊的东西搞清楚了。
After alignment to the genome or transcriptome, read counts can be summarized on a gene or transcript level. In case of genome alignment, the simplest strategy is to count only reads mapping to a unique location (non-multimappers), and only overlapping one gene. This, however, inevitably creates a bias in gene expression estimates (Pachter, 2011). Somewhat more advanced strategy includes splitting read counts between features it aligns to (e.g. if a read aligns equally well to 3 paralogous genes, each paralog gets ⅓ of a count). Strand-specific RNA-seq reduces the ambiguity of read assignment in case of overlapping features located on opposite strands. An example of a program that efficiently implements all of the above-mentioned counting methods is featureCounts from the Subread package.
When transcriptome alignment is used, read assignment ambiguity is too great for simple counting. Thus, per-transcript and per-gene abundances are calculated using maximum likelihood abundance estimates using the Expectation-Maximization (EM) algorithm. This approach results in varying fractions of the read being assigned to features it maps to, and substantially reduces the bias related to multimappers. Reads (and read fractions) that are assigned to transcripts are then summarized on gene level. The most widely used and well-supported program implementing this strategy is RSEM. Generally speaking, this is the most accurate way of bulk RNA-seq quantification (Pachter, 2011).
An alternative to a conventional approach described above (alignment, then read quantification) is based on so-called pseudo-alignment methods. Two popular tools, kallisto and salmon, use very similar approaches:
- Split the reference transcriptome into k-mers and make a De Bruijn graph;
- Convert RNA-seq reads into k-mers;
- Use the k-mers to assign reads to a transcript or several transcripts (“equivalence class”);
- Summarize the resulting counts on a transcript or gene level.
- 把参考的转录组分成k-mers，做一张De Bruijn图；
Expectation-Maximization algorithm is used to find optimal distribution of reads that map to multiple transcripts. Both tools are extremely memory- and CPU-efficient, and quite accurate, especially for paired-end or long single-end reads. Pseudoalignment does not generate an alignment BAM file, so if visualization is needed (e.g. when using RNA-seq for transcript annotation) alignment should also be performed separately.
Figure 3.4: Pseudoalignment to transcriptome De Bruijn graph
Several things should be noted about bulk RNA-seq quantification. First, it is generally assumed that the number of sequenced cDNA fragments is proportional to the amount of RNA present in a cell. Thus when paired-end reads are used, each read pair is counted only once, as it stems from the same cDNA fragment. For well-annotated genomes like human and mouse it is very common to use single-end reads for RNA-seq. Second, PCR duplicates are usually ignored in bulk RNA-seq, and UMI use also does not confer substantial benefits. Several independent studies indicated that duplicate removal or UMI use do not noticeably increase the statistical power of bulk RNA-seq.
关于bulk RNA-seq有几件事需要注意。首先，一般认为已测序的cDNA片段的数量与细胞中RNA的数量成正比。因此，如果用paired-end读数法，每端只读1次，因为他们是来自同一片段的cDNA。对于像人类和小鼠这样注释良好的基因组，使用单端读取来RNA测序是非常常见的。其次bulk RNA-seq中PCR复制通常被忽略，UMI的使用也不会带来实质性的好处。有些独立实验表明，重复去除或移除UMI不能显著增加bulk RNA-seq的统计能力。
Finally, while many differential expression methods use raw read counts, it is conventional to use within-sample normalization when doing clustering, PCA, and other types of exploratory analysis. The most popular method of such normalization is conversion of raw reads to Transcripts per Million (TPM). The conversion accounts for two biases: 1) different samples are sequenced at different depth, not directly related to gene expression differences; 2) long genes are expected to generate more cDNA fragments than the short ones. Thus, for TPM calculation, raw read counts are first divided by the effective transcript length, which is defined as transcript length – cDNA fragment size + 1. After this, the resulting numbers are scaled linearly to add up to one million. Thus, the sum of all TPM values for a particular sample is always equal to (approximately) 1,000,000.
Single cell RNA-seq data differ from bulk RNA seq in a number of ways (see Introduction to single cell RNA-Seq chapter above). Most modern scRNA-seq technologies generate read sequences containing three key pieces of information:
- cDNA fragment that identifies the RNA transcript;能够识别转录RNA的cDNA片段；
- Cell barcode (CB) that identifies the cell where the RNA was expressed;可以辨别RNA表达场所的细胞条码；
- Unique Molecular Identifier (UMI) that allows to collapse reads that are PCR duplicates.可折叠读取的PCR复制品UMI。
In contrast to bulk RNA-seq, scRNA-seq deals with a much smaller amount of RNA, and more PCR cycles are performed. Thus, UMI barcodes become very useful and are now widely accepted in scRNAseq. Library sequencing is often done with paired-end reads, with one read containing CB + UMI (read 1 in 10x Chromium), and the other containing actual transcript sequence (read 2 in 10x Chromium).
A classical scRNA-seq workflow contains four main steps:
- Mapping the cDNA fragments to a reference;给cDNA碎片绘图作为参考；
- Assigning reads to genes;给基因分别读数；
- Assigning reads to cells (cell barcode demultiplexing);把读数分到细胞（分解细胞条码）；
- Counting the number of unique RNA molecules (UMI deduplication).数独特的分子数量（UMI复制来的）
The outcome of this procedure is a gene/cell count matrix, which is used as an estimate of the number of RNA molecules in each cell for each gene.
Cell Ranger is the default tool for processing 10x Genomics Chromium scRNAseq data. It uses STAR aligner, which performs splicing-aware alignment of reads to the genome. After this, it uses the transcript annotation GTF to bucket the reads into exonic, intronic, and intergenic, and by whether the reads align (confidently) to the genome. A read is exonic if at least 50% of it intersects an exon, intronic if it is non-exonic and intersects an intron, and intergenic otherwise (shown below). Following the read type assignment, mapping quality adjustment is done: for reads that align to a single exonic locus but also align to 1 or more non-exonic loci, the exonic locus is prioritized and the read is considered to be confidently mapped to the exonic locus and given a maximum mapping quality score.
Cell Ranger是处理10x Genomics Chromium scRNAseq数据的默认工具。 它使用STAR比对器，对基因组进行拼接比对，然后它用转录注释GTF将读取的片段放入外显子、内含子和基因间,并根据所读取的片段是否对齐，如果一个基因片段中至少有50%与外显子相交，则为外显子，如果是非外显子且与内含子相交，则为内含子，否则为基因间(如下图所示)。 读数进行分类分配后，作图质量的调整就好了，读数和外显子的位点对齐也和非外现在的位点对齐，但是位点明显偏向外显子所以基本可以确信地认为对应外显子，给了它一个最大比重可能性的结果。
Figure 3.5: Classification of aligned reads in Cell Ranger
图3.5 Cell Ranger里的对比读数分类
Cell Ranger further aligns exonic and intronic confidently mapped reads to annotated transcripts by examining their compatibility with the transcriptome. Reads are classified based on whether they are sense or antisense and based on whether they are exonic, intronic or whether their splicing pattern is compatible with transcript annotations associated with that gene. By default, reads that are transcriptomic (blue in the figure above) are carried forward to UMI counting.
Cell Ranger 进一步对比了外显子和内含子通过检查他们转录组的兼容性对转录物进行读数的绘图注释。读数根据他们是否抗转录以及他们是否与外显子、内含子互补或者说他们的剪切模式在基因方面是否和准路注释兼容。默认来看，转录组的读数（上面的蓝色）将转入UMI计数。
When the input to the assay consists of nuclei, a high percentage of the reads comes from the unspliced transcripts and align to introns. In order to count these intronic reads, the “cellranger count” and “cellranger multi” pipelines can be run with the option include-introns. If this option is used, any reads that map in the sense orientation to a single gene – which include the reads labeled transcriptomic (blue), exonic (light blue), and intronic (red) in the diagram above – are carried forward to UMI counting. The include-introns option eliminates the need for a custom “pre-mRNA” reference that defines the entire gene body to be an exon. Importantly, a read is considered uniquely mapping if it is compatible with only a single gene. Only uniquely mapping reads are carried forward to UMI counting; multimapping genes are discarded by
Cell Ranger. In the Web Summary HTML output, the set of reads carried forward to UMI counting is referred to as “Reads mapped confidently to transcriptome.”
当检测的东西包括细胞核时，很大比例的读取物来自未剪接的转录本和对齐的内含子。 为了计数这些内含子读数，“cellranger count”和“cellranger multi”可以使用包含-内含子选项的运行途经。如果这么干，任何绘图出来的单个基因的读数——包括上面图表中标记为转录组(蓝色)、外显子(浅蓝色)和内含子(红色)的reads——都将进行UMI计数。包含内含子的方法消除了传统上对“pre-mRNA”参考的需求，该参考标准将整个基因体定义为一个外显子。 重要的是，如果一个读数绘出来的图和仅仅一个基因是兼容配对的，那么这就被认为是独特的。 只有唯一的配对读取出来的绘图才会被继续去UMI计数，多数据绘图基因的方法被Cell Ranger摒弃了。 在Web Summary HTML输出中，让读取的数据去UMI计数被称为“ Reads mapped confidently to transcriptome ”。
Before we dive into the details of reference processing, it’s important to note how default
Cell Ranger human and mouse references are prepared. Primary genome assembly versions (i.e. without the ALT loci) are used for the alignment in all versions. Annotation GTF files are filtered, using the scripts that could be found here. The following biotypes are retained: protein coding, long noncoding RNA, antisense, and all biotypes belonging to BCR/TCR (i.e. V/D/J) genes (note that older
Cell Ranger reference versions do not include the latter). All pseudogenes and small noncoding RNAs are removed.
在我们深入探讨准备过程的细节前，谈一谈默认的Cell Ranger人类和老鼠是怎么准备的。最开始的基因组装版本（即没有ALT位点）全都没有对齐的的步骤。用这里可以找到的脚本对GTF注释文件进行筛选。以下的生物类型被保留了：蛋白质编码，长非编码RNA，抗转录以及所有属于BCR/TCR的生物种类（即V/D/J）基因（注意较老的Cell Ranger参考版本不包括后者），所有假基因和小非编码RNA都被移除。
There are several versions of
Cell Ranger reference that come pre-packaged with the software; 2020-A is the newest version of reference to date. All individual assembly and annotation combinations used by
Cell Ranger previously are listed below. The unfiltered scRNAseq expression matrix generated using each reference is expected to contain the number of rows equal to the value in the “genes after filtering” column. Additionally,
Cell Ranger also contains human + mouse combined reference, which is useful for experiments involving both human and mouse cells.
|Cell Ranger Reference||Species||Assembly/Annotation||Genes before filtering||Genes after filtering|
Cellular barcode sequences are synthetic sequences attached to the beads that identify individual cells. The library of unique sequences is called a whitelist and depends on the Chromium library preparation kit version. The whitelist files are available from the Cell Ranger repository. There are three whitelists used for Chromium:
3M-february-2018.txt. CBs from the first list are 14 bp long, and two others are 16 bp. The table below provides cellular barcodes and UMI lengths, as well as appropriate whitelist files, for popular 10x single cell sequencing kits:
|Chemistry||CB, bp||UMI, bp||Whitelist file|
|10x Chromium Single Cell 3’ v1||14||10||737K-april-2014_rc.txt|
|10x Chromium Single Cell 3’ v2||16||10||737K-august-2016.txt|
|10x Chromium Single Cell 3’ v3||16||12||3M-february-2018.txt|
|10x Chromium Single Cell 3’ v3.1 (Next GEM)||16||12||3M-february-2018.txt|
|10x Chromium Single Cell 5’ v1.1||16||10||737K-august-2016.txt|
|10x Chromium Single Cell 5’ v2 (Next GEM)||16||10||737K-august-2016.txt|
Cell Ranger uses the following algorithm to correct putative barcode sequences against the whitelist :
- Count the observed frequency in the dataset of every barcode on the whitelist;
- For every observed barcode in the dataset that is not on the whitelist, find whitelist sequences that are 1-Hamming-distance away. For each such sequence:
- Compute the posterior probability that the observed barcode originated from the whitelist barcode with a sequencing error at the differing base (based on the base Q score);
- Replace the observed barcode with the whitelist barcode with the highest posterior probability that exceeds 0.975.
The corrected barcodes are used for all downstream analysis and output files. In the output BAM file, the original uncorrected barcode is encoded in the CR tag, and the corrected barcode sequence is encoded in the CB tag. Reads that are not able to be assigned a corrected barcode will not have a CB tag. In case you’re wondering why 3M-february-2018.txt file actually contains over 6 million unique sequences, the explanation can be found here.
What is usually referred to as “UMI counting” consists of read counting followed by PCR duplicate collapsing based on UMI sequences. Before counting UMIs, Cell Ranger attempts to correct for sequencing errors in the UMI sequences. Reads that were confidently mapped to the transcriptome are placed into groups that share the same barcode, UMI, and gene annotation. If two groups of reads have the same barcode and gene, but their UMIs differ by a single base (i.e., are Hamming distance 1 apart), then one of the UMIs was likely introduced by a substitution error in sequencing. In this case, the UMI of the less-supported read group is corrected to the UMI with higher support.
Cell Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.
After these two filtering steps, each observed barcode, UMI, gene combination is recorded as a UMI count in the unfiltered feature-barcode (i.e. gene-cell) matrix. The number of reads supporting each counted UMI is also recorded in the molecule info file.
Unfiltered (“raw”) feature-barcode matrix contains many columns that are in fact empty droplets. Gene expression counts in these droplets are not zero due to technical noise, e.g. the presence of ambient RNA from broken cells. However, they can usually be distinguished from bona fide cells by the amount of RNA present. There are two algorithms implemented for this cell filtering in
Cell Ranger, that we will refer to as “Cell Ranger 2.2” and “Cell Ranger 3.0” filtering.
The original algorithm (Cell Ranger 2.2) identified the first “knee point” in the “barcode count vs UMIs per barcode” plot. Cell Ranger 3.0 introduced an improved cell-calling algorithm that is better able to identify populations of low RNA content cells, especially when low RNA content cells are mixed into a population of high RNA content cells. For example, tumor samples often contain large tumor cells mixed with smaller tumor infiltrating lymphocytes (TIL) and researchers may be particularly interested in the TIL population. The new algorithm is based on the EmptyDrops method (Lun et al., 2018).
Figure 3.6: Knee plots and empty drop cutoffs identified by the Cell Ranger 2.2 and 3.0 filtering algorithms
Pseudoalignment (see above) can also be used to quickly quantify scRNA-seq datasets. Currently, there are two software suites that implement this approach: kallisto/BUStools, and Salmon/Alevin/Alevin-fry. In order to preserve the modular approach, both ecosystems introduced their own format that allows the storage of quantification results:
BUStools introduced the BUS (Barcode, UMI, and Set) file format (Melsted et al, 2019), while
Alevin-fry is using RAD format for the same purpose (Srivastava et al, 2019).
Alevin-fry implement their own algorithms for cell barcode and UMI error correction and demultiplexing – for example,
Alevin does not require (but can use) a cell barcode whitelist. However, the biggest difference from alignment-based methods stems from the lower accuracy of pseudoalignment, and the inclusion of multimapping reads.
BUStools supports many sequencing technologies, including low-throughput ones like CEL-seq, CEL-seq2, and SMART-seq. The full list of supported experiments can be printed with
Alevin currently supports only the two most popular droplet based single-cell protocols, Drop-seq and 10x Chromium.
Alevin are extremely efficient, allowing processing of human or mouse datasets with 2-4 Gb of RAM and at least an order of magnitude faster than
Cell Ranger. Both tools also handle multimapping reads correctly, reducing the quantification bias for the affected genes. However, it has been noted in several publications that pseudoalignment-based methods falsely map reads from retained introns to the transcriptome (Melsted et al, 2021; Srivastava et al, 2020). It is well known that scRNA-seq experiments, and, in particular, single-nucleus RNA-seq can contain a very high percentage of transcripts with retained introns. This erroneous assignment makes hundreds of non-expressed genes look weakly expressed, which may substantially influence the downstream analysis, particularly marker selection (Kaminow et al, 2021). Thus, there is still a push to develop methods that would be at least as accurate and much faster than
STARsolo is a standalone pipeline that is a part of
STAR RNA-seq aligner mentioned in this chapter previously. It was developed with a goal to generate results that are very similar to
Cell Ranger, while remaining computationally efficient. Normally,
STARsolo is several times faster than
Cell Ranger on the same dataset.
STARsolo methods for UMI collapsing, cell barcode demultiplexing, and cell filtering are purposefully re-implementing the algorithms used by
Cell Ranger. Starting from the
STAR version 2.7.9a,
STARsolo is also capable of quantifying multimapping read correctly, making it a very attractive option for fast and accurate scRNA-seq processing (Kaminow et al, 2021). Additional benefit of
STARsolo is its flexible implementation of cellular barcode and UMI search: knowing a relative location within a read, and length of each sequence, it’s possible to process the data generated by most scRNA-seq approaches.
Alevin developers has also acknowledged the above-mentioned problem of intronic reads, developing a special solution which involved the use of so-called decoy sequences. In a recent
Alevin with full genome decoy has shown accuracy that is very similar to that of
Cell Ranger (Kaminow et al, 2021).
It is becoming increasingly popular to use single-cell RNA-seq for characterization of less known multicellular organisms, especially as a part of de novo genome assembly projects of key species. Two things are important to note here. First of all, a correctly assembled and well-annotated mitochondrial sequence is crucial, because mitochondrial reads constitute a substantial fraction of every scRNA-seq library and are widely used for experiment quality control (see Introduction to scRNA-seq). A recent effort has put together a compendium of carefully assembled mitochondrial sequences for many non-model vertebrates (Formenti et al, 2021). MITOS2 is a specialized server that can be used to automatically generate good quality mitochondrial annotations for metazoans.
Second, it is important to notice that most annotation methods for de novo sequenced genomes generate gene models that do not include UTR sequences. Both 3’ and 5’ scRNA-seq methods are heavily biased in their read distribution towards either end of the gene (figure below). Thus, using gene annotation that does not have UTR sequences will dramatically distort the results of quantification and analysis.
Figure 3.7: Typical read coverage in 3’ and 5’ 10x scRNA-seq experiments
Cell Ranger is the default software suite offered by 10x Genomics, and it remains the most widely used tool for read alignment and quantification. If you lack experience in bioinformatics, or have many other samples processed with
Cell Ranger, stick with it. We would encourage you to use the latest
Cell Ranger version, and the latest annotation files it comes with (see 3.3 above). At the same time,
Alevin-full_decoy offer great computational speed-up and correct processing of multimappers, which reduces the quantification bias while retaining very high compatibility with
Cell Ranger. They probably are the best option for users experienced with terminal tools. Finally, if you’re working with a poorly annotated genome, make sure your gene models include UTRs, and that you have a well-assembled and annotated mitochondrion.
- 1 About the course 关于单细胞测序跟练课程
- 2 单细胞RNA-seq介绍
- 3 Processing Raw scRNA-Seq Sequencing Data: From Reads to a Count Matrix处理scRNA-seq测序的原始数据：把读取的数据转化为计数矩阵
- 5 scRNA-seq Analysis with Bioconductor
- 6 Basic Quality Control (QC) and Exploration of scRNA-seq Datasets
- 7 Biological Analysis
- 8 Single cell RNA-seq analysis using Seurat
- 9 scRNA-seq Dataset Integration
- 10 Resources
- 11 References
- 【单细胞数据分析】SCENIC 从单细胞数据推断基因调控网络和细胞类型