R|fastqcr QC数据处理
转自:https://mp.weixin.qq.com/s?__biz=MzIyNDI1MzgzOQ==&mid=2650393841&idx=1&sn=edffe1a043b34a36e9f052d955bbddbc&chksm=f01cae11c76b2707d1ea34bc68facae4d3c332ec7d72f484b74400fd05c926bb62e9bcf48210&scene=21#wechat_redirect
FastQC是一款较常用的高通量数据质控软件,每个样本会得到一个zip和html的结果文件,查看略有不便。
fastqcr R包可以整理并分析多样不的zip格式的结果报告,当然也可以直接做fastQC分析。

一 加载安装R包
install.packages("fastqcr")
library(fastqcr)
二 R跑FastQC程序
fastqc(fq.dir = "~/Documents/FASTQ", # FASTQ files directory
qc.dir = "~/Documents/FASTQC", # Results direcory
threads = 4 # Number of threads
)
三 Fastcq结果整理绘图
3.1 读入fastqc的zip结果文件
qc.dir <- system.file("fastqc_results", package = "fastqcr")
qc.dir
# [1] "C:/Users/MJ/Documents/R/win-library/3.3/fastqcr/fastqc_results"
查看文件夹中文件
list.files(qc.dir)
#[1] "A1.R2.clean_fastqc.zip" "A1_R1.clean_fastqc.zip" "A2.R1.clean_fastqc.zip"
#[4] "A2.R2.clean_fastqc.zip"
3.2 整体统计结果 Aggregate & summary
1)Aggregate函数,Sample的基本统计;qc_stats()类似
c <- qc_aggregate(qc.dir)
head(qc,3)
sample module status tot.seq seq.length pct.gc pct.dup
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 170197A-Ti-1.R1.clean.fq.gz Basic Statis~ PASS 364589~ 150 51. 18.0
2 170197A-Ti-1.R1.clean.fq.gz Per base seq~ PASS 364589~ 150 51. 18.0
3 170197A-Ti-1.R1.clean.fq.gz Per tile seq~ WARN 364589~ 150 51. 18.0
其中:
sample: sample names
module: fastqc modules
status: fastqc module status for each sample
tot.seq: total sequences (i.e.: the number of reads)
seq.length: sequence length
pct.gc: percentage of GC content
pct.dup: percentage of duplicate reads
2)summary函数,Module的基本统计
head(summary(qc),3)
module nb_samples nb_fail nb_pass nb_warn failed warned
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Adapter Content 4. 0. 4. 0. NA NA
2 Basic Statistics 4. 0. 4. 0. NA NA
3 Kmer Content 4. 4. 0. 0. 170197A-Ti-1.R1.clean.~ NA
其中:
module: fastqc modules
nb_samples: the number of samples tested
nb_pass, nb_fail, nb_warn: the number of samples that passed, failed and warned, respectively.
failed, warned: the name of samples that failed and warned, respectively.
3)dplyr函数
aggregated完可以使用dplyr得到重点关注的信息:如WARN和FALL样本:
library(dplyr)
qc %>%select(sample, module, status) %>% filter(status %in% c("WARN", "FAIL")) %>%arrange(sample)
sample module status
<chr> <chr> <chr>
1 170197A-Ti-1.R1.clean.fq.gz Per tile sequence quality WARN
2 170197A-Ti-1.R1.clean.fq.gz Per sequence GC content WARN
3 170197A-Ti-1.R1.clean.fq.gz Kmer Content FAIL
附:dplyr简单介绍 dplyr
3.3 异常信息统计
1) Module异常:qc_warns & qc_problems函数
qc_warns(qc, "module") # See which module warned in the most samples
module nb_problems sample
<chr> <int> <chr>
1 Per sequence GC content 4 170197A-Ti-1.R1.clean.fq.gz, 170197A-Ti-1.R~
2 Per tile sequence quality 2 170197A-Ti-1.R1.clean.fq.gz, 170197A-Ti-10.~
查看特定模块的异常
qc_problems(qc, "module", name = "Per sequence GC content")
module nb_problems sample status
<chr> <int> <chr> <chr>
1 Per sequence GC content 4 170197A-Ti-1.R1.clean.fq.gz WARN
2 Per sequence GC content 4 170197A-Ti-1.R2.clean.fq.gz WARN
3 Per sequence GC content 4 170197A-Ti-10.R1.clean.fq.gz WARN
4 Per sequence GC content 4 170197A-Ti-10.R2.clean.fq.gz WARN
2)Sample 异常 : qc_fails & qc_problems 函数
qc_problems(qc, "sample", compact = FALSE) #fail 和 warn的 ,compact展示方式
sample nb_problems module status
<chr> <int> <chr> <chr>
1 170197A-Ti-1.R1.clean.fq.gz 3 Kmer Content FAIL
2 170197A-Ti-1.R1.clean.fq.gz 3 Per tile sequence quality WARN
3 170197A-Ti-1.R1.clean.fq.gz 3 Per sequence GC content WARN
4 170197A-Ti-1.R2.clean.fq.gz 3 Per tile sequence quality FAIL
5 170197A-Ti-1.R2.clean.fq.gz 3 Kmer Content FAIL
6 170197A-Ti-1.R2.clean.fq.gz 3 Per sequence GC content WARN
四 绘图及生成报告
4.1 导入单个样本fastq结果
qc.file <- system.file("fastqc_results", "170197A-Ti-1.R1.clean_fastqc.zip", package = "fastqcr")
qc <- qc_read(qc.file) #读入qc的结果,后续绘图使用
names(qc) #查看包含的元素
4.2 绘制样本基本图形
qc_plot(qc, "Per sequence GC content")
qc_plot(qc, "Per base sequence quality")
qc_plot(qc, "Per sequence quality scores")
qc_plot(qc, "Per base sequence content")
qc_plot(qc, "Sequence duplication levels")

4.3 生成单样本网页报告
1)图形有解释信息的报告
qc_report(qc.file, result.file = "one-sample-report-with-interpretation",interpret = TRUE)
在工作目录下生成html的网页报告,图形含有解释信息(如下),表格可交互。

2)图形无解释信息的报告
qc_report(qc.file, result.file = "one-sample-report",interpret = FALSE)
4.4 多样本网页报告
qc.dir <- system.file("fastqc_results", package = "fastqcr")
qc_report(qc.dir, result.file = "~/Desktop/multi-qc-result",experiment = "Exome sequencing")

请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!