Template developed with materials from https://hbctraining.github.io/main/.
Code
# This set up the working directory to this file so all files can be foundlibrary(rstudioapi)setwd(fs::path_dir(getSourceEditorContext()$path))# NOTE: This code will check version, this is our recommendation, it may work# . other versionsstopifnot(R.version$major >=4) # requires R4if (compareVersion(R.version$minor, "3.1") <0) warning("We recommend >= R4.3.1")stopifnot(compareVersion(as.character(BiocManager::version()), "3.18") >=0)
This code is in this revision.
Code
# 1. set up factor_of_interest parameter from parameter above or manually# this is used to color plots, it needs to be part of the metadatafactor_of_interest <- params$factor_of_interestgenome <- params$genomesingle_end <- params$single_end# 2. Set input files in this filesource(params$params_file)# 3. If you set up this file, project information will be printed below and# . it can be reused for other Rmd files.source(params$project_file)# 4. Load custom functions to load data from coldata/metrics/countssource(params$functions_file)
1 Overview
Project: name_XXXXX
PI: person name
Analyst: person in the core
Experiment: short description
Code
library(tidyverse)library(janitor)library(knitr)library(rtracklayer)library(DESeq2)library(DEGreport)library(ggrepel)# library(RColorBrewer)library(DT)library(pheatmap)library(RColorBrewer)library(ggprism)library(grafify)ggplot2::theme_set(theme_prism(base_size =12))# https://grafify-vignettes.netlify.app/colour_palettes.html# NOTE change colors here if you wishscale_colour_discrete <-function(...) {scale_colour_manual(...,values =as.vector(grafify:::graf_palettes[["kelly"]]) )}scale_fill_discrete <-function(...) {scale_fill_manual(...,values =as.vector(grafify:::graf_palettes[["kelly"]]) )}opts_chunk[["set"]](cache =FALSE,cache.lazy =FALSE,dev =c("png", "pdf"),error =TRUE,highlight =TRUE,message =FALSE,prompt =FALSE,tidy =FALSE,warning =FALSE,fig.height =4)
Code
sanitize_datatable <-function(df, ...) {# remove dashes which cause wrapping DT::datatable(df, ...,rownames =gsub("-", "_", rownames(df)),colnames =gsub("-", "_", colnames(df)) )}
2 Samples and metadata
Code
# This code will load from bcbio or nf-core folder# TODO: make sure to set numerator and denominatorcoldata <-load_coldata(coldata_fn)# Change this line to change the levels to the desired order.# It will affect downstream colors in plots.coldata[[factor_of_interest]] <-as.factor(coldata[[factor_of_interest]])coldata$sample <-row.names(coldata)counts <-load_counts(counts_fn)counts <- counts[, colnames(counts) %in% coldata$sample]metrics <-load_metrics( se_object, multiqc_data_dir, gtf_fn, counts, single_end) %>%left_join(coldata, by =c("sample")) %>%as.data.frame()metrics <-subset(metrics, metrics$sample %in% coldata$sample)# TODO: change order as neededorder <-unique(metrics[["sample"]])rownames(metrics) <- metrics$sample# if the names don't match in order or string check files names and coldata informationcounts <- counts[, rownames(metrics)]coldata <- coldata[rownames(metrics), ]stopifnot(all(names(counts) ==rownames(metrics)))
# get min percent mapped reads for referencemin_pct_mapped <-round(min(metrics$mapped_reads / metrics$total_reads) *100, 1)max_pct_mapped <-round(max(metrics$mapped_reads / metrics$total_reads) *100, 1)
The genomic mapping rate represents the percentage of reads mapping to the reference genome. We want to see consistent mapping rates between samples and over 70% mapping (the grey line). These samples have mapping rates: 97.3% - 98.1 %.
In this section, we look at how well the different groups in the dataset cluster with each other. Samples from the same group should ideally be clustering together. We use Principal Component Analysis (PCA).
4.1 Principal component analysis (PCA)
Principal Component Analysis (PCA) is a statistical technique used to simplify high-dimensional data by identifying patterns and reducing the number of variables. In the context of gene expression, PCA helps analyze large datasets containing information about the expression levels of thousands of genes across different samples (e.g., tissues, cells).
Inter-correlation analysis (ICA) is another way to look at how well samples cluster by plotting the correlation between the expression profiles of the samples.
When there are multiple factors that can influence the results of a given experiment, it is useful to assess which of them is responsible for the most variance as determined by PCA. This method adapts the method described by Daily et al. for which they integrated a method to correlate covariates with principal components values to determine the importance of each factor.
RNA-seq counts were generated by the nf-core rnaseq pipeline [version] using Salmon (Patro et al. 2017). Downstream analyses were performed using R version 4.5.1 (2025-06-13). Counts were imported into R using DESeq2 version 1.46.0 (Love, Huber, and Anders 2014). Gene annotations were obtained from Ensembl. Plots were generated by ggplot2 (Wickham 2016). Heatmaps were generated by pheatmap (Kolde 2019).
7.1 R package references
Code
citation("DESeq2")
To cite package ‘DESeq2’ in publications use:
Love, M.I., Huber, W., Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Genome Biology 15(12):550 (2014)
A BibTeX entry for LaTeX users is
@Article{, title = {Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2}, author = {Michael I. Love and Wolfgang Huber and Simon Anders}, year = {2014}, journal = {Genome Biology}, doi = {10.1186/s13059-014-0550-8}, volume = {15}, issue = {12}, pages = {550}, }
Code
citation("ggplot2")
To cite ggplot2 in publications, please use
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
A BibTeX entry for LaTeX users is
@Book{, author = {Hadley Wickham}, title = {ggplot2: Elegant Graphics for Data Analysis}, publisher = {Springer-Verlag New York}, year = {2016}, isbn = {978-3-319-24277-4}, url = {https://ggplot2.tidyverse.org}, }
---title: "Quality Control"author: "Harvard Chan Bioinformatics Core"date: "`r Sys.Date()`"format: html: code-fold: true code-tools: true df-print: paged highlight-style: pygments number-sections: true self-contained: true theme: default toc: true toc-location: right toc-expand: falseparams: # Fill this file with the right paths to nfcore output # Put hg38, mm10, mm39, or other # params_file: ../00_params/params.R params_file: ../00_params/params-example.R # example data genome: hg38 single_end: false factor_of_interest: sample_type project_file: ../information.R functions_file: ../00_libs/load_data.R---Template developed with materials from https://hbctraining.github.io/main/.```{r, cache = FALSE, message = FALSE, warning=FALSE, eval = interactive()}# This set up the working directory to this file so all files can be foundlibrary(rstudioapi)setwd(fs::path_dir(getSourceEditorContext()$path))# NOTE: This code will check version, this is our recommendation, it may work# . other versionsstopifnot(R.version$major >=4) # requires R4if (compareVersion(R.version$minor, "3.1") <0) warning("We recommend >= R4.3.1")stopifnot(compareVersion(as.character(BiocManager::version()), "3.18") >=0)```This code is in this  revision.```{r source_params, cache = FALSE, message = FALSE, warning=FALSE}# 1. set up factor_of_interest parameter from parameter above or manually# this is used to color plots, it needs to be part of the metadatafactor_of_interest <- params$factor_of_interestgenome <- params$genomesingle_end <- params$single_end# 2. Set input files in this filesource(params$params_file)# 3. If you set up this file, project information will be printed below and# . it can be reused for other Rmd files.source(params$project_file)# 4. Load custom functions to load data from coldata/metrics/countssource(params$functions_file)```# Overview- Project: `r project`- PI: `r PI`- Analyst: `r analyst`- Experiment: `r experiment````{r load_libraries, cache = FALSE, message = FALSE, warning=FALSE}library(tidyverse)library(janitor)library(knitr)library(rtracklayer)library(DESeq2)library(DEGreport)library(ggrepel)# library(RColorBrewer)library(DT)library(pheatmap)library(RColorBrewer)library(ggprism)library(grafify)ggplot2::theme_set(theme_prism(base_size =12))# https://grafify-vignettes.netlify.app/colour_palettes.html# NOTE change colors here if you wishscale_colour_discrete <-function(...) {scale_colour_manual(...,values =as.vector(grafify:::graf_palettes[["kelly"]]) )}scale_fill_discrete <-function(...) {scale_fill_manual(...,values =as.vector(grafify:::graf_palettes[["kelly"]]) )}opts_chunk[["set"]](cache =FALSE,cache.lazy =FALSE,dev =c("png", "pdf"),error =TRUE,highlight =TRUE,message =FALSE,prompt =FALSE,tidy =FALSE,warning =FALSE,fig.height =4)``````{r sanitize-datatable}sanitize_datatable <-function(df, ...) {# remove dashes which cause wrapping DT::datatable(df, ...,rownames =gsub("-", "_", rownames(df)),colnames =gsub("-", "_", colnames(df)) )}```# Samples and metadata```{r load_data, message=F, warning=F}# This code will load from bcbio or nf-core folder# TODO: make sure to set numerator and denominatorcoldata <-load_coldata(coldata_fn)# Change this line to change the levels to the desired order.# It will affect downstream colors in plots.coldata[[factor_of_interest]] <-as.factor(coldata[[factor_of_interest]])coldata$sample <-row.names(coldata)counts <-load_counts(counts_fn)counts <- counts[, colnames(counts) %in% coldata$sample]metrics <-load_metrics( se_object, multiqc_data_dir, gtf_fn, counts, single_end) %>%left_join(coldata, by =c("sample")) %>%as.data.frame()metrics <-subset(metrics, metrics$sample %in% coldata$sample)# TODO: change order as neededorder <-unique(metrics[["sample"]])rownames(metrics) <- metrics$sample# if the names don't match in order or string check files names and coldata informationcounts <- counts[, rownames(metrics)]coldata <- coldata[rownames(metrics), ]stopifnot(all(names(counts) ==rownames(metrics)))``````{r load_metadata}meta_df <- coldataggplot(meta_df, aes(.data[[factor_of_interest]],fill = .data[[factor_of_interest]])) +geom_bar() +ylab("") +xlab("") +ylab("# of samples") +theme(axis.text.x =element_text(angle =90, vjust =0.5),legend.position ="none" )``````{r show_metadata}meta_sm <- meta_df %>%as.data.frame()meta_sm %>%sanitize_datatable()```# Read metrics::: {.panel-tabset}## Total readsHere, we want to see consistency and a minimum of 20 million reads (the grey line).```{r plot_total_reads}metrics %>%ggplot(aes(x =factor(sample, level = order),y = total_reads, # if total reads are not already in millions, divide by 10000000 herefill = .data[[factor_of_interest]] )) +geom_bar(stat ="identity") +coord_flip() +scale_y_continuous(name ="Millions of reads") +scale_x_discrete(limits = rev) +xlab("") +ggtitle("Total reads") +geom_hline(yintercept =20, color ="grey", linewidth =2)metrics %>%ggplot(aes(x = .data[[factor_of_interest]],y = total_reads,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +coord_flip() +scale_y_continuous(name ="million reads") +ggtitle("Total reads")``````{r calc_min_max_pct_mapped}# get min percent mapped reads for referencemin_pct_mapped <-round(min(metrics$mapped_reads / metrics$total_reads) *100, 1)max_pct_mapped <-round(max(metrics$mapped_reads / metrics$total_reads) *100, 1)```## Mapping rateThe genomic mapping rate represents the percentage of reads mapping to the reference genome. We want to see consistent mapping rates between samples and over 70% mapping (the grey line). These samples have mapping rates: `r min_pct_mapped`% - `r max_pct_mapped` %.```{r plot_mapping_rate}metrics$mapped_reads_pct <-round(metrics$mapped_reads / metrics$total_reads *100, 1)metrics %>%ggplot(aes(x =factor(sample, level = order),y = mapped_reads_pct,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +coord_flip() +ylab("Mapped Reads %") +scale_x_discrete(limits = rev) +ylim(0, 100) +ggtitle("Mapping rate") +xlab("") +geom_hline(yintercept =70, color ="grey", linewidth =2)```## Number of genes detectedThe number of genes represented in every sample is expected to be consistent and over 20K (grey line).```{r calc_genes_detected}genes_detected <-colSums(counts >0) %>%enframe()sample_names <- metrics[, c("sample"), drop = F]genes_detected <-left_join(genes_detected, sample_names, by =c("name"="sample"))genes_detected <- genes_detected %>%group_by(name)genes_detected <-summarise(genes_detected,n_genes =max(value))metrics <- metrics %>%left_join(genes_detected, by =c("sample"="name"))``````{r plot_genes_detected}ggplot(metrics, aes(x =factor(sample, level = order),y = n_genes, fill = .data[[factor_of_interest]])) +geom_bar(stat ="identity") +coord_flip() +scale_x_discrete(limits = rev) +ggtitle("Number of genes") +ylab("Number of genes") +xlab("") +geom_hline(yintercept =20000, color ="grey", linewidth =2)metrics %>%ggplot(aes(x = .data[[factor_of_interest]],y = n_genes,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +coord_flip() +scale_x_discrete(limits = rev) +scale_y_continuous(name ="million reads") +xlab("") +ggtitle("Number of Genes")```## Gene detection saturationThis plot shows how complex the samples are. We expect samples with more reads to detect more genes. ```{r plot_gene_saturation}metrics %>%ggplot(aes(x = total_reads /1000000,y = n_genes,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +scale_x_log10() +ggtitle("Gene saturation") +ylab("Number of genes") +xlab("Millions of reads")```## Exonic mapping rateHere we are looking for consistency, and exonic mapping rates around or above 70% (grey line). ```{r plot_exonic_mapping_rate}metrics %>%ggplot(aes(x =factor(sample, level = order),y = exonic_rate *100,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +ylab("Exonic rate %") +ggtitle("Exonic mapping rate") +scale_x_discrete(limits = rev) +coord_flip() +xlab("") +ylim(c(0, 100)) +geom_hline(yintercept =70, color ="grey", linewidth =2)```## Intronic mapping rateHere, we expect a low intronic mapping rate (≤ 15% - 20%). The grey line indicates 20%.```{r plot_intronic_mapping_rate}metrics %>%ggplot(aes(x =factor(sample, level = order),y = intronic_rate *100,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +ylab("Intronic rate %") +ggtitle("Intronic mapping rate") +scale_x_discrete(limits = rev) +coord_flip() +xlab("") +ylim(c(0, 100)) +geom_hline(yintercept =20, color ="grey", linewidth =2)```## Intergenic mapping rateHere, we expect a low intergenic mapping rate, which is true for all samples. The grey line indicates 15%```{r plot_intergenic_mapping_rate}metrics %>%ggplot(aes(x =factor(sample, level = order),y = intergenic_rate *100,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +ylab("Intergenic rate %") +ggtitle("Intergenic mapping rate") +coord_flip() +xlab("") +scale_x_discrete(limits = rev) +ylim(c(0, 100)) +geom_hline(yintercept =15, color ="grey", linewidth =2)```## tRNA/rRNA mapping rateSamples should have a ribosomal RNA (rRNA) "contamination" rate below 10% (the grey line).```{r plot_rrna_mapping_rate}rrna_ylim <-max(round(metrics$r_and_t_rna_rate *100, 2)) +10metrics %>%ggplot(aes(x =factor(sample, level = order),y = r_and_t_rna_rate *100,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +ylab("tRNA/rRNA rate, %") +ylim(0, rrna_ylim) +ggtitle("tRNA/rRNA mapping rate") +coord_flip() +scale_x_discrete(limits = rev) +ylim(c(0, 100)) +xlab("") +geom_hline(yintercept =10, color ="grey", linewidth =2)```## 5'->3' biasThere should be little bias, i.e. the values should be close to 1, or at least consistent among samples```{r plot_53_bias, eval=!all(is.na((metrics['r_and_t_rna_rate'])))}metrics %>%ggplot(aes(x =factor(sample, level = order),y = x5_3_bias,color = .data[[factor_of_interest]] )) +geom_point(alpha =0.5, size =4) +ggtitle("5'-3' bias") +coord_flip() +scale_x_discrete(limits = rev) +ylim(c(0.5, 1.5)) +xlab("") +ylab("5'-3' bias") +geom_hline(yintercept =1, color ="grey", linewidth =2)```## Counts per gene - all genesWe expect consistency in the box plots here between the samples, i.e. the distribution of counts across the genes is similar```{r plot_counts_per_gene}metrics_small <- metrics %>% dplyr::select(sample, .data[[factor_of_interest]])metrics_small <-left_join(sample_names, metrics_small)counts_lng <- counts %>%as_tibble() %>%filter(rowSums(.) !=0) %>%gather(name, counts)counts_lng <-left_join(counts_lng, metrics_small, by =c("name"="sample"))ggplot(counts_lng, aes(factor(name, level = order),log2(counts +1),fill = .data[[factor_of_interest]])) +geom_boxplot() +scale_x_discrete(limits = rev) +coord_flip() +xlab("") +ggtitle("Counts per gene, all non-zero genes")```:::# Sample similarity analysisIn this section, we look at how well the different groups in the dataset cluster with each other. Samples from the same group should ideally be clustering together. We use Principal Component Analysis (PCA).## Principal component analysis (PCA) Principal Component Analysis (PCA) is a statistical technique used to simplify high-dimensional data by identifying patterns and reducing the number of variables. In the context of gene expression, PCA helps analyze large datasets containing information about the expression levels of thousands of genes across different samples (e.g., tissues, cells).<!-- ### PCA, PCs 1-4, (labled) -->```{r PCA1:5 summary, all, unlabeled, fig.width= 7, fig.height = 5}vst <-vst(counts)coldat_for_pca <-as.data.frame(metrics)rownames(coldat_for_pca) <- coldat_for_pca$samplecoldat_for_pca <- coldat_for_pca[colnames(counts), ]pca1 <-degPCA(vst, coldat_for_pca,condition = factor_of_interest, data = T)[["plot"]]pca2 <-degPCA(vst, coldat_for_pca,condition = factor_of_interest, data = T, pc1 ="PC3", pc2 ="PC4")[["plot"]]pca1 +scale_color_grafify(palette ="kelly")pca2 +scale_color_grafify(palette ="kelly")```## Hierarchical clusteringInter-correlation analysis (ICA) is another way to look at how well samplescluster by plotting the correlation between the expression profiles of thesamples.```{r clustering fig, fig.width = 10, fig.asp = .62}vst_cor <-cor(vst)colma <- coldata %>%as.data.frame()rownames(colma) <- colma$samplecolma <- colma[rownames(vst_cor), ]colma <- colma %>% dplyr::select(.data[[factor_of_interest]])anno_colors <-lapply(colnames(colma), function(c) { l.col <- grafify:::graf_palettes[["kelly"]][1:length(unique(colma[[c]]))]names(l.col) <-levels(colma[[c]]) l.col})names(anno_colors) <-colnames(colma)p <-pheatmap(vst_cor,annotation = colma,annotation_colors = anno_colors,show_rownames = T,show_colnames = T,color =colorRampPalette(brewer.pal(11, "RdBu"))(15))p```# Covariates analysisWhen there are multiple factors that can influence the results of a given experiment, it is useful to assess which of them is responsible for the most variance as determined by PCA. This method adapts the method described by Daily et al. for which they integrated a method to correlate covariates with principal components values to determine the importance of each factor. ```{r covariate-plot,fig.height=12, fig.width=10}## Remove non-useful columns output by nf-corecoldat_2 <-data.frame(coldat_for_pca[, !(colnames(coldat_for_pca) %in%c("fastq_1", "fastq_2", "salmon_library_types", "salmon_compatible_fragment_ratio", "samtools_reads_mapped_percent", "samtools_reads_properly_paired_percent", "samtools_mapped_passed_pct", "strandedness", "qualimap_5_3_bias"))])# Remove missing datacoldat_2 <-na.omit(coldat_2)degCovariates(vst, metadata = coldat_2)```# Conclusions# MethodsRNA-seq counts were generated by the nf-core rnaseq pipeline [version] using Salmon (Patro et al. 2017). Downstream analyses were performed using `r version$version.string`. Counts were imported into R using DESeq2 version `r packageVersion("DESeq2")` (Love, Huber, and Anders 2014). Gene annotations were obtained from Ensembl. Plots were generated by ggplot2 (Wickham 2016). Heatmaps were generated by pheatmap (Kolde 2019).## R package references```{r citations, results='asis'}citation("DESeq2")citation("ggplot2")citation("pheatmap")```## R sessionList and version of tools used for the QC report generation.```{r}sessionInfo()```