Single-cell QC Report

Author

Harvard Chan Bioinformatics Core

Published

August 14, 2025

Code
stopifnot(R.version$major >= 4) # requires R4
if (compareVersion(R.version$minor, "3.1") < 0) warning("We recommend >= R4.3.1")
stopifnot(compareVersion(as.character(BiocManager::version()), "3.18") >= 0)
stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.0.0") >= 0)
invisible(list2env(params, environment()))
source(params_file)
source(project_file)
Code
sanitize_datatable <- function(df, ...) {
  DT::datatable(df, ...,
    rownames = gsub("-", "_", rownames(df)),
    colnames = gsub("-", "_", colnames(df))
  )
}

This code is in this revision.

READ ME FIRST

This is a template for scRNA QC to present to your client. The actual QC can be done using our rshiny app:

https://github.com/hbc/scRNAseq_qc_app/archive/refs/heads/main.zip

Please download the app, and execute it to identify parameters interactively

After you have decided on your QC metrics load your raw object (i.e. right after you first read data into seurat) and put the parameters.R file you got from the shiny app in the same folder as this rmd.

Overview

  • Project: name_hbcXXXXX
  • PI: person name
  • Analyst: person in the core
  • Experiment: short description
Code
if (isUrl(seurat_obj)) {
  seurat_raw <- readRDS(url(seurat_obj))
} else {
  seurat_raw <- readRDS(seurat_obj)
}
Code
# if you suspect that your dataset contains doublets, you can use this code to
# detect them and filter them out. if your seurat object contains data from more
# than one sample, it is important to pass the metadata column containing the
# sample name to scDblFinder using the "samples" argument

sce <- as.SingleCellExperiment(seurat_raw, assay = "RNA")
sce <- scDblFinder(sce, samples = "orig.ident")

meta_scdblfinder <- sce@colData@listData %>%
  as.data.frame() %>%
  dplyr::select(starts_with("scDblFinder")) %>%
  mutate(barcode = sce@colData@rownames)

seurat_raw_meta <- seurat_raw@meta.data

seurat_raw_meta_new <- seurat_raw_meta %>% left_join(meta_scdblfinder, by = "barcode")
rownames(seurat_raw_meta_new) <- seurat_raw_meta_new$barcode
seurat_raw@meta.data <- seurat_raw_meta_new

seurat_raw <- subset(seurat_raw, scDblFinder.class == "singlet")
Code
seurat_qc <- subset(
  x = seurat_raw,
  subset = (nCount_RNA >= nCount_RNA_cutoff) &
    (nFeature_RNA >= nFeature_RNA_cutoff) &
    (mitoRatio < mitoRatio_cutoff)
  ##  & (riboRatio < riboRatio_cutoff)
  & (Log10GenesPerUMI > Log10GenesPerUMI_cutoff)
)
Code
seurat_qc <- subset(
  x = seurat_raw,
  subset = (nCount_RNA >= nCount_RNA_cutoff) &
    (nFeature_RNA >= nFeature_RNA_cutoff) &
    (mitoRatio < mitoRatio_cutoff) &
    (riboRatio < riboRatio_cutoff) &
    (Log10GenesPerUMI > Log10GenesPerUMI_cutoff)
)
Code
## Save QC object
saveRDS(seurat_qc, file = params$seurat_obj_filt_fn)
Code
metadata0 <- seurat_raw@meta.data
metadata0 <- metadata0 %>% dplyr::rename(
  nUMI = nCount_RNA,
  nGene = nFeature_RNA
)
metadata1 <- seurat_qc@meta.data
metadata1 <- metadata1 %>% dplyr::rename(
  nUMI = nCount_RNA,
  nGene = nFeature_RNA
)

QC metrics: raw data

In this section, we review quality control (QC) metrics for the raw feature matrices generated by Cellranger. Only a low level filter excluding cells with <100 nUMIs (= number of unique molecular identifiers, or sequenced reads per cell) was applied when uploading the data into R.

Cells per sample

Code
metadata0 %>%
  group_by(orig.ident) %>%
  summarize(n_cells_pre_filt = n()) %>%
  sanitize_datatable()

UMIs per cell

Here, we look at the distribution of UMIs (unique molecular identifiers, or sequenced reads) per cell (droplet) in the dataset. Before QC, we expect a biomodal distribution with a first small peak at low numbers of UMIs (<250) corresponding to droplets that encapsulated background/dying cells, and a second higher peak centered at >1000. The line is at 0.

Code
metadata0 %>%
  ggplot(aes(x = nUMI, color = orig.ident, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  theme_classic() +
  ylab("Cell density") +
  scale_x_log10() +
  geom_vline(xintercept = nCount_RNA_cutoff) +
  facet_wrap(. ~ orig.ident) +
  ggtitle("UMIs per cell in raw dataset")

Code
# Visualize the distribution of nUMIs per cell (boxplot)
metadata0 %>%
  ggplot(aes(x = orig.ident, y = log10(nUMI), fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  geom_hline(yintercept = log10(nCount_RNA_cutoff)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Genes per cell

Here, we look at the number of different genes that were detected in each cell. By “detected”, we mean genes with a non-zero read count measurement. Gene detection in the range of 500 to 5000 is normal for most single-cell experiments. The line is at 0.

Code
# Visualize the distribution of genes detected per cell (histogram)
metadata0 %>%
  ggplot(aes(x = nGene, color = orig.ident, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  theme_classic() +
  scale_x_log10() +
  geom_vline(xintercept = c(nFeature_RNA_cutoff)) +
  facet_wrap(. ~ orig.ident) +
  ggtitle("Detected genes per cell in raw dataset")

Code
# Visualize the distribution of nUMIs per cell (boxplot)
metadata0 %>%
  ggplot(aes(x = orig.ident, y = log10(nGene), fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  geom_hline(yintercept = c(log10(nFeature_RNA_cutoff))) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Mitochondrial ratio

We evaluate overall mitochondrial gene expression as a biomarker of cellular stress during sample preparation. Typically, we expect mitochondrial genes to account for <20% of overall transcripts in each cell. The line indicates 100 %.

Code
# Visualize the distribution of mitochondrial gene expression detected per cell
metadata0 %>%
  ggplot(aes(color = orig.ident, x = mitoRatio, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  geom_vline(xintercept = mitoRatio_cutoff) +
  facet_wrap(. ~ orig.ident) +
  ggtitle("Percentage of mitochondrial gene expression per cell in raw dataset")

Code
# Visualize the distribution of mitochondrial gene expression per cell (violin plot)
metadata0 %>%
  ggplot(aes(x = orig.ident, y = mitoRatio, fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  geom_hline(yintercept = c(mitoRatio_cutoff)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Code
cat("## Ribosomal ratio \n")

cat("We evaluate overall ribosomal gene expression. Different cells types are expected to have different levels of ribosomal expression. Due to this, there is no suggested cutoff for ribosomal ratio. We merely expect it to be similar among samples with similar cellular composition. Note that extremely high levels can indicate low quality reads. \n")
# Visualize the distribution of mitochondrial gene expression detected per cell
metadata0 %>%
  ggplot(aes(color = orig.ident, x = riboRatio, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  facet_wrap(. ~ orig.ident) +
  ggtitle("Percentage of ribosomal gene expression per cell in raw dataset")

UMIs vs. Genes

By plotting the number of UMIs per cell (x-axis) vs. the number of genes per cell (y-axis), we can visually assess whether there is a large proportion of low quality cells with low read counts and/or gene detection (bottom left quadrant of the plot). In the following representation, cells are further color-coded based on the percentage of mitochondrial genes found among total detected genes. The line for nUMI is at 0 and the line for nGene is at 0.

Code
# Visualize the correlation between genes detected and number of UMIs and determine whether strong presence of cells with low numbers of genes/UMIs
metadata0 %>%
  ggplot(aes(x = nUMI, y = nGene, color = mitoRatio)) +
  geom_point() +
  stat_smooth(method = lm) +
  scale_x_log10() +
  scale_y_log10() +
  theme_classic() +
  geom_vline(xintercept = nCount_RNA_cutoff) +
  geom_hline(yintercept = nFeature_RNA_cutoff) +
  ggtitle("Genes vs. nUMIs in raw dataset") +
  facet_wrap(~orig.ident)

Complexity

Another way to assess the quality and purity of a single-cell dataset is to look for cells that have fewer detected genes per UMI than others. Typical values for this metric are >0.8 for most cells. Cells with lower diversity in the genes they express may be low-complexity cell types such as red blood cells. With sorted populations, we expect high purity and a very similar complexity distribution across samples. The line is at 0.

Code
# Visualize the overall novelty of the gene expression by visualizing the genes detected per UMI
metadata0 %>%
  ggplot(aes(x = Log10GenesPerUMI, color = orig.ident, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  theme_classic() +
  geom_vline(xintercept = Log10GenesPerUMI_cutoff) +
  facet_wrap(. ~ orig.ident) +
  ggtitle("log10(Genes per UMI) in raw dataset")

Code
# Visualize the overall novelty of the gene expression by visualizing the genes detected per UMI (boxplot)
metadata0 %>%
  ggplot(aes(x = orig.ident, Log10GenesPerUMI, fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  geom_hline(yintercept = Log10GenesPerUMI_cutoff) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

QC metrics: Filtered data

Based on the above QC metrics, we filtered the dataset to isolate cells passing the following thresholds: >0 UMIs, >0 genes, <1 mitochondrial gene ratio, and >0 complexity.

In this section, we review QC metrics for our filtered dataset.

Cells per sample

Code
metadata0 %>%
  group_by(orig.ident) %>%
  summarize(n_cells_pre_filt = n()) %>%
  left_join(metadata1 %>% group_by(orig.ident) %>% summarize(n_cells_post_filt = n())) %>%
  mutate(pct_remaining = round(n_cells_post_filt / n_cells_pre_filt * 100, 3)) %>%
  sanitize_datatable()

UMIs per cell

Code
metadata1 %>%
  ggplot(aes(color = orig.ident, x = nUMI, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  ylab("Cell density") +
  xlab("nUMI") +
  facet_wrap(. ~ orig.ident) +
  ggtitle("UMIs per cell in filtered dataset")

Code
# Visualize the distribution of nUMIs per cell (boxplot)
metadata1 %>%
  ggplot(aes(x = orig.ident, y = log10(nUMI), fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Genes per cell

Code
# Visualize the distribution of genes detected per cell via histogram
metadata1 %>%
  ggplot(aes(color = orig.ident, x = nGene, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  theme_classic() +
  scale_x_log10() +
  xlab("nGene") +
  facet_wrap(. ~ orig.ident) +
  ggtitle("Detected genes per cell in filtered dataset")

Code
# Visualize the distribution of genes detected per cell (boxplot)
metadata1 %>%
  ggplot(aes(x = orig.ident, y = log10(nGene), fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Mitochondrial ratio

Code
# Visualize the distribution of mitochondrial gene expression detected per cell
metadata1 %>%
  ggplot(aes(color = orig.ident, x = mitoRatio, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  geom_vline(xintercept = 0.1)

Code
# +
#   facet_wrap(. ~ surgery)
Code
# Visualize the distribution of mitochondrial gene expression per cell (violin plot)
metadata1 %>%
  ggplot(aes(x = orig.ident, y = mitoRatio, fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  geom_hline(yintercept = c(mitoRatio_cutoff)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Code
cat("## Ribosomal ratio \n")

# Visualize the distribution of ribosomal gene expression detected per cell
metadata1 %>%
  ggplot(aes(color = orig.ident, x = riboRatio, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  facet_wrap(. ~ orig.ident) +
  ggtitle("Percentage of ribosomal gene expression per cell in filtered dataset")
Code
metadata1 %>%
  ggplot(aes(x = orig.ident, y = riboRatio, fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

UMIs vs. Genes

Code
metadata1 %>%
  ggplot(aes(x = nUMI, y = nGene, color = mitoRatio)) +
  geom_point() +
  stat_smooth(method = lm) +
  scale_x_log10() +
  scale_y_log10() +
  theme_classic() +
  ggtitle("Genes vs. nUMIs in filtered dataset") +
  xlab("nUMI") +
  ylab("nGene") +
  facet_wrap(~orig.ident)

Complexity

Code
# Visualize the overall novelty of the gene expression by visualizing the genes detected per UMI
metadata1 %>%
  ggplot(aes(x = Log10GenesPerUMI, color = orig.ident, fill = orig.ident)) +
  geom_density(alpha = 0.2) +
  theme_classic() +
  facet_wrap(. ~ orig.ident) +
  ggtitle("log10(Genes per UMI) in filtered dataset")

Code
# Visualize the distribution of nUMIs per cell (boxplot)
metadata1 %>%
  ggplot(aes(x = orig.ident, Log10GenesPerUMI, fill = orig.ident)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = alpha("white", 0.7)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

R session

Code
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] R.utils_2.13.0     R.oo_1.27.1        R.methodsS3_1.8.2  lubridate_1.9.4   
 [5] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.1.0       
 [9] readr_2.1.5        tidyr_1.3.1        tibble_3.3.0       ggplot2_3.5.2     
[13] tidyverse_2.0.0    Seurat_5.3.0       SeuratObject_5.1.0 sp_2.2-0          

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3     jsonlite_2.0.0         magrittr_2.0.3        
  [4] spatstat.utils_3.1-4   farver_2.1.2           rmarkdown_2.29        
  [7] vctrs_0.6.5            ROCR_1.0-11            spatstat.explore_3.4-3
 [10] htmltools_0.5.8.1      sass_0.4.10            sctransform_0.4.2     
 [13] parallelly_1.45.0      KernSmooth_2.23-26     bslib_0.9.0           
 [16] htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9            
 [19] cachem_1.1.0           plotly_4.11.0          zoo_1.8-14            
 [22] igraph_2.1.4           mime_0.13              lifecycle_1.0.4       
 [25] pkgconfig_2.0.3        Matrix_1.7-3           R6_2.6.1              
 [28] fastmap_1.2.0          fitdistrplus_1.2-4     future_1.58.0         
 [31] shiny_1.10.0           digest_0.6.37          patchwork_1.3.1       
 [34] tensor_1.5.1           RSpectra_0.16-2        irlba_2.3.5.1         
 [37] crosstalk_1.2.1        labeling_0.4.3         progressr_0.15.1      
 [40] spatstat.sparse_3.1-0  timechange_0.3.0       mgcv_1.9-1            
 [43] httr_1.4.7             polyclip_1.10-7        abind_1.4-8           
 [46] compiler_4.5.1         withr_3.0.2            fastDummies_1.7.5     
 [49] MASS_7.3-65            tools_4.5.1            lmtest_0.9-40         
 [52] httpuv_1.6.16          future.apply_1.20.0    goftest_1.2-3         
 [55] glue_1.8.0             nlme_3.1-168           promises_1.3.2        
 [58] grid_4.5.1             Rtsne_0.17             cluster_2.1.8.1       
 [61] reshape2_1.4.4         generics_0.1.4         gtable_0.3.6          
 [64] spatstat.data_3.1-6    tzdb_0.5.0             data.table_1.17.8     
 [67] hms_1.1.3              spatstat.geom_3.4-1    RcppAnnoy_0.0.22      
 [70] ggrepel_0.9.6          RANN_2.6.2             pillar_1.10.2         
 [73] spam_2.11-1            RcppHNSW_0.6.0         later_1.4.2           
 [76] splines_4.5.1          lattice_0.22-6         survival_3.8-3        
 [79] deldir_2.0-4           tidyselect_1.2.1       miniUI_0.1.2          
 [82] pbapply_1.7-2          knitr_1.50             gridExtra_2.3         
 [85] scattermore_1.2        xfun_0.52              matrixStats_1.5.0     
 [88] DT_0.33                stringi_1.8.7          lazyeval_0.2.2        
 [91] yaml_2.3.10            evaluate_1.0.3         codetools_0.2-20      
 [94] BiocManager_1.30.25    cli_3.6.5              uwot_0.2.3            
 [97] xtable_1.8-4           reticulate_1.42.0      jquerylib_0.1.4       
[100] Rcpp_1.0.14            globals_0.18.0         spatstat.random_3.4-1 
[103] png_0.1-8              spatstat.univar_3.1-4  parallel_4.5.1        
[106] dotCall64_1.2          listenv_0.9.1          viridisLite_0.4.2     
[109] scales_1.4.0           ggridges_0.5.6         rlang_1.1.6           
[112] cowplot_1.1.3