scRNA pseudobulk
Template developed with materials from https://hbctraining.github.io/main/.
Code
# Create a function to clean up data frames
sanitize_datatable <- function(df, ...) {
# Remove dashes from row names and column names which cause wrapping
DT::datatable(df, ..., rownames = gsub("-", "_", rownames(df)), colnames = gsub("-",
"_", colnames(df))) %>%
formatRound(columns = names(df)[sapply(df, is.numeric)], digits = 3)
}
This code is in this revision.
Overview
- Project: name_hbcXXXXX
- PI: person name
- Analyst: person in the core
- Experiment: short description
- Aim: short description
Here we will apply a pseudobulk approach to look for differentially expressed genes in one of our cell types.
Using a pseudobulk approach involves the following steps:
- Subsetting to the cells for the cell type(s) of interest to perform the DE analysis.
- Extracting the raw counts after QC filtering of cells to be used for the DE analysis.
- Aggregating the counts and metadata to the sample level.
- Performing the DE analysis (you need at least two biological replicates per condition to perform the analysis, but more replicates are recommended).
Dataset
After filtering, each sample contributed the following number of cells to the analysis:
Aggregate counts
Aggregate metadata at the sample level
Aggregate counts
To aggregate the counts, we will use the AggregateExpression() function from Seurat. It will take as input a Seurat object, and return summed counts (“pseudobulk”) for each identity class. The default is to return a matrix with genes as rows, and identity classes as columns. We have set return.seurat to TRUE, which means rather than a matrix we will get an object of class Seurat. We have also specified which factors to aggregate on, using the group.by argument.
Add number of cells per sample per cluster to the metadata
Code
# Number of cells by sample and celltype
n_cells <- seurat@meta.data %>%
dplyr::count(sample, .data[[column]]) %>%
dplyr::rename(n_cells = "n")
# n_cells$sample <- str_replace(n_cells$sample, '_', '-')
## extra check if aggregated sample names start with numbers
n_cells$sample <- ifelse(grepl("^\\d", n_cells$sample), paste0("g", n_cells$sample),
n_cells$sample)
meta_bulk <- bulk@meta.data
meta_bulk$sample <- str_replace_all(meta_bulk$sample, "-", "_")
meta_bulk <- meta_bulk %>%
left_join(n_cells, by = c("sample", column))
rownames(meta_bulk) <- meta_bulk$orig.ident
bulk@meta.data <- meta_bulk
# Turn column into a factor bulk[[column]] <- as.factor(bulk[[column]]) test
# for consistency
stopifnot(all(Cells(bulk) == row.names(bulk@meta.data)))
bulk@meta.data %>%
head() %>%
sanitize_datatable()
DE analysis with DESeq2
Subset to cell type of interest
Check that we have enough cells
Before moving on to a pseudobulk DGE analysis, it is important to identify how many cells we aggregated for each sample. We need to make sure that we have enough cells per sample after subsetting to one celltype. We recommend 50 cells per sample to move forward with confidence.
Code
ggplot(bulk@meta.data, aes(x = orig.ident, y = n_cells)) + geom_bar(stat = "identity",
color = "black", aes(fill = .data[[column]])) + theme_classic() + theme(axis.text.x = element_text(angle = 45,
vjust = 1, hjust = 1)) + labs(x = "Sample name", y = "Number of cells") + geom_text(aes(label = n_cells),
vjust = -0.5)
Run DE analysis
Differential gene expression analysis of count data was performed using the Bioconductor R package, DESeq2, which fits the count data to a negative binomial model.
Before fitting the model, we often look at a metric called dispersion, which is a measure for variance which also takes into consideration mean expression. A dispersion value is estimated for each individual gene, then ‘shrunken’ to a more accurate value based on expected variation for a typical gene exhibiting that level of expression. Finally, the shrunken dispersion value is used in the final GLM fit.
We use the below dispersion plot, which should show an inverse relationship between dispersion and mean expression, to get an idea of whether our data is a good fit for the model.
Code
Because it is difficult to accurately detect and quantify the expression of lowly expressed genes, differences in their expression between treatment conditions can be unduly exaggerated after the model is fit. We correct for this so that gene LFC is not dependent overall on basal gene expression level.
In cases there are multiple groups and conditions across groups is recommended to use dummy variables instead of interaction terms: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions.
The LRT is useful for testing multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. The LRT for count data is conceptually similar to an analysis of variance (ANOVA) calculation in linear regression, except that in the case of the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the deviance captures the difference in likelihood between a full and a reduced model.
Code
# NOTE As a note: Use `ashr` for comparisons with many groups to be able to
# pull out all the contrasts; otherwise `apeglm` is fine. It shrinks less.
# NOTE We recommend LRT for time series
# resultsNames(de) # check the order is right
names_to_use <- lapply(contrasts, function(contrast) {
coef <- paste0(contrast[1], "_", contrast[2], "_vs_", contrast[3])
})
names(contrasts) <- names_to_use
de_list <- lapply(contrasts, function(contrast) {
resLFC <- results(de, contrast = contrast)
coef <- paste0(contrast[1], "_", contrast[2], "_vs_", contrast[3])
# resLFCS <- lfcShrink(de, coef=coef, type='apeglm')
resLFCS <- lfcShrink(de, contrast = contrast, type = "ash")
res <- as.data.frame(resLFCS) %>%
rownames_to_column("gene_id") %>%
dplyr::rename(lfc = log2FoldChange) %>%
mutate(pi = abs(lfc) * -log10(padj)) %>%
arrange(-pi)
## Filter out genes that have no expression or were filtered out by DESEQ2
res <- res[res$baseMean > 0, ] %>%
drop_na(padj) %>%
drop_na(pvalue)
res_sig <- res %>%
filter(padj < 0.05) %>%
arrange(padj)
results <- list(lfc = resLFC, lfcs = resLFCS, all = res, sig = res_sig)
return(results)
})
# NOTE if you add manually any other comparison to the list with the following
# variables, the code below will make the plots for those as wells:
# de_list=c(de_list, new_comparison=list(lfc=resLFC, lfcs=resLFCS, all=res,
# sig=res_sig))
MA plot
This plot can help to:
- Identify Differential Expression: Genes that show a significant log-fold change (M value away from 0) indicate changes in expression between conditions.
- Assess Data Quality: The plot can help in identifying biases or systematic errors in the data. Ideally, most points should scatter around the M=0 line, indicating that there is no significant systematic difference between the conditions.
- Visualize data dispersion: The distribution of points along the A-axis gives a sense of the spread of expression levels and any patterns or anomalies in the dataset.
Volcano plot
This volcano plot shows the genes that are significantly up- and down-regulated as a result of the analysis comparison. The points highlighted in red are genes that have padj < 0.05 and a log2-fold change > 1. Points in blue have a padj < 0.05 and a log2-fold change < 1 and points in green have a padj > 0.05 and a log2-fold change > 2. Grey points are non-significant. The dashed lines correspond to the cutoff values of log2 foldchance and padj that we have chosen.
Heatmap
Differentially Expressed Genes
Plot top 16 genes - pseudobulk
Plot top 16 genes - single cell
Methods
Seurat (package, paper) is used to aggregate the single cell expression data into pseudobulk samples, and DESeq2 (package, paper) is used to perform statistical analysis.
R session
List and version of tools used for the QC report generation.
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ashr_2.2-63 bcbioR_0.4.5
[3] shiny_1.11.1 caTools_1.18.3
[5] viridis_0.6.5 viridisLite_0.4.2
[7] pheatmap_1.0.13 EnhancedVolcano_1.26.0
[9] ggrepel_0.9.6 DEGreport_1.44.0
[11] DESeq2_1.48.1 SummarizedExperiment_1.38.1
[13] Biobase_2.68.0 MatrixGenerics_1.20.0
[15] matrixStats_1.5.0 GenomicRanges_1.60.0
[17] GenomeInfoDb_1.44.1 IRanges_2.42.0
[19] S4Vectors_0.46.0 BiocGenerics_0.54.0
[21] generics_0.1.4 lubridate_1.9.4
[23] forcats_1.0.0 stringr_1.5.1
[25] dplyr_1.1.4 purrr_1.1.0
[27] readr_2.1.5 tidyr_1.3.1
[29] tibble_3.3.0 tidyverse_2.0.0
[31] R.utils_2.13.0 R.oo_1.27.1
[33] R.methodsS3_1.8.2 grafify_5.0.0.1
[35] ggprism_1.0.6 clustree_0.5.1
[37] ggraph_2.2.1 ggplot2_3.5.2
[39] patchwork_1.3.1 DT_0.33
[41] data.table_1.17.8 rmarkdown_2.29
[43] knitr_1.50 harmony_1.2.3
[45] Rcpp_1.1.0 Seurat_5.3.0
[47] SeuratObject_5.1.0 sp_2.2-0
loaded via a namespace (and not attached):
[1] fs_1.6.6 bitops_1.0-9
[3] spatstat.sparse_3.1-0 httr_1.4.7
[5] RColorBrewer_1.1-3 doParallel_1.0.17
[7] numDeriv_2016.8-1.1 tools_4.5.1
[9] sctransform_0.4.2 backports_1.5.0
[11] R6_2.6.1 lazyeval_0.2.2
[13] uwot_0.2.3 mgcv_1.9-3
[15] GetoptLong_1.0.5 withr_3.0.2
[17] gridExtra_2.3 progressr_0.15.1
[19] cli_3.6.5 formatR_1.14
[21] logging_0.10-108 spatstat.explore_3.5-2
[23] fastDummies_1.7.5 labeling_0.4.3
[25] sass_0.4.10 SQUAREM_2021.1
[27] mvtnorm_1.3-3 spatstat.data_3.1-6
[29] ggridges_0.5.6 pbapply_1.7-4
[31] mixsqp_0.3-54 foreign_0.8-90
[33] parallelly_1.45.1 invgamma_1.2
[35] limma_3.64.1 rstudioapi_0.17.1
[37] shape_1.4.6.1 crosstalk_1.2.1
[39] ica_1.0-3 spatstat.random_3.4-1
[41] car_3.1-3 Matrix_1.7-3
[43] abind_1.4-8 lifecycle_1.0.4
[45] whisker_0.4.1 yaml_2.3.10
[47] edgeR_4.6.3 carData_3.0-5
[49] SparseArray_1.8.1 Rtsne_0.17
[51] grid_4.5.1 promises_1.3.3
[53] crayon_1.5.3 miniUI_0.1.2
[55] lattice_0.22-7 cowplot_1.2.0
[57] pillar_1.11.0 ComplexHeatmap_2.24.1
[59] rjson_0.2.23 boot_1.3-31
[61] estimability_1.5.1 future.apply_1.20.0
[63] codetools_0.2-20 glue_1.8.0
[65] spatstat.univar_3.1-4 vctrs_0.6.5
[67] png_0.1-8 spam_2.11-1
[69] Rdpack_2.6.4 gtable_0.3.6
[71] cachem_1.1.0 xfun_0.52
[73] rbibutils_2.3 S4Arrays_1.8.1
[75] mime_0.13 tidygraph_1.3.1
[77] ConsensusClusterPlus_1.72.0 coda_0.19-4.1
[79] reformulas_0.4.1 survival_3.8-3
[81] iterators_1.0.14 statmod_1.5.0
[83] fitdistrplus_1.2-4 ROCR_1.0-11
[85] nlme_3.1-168 usethis_3.1.0
[87] RcppAnnoy_0.0.22 bslib_0.9.0
[89] irlba_2.3.5.1 KernSmooth_2.23-26
[91] rpart_4.1.24 colorspace_2.1-1
[93] Hmisc_5.2-3 nnet_7.3-20
[95] mnormt_2.1.1 tidyselect_1.2.1
[97] emmeans_1.11.2 compiler_4.5.1
[99] htmlTable_2.4.3 ggdendro_0.2.0
[101] DelayedArray_0.34.1 plotly_4.11.0
[103] checkmate_2.3.2 scales_1.4.0
[105] psych_2.5.6 lmtest_0.9-40
[107] digest_0.6.37 goftest_1.2-3
[109] spatstat.utils_3.1-5 minqa_1.2.8
[111] XVector_0.48.0 htmltools_0.5.8.1
[113] pkgconfig_2.0.3 base64enc_0.1-3
[115] lme4_1.1-37 fastmap_1.2.0
[117] GlobalOptions_0.1.2 rlang_1.1.6
[119] htmlwidgets_1.6.4 UCSC.utils_1.4.0
[121] jquerylib_0.1.4 farver_2.1.2
[123] zoo_1.8-14 jsonlite_2.0.0
[125] BiocParallel_1.42.1 magrittr_2.0.3
[127] Formula_1.2-5 GenomeInfoDbData_1.2.14
[129] dotCall64_1.2 reticulate_1.43.0
[131] stringi_1.8.7 MASS_7.3-65
[133] plyr_1.8.9 parallel_4.5.1
[135] listenv_0.9.1 deldir_2.0-4
[137] graphlayouts_1.2.2 splines_4.5.1
[139] tensor_1.5.1 circlize_0.4.16
[141] hms_1.1.3 locfit_1.5-9.12
[143] igraph_2.1.4 spatstat.geom_3.5-0
[145] RcppHNSW_0.6.0 reshape2_1.4.4
[147] evaluate_1.0.4 BiocManager_1.30.26
[149] nloptr_2.2.1 tzdb_0.5.0
[151] foreach_1.5.2 tweenr_2.0.3
[153] httpuv_1.6.16 RANN_2.6.2
[155] polyclip_1.10-7 reshape_0.8.10
[157] clue_0.3-66 future_1.58.0
[159] scattermore_1.2 ggforce_0.5.0
[161] broom_1.0.9 xtable_1.8-4
[163] RSpectra_0.16-2 later_1.4.2
[165] truncnorm_1.0-9 lmerTest_3.1-3
[167] memoise_2.0.1 cluster_2.1.8.1
[169] timechange_0.3.0 globals_0.18.0