scRNA pseudobulk

Author

Harvard Chan Bioinformatics Core

Published

August 14, 2025

Template developed with materials from https://hbctraining.github.io/main/.

Code
stopifnot(R.version$major >= 4)  # requires R4
if (compareVersion(R.version$minor, "3.1") < 0) warning("We recommend >= R4.3.1")
stopifnot(compareVersion(as.character(BiocManager::version()), "3.18") >= 0)
stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.0.0") >= 0)
Code
# Create a function to clean up data frames
sanitize_datatable <- function(df, ...) {
    # Remove dashes from row names and column names which cause wrapping
    DT::datatable(df, ..., rownames = gsub("-", "_", rownames(df)), colnames = gsub("-",
        "_", colnames(df))) %>%
        formatRound(columns = names(df)[sapply(df, is.numeric)], digits = 3)
}

This code is in this revision.

Overview

  • Project: name_hbcXXXXX
  • PI: person name
  • Analyst: person in the core
  • Experiment: short description
  • Aim: short description

Here we will apply a pseudobulk approach to look for differentially expressed genes in one of our cell types.

Using a pseudobulk approach involves the following steps:

  1. Subsetting to the cells for the cell type(s) of interest to perform the DE analysis.
  2. Extracting the raw counts after QC filtering of cells to be used for the DE analysis.
  3. Aggregating the counts and metadata to the sample level.
  4. Performing the DE analysis (you need at least two biological replicates per condition to perform the analysis, but more replicates are recommended).

Dataset

Code
# Loading object
if (isUrl(seurat_obj)) {
    seurat <- readRDS(url(seurat_obj))
} else {
    seurat <- readRDS(seurat_obj)
}

DefaultAssay(seurat) <- "RNA"

After filtering, each sample contributed the following number of cells to the analysis:

Code
seurat@meta.data %>%
    group_by(orig.ident) %>%
    summarize(n_bins = n()) %>%
    sanitize_datatable()

Aggregate counts

Aggregate metadata at the sample level

Code
meta_columns <- c("orig.ident", column)
meta <- seurat@meta.data %>%
    select(meta_columns) %>%
    unique() %>%
    remove_rownames()

meta %>%
    sanitize_datatable()

Aggregate counts

To aggregate the counts, we will use the AggregateExpression() function from Seurat. It will take as input a Seurat object, and return summed counts (“pseudobulk”) for each identity class. The default is to return a matrix with genes as rows, and identity classes as columns. We have set return.seurat to TRUE, which means rather than a matrix we will get an object of class Seurat. We have also specified which factors to aggregate on, using the group.by argument.

Code
seurat$sample <- seurat$orig.ident

bulk <- AggregateExpression(seurat, return.seurat = T, assays = "RNA", group.by = c("sample",
    column))

Add number of cells per sample per cluster to the metadata

Code
# Number of cells by sample and celltype
n_cells <- seurat@meta.data %>%
    dplyr::count(sample, .data[[column]]) %>%
    dplyr::rename(n_cells = "n")
# n_cells$sample <- str_replace(n_cells$sample, '_', '-')

## extra check if aggregated sample names start with numbers
n_cells$sample <- ifelse(grepl("^\\d", n_cells$sample), paste0("g", n_cells$sample),
    n_cells$sample)


meta_bulk <- bulk@meta.data
meta_bulk$sample <- str_replace_all(meta_bulk$sample, "-", "_")

meta_bulk <- meta_bulk %>%
    left_join(n_cells, by = c("sample", column))

rownames(meta_bulk) <- meta_bulk$orig.ident
bulk@meta.data <- meta_bulk

# Turn column into a factor bulk[[column]] <- as.factor(bulk[[column]]) test
# for consistency

stopifnot(all(Cells(bulk) == row.names(bulk@meta.data)))

bulk@meta.data %>%
    head() %>%
    sanitize_datatable()

DE analysis with DESeq2

Subset to cell type of interest

Code
Idents(object = bulk) <- column

Check that we have enough cells

Before moving on to a pseudobulk DGE analysis, it is important to identify how many cells we aggregated for each sample. We need to make sure that we have enough cells per sample after subsetting to one celltype. We recommend 50 cells per sample to move forward with confidence.

Code
ggplot(bulk@meta.data, aes(x = orig.ident, y = n_cells)) + geom_bar(stat = "identity",
    color = "black", aes(fill = .data[[column]])) + theme_classic() + theme(axis.text.x = element_text(angle = 45,
    vjust = 1, hjust = 1)) + labs(x = "Sample name", y = "Number of cells") + geom_text(aes(label = n_cells),
    vjust = -0.5)

Run DE analysis

Differential gene expression analysis of count data was performed using the Bioconductor R package, DESeq2, which fits the count data to a negative binomial model.

Before fitting the model, we often look at a metric called dispersion, which is a measure for variance which also takes into consideration mean expression. A dispersion value is estimated for each individual gene, then ‘shrunken’ to a more accurate value based on expected variation for a typical gene exhibiting that level of expression. Finally, the shrunken dispersion value is used in the final GLM fit.

We use the below dispersion plot, which should show an inverse relationship between dispersion and mean expression, to get an idea of whether our data is a good fit for the model.

Code
cluster_counts <- t(FetchData(bulk, layer = "counts", vars = rownames(bulk)))
formula <- as.formula(paste0("~ ", " + ", column))


dds_to_use <- DESeqDataSetFromMatrix(cluster_counts, bulk@meta.data, design = formula)
de <- DESeq(dds_to_use)
DESeq2::plotDispEsts(de)

Code
norm_matrix <- as.data.frame(counts(de, normalized = T))

vsd <- vst(de)
vsd_matrix <- assay(vsd)

Because it is difficult to accurately detect and quantify the expression of lowly expressed genes, differences in their expression between treatment conditions can be unduly exaggerated after the model is fit. We correct for this so that gene LFC is not dependent overall on basal gene expression level.

In cases there are multiple groups and conditions across groups is recommended to use dummy variables instead of interaction terms: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions.

The LRT is useful for testing multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. The LRT for count data is conceptually similar to an analysis of variance (ANOVA) calculation in linear regression, except that in the case of the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the deviance captures the difference in likelihood between a full and a reduced model.

Code
# NOTE As a note: Use `ashr` for comparisons with many groups to be able to
# pull out all the contrasts; otherwise `apeglm` is fine. It shrinks less.

# NOTE We recommend LRT for time series

# resultsNames(de) # check the order is right
names_to_use <- lapply(contrasts, function(contrast) {
    coef <- paste0(contrast[1], "_", contrast[2], "_vs_", contrast[3])
})
names(contrasts) <- names_to_use
de_list <- lapply(contrasts, function(contrast) {
    resLFC <- results(de, contrast = contrast)
    coef <- paste0(contrast[1], "_", contrast[2], "_vs_", contrast[3])
    # resLFCS <- lfcShrink(de, coef=coef, type='apeglm')
    resLFCS <- lfcShrink(de, contrast = contrast, type = "ash")

    res <- as.data.frame(resLFCS) %>%
        rownames_to_column("gene_id") %>%
        dplyr::rename(lfc = log2FoldChange) %>%
        mutate(pi = abs(lfc) * -log10(padj)) %>%
        arrange(-pi)

    ## Filter out genes that have no expression or were filtered out by DESEQ2
    res <- res[res$baseMean > 0, ] %>%
        drop_na(padj) %>%
        drop_na(pvalue)

    res_sig <- res %>%
        filter(padj < 0.05) %>%
        arrange(padj)
    results <- list(lfc = resLFC, lfcs = resLFCS, all = res, sig = res_sig)
    return(results)
})

# NOTE if you add manually any other comparison to the list with the following
# variables, the code below will make the plots for those as wells:
# de_list=c(de_list, new_comparison=list(lfc=resLFC, lfcs=resLFCS, all=res,
# sig=res_sig))

MA plot

This plot can help to:

  • Identify Differential Expression: Genes that show a significant log-fold change (M value away from 0) indicate changes in expression between conditions.
  • Assess Data Quality: The plot can help in identifying biases or systematic errors in the data. Ideally, most points should scatter around the M=0 line, indicating that there is no significant systematic difference between the conditions.
  • Visualize data dispersion: The distribution of points along the A-axis gives a sense of the spread of expression levels and any patterns or anomalies in the dataset.

Volcano plot

This volcano plot shows the genes that are significantly up- and down-regulated as a result of the analysis comparison. The points highlighted in red are genes that have padj < 0.05 and a log2-fold change > 1. Points in blue have a padj < 0.05 and a log2-fold change < 1 and points in green have a padj > 0.05 and a log2-fold change > 2. Grey points are non-significant. The dashed lines correspond to the cutoff values of log2 foldchance and padj that we have chosen.

Heatmap

Differentially Expressed Genes

Plot top 16 genes - pseudobulk

Plot top 16 genes - single cell

Methods

Seurat (package, paper) is used to aggregate the single cell expression data into pseudobulk samples, and DESeq2 (package, paper) is used to perform statistical analysis.

R session

List and version of tools used for the QC report generation.

Code
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ashr_2.2-63                 bcbioR_0.4.5               
 [3] shiny_1.11.1                caTools_1.18.3             
 [5] viridis_0.6.5               viridisLite_0.4.2          
 [7] pheatmap_1.0.13             EnhancedVolcano_1.26.0     
 [9] ggrepel_0.9.6               DEGreport_1.44.0           
[11] DESeq2_1.48.1               SummarizedExperiment_1.38.1
[13] Biobase_2.68.0              MatrixGenerics_1.20.0      
[15] matrixStats_1.5.0           GenomicRanges_1.60.0       
[17] GenomeInfoDb_1.44.1         IRanges_2.42.0             
[19] S4Vectors_0.46.0            BiocGenerics_0.54.0        
[21] generics_0.1.4              lubridate_1.9.4            
[23] forcats_1.0.0               stringr_1.5.1              
[25] dplyr_1.1.4                 purrr_1.1.0                
[27] readr_2.1.5                 tidyr_1.3.1                
[29] tibble_3.3.0                tidyverse_2.0.0            
[31] R.utils_2.13.0              R.oo_1.27.1                
[33] R.methodsS3_1.8.2           grafify_5.0.0.1            
[35] ggprism_1.0.6               clustree_0.5.1             
[37] ggraph_2.2.1                ggplot2_3.5.2              
[39] patchwork_1.3.1             DT_0.33                    
[41] data.table_1.17.8           rmarkdown_2.29             
[43] knitr_1.50                  harmony_1.2.3              
[45] Rcpp_1.1.0                  Seurat_5.3.0               
[47] SeuratObject_5.1.0          sp_2.2-0                   

loaded via a namespace (and not attached):
  [1] fs_1.6.6                    bitops_1.0-9               
  [3] spatstat.sparse_3.1-0       httr_1.4.7                 
  [5] RColorBrewer_1.1-3          doParallel_1.0.17          
  [7] numDeriv_2016.8-1.1         tools_4.5.1                
  [9] sctransform_0.4.2           backports_1.5.0            
 [11] R6_2.6.1                    lazyeval_0.2.2             
 [13] uwot_0.2.3                  mgcv_1.9-3                 
 [15] GetoptLong_1.0.5            withr_3.0.2                
 [17] gridExtra_2.3               progressr_0.15.1           
 [19] cli_3.6.5                   formatR_1.14               
 [21] logging_0.10-108            spatstat.explore_3.5-2     
 [23] fastDummies_1.7.5           labeling_0.4.3             
 [25] sass_0.4.10                 SQUAREM_2021.1             
 [27] mvtnorm_1.3-3               spatstat.data_3.1-6        
 [29] ggridges_0.5.6              pbapply_1.7-4              
 [31] mixsqp_0.3-54               foreign_0.8-90             
 [33] parallelly_1.45.1           invgamma_1.2               
 [35] limma_3.64.1                rstudioapi_0.17.1          
 [37] shape_1.4.6.1               crosstalk_1.2.1            
 [39] ica_1.0-3                   spatstat.random_3.4-1      
 [41] car_3.1-3                   Matrix_1.7-3               
 [43] abind_1.4-8                 lifecycle_1.0.4            
 [45] whisker_0.4.1               yaml_2.3.10                
 [47] edgeR_4.6.3                 carData_3.0-5              
 [49] SparseArray_1.8.1           Rtsne_0.17                 
 [51] grid_4.5.1                  promises_1.3.3             
 [53] crayon_1.5.3                miniUI_0.1.2               
 [55] lattice_0.22-7              cowplot_1.2.0              
 [57] pillar_1.11.0               ComplexHeatmap_2.24.1      
 [59] rjson_0.2.23                boot_1.3-31                
 [61] estimability_1.5.1          future.apply_1.20.0        
 [63] codetools_0.2-20            glue_1.8.0                 
 [65] spatstat.univar_3.1-4       vctrs_0.6.5                
 [67] png_0.1-8                   spam_2.11-1                
 [69] Rdpack_2.6.4                gtable_0.3.6               
 [71] cachem_1.1.0                xfun_0.52                  
 [73] rbibutils_2.3               S4Arrays_1.8.1             
 [75] mime_0.13                   tidygraph_1.3.1            
 [77] ConsensusClusterPlus_1.72.0 coda_0.19-4.1              
 [79] reformulas_0.4.1            survival_3.8-3             
 [81] iterators_1.0.14            statmod_1.5.0              
 [83] fitdistrplus_1.2-4          ROCR_1.0-11                
 [85] nlme_3.1-168                usethis_3.1.0              
 [87] RcppAnnoy_0.0.22            bslib_0.9.0                
 [89] irlba_2.3.5.1               KernSmooth_2.23-26         
 [91] rpart_4.1.24                colorspace_2.1-1           
 [93] Hmisc_5.2-3                 nnet_7.3-20                
 [95] mnormt_2.1.1                tidyselect_1.2.1           
 [97] emmeans_1.11.2              compiler_4.5.1             
 [99] htmlTable_2.4.3             ggdendro_0.2.0             
[101] DelayedArray_0.34.1         plotly_4.11.0              
[103] checkmate_2.3.2             scales_1.4.0               
[105] psych_2.5.6                 lmtest_0.9-40              
[107] digest_0.6.37               goftest_1.2-3              
[109] spatstat.utils_3.1-5        minqa_1.2.8                
[111] XVector_0.48.0              htmltools_0.5.8.1          
[113] pkgconfig_2.0.3             base64enc_0.1-3            
[115] lme4_1.1-37                 fastmap_1.2.0              
[117] GlobalOptions_0.1.2         rlang_1.1.6                
[119] htmlwidgets_1.6.4           UCSC.utils_1.4.0           
[121] jquerylib_0.1.4             farver_2.1.2               
[123] zoo_1.8-14                  jsonlite_2.0.0             
[125] BiocParallel_1.42.1         magrittr_2.0.3             
[127] Formula_1.2-5               GenomeInfoDbData_1.2.14    
[129] dotCall64_1.2               reticulate_1.43.0          
[131] stringi_1.8.7               MASS_7.3-65                
[133] plyr_1.8.9                  parallel_4.5.1             
[135] listenv_0.9.1               deldir_2.0-4               
[137] graphlayouts_1.2.2          splines_4.5.1              
[139] tensor_1.5.1                circlize_0.4.16            
[141] hms_1.1.3                   locfit_1.5-9.12            
[143] igraph_2.1.4                spatstat.geom_3.5-0        
[145] RcppHNSW_0.6.0              reshape2_1.4.4             
[147] evaluate_1.0.4              BiocManager_1.30.26        
[149] nloptr_2.2.1                tzdb_0.5.0                 
[151] foreach_1.5.2               tweenr_2.0.3               
[153] httpuv_1.6.16               RANN_2.6.2                 
[155] polyclip_1.10-7             reshape_0.8.10             
[157] clue_0.3-66                 future_1.58.0              
[159] scattermore_1.2             ggforce_0.5.0              
[161] broom_1.0.9                 xtable_1.8-4               
[163] RSpectra_0.16-2             later_1.4.2                
[165] truncnorm_1.0-9             lmerTest_3.1-3             
[167] memoise_2.0.1               cluster_2.1.8.1            
[169] timechange_0.3.0            globals_0.18.0