Spatial transcriptomic data with the Visium (HD) platform is in many ways similar to scRNAseq data. The data is represented per spot/bin on the slide, as we have spatial barcode but no cellular barcodes.
For Visium data, each spot contains UMI counts for 5-20 cells instead of single cells, but is still quite sparse in the same way as scRNAseq data is, but with the additional information about spatial location in the tissue.
For Visium HD, the slide contain two 6.5 x 6.5 mm Capture Areas with a continuous lawn of oligonucleotides arrayed in millions of 2 x 2 µm barcoded squares without gaps, achieving single cell–scale spatial resolution. The data is output at 2 µm, as well as multiple bin sizes. You can choose the bin resolution for downstream visualization and analysis.
The term spot(s)/bin(s) are used throughout this tutorial which corresponds to two technology Visium and Visium HD.
The main objective of quality control is to filter the data so that we include only data from spots/bins that are of high quality. This makes it so that when we cluster our spots/bins, it is easier to identify distinct cell type populations.
In spatial transcriptomic data, the main challenge is in delineating spots/bins that are poor quality from spots/bins containing reads from less complex cells. If you expect a particular cell type in your dataset to be less transcriptionally active as compared other cell types in your dataset, the spots/bins underneath this cell type will naturally have fewer detected genes and transcripts. However, having fewer detected genes and transcripts can also be a technical artifact and not a result of biological signal.
Various metrics can be used to filter low-quality cells from high-quality ones, including:
UMI counts per spot/bin - This is the number of unique transcripts detected per spot/bin. Because the spot/bin are very small, this number is less than what we would expect for non-spatial scRNAseq data.
Genes detected per spot/bin - This is the number of unique genes detected per spot/bin. Again, because the spots/bins are very small, this number is less than what we would expect for non-spatial scRNAseq data.
Complexity (novelty score) - The novelty score is computed by taking the ratio of nGenes over nUMI. If there are many captured transcripts (high nUMI) and a low number of genes detected in a spot, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again. These low complexity (low novelty) spots/bins could represent a specific cell type (i.e. red blood cells which lack a typical transcriptome), or could be due to an artifact or contamination. Generally, we expect the novelty score to be above 0.80 for good quality spots/bins.
Mitochondrial counts ratio - This metric can identify whether there is a large amount of mitochondrial contamination from dead or dying cells. We define poor quality samples for mitochondrial counts as spots/bins which surpass the 0.2 (20%) mitochondrial ratio mark, unless of course you are expecting this in your sample.
Hemoglobin counts ratio - This metric can identify whether there is a large amount of hemoglobin gene contamination from blood. We define poor quality samples for hemoglobin counts as spots/bins which surpass the 0.2 (20%) hemoglobin ratio mark, unless of course you are expecting this in your sample.
Code
# This set up the working directory to this file so all files can be found# library(rstudioapi)# setwd(fs::path_dir(getSourceEditorContext()$path))stopifnot(R.version$major>=4) # requires R4if (compareVersion(R.version$minor,"3.1")<0) warning("We recommend >= R4.3.1") stopifnot(compareVersion(as.character(BiocManager::version()), "3.16")>=0)stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.1")>=0)
2 Project details
Project: name_hbcXXXXX
PI: person name
Analyst: person in the core
Experiment: short description
Aim: short description
Code
# Metrics like `nCount` and `nfeature` are named with the suffix of default assay name, to make the variable usage more generalizable, we removed the suffix by pulling out the default assay of the visium `seurat` object.visium <-qs_read(visiumHD_obj)visium <-PercentageFeatureSet(visium, "^mt-", col.name ="percent_mito")visium <-PercentageFeatureSet(visium, "^Hb.*-", col.name ="percent_hb")metaD <- visium@meta.datametaD$log10GenesPerUMI <-log10(metaD$nFeature)/log10(metaD$nCount)colnames(metaD)%<>%gsub(pattern=glue("_{DefaultAssay(visium)}"),replacement="")
Let’s take a quick look at the data and make a decision on whether we need to apply any filtering.
3 Quality control per spot/bin
3.1 Number of UMIs and genes detected per spot/bin
Those two metrics is really dependent on tissue type, RNA quality, and sequencing depth. Since the test data is generated from Visium HD technology, we use bin and corresponding reference thresholds in the plot. Reference line at 100 is plotted as the suggested cut-offs for both metrics.
3.2 Overall complexity of transcriptional profile per spot/bin
We can evaluate each spot/bin in terms of how complex the RNA species are by using a measure called the novelty score. The novelty score is computed by taking the ratio of nGenes over nUMI. If there are many captured transcripts (high nUMI) and a low number of genes detected in a cell, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again.
With scRNA-seq this is more easily interpreted for a single cell, but for spatial data this would give us complexity of the spot, which is across multiple cells.
Now, it is time to choose some cut-offs for QC metrics mentioned above and removing low-quality cells, as well as mitochondria, hemoglobin genes from the feature space and we can take a quick look at what are our top 20 expressed genes.
Code
GeneVar <-glue('nFeature_{DefaultAssay(visium)}')UMIVar <-glue('nCount_{DefaultAssay(visium)}')cutoffs <-list("nFeature"=100,"nCount"=100,"hb"=20,"mito"=20)Qced <- visium@meta.data[,GeneVar] > cutoffs$nFeature & visium@meta.data[,UMIVar] > cutoffs$nCount & visium$percent_hb < cutoffs$hb & visium$percent_mito < cutoffs$mitovisium <- visium[,Qced]# Filter Mitocondrialvisium <- visium[!grepl("^mt-", rownames(visium)), ]# Filter Hemoglobin gene (optional if that is a problem on your data)visium <- visium[!grepl("^Hb.*-", rownames(visium)), ]C <-GetAssayData(visium, slot ="counts")C@x <- C@x /rep.int(colSums(C), diff(C@p))most_expressed <-order(Matrix::rowSums(C), decreasing = T)[20:1]exprD <-as.data.frame(t(C[most_expressed, ])) %>% tibble::rownames_to_column("bin") %>% tidyr::gather(gene,expr,-bin)ggplot(exprD,aes(x=gene,y=expr,color=gene,fill=gene))+geom_violin(position=position_dodge(1),alpha=0.5,na.rm=TRUE,trim=FALSE)+geom_boxplot(width=0.1,outliers = F,color="black")+theme_minimal()+theme(axis.text.x =element_text(size=rel(1),face="bold"),plot.title =element_text(hjust =0.5),legend.position ="none" )+scale_y_log10(breaks=c(0.001,.01,.1,1),labels=c(0.1,1,10,100))+labs(x="",y="% of total UMIs/bin \n (log10 scaled)")+coord_flip()
We saved your qc-filled Seurat object in ./results/01_qc.qs.
6 Methods
6.1 Citation
Code
citation("Seurat")
To cite Seurat in publications, please use:
Hao et al. Dictionary learning for integrative, multimodal and
scalable single-cell analysis. Nature Biotechnology (2023) [Seurat
V5]
Hao and Hao et al. Integrated analysis of multimodal single-cell
data. Cell (2021) [Seurat V4]
Stuart and Butler et al. Comprehensive Integration of Single-Cell
Data. Cell (2019) [Seurat V3]
Butler et al. Integrating single-cell transcriptomic data across
different conditions, technologies, and species. Nat Biotechnol
(2018) [Seurat V2]
Satija and Farrell et al. Spatial reconstruction of single-cell gene
expression data. Nat Biotechnol (2015) [Seurat V1]
To see these entries in BibTeX format, use 'print(<citation>,
bibtex=TRUE)', 'toBibtex(.)', or set
'options(citation.bibtex.max=999)'.
---title: "Visium QC"author: "Harvard Chan Bioinformatics Core"date: "`r Sys.Date()`"format: html: code-fold: true code-tools: true code-overflow: wrap df-print: paged highlight-style: pygments number-sections: true self-contained: true theme: default toc: true toc-location: left toc-expand: false lightbox: true page-layout: fullparams: project_file: ../information.R results_dir: ./results visiumHD_obj: "../visium.qs"editor_options: chunk_output_type: console---This code is in this  revision.## Visium DescriptionThis report was adapted from [this tutorial](https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_08_spatial.html). Spatial transcriptomic data with the Visium (HD) platform is in many ways similar to scRNAseq data. The data is represented per spot/bin on the slide, as we have spatial barcode but no cellular barcodes. For Visium data, each spot contains UMI counts for 5-20 cells instead of single cells, but is still quite sparse in the same way as scRNAseq data is, but with the additional information about spatial location in the tissue.For Visium HD, the slide contain two 6.5 x 6.5 mm Capture Areas with a continuous lawn of oligonucleotides arrayed in millions of 2 x 2 µm barcoded squares without gaps, achieving **single cell–scale spatial resolution**. The data is output at 2 µm, as well as multiple bin sizes. You can choose the bin resolution for downstream visualization and analysis. The term **spot(s)/bin(s)** are used throughout this tutorial which corresponds to two technology Visium and Visium HD.The main objective of quality control is to filter the data so that we include only data from spots/bins that are of high quality. This makes it so that when we cluster our spots/bins, it is easier to identify distinct cell type populations.In spatial transcriptomic data, the main challenge is in **delineating spots/bins that are poor quality from spots/bins containing reads from less complex cells**. If you expect a particular cell type in your dataset to be less transcriptionally active as compared other cell types in your dataset, the spots/bins underneath this cell type will naturally have fewer detected genes and transcripts. However, having fewer detected genes and transcripts can also be a technical artifact and not a result of biological signal. Various metrics can be used to filter low-quality cells from high-quality ones, including:- **UMI counts per spot/bin** - This is the number of unique transcripts detected per spot/bin. Because the spot/bin are very small, this number is less than what we would expect for non-spatial scRNAseq data.- **Genes detected per spot/bin** - This is the number of unique genes detected per spot/bin. Again, because the spots/bins are very small, this number is less than what we would expect for non-spatial scRNAseq data.- **Complexity (novelty score)** - The novelty score is computed by taking the ratio of nGenes over nUMI. If there are many captured transcripts (high nUMI) and a low number of genes detected in a spot, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again. These low complexity (low novelty) spots/bins could represent a specific cell type (i.e. red blood cells which lack a typical transcriptome), or could be due to an artifact or contamination. Generally, we expect the novelty score to be above 0.80 for good quality spots/bins.- **Mitochondrial counts ratio** - This metric can identify whether there is a large amount of mitochondrial contamination from dead or dying cells. We define poor quality samples for mitochondrial counts as spots/bins which surpass the 0.2 (20%) mitochondrial ratio mark, unless of course you are expecting this in your sample.- **Hemoglobin counts ratio** - This metric can identify whether there is a large amount of hemoglobin gene contamination from blood. We define poor quality samples for hemoglobin counts as spots/bins which surpass the 0.2 (20%) hemoglobin ratio mark, unless of course you are expecting this in your sample.```{r, cache = FALSE, message = FALSE, warning=FALSE}# This set up the working directory to this file so all files can be found# library(rstudioapi)# setwd(fs::path_dir(getSourceEditorContext()$path))stopifnot(R.version$major>=4) # requires R4if (compareVersion(R.version$minor,"3.1")<0) warning("We recommend >= R4.3.1") stopifnot(compareVersion(as.character(BiocManager::version()), "3.16")>=0)stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.1")>=0)``````{r load_libraries, cache = FALSE, message = FALSE, warning=FALSE, echo=FALSE,}library(knitr)library(import)library(glue)library(qs2)library(dplyr)library(purrr)library(ggplot2)library(ggprism)library(grafify)library(ggpubr)library(gridExtra)library(scales)library(Seurat)import::from(magrittr, set_colnames, set_rownames, "%<>%")invisible(list2env(params,environment()))source(project_file)ggplot2::theme_set(theme_prism(base_size =12))# https://grafify-vignettes.netlify.app/colour_palettes.html# NOTE change colors here if you wishscale_colour_discrete <-function(...)scale_colour_manual(..., values =as.vector(grafify:::graf_palettes[["kelly"]]))scale_fill_discrete <-function(...)scale_fill_manual(..., values =as.vector(grafify:::graf_palettes[["kelly"]]))opts_chunk[["set"]](cache = F,cache.lazy =FALSE,dev =c("png", "pdf"),error =TRUE,highlight =TRUE,message =FALSE,prompt =FALSE,tidy =FALSE,warning =FALSE,echo = T, fig.height =4)# set seed for reproducibilityset.seed(1234567890L)```## Project details- Project: `r project`- PI: `r PI`- Analyst: `r analyst`- Experiment: `r experiment`- Aim: `r aim````{r metric-calc}# Metrics like `nCount` and `nfeature` are named with the suffix of default assay name, to make the variable usage more generalizable, we removed the suffix by pulling out the default assay of the visium `seurat` object.visium <-qs_read(visiumHD_obj)visium <-PercentageFeatureSet(visium, "^mt-", col.name ="percent_mito")visium <-PercentageFeatureSet(visium, "^Hb.*-", col.name ="percent_hb")metaD <- visium@meta.datametaD$log10GenesPerUMI <-log10(metaD$nFeature)/log10(metaD$nCount)colnames(metaD)%<>%gsub(pattern=glue("_{DefaultAssay(visium)}"),replacement="")```Let's take a quick look at the data and make a decision on whether we need to apply any filtering.## Quality control per spot/bin{.tabset}### Number of UMIs and genes detected per spot/binThose two metrics is really dependent on tissue type, RNA quality, and sequencing depth. Since the test data is generated from Visium HD technology, we use bin and corresponding reference thresholds in the plot. Reference line at 100 is plotted as the suggested cut-offs for both metrics. ```{r}summary_metaD <-apply(metaD[,-1],2,mean)metacol_label <-list("nFeature"="Genes","nCount"="UMI")refs <-list("nFeature"=100,"nCount"=100)dists_before <-imap(metacol_label,\(label,col)ggdensity(metaD,x = col,xscale="log10",add ="mean", rug =TRUE,alpha =0.2,fill ="lightgray",xlab=glue("Number of {label} per bin(in log10 scale)"),ylab="Cell density",title=glue('Pre-QC {label}/Bin'))+geom_vline(xintercept = refs[[col]],color="darkred",cex=rel(1.3),linetype="dashed")+annotate("text",x=summary_metaD[col],y =Inf,label =glue("Mean \n = {round(summary_metaD[col],0)}"),vjust =1,hjust=2))dists_before[[1]] | dists_before[[2]]```### Overall complexity of transcriptional profile per spot/binWe can evaluate each spot/bin in terms of how complex the RNA species are by using a measure called the novelty score. The novelty score is computed by taking the ratio of nGenes over nUMI. If there are many captured transcripts (high nUMI) and a low number of genes detected in a cell, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again.With scRNA-seq this is more easily interpreted for a single cell, but for spatial data this would give us complexity of the spot, which is across multiple cells.```{r complexity}col <-"log10GenesPerUMI"ggdensity(metaD,x = col,add ="mean", rug =TRUE,alpha =0.2,fill ="lightgray",xlab="complexity",ylab="Cell density",title=glue('Novelty score'))+geom_vline(xintercept =0.8,color="darkred",cex=rel(1.3),linetype="dashed")+annotate("text",x=summary_metaD[col],y =Inf,label =glue("Mean = {round(summary_metaD[col],0)}"),vjust =1,hjust=2)+theme(plot.title =element_text(hjust=0.5, face="bold"))```### mitochondria & hemoglospot/bingene ratios```{r}ggplot(metaD %>%select(orig.ident,starts_with("percent_")) %>% tidyr::gather(class,percent_unexpected,-orig.ident), aes_string(x ="orig.ident", y ="percent_unexpected")) +geom_violin(position=position_dodge(1),alpha=1, na.rm=TRUE,trim=FALSE)+ ggbeeswarm::geom_quasirandom(na.rm=TRUE,dodge.width=0.5,method='quasirandom',alpha=0.01)+geom_boxplot(width=0.1,outliers = F)+geom_hline(yintercept=20)+facet_grid(~class,scales ="free")+theme(axis.text.x =element_text(size=rel(1),face="bold"),plot.title =element_text(hjust =0.5),strip.text.x =element_text(size =rel(1.5), colour ="black"),legend.position ="none" )+scale_y_log10(breaks=c(1,5,10,20,100))+# ylim(c(0,100))+labs(x="",y="% of contamination genes")```## QC metrics visualized on slides{.tabset}Here, we can look at all the QC metrics we discussed above on the individual tissue slide.```{r}features2check <-c(glue('nCount_{DefaultAssay(visium)}'),glue('nFeature_{DefaultAssay(visium)}'),"percent_mito","percent_hb")``````{r spatial-plot,fig.height=5,fig.width=5,eval=T,results='asis'}for(f in features2check){cat("### ", f, "\n\n") p1 <-SpatialFeaturePlot(visium, feature = f,pt.size.factor =4)print(p1)cat("\n\n")}```## Top expressed genesNow, it is time to choose some cut-offs for QC metrics mentioned above and removing low-quality cells, as well as mitochondria, hemoglobin genes from the feature space and we can take a quick look at what are our top 20 expressed genes.```{r filtering,fig.height=7,fig.width=7}GeneVar <-glue('nFeature_{DefaultAssay(visium)}')UMIVar <-glue('nCount_{DefaultAssay(visium)}')cutoffs <-list("nFeature"=100,"nCount"=100,"hb"=20,"mito"=20)Qced <- visium@meta.data[,GeneVar] > cutoffs$nFeature & visium@meta.data[,UMIVar] > cutoffs$nCount & visium$percent_hb < cutoffs$hb & visium$percent_mito < cutoffs$mitovisium <- visium[,Qced]# Filter Mitocondrialvisium <- visium[!grepl("^mt-", rownames(visium)), ]# Filter Hemoglobin gene (optional if that is a problem on your data)visium <- visium[!grepl("^Hb.*-", rownames(visium)), ]C <-GetAssayData(visium, slot ="counts")C@x <- C@x /rep.int(colSums(C), diff(C@p))most_expressed <-order(Matrix::rowSums(C), decreasing = T)[20:1]exprD <-as.data.frame(t(C[most_expressed, ])) %>% tibble::rownames_to_column("bin") %>% tidyr::gather(gene,expr,-bin)ggplot(exprD,aes(x=gene,y=expr,color=gene,fill=gene))+geom_violin(position=position_dodge(1),alpha=0.5,na.rm=TRUE,trim=FALSE)+geom_boxplot(width=0.1,outliers = F,color="black")+theme_minimal()+theme(axis.text.x =element_text(size=rel(1),face="bold"),plot.title =element_text(hjust =0.5),legend.position ="none" )+scale_y_log10(breaks=c(0.001,.01,.1,1),labels=c(0.1,1,10,100))+labs(x="",y="% of total UMIs/bin \n (log10 scaled)")+coord_flip()``````{r save_seurat}if(!dir.exists(results_dir)){system(glue("mkdir -p {results_dir}"))}qs_save(visium, file.path(results_dir, "01_qc.qs"))outputPath =file.path(results_dir, "01_qc.qs")```We saved your qc-filled Seurat object in **`r outputPath`**.## Methods### Citation```{r citations}citation("Seurat")```### Session Information```{r}sessionInfo()```