Redeconve manual¶
library(Redeconve)
Loading required package: foreach Loading required package: ggplot2 Loading required package: ggrepel Loading required package: igraph Attaching package: 'igraph' The following objects are masked from 'package:stats': decompose, spectrum The following object is masked from 'package:base': union Loading required package: scatterpie
1 Main function¶
This part describes how to use the main function deconvoluting
to perform single-cell deconvolution.
Usage¶
The usage of deconvoluting
is as follows:
nums <- deconvoluting(ref, st,
cellnames = NULL,
genemode, gene.list,
var_thresh = 0.025, exp_thresh = 0.003,
hpmode, hp,
normalize = T,
thre = 1e-10,
dopar = T, ncores,
realtime = F, dir = NULL
)
It does contains many parameters. Next, I will divide these parameters into several parts by function, and explain them one by one.
1.1 Necessary data for deconvolution¶
ref
and st
are the core data used for deconvolution. They are required as follows:
ref
, the scRNA-seq data served as reference for deconvolution. It is aMatrix
(ordgCMatrix
) of unprocessed count-level scRNA-seq data. One row represents one gene and one column represents one cell.st
, the spatial transcriptomics to be deconvoluted. It is aMatrix
(ordgCMatrix
) of unprocessed spatial transcriptomics, represents raw counts in each spot. One row represents one gene and one column represents one spot.
1.2 Mode of gene selection¶
genemode
, gene.list
, var_thresh
and exp_thresh
are about how to deal with genes. genemode
determines the mode of handling genes and gene.list
, var_thresh
, exp_thresh
are associated with specific modes. Redeconve offers 3 alternative modes of dealing with genes:
default
: Use the intersection of genes inref
andst
, without other treatment.customized
: Indicating gene list yourself. Parametergene.list
is the list of genes you indicated. Note that only those genes within the intersection ofref
andst
would be used.filtered
: We will use a built-in functiongene.filter
to screen some genes. This function will first take the intersection ofref
andst
, the use two indices,var_thresh
andexp_thresh
to filter genes. You can customize these two parameters as well.
var_thresh
considers variance of reference. Genes whose variance across all cells in reference do not reach that threshold will be filtered out. The default value is 0.025.exp_thresh
considers expression in spatial transcriptomics. Genes whose average count across all spots in spatial transcriptomics is less that this value will be filtered out. The default value is 0.003.
1.3 Mode of determining hyperparameter¶
The hyperparamter is our key to single-cell resolution (See Methods for details). Here we still offers 3 modes to determine the hyperparameter:
default
: We will calculate a hyperparameter according to the number of genes and cells in reference (See Methods for details).customized
: Indicating the hyperparameter yourself.autoselection
: Redeconve will use a procedure to select the optimal hyperparameter. In this procedure, a series of hyperparameter will be set in the vicinity of the hyperparameter selected by modedefault
, and Redeconve will use these hyperparameters to perform deconvolution separately, then return the result with the best hyperparameter. You can see Methods for details about how we determine the best hyperparameter.
Note that in this procedure, several rounds of deconvolution will be performed, so it may take a long time. Under such circumstances, parallel computing will be beneficial.
1.4 Parallel computing¶
Sometimes the reference will contain tens of thousands of cells, or the spatial transcriptomics will contain tens of thousands of spots (e.g. when the data is from Slide-seq), then parallel computing is useful. Redeconve uses the package "doSNOW" to achieve parallel computing (which means there is a progress bar). Related parameters are dopar
and ncores
.
dopar
determines whether to use parallel computing or not.ncores
indicates the number of cores to be used in parallel computing. It's recommended to manually set this parameter rather than use the functiondetectCores
to avoid underlying errors.
! Important tips for parallel computing: !
- Our underlying algorithm makes use of OpenBLAS, which may include parallel computing inside. Therefore, setting
system("export OPENBLAS_NUM_THREADS=1")
is necessary to avoid underlying errors. - An error may be reported when the number of threads is too large :
Error in socketAccept(socket = socket, blocking = TRUE, open = "a+b",: all connections are in use
. If such error occurs, please reduce the number of cores.
1.5 Writing real-time results¶
Even with parallel computing, some dataset is still time-consuming. Redeconve is able to write results into disk in real time at the cost of some running speed. Related parameters are realtime
and dir
.
realtime
determines whethers to write the results into disk in real time or not.dir
indicates the directory to write the results.
For real time results, the result of each spot will be write into a separate csv file, whose name is the barcode of the spot.
1.6 Other parameters¶
The left parameters are cellnames
, normalize
and thre
.
cellnames
: Chances are that you may not want to use all cells in reference to run deconvolution. Then you can indicate which cells will be used by this parameter. If you do not specify this parameter, all cells will be used.normalize
: Normalization is recommended because the scale of scRNA-seq and spatial transcriptomics data are different. This step is also recommended if you are doing bulk RNA-seq deconvolution.thre
: The estimated cell abundance will not be exactly 0. This parameter indicates that the abundance less than this value will be treated as 0. Generally this value does not need to be adjusted, and the result will remain the same within a relatively big range of this value.
A demo¶
Next we will use a demo to give an example of how to use this function.
## load the data
data(basic)
promises::promise_resolve(sc)
promises::promise_resolve(st)
## check the dimensions of sc and st
dim(sc)
dim(st)
<Promise [fulfilled: dgCMatrix]>
<Promise [fulfilled: dgCMatrix]>
- 19736
- 500
- 19736
- 428
## check the number cells in each cell type
table(annotations[, 2])
Acinar cells Cancer clone A 3 33 Cancer clone B Ductal - APOL1 high/hypoxic 44 56 Ductal - CRISP3 high/centroacinar like Ductal - MHC Class II 137 75 Ductal - terminal ductal like Endocrine cells 91 1 Endothelial cells Fibroblasts 3 1 Macrophages A Macrophages B 5 5 Mast cells mDCs A 4 3 mDCs B Monocytes 9 5 pDCs RBCs 3 4 T cells & NK cells Tuft cells 10 8
## deconvolution
# this may take a long time
res <- deconvoluting(sc, st, genemode = "def", hpmode = "def", dopar = T, ncores = 8)
[1] "Number of selected genes: 19736" [1] "Calculating correlation matrix ..." ---> Splitting data matrix: 99 splits of 19736x5 size ---> Splitting data matrix: 1 split of 19736x5 size [1] "Set hyperparameter as: 7894" [1] "Calculating Hessian matrix ..." [1] "Running deconvolution ..." |======================================================================| 100%
sc
andst
are separately reference and spatial transcriptomics.genemode
is set to"default"
to guarantee accuracy.hpmode
is set to"default"
to improve efficiency.dopar
is set toTRUE
(default value) andncores
is set to 8. You can raise the number of cores to improve efficiency.- For this dataset is not very large,
realtime
is set toFALSE
(default value). - We want to use all cells in deconvolution, so we do not need to specify
cellnames
. Also, we do not need to adjustthre
.
2 Dealing with large scRNA datasets¶
Nowadays the scale of scRNA datasets are getting increasingly larger. Sometimes the reference cells can reach to sub-million or million level. Although Redeconve possesses high efficiency, directly using so much cells as reference is both time consuming and low yield. Here we provide two ways to deal with such situation:
2.1 Cell-type deconvolution¶
Like other methods, Redeconve can also perform deconvolution at cell-type level. This part shows how to do so.
## get reference
ref <- get.ref(sc, annotations, dopar = F)
## deconvolution
res.ct <- deconvoluting(ref, st, genemode = "def", hpmode = "auto", dopar = T, ncores = 8)
[1] "Number of selected genes: 19736" [1] "Calculating correlation matrix ..." ---> Splitting data matrix: 9 splits of 19736x2 size ---> Splitting data matrix: 1 split of 19736x2 size [1] "iteration 1: hp = 49340" [1] "Calculating Hessian matrix ..." [1] "Running deconvolution ..." |======================================================================| 100% [1] "iteration 2: hp = 493400" [1] "Calculating Hessian matrix ..." [1] "Running deconvolution ..." |======================================================================| 100% [1] "iteration 3: hp = 4934000" [1] "Calculating Hessian matrix ..." [1] "Running deconvolution ..." |======================================================================| 100% [1] "iteration 4: hp = 49340000" [1] "Calculating Hessian matrix ..." [1] "Running deconvolution ..." |======================================================================| 100% [1] "iteration 5: hp = 493400000" [1] "Calculating Hessian matrix ..." [1] "Running deconvolution ..." |======================================================================| 100% [1] "iteration 6: hp = 4.934e+09" [1] "Calculating Hessian matrix ..." [1] "Running deconvolution ..." |======================================================================| 100% [1] "Select the hyperparameter as: 493400"
You can see that actually, only one more step is required to convert single-cell expression profile to that of cell type. The function get.ref
will take the average expression of all cells in one cell type as the profile of that cell type. Note that get.ref
also has parameters dopar
and ncores
, with which you can perform parallel computing when the scale of sc
is too large.
For the main function, every thing is the same except that we set hpmode
as "autoselection"
, for there are only tens of cell types, then the speed is fast enough for us to run several rounds of deconvolution.
2.2 Sampling¶
Down-sampling is another way to deal with large scRNA datasets. Function cell.sampling
is designed for this.
idx <- cell.sampling(ncells = 500, annotations, size = 200, prot = T)
sc.ds <- sc[, idx[, 1]]
[1] "cells actually taken: 201"
This function returns a 2-columned matrix showing the barcodes and annotations of selected cells. The first parameter ncells
is the original number of cells in scRNA dataset and the third size
is the number of cells you want. The fourth parameter prot
is to guarantee that at least one cell of each cell type in annotations
is selected when set to TRUE
.
res.ds <- deconvoluting(sc.ds, st, genemode = "def", hpmode = "def", dopar = T, ncores = 8)
[1] "Number of selected genes: 19736" [1] "Calculating correlation matrix ..." ---> Splitting data matrix: 99 splits of 19736x2 size ---> Splitting data matrix: 1 split of 19736x3 size [1] "Set hyperparameter as: 48850" [1] "Calculating Hessian matrix ..." [1] "Running deconvolution ..." |======================================================================| 100%
3 Seurat interface¶
This package provided two functions to extract information from Seurat object to satisfy Redeconve's input:
extract.seurat.sc
extracts expression and annotations (withSeurat::Idents
) to a list. When using this function, make sure you have set the desired ident to "active".extract.seurat.sc
extracts expression and spot coordinates (withSeurat::GetTissueCoordinates
) to a list.
# install.packages("SeuratData")
library(Seurat)
library(SeuratData)
## installing data
# InstallData("pbmc3k")
# InstallData("stxBrain")
## sc
pbmc3k <- UpdateSeuratObject(pbmc3k)
sclist <- extract.seurat.sc(pbmc3k)
## st
stxBrain <- LoadData("stxBrain", type = "anterior1")
stlist <- extract.seurat.st(stxBrain)
## just for example
# ref = get.ref(sclist[["expr"]],sclist[["annotations"]],dopar=F)
# res = deconvoluting(ref,stlist[["expr"],genemode="filt",hpmode="auto",dopar=T,ncores=8)
Loading required package: SeuratObject Loading required package: sp Loading required package: sp Attaching package: 'sp' The following object is masked from 'package:scatterpie': recenter 'SeuratObject' was built under R 4.3.3 but the current version is 4.4.1; it is recomended that you reinstall 'SeuratObject' as the ABI for R may have changed 'SeuratObject' was built with package 'Matrix' 1.6.5 but the current version is 1.7.0; it is recomended that you reinstall 'SeuratObject' as the ABI for 'Matrix' may have changed Attaching package: 'SeuratObject' The following objects are masked from 'package:base': %||%, intersect, t Attaching package: 'Seurat' The following object is masked from 'package:igraph': components The following object is masked from 'package:base': %||% ── Installed datasets ──────────────────────────────── SeuratData v0.2.2.9001 ── ✔ pbmc3k 3.1.4 ✔ stxBrain 0.1.1 ────────────────────────────────────── Key ───────────────────────────────────── ✔ Dataset loaded successfully ❯ Dataset built with a newer version of Seurat than installed ❓ Unknown version of Seurat installed Validating object structure Updating object slots Ensuring keys are in the proper structure Warning message: "Assay RNA changing from Assay to Assay" Ensuring keys are in the proper structure Ensuring feature names don't have underscores or pipes Updating slots in RNA Validating object structure for Assay 'RNA' Object representation is consistent with the most current Seurat version Validating object structure Updating object slots Ensuring keys are in the proper structure Warning message: "Assay Spatial changing from Assay to Assay" Ensuring keys are in the proper structure Ensuring feature names don't have underscores or pipes Updating slots in Spatial Updating slots in anterior1 Validating object structure for Assay 'Spatial' Validating object structure for VisiumV1 'anterior1' Object representation is consistent with the most current Seurat version
4 Downstream analysis and visualization¶
Redeconve offers many built-in functions for downstream analysis and visualization.
4.1 Gaining interpretability¶
For we included a normalization procedure, the result of deconvoluting
has only relative significance. Here are some functions related:
4.1.1 to.proportion
: This function converts the result to proportion, i.e., the sum of cell abundance per spot is 1. This is convenient for visualization.
res.prop <- to.proportion(res)
4.1.2 to.absolute.abundance
: This function converts the result to absolute cell abundance with some priori knowledge. Parameter aver.cell
is required to estimate the absolute abundance of each cell state. This value indicates the average number of cells in one spot. Users can determine it according to the platform of st data. For example, 10x Visium has a spot radius of about 50$\mu$m, so this value for a 10x Visium dataset is about 10.
res.ab <- to.absolute.abundance(res, aver.cell = 10)
4.1.3 sc2type
: This function merges single-cell result to cell-type result according to annotations. If you used get.ref
and got the cell-type level result as res.ct
, then there is no need to use this function. Note that this function itself does not provide interpretability.
res.ctmerge <- sc2type(res, annotations)
## If you have downsampled the data:
# res.ctmerge = sc2type(res.ds,annotations = annotations[idx,])
4.2 Cell occurrence¶
The function cell.occur
shows the number of cells actually used in deconvolution, and the number of spots that every used cell occurs.
cell.occur(res)
[1] "196 in 500 cells are used."
4.3 Find cells of interest¶
Redeconve offers a function to plot mean vs sd, and highlighting those outliers. Users can then explore those cells. Note that this function can only be run after cell.occur
, because it needs occurred.cells
as input, which is an output of cell.occur
.
show.cellsofinterest(res, occurred.cells)
[1] "Cancer.clone.A.117" [2] "Acinar.cells.10" [3] "Ductal...CRISP3.high.centroacinar.like.62" [4] "Macrophages.A" [5] "Fibroblasts.1" [6] "Ductal...CRISP3.high.centroacinar.like.311" [7] "Ductal...CRISP3.high.centroacinar.like.36" [8] "Ductal...CRISP3.high.centroacinar.like.58" [9] "Cancer.clone.A.115" [10] "Ductal...CRISP3.high.centroacinar.like.271"
4.4 Spatial distributiom of all or some specific cells¶
The function spatial.cell.number
can plot the abundance of all cells (or cell types) or some cells you indicate.
spatial.cell.number(res.ctmerge, cell.names = "T cells & NK cells", coords = coords, size = 4, pdfout = F)
[[1]]
Here we plotted the abundance of T cells & NK cells. When you want to plot the distribution of more than one cells, to set pdfout
as TRUE
is recommended.
4.5 Spatial pie chart¶
The function spatial.pie
can plot the spatial pie chart, showing the proportion of each cell type in each spatial spot.
# For this is a demo, we use default colors. See function documentation for details.
spatial.piechart(res.ctmerge, coords)
4.6 Spot pie chart¶
The function spot.pie
can plot the proportion of each cell type within a certain spot.
spot.pie(res[, "X27x18"], title = "X27x18")
4.7 Co-localization¶
Function coloc.network
infers a co-localization network from correlation of abundance of cells or cell types.
corr <- cor(t(res.ctmerge), method = "pearson")
corr[is.na(corr)] <- 0
celltypes <- levels(as.factor(annotations[, 2]))
g <- coloc.network(corr, thresh = 0.3, cell.type = T, annotations = celltypes, ntypes = length(celltypes))
Warning message in cor(t(res.ctmerge), method = "pearson"): "the standard deviation is zero"
The parameter thresh
means that only correlation coefficient larger than this value is regarded as an edge.
# plotting with igraph
plot(g,
vertex.size = 20,
vertex.label.cex = 0.7,
layout = layout_in_circle,
edge.width = E(g)$weight * 5,
edge.label = round(-E(g)$weight, 2),
edge.label.cex = 0.8,
edge.label.color = "red",
vertex.label = V(g)$name,
main = "Co-localization network"
)
Then, we can extract subplots containing a certain cell-of-interest and all other cells that co-localize with it by function extract.subgraph
.
subg <- extract.subgraph(g, "Macrophages A")
Warning message: "'from' is deprecated. Use '.from' instead. See help("Deprecated")"