MAGICAL (Multiome Accessible Gene Integration Calling And Looping) is a hierarchical Bayesian approach that leverages paired scRNA-seq and scATAC-seq data from different conditions to map disease-associated transcription factors, chromatin sites, and genes as regulatory circuits. By simultaneously modeling signal variation across cells and conditions in both omics data types, MAGICAL achieved high accuracy on circuit inference.
MAGICAL only requires gene symbols, peak coordinates, read count and cell meta information (including cell type, sample/subject ID and sample group/condition). These information are very fundamental and can be easily obtained from any single cell multioimc dataset. We provide a script Multiomics_input_for_MAGICAL.R to show how to prepare the necessary input files from the single cell multiomics data for use with MAGICAL. The script includes code to extract needed information from Seurat processed scRNA-seq data and ArchR or Signac processed scATAC-seq data.
To run this demo, please download the demo data and put them into the folder ’Demo input files” in current working directory.
MAGICAL infers regulatory circuits for each cell type. Therefore, the input scRNA-seq and scATAC-seq data should be cell type labelled. Users would need to select one cell type and use the provided R script to prepare the following input files for MAGICAL.
As differential calling is usually done seperately during the scRNA-seq and scATAC-seq data processing, we highly recommand preparing these two files using similar differential statistics cutoffs.
gene symbols
chr
, point1
, and point2
The scRNA-seq read count information from cells labelled to the selected cell type.
gene index
, cell index
, and
RNA read count
gene index
and gene name
.cell index
, cell barcode
,
cell type label
, sample/subject ID
, and
condition
Note, each sample must have a unique name and this name should be the same in the scATAC-seq data to allow MAGICAL to pair the two modalities together.
The scATAC-seq read count information from cells labelled to the selected cell type.
peak index
, cell index
, and
ATAC read count
peak index
, chr
, peak_point1
,
peak_point2
.cell index
, cell barcode
,
cell type label
, sample/subject ID
, and
condition
We map 870 motifs from the chromVARmotifs library to all peaks and get the binary mapping relationship.
peak index
, motif index
, and
binary binding
.motif index
and motif names
.Note, the motif names of the same TF can be very different if they are collected from different databases. To avoid duplicated motifs and minimize redundance, we required using the corresponding gene name for each TF motif.
TAD boundary information can be easily obatined from HiC profiles or
similar experiments. We include in our demo a GM12878 HiC-based TAD file
including ~6000 domains with median size 400kb for blood context
analysis. * TAD file: a three column matrix with
chr
, left_boundary
, and
right_boundary
In case no proper TAD information or HiC profile is available for the context being studied, another option to use relative distance to TSS (e.g. 500kb) as prior to initally pair peaks and genes. A hg38 RefSeq file is included in the demo for TSS reference.
# TAD prior
TAD_file_path = 'Demo input files/RaoGM12878_40kb_TopDomTADs_filtered_hg38.txt'
distance_control=5e5
# Refseq file for transcription starting site extraction
Ref_seq_file_path = 'Demo input files/hg38_Refseq.txt'
Load data:
loaded_data <- Data_loading(Candidate_gene_file_path, Candidate_peak_file_path,
scRNA_readcount_file_path, scRNA_gene_file_path, scRNA_cellmeta_file_path,
scATAC_readcount_file_path, scATAC_peak_file_path, scATAC_cellmeta_file_path,
Motif_mapping_file_path, Motif_name_file_path, Ref_seq_file_path)
#> [1] "loading candidate genes ..."
#>
#> [1] "1143 candidate genes are provided."
#>
#> [1] "loading scRNAseq data ..."
#>
#> [1] "We detected 2 conditions from the meta file"
#>
#> [1] "The input scRNAseq data includes 36601 Genes and 13248 cells from 12 samples"
#>
#> [1] "loading candidate peaks ..."
#>
#> [1] "945 candidate peaks are provided."
#>
#> [1] "loading scATACseq data ..."
#>
#> [1] "The input scATACseq data includes 144387 Peaks and 13248 cells from 12 samples"
#>
#> [1] "There are sample-paired single cell multiomics data for 12 samples."
#> [1] "Please check sample IDs if this number is lower than expected."
#>
#> [1] "loading TF motif match data ..."
#>
#> [1] "870 motifs are screened."
To identify candidate disease-modulated triads, candidate chromatin sites are associated with TFs by motif sequence matching. These sites are then linked to the candidate genes by requiring them to be within the same TAD or within a user controlled distance.
#Candidate circuits construction with TAD
Candidate_circuits <- Candidate_circuits_construction_with_TAD(loaded_data, TAD_file_path)
#> [1] "loading TAD data ..."
#>
#> [1] "5904 TAD segments provided"
#>
#> [1] "MAGICAL model initialization ..."
#> as(<dgCMatrix>, "dgTMatrix")不再有用。
#> 请用'as(., "TsparseMatrix")'。
#> 见help("Deprecated")和help("Matrix-deprecated")。
Or
#Candidate circuits construction without TAD
Candidate_circuits <- Candidate_circuits_construction_without_TAD(loaded_data, distance_control)
#> [1] "Naive pairing of peaks and genes if they are within 5e+05 bps"
#>
#> [1] "MAGICAL model initialization ..."
#> Joining with `by = join_by(chr, point1, point2)`
Then
MAGICAL uses a Bayesian framework to iteratively model chromatin accessibility and gene expression variation across cells and conditions and estimate the strength of the circuit TF-peak-gene linkages. The TF-peak binding confidence and the hidden TF activity are optimized to fit to the chromatin accessibility data. The estimated TF binding strength, TF activity and the input gene expression data are used to infer the peak-gene looping. The updated states of TF-peak-gene linkages based on the estimated strength are used to initialize the next round of estimations.
# Model parameter estimation
Circuits_linkage_posterior<-MAGICAL_estimation(loaded_data, Candidate_circuits, Initial_model, iteration_num = 1000)
#> [1] "MAGICAL integration starts ..."
#>
#> [1] "MAGICAL finished 10 percent"
#>
#> [1] "MAGICAL finished 20 percent"
#>
#> [1] "MAGICAL finished 30 percent"
#>
#> [1] "MAGICAL finished 40 percent"
#>
#> [1] "MAGICAL finished 50 percent"
#>
#> [1] "MAGICAL finished 60 percent"
#>
#> [1] "MAGICAL finished 70 percent"
#>
#> [1] "MAGICAL finished 80 percent"
#>
#> [1] "MAGICAL finished 90 percent"
#>
#> [1] "MAGICAL finished 100 percent"
Finally, optimized circuits fitting the variation in both modalities are selected. Circuit genes, associated chromatin sites and the regulatory TFs will be written to “MAGICAL_selected_regulatory_circuits.txt”. MAGICAL uses default thresholds (posterior probabilities on TF-peak binding and peak-gene looping) for circuit selection. Users can adjust these thresholds in the provided scripts to allow more or fewer outputs.
Note: running MAGICAL on our demo files may return very similar but slightly different results to the example output file we provided due the nature of Bayesian algorithm.