cibersortx/gep manual with usage examples

Usage

docker run <bind_mounts> cibersortx/gep [options] --mixture <file> --sigmatrix <file>

Manual

Brief introduction

CIBERSORTx is a computational method used to characterize the cell composition of complex tissues from their gene expression profiles. In this group mode, CIBERSORTx compares the average expression profile of each cell type across all samples in a group. This mode is useful for identifying cell types that are consistently different between groups, even if the differences are subtle. It is also useful for analyzing large datasets, as it reduces the computational burden. CIBERSORTx can also run in high resolution mode, in which it analyzes each sample individually. The high resolution mode is useful for identifying cell types that vary widely between samples within a group. It provides a more detailed and precise analysis of the cell composition in each sample, but it is more computationally intensive than the group mode covered by this page.

docker or singularity is required to run this tool. You can run

docker pull cibersortx/gep

to obtain a copy of this tool. You also need a token that you will provide every time you run the CIBERSORTx executables. You can obtain the token from the CIBERSORTx website.

Required arguments

--username string: Email used for login to cibersortx.stanford.edu
--token string: Token associated with current IP address (generated on website)
--mixture file: Gene expression profile (GEP) matrix for the mixtures (bulk RNA-seq samples). Formatting requirements:
- Tab-delimited tabular input format (.txt or .tsv) with no double quotations and no missing entries.
- Genes in column 1; Mixture labels (sample names) in row 1
- Given the significant difference between counts (e.g., CPM) and gene length-normalized expression data (e.g., TPM) we recommend that the signature matrix and mixture files be represented in the same normalization space whenever possible.
- Data should be in non-log space. Note: if maximum expression value is less than 50; CIBERSORTx will assume that data are in log space, and will anti-log all expression values by $2^x$.
--sigmatrix file: Signature matrix. You can use empirically determined signature matrix, you can also use CIBERSORTx to generate one for you if you have single-cell RNA-seq reference available.

Options

--classes file: Cell type groupings
--cibresults file: Previous CIBERSORTx cell fractions (default: run CIBERSORT)
--label char: Sample label
--rmbatchBmode bool: Run B-mode batch correction (default: FALSE)
- --sourceGEPs file: Signature matrix GEPs for B-mode batch correction (default: sigmatrix). Required when setting --rmbatchBmode TRUE.
--rmbatchSmode bool: Run S-mode batch correction (default: FALSE)
- --refsample file: single-cell RNA-seq reference profiles for B-mode batch correction. Required when setting --rmbatchSmode TRUE.
--groundtruth file: Ground truth GEPs (same labels as --classes)
--threads int: Number of parallel processes (default: $\text{No. cores} - 1$)
--QN bool: Run quantile normalization (default: FALSE)
--outdir char: Output directory (default: "mixture dir/CIBERSORTx_output")
--nsampling int: Number of subsamples for NNLS (default: 30)
--degclasses file: Run on two classes, specified by 1, 2, 0=skip
--redo bool: Redo transcriptome imputation (default: FALSE)
--redocibersort bool: Redo CIBERSORT (default: FALSE)
--useadjustedmixtures bool: If doing batch correction, use adjusted mixtures for GEP imputation (default: FALSE)

Outputs

*_GEPs_Filtered.txt: The main result of CIBERSORTx Group Mode is a file for cell-type specific gene expression profiles where genes have been filtered out using a threshold to eliminate unreliably estimated genes for each cell type.
The "1" values in the expression matrix txt files are genes with insufficient evidence of expression (these genes are either not expressed or have inadequate statistical power to be imputed). The NA values are genes that have inadequate statistical power to be imputed.
*_GEPs.txt: the file for cell-type specific gene expression profiles where no filtering was done.
*_Fractions.txt: file enumerating the fractions of the different cell types in bulks samples.

The different statistics used for the filtering are saved in the files listed below. Refer to Supplementary Note in Newman et al. (submitted) for further details:

*_GEPs_StdErrs.txt: analytically derived standard errors of the regression coefficients.
*GEPs_Pvals.txt: $p$-values used to determine the significance of the regression coefficients.
*GEPs_Qvals.txt: adjusted $p$-values ($q$-values) after multiple hypothesis testing using the Benjamini-Hochberg method.
*_GEPs_CV.txt: To further reduce confounding noise, genes were filtered based on their geometric coefficient of variation (geometric c.v.), which are the values listed in this file, calculated using the natural logarithm of subsampled regression coefficients. The geometric c.v. were used to determined the adaptive cell-type specific noise threshold.
*_GEPs_ThresholdPlots.pdf: plot illustrating the adaptive noise threshold used for filtering.
CIBERSORTxGEP_Weights.txt: the fractions of the different cell types after merging them into major classes according to the merged classes file.

If ground truth was given as input:

*_CrossCorrelationMatrix.pdf: plots showing the correlation between estimated genes and ground truth for each cell types. for all genes (GEP), and for the signature matrix genes (SM). The corresponding *_CrossCorrelationMatrix.txt file is also given as output.
*_ScatterPlots.pdf: scatterplots showing the estimated expression values (non-zero only) versus the observed expression values for the whole gene expression profile (GEP) and for the signature matrix genes (SM) after noise filtering.
*_SM_GEPS_HeatMap.png: Heatmap illustrating the CIBERSORTx imputed gene expression values for the signature matrix genes ($y$ axis), compared to ground truth.
*GEPs_Stats.txt: set of benchmark statistics used to compare with ground truth.

Examples

Group Level GEPs - FL (Fig. 3b-f)

This examples imputes cell type specific gene expression profiles from bulk follicular lymphoma samples profiled on microarray, using the signature matrix LM22 collapsed to 4 major cell types. In addition the results are compared to ground truth reference profiles obtained from FACS-sorted cell subsets.

docker run -v absolute/path/to/input/dir:/src/data -v absolute/path/to/output/dir:/src/outdir cibersortx/gep \
    --username email_address_registered_on_CIBERSORTx_website \
    --token token_obtained_from_CIBERSORTx_website \
    --mixture Fig3b-f-FL-arrays-mixture.txt \
    --sigmatrix LM22.txt \
    --groundtruth Fig3b-f-FL-arrays-groundtruth.RMA.txt \
    --classes Fig3b-f-LM4_merged_classes.txt --QN TRUE

Group Level GEPs - NSCLC (Fig. 3g)

This examples imputes cell type specific gene expression profiles from bulk NSCLC samples profiled by RNA-Seq and compares the results to ground truth reference profiles obtained from FACS-sorted cell subsets.

docker run -v absolute/path/to/input/dir:/src/data -v absolute/path/to/output/dir:/src/outdir cibersortx/gep \
    --username email_address_registered_on_CIBERSORTx_website \
    --token token_obtained_from_CIBERSORTx_website \
    --mixture mixture_NSCLCbulk_Fig3g.txt \
    --sigmatrix sigmatrix_NSCLC_Fig3g.txt \
    --groundtruth groundtruth_NSCLCsubsets_Fig3g.txt \
    --classes merged_classes_NSCLC_Fig3g.txt

File formats this tool works with

TSVTXT

cibersortx/gep

Category