cibersortx/fractions manual with usage examples

Usage

docker run <bind_mounts> cibersortxfractions [Options]

Manual

Brief introduction

CIBERSORTx is an analytical tool developed by Newman et al. to impute gene expression profiles and provide an estimation of the abundances of member cell types in a mixed cell population, using gene expression data. It allows users to process gene expression data representing a bulk admixture of different cell types, along with a signature matrix file that enumerates the genes defining the expression profile for each cell type of interest. For the latter, users can either use existing/curated signature matrices for reference cell types, or can create custom signature gene files by providing the reference gene expression profiles of pure cell populations. Moreover, given the increasing use of single cell transcriptome sequencing, CIBERSORTx also provides the option to derive signature matrices from single-cell RNA sequencing data. The fractions module of CIBERSORTx enumerates the proportions of distinct cell subpopulations in bulk tissue expression profiles. Unlike its predecessor, CIBERSORTx supports deconvolution of bulk RNA-Seq data using signature genes derived from either single cell transcriptomes or sorted cell populations.

docker or singularity is required to run this tool. You can run

docker pull cibersortx/fractions

to obtain a copy of this tool. You also need a token that you will provide every time you run the CIBERSORTx executables. You can obtain the token from the CIBERSORTx website.

Required arguments

--username string: Email used for login to cibersortx.stanford.edu.
--token string: Token associated with the current IP address (generated on the website).

Options

--mixture file_name: Gene expression profile (GEP) matrix for the mixtures (bulk RNA-seq samples). [required for running CIBERSORTx, optional for creating a custom signature matrix only]. Formatting requirements:
- Tab-delimited tabular input format (.txt or .tsv) with no double quotations and no missing entries.
- Genes in column 1; Mixture labels (sample names) in row 1
- Given the significant difference between counts (e.g., CPM) and gene length-normalized expression data (e.g., TPM) we recommend that the signature matrix and mixture files be represented in the same normalization space whenever possible.
- Data should be in non-log space. Note: if maximum expression value is less than 50; CIBERSORTx will assume that data are in log space, and will anti-log all expression values by $2^x$.
- CIBERSORTx will add an unique identifier to each redundant gene symbol, however we recommend that users remove redundancy prior to file upload.
- CIBERSORTx performs a feature selection and therefore typically does not use all genes in the signature matrix. It is generally ok if some genes are missing from the user’s mixture file. If less than 50% of signature matrix genes overlap, CIBERSORTx will issue a warning.
--sigmatrix file_name: Cell type GEP barcode matrix: row 1 = sample labels; column 1 = gene symbols; no missing valuesSignature matrix [required: use preexisting matrix or create one].
--perm int: Number of permutations for p-value calculation [default: 0].
--label char: Sample label [default: none].
--sourceGEPs file_name: Signature matrix GEPs for batch correction [default: sigmatrix].
--QN bool: Run quantile normalization [default: FALSE].
--absolute bool: Run absolute mode [default: FALSE].
--abs_method char: Pick absolute method ['sig.score' (default) or 'no.sumto1'].
--verbose bool: Print verbose output to the terminal [default: FALSE].

Options for correcting batch effects

CIBERSORTx provides two options to address platform-speicifc variations (e.g., between scRNA-seq and RNA-seq). Enabling these options requires a minimum of three mixtures samples, and more than ten mixtures is recommended.

--rmbatchBmode bool: Run B-mode batch correction. B-mode (bulk mode) batch correction removes technical differences between a signature matrix derived from bulk sorted reference profiles (e.g., bulk RNA-Seq or microarrays) and an input set of mixture samples. The technique can also be applied to signature matrices derived from scRNA-Seq platforms, provided that transcripts are measured analogously to bulk mixture expression profiles (e.g., full-length transcripts without UMIs profiled by SMART-Seq2). In this mode, the mixture datasets will be adjusted and used in fraction estimation. [default: FALSE].
--rmbatchSmode bool: Run S-mode batch correction. S-mode (single cell mode) batch correction is tailored for single cell-derived signature matrices generated from droplet-based or UMI-based platforms, including 10x Chromium or protocols with high technical variation. The signature matrix will be adjusted in this mode. [default: FALSE].

Options for creating a custom signature matrix

--refsample file_name: Reference profiles (with replicates) or labeled scRNA-Seq data [required]. The reference sample file is a table of the gene expression profiles of reference sample cell populations that will be compared to each other as defined in the phenotype classes file (see --phenoclasses) to generate the custom signature genes file. Formatting requirements:
- Tab-delimited tabular input format (.txt) with no double quotations and no missing entries.
- Gene symbols in column 1; Reference cell phenotype labels in row 1.
- Cells with the same phenotype should have the same phenotypic label.
- Remove any non-assigned cells before uploading the file to CIBERSORTx.
- CIBERSORTx will automatically normalize the input data such that the sum of all normalized reads are the same for each transcriptome. If a gene length-normalized expression matrix is provided (e.g., RPKM), then the signature matrix will be in TPM (transcripts per million). If a count matrix is provided, the signature matrix will be in CPM (counts per million). Regardless of the input, the signature matrix and mixture files should be represented in the same normalization space.
- Data should be in non-log space. Note: if maximum expression value is $<50$; CIBERSORTx will assume that data are in log space, and will anti-log all expression values by $2^x$.
--phenoclasses file_name: Rows correspond to the cell type classes that will be used to define the classes in the signature genes file. Columns correspond in exact order to the reference samples in the reference samples file. Each data point should have a value of "0", "1" or "2" (without the double quotes). A value of "1" indicates membership of the reference sample to the class as defined in that row, a value of "2" indicates the class that the sample will be compared against, and a value of "0" indicates that the comparison will be ignored. If specified --single_cell TRUE, the phenotype classes file will be built by CIBERSORTx, and is not required as input. [required, if --single_cell FALSE].
--single_cell bool: Create a matrix from scRNA-Seq data [default: FALSE].
--G.min int: Minimum number of genes per cell type in the signature matrix [default: 50, if --single_cell TRUE: 300].
--G.max int: Maximum number of genes per cell type in the signature matrix [default: 150, if --single_cell TRUE: 500].
--q.value int: Q-value threshold for differential expression [default: 0.3, if --single_cell TRUE: 0.01].
--filter bool: Remove non-hematopoietic genes. This option was applied to building the LM22 signature matrix. For further details, please refer to Newman et al., Nature Methods (2015). [default: FALSE].
--k.max int: Maximum condition number [default: 999].
--remake bool: Remake signature gene matrix [default: FALSE].
--replicates int: Number of replicates to use for building scRNAseq reference file [default: 5].
--sampling float: Fraction of available single cell GEPs selected using random sampling [default: 0.5].
--fraction float: Average gene expression threshold (in $\log_2$ space) for cells with the same identity/phenotype showing evidence of expression (default = 0.75). Although appropriate for plate-based approaches (e.g., SmartSeq2), this threshold may be too high for single cell experiments generated using droplet-based platforms, such as 10x Chromium or DropSeq, which generally capture a much smaller number of genes (e.g., $<1500$). For the latter case, we recommend reducing this parameter to 0.50 or even 0. Otherwise, the sparsity of the data may yield too few genes for creating a reliable signature matrix.

Tips

Avoid special symbols in gene names; otherwise, you may see error messages like:

In fread(X_file, header = F, sep = "\t") :
File '/src/outdir//temp.Fractions.coreSVR.X.tsv' has size 0. Returning a NULL data.table.
Warning message:
In fread(Y_file, header = F, sep = "\t") :
File '/src/outdir//temp.Fractions.coreSVR.Y.tsv' has size 0. Returning a NULL data.table.
Error: $ operator is invalid for atomic vectors
In addition: Warning message:
In mclapply(1:svn_itor, res, mc.cores = svn_itor) :
  all scheduled cores encountered errors in user code
Execution halted

Outputs

CIBERSORTx_{1}_inferred_phenoclasses.CIBERSORTx_{1}_inferred_refsample.bm.K{2}.txt: The newly created custom signature matrix file.
CIBERSORTx_{1}_inferred_phenoclasses.CIBERSORTx_{1}_inferred_refsample.bm.K{2}..pdf: Heatmap for the signature matrix.
CIBERSORTx_{1}_inferred_phenoclasses.txt: The phenotype classes file created by CIBERSORTx for building the custom signature matrix.
CIBERSORTx_{1}_refsample_inferred_refsample.txt: A reference file created from the reference sample you provided as input using the parameters set for number of replicates, fraction of GEPs using random sampling, and fraction of cells of same identify that show evidence of expression for a given gene.
If no batch correction is performed:
- *_Results.txt: file enumerating the fractions of the different cell types in mixture samples.
If batch correction is performed (B-mode or S-mode):
- *_Adjusted.txt: file enumerating the fractions of the different cell types in mixture samples after batch correction.
- *_Mixtures_Adjusted.txt: the input mixture file after batch correction
If S-mode batch correction is performed:
- [signature matrix filename]_Adjusted: the signature matrix after batch correction

Value for {1} will be the file prefix of the --refsample, value for {2} will be the maximum condition number as specified by --k.max.

Examples

NSCLC PBMCs Single Cell RNA-Seq (Fig. 2a,b)

This example builds a signature matrix from single cell RNA sequencing data from NSCLC PBMCs and enumerates the proportions of the different cell types in a RNA-seq dataset profiled from whole blood using S-mode batch correction.

docker run -v absolute/path/to/input/dir:/src/data -v absolute/path/to/output/dir:/src/outdir cibersortx/fractions \
    --username email_address_registered_on_CIBERSORTx_website \
    --token token_obtained_from_CIBERSORTx_website \
    --single_cell TRUE \
    --refsample Fig2ab-NSCLC_PBMCs_scRNAseq_refsample.txt \
    --mixture Fig2b-WholeBlood_RNAseq.txt \
    --fraction 0 --rmbatchSmode TRUE

Single Cell RNA-Seq HNSCC (Fig. 2c,d)

This example builds a signature matrix from single cell RNA sequencing data from HNSCC tumors (Puram et al., Cell, 2017) and enumerates the proportions of the different cell types in bulk HNSCC tumors reconstituted from single cell RNA-Seq data.

docker run -v absolute/path/to/input/dir:/src/data -v absolute/path/to/output/dir:/src/outdir cibersortx/fractions \
    --username email_address_registered_on_CIBERSORTx_website \
    --token token_obtained_from_CIBERSORTx_website \
    --single_cell TRUE \
    --refsample scRNA-Seq_reference_HNSCC_Puram_et_al_Fig2cd.txt \
    --mixture mixture_HNSCC_Puram_et_al_Fig2cd.txt

Single Cell RNA-Seq Melanoma

This example builds a signature matrix from single cell RNA sequencing data from melanoma (Tirosh et al., Science, 2016) and enumerates the proportions of the different cell types in bulk melanoma tumors reconstituted from single cell RNA-Seq data.

docker run -v absolute/path/to/input/dir:/src/data -v absolute/path/to/output/dir:/src/outdir cibersortx/fractions \
    --username email_address_registered_on_CIBERSORTx_website \
    --token token_obtained_from_CIBERSORTx_website \
    --single_cell TRUE \
    --refsample scRNA-Seq_reference_melanoma_Tirosh_SuppFig_3b-d.txt \
    --mixture mixture_melanoma_Tirosh_SuppFig_3b-d.txt

Abbas et al

This examples builds a signature matrix from sorted cell populations profiled on microarray, and enumerated cell proportions in bulk samples from microarray.

docker run -v absolute/path/to/input/dir:/src/data -v absolute/path/to/output/dir:/src/outdir cibersortx/fractions \
    --username email_address_registered_on_CIBERSORTx_website \ 
    --token token_obtained_from_CIBERSORTx_website \
    --refsample reference_purified_GSE11103.txt \
    --phenoclasses phenoclasses_GSE11103.txt \
    --mixture mixture_GSE11103.txt --QN TRUE

File formats this tool works with

TSVTXT

cibersortx/fractions

Category