Category

Sequence Analysis


Usage

ame [options] <sequence_file> <motif_file>+


Manual

AME (Analysis of Motif Enrichment) scores a set of sequences with a motif, treating each subsequence (and its reverse complement for complementable alphabets) in the sequence as a possible match to the motif. AME supports several types of sequence scoring functions, and it treats motif occurrences the same, regardless of their locations within the sequences. AME supports several types of statistical enrichment functions.

Note: AME does not score sequence positions that contain ambiguous characters.

Required arguments

  • sequence_file: A fasta file with sequences in which you want to find enriched motifs. The sequences may have differing lengths.
  • motif_file: The particular motif database (in MEME format) you require. Precompiled MEME databases are available here.

Options

  • --o <output dir>: output directory; default: ame_out
  • --oc <output dir>: overwrite output; default: ame_out
  • --text: output TSV format to stdout; overrides --o and --oc; default: create HTML and TSV files in <output_dir>
  • --control <control file|--shuffle-->: control sequences in FASTA format or the keyword --shuffle-- (includes the dashes) to use shuffled versions of the primary sequences. 
    AME will determine if each motif is enriched in the primary sequences compared to the control sequences by labeling the primary sequences 'positive' and the control sequences 'negative', and then applying the enrichment method to that labeling. The keyword --shuffle-- causes AME to create (a minimum of 1000) control sequences by shuffling the letters in each primary sequence while preserving the frequencies of k-mers (see option --kmer). 
    Note: The control sequences should have (approximately) the same distribution of lengths as the primary sequences or AME may fail to correctly detect enriched motifs and will report inaccurate p-values.
  • --kmer <k>: preserve k-mer frequencies when shuffling; This option will be effective only when you set --control --shuffle--. default: 2
  • --seed <s>: random number seed (integer); default: 1
  • --method [fisher|3dmhg|4dmhg|ranksum|pearson|spearman]: statistical test; default: fisher.
    • fisher: the one-tailed Fisher's Exact test. By default, AME performs partition maximization, labeling sequences sorted by FASTA score, and classifies them using the hit threshold (see --hit-lo-fraction). If you specify which sequences are 'positive' using either --control or --fix-partition, AME instead maximizes over all possible PWM thresholds that are at least as large as the sequence threshold defined for the scoring method in use (see --scoring). (default)
    • ranksum: the one-tailed Wilcoxon rank-sum test, also known as the Mann-Whitney U test.
    • pearson: the significance of the Pearson correlation coefficient between the PWM score and the FASTA score. Requires FASTA scores in the all sequence headers. If there are fewer than 30 sequences, AME computes the mean-squared error of the linear regression between the PWM score and the FASTA score instead. Not valid with --control.
    • spearman: the significance of Spearman's rank coefficient (ρ) between the PWM score ranks and the FASTA score ranks. Not valid with --control.
    • 3dmhg and 4dmhg: the 3-dimensional (3dmhg) and 4-dimensional (4dmhg) multi-hypergeometric tests are two-tailed tests described in McLeay and Bailey, "Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data", BMC Bioinformatics 11:165, 2010. These tests require --scoring totalhits; the 3dmhg function discriminates among sequences with 0, 1 or ≥ 2 hits, and the 4dmhg function discriminates among sequences with 0, 1, 2 or ≥ 3 hits. Note: Motifs enriched in either the primary or control sequences (or at the top or bottom of the sequences if you only give one sequence file) are considered significant by these tests. Not valid with --control.
  • --scoring [avg|max|sum|totalhits]: The method for scoring a single sequence for matches to a motif's PWM. The PWM score assigned to a sequence is either:
    • avg: the average motif odds score of all positions in the sequence; the sequence threshold assumes that the sequence has one "hit" (see --hit-lo-fraction) and the rest of the sites in the sequence have an average odds of 1. (default)
    • max: the maximum motif odds score over all positions in the sequence; the sequence threshold is equal to hit threshold (see --hit-lo-fraction).
    • sum: the sum of the motif odds scores of all positions in the sequence; the sequence threshold assumes that the sequence has one "hit" (see --hit-lo-fraction) and the rest of the sites in the sequence have an average odds of 1.
    • totalhits: the total number of positions in the sequence whose odds score is at least hit score (see --hit-lo-fraction); the sequence threshold is 1.
  • --hit-lo-fraction <fraction>: The hit threshold for a motif is defined as fraction times the maximum possible log-odds score for the motif. A position is considered a "hit" if the log-odds score is greater than or equal to the hit threshold. default: 0.25
  • --evalue-report-threshold <ev>: motif significance reporting threshold; default: 10
  • --fasta-threshold <ft>: For the Fisher's exact test only when you use --poslist pwm, and you do not use --control --fix-partition. AME will classify sequences with FASTA scores below score as 'positives'. default: 0.001
  • --fix-partition <int>: Causes AME to evaluate only the single partition consisting of the first N sequences. May not be use with --control or --poslist pwm.
  • --poslist [fasta|pwm]: For partition maximization, test thresholds on either X (PWM score) or Y (FASTA score). May not be used with --control or --fix-partition.
    • pwm: Use PWM score (X).
    • fasta: Use FASTA score (Y).
    Hint: Be careful switching the poslist. It switches between using X and Y for determining true positives in the contingency matrix, in addition to switching which of X and Y AME uses for partition maximization.partition on affinity (fasta) or motif (pwm) scores; default: fasta
  • --log-fscores: use log of FASTA scores (pearson) or log of ranks (spearman)
  • --log-pwmscores: use log of log of PWM scores. Only relevant for the pearson method.
  • --linreg-switchxy: switch roles of X=FASTA scores and Y=PWM scores. Only relevant for the pearson and spearman methods.
  • --xalph <alph file>: If the input motifs are in a different alphabet than the input sequences, and the motif alphabet is a subset of the sequence alphabet, you can specify an alphabet file containing the sequence alphabet definition. The input motifs are converted to this new alphabet, with the probabilities for the new symbols set to zero prior to applying pseudocounts.
  • --bfile <bfile>: Specify the source of a 0-order background model for converting a frequency matrix to a log-odds score matrix and for use in estimating the p-values of match scores. The background model normalizes for biased distribution of individual letters in the sequences. The value of file is either the path to a file in Markov Background Model Format, or one of the keywords motifmotif-file or uniform. The first two keywords cause the 0-order letter frequencies contained in the first motif file to be used, and uniform causes uniform letter frequencies to be used. If the background model in file is higher than 0-order, only the 0-order portion is used. If both strands are being scored, the background model is modified by averaging the frequencies of letters and their reverse complements.
  • --motif-pseudo <pc>: Add a this total pseudocount to the counts in each motif column when converting a frequency matrix to a log-odds score matrix. The pseudocount added to each count is pseudocount times the background frequency of the letter (see option --bgfile). default: 0.1.
    Notes 1: Counts are computed from MEME formatted motifs by multiplying the frequency of the letter times the value of nsites given in the motif letter-probability matrix header line. 
    Notes 2: The synonym --pseudocount is also allowed.
  • --inc <pattern>: name pattern to select as motif; may be repeated; default: all motifs are used
  • --exc <pattern>: name pattern to exclude as motif; may be repeated; default: all motifs are used
  • --noseqDo not output the TSV (tab-separated values) file sequences.tsv
    Note: This option is recommended when there are many many motifs and many input sequences as the TSV file can become extremely large.
  • --verbose [1|2|3|4|5]: controls program verbosity (5=maximum verbosity); default: 2
  • --help: print this message and exit
  • --version: print the version and exit

Output

AME writes its output to files in a directory named ame_out, which it creates if necessary. You can change the output directory using the --o or --oc options. The directory will contain the following files:

  • ame.html: an HTML file that provides the results in a human-readable format
  • ame.tsv: a TSV (tab-separated values) the results in a format suitable for parsing by scripts and viewing with Excel
  • sequences.tsv: (optional, --method fisher only) a TSV (tab-separated values) file that lists the true- and false-positive sequences identified by AME

In all output files, only results for significantly enriched motifs are reported.

Examples

Identify enriched motifs with user defined control sequences

 

Identify enriched motifs with shuffled sequences

 

 


Share your experience or ask a question