clustalo manual with usage examples | BioQueue Encyclopedia

Usage

clustalo [options] -i input_file.fasta -o output_file.fasta

Manual

Clustal Omega is a multiple sequence alignment (MSA) program used in bioinformatics to align three or more biological sequences (protein and DNA/RNA). It's an advanced version of the original ClustalW software, offering improved accuracy and speed. It's capable of handling large datasets efficiently and provides options for customization and integration with other bioinformatics tools.

In default mode, users give a file of sequences to be aligned and these are clustered to produce a guide tree and this is used to guide a "progressive alignment" of the sequences. There are also facilities for aligning existing alignments to each other, aligning a sequence to an alignment and for using a hidden Markov model (HMM) to help guide an alignment of new sequences that are homologous to the sequences used to make the HMM. This latter procedure is referred to as "external profile alignment" or EPA.

Clustal-Omega uses HMMs for the alignment engine, based on the HHalign package from Johannes Soeding. Guide trees are made using an enhanced version of mBed which can cluster very large numbers of sequences in $\mathcal{O}(N\log(N))$ time. Multiple alignment then proceeds by aligning larger and larger alignments using HHalign, following the clustering given by the guide tree.

Required arguments

Clustal-Omega accepts 3 types of sequence input:

a sequence file with un-aligned or aligned sequences -i,
profiles (a multiple alignment in a file) of aligned sequences (--p1, --p2)
a HMM (--hmm-in).

Valid combinations of the above are:

one file with un-aligned or aligned sequences (1.); the sequences will be aligned, and the alignment will be written out. For this mode use the -i flag. If the sequences are aligned (all sequences have the same length and at least one sequence has at least one gap), then the alignment is turned into a HMM, the sequences are de-aligned and the now un-aligned sequences are aligned using the HMM as an External Profile for External Profile Alignment (EPA). If no EPA is desired use the --dealign flag. Use the above option to make a multiple alignment from a set of sequences. A sequence file must contain more than one sequence (at least two sequences).
two profiles (2.)+(2.); the columns in each profile will be kept fixed and the alignment of the two profiles will be written out. Use the --p1 and --p2 flags for this mode. Use this option to align two alignments (profiles) together.
one file with un/aligned sequences (1.) and one profile (2.); the profile is converted into a HMM and the un-aligned sequences will be multiply aligned (using the HMM background information) to form a profile; this constructed profile is aligned with the input profile; the columns in each profile (the original one and the one created from the un-aligned sequences) will be kept fixed and the alignment of the two profiles will be written out. Use the -i flag in conjunction with the --p1 flag for this mode. The un/aligned sequences file (1.) must contain at least two sequences. If a single sequence has to be aligned with a profile the profile-profile option (2.) has to be used. Use the option to add new sequences to an existing alignment.
one file with un-aligned sequences (1.) and one HMM (3.); the un-aligned sequences will be aligned to form a profile, using the HMM as an External Profile. So far only one HMM can be input and only HMMer2 and HMMer3 formats are allowed. The alignment will be written out; the HMM information is discarded. As, at the moment, only one HMM can be used, no HMM is produced if the sequences are already aligned. Use the -i flag in conjunction with the --hmm-in flag for this mode. Multiple HMMs can be inputted, however, in the current version all but the first HMM will be ignored. Use this option to make a new multiple alignment of sequences from the input file and use the HMM as a guide (EPA).

Options

Sequence Input

-i, --in, --infile file: Multiple sequence input file (- for stdin). --in is mentioned in the description, and one example to use --in is --in abc.fa. The sequence file must contain more than one sequence (at least two sequences).
--hmm-in file: HMM input files.
--hmm-batch file: Specify HMMs for individual sequences.
--dealign: Dealign input sequences.
--profile1, --p1 file: Pre-aligned multiple sequence file (aligned columns will be kept fix).
--profile2, --p2 file: Pre-aligned multiple sequence file (aligned columns will be kept fix).
--is-profile: Disable check if profile, force profile (default no).
-t, --seqtype Protein, RNA, DNA: Force a sequence type (default: auto).
--infmt a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]: Forced sequence input file format (default: auto).

Clustering

--distmat-in file: Pairwise distance matrix input file (skips distance computation).
--distmat-out file: Pairwise distance matrix output file.
--guidetree-in file: Guide tree input file (skips distance computation and guide-tree clustering step).
--guidetree-out file: Guide tree output file.
--pileup: Sequentially align sequences.
--full: Use full distance matrix for guide-tree calculation (might be slow; mBed is default).
--full-iter: Use full distance matrix for guide-tree calculation during iteration (might be slowish; mBed is default).
--cluster-size n: Soft maximum of sequences in sub-clusters.
--clustering-out file: Clustering output file.
--trans n: Use transitivity (default: 0).
--posterior-out file: Posterior probability output file.
--use-kimura: Use Kimura distance correction for aligned sequences (default no).
--percent-id: Convert distances into percent identities (default no).

Alignment Output

-o, --out, --outfile file: Multiple sequence alignment output file (default: stdout).
--outfmt a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]: MSA output file format (default: fasta).
--residuenumber, --resno: In Clustal format print residue numbers (default no).
--wrap n: Number of residues before line-wrap in output.
--output-order input-order,tree-order: MSA output order like in input/guide-tree.

Iteration

--iterations, --iter n: Number of (combined guide-tree/HMM) iterations.
--max-guidetree-iterations n: Maximum number of guidetree iterations.
--max-hmm-iterations n: Maximum number of HMM iterations.

Limits

If the program exceed limits setting in this section, it'll exit early.

--maxnumseq n: Maximum allowed number of sequences.
--maxseqlen l: Maximum allowed sequence length.

Miscellaneous options

--auto: Set options automatically (might overwrite some of your options).
--threads n: Number of processors to use.
--pseudo file: Input file for pseudo-count parameters.
-l, --log file: Log all non-essential output to this file.
--version: Print version information and exit.
--long-version: Print long version information and exit.
--force: Force file overwriting.
-h, --help: Print this help and exit.
-v, --verbose: Verbose output (increases if given multiple times).

File formats this tool works with

FASTA

clustalo

Category