plot_len.pl |
plot_len.pl input.clstr 1,2-4,5-9,10-19,20-49,50-99,100-299,500-99999 10-59,60-149,150-499,500-1999,2000-999999 |
This is a script to print out distributions of clusters & sequences. |
MAFFT |
mafft [arguments] input > output |
MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <∼200 sequences), FFT-NS-2 (fast; for alignment of <∼30,000 sequences), etc. |
CD-HIT-2D |
cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.9 -n 5 |
CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2. |
make_multi_seq.pl |
make_multi_seq.pl seq_db dbout.clstr multi-seq 20 |
This script reads the .clstr file, it generates a separate fasta file for each cluster over certain size and saves it in designated subdirectory. To run this script correctly, ”-d 0” option should be used in the cd-hit run and it is better to use ”-g 1” in the cd-hit run to get accurate clustering results. |
FIMO |
fimo [options] <motifs> <database> |
FIMO scans a sequence database for individual matches to each of the motifs you provide (sample output for motifs and sequences). |
CD-HIT-PARA |
cd-hit-para.pl -i nr90 -o nr60 -c 0.6 -n 4 --B hosts --S 64 |
CD-HIT-PARA is a script that runs cd-hit, cd-hit-est in a parallel mode. It splits the input database; runs cd-hit or cd-hit-est in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit or cd-hit-est. The input is a protein or DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. |
clstr_sort_prot_by.pl |
Clstr_sort_prot_by.pl input.clstr id > input_sort.clstr |
This script sort sequences within clusters in .clstr file by length, name, etc. |
CD-HIT-2D-PARA |
cd-hit-para.pl -i nr -i2 swissprot -o swissprot_vs_nr -c 0.6 -n 4 --Q 20 -T "SGE" --S 2 --S2 20 |
CD-HIT-2D-PARA is a script that runs cd-hit-2d, cd-hit-est-2d in a parallel mode. It splits the input databases; runs cd-hit-2d or cd-hit-est-2d in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit-2d or cd-hit-est-2d. The input is a protein or DNA/RAN dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. |
clstr_renumber.pl |
Clstr_renumber.pl input.clstr > input_ren.clstr |
It renumbers clusters and sequences within clusters in .clstr file after merge or other operations |
CD-HIT |
cd-hit -i db -o db90 -c 0.9 -n 5 |
CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. |
CD-HIT |
cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 2000 |
CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. |
clstr2xml.pl |
clstr2xml.pl [-len|-size] input1.clstr [input2.clstr input3.clstr ...] |
This script converts a cluster file or combines multiple cluster files from a hierarchical cd-hit run to xml format. The output is sorted by sequence length (default) or cluster size. The input cluster files must be in the order of being generated, that is, the cluster file with higher identity cutoff comes first. |
CD-HIT-EST-2D |
cd-hit-est-2d -i mrna_human -i2 est_human -o est_human_novel -c 0.95 -n 8 |
CD-HIT-EST-2D compares 2 nucleotide datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two DNA/RNA datasets (db1, db2) in fasta format and the output are two files: a fasta file of sequences in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2. For same reason as CD-HIT-EST, CD-HIT-EST-2D is good for non-intron containing sequences like EST. |
CD-HIT-EST |
cd-hit-est -i est_human -o est_human95 -c 0.95 -n 8 |
CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST. |
PSI-CD-HIT |
psi-cd-hit.pl -i nr60 -o nr30 -c 0.3 |
PSI-CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, which can be identity or expect value. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the outputs are two files: a fasta file of representative sequences and a text file of list of clusters |
ame |
ame [options] <sequence_file> <motif_file>+ |
Identify motifs that are enriched in your sequences compared to control sequences. |
tomtom |
tomtom [options] <query file> <target file>+ |
Tomtom compares one or more motifs against a database of known motifs (e.g., JASPAR). Tomtom will rank the motifs in the database and produce an alignment for each significant match (sample output for motif and JASPAR CORE 2014 database). |