CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.
cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.9 -n 5
where
db1
& db2
are inputs,
db2novel
is output,
0.9
means 90% identity, is the comparing threshold
5
is the size of word
Options, -b, -M, -l, -d, -t, -s, -S, -B, -p, -aL, -AL, -aS, -AS, -g, -G, -T are same to CD-HIT, here are few more cd-hit-2d specific options: