Category

Genomic Interval Manipulation


Usage

bedtools jaccard [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>


Manual

This tool is part of the bedtools suite.

bedtools jaccard calculates the Jaccard statistic between two sets of genomic intervals, which is a measure of similarity between the sets based on the intersection and union of the intervals. The Jaccard similarity coefficient, often referred to as the Jaccard index, is a way to quantify the degree of overlap or similarity between two sets. In genomics, it's commonly used to assess how much two sets of genomic intervals overlap with each other.

Required arguments

  • -a path: Path to the input file A containing genomic intervals (bed/gff/vcf format).
  • -b path: Path to the input file B containing genomic intervals (bed/gff/vcf format).

Options

  • -s: Require same strandedness. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
  • -S: Require different strandedness. That is, only report hits in B that overlap A on the opposite strand. By default, overlaps are reported without respect to strand.
  • -f float: Minimum overlap required as a fraction of A. Default is $1^{-9}$ (i.e., 1bp).
  • -F float: Minimum overlap required as a fraction of B. Default is $1^{-9}$ (i.e., 1bp).
  • -r: Require that the fraction overlap be reciprocal for A AND B. In other words, if -f 0.90 and -r is used, this requires that B overlap 90% of A and A also overlaps 90% of B.
  • -e: Require that the minimum fraction be satisfied for A OR B. In other words, if -e is used with -f 0.90 and -F 0.10, this requires that either 90% of A is covered OR 10% of B is covered. Without -e, both fractions would have to be satisfied.
  • -split: Treat "split" BAM or BED12 entries as distinct BED intervals.
  • -g genome_file: Provide a genome file to enforce consistent chromosome sort order across input files. Only applies when used with -sorted option.
  • -nonamecheck: For sorted data, don't throw an error if the file has different naming conventions for the same chromosome. ex. "chr1" vs "chr01".
  • -bed: If using BAM input, write output as BED.
  • -header: Print the header from the A file prior to results.
  • -nobuf: Disable buffered output. Using this option will cause each line of output to be printed as it is generated, rather than saved in a buffer. This can be useful in conjunction with other software tools and scripts that need to process one line of bedtools output at a time.
  • -iobuf integer: Specify the amount of memory to use for input buffer. Optional suffixes K/M/G supported. Note: currently has no effect with compressed files.

Examples

Default behavior

By default, bedtools jaccard reports the length of the intersection, the length of the union (minus the intersection), the final Jaccard statistic reflecting the similarity of the two sets, as well as the number of intersections.

$ cat a.bed
chr1  10  20
chr1  30  40

$ cat b.bed
chr1  15   20

$ bedtools jaccard -a a.bed -b b.bed
intersection  union   jaccard n_intersections
5     20      0.25    1
Controlling which intersections are included

One can also control which intersections are included in the statistic by requiring a certain fraction of overlap with respect to the features in A (via the -f parameter) or also by requiring that the fraction of overlap is reciprocal (-r) in A and B.

$ cat a.bed
chr1  10  20
chr1  30  40

$ cat b.bed
chr1  15   20

Require 10% overlap with respect to the intervals in A:

$ bedtools jaccard -a a.bed -b b.bed -f 0.1
intersection  union   jaccard n_intersections
5 20  0.25    1

Require 60% overlap with respect to the intervals in A:

$ bedtools jaccard -a a.bed -b b.bed -f 0.6
intersection  union   jaccard n_intersections
0 25  0.25    0


Share your experience or ask a question