Reference Code backup Executable files
Summarize a dataset column based upon common column groupings. Akin to the SQL "group by" command.
bedtools groupby [OPTIONS] -i <input> -g <group columns> -c <op. column> -o <operation>
This tool is part of the bedtools
suite.
bedtools groupby
is a useful tool that mimics the group by clause in database systems. Given a file or stream that is sorted by the appropriate grouping columns (-g), groupby
will compute summary statistics on another column (-c) in the file or stream. This will work with output from all BEDTools as well as any other tab-delimited file or stream. As such, this is a generally useful tool for all command-line analyses, not just genomics related research.
Related tools: bedtools merge
-grp 1,2,3
), the data should be pre-grouped accordingly (with commands like sort -k1,1 -k2,2 -k3,3 data.txt
). When bedtools groupby
detects changes in the group columns it then summarizes all lines with that group.If there is only one column, but multiple operations, all operations will be applied on that column. Likewise, if there is only one operation, but multiple columns, that operation will be applied to all columns. Otherwise, the number of columns must match the the number of operations, and will be applied in respective order. E.g., "-c 5,4,6
-o sum,mean,count
" will give the sum of column 5, the mean of column 4, and the count of column 6. The order of output columns will match the ordering given in the command.
-delim "|"
.Examples
Let’s imagine we have three incredibly interesting genetic variants that we are studying and we are interested in what annotated repeats these variants overlap.
$ cat variants.bed chr21 9719758 9729320 variant1 chr21 9729310 9757478 variant2 chr21 9795588 9796685 variant3 $ bedtools intersect-a variants.bed
-b repeats.bed
-wa -wb > variantsToRepeats.bed $ cat variantsToRepeats.bed chr21 9719758 9729320 variant1 chr21 9719768 9721892 ALR/Alpha 1004 + chr21 9719758 9729320 variant1 chr21 9721905 9725582 ALR/Alpha 1010 + chr21 9719758 9729320 variant1 chr21 9725582 9725977 L1PA3 3288 + chr21 9719758 9729320 variant1 chr21 9726021 9729309 ALR/Alpha 1051 + chr21 9729310 9757478 variant2 chr21 9729320 9729809 L1PA3 3897 - chr21 9729310 9757478 variant2 chr21 9729809 9730866 L1P1 8367 + chr21 9729310 9757478 variant2 chr21 9730866 9734026 ALR/Alpha 1036 - chr21 9729310 9757478 variant2 chr21 9734037 9757471 ALR/Alpha 1182 - chr21 9795588 9796685 variant3 chr21 9795589 9795713 (GAATG)n 308 + chr21 9795588 9796685 variant3 chr21 9795736 9795894 (GAATG)n 683 + chr21 9795588 9796685 variant3 chr21 9795911 9796007 (GAATG)n 345 + chr21 9795588 9796685 variant3 chr21 9796028 9796187 (GAATG)n 756 + chr21 9795588 9796685 variant3 chr21 9796202 9796615 (GAATG)n 891 + chr21 9795588 9796685 variant3 chr21 9796637 9796824 (GAATG)n 621 +
We can see that variant1 overlaps with 3 repeats, variant2 with 4 and variant3 with 6. We can use bedtools groupby to summarize the hits for each variant in several useful ways. The default behavior is to compute the sum of the -opCols.
$ bedtools groupby-i variantsToRepeats.bed
-g 1,2,3
-c 9
chr21 9719758 9729320 6353 chr21 9729310 9757478 14482 chr21 9795588 9796685 3604
Now let’s find the min and max repeat score for each variant. We do this by grouping on the variant coordinate columns (i.e. cols. 1,2 and 3) and ask for the min and max of the repeat score column (i.e. col. 9).
$ bedtools groupby-i variantsToRepeats.bed
-g 1,2,3
-c 9
-o min
chr21 9719758 9729320 1004 chr21 9729310 9757478 1036 chr21 9795588 9796685 308
We can also group on just the name column with similar effect.
$ bedtools groupby-i variantsToRepeats.bed
-g 4
-c 9
-o min
variant1 1004 variant2 1036 variant3 308