Manual page from htslib-1.20
released on 15 April 2024

NAME

annot–tsv – transfer annotations from one TSV (tab–separated values) file into another

SYNOPSIS

annot-tsv [OPTIONS]

DESCRIPTION

The program finds overlaps in two sets of genomic regions (for example two CNV call sets) and annotates regions of the target file (–t, ––target–file) with information from overlapping regions of the source file (–s, ––source–file).

It can transfer one or multiple columns (–f, ––transfer) and the transfer can be conditioned on requiring matching values in one or more columns (–m, ––match). In addition to column transfer (–f) and special annotations (–a, ––annotate), the program can operate in a simple grep-like mode and print matching lines (when neither –f nor –a are given) or drop matching lines (–x, ––drop-overlaps).

All indexes and coordinates are 1-based and inclusive.

OPTIONS

Common Options

–c, ––core SRC:TGT List of names of the core columns, in the order of chromosome, start and end positions, irrespective of the header name and order in which they appear in source or target files (for example "chr,beg,end:CHROM,START,END"). If both files use the same header names, the TGT names can be omitted (for example "chr,beg,end"). If SRC or TGT file has no header, 1-based indexes can be given instead (for example "chr,beg,end:3,1,2"). Note that regions are not required, the program can work with a list of positions (for example "chr,beg,end:CHROM,POS,POS").

–f, ––transfer SRC:TGT Comma-separated list of columns to transfer. If the SRC column does not exist, interpret it as the default value to fill in when a match is found or a dot (".") when a match is not found. If the TGT column does not exist, a new column is created. If the TGT column already exists, its values will be overwritten when overlap is found and left as is otherwise.

–m, ––match SRC:TGT The columns required to be identical

–o, ––output FILE Output file name, by default the result is printed on standard output

–s, ––source–file FILE Source file with annotations to transfer

–t, ––target–file FILE Target file to be extend with annotations from –s, ––source–file

Other options

––allow–dups Add the same annotations multiple times if multiple overlaps are found

––max–annots INT Add at most INT annotations per column to save time when many overlaps are found with a single region

––version Print version string and exit

–a, ––annotate LIST Add one or more special annotation and its target name separated by ':'. If no target name is given, the special annotation's name will be used in output header.

cnt number of overlapping regions

frac fraction of the target region with an overlap

nbp number of source base pairs in the overlap

–H, ––ignore–headers Ignore the headers completely and use numeric indexes even when a header exists

–O, ––overlap FLOAT Minimum overlap as a fraction of region length in at least one of the overlapping regions. If also –r, ––reciprocal is given, require at least FLOAT overlap with respect to both regions

–r, ––reciprocal Require the –O, ––overlap with respect to both overlapping regions

–x, ––drop-overlaps Drop overlapping regions (cannot be combined with –f, ––transfer)

EXAMPLE

Both SRC and TGT input files must be tab-delimited files with or without a header, their columns can be named differently, can appear in arbitrary order. For example consider the source file

#chr   beg   end   sample   type   qual
chr1   100   200   smpl1    DEL    10
chr1   300   400   smpl2    DUP    30
and the target file
150   200   chr1   smpl1
150   200   chr1   smpl2
350   400   chr1   smpl1
350   400   chr1   smpl2
In the first example we transfer type and quality but only for regions with matching sample. Notice that the header is present in SRC but not in TGT, therefore we use column indexes for the latter
annot-tsv -s src.txt.gz -t tgt.txt.gz -c chr,beg,end:3,1,2 -m sample:4 -f type,qual
150   200   chr1   smpl1   DEL   10
150   200   chr1   smpl2   .     .
350   400   chr1   smpl1   .     .
350   400   chr1   smpl2   DUP   30
The next example demonstrates the special annotations nbp and cnt, with target name as pair,count. In this case we use a target file with headers so that column names will be copied to the output:
#from	to	chrom	sample
150	200	chr1	smpl1
150	200	chr1	smpl2
350	400	chr1	smpl1
350	400	chr1	smpl2

annot-tsv -s src.txt.gz -t tgt_hdr.txt.gz -c chr,beg,end:chrom,from,to -m sample -f type,qual -a nbp,cnt:pair,count
#[1]from	[2]to	[3]chrom	[4]sample	[5]type	[6]qual	[7]pair	[8]count
150	200	chr1	smpl1	DEL	10	51	1
150	200	chr1	smpl2	.	.	0	0
350	400	chr1	smpl1	.	.	0	0
350	400	chr1	smpl2	DUP	30	51	1
One of the SRC or TGT file can be streamed from stdin
cat src.txt | annot–tsv –t tgt.txt –c chr,beg,end:3,2,1 –m sample:4 –f type,qual –o output.txt
cat tgt.txt | annot–tsv –s src.txt –c chr,beg,end:3,2,1 –m sample:4 –f type,qual –o output.txt

The program can be used in a grep-like mode to print only matching regions of the target file without modifying the records

annot–tsv –s src.txt –t tgt.txt –c chr,beg,end:3,2,1 –m sample:4
150   200   chr1   smpl1
350   400   chr1   smpl2

AUTHORS

The program was written by Petr Danecek and was originally published on github as annot–regs

COPYING

The MIT/Expat License, see the LICENSE document for details.
Copyright (c) Genome Research Ltd.