Manual page from samtools-1.2
released on 2 February 2015

NAME

samtools – Utilities for the Sequence Alignment/Map (SAM) format

SYNOPSIS

samtools view -bt ref_list.txt -o aln.bam aln.sam.gz

samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam

samtools index aln.sorted.bam

samtools idxstats aln.sorted.bam

samtools view aln.sorted.bam chr2:20,100,000-20,200,000

samtools merge out.bam in1.bam in2.bam in3.bam

samtools faidx ref.fasta

samtools fixmate in.namesorted.sam out.bam

samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam

samtools tview aln.sorted.bam ref.fasta

samtools flags PAIRED,UNMAP,MUNMAP

samtools bam2fq input.bam > output.fastq

DESCRIPTION

Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly.

Samtools is designed to work on a stream. It regards an input file `-' as the standard input (stdin) and an output file `-' as the standard output (stdout). Several commands can thus be combined with Unix pipes. Samtools always output warning and error messages to the standard error output (stderr).

Samtools is also able to open a BAM (not SAM) file on a remote FTP or HTTP server if the BAM file name starts with `ftp://' or `http://'. Samtools checks the current working directory for the index file and will download the index upon absence. Samtools does not retrieve the entire alignment file unless it is asked to do so.

COMMANDS AND OPTIONS

view

samtools view [options] in.bam|in.sam|in.cram [region...]

With no options or regions specified, prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header).

You may specify one or more space-separated region specifications after the input filename to restrict output to only those alignments which overlap the specified region(s). Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).

The -b, -C, -1, -u, -h, -H, and -c options change the output format from the default of headerless SAM, and the -o and -U options set the output file name(s).

The -t and -T options provide additional reference data. One of these two options is required when SAM input does not contain @SQ headers, and the -T option is required whenever writing CRAM output.

The -L, -r, -R, -q, -l, -m, -f, and -F options filter the alignments that will be included in the output to only those alignments that match certain criteria.

The -x, -B, and -s options modify the data which is contained in each alignment.

Finally, the -@ option can be used to allocate additional threads to be used for compression, and the -? option requests a long help message.

REGIONS:

Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and all position coordinates are 1-based.

Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.

Examples of region specifications:

`chr1'

Output all alignments mapped to the reference sequence named `chr1' (i.e. @SQ SN:chr1) .

`chr2:1000000'

The region on chr2 beginning at base position 1,000,000 and ending at the end of the chromosome.

`chr3:1000-2000'

The 1001bp region on chr3 beginning at base position 1,000 and ending at base position 2,000 (including both end positions).

OPTIONS:

-b

Output in the BAM format.

-C

Output in the CRAM format (requires -T).

-1

Enable fast BAM compression (implies -b).

-u

Output uncompressed BAM. This option saves time spent on compression/decompression and is thus preferred when the output is piped to another samtools command.

-h

Include the header in the output.

-H

Output the header only.

-c

Instead of printing the alignments, only count them and print the total number. All filter options, such as -f, -F, and -q, are taken into account.

-?

Output long help and exit immediately.

-o FILE

Output to FILE [stdout].

-U FILE

Write alignments that are not selected by the various filter options to FILE. When this option is used, all alignments (or all alignments intersecting the regions specified) are written to either the output file or this file, but never both.

-t FILE

A tab-delimited FILE. Each line must contain the reference name in the first column and the length of the reference in the second column, with one line for each distinct reference. Any additional fields beyond the second column are ignored. This file also defines the order of the reference sequences in sorting. If you run: `samtools faidx <ref.fa>', the resulting index file <ref.fa>.fai can be used as this FILE.

-T FILE

A FASTA format reference FILE, optionally compressed by bgzip and ideally indexed by samtools faidx. If an index is not present, one will be generated for you.

-L FILE

Only output alignments overlapping the input BED FILE [null].

-r STR

Only output alignments in read group STR [null].

-R FILE

Output alignments in read groups listed in FILE [null].

-q INT

Skip alignments with MAPQ smaller than INT [0].

-l STR

Only output alignments in library STR [null].

-m INT

Only output alignments with number of CIGAR bases consuming query sequence ≥ INT [0]

-f INT

Only output alignments with all bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].

-F INT

Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].

-x STR

Read tag to exclude from output (repeatable) [null]

-B

Collapse the backward CIGAR operation.

-s FLOAT

Integer part is used to seed the random number generator [0]. Part after the decimal point sets the fraction of templates/pairs to subsample [no subsampling].

-@ INT

Number of BAM compression threads to use in addition to main thread [0].

-S

Ignored for compatibility with previous samtools versions. Previously this option was required if input was in SAM format, but now the correct format is automatically detected by examining the first few characters of input.

tview

samtools tview [-p chr:pos] [-s STR] [-d display] <in.sorted.bam> [ref.fasta]

Text alignment viewer (based on the ncurses library). In the viewer, press `?' for help and press `g' to check the alignment start from a region in the format like `chr10:10,000,000' or `=10,000,000' when viewing the same reference sequence.

Options:

-d display

Output as (H)tml or (C)urses or (T)ext

-p chr:pos

Go directly to this position

-s STR

Display only alignments from this sample or read group

mpileup

samtools mpileup [-EBugp] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]

Generate VCF, BCF or pileup for one or multiple BAM files. Alignment records are grouped by sample (SM) identifiers in @RG header lines. If sample identifiers are absent, each input file is regarded as one sample.

In the pileup format (without -u or -g), each line represents a genomic position, consisting of chromosome name, 1-based coordinate, reference base, the number of reads covering the site, read bases, base qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch on the reverse strand. A pattern `\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the reference. The deleted bases will be presented as `*' in the following lines. Also at the read base column, a symbol `^' marks the start of a read. The ASCII of the character following `^' minus 33 gives the mapping quality. A symbol `$' marks the end of a read segment.

Input Options:

-6, --illumina1.3+

Assume the quality is in the Illumina 1.3+ encoding.

-A, --count-orphans

Do not skip anomalous read pairs in variant calling.

-b, --bam-list FILE

List of input BAM files, one file per line [null]

-B, --no-BAQ

Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.

-C, --adjust-MQ INT

Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. [0]

-d, --max-depth INT

At a position, read maximally INT reads per input BAM. [250]

-E, --redo-BAQ

Recalculate BAQ on the fly, ignore existing BQ tags

-f, --fasta-ref FILE

The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by bgzip. [null]

-G, --exclude-RG FILE

Exclude reads from readgroups listed in FILE (one @RG-ID per line)

-l, --positions FILE

BED or position list file containing a list of regions or sites where pileup or BCF should be generated. If BED, positions are 0-based half-open [null]

-q, -min-MQ INT

Minimum mapping quality for an alignment to be used [0]

-Q, --min-BQ INT

Minimum base quality for a base to be considered [13]

-r, --region STR

Only generate pileup in region. Requires the BAM files to be indexed. If used in conjunction with -l then considers the intersection of the two requests. STR [all sites]

-R, --ignore-RG

Ignore RG tags. Treat all reads in one BAM as one sample.

--rf, --incl-flags STR|INT

Required flags: skip reads with mask bits unset [null]

--ff, --excl-flags STR|INT

Filter flags: skip reads with mask bits set [UNMAP,SECONDARY,QCFAIL,DUP]

-x, --ignore-overlaps

Disable read-pair overlap detection.

Output Options:

-o, --output FILE

Write pileup or VCF/BCF output to FILE, rather than the default of standard output.

(The same short option is used for both --open-prob and --output. If -o's argument contains any non-digit characters other than a leading + or - sign, it is interpreted as --output. Usually the filename extension will take care of this, but to write to an entirely numeric filename use -o ./123 or --output 123.)

-g, --BCF

Compute genotype likelihoods and output them in the binary call format (BCF). As of v1.0, this is BCF2 which is incompatible with the BCF1 format produced by previous (0.1.x) versions of samtools.

-v, --VCF

Compute genotype likelihoods and output them in the variant call format (VCF). Output is bgzip-compressed VCF unless -u option is set.

Output Options for mpileup format (without -g or -v):

-O, --output-BP

Output base positions on reads.

-s, --output-MQ

Output mapping quality.

Output Options for VCF/BCF format (with -g or -v):

-D

Output per-sample read depth [DEPRECATED - use -t DP instead]

-S

Output per-sample Phred-scaled strand bias P-value [DEPRECATED - use -t SP instead]

-t, --output-tags LIST

Comma-separated list of FORMAT and INFO tags to output (case-insensitive): DP (Number of high-quality bases, FORMAT), DV (Number of high-quality non-reference bases, FORMAT), DPR (Number of high-quality bases for each observed allele, FORMAT), INFO/DPR (Number of high-quality bases for each observed allele, INFO), DP4 (Number of high-quality ref-forward, ref-reverse, alt-forward and alt-reverse bases, FORMAT), SP (Phred-scaled strand bias P-value, FORMAT) [null]

-u, --uncompressed

Generate uncompressed VCF/BCF output, which is preferred for piping.

-V

Output per-sample number of non-reference reads [DEPRECATED - use -t DV instead]

Options for SNP/INDEL Genotype Likelihood Computation (for -g or -v):

-e, --ext-prob INT

Phred-scaled gap extension sequencing error probability. Reducing INT leads to longer indels. [20]

-F, --gap-frac FLOAT

Minimum fraction of gapped reads [0.002]

-h, --tandem-qual INT

Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. [100]

-I, --skip-indels

Do not perform INDEL calling

-L, --max-idepth INT

Skip INDEL calling if the average per-sample depth is above INT. [250]

-m, --min-ireads INT

Minimum number gapped reads for indel candidates INT. [1]

-o, --open-prob INT

Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. [40]

(The same short option is used for both --open-prob and --output. When -o's argument contains only an optional + or - sign followed by the digits 0 to 9, it is interpreted as --open-prob.)

-p, --per-sample-mF

Apply -m and -F thresholds per sample to increase sensitivity of calling. By default both options are applied to reads pooled from all samples.

-P, --platforms STR

Comma-delimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all]

reheader

samtools reheader <in.header.sam> <in.bam>

Replace the header in in.bam with the header in in.header.sam. This command is much faster than replacing the header with a BAM→SAM→BAM conversion.

cat

samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ]

Concatenate BAMs. The sequence dictionary of each input BAM must be identical, although this command does not check this. This command uses a similar trick to reheader which enables fast BAM concatenation.

sort

samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format] [-n] -T out.prefix [-@ threads] [in.bam]

Sort alignments by leftmost coordinates, or by read name when -n is used. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary.

The sorted output is written to standard output by default, or to the specified file (out.bam) when -o is used. This command will also create temporary files out.prefix.%d.bam as needed when the entire alignment data cannot fit into memory (as controlled via the -m option).

Options:

-l INT

Set the desired compression level for the final output file, ranging from 0 (uncompressed) or 1 (fastest but minimal compression) to 9 (best compression but slowest to write), similarly to gzip(1)'s compression level setting.

If -l is not used, the default compression level will apply.

-m INT

Approximately the maximum required memory per thread, specified either in bytes or with a K, M, or G suffix. [768 MiB]

-n

Sort by read names (i.e., the QNAME field) rather than by chromosomal coordinates.

-o FILE

Write the final sorted output to FILE, rather than to standard output.

-O FORMAT

Write the final output as sam, bam, or cram.

By default, samtools tries to select a format based on the -o filename extension; if output is to standard output or no format can be deduced, -O must be used.

-T PREFIX

Write temporary files to PREFIX.nnnn.bam. This option is required.

-@ INT

Set number of sorting and compression threads. By default, operation is single-threaded.

For compatibility with existing scripts, samtools sort also accepts the previous less flexible way of specifying the final and temporary output filenames:

samtools sort [-nof] [-m maxMem] in.bam out.prefix

The sorted BAM output is written to out.prefix.bam (or as determined by the -o and -f options below) and any temporary files are written alongside as out.prefix.%d.bam.

-o

Output the final alignment to the standard output.

-f

Use out.prefix as the full output path and do not append .bam suffix.

-l, -m, -n, -@

Accepted with the same meanings as above.

This will eventually be removed; you should move to using the more flexible newer style of invocation.

merge

samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>] <out.bam> <in1.bam> <in2.bam> [<in3.bam> ... <inN.bam>]

Merge multiple sorted alignment files, producing a single sorted output file that contains all the input records and maintains the existing sort order.

If -h is specified the @SQ headers of input files will be merged into the specified header, otherwise they will be merged into a composite header created from the input headers. If in the process of merging @SQ lines for coordinate sorted input files, a conflict arises as to the order (for example input1.bam has @SQ for a,b,c and input2.bam has b,a,c) then the resulting output file will need to be re-sorted back into coordinate order.

Unless the -c or -p flags are specified then when merging @RG and @PG records into the output header then any IDs found to be duplicates of existing IDs in the output header will have a suffix appended to them to diffientiate them from similar header records from other files and the read records will be updated to reflect this.

OPTIONS:

-1

Use zlib compression level 1 to compress the output.

-b FILE

List of input BAM files, one file per line.

-f

Force to overwrite the output file if present.

-h FILE

Use the lines of FILE as `@' headers to be copied to out.bam, replacing any header lines that would otherwise be copied from in1.bam. (FILE is actually in SAM format, though any alignment records it may contain are ignored.)

-n

The input alignments are sorted by read names rather than by chromosomal coordinates

-R STR

Merge files in the specified region indicated by STR [null]

-r

Attach an RG tag to each alignment. The tag value is inferred from file names.

-u

Uncompressed BAM output

-c

Combine RG tags with colliding IDs rather than adding a suffix to differentiate them.

-p

Combine PG tags with colliding IDs rather than adding a suffix to differentiate them.

index

samtools index [-bc] [-m INT] aln.bam|aln.cram

Index a coordinate-sorted BAM or CRAM file for fast random access. This index is needed when region arguments are used to limit samtools view and similar commands to particular regions of interest.

For a CRAM file aln.cram, index file aln.cram.crai will be created. For a BAM file aln.bam, either aln.bam.bai or aln.bam.csi will be created, depending on the index format selected.

Options:

-b

Create a BAI index. This is currently the default when no format options are used.

-c

Create a CSI index. By default, the minimum interval size for the index is 2^14, which is the same as the fixed value used by the BAI format.

-m INT

Create a CSI index, with a minimum interval size of 2^INT.

idxstats

samtools idxstats <aln.bam>

Retrieve and print stats in the index file. The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.

faidx

samtools faidx <ref.fasta> [region1 [...]]

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the BGZF format.

fixmate

samtools fixmate [-rpc] [-O format] in.nameSrt.bam out.bam

Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.

OPTIONS:

-r

Remove secondary and unmapped reads.

-p

Disable FR proper pair check.

-c

Add template cigar ct tag.

-O FORMAT

Write the final output as sam, bam, or cram.

By default, samtools tries to select a format based on the output filename extension; if output is to standard output or no format can be deduced, -O must be used.

rmdup

samtools rmdup [-sS] <input.srt.bam> <out.bam>

Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).

OPTIONS:

-s

Remove duplicates for single-end reads. By default, the command works for paired-end reads only.

-S

Treat paired-end reads and single-end reads.

calmd

samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>

Generate the MD tag. If the MD tag is already present, this command will give a warning if the MD tag generated is different from the existing tag. Output SAM by default.

OPTIONS:

-A

When used jointly with -r this option overwrites the original base quality.

-e

Convert a the read base to = if it is identical to the aligned reference base. Indel caller does not support the = bases at the moment.

-u

Output uncompressed BAM

-b

Output compressed BAM

-S

The input is SAM with header lines

-C INT

Coefficient to cap mapping quality of poorly mapped reads. See the pileup command for details. [0]

-r

Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).

-E

Extended BAQ calculation. This option trades specificity for sensitivity, though the effect is minor.

targetcut

samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>

This command identifies target regions by examining the continuity of read depth, computes haploid consensus sequences of targets and outputs a SAM with each sequence corresponding to a target. When option -f is in use, BAQ will be applied. This command is only designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].

phase

samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam>

Call and phase heterozygous SNPs.

OPTIONS:

-A

Drop reads with ambiguous phase.

-b STR

Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file STR.0.bam and phase-1 reads in STR.1.bam. Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads with switch errors will be saved in STR.chimeric.bam. [null]

-F

Do not attempt to fix chimeric reads.

-k INT

Maximum length for local phasing. [13]

-q INT

Minimum Phred-scaled LOD to call a heterozygote. [40]

-Q INT

Minimum base quality to be used in het calling. [13]

flags

samtools flags INT|STR[,...]

Convert between textual and numeric flag representation.

FLAGS:

0x1PAIRED.. paired-end (or multiple-segment) sequencing technology
0x2PROPER_PAIR.. each segment properly aligned according to the aligner
0x4UNMAP.. segment unmapped
0x8MUNMAP.. next segment in the template unmapped
0x10REVERSE.. SEQ is reverse complemented
0x20MREVERSE.. SEQ of the next segment in the template is reversed
0x40READ1.. the first segment in the template
0x80READ2.. the last segment in the template
0x100SECONDARY.. secondary alignment
0x200QCFAIL.. not passing quality controls
0x400DUP.. PCR or optical duplicate
0x800SUPPLEMENTARY.. supplementary alignment

bam2fq

samtools bam2fq [-nO] [-s <outSE.fq>] <in.bam>

Converts a bam into FASTQ format.

OPTIONS:

-n

By default, either '/1' or '/2' is added to the end of read names where the corresponding BAM_READ1 or BAM_READ2 flag is set. Using -n causes read names to be left as they are.

-O

Use quality values from OQ tags in preference to standard quality string if available.

-s FILE

Write singleton reads in FASTQ format to FILE instead of outputting them.

help, --help

Display a brief usage message listing the samtools commands available. If the name of a command is also given, e.g., samtools help view, the detailed usage message for that particular command is displayed.

--version

Display the version numbers and copyright information for samtools and the important libraries used by samtools.

--version-only

Display the full samtools version number in a machine-readable format.

REFERENCE SEQUENCES

The CRAM format requires use of a reference sequence for both reading and writing.

When reading a CRAM the @SQ headers are interrogated to identify the reference sequence MD5sum (M5: tag) and the local reference sequence filename (UR: tag). Note that http:// and ftp:// based URLs in the UR: field are not used, but local fasta filenames (with or without file://) can be used.

To create a CRAM the @SQ headers will also be read to identify the reference sequences, but M5: and UR: tags may not be present. In this case the -T and -t options of samtools view may be used to specify the fasta or fasta.fai filenames respectively (provided the .fasta.fai file is also backed up by a .fasta file).

The search order to obtain a reference is:

Use any local file specified by the command line options (eg -T).

Look for MD5 via REF_CACHE environment variable.

Look for MD5 in each element of the REF_PATH environment variable.

Look for a local file listed in the UR: header tag.

ENVIRONMENT VARIABLES

REF_PATH

A colon separated (semi-colon on Windows) list of locations in which to look for sequences identified by their MD5sums. This can be either a list of directories or URLs. Note that if a URL is included then the colon in http:// and ftp:// and the optional port number will be treated as part of the URL and not a PATH field separator. For URLs, the text %s will be replaced by the MD5sum being read.

If no REF_PATH has been specified it will default to http://www.ebi.ac.uk/ena/cram/md5/%s and if REF_CACHE is also unset, it will be set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s. If $XDG_CACHE_HOME is unset, $HOME/.cache (or a local system temporary directory if no home directory is found) will be used similarly.

REF_CACHE

This can be defined to a single directory housing a local cache of references. Upon downloading a reference it will be stored in the location pointed to by REF_CACHE. When reading a reference it will be looked for in this directory before searching REF_PATH. To avoid many files being stored in the same directory, a pathname may be constructed using %nums and %s notation, consuming num characters of the MD5sum. For example /local/ref_cache/%2s/%2s/%s will create 2 nested subdirectories with the filenames in the deepest directory being the last 28 characters of the md5sum.

The REF_CACHE directory will be searched for before attempting to load via the REF_PATH search list. If no REF_PATH is defined, both REF_PATH and REF_CACHE will be automatically set (see above), but if REF_PATH is defined and REF_CACHE not then no local cache is used.

To aid population of the REF_CACHE directory a script misc/seq_ref_populate.pl is provided in the Samtools distribution. This takes a fasta file or a directory of fasta files and generates the MD5sum named files.

EXAMPLES

Import SAM to BAM when @SQ lines are present in the header:

samtools view -bS aln.sam > aln.bam
If @SQ lines are absent:
samtools faidx ref.fa
samtools view -bt ref.fa.fai aln.sam > aln.bam
where ref.fa.fai is generated automatically by the faidx command.

Convert a BAM file to a CRAM file using a local reference sequence.

samtools view -C -T ref.fa aln.bam > aln.cram

Attach the RG tag while merging sorted alignments:

perl -e 'print "@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:Illumina\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:454\\n"' > rg.txt
samtools merge -rh rg.txt merged.bam ga.bam 454.bam
The value in a RG tag is determined by the file name the read is coming from. In this example, in the merged.bam, reads from ga.bam will be attached RG:Z:ga, while reads from 454.bam will be attached RG:Z:454.

Call SNPs and short INDELs:

samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf  > var.flt.vcf
The bcftools filter command marks low quality sites and sites with the read depth exceeding a limit, which should be adjusted to about twice the average read depth (bigger read depths usually indicate problematic regions which are often enriched for artefacts). One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps BWA-short but may not other mappers.

Individuals are identified from the SM tags in the @RG header lines. Individuals can be pooled in one alignment file; one individual can also be separated into multiple files. The -P option specifies that indel candidates should be collected only from read groups with the @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads sequenced by an indel-prone technology may affect the performance of indel calling.

Generate the consensus sequence for one diploid individual:

samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq

Phase one individual:

samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
The calmd command is used to reduce false heterozygotes around INDELs.

Dump BAQ applied alignment for other SNP callers:

samtools calmd -bAr aln.bam > aln.baq.bam
It adds and corrects the NM and MD tags at the same time. The calmd command also comes with the -C option, the same as the one in pileup and mpileup. Apply if it helps.

LIMITATIONS

Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c.

Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or ends mapped to different chromosomes). If this is a concern, please use Picard's MarkDuplicates which correctly handles these cases, although a little slower.

AUTHOR

Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library. John Marshall and Petr Danecek contribute to the source code and various people from the 1000 Genomes Project have contributed to the SAM format specification.

SEE ALSO

bcftools(1), sam(5)

Samtools website: <http://samtools.sourceforge.net>
Samtools latest source: <https://github.com/samtools/samtools>
File format specification of SAM/BAM,CRAM,VCF/BCF: <http://samtools.github.io/hts-specs>
HTSlib website: <https://github.com/samtools/htslib>
Bcftools website: <http://samtools.github.io/bcftools>