Sequencing instruments produce unaligned data, typically in FASTQ format. It is possible to store unaligned data in BAM or CRAM, and indeed it may be preferable as it permits meta-data in the header and per-record auxiliary tags, however in this workflow we consider the end product to be a sorted aligned BAM or CRAM file so we cover that at the end.
There are two primary ways of producing this:
There are many tools for mapping sequences (reads) against a reference file. We do not recommend any single tool over another, but for illustrative purposes are using Minimap2.
There are processing steps which together make a pipeline:
We initially break these down into individual stages, reading and writing from disk, but it is worth considering using a UNIX pipeline to link these together avoiding temporary files.
A basic mapping tool may take a single or pair of FASTQ files and align against either a pre-built index or directly against a fasta file (as used below). The most naive approach is simply to save this file as SAM:
minimap2 -t 8 -a -x sr C.Elegans.fa SRR065390_1.fastq SRR065390_2.fastq -o CE.sam
This outputs in SAM (-a), uses 8 threads (-t 8), with options for paired end short read (-x sr).
This output file will be in the original input file order, hence the read pairs will be collated next to each other. This is important as the next step requires name-collated data. Note some aligners may shuffle the data a bit when multi-threading is enabled, but in all cases the output will still be name-collated.
Note if you use nohup
to run a command and redirect stdout using
e.g. > CE.sam
, then you may mix stdout and stderr together. This
can be prevented by also adding 2>err
to explicitly save stderr to a
separate file or preferably use an explicit output option, such as the
-o CE.sam
above.
The samtools fixmate
tool corrects any flaws in read-pairing that
may have been introduced by the aligner. Sadly a number of them have
subtle bugs and quirks, so this can be considered as a proof-reading
step. It ensures the SAM FLAG, RNEXT, PNEXT and TLEN fields are
correct and internally consistent.
Note samtools fixmate
, as with other sub-commands, can read SAM
directly without needing an explicit option. You may find older
tutorials which use -S
to indicate SAM or have an explicit SAM to
BAM conversion using samtools view
. This is not necessary.
samtools fixmate -O bam,level=1 CE.sam fixmate.bam
The -O bam,level=1
requests the fastest level of BAM compression
for the output file. We could also use level=0
(or use -u
) for
uncompressed output.
Marking duplicates requires some analysis performed on data in read-name collated order, and some performed in genome-position order. Rather than use a markdup tool that internally does sorting, we break it down into two stages so it can be slotted in to our pipeline without requiring any additional sorting requirements.
See Duplicate Marking for further details.
The first stage is also implemented with samtools fixmate
, so we can
amend the previous step with an additional flag:
samtools fixmate -O bam,level=1 -m CE.sam fixmate.bam
This adds mate cigar (MC) and mate score tags (ms) which will be used
later by samtools markdup
proper.
The data can now be converted to genome chromosome and coordinate order with:
samtools sort -l 1 -@8 -o pos.srt.bam -T /tmp/example_prefix fixmate.bam
Sort is highly parallel so the -@8
option here enables to use of 8
additional CPU threads. It can also be sped up by providing it with
more memory, but note the memory option (-m
) is per-thread. The -l
1
indicates level 1 compression again. We could also specify -O
bam,level=1
as used above.
The main core of marking duplicates may now be ran on the
position-sorted file, utilising the extra tags added during the
fixmate
step.
samtools markdup -O bam,level=1 pos.srt.bam markdup.bam
At this point you can convert to a more highly compressed BAM or to
CRAM with samtools view
samtools view -@8 markdup.bam -o final.bam
or
samtools view -T C.Elegans.fa -@8 markdup.bam -o final.cram
Note if there is no other processing to do after markdup, the final compression level and output format may be specified directly in that command. See below for an example.
You may notice that the output from each stage is used solely as the input to the next stage. We’d want to tidy up these intermediates at the end too. While the above may allow you to specify the commands in a workflow language (e.g. CWL) they are not the most efficient method.
Using Unix pipelines is a faster approach. Furthermore the lack of
needing temporary files on disk (excepting any internal temporary
files output by sort
) means we can use uncompressed BAM for maximum
speed. All samtools commands accept -
as a synonym for stdin and stdout.
Most commands have a -u
option to request uncompressed BAM,
but all should also accept a more explicit -O bam,level=0
.
An example pipeline would be:
minimap2 -t 8 -a -x sr C.Elegans.fa SRR065390_[12].fastq | \
samtools fixmate -u -m - - | \
samtools sort -u -@2 -T /tmp/example_prefix - | \
samtools markdup -@8 --reference C.Elegans.fa - final.cram
Adjusting the number of threads used by each step may be organism and system specific, but samtools will only use as many as it needs so specifying too many threads may not be too detrimental.
When estimating the total number of concurrent threads to allocate, consider that the sort step is a crunch point that separates the steps before it from the step afterwards. The mapping, fixmate and partial sort to temporary file steps will operate in parallel. Once complete, the sort merge (from temporary files) and markdup steps will then run in parallel.
The alignment is typically the bottleneck, so some pipelines may wish
to split the input into multiple jobs and distribute multiple mapping
jobs. These could be combined together (either before or after
fixmate
) using samtools cat
.
Finally we would recommend enabling pipefail
in your shell if
available. In Bash this is:
set -o pipefail
Normally the shell will only report the exit status of the final command in the pipeline. With pipefail, a failure in any part of the pipe will make the entire pipeline fail.
None of the steps above removed data provided the aligner included the unmapped reads rather than discarding them. (Most will do this automatically.)
The original FASTQ can be produced, albeit perhaps not in the exact
same order, by sorting your position-sorted back to name-sorted order
and then running samtools fastq
. The aligners will not care if the
data is precisely the same order as produced by the sequencing
instrument. Storing your FASTQ in aligned BAM or CRAM format can be
one very simple way of reducing long-term storage requirements.
samtools sort -n -@8 final.cram | \
samtools fastq - -1 dat_1.fq -2 dat_2.fq > /dev/null
We don’t wish to cover this in detail as it is a complex topic and readers should follow the procedures recommended by the authors of the assembly software.
However note that most sequence assemblers produce a consensus, in FASTA or FASTQ format and not individual alignments of every sequence. If you wish to get the BAM or CRAM file of data aligned against this consensus, for purposes of curation or downstream analysis, then simply follow the Mapping section above (with an additional step, if required, of building an index on the assembly consensus prior to mapping).
Note if using CRAM then care must be taken to ensure you do not lose the consensus file, or to amend the CRAM options to either embed the reference into the CRAM file or to use non-reference based compression. An example of embedding the reference is:
samtools view -O CRAM,embed_ref in.sam -o out.cram
Copyright © 2023 Genome Research Limited (reg no. 2742969) is a charity registered in England with number 1021457. Terms and conditions.