Benchmarks are using the faster CRAM codecs; primarily deflate and rANS. For comparison we also include “Io_lib”’s Scramble tool for bzip2 and lzma CRAM (not yet supported in htslib) and the Deez tool on one data set.
This test set is chr1 of NA12878_S1, downloaded from http://www.ebi.ac.uk/ena/data/view/ERP002490
Conversion from BAM to
All times are reported as wall-clock, although typically these algorithms are CPU bound so the cpu time is comparable.
To compare CRAM efficiency in a variety of circumstances we chose a smaller dataset to more completely explore the parameter space. MiSeq_Ecoli_DH10B_110721_PF.bam is the smallest example data taken from the Deez paper, so we also include Deez here for comparison.
|CRAM v3+bz2||850165878||124||45.4||Via Scramble -j|
Extra decoding time for CRAM v3 is largely explained by the additional CRC checksums.
The effect of varying compression levels:
Compression level “u” is uncompressed. Note there is almost no difference in speed between CRAM level 1 and the default level. Maybe we need to make -1 more aggressively fast at the expense of ratio.
Also note that BAM -1 is slower to encode than CRAMv3 at default levels, although it will be faster to decode. That makes me wonder about how we should deal with temporary files. (Ideally with neither BAM nor CRAM compression, but LZ4 or Snappy.)
###Embedding & Reference-less encoding
Embedded reference - no external file dependencies:
Note that this is almost the same as the default mode of using an external reference as the sequence depth is high.
Non-reference encoding - all sequence bases are stored verbatim:
The significant speed difference between version 2.1 and 3.0 is due to improved ways of storing multi-base differences instead of requiring one CRAM feature for each base call.
Human gut sample SAMEA728920 from http://www.ebi.ac.uk/ena/data/view/ERA000116 This is unmapped data, converted from FASTQ to SAM via biobambam.
|CRAM v3+bz2||289888505||32||-||Via Scramble -j|
|CRAM v3+lzma||282989638||105||-||Via Scramble -Z|
|CRAM v3 MAX||281666960||166||-||Via Scramble -9 -jZ (bzip2, lzma)|
Scramble was used to test bzip2, lzma and both combined along with compression level 9 for maximum shrinkage.
Copyright © 2017 Genome Research Limited (reg no. 2742969) is a charity registered in England with number 1021457. Terms and conditions.