(Updated Sep 2019)
Benchmarks of CRAM 2.1 and 3.0 are using the faster CRAM codecs; primarily deflate and rANS.
Also included is the performance of the proposed CRAM v3.1 standard. This is not yet a ratified GA4GH standard, but these figures give indicative results.
The options listed below also include the new proposed compression profiles (fast, normal (default), small and archive) that ease the trade-off between speed vs size vs random access. The profiles are synonyms for a collection of existing options. At the time of writing, these profiles are:
|fast||3.0, 3.1||seqs_per_slice=1000, level=1|
|normal (default)||3.0, 3.1||seqs_per_slice=10000|
To demonstrate the absolute smallest size we use add option “use_lzma” to the archive profile tests. This adds considerable encode cost, but minimal decode.
This test set is chr1 of NA12878_S1, downloaded from http://www.ebi.ac.uk/ena/data/view/ERP002490
Conversion from BAM to
All times are reported as wall-clock, although some I/O time will impact this test as the file is large, particularly decode times. In this first test we also show both vanilla Zlib and Libdeflate implementations of the gzip standard.
|CRAM v3.1 (proposed)||62150||1324||417|
|CRAM v3.1 (proposed)||small||56204||2465||1405|
|CRAM v3.1 (proposed)||archive,use_lzma||54237||3048||1395|
To compare CRAM efficiency in a variety of circumstances we chose a smaller dataset to more completely explore the parameter space. MiSeq_Ecoli_DH10B_110721_PF.bam is the smallest example data taken from the Deez paper, so we also include Deez itself here for comparison too.
The BAM/SAM.gz implementation here uses libdeflate. Again we use 8 threads, but note deez was only able to use around 2.
With light-weight level 1 compression and uncompressed level 0 files we see CRAM 3 being slower for uncompressed data than CRAM 2. This is due to the additional CRC integrity checks.
Note the slower time for BAM level 0 than level 1 is purely down to increased disk I/O costs; CPU times double for level 1. Why SAM does not pay this penalty is unknown, but it is likely this picture would change given a large enough file.
By default aligned CRAM uses an external reference file. Portions of that reference can be embedded within each slice to remove this external file dependency. On deep data this has minimal impact as the reference is small in comparison to the alignments.
CRAM can also do non reference-based compression, storing the sequence as-is (like BAM). This leads to larger files.
The significant speed difference of no_ref between version 2.1 and 3.0 is due to improved ways of storing multi-base differences instead of requiring one CRAM feature for each base call.
This is the same file above, with aligned sequencing data, but sorted into name order using “samtools sort -n”. BAM is significantly larger as the sequences are no longer in sorted order, harming gzip, but CRAM does not change size considerably. This is due to the use of reference based compression. With referenceless compression CRAM will grow in size, similar to BAM, as is visible with the “no_ref” option. Although “no_ref” it makes minimal difference here, with a very large reference it may be preferable to use this on name sorted data to reduce memory usage.
This is the name sorted data above, but with alignments and all auxiliary tags stripped out. This was achieved by converting back to FASTQ via “samtools fastq” and from there back to unaligned BAM. As expected the CRAMs are broadly similar in size to the no_ref mapped name sorted files.
Note the fastq was compressed with pigz and bgzip both using 8 threads. Pigz is smaller, but bgzip’s use of libdeflate greatly improves the speed. Both are significantly behind unaligned CRAM though so our recommendation is against storing data in native FASTQ format.
As above, but passed through the experimental “samtools sort -M” command first. This clusters reads by a hash of their sequence, having the effect of grouping similar looking data together which helps LZ compression algorithms. Note the data is still unaligned. It is a quick alternative to (a better) full genome assembly. With 8 threads this sort process took 21 seconds real time, 170 seconds CPU, although expect this to be less performant on a very large file as would spill temporary files to disk and require a large merge sort.
Note some aligners will need these files sorting (or collating) back to name order prior to converting back to FASTQ.
The effect of “sort -M” on a small deeply sequenced genome is profound, giving file sizes comparable to the aligned position sorted data and around half the size of the name sorted compressed FASTQ file. Expect this effect to be less pronounced on shallow data sets or much larger genomes.
Copyright © 2021 Genome Research Limited (reg no. 2742969) is a charity registered in England with number 1021457. Terms and conditions.