CRAM 3.1 becomes the default CRAM output version

Release 1.22 of HTSlib and SAMtools has switched to writing CRAM 3.1 by default. Read support CRAM 3.1 has been available since HTSlib 1.12 (March 2021), and we announced our switch to making it the default in September 2024.

Support for CRAM 3.1 exists in other tools, including the Noodles (Rust) and at the time of writing a GitHub pull-request exists for adding it to HTSJDK.

If you need to continue using CRAM 3.0, please explicitly specify the version when using SAMtools. For example:

samtools view -O cram,version=3.0 file.bam -o file.cram

This is also the way you can control the compression levels and/or profiles. E.g.

samtools view -O cram,small,level=7 file.bam -o file.cram

Note not all machine types may benefit from CRAM 3.1, as it largely depends on the randomness of the data. However some platforms may benefit considerably. As with CRAM 3.0 there is a tradeoff between speed and size, but CRAM 3.1 adds more sequence data specific compression codecs which make this tradeoff more worth while. The default output however favours the faster end of the speed/size tradeoff. There is a paper describing these codecs at https://doi.org/10.1093/bioinformatics/btac010.

CRAM 3.0 vs 3.1 benchmark summary

We compare CRAM 3.0 against CRAM 3.1 for a variety of sequencing platforms. Each line represents a different compression profile, targetting different positions on the speed-vs-size tradeoff.

NovaSeq CRAM Size vs Encode CPU HiSeq 2500 CRAM Size vs Encode CPU Revio CRAM Size vs Encode CPU ONT CRAM Size vs Encode CPU

Details of these compression profiles and specific tables of data are below. Some earlier benchmarks from 2019 are also available.

CRAM benchmarking profiles

(Updated Apr 2025)

These benchmarks here are for a variety of instrument types and showing differences between BAM and various CRAM versions. Each format also permits a variety of compression options and levels. For speed, we test only a subset of the full data sets. Note that this may have an impact on expected compression ratios if the data set contains a large amount of unaligned data at the end or if the chosen subset is not representative of the overall data.

The benchmarks below utilise the HTSlib supported compression profiles of “fast”, “normal” (the default), “small” and “archive”. These set a mix of options as defined here:

Profile	CRAM versions	options
fast	3.0	seqs_per_slice=10000,level=1
fast	3.0, 3.1	seqs_per_slice=10000,level=1,use_tok=0
normal	3.0, 3.1	seqs_per_slice=10000
small	3.0	seqs_per_slice=25000,level=6,use_bzip2
small	3.1	seqs_per_slice=25000,level=6,use_bzip2,use_fqz
archive	3.0	seqs_per_slice=100000,level=7,use_bzip2
archive	3.1	seqs_per_slice=100000,level=7,use_bzip2,use_fqz,use_arith

If level 8 was specified prior to enabling “archive” mode, then it also adds “use_lzma” into the option list. We also provide benchmarks of using lzma to show the maximum compression, but it is rarely worth the CPU cost.

Illumina NovaSeq

10 million coordinate sorted alignments, originating from an Illumina published dataset when announcing the NovaSeq.

Format	Options	Size(Mb)	Encoding(s) real	Encoding(s) CPU	Decoding(s) real	Decoding(s) CPU
BAM	level=1	577	5.6	18.3	0.8	4.4
BAM		515	9.2	61.7	0.8	4.2
BAM	level=7	508	15.0	109.9	0.9	4.3
BAM	level=9	481	209.9	1661.0	0.9	4.3
CRAM v3.0	fast	216	5.2	28.3	2.0	13.6
CRAM v3.0		207	5.4	33.4	2.1	13.8
CRAM v3.0	small	201	11.6	88.3	3.2	24.4
CRAM v3.0	archive	199	17.4	128.7	3,5	25.9
CRAM v3.0	archive,use_lzma	194	65.4	503.2	2.6	18.6
CRAM v3.1	fast	217	4.6	27.6	1.9	10.9
CRAM v3.1		176	5.4	36.4	1.9	11.6
CRAM v3.1	small	166	11.8	90.1	5.5	41.5
CRAM v3.1	archive	158	24.3	185.8	5.8	42.4
CRAM v3.1	archive,use_lzma	157	57.2	440.9	6.0	43.4

A break down of data types in CRAM 3.1 is:

Data type	File percentage	Bits per base
Quality	50%	0.46
Sequence	23%	0.22
Read names	18%	0.17
Aux tags	9%	0.08

The quantisation of quality values shows a significant reduction in the amount of storage taken by quality values compared to the earlier HiSeq (below).

Illumina HiSeq 2500

10 million alignments from https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps

Aligned and coordinate sorted.

Format	Options	Size(Mb)	Encoding(s) real	Encoding(s) CPU	Decoding(s) real	Decoding(s) CPU
BAM	level=1	1671	12.6	43.7	1.8	10.4
BAM		1546	19.2	125.2	1.7	9.9
BAM	level=7	1536	28.9	204.4	1.7	9.9
BAM	level=9	1461	251.4	1992.1	1.8	9.9
CRAM v3.0	fast	902	7.7	45.9	3.1	21.2
CRAM v3.0		880	8.1	51.6	3.0	21.5
CRAM v3.0	small	870	17.8	133.6	4.3	31.4
CRAM v3.0	archive	868	28.1	210.4	4.5	32.6
CRAM v3.0	archive,use_lzma	861	137.7	1078.5	3.9	27.3
CRAM v3.1	fast	904	7.7	46.0	2.9	15.2
CRAM v3.1		852	10.1	71.4	3.1	22.2
CRAM v3.1	small	789	25.9	200.0	14.6	113.3
CRAM v3.1	archive	775	56.4	427.7	15.7	109.4
CRAM v3.1	archive,use_lzma	774	135.1	1073.0	15.7	109.4

The addition of lzma as a codec choice helps the compression ratio of CRAM 3.0 a little, but at a big cost in CPU making it likely not worth it. It’s used even less in CRAM 3.1, costing CPU during encoder evaluation, but mostly being unutilised for any large data types thanks to the better range of codecs available.

A break down of data types in CRAM 3.1 (default) is:

Data type	File percentage	Bits per base
Quality	86%	2.37
Sequence	7%	0.18
Read names	4%	0.11
Aux tags	3%	0.08

The bulk of the storage cost is the quality values due to the 32 discrete values.

The QS data series shrinks from 735MB to 665MB when we enable archive mode, and this accounts for the bulk of the reduction. (Similarly for small mode which also uses the fqzcomp quality codec.)

PacBio Revio

Alignments from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/HG002/assemblies/polishing/HG002/v1.0/mapping/hifi_revio_pbmay24/

The reference used for this is the T2T hg002v1.0.1.fasta.gz file.

Format	Options	Size(Mb)	Encoding(s) real	Encoding(s) CPU	Decoding(s) real	Decoding(s) CPU
BAM	level=1	5347	39	154	6.7	46.4
BAM		4832	66	420	5.6	32.3
BAM	level=7	4785	108	753	6.7	35.7
BAM	level=9	4409	1219	9680	6.9	36.3
CRAM v3.0	fast	1400	28	167	12.4	80.2
CRAM v3.0		1386	27	168	12.1	77.3
CRAM v3.0	small	1379	29	195	12.1	79.9
CRAM v3.0	archive	1374	63	387	16.6	107.8
CRAM v3.0	archive,use_lzma	1369	248	1357	16.9	109.6
CRAM v3.1	fast	1403	31	174	10.9	48.8
CRAM v3.1		1288	27	164	10.7	44.1
CRAM v3.1	small	1211	69	531	37.6	288.6
CRAM v3.1	archive	1202	144	959	39.4	298.7
CRAM v3.1	archive,use_lzma	1197	310	1920	39.9	301.8

A break down of data types in CRAM 3.1 is:

Data type	File percentage	Bits per base
Quality	80%	0.67
MM+ML tags	14%	0.12
Sequence	3%	0.03
Read names	<1%	0.00
Other	2%	0.02

The quantised quality values compress almost as well as NovaSeq, but they make up a larger proportion of data partly due to the extreme compression of sequence values. To some degree this is an artifact of this specific data set which has been aligned against a diploid T2T assembly rather than the canonical GRCh38, but will also be in part due to the high quality of HiFi base calls.

The base modification tags are taking up a significant proportion and the ML confidence values could benefit from a similar quantisation seen in the quality scores.

Oxford Nanopore

This is a the latest GIAB data uses a modern chemistry and base caller. The data comes from https://epi2me.nanoporetech.com/giab-2025.01/.

The data is approx 110,000 reads, from chr4:20M-50M.

Format	Options	Size(Mb)	Encoding(s) real	Encoding(s) CPU	Decoding(s) real	Decoding(s) CPU
BAM	level=1	2347	17	55	1.8	11.9
BAM		2110	21	129	2.0	11.8
BAM	level=7	2100	38	207	2.2	12.5
BAM	level=9	1990	189	1434	2.2	12.7
CRAM v3.0	fast	1392	9	50	3.9	24.6
CRAM v3.0		1340	10	60	3.8	24.0
CRAM v3.0	small	1321	22	149	4.3	28.7
CRAM v3.0	archive	1302	39	280	4.1	28.4
CRAM v3.0	archive,use_lzma	1268	226	1767	5.2	36.6
CRAM v3.1	fast	1388	8	48	3.1	16.4
CRAM v3.1		1327	11	75	3.6	22.2
CRAM v3.1	small	1313	29	179	3.6	23.4
CRAM v3.1	archive	1294	72	546	11.0	83.6
CRAM v3.1	archive,use_lzma	1262	258	2021	10.1	75.3

A break down of data types in CRAM 3.1 is:

Data type	File percentage	Bits per base
Quality	50%	2.53
MM+ML tags	44%	2.22
Sequence	6%	0.30
Read names	<1%	0.01
Other	<1%	0.00

Base qualities are more costly to store than any other technology, mainly due to the high variability and no quantisation.

Base modifications are a surprising total of the compressed data, possibly due to the number of base types present, but also as with Revio there is no quantisation of the ML tag modification likelihoods.

The unpredictability of both of these data types means both CRAM 3.0 and 3.1 struggle to get effective compression.