Manual page from samtools-1.17
released on 21 February 2023

NAME

samtools cram-size – list a break down of data types in a CRAM file

SYNOPSIS

samtools cram-size [-ve] [-o file] in.bam

DESCRIPTION

Produces a summary of CRAM block Content ID numbers and their associated Data Series stored within them. Optionally a more detailed breakdown of how each data series is encoded per container may also be listed using the -e or --encodings option.

CRAM permits mixing multiple Data Series into a single block. In this case it is not possible to tell the relative proportion that the Data Series consume within that block. CRAM also permits different encodings and block Content ID assignment per container, although this would be highly unusual. Htslib will always assign the same Data Series to a block with a consistent Content ID, although the CRAM Encoding may change.

Each CRAM block has a compression method. These may not be consistent between successive blocks with the same Content ID. Htslib learns which compression methods work, so a single Content ID may have multiple compression methods associated with it. The methods utilised are listed per line with a single character code, although the size breakdown per method and a more verbose description can be shown using the -v option. The compression codecs used in CRAM may have a variety of parameters, such as compression levels, inbuilt transformations, and choices of entropy encoding. An attempt is made to distinguish between these different method parameterisations.

The compression methods and their short and long (verbose) name are below:

ShortLongDescription
_
ggzipGzip
_gzip-minGzip -1
Ggzip-maxGzip -9
bbzip2Bzip2
bbzip2-1 to bzip2-8Explicit bzip2 compression levels
Bbzip2-9Bzip2 -9
llzmaLZMA
rr4x8-o0rANS 4x8 Order-0
Rr4x8-o1rANS 4x8 Order-1
0r4x16-o0rANS 4x16 Order-0
0r4x16-o0RrANS 4x16 Order-0 with RLE
0r4x16-o0PrANS 4x16 Order-0 with PACK
0r4x16-o0PRrANS 4x16 Order-0 with PACK and RLE
1r4x16-o1rANS 4x16 Order-1
1r4x16-o1RrANS 4x16 Order-1 with RLE
1r4x16-o1PrANS 4x16 Order-1 with PACK
1r4x16-o1PRrANS 4x16 Order-1 with PACK and RLE
4r32x16-o0rANS 32x16 Order-0
4r32x16-o0RrANS 32x16 Order-0 with RLE
4r32x16-o0PrANS 32x16 Order-0 with PACK
4r32x16-o0PRrANS 32x16 Order-0 with PACK and RLE
5r32x16-o1rANS 32x16 Order-1
5r32x16-o1RrANS 32x16 Order-1 with RLE
5r32x16-o1PrANS 32x16 Order-1 with PACK
5r32x16-o1PRrANS 32x16 Order-1 with PACK and RLE
8rNx16-xo0rANS Nx16 STRIPED mode
2rNx16-catrANS Nx16 CAT mode
aarith-o0Arithmetic coding Order-0
aarith-o0RArithmetic coding Order-0 with RLE
aarith-o0PArithmetic coding Order-0 with PACK
aarith-o0PRArithmetic coding Order-0 with PACK and RLE
Aarith-o1Arithmetic coding Order-1
Aarith-o1RArithmetic coding Order-1 with RLE
Aarith-o1PArithmetic coding Order-1 with PACK
Aarith-o1PRArithmetic coding Order-1 with PACK and RLE
aarith-xo0Arithmetic coding STRIPED mode
aarith-catArithmetic coding CAT mode
ffqzcompFQZComp quality codec
ntok3-ransName tokeniser with rANS encoding
ntok3-arithName tokeniser with Arithmetic encoding

OPTIONS

-o FILE

Output size information to FILE.

-v

Verbose mode. This shows one line per combination of Content ID and compression method.

-e, --encodings

CRAM uses an Encoding, which describes how the data is serialised into a data block. This is distinct from the CRAM compression method, which is then applied to the block post-encoding. The encoding methods are stored per CRAM Container.

This option list CRAM record encoding map and tag encoding map. This shows the data series, the associated CRAM encoding method, such as HUFFMAN, BETA or EXTERNAL, and any parameters associated with that encoding. The output may be large as this is information per container rather than a single set of summary statistics at the end of processing.

EXAMPLES

The basic summary of block Content ID sizes for a CRAM file:

$ samtools cram-size in.cram
#   Content_ID  Uncomp.size    Comp.size   Ratio Method  Data_series
BLOCK     CORE            0            0 100.00% .      
BLOCK       11    394734019     51023626  12.93% g       RN
BLOCK       12   1504781763     99158495   6.59% R       QS
BLOCK       13       330065        84195  25.51% _r.g    IN
BLOCK       14     26625602      6803930  25.55% Rrg     SC
...

Show the same file above with verbose mode. Here we see the distinct compression methods which have been used per block Content ID.

$ samtools cram-size -v in.cram
#   Content_ID  Uncomp.size    Comp.size   Ratio Method      Data_series
BLOCK     CORE            0            0 100.00% raw        
BLOCK       11    394734019     51023626  12.93% gzip        RN
BLOCK       12   1504781763     99158495   6.59% r4x8-o1     QS
BLOCK       13       275033        64343  23.39% gzip-min    IN
BLOCK       13        43327        15412  35.57% r4x8-o0     IN
BLOCK       13         2452         2452 100.00% raw         IN
BLOCK       13         9253         1988  21.49% gzip        IN
BLOCK       14     23106404      5903351  25.55% r4x8-o1     SC
BLOCK       14      1951616       513722  26.32% r4x8-o0     SC
BLOCK       14      1567582       386857  24.68% gzip        SC
...

List encoding methods per CRAM Data Series. The two letter series are the standard CRAM Data Series and the three letter ones are the optional auxiliary tags with the tag name and type combined.

$ samtools cram-size -e in.cram
Container encodings
    RN      BYTE_ARRAY_STOP(stop=0,id=11)
    QS      EXTERNAL(id=12)
    IN      BYTE_ARRAY_STOP(stop=0,id=13)
    SC      BYTE_ARRAY_STOP(stop=0,id=14)
    BB      BYTE_ARRAY_LEN(len_codec={EXTERNAL(id=42)}, \
                           val_codec={EXTERNAL(id=37)}
    ...
    XAZ     BYTE_ARRAY_STOP(stop=9,id=5783898)
    MDZ     BYTE_ARRAY_STOP(stop=9,id=5063770)
    ASC     BYTE_ARRAY_LEN(len_codec={HUFFMAN(codes={1},lengths={0})}, \
                           val_codec={EXTERNAL(id=4281155)}
    ...

AUTHOR

Written by James Bonfield from the Sanger Institute.

SEE ALSO

samtools (1),

Samtools website: <http://www.htslib.org/>