Manual page from htslib-1.22
released on 30 May 2025

NAME

ref-cache – CRAM reference caching proxy

SYNOPSIS

ref-cache [-bLUv] [-l LOG_DIR] [-u URL] -d CACHE_DIR -p PORT

DESCRIPTION

ref-cache is a caching proxy for reference sequences, for use when encoding and decoding CRAM format sequence alignment files.

CRAM can use reference based compression where individual bases in aligned records are compared against a known reference sequence, storing only the bases that differ. This gives better compression, but requires the reference sequence to be supplied from an external source. One way to get these sequences is by querying a server implementing the GA4GH refget standard <https://ga4gh.github.io/refget/>, however this can lead to excessive network traffic and server load if, as is often the case, the same reference is needed more than once. ref-cache makes reference handling easier by keeping copies of downloaded files, allowing them to be reused when they are needed again.

As it has been specifically designed to serve reference sequences for CRAM encoders and decoders, ref-cache behaves rather differently to general-purpose caching web proxies:

QUICK-START GUIDE

Create directories for the cache and (optionally) log files. Then start up the server in the background, listening on port 8080 and with the EBI's CRAM reference server as the upstream source.
mkdir cached_refs
mkdir logs
ref-cache -b -d cached_refs -l logs -p 8080 -u https://www.ebi.ac.uk/ena/cram/md5/
To make SAMtools and HTSlib use the server, set its URL in the REF_PATH environment variable (note that colons should be doubled up in the URL, and you should substitute the hostname of your actual server).
REF_PATH='http:://myserver.example.com::8080/%s'
export REF_PATH
If the cache directory can be made visible to SAMtools/HTSlib processes, it can also be added directly to REF_PATH by putting it before the web server URL. It is necessary to use the full path to the directory, followed by "/%2s/%2s/%s" for the file location due to the way they are stored inside the cache.
REF_PATH='/path/to/cache/%2s/%2s/%s:http:://myserver.example.com::8080/%s'
export REF_PATH
This is useful as accessing the files directly is more efficient than using http. Files are downloaded to a temporary name and then renamed after validation so processes directly using the cache will never try to use a partly downloaded file. By putting the URL at the end, the web server will pick up any requests for references not already in the cache, download them, provide them to the requester, and store them in the cache.

OPTIONS

-b

Run in the background as a System V-style daemon. This option must not be used with -s.

-d <dir>

Directory where cached files will be stored

-h

Show help

-l <dir>

Directory for log files. If not set and running in the foreground, logs will be sent to stdout

-L

Don't log

-m all|default|localhost|<network-list>

Reply to connections from the listed network(s). This option can be given more than once, with the final allow list being the union of all listed networks along with localhost (which is always enabled). See CLIENT ADDRESS CHECKING below.

-n <1-4>

Number of server processes to run

-p <port>

Port number to listen on

-r <num>

Number of request log files to keep

-R <num>

Maximum size of a request log file (MiB)

-s

Run as a systemd-style socket service. As the service manager handles socket allocation, the -p option is ignored when running in this mode. This option must not be used with -b.

-u <url>

URL of the upstream server. If not set or overridden using -U, the EBI's server (https://www.ebi.ac.uk/ena/cram/md5/) will be used.

-U

Do not attempt to get files from an upstream server. Only files already in the local cache will be served.

-v

Turn on debugging output

CLIENT ADDRESS CHECKING

ref-cache is designed to serve references to local networks. To ensure that it only responds to the desired clients, it has an allow list of address ranges that it will talk to. If a connection attempt comes from an IP address not in the allowed set, it will be closed immediately. (N.B.: Rejected clients will see a connection open and immediately close, as it's necessary for connections to be opened for the server to discover the peer address. If you want to drop or reject unwanted requests without opening them, you will need to use your operating system's firewall.)

The address ranges can be set using the -m option, which may be used more than once. Networks can be specified either as a comma-separated list of CIDR-format blocks (e.g. 192.0.2.0/24, 2001:db8::/32) or using one of the following synonyms:

Any address (not recommended)

10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 (the private ranges listed in RFC 1918); fc00::/7 (the local IPv6 Unicast address range in RFC 4193); and fe80::/10 (IPv6 link-local addresses)

127.0.0.0/8 and ::1/128 (loop-back addresses)

If no -m option is given, the "default" list will be used, as most organisations will be using one or more of these internally. This will be overridden if any -m option appears, in which case -m default will need to be specified explicitly if you also want to reply to addresses in the IPv4 and IPv6 private ranges. For example:

ref-cache -m 192.0.2.0/24 -m default ...

ref-cache will always listen to the loop-back address, even if this was not specified. Using -m localhost will limit it to only respond to loop-back requests.

AUTHOR

Written by Rob Davies from the Wellcome Sanger Institute

SEE ALSO

samtools (1)

Samtools website: <http://www.htslib.org/>

CRAM specification: <https://samtools.github.io/hts-specs/CRAMv3.pdf>

Refget website: <https://ga4gh.github.io/refget/>