lookup_cache

Thread-safe, disk-backed key-value cache for external API lookups.

Designed to be shared by: - Gene sequence lookups (Ensembl, RefSeq)

→ namespaces ensembl_sequences / refseq_sequences

  • SMILES lookups (future) → namespace smiles_metabolite

  • Any other external API calls that are expensive and should survive process restarts

Cache files live in {project_root}/.lookup_cache/ by default, or in a directory set by the VmaxBuilder_CACHE_DIR environment variable. Each namespace is a separate JSON file, e.g. .lookup_cache/ensembl_sequences.json.

Thread-safety

All public methods acquire an internal threading.Lock so the cache can safely be used from ThreadPoolExecutor workers that call set() concurrently.

Atomic writes

Saves go via {file}.tmpPath.replace() so a crash mid-write never leaves a corrupt cache file.

Usage example

from src.VmaxBuilder.utils.lookup_cache import LookupCache, get_default_cache_dir

cache = LookupCache(get_default_cache_dir(), "ensembl_sequences")
key   = sequence_cache_key("homo_sapiens", "ENSG00000139618", "canonical_only")

if key not in cache:
    result = expensive_api_call(...)
    cache.set(key, gene_result_to_dict(result))   # saved to disk immediately

data = cache.get(key)  # returns the stored dict

Classes

GeneSequenceResult(gene_symbol, sequences, ...)

Container for sequence lookup results for one gene symbol.

LookupCache(cache_dir, namespace[, autosave])

Thread-safe, disk-backed key-value store for a single namespace.

SequenceRecord(sequence, source, accession)

JSON-friendly record describing one retrieved sequence.

Functions

dict_to_gene_result(data)

Generated: validation needed.

gene_result_to_dict(result)

Generated: validation needed.

get_default_cache_dir()

Resolve the default cache directory.

sequence_cache_key(species, gene_symbol, mode)

Canonical cache key for a GeneSequenceResult.

smiles_cache_key(database, metabolite_id)

Canonical cache key for a metabolite SMILES lookup.