IdentifierTranslationService

class VmaxBuilder.database_retrieval.identifier_translation.IdentifierTranslationService[source]

Generated: validation needed.

Description:

Translate identifier namespaces and build transcript-to-gene mapping tables using network APIs with threaded execution.

Public Methods

build_gene_transcript_dataframe(…)

Generated: validation needed.

build_transcript_gene_dataframe(…)

Generated: validation needed.

translate_identifiers(…)

Generated: validation needed.

translate_identifiers(identifiers: Sequence[str], *, source_id_type: str, target_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) IdentifierTranslationResult[source]

Generated: validation needed.

Description:

Translate identifiers from one namespace into another with partial-result support.

Parameters:
  • identifiers (Sequence[str]) – Source identifiers to translate.

  • source_id_type (str) – Source identifier namespace.

  • target_id_type (str) – Target identifier namespace.

  • species (str | None) – Optional species hint forwarded to provider.

  • provider (str) – Translation provider key. Supported values: auto, mygene.

  • max_workers (int) – Maximum number of parallel worker threads.

  • batch_size (int) – Number of identifiers per provider query chunk.

Returns:

IdentifierTranslationResult – Mapping output and unresolved identifiers list.

Raises:

ValueError – If provider or id-type configuration is unsupported.

build_transcript_gene_dataframe(transcript_ids: Sequence[str], *, transcript_id_type: str, target_gene_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) DataFrame[source]

Generated: validation needed.

Description:

Build transcript-to-gene mapping dataframe for transcript-level expression inputs.

Parameters:
  • transcript_ids (Sequence[str]) – Transcript identifiers present in expression table.

  • transcript_id_type (str) – Transcript identifier namespace.

  • target_gene_id_type (str) – Target gene identifier namespace.

  • species (str | None) – Optional species hint forwarded to provider.

  • provider (str) – Translation provider key. Supported values: auto, mygene.

  • max_workers (int) – Maximum number of parallel worker threads.

  • batch_size (int) – Number of identifiers per provider query chunk.

Returns:

pd.DataFrame – Mapping table with transcript_id and gene_id columns.

build_gene_transcript_dataframe(gene_ids: Sequence[str], *, gene_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) DataFrame[source]

Generated: validation needed.

Description:

Build transcript metadata table for model genes with transcript-level annotation fields used by downstream transcript IFP expansion.

Parameters:
  • gene_ids (Sequence[str]) – Model gene identifiers.

  • gene_id_type (str) – Gene identifier namespace.

  • species (str | None) – Optional species hint forwarded to provider.

  • provider (str) – Translation provider key. Supported values: auto, mygene.

  • max_workers (int) – Maximum number of parallel worker threads.

  • batch_size (int) – Number of identifiers per provider query chunk.

Returns:

pd.DataFrame

Transcript metadata table with columns:

transcript_id, gene_id, is_protein_coding, is_canonical, peptide_len, cdna_len, peptide_seq, cdna_seq.

Raises:

ValueError – If provider or gene identifier namespace is unsupported.

static _deduplicate_identifiers(identifiers: Sequence[str]) list[str][source]

Generated: validation needed.

Description:

Deduplicate and strip identifiers while preserving input encounter order.

Parameters:

identifiers (Sequence[str]) – Raw identifier sequence.

Returns:

list[str] – Deduplicated non-empty identifiers.

_translate_with_mygene(*, identifiers: Sequence[str], source_id_type: str, target_id_type: str, species: str | None, max_workers: int, batch_size: int) dict[str, str][source]

Generated: validation needed.

Description:

Translate identifier chunks through MyGene queries and merge first-hit mappings.

Parameters:
  • identifiers (Sequence[str]) – Identifiers to map.

  • source_id_type (str) – Source identifier namespace.

  • target_id_type (str) – Target identifier namespace.

  • species (str | None) – Optional species hint accepted by MyGene.

  • max_workers (int) – Maximum number of parallel worker threads.

  • batch_size (int) – Number of identifiers per provider query chunk.

Returns:

dict[str, str] – Source identifier to first resolved target identifier.

Raises:

ValueError – If source or target identifier namespace is unsupported.

static _query_mygene_chunk(chunk: list[str], source_scope: str, field_string: str, species: str | None) list[dict[str, Any]][source]

Generated: validation needed.

Description:

Execute one MyGene querymany call for one identifier chunk.

Parameters:
  • chunk (list[str]) – Identifier chunk.

  • source_scope (str) – MyGene scopes value.

  • field_string (str) – MyGene fields value.

  • species (str | None) – Optional species filter.

Returns:

list[dict[str, Any]] – Raw MyGene hits for chunk.

static _extract_target_identifier(hit: dict[str, Any], *, target_id_type: str) str | None[source]

Generated: validation needed.

Description:

Extract one target identifier from one MyGene hit record.

Parameters:
  • hit (dict[str, Any]) – MyGene hit record.

  • target_id_type (str) – Target namespace selector.

Returns:

str | None – First resolved target identifier when available.

static _extract_ensembl_gene_identifier(hit: dict[str, Any]) str | None[source]

Generated: validation needed.

Description:

Extract one Ensembl gene identifier from variant MyGene hit structures.

Parameters:

hit (dict[str, Any]) – MyGene hit record.

Returns:

str | None – First resolved Ensembl gene identifier.

_extract_transcript_rows_from_hit(hit: dict[str, Any]) list[dict[str, Any]][source]

Generated: validation needed.

Description:

Extract transcript metadata rows from one MyGene hit payload.

Parameters:

hit (dict[str, Any]) – MyGene hit record.

Returns:

list[dict[str, Any]] – Transcript metadata rows.

static _extract_canonical_transcript_identifier(hit: dict[str, Any]) str | None[source]

Generated: validation needed.

Description:

Extract canonical transcript identifier from one MyGene hit payload.

Parameters:

hit (dict[str, Any]) – MyGene hit record.

Returns:

str | None – Canonical transcript identifier when available.