IdentifierTranslationService

class VmaxBuilder.database_retrieval.identifier_translation.IdentifierTranslationService[source]

Generated: validation needed.

Description:: Translate identifier namespaces and build transcript-to-gene mapping tables using network APIs with threaded execution.

Public Methods

`build_gene_transcript_dataframe`(…)	Generated: validation needed.
`build_transcript_gene_dataframe`(…)	Generated: validation needed.
`translate_identifiers`(…)	Generated: validation needed.

translate_identifiers(identifiers: Sequence[str], *, source_id_type: str, target_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) → IdentifierTranslationResult[source]

Generated: validation needed.

Description:: Translate identifiers from one namespace into another with partial-result support.

Parameters:

identifiers (Sequence[str]) – Source identifiers to translate.
source_id_type (str) – Source identifier namespace.
target_id_type (str) – Target identifier namespace.
species (str | None) – Optional species hint forwarded to provider.
provider (str) – Translation provider key. Supported values: auto, mygene.
max_workers (int) – Maximum number of parallel worker threads.
batch_size (int) – Number of identifiers per provider query chunk.

Returns:

IdentifierTranslationResult – Mapping output and unresolved identifiers list.

Raises:

ValueError – If provider or id-type configuration is unsupported.

build_transcript_gene_dataframe(transcript_ids: Sequence[str], *, transcript_id_type: str, target_gene_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) → DataFrame[source]

Generated: validation needed.

Description:: Build transcript-to-gene mapping dataframe for transcript-level expression inputs.

Parameters:

transcript_ids (Sequence[str]) – Transcript identifiers present in expression table.
transcript_id_type (str) – Transcript identifier namespace.
target_gene_id_type (str) – Target gene identifier namespace.
species (str | None) – Optional species hint forwarded to provider.
provider (str) – Translation provider key. Supported values: auto, mygene.
max_workers (int) – Maximum number of parallel worker threads.
batch_size (int) – Number of identifiers per provider query chunk.

Returns:

pd.DataFrame – Mapping table with transcript_id and gene_id columns.

build_gene_transcript_dataframe(gene_ids: Sequence[str], *, gene_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) → DataFrame[source]

Generated: validation needed.

Description:: Build transcript metadata table for model genes with transcript-level annotation fields used by downstream transcript IFP expansion.

Parameters:

gene_ids (Sequence[str]) – Model gene identifiers.
gene_id_type (str) – Gene identifier namespace.
species (str | None) – Optional species hint forwarded to provider.
provider (str) – Translation provider key. Supported values: auto, mygene.
max_workers (int) – Maximum number of parallel worker threads.
batch_size (int) – Number of identifiers per provider query chunk.

Returns:

pd.DataFrame –

Transcript metadata table with columns:: transcript_id, gene_id, is_protein_coding, is_canonical, peptide_len, cdna_len, peptide_seq, cdna_seq.

Raises:

ValueError – If provider or gene identifier namespace is unsupported.

static _deduplicate_identifiers(identifiers: Sequence[str]) → list[str][source]

Generated: validation needed.

Description:: Deduplicate and strip identifiers while preserving input encounter order.

Parameters:: identifiers (Sequence[str]) – Raw identifier sequence.
Returns:: list[str] – Deduplicated non-empty identifiers.

_translate_with_mygene(*, identifiers: Sequence[str], source_id_type: str, target_id_type: str, species: str | None, max_workers: int, batch_size: int) → dict[str, str][source]

Generated: validation needed.

Description:: Translate identifier chunks through MyGene queries and merge first-hit mappings.

Parameters:

identifiers (Sequence[str]) – Identifiers to map.
source_id_type (str) – Source identifier namespace.
target_id_type (str) – Target identifier namespace.
species (str | None) – Optional species hint accepted by MyGene.
max_workers (int) – Maximum number of parallel worker threads.
batch_size (int) – Number of identifiers per provider query chunk.

Returns:

dict[str, str] – Source identifier to first resolved target identifier.

Raises:

ValueError – If source or target identifier namespace is unsupported.

static _query_mygene_chunk(chunk: list[str], source_scope: str, field_string: str, species: str | None) → list[dict[str, Any]][source]

Generated: validation needed.

Description:: Execute one MyGene querymany call for one identifier chunk.

Parameters:

chunk (list[str]) – Identifier chunk.
source_scope (str) – MyGene scopes value.
field_string (str) – MyGene fields value.
species (str | None) – Optional species filter.

Returns:

list[dict[str, Any]] – Raw MyGene hits for chunk.

static _extract_target_identifier(hit: dict[str, Any], *, target_id_type: str) → str | None[source]

Generated: validation needed.

Description:: Extract one target identifier from one MyGene hit record.

Parameters:

hit (dict[str, Any]) – MyGene hit record.
target_id_type (str) – Target namespace selector.

Returns:

str | None – First resolved target identifier when available.

static _extract_ensembl_gene_identifier(hit: dict[str, Any]) → str | None[source]

Generated: validation needed.

Description:: Extract one Ensembl gene identifier from variant MyGene hit structures.

Parameters:: hit (dict[str, Any]) – MyGene hit record.
Returns:: str | None – First resolved Ensembl gene identifier.

_extract_transcript_rows_from_hit(hit: dict[str, Any]) → list[dict[str, Any]][source]

Generated: validation needed.

Description:: Extract transcript metadata rows from one MyGene hit payload.

Parameters:: hit (dict[str, Any]) – MyGene hit record.
Returns:: list[dict[str, Any]] – Transcript metadata rows.

static _extract_canonical_transcript_identifier(hit: dict[str, Any]) → str | None[source]

Generated: validation needed.

Description:: Extract canonical transcript identifier from one MyGene hit payload.

Parameters:: hit (dict[str, Any]) – MyGene hit record.
Returns:: str | None – Canonical transcript identifier when available.