IdentifierTranslationService
- class VmaxBuilder.database_retrieval.identifier_translation.IdentifierTranslationService[source]
Generated: validation needed.
- Description:
Translate identifier namespaces and build transcript-to-gene mapping tables using network APIs with threaded execution.
Public Methods
Generated: validation needed.
Generated: validation needed.
Generated: validation needed.
- translate_identifiers(identifiers: Sequence[str], *, source_id_type: str, target_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) IdentifierTranslationResult[source]
Generated: validation needed.
- Description:
Translate identifiers from one namespace into another with partial-result support.
- Parameters:
identifiers (Sequence[str]) – Source identifiers to translate.
source_id_type (str) – Source identifier namespace.
target_id_type (str) – Target identifier namespace.
species (str | None) – Optional species hint forwarded to provider.
provider (str) – Translation provider key. Supported values: auto, mygene.
max_workers (int) – Maximum number of parallel worker threads.
batch_size (int) – Number of identifiers per provider query chunk.
- Returns:
IdentifierTranslationResult – Mapping output and unresolved identifiers list.
- Raises:
ValueError – If provider or id-type configuration is unsupported.
- build_transcript_gene_dataframe(transcript_ids: Sequence[str], *, transcript_id_type: str, target_gene_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) DataFrame[source]
Generated: validation needed.
- Description:
Build transcript-to-gene mapping dataframe for transcript-level expression inputs.
- Parameters:
transcript_ids (Sequence[str]) – Transcript identifiers present in expression table.
transcript_id_type (str) – Transcript identifier namespace.
target_gene_id_type (str) – Target gene identifier namespace.
species (str | None) – Optional species hint forwarded to provider.
provider (str) – Translation provider key. Supported values: auto, mygene.
max_workers (int) – Maximum number of parallel worker threads.
batch_size (int) – Number of identifiers per provider query chunk.
- Returns:
pd.DataFrame – Mapping table with transcript_id and gene_id columns.
- build_gene_transcript_dataframe(gene_ids: Sequence[str], *, gene_id_type: str, species: str | None = None, provider: str = 'auto', max_workers: int = 8, batch_size: int = 500) DataFrame[source]
Generated: validation needed.
- Description:
Build transcript metadata table for model genes with transcript-level annotation fields used by downstream transcript IFP expansion.
- Parameters:
gene_ids (Sequence[str]) – Model gene identifiers.
gene_id_type (str) – Gene identifier namespace.
species (str | None) – Optional species hint forwarded to provider.
provider (str) – Translation provider key. Supported values: auto, mygene.
max_workers (int) – Maximum number of parallel worker threads.
batch_size (int) – Number of identifiers per provider query chunk.
- Returns:
pd.DataFrame –
- Transcript metadata table with columns:
transcript_id, gene_id, is_protein_coding, is_canonical, peptide_len, cdna_len, peptide_seq, cdna_seq.
- Raises:
ValueError – If provider or gene identifier namespace is unsupported.
- static _deduplicate_identifiers(identifiers: Sequence[str]) list[str][source]
Generated: validation needed.
- Description:
Deduplicate and strip identifiers while preserving input encounter order.
- Parameters:
identifiers (Sequence[str]) – Raw identifier sequence.
- Returns:
list[str] – Deduplicated non-empty identifiers.
- _translate_with_mygene(*, identifiers: Sequence[str], source_id_type: str, target_id_type: str, species: str | None, max_workers: int, batch_size: int) dict[str, str][source]
Generated: validation needed.
- Description:
Translate identifier chunks through MyGene queries and merge first-hit mappings.
- Parameters:
identifiers (Sequence[str]) – Identifiers to map.
source_id_type (str) – Source identifier namespace.
target_id_type (str) – Target identifier namespace.
species (str | None) – Optional species hint accepted by MyGene.
max_workers (int) – Maximum number of parallel worker threads.
batch_size (int) – Number of identifiers per provider query chunk.
- Returns:
dict[str, str] – Source identifier to first resolved target identifier.
- Raises:
ValueError – If source or target identifier namespace is unsupported.
- static _query_mygene_chunk(chunk: list[str], source_scope: str, field_string: str, species: str | None) list[dict[str, Any]][source]
Generated: validation needed.
- Description:
Execute one MyGene querymany call for one identifier chunk.
- Parameters:
chunk (list[str]) – Identifier chunk.
source_scope (str) – MyGene scopes value.
field_string (str) – MyGene fields value.
species (str | None) – Optional species filter.
- Returns:
list[dict[str, Any]] – Raw MyGene hits for chunk.
- static _extract_target_identifier(hit: dict[str, Any], *, target_id_type: str) str | None[source]
Generated: validation needed.
- Description:
Extract one target identifier from one MyGene hit record.
- Parameters:
hit (dict[str, Any]) – MyGene hit record.
target_id_type (str) – Target namespace selector.
- Returns:
str | None – First resolved target identifier when available.
- static _extract_ensembl_gene_identifier(hit: dict[str, Any]) str | None[source]
Generated: validation needed.
- Description:
Extract one Ensembl gene identifier from variant MyGene hit structures.
- Parameters:
hit (dict[str, Any]) – MyGene hit record.
- Returns:
str | None – First resolved Ensembl gene identifier.
- _extract_transcript_rows_from_hit(hit: dict[str, Any]) list[dict[str, Any]][source]
Generated: validation needed.
- Description:
Extract transcript metadata rows from one MyGene hit payload.
- Parameters:
hit (dict[str, Any]) – MyGene hit record.
- Returns:
list[dict[str, Any]] – Transcript metadata rows.
- static _extract_canonical_transcript_identifier(hit: dict[str, Any]) str | None[source]
Generated: validation needed.
- Description:
Extract canonical transcript identifier from one MyGene hit payload.
- Parameters:
hit (dict[str, Any]) – MyGene hit record.
- Returns:
str | None – Canonical transcript identifier when available.