Species Identity Proposal¶
Summary¶
This proposal makes latin_name the canonical identity for every Species
object and treats prefixes, common names, historic prefixes, and source-local
codes as aliases rather than as identity.
The main reason is collision handling. Today, bare prefixes can map to multiple species and runtime lookup resolves that ambiguity with a heuristic. That is not stable enough for a curated ontology.
Current Problem¶
The repo already uses scientific names as the top-level keys in
mhcgnomes/data/species.yaml, but runtime lookup still conflates:
- canonical identity
- user-facing display prefix
- legacy aliases
- source-local external codes
This shows up in a few places:
Species.namecurrently stores the scientific name, but that is not exposed explicitly aslatin_name.Species.get()accepts prefixes, common names, and scientific names through a shared alias table.- ambiguous aliases are sorted with
create_species_sort_key()instead of being treated as ambiguous input. - side tables such as
gene_aliases.yamlare keyed by a mix of canonical prefixes and legacy identifiers rather than one stable species key.
The result is that collisions are handled as lookup accidents instead of as a first-class ontology concern.
Goals¶
- Make canonical species identity explicit and stable inside the codebase.
- Stop silently resolving colliding species aliases.
- Preserve parsing for unambiguous prefixes and common names.
- Keep normalized output based on the curated MHC prefix, not the scientific name.
- Allow a staged migration with minimal immediate breakage.
Non-Goals¶
- Full taxonomic authority management or synonym history.
- Reworking allele or gene ontology beyond species identity boundaries.
- Changing the public normalized string format from
HLA-A*02:01style output to scientific-name output.
Proposed Model¶
Canonical identity¶
Use scientific name as the canonical species key everywhere inside the runtime model.
Proposed Species contract:
Species.latin_name: canonical identity fieldSpecies.name: compatibility alias forlatin_nameduring migrationSpecies.prefix: curated primary runtime MHC prefixSpecies.other_mhc_prefixes: non-primary runtime prefixesSpecies.old_mhc_prefix: historic display/parse prefixSpecies.common_nameandSpecies.other_common_names: human-oriented aliases
The important change is semantic, not cosmetic: prefixes stop being identity.
Alias resolution¶
Split exact canonical lookup from alias lookup.
Proposed API shape:
Species.get_by_latin_name(latin_name) -> Species | NoneSpecies.get_multiple(query) -> tuple[Species, ...]Species.resolve_alias(query) -> tuple[Species, ...]Species.get(query) -> Species | None
Behavior of Species.get() after migration:
- exact scientific name: return that species
- unique alias: return that species
- ambiguous alias: return
Noneor raise a dedicated ambiguity error
The current heuristic winner-selection should be removed for ambiguous aliases.
Parser Behavior¶
Species-only parsing¶
When the input is just a species token:
HLAshould still parse as human because it is uniqueHumanshould still parse as human because it is unique- a colliding token such as
Bubushould not silently pick a species
Species plus downstream context¶
Parser code should still be allowed to use downstream evidence to disambiguate. For example:
- if an alias maps to multiple species but only one candidate contains the observed canonical gene or gene alias, resolve to that species
- if more than one candidate still fits, return ambiguity rather than choosing by number of curated genes
This preserves practical parsing while making collisions explicit.
Default species¶
default_species should accept:
Species- scientific name
- unique alias
Internally it should normalize to canonical latin_name as early as possible.
default_species remains lenient: it is only a fallback/default when the input
does not unambiguously identify another species.
Strict species¶
species should accept the same identifiers as default_species, but it is a
strict filter:
- mismatched
Speciesparses should fail - mismatched gene/allele parses should fail
- generic ancestor-scoped parses such as
BoLA-...may be converted to a requested descendant species if reparsing there is valid - descendant results should not be silently cast up to an ancestor taxon
- parent taxonomic prefixes should not be treated as aliases for child species
Data File Changes¶
species.yaml¶
Keep the file keyed by scientific name. That is already the right shape.
Recommended schema clarification:
prefix: primary runtime prefixold prefix: historic prefixother prefixes: additional accepted runtime aliasesname: common name or list of common names
No top-level entry should be keyed by prefix.
Alias-bearing runtime YAML¶
Files currently keyed by species alias should move toward canonical scientific-name keys:
gene_aliases.yamlallele_aliases.yamlknown_alleles.yamlhaplotypes.yamlserotypes.yamlheterodimers.yamlsupertypes.yaml
Migration rule:
- Canonical key is scientific name.
- Loader temporarily accepts legacy prefix keys for backward compatibility.
- Validation rejects ambiguous non-scientific-name keys once a collision exists.
This eliminates the current need to combine side tables by trying every identifier associated with a species.
Source registry¶
underrepresented_taxa_source_registry.yaml currently serves a second purpose
in addition to curation planning: it is the machine-readable provenance ledger
for short runtime prefixes added from underrepresented taxa work.
Current requirement:
- top-level keyed by runtime prefix
- required
scientific_name - at least one source URL (
taxonomy_sources,structured_sources,literature_sources, orrepresentative_annotation_sources)
That gives every active short prefix an auditable source trail, even when the runtime ontology itself is keyed by scientific name.
Runtime Data Structures¶
Introduce separate lookup indexes:
latin_name_to_speciesalias_to_species_objectsprefix_to_species_objectscommon_name_to_species_objects
This keeps the distinction between canonical identity and lookup aliases visible in code instead of collapsing everything into one bag of identifiers.
Species.all_identifiers can remain as a convenience helper, but runtime logic
should stop depending on it for identity-sensitive operations.
Serialization Changes¶
to_record() and any tabular output should expose scientific name explicitly.
Recommended record fields:
species_latin_namespecies_prefixspecies_common_name
For compatibility, species_name can temporarily remain as an alias of
species_latin_name, but new code should stop relying on that ambiguity.
Migration Plan¶
Phase 1: Make identity explicit¶
- add
Species.latin_name - add
Species.get_by_latin_name() - add tests that canonical lookup is scientific-name-based
- leave existing prefix parsing behavior in place for unique aliases
Phase 2: Separate alias resolution from identity¶
- add explicit alias-resolution helpers
- remove heuristic winner selection from
Species.get()andinfer_species_from_prefix() - return ambiguity for colliding bare aliases
Phase 3: Migrate runtime side tables¶
- convert YAML files to scientific-name keys
- keep temporary loader support for legacy prefix keys
- add validation that new entries use scientific names
Phase 4: Tighten public behavior¶
- deprecate any API behavior that depends on heuristic alias resolution
- update docs and examples to prefer scientific-name-based internal references
Test Plan¶
Minimum tests to add:
- exact scientific name lookup returns the canonical species
- unique prefix lookup still works
- ambiguous prefix lookup does not silently choose a species
- parser can disambiguate an ambiguous alias when downstream gene context makes only one species valid
- parser returns ambiguity when both species and downstream tokens remain compatible
- side-table loading by scientific name matches current behavior for existing unambiguous species
Compatibility Notes¶
This proposal does not require changing normalized output strings. End users can
continue to see HLA, BoLA, Gaga, and similar curated prefixes in rendered
names.
One subtlety worth calling out explicitly: not every curated prefix is a
species-level identifier. Some historically important committee prefixes such as
DLA, SLA, OLA, BoLA, and CELA are intentionally modeled as umbrella
taxon nodes (Canis sp., Sus sp., Ovis sp., Bos sp., Cetacea sp.).
Those generic prefixes coexist with more specific descendant species prefixes
such as Calu, Susc, Ovar, Bota, and Tutr.
The internal change is that scientific name becomes the primary key and aliases become parse helpers.
Open Questions¶
- Should ambiguous alias lookup return
Noneor raise a dedicatedAmbiguousSpeciesError? - Should
Species.nameremain indefinitely as a synonym forlatin_name, or should it eventually be deprecated? - Resolved: parent/child side-table inheritance should flow through ancestor scientific names only. Parent prefixes/common names should not become implicit child aliases.