Curation Guide¶
Purpose¶
This document records the rules for placing curated data in mhcgnomes
and the source strategy for taxa that are not yet well served by stable
committee-curated MHC nomenclature.
Most of the concrete examples here come from reptiles, amphibians, and other underrepresented vertebrate groups, but the same rules also apply to unusual mammals or any other taxon where source quality is ahead of stable nomenclature.
The key constraint is that mhcgnomes is strongest when it has:
- stable species nomenclature,
- stable gene nomenclature, and
- a curated or at least reproducible source of allele strings.
For most reptiles and amphibians, item 3 is the weak link. There are now good genome assemblies and several strong survey papers, but there is not yet an IPD-MHC-style curated comparative allele registry for these clades.
Current Repo State¶
As of v3.8.0, mhcgnomes has 235 species across 152 genera, with 135 species having gene definitions and 100 as species-only placeholders. Coverage spans mammals, birds, fish, reptiles, amphibians, and sharks.
The core data files live in
mhcgnomes/data/:
species.yaml— canonical species and gene definitionsunderrepresented_taxa_source_registry.yaml— curation ledger for taxa not yet fully ingested
The right design boundary is not "nonmammal". The right boundary is "taxa that are not yet well served by stable IPD-MHC-style nomenclature and therefore need source-aware ingestion rules". That can include unusual mammals such as marsupials or monotremes.
Curation Status¶
Every taxon in the registry has a curation_status field that tracks where it
sits in the curation pipeline.
| Status | Meaning | Example |
|---|---|---|
active |
In species.yaml, parses at runtime. |
Gaga (chicken): BF, BLB, DMA, etc. all parse. |
pending |
We have sources downloaded and documented, but the entry hasn't been reviewed for runtime inclusion yet. No known blockers — just needs someone to look at it. | A newly ingested species with papers and gene candidates listed but not yet vetted. |
blocked |
Reviewed and found a specific problem (blocked_on field) that prevents activation. |
Zhom (Omei tree frog): gene name Rhom-beta1 preserves an old genus prefix; blocked on deciding a stable canonical name. |
Partial Capture Policy¶
Not every useful source belongs in mhcgnomes/data/species.yaml.
When a source gives us real signal but not runtime-ready nomenclature, the
information should be captured in
mhcgnomes/data/underrepresented_taxa_source_registry.yaml instead.
That registry is now the place to preserve:
- candidate species that are not ready for runtime ontology
- observed gene-family structure from papers or genome annotations
- representative annotation URLs or accessions
- blockers that prevent runtime ingestion
- example species for under-reviewed clades, including unusual mammals
This keeps partial information from being lost while avoiding premature canonization in runtime parsing tables.
Data Placement Rules¶
The rule is: put data in the narrowest runtime file that matches its level of stability.
Runtime-loaded files¶
These are loaded by mhcgnomes/data.py and directly affect parser behavior.
| File | Put this here | Do not put this here |
|---|---|---|
species.yaml |
Canonical species entries, canonical gene names, MHC class placement, stable parent/prefix relationships | Paper-local aliases, uncertain gene symbols, unresolved candidate loci |
gene_aliases.yaml |
Alternative gene spellings or retired/provisional names that normalize to an existing canonical gene in species.yaml |
New genes that do not yet have a canonical destination in species.yaml |
allele_aliases.yaml |
Retired, shorthand, or formatting variants that normalize to a canonical allele string | New literature-only alleles with no stable canonical allele target |
known_alleles.yaml |
Curated known allele labels for a species/gene where the ontology already exists | A substitute for adding missing species or genes |
haplotypes.yaml |
Named haplotypes and their member alleles | Partial gene-family observations from papers |
serotypes.yaml |
Serotype-to-allele mappings | General class I / class II family structure |
heterodimers.yaml |
Explicit shorthand heterodimer mappings like DQ2.5 |
Speculative alpha/beta combinations from weak literature evidence |
supertypes.yaml |
Functional supertype groupings with clear representative alleles | Serotypes or paper-local functional clusters |
Non-runtime curation files¶
These preserve source-backed information that is not yet ready to affect parser behavior.
| File | Put this here | Why |
|---|---|---|
underrepresented_taxa_source_registry.yaml |
Partial but useful source information: candidate species, observed gene-family structure, representative annotation URLs, blockers, example species | This is the holding area for real signal that is not stable enough for runtime ontology |
docs/curation.md |
Cross-file policy, source strategy, confidence tiers, implementation order | This explains decisions; it should not be the only place where concrete partial source facts live |
Practical decision tree¶
- If the species prefix and canonical gene names are stable, add them to
species.yaml. - If a new string can normalize to an existing canonical gene, put it in
gene_aliases.yaml. - If a new string can normalize to an existing canonical allele, put it in
allele_aliases.yaml. - If the source only tells us "this clade probably has class I / class IIbeta / TAP genes" but not stable canonical names, capture it in
underrepresented_taxa_source_registry.yaml. - If the source is a survey paper with paper-local allele IDs, do not put those IDs in runtime YAML unless they map cleanly onto stable canonical names.
- If the data encodes a derived concept like a serotype, supertype, haplotype, or heterodimer shorthand, use the dedicated file for that concept rather than overloading
species.yaml.
Prefix Conflict Resolution Plan¶
Four-letter species prefixes are useful, but external datasets reuse them often
enough that mhcgnomes needs an explicit conflict policy instead of ad hoc
exceptions.
Current examples include:
Phco: used in different datasets for bothPhasianus colchicusandPhylloscopus collybitaCyca: reused across multiple taxa in external resources even though runtime now uses it for common carpBubu,Chpi, andMimi: reused across unrelated genera in different source collections
The rule is: runtime parsing should only accept a bare prefix when the prefix is
unambiguous inside mhcgnomes.
When a short prefix is source-attested but collides with another runtime owner,
it can be stored as a context only prefix on the affected species. These
prefixes are not accepted by Species.get(...) or by bare parsing on their own,
but parse(..., species="<latin name>") may reinterpret them in the requested
species context. This is the right bucket for cases like Hymo, Moal, and
fish-side Orla, where the source string is real but the bare prefix is not a
safe global identifier.
Corollary: do not auto-generate new 2+2 / 4-letter runtime aliases from Latin
names. Add a short alias only when an explicit source attests that exact
prefix, and record the provenance in
underrepresented_taxa_source_registry.yaml.
Related rule: taxonomic node labels such as Gnathostomata, Galliformes,
Crocodylia, Salmonidae, and Testudines are internal tree prefixes, not
species prefixes to inherit onto child species. Short canonical species
prefixes should either be tied to an explicit source in nearby section comments
or be clearly marked as generated/collision-avoidance prefixes. For
underrepresented/runtime-added species, the machine-readable source of truth is
underrepresented_taxa_source_registry.yaml: every active short prefix should
have a matching registry entry with scientific_name plus at least one source
URL.
Collision types¶
- Canonical-prefix conflict
A prefix is already claimed in
species.yaml, and a new dataset uses the same prefix for a different species. - External-only collision Multiple external sources reuse the same prefix, but none of them has been admitted into runtime ontology yet.
- Case-only variant
The external source uses
SAALinstead ofSaal,MODOinstead ofModo, and so on. - Embedded or double prefix
The source string contains an old or repeated prefix inside the gene token,
such as
Tyal-MhcTyal-DAB1orZhom-Rhom-beta1.
Runtime policy¶
- Do not overwrite an existing canonical runtime prefix with a different species.
- Do not add a second species to
species.yamlunder an already-claimed prefix. - Do not silently infer a winner for a colliding bare prefix.
- Keep colliding prefixes in
mhcgnomes/data/underrepresented_taxa_source_registry.yamluntil they are resolved.
How to resolve each type¶
Canonical-prefix conflict¶
Use this when runtime already owns the prefix.
Required before ingestion:
- an authoritative source showing a better stable prefix for the new species, or
- an upstream source-specific translation layer keyed by organism metadata, not by the bare string alone
What not to do:
- do not remap old runtime data to the new species
- do not add an ambiguous alias that changes meaning based only on guesswork
Example:
Phco-UAstays out of runtime becausePhcois reused externally for both pheasant and chiffchaff strings, and the chiffchaff source trail is not clean enough to justify taking over the prefix
External-only collision¶
Use this when the collision exists only in source data and no runtime species has claimed the prefix yet.
Required before ingestion:
- choose one canonical prefix backed by a stable source, and
- record the rejected alternatives and their provenance in the registry
If no stable winner exists, keep the group registry-only.
Case-only variant¶
Case normalization is safe only when the lowercase or titlecase form maps to a single runtime species.
Safe:
SAAL -> SaalifSaalis uniquely owned in runtime
Not safe:
- any normalization where the case-folded token could refer to multiple taxa or to a source-local code with different meaning
Embedded or double prefix¶
These are acceptable only when the embedded prefix clearly refers to the same species or to a documented old genus synonym.
Safe example:
Tyal-MhcTyal-DAB1 -> Tyal-DAB1
Not safe without more evidence:
Zhom-Rhom-beta1, because the inner prefix preserves an old-genus label but the runtime canonical gene is still unsettled
Required evidence before runtime admission¶
At least one of these should be true:
- the exact prefix/gene string is used in a species-specific primary source
- the exact prefix/gene string is used in a structured protein or genome record with species metadata
- the string cleanly normalizes to an already-canonical runtime gene
If the exact string is not source-backed, but only the biology is plausible, store it in the registry and stop there.
Implementation order¶
- Safe case-normalization for uniquely owned prefixes
- Safe same-species embedded-prefix aliases
- Source-specific alias handling where the calling code already knows the organism
- Revisit true canonical collisions only after steps 1-3 reduce the backlog
Test policy¶
Every collision-resolution change should add both:
- a positive test for the accepted normalization or alias, and
- a negative test proving that colliding unresolved strings still do not parse
Known prefix collisions¶
Resolved collisions¶
These collisions have been resolved by assigning long (4+4) prefixes to the less-established species. See the Species Identity Proposal for the design rationale.
| 4-letter code | Species | Resolution | Literature reference |
|---|---|---|---|
Bubu |
Bubalus bubalis (water buffalo) | Keeps Bubu — well-established in MHC literature |
Bubu-DQA alleles in swamp buffaloes, Bubu-DRB polymorphism |
Bubu |
Bubo bubo (Eurasian eagle-owl) | Uses BuboBubo |
MHC class II in Bubo owls |
Orla / OrLA |
Pongo sp. (orangutan) / Oryzias latipes (medaka) and several killifish | Orangutan keeps OrLA; fish use long canonical prefixes, and species= can rescue source-side Orla strings |
|
Gaga |
Gallus gallus (chicken) / Gavialis gangeticus (gharial) | Chicken keeps Gaga (IPD-MHC chicken); gharial uses GaviGang |
|
Cyca |
Cyprinus carpio (carp) / Cyclura carinata (iguana) / Cyanistes caeruleus (blue tit) | Carp keeps Cyca; iguana uses CyclCari; blue tit uses CyanCaer |
All three attested in literature: carp in IPD-MHC, iguana in Glaberman et al., blue tit in Westerdahl et al. |
Chpi |
Chrysolophus pictus (golden pheasant) / Chrysemys picta (painted turtle) | Pheasant keeps Chpi; turtle uses ChryPict |
Low-risk collisions¶
| Code | Species | Notes |
|---|---|---|
Pren |
Semnopithecus entellus / Theropithecus gelada | Old prefix only — both species have different primary prefixes |
Why four-letter prefixes are a weak identity model¶
Four-letter codes derived from genus + species binomials are a lossy encoding.
With 235+ species and growing, collisions are inevitable — sometimes even in
the published literature (e.g., Cyca is used independently for carp, iguana,
and blue tit by different research groups).
The species identity model now uses latin
names as canonical identity (see Species.latin_name,
Species.get_by_latin_name()). Every species is also parseable via its full
concatenated latin name (e.g., HomoSapiens-A*02:01) and a 4+4 truncated
form (e.g., HomoSapi-A*02:01). See the
prefix tier documentation
in the README.
Current special cases¶
default_alleles.yamlexists but is currently minimal and not part of the main runtime loading path indata.py.- Root-level transport and DM genes now live directly in
species.yamlunderGnathostomata sp.. There is no separate runtimecommon_genes.yamlfallback anymore.
Source Inventory¶
Tier 1: Official structured databases¶
These are the best sources for species normalization and, where available, stable gene names.
| Source | What it is | Confidence | What we can use it for |
|---|---|---|---|
| IPD-MHC | Official curated comparative MHC database | High | Existing official groups, file formats, committee norms, future submission target |
| NCBI Datasets | Official genome/annotation/metadata download portal | High | Genome assemblies, annotations, proteins, transcripts, taxon metadata |
| Xenbase | Official Xenopus knowledgebase with downloads, BLAST, gene nomenclature | High | Frog gene names, genome coordinates, gene aliases, other amphibian genomes |
| The Reptile Database | Widely used reptile taxonomy authority | High for taxonomy | Species normalization and synonyms for snakes, lizards, turtles, crocodilians, tuatara |
| Amphibian Species of the World | Curated amphibian taxonomy reference | High for taxonomy | Species normalization and literature discovery for frogs and salamanders |
References:
- IPD-MHC home: https://www.ebi.ac.uk/ipd/mhc/
- IPD-MHC taxonomy: https://www.ebi.ac.uk/ipd/mhc/taxonomy/
- IPD-MHC downloads: https://www.ebi.ac.uk/ipd/mhc/download/
- NCBI Datasets: https://www.ncbi.nlm.nih.gov/datasets/
- Xenbase: https://www.xenbase.org/
- Xenbase data/download entry points: https://www.xenbase.org/xenbase/
- Reptile Database: https://www.reptile-database.org/
- Amphibian Species of the World: https://amphibiansoftheworld.amnh.org/
Notes:
- IPD-MHC is still the gold standard for canonical allele ingestion, but its
current official groups do not yet cover reptiles or amphibians as first-class
groups. The taxonomy page currently lists primates, felids, canids, salmonids,
ovids, bovids, equids, suids, murids,
Gallus, and cetaceans. - Xenbase is the strongest structured source in this expansion set because it provides official gene nomenclature, gene search, BLAST, downloadable genomes, and "other amphibian genomes" links.
- NCBI Datasets is the best generic fallback when there is no clade-specific nomenclature database.
Tier 2: Strong clade-specific papers¶
These papers are useful for deciding which species and gene families are worth adding, but they are not automatically safe as canonical allele registries.
| Clade | Source | Evidence type | Best use |
|---|---|---|---|
| Sea turtles | Martin et al. 2026 | 4-species allele survey | Prioritize turtle species and coarse class I / class II support |
| Sea turtles | Martin et al. 2022 | Green/loggerhead class I survey | Backfill turtle class I alias handling |
| Crocodilians | Jaratlerdsiri et al. 2014 | Order-level class I evolution | Add crocodilian species/gene family aliases |
| Lizards | Miller et al. 2022 | Two Anolis genomes |
Add genome-backed lizard gene metadata |
| Snakes | Kirsch et al. 2025 | Rattlesnake genomes | Add modern snake gene structure assumptions |
| Amphibians | Kiemnec-Tyburczy et al. 2018 | Review | Guide frog/salamander scope and terminology |
| Salamanders | Migalska et al. 2022 | 30-species class I survey | Do not ingest canonical alleles; use for architecture expectations |
| Salamanders | Palomar et al. 2021 | MHC-I/APG coevolution | Add APG expectations and nonclassical expansion notes |
| Tuatara | Miller et al. 2015 | Genome organization study | Useful "other reptile" pilot species |
References:
- Sea turtles, 4 species, class I and class II: https://pubmed.ncbi.nlm.nih.gov/41575191/
- Sea turtle class I disease-association study: https://pubmed.ncbi.nlm.nih.gov/35154791/
- Crocodylia class I evolution: https://pubmed.ncbi.nlm.nih.gov/24253731/
- Squamate MHC in two
Anolisgenomes: https://pubmed.ncbi.nlm.nih.gov/36425073/ - Rattlesnake MHC architecture: https://pubmed.ncbi.nlm.nih.gov/39704347/
- Amphibian MHC review: https://pubmed.ncbi.nlm.nih.gov/28695290/
- Salamander MHC-I survey: https://pubmed.ncbi.nlm.nih.gov/36000494/
- Salamander MHC-I/APG coevolution: https://pubmed.ncbi.nlm.nih.gov/34375431/
- Tuatara MHC organization: https://pubmed.ncbi.nlm.nih.gov/25953959/
Diagnosis By Clade¶
Frogs¶
What we know:
Xenopus laevisalready exists in the ontology asXela-UAA.- Xenbase has official gene search, gene nomenclature, downloads, and genome
browsers for
X. laevisandX. tropicalis. - Xenbase also links to "other amphibian genomes", which makes it the best structured on-ramp for frog expansion.
- The amphibian review shows that both class I and class II are relevant to disease susceptibility, especially chytridiomycosis.
What can be confidently ingested:
Xenopusspecies prefixes- official Xenbase gene symbols and aliases
- antigen-processing genes (
TAP1,TAP2,TAPBP,B2M) where they are clearly annotated - model-organism frog genes, even without extensive allele catalogs
What is not yet safe as canonical allele ontology:
- broad allele-level frog nomenclature outside
Xenopus - amplicon-only literature alleles with no stable locus naming
Recommendation:
- Start with
Xenopus laevisandXenopus tropicalis. - Use Xenbase plus NCBI annotations to add gene-level ontology, not paper-local allele catalogs.
Salamanders¶
What we know:
- Salamanders show extreme class I expansion.
- One 30-species survey reported about 3000 class I variants and 2-22 gene copies per species.
- Salamanders also show coevolution between MHC-I and
TAP1/TAP2.
What can be confidently ingested:
- species taxonomy
- coarse expectations that salamander MHC can include multigene class I families
- APG support (
TAP1,TAP2,TAPBP,PSMB8,PSMB9) when genome annotations exist
What is not yet safe as canonical allele ontology:
- species-wide allele naming from large amplicon datasets
- locus-stable class I names across species
Recommendation:
- Treat salamanders as a later phase.
- Add species aliases and gene-family parsing only when genome annotations are available.
- Do not ingest paper-local salamander allele IDs as canonical
mhcgnomesallele names.
Snakes¶
What we know:
- The repo currently has only
Sica-DAA/Sica-DAB. - Recent rattlesnake genome work identifies highly duplicated class I and class IIbeta loci localized in gene clusters on chromosome 2.
- This implies the current repo snake model is underpowered and probably structurally outdated.
What can be confidently ingested:
- snake species prefixes from The Reptile Database
- gene-family level support for class I and class IIbeta in species with chromosome-level assemblies or strong annotations
- genome-backed aliases from rattlesnake papers and NCBI annotations
What is not yet safe as canonical allele ontology:
- generic cross-snake locus names that pretend all snake orthology is settled
- single-paper allele names with no stable accession-backed registry
Recommendation:
- Revisit snake support from scratch rather than extending the current
Sica-DAA/DABpattern blindly. - Pilot on rattlesnakes with modern genome-backed gene structure first.
Lizards¶
What we know:
- Two
Anolisgenomes show a core MHC region on chromosome 2 and include many homologs of mammalian core MHC genes. - This is one of the clearest reptile cases for genome-backed, gene-level ingestion.
- The strongest current comparative paper is Card et al. 2022, which analyzes the green and brown anole MHC using genome structure, BAC evidence, and comparative annotation.
What can be confidently ingested:
Anolis carolinensisandAnolis sagreispecies entries- genome-backed gene names and aliases from annotated assemblies
- core MHC framework genes and antigen-processing genes if annotation quality is good enough
What is not yet safe as canonical allele ontology:
- paper-specific lizard allele numbering not grounded in stable external records
- paper-local homolog numbering from Card et al. 2022
- homology-derived NCBI
LOC...model records treated as if they were settled community locus names
What is ambiguous specifically for Anolis:
- The 2022 paper identifies
mhc1andmhc2βhomologs, but its phylogenies label homologs with sequential within-paper numbers and point readers to a supplementary mapping table. That is useful comparative biology, but not yet a community nomenclature standard. - The same paper reports that one of two
mhc2βhomologs in each anole lacks exon 2, which means that even "gene copy count" is not equivalent to "intact canonical class IIbeta loci". - NCBI annotations for these species are still mostly model-based and use names
such as
LOC103282626("major histocompatibility complex class I-related gene protein-like") andLOC132766334("RLA class II histocompatibility antigen, DP alpha-1 chain-like"). Those are useful evidence for gene-family presence, but they are not strong enough to canonize as runtime gene symbols inmhcgnomes. - The green anole annotations have already moved between assemblies
(
AnoCar2.0torAnoCar3.1.pri), which is a good sign for the assembly but a reason not to freeze unstable identifiers too early.
Recommendation:
- Lizards are one of the best first reptile targets because the source is genome structural, not just amplicon diversity.
- For
Anolis, stay at species-level support until we curate a small set of exact gene symbols backed by stable accessions, not just modelLOCrecords.
Primary sources for Anolis:
- Card et al. 2022, squamate MHC in two
Anolisgenomes: https://pubmed.ncbi.nlm.nih.gov/36425073/ - Eckalbar et al. 2013, green anole genome reannotation: https://pubmed.ncbi.nlm.nih.gov/23343042/
- NCBI green anole class I-like model gene example: https://www.ncbi.nlm.nih.gov/gene/103282626
- NCBI brown anole class II alpha-like model gene example: https://www.ncbi.nlm.nih.gov/gene/132766334
Turtles¶
What we know:
- Sea turtles now have a strong 4-species class I / class II survey with 162 functionally distinct class I alleles and 308 class II alleles across more than 300 individuals.
- Earlier work characterized class I variation in green and loggerhead turtles.
- This is strong evidence that turtle MHC diversity is tractable, but the nomenclature is still literature-driven rather than committee-curated.
What can be confidently ingested:
- turtle taxonomy from The Reptile Database
- species prefixes for common study species
- coarse gene-family parsing for class I and class II where locus names are explicitly given and stable
- paper aliases with provenance
What is not yet safe as canonical allele ontology:
- treating survey-paper alleles as if they were official, cross-study canonical names
- inferring stable locus orthology across all turtles from short amplicon studies
Recommendation:
- Turtles are a strong second-wave target after
Xenopusand genome-backed reptiles. - Start with species/prefix support plus paper-alias parsing, not canonical allele registries.
Crocodilians¶
What we know:
- Crocodilian class I evolution has been studied across the order.
- Additional genome papers show structured MHC organization in crocodilians.
- This is enough to justify species and gene-family support, but still not enough for an IPD-MHC-style allele ontology.
What can be confidently ingested:
- crocodilian species prefixes from The Reptile Database
- class I and class II family-level genes from genome-backed sources
- accession-backed aliases from strong genomic studies
What is not yet safe as canonical allele ontology:
- broad allele naming across crocodilians from partial-exon studies
- pretending locus labels are standardized across the order when they are not
Recommendation:
- Make crocodilians a genome-backed parser target, not an official allele registry target.
Tuatara and other reptiles¶
What we know:
- Tuatara has a mapped core MHC region with class I and class IIbeta copies on two chromosomes.
- This is useful as a design test case for handling reptiles with dispersed MHC architecture.
What can be confidently ingested:
- species prefix
- gene-family level support for class I and class IIbeta
What is not yet safe as canonical allele ontology:
- full allele sets from old BAC-based or clone-based studies unless matched to stable modern accessions
Recommendation:
- Keep tuatara as an "advanced architecture" pilot after the first reptile wave.
Confidence Tiers For Ingestion¶
Tier A: Safe to ingest as canonical ontology now¶
These have either official nomenclature support or strong gene-level database support.
- species taxonomy from The Reptile Database and Amphibian Species of the World
Xenopusspecies and gene symbols from Xenbase- genome-annotated genes from NCBI Datasets when the annotation uses stable gene names and there is no evidence of paper-local naming only
Tier B: Safe to ingest as gene-level aliases, not canonical alleles¶
- genome-backed reptile genes from
Anolis, rattlesnakes, crocodilians, and tuatara - turtle gene-family labels from recent multi-species studies
- APG genes associated with MHC loci (
TAP1,TAP2,TAPBP,PSMB8,PSMB9)
Tier C: Not safe to ingest as canonical ontology without extra curation¶
- amplicon-only allele sets from survey papers
- study-local allele IDs that are not mirrored in GenBank or a curated database
- locus names with uncertain orthology across species
- copy-number-based labels that are not stable across assemblies or haplotypes
What We Can Build Confidently First¶
Phase 1: Taxonomy and source registry¶
Deliverables:
- add a source registry file for reptiles/amphibians
- define species prefixes using current scientific names and common aliases
- record provenance for every new species entry
Targets:
- frogs:
Xenopus laevis,Xenopus tropicalis - lizards:
Anolis carolinensis,Anolis sagrei - snakes:
Crotalus horridusand close rattlesnake references from the genome paper - turtles:
Caretta caretta,Chelonia mydas,Dermochelys coriacea,Lepidochelys kempii - crocodilians:
Crocodylus porosusplus species from the Crocodylia survey - other reptiles:
Sphenodon punctatus
Phase 2: Gene-level ontology only¶
Deliverables:
- species entries with gene families, not large allele registries
- tests for species parsing and representative gene parsing
- APG support where source annotation is clear
Targets:
- Xenopus genes from Xenbase
Anoliscore MHC genes from genome-backed annotations- rattlesnake class I and class IIbeta family parsing
- crocodilian class I / class II family parsing
Phase 3: Alias-level parsing for literature names¶
Deliverables:
- optional alias tables per clade
- provenance in comments or sidecar files
- tests from paper examples and GenBank-backed names
Targets:
- sea turtle class I / class II survey names
- selected crocodilian and salamander paper aliases
Constraint:
- these aliases should not be presented as official comparative nomenclature if they are only paper-local.
Phase 4: Canonical allele ingestion only where a registry exists¶
This phase should happen only when one of the following is true:
- the clade enters IPD-MHC,
- the species has a stable curated allele registry,
- or we build a clearly provenance-annotated internal registry with accession-level traceability and conservative scope.
Recommended First Implementation Order¶
Xenopusvia XenbaseAnolisvia genome papers and NCBI Datasets- rattlesnakes via genome-backed class I / class IIbeta parsing
- sea turtles as species + paper-alias support
- crocodilians as species + gene-family support
- salamanders only after a clear strategy for multigene class I handling
- tuatara as an architecture stress-test species
Design Constraints For mhcgnomes¶
To keep mhcgnomes coherent, new underrepresented-taxa ingestion should follow
these rules:
- Separate canonical ontology from paper-local aliases.
- Never invent stable allele names where the source community has not.
- Prefer gene-family parsing over aggressive allele normalization when locus orthology is unresolved.
- Keep provenance close to the data source.
- Add one or more parser tests per source family before expanding breadth.
Proposed Next Concrete Tasks¶
- Add a source registry module or YAML sidecar for underrepresented taxa.
- Add tests for existing
SicaandXelacoverage so the current baseline is explicit. - Add
Xenopus tropicalisand expandXenopusgene metadata from Xenbase. - Add a first reptile pilot with
Anolis carolinensis. - Add a second reptile pilot with
Crotalus horridus. - Decide whether paper-local aliases should live in the main ontology or in separate alias tables.