Genotype sampling & population priors¶
Two related ways to get a per-individual genotype without pinning every gene by
hand: sample one from population priors, or author a population prior on
the cartridge so sampling is data-driven. Both produce an ordinary Genotype
you attach with with_genotype (see the Genotypes overview for
the core model, and Genotype cohorts to draw many at
once). Examples assume the imports and cfg from the overview.
Sampling from population priors¶
Instead of specifying every gene, draw a plausible diploid genotype with
Genotype.sample. It uses an independent per-gene, per-chromosome
Hardy-Weinberg model: each gene on each chromosome is independently deleted
(haplotype_deletion_prob) or assigned an allele from that gene's frequencies, so
homozygous / heterozygous / hemizygous / deleted states emerge at the expected
rates.
g = Genotype.sample(
cfg,
seed=7,
allele_frequencies={ # {segment: {gene: {allele: weight}}}
"V": {"IGHVF1-G1": {"IGHVF1-G1*01": 8, "IGHVF1-G1*02": 2}},
}, # unspecified genes -> uniform within gene
haplotype_deletion_prob=0.05, # 5% per-haplotype gene-absence
subject_id="DONOR_R1",
)
res = ga.Experiment.on(cfg).with_genotype(g).recombine().run_records(n=500, seed=1)
Genotype.sample always returns a fully-specified, runnable genotype. By
default ensure_viable=True re-draws (up to max_resamples, deterministically)
until at least one expressible (positive-weight) chromosome carries every
required segment, raising a clear error only if your deletion settings make that
impossible; pass ensure_viable=False to allow infeasible draws.
Default sampling is HW conditioned on viability
With the default ensure_viable=True, draws that leave no complete usable
haplotype are rejected, so per-gene deletion/zygosity rates are Hardy-Weinberg
conditioned on at least one viable chromosome - not the unconditional HW
rates. This only matters at high haplotype_deletion_prob (e.g. a single-J
cartridge with a high J-deletion prob). Use ensure_viable=False for the raw,
unconditional draw (which may be infeasible and rejected at compile).
Frequency priors accept the segment-aware
{segment: {gene: {allele: weight}}} shape (or a flat {gene: …} when the gene is
unambiguous); a supplied gene's listed alleles define its distribution (weight 0
excludes an allele), and unspecified genes fall back to uniform.
haplotype_deletion_prob is a float or a per-gene/per-segment dict.
What this model is - and isn't
This is an independent per-gene sampler. It does not model linkage
disequilibrium, gene co-deletion blocks, ancestry, or donor-specific haplotype
structure, and it samples catalogue alleles only (no novel alleles) and
deletion only (no duplication). Supply explicit allele_frequencies from a
population source (e.g. VDJbase) for realistic per-gene frequencies; the default
is uniform within each gene. allele_frequencies="usage_as_prior" is an
explicit opt-in that reuses the cartridge's recombination allele_usage as a
frequency proxy - convenient but biologically approximate.
Loading population frequencies¶
GenAIRR has no built-in population-frequency loader yet (that is a deferred
feature), but the allele_frequencies argument is a plain nested dict, so any
table you assemble works. For realistic per-allele frequencies, build the dict
from a population source - for example a frequency table you export from
VDJbase (which reports per-individual genotypes and
allele frequencies across populations). Given a CSV with
segment,gene,allele,frequency columns:
import csv
from collections import defaultdict
freqs: dict = defaultdict(lambda: defaultdict(dict))
with open("allele_frequencies.csv") as fh: # your exported table
for row in csv.DictReader(fh):
freqs[row["segment"]][row["gene"]][row["allele"]] = float(row["frequency"])
g = Genotype.sample(
cfg, seed=7,
allele_frequencies={s: dict(genes) for s, genes in freqs.items()},
)
Allele names in the table must match the cartridge's names (see the naming note in the overview); genes you omit fall back to uniform.
Population genotype models on a cartridge¶
The allele_frequencies / haplotype_deletion_prob you pass to
Genotype.sample can instead be authored once on the cartridge as a
population genotype model - a donor-population germline prior. It is a distinct
plane from reference_models.allele_usage: allele_usage weights how often each
allele is expressed during recombination, whereas a genotype prior describes
which alleles a donor population carries (frequencies, gene-deletion rates, and
population novel alleles). It lives on DataConfig.genotype_priors.
Authoring and attaching a model¶
from GenAIRR.genotype_priors import PopulationGenotypeModel, PopulationNovelAllele
model = PopulationGenotypeModel(
model_id="IGH-toy-1", source="hand-authored", # identity is required
allele_frequencies={"V": {"IGHVF1-G1": {"IGHVF1-G1*01": 3.0, "IGHVF1-G1*02": 1.0}}},
haplotype_deletion_prob={"V": {"IGHVF1-G1": 0.1}},
)
# Attach via the cartridge builder (validated against the chain type + catalogue):
# cfg = builder.set_genotype_priors(model).build()
set_genotype_priors validates the model against the cartridge: unknown
genes/alleles raise, and any population novel allele goes through the same
functional validation as add_novel_allele (conserved anchor, stop-free frame,
base-allele match). A non-None plane becomes part of cartridge identity -
it folds into the cartridge's compute_checksum() (the content hash that changes
whenever the reference changes) and its manifest (the cartridge's
machine-readable provenance/identity record), so a prior baked onto a cartridge
travels and versions with it.
Estimating a model from observed genotypes¶
Given a list of observed Genotype objects (e.g. one per donor), estimate a
prior directly:
model = PopulationGenotypeModel.from_genotypes(
[g_donor1, g_donor2, g_donor3],
cfg=cfg, model_id="cohort-est", source="my-cohort",
)
# or, attaching to a builder in one chained step:
# builder.estimate_genotype_priors([g_donor1, g_donor2, g_donor3], source="my-cohort")
Estimator conventions: allele frequencies are counted per carried chromosome
(homozygous contributes 2, hemizygous 1, deleted 0); gene deletion probability is
deleted_haplotypes / (2 × n_subjects). pseudocount smooths the per-gene
catalogue-allele counts only (deletion gets none). Genotypes carrying a
duplicated gene are rejected - the plane is deletion-only.
Drawing from the cartridge plane¶
When a cartridge carries a plane, Genotype.sample(cfg) auto-uses it and
records where every input came from:
g = Genotype.sample(cfg, seed=7) # plane supplies freqs / deletion / weights
print(g.prior_provenance)
# {'allele_frequencies': 'cartridge', 'haplotype_deletion_prob': 'cartridge',
# 'chromosome_weights': 'cartridge', 'novel_alleles': 'none', # 'cartridge' if the model has novels
# 'model_id': 'IGH-toy-1', 'model_checksum': '…'}
print(g.to_metadata()) # subject_id + provenance + source/effective refdata hashes
Pass use_cartridge_priors=False for a clean uniform, catalogue-only draw (all
plane consumption - including novels - disabled). Each input is sourced
independently, so you can mix explicit and cartridge values:
g = Genotype.sample(
cfg, seed=7,
allele_frequencies={"V": {"IGHVF1-G1": {"IGHVF1-G1*01": 1.0}}}, # explicit
# haplotype_deletion_prob and chromosome_weights left to the plane
)
print(g.prior_provenance["allele_frequencies"]) # 'explicit'
print(g.prior_provenance["haplotype_deletion_prob"]) # 'cartridge'
print(g.prior_provenance["chromosome_weights"]) # 'cartridge'
Population novel alleles on the plane are candidate alleles for the draw: a
novel that gets sampled is carried (and flows into v_call / reads / truth like
any allele); a novel that isn't drawn never pollutes the output reference. By
default (include_cartridge_novel_alleles="auto") novels are injected only when
the allele frequencies are cartridge- or uniform-sourced - supplying an
explicit frequency table keeps it explicit. Pass
include_cartridge_novel_alleles=True to inject them anyway, or False to never.
Auditing the plane¶
cfg.cartridge_manifest()["models"]["genotype_priors"] reports availability,
model_id / source / version, the plane's model_checksum, per-segment gene
counts, the novel-allele count, and source_field ("DataConfig.genotype_priors").