lacan.gen
gen.py — Random molecule generation and adaptive genetic algorithm for LACAN.
This module has two responsibilities:
Random generation — assemble drug-like molecules from scratch by recursively sampling fragments from the corpus and joining them through their dummy-atom attachment points.
Optimisation — run an adaptive GA (
generate_optimized_molecules()) that iteratively improves molecules toward a user-defined scoring function while maintaining population diversity.
Random generation
generate_random_molecule() builds a molecule by starting from a randomly
chosen corpus fragment and recursively filling its open attachment points with
rings, linkers, or substituents sampled proportionally to their corpus
frequency. The recursion terminates when all dummies are consumed.
generate_filtered_molecule() wraps the above with a rejection loop that
discards molecules outside the atom-count window or below the LACAN score
threshold.
generate_filtered_molecules() parallelises the above using
multiprocessing.Pool, giving each worker a unique seed offset so the
molecules are independent.
Corpus biasing
get_corpus_from_mols() / bias_corpus() build a merged corpus that
over-represents fragments from a reference set (e.g. known actives). The
ratio parameter controls the boost: at ratio=2 the reference fragments’
effective frequency is doubled relative to the background ChEMBL corpus.
The resulting corpus can be passed as fragcorpus= to any generation
function to steer sampling toward the desired chemotype.
Adaptive GA
See generate_optimized_molecules() for the full description.
GA design
A
HallOfFamecollects the all-time best diverse molecules; the final result is simply the top-N from it.scoring_budget — a hard cap on molecules sent to the scoring function per generation. Candidates are pre-screened with cheap property filters and MaxMin diversity sampling before scoring, keeping the budget tight even when raw operation output is large.
Per-operation Thompson Sampling bandit — seven arms (ring replacement, linker replacement, substituent replacement, scaffold decoration, crossover, random injection, and atom-level mutation) each maintain a Beta posterior over their hit rate. Budget is allocated proportionally to sampled weights, so productive arms get more budget while all arms remain explored.
Smooth explore_fraction — a float in [0,1] controls the mix of exploration arms (ring/linker/sub-replace, decoration, crossover, random) vs exploitation arms (mutation). On plateau it shifts toward mutation (highest empirical hit rate); on diversity collapse it shifts toward exploration. It decays back toward a user-set baseline otherwise.
Presets —
preset="ml"/"medium"/"docking"/"guacamol"set a coherent bundle of defaults appropriate for fast, medium, slow, and unlimited-budget scoring functions respectively. Individual kwargs override preset values.
- lacan.gen.load_corpus(min_count=200)[source]
Load the ChEMBL fragment corpus from
data/rls.csv, with caching.The CSV is read only on the first call for a given
min_countvalue; subsequent calls return the cached result immediately.
- lacan.gen.generate_random_molecule(fragments=[], fragcorpus=None, needed_sample=[], bondstring='', mol=None)[source]
Recursively assemble a random molecule from corpus fragments.
- Parameters:
fragments (list — accumulates SMILES of all fragments used so far)
fragcorpus (list of corpus entries (default: ChEMBL rls.csv))
needed_sample (list of lists — allowed fragment types for each open point)
bondstring (str — bond types for the current open attachment points)
mol (RDKit Mol — molecule built so far)
- Return type:
(mol, fragments)
- lacan.gen.generate_filtered_molecule(profile, fragcorpus=None, threshold=0.5, seed=None, min_atoms=14, max_atoms=35)[source]
Generate one random molecule that passes the LACAN score and size filters.
- Parameters:
profile (LACAN profile dict)
fragcorpus (fragment corpus (default: ChEMBL rls.csv))
threshold (minimum LACAN score to accept (default 0.5))
seed (int or None)
min_atoms (minimum heavy-atom count, exclusive (default 14))
max_atoms (maximum heavy-atom count, exclusive (default 35))
- Return type:
RDKit Mol
- lacan.gen.generate_filtered_molecules(profile, fragcorpus=None, threshold=0.5, seed=None, min_atoms=14, max_atoms=35, n_jobs=1, n_molecules=10)[source]
Generate multiple filtered molecules, optionally in parallel.
- Parameters:
profile (LACAN profile dict)
fragcorpus (fragment corpus (default: ChEMBL rls.csv))
threshold (minimum LACAN score (default 0.5))
seed (int or None — base seed)
min_atoms (minimum heavy-atom count, exclusive (default 14))
max_atoms (maximum heavy-atom count, exclusive (default 35))
n_jobs (int — parallel workers; 1 = sequential, -1 = all CPU cores)
n_molecules (int — number of molecules to generate (default 10))
- Return type:
list of RDKit Mol
- lacan.gen.get_corpus_from_mols(mols, fragcorpus=None, ratio=1)[source]
Merge a custom fragment corpus from reference molecules into the background corpus.
- Parameters:
mols (iterable of RDKit Mol objects)
fragcorpus (background corpus (default: ChEMBL rls.csv))
ratio (float — frequency multiplier for reference fragments (default 1))
- Return type:
list of corpus entries
- lacan.gen.bias_corpus(mols, fragcorpus=None, ratio=2.0)[source]
Build a fragment corpus biased toward the chemistry of the provided molecules.
- Parameters:
mols (iterable of RDKit Mol objects — reference molecules)
fragcorpus (background corpus (default: ChEMBL rls.csv))
ratio (float — frequency multiplier (default 2.0))
- Return type:
list of corpus entries
- class lacan.gen.ThompsonBandit(arm_names, stats_file=None, prior_alpha=None, prior_beta=None)[source]
Bases:
objectBeta-Bernoulli Thompson Sampling bandit over GA operations.
Models each arm as a Bernoulli process: did this arm produce at least one improvement this generation? Maintains Beta(α, β) posteriors over each arm’s true hit rate. Allocation weights are drawn from these posteriors each generation, so uncertain arms are still explored but arms with consistently higher hit rates get proportionally more budget over time.
Thompson Sampling is well-suited here because most moves are non-productive (hit rate ~1–5%), so the Bernoulli posterior handles sparse signals natively. Budget allocation is proportional (soft) rather than winner-take-all, so all arms remain explored. Statistics persist across runs (saved to JSON), letting the bandit accumulate meaningful signal over many optimisation campaigns.
Prior α₀ / β₀
The prior encodes domain knowledge about each arm’s expected hit rate, based on empirical observations from fragment-based drug design:
sub, mutate : slightly higher prior hit rate (α=2, β=18 → ~10%)
ring, linker : moderate (α=1, β=14 → ~7%)
decorate : moderate (α=1, β=14 → ~7%)
crossover : lower prior (α=1, β=24 → ~4%)
random : lowest (α=1, β=49 → ~2%) — random injection rarely beats an optimised population directly
These are weak priors (total pseudo-counts 15–25) so they are quickly overridden by real data once enough generations accumulate.
- param arm_names:
- type arm_names:
list of str
- param stats_file:
If provided, α/β counts are loaded on init and saved after each update. Pass None to disable persistence.
- type stats_file:
str or None — path to JSON file for persistent statistics.
- param prior_alpha:
- type prior_alpha:
dict mapping arm name -> α₀ (default: built-in domain priors)
- param prior_beta:
- type prior_beta:
dict mapping arm name -> β₀ (default: built-in domain priors)
- sample_weights()[source]
Draw one sample from each arm’s Beta posterior.
Returns a dict arm -> sampled weight. Use these as proportional allocation weights for the scoring budget.
- property hit_rate
α / (α + β).
- Type:
Posterior mean hit rate per arm
- class lacan.gen.HallOfFame(max_size=100, sim_threshold=0.55, higher_is_better=True)[source]
Bases:
objectDiversity-gated collection of the best molecules seen across all generations.
A new molecule is accepted only if its Tanimoto similarity to every current member is below
sim_threshold. When full, a new molecule displaces the worst member if it scores better.- Parameters:
max_size (int — maximum number of entries (default 100))
sim_threshold (float — max allowed Tanimoto similarity to existing members) – (default 0.70)
higher_is_better (bool (default True))
- offer(smi, score, mol)[source]
Try to add (smi, score) to the HallOfFame.
Diversity gate: the candidate is accepted only if its maximum Tanimoto similarity to the
_DIVERSITY_GATE_Kmost similar existing members is belowsim_threshold. Using the top-K rather than all members prevents mediocre early entries from permanently blocking better later ones that happen to be structurally related.Returns True if accepted, False if rejected.
- property best_score
- lacan.gen.generate_optimized_molecules(scoring_function, profile, seed=123, startN=None, generations=None, popsize=None, scoring_budget=None, higher_is_better=True, diversity_threshold=0.35, plateau_patience=None, base_explore_fraction=None, explore_ratio=None, hof_size=None, hof_sim_threshold=None, sim_threshold=None, pool_diversity_k=2, n_replacements=None, conservative=None, fragcorpus=None, n_jobs=-1, quiet=False, callback=None, seed_mols=None, preset='ml', bandit_stats_file=None, maxmin_cap_factor=None, win_threshold=None, **kwargs)[source]
Adaptive GA with per-operation Thompson Sampling bandit, scoring budget, and HallOfFame.
Quick start
For most use cases, set
preset=and override individual parameters as needed:# Fast ML model winners = gen.generate_optimized_molecules(rf_score, profile, preset="ml") # Slow docking function — tight budget winners = gen.generate_optimized_molecules(dock, profile, preset="docking", scoring_budget=15) # GuacaMol benchmark — unlimited scoring budget winners = gen.generate_optimized_molecules(guac_score, profile, preset="guacamol")
Presets
preset="ml"— fast QSAR/ML models (0.1–10 ms/mol)preset="medium"— ADMET, shape, ML ensembles (0.1–2 s/mol)preset="docking"— physics-based docking (5–30 s/mol)preset="guacamol"— unlimited scoring budget, large population, mutation-dominant allocationAll preset values can be overridden by passing explicit kwargs.
Key parameters
- scoring_budgetint
Hard cap on molecules sent to the scoring function per generation. Candidates are pre-screened with cheap property filters and MaxMin diversity sampling before scoring. This is the most important knob for slow scoring functions.
- base_explore_fractionfloat in [0,1]
Baseline fraction of the scoring budget allocated to exploration arms (ring/linker/sub replacement, decoration, crossover, random injection) vs exploitation arms (atom-level mutation). On scoring plateau this shifts toward mutation (highest empirical hit rate); on diversity collapse it shifts toward exploration.
- hof_sizeint
Maximum size of the HallOfFame. The HoF collects the best diverse molecules found across all generations; the final result is the top-N from it sorted best-first.
- hof_sim_thresholdfloat
Maximum Tanimoto similarity allowed between two HallOfFame members (diversity gate). Lower = more diverse but smaller effective HoF.
- diversity_thresholdfloat
If mean pairwise Tanimoto distance of the pool falls below this, explore_fraction is boosted.
- plateau_patienceint
After this many generations without improvement, explore_fraction is shifted toward mutation to exploit the arm with the highest hit rate.
- returns:
All HallOfFame members sorted best-first.
- rtype:
list of (smiles, score) tuples
- class lacan.gen.GAReporter(label='run')[source]
Bases:
objectCollects per-generation statistics from
generate_optimized_molecules()and plots them.Pass an instance as the
callback=argument to the GA. After the run, callplot()to visualise the results, or compare multiple runs by passing a list of reporters tocompare().- Parameters:
label (str) – Name for this run, shown in plot legends (default
"run").
Example
reporter = GAReporter(label="docking_preset") winners = gen.generate_optimized_molecules( my_score_fn, profile, preset="docking", callback=reporter, ) reporter.plot()
- plot(ax=None, show=True)[source]
Plot score, diversity, explore fraction, and per-arm hit rates over time.
- lacan.gen.optimize_from_mol(mol, scoring_function, profile, generations=20, beam_width=10, n_replacements=15, win_threshold=0.8, higher_is_better=True, conservative=True, fragcorpus=None, plateau_patience=5, quiet=False, callback=None, protect_smarts=None)[source]
Optimise a single seed molecule using fragment-level operations.
Unlike
generate_optimized_molecules(), which explores chemical space from random starting structures, this function starts from one specific molecule and iteratively improves it by applying ring, linker, substituent, and decoration operations — then retaining the best results as parents for the next round (beam search).- Parameters:
mol (RDKit Mol — the seed / lead molecule)
scoring_function (callable —
fn(list[Mol]) -> list[float])profile (LACAN profile dict)
generations (int — maximum number of rounds (default 20))
beam_width (int — top molecules kept as parents each round (default 10))
n_replacements (int — fragment operation attempts per parent (default 15))
win_threshold (float — score at which a molecule is reported as winner) – (default 0.8). If no winners found, best beam returned.
higher_is_better (bool (default True))
conservative (bool — prefer similar fragment replacements (default True))
fragcorpus (fragment corpus (default: ChEMBL rls.csv))
plateau_patience (int — stop early after this many generations without) – improvement (default 5)
quiet (bool (default False))
callback (callable or None — called each generation with stats dict)
protect_smarts (str or None — SMARTS pattern; atoms matching it are skipped) – by all fragment operations every generation. The protected set is re-derived from each parent mol on every call via
get_protected_atoms(), so it survives the SMILES round-trip in the beam transparently.Nonedisables atom exclusion entirely (default).
- Return type:
list of (smiles, score) tuples, sorted best-first