lacan.gen

gen.py — Random molecule generation and adaptive genetic algorithm for LACAN.

This module has two responsibilities:

Random generation — assemble drug-like molecules from scratch by recursively sampling fragments from the corpus and joining them through their dummy-atom attachment points.
Optimisation — run an adaptive GA (generate_optimized_molecules()) that iteratively improves molecules toward a user-defined scoring function while maintaining population diversity.

Random generation

generate_random_molecule() builds a molecule by starting from a randomly chosen corpus fragment and recursively filling its open attachment points with rings, linkers, or substituents sampled proportionally to their corpus frequency. The recursion terminates when all dummies are consumed.

generate_filtered_molecule() wraps the above with a rejection loop that discards molecules outside the atom-count window or below the LACAN score threshold.

generate_filtered_molecules() parallelises the above using multiprocessing.Pool, giving each worker a unique seed offset so the molecules are independent.

Corpus biasing

get_corpus_from_mols() / bias_corpus() build a merged corpus that over-represents fragments from a reference set (e.g. known actives). The ratio parameter controls the boost: at ratio=2 the reference fragments’ effective frequency is doubled relative to the background ChEMBL corpus.

The resulting corpus can be passed as fragcorpus= to any generation function to steer sampling toward the desired chemotype.

Adaptive GA

See generate_optimized_molecules() for the full description.

GA design

A HallOfFame collects the all-time best diverse molecules; the final result is simply the top-N from it.
scoring_budget — a hard cap on molecules sent to the scoring function per generation. Candidates are pre-screened with cheap property filters and MaxMin diversity sampling before scoring, keeping the budget tight even when raw operation output is large.
Per-operation Thompson Sampling bandit — seven arms (ring replacement, linker replacement, substituent replacement, scaffold decoration, crossover, random injection, and atom-level mutation) each maintain a Beta posterior over their hit rate. Budget is allocated proportionally to sampled weights, so productive arms get more budget while all arms remain explored.
Smooth explore_fraction — a float in [0,1] controls the mix of exploration arms (ring/linker/sub-replace, decoration, crossover, random) vs exploitation arms (mutation). On plateau it shifts toward mutation (highest empirical hit rate); on diversity collapse it shifts toward exploration. It decays back toward a user-set baseline otherwise.
Presets — preset="ml" / "medium" / "docking" / "guacamol" set a coherent bundle of defaults appropriate for fast, medium, slow, and unlimited-budget scoring functions respectively. Individual kwargs override preset values.

lacan.gen.load_corpus(min_count=200)[source]

Load the ChEMBL fragment corpus from data/rls.csv, with caching.

The CSV is read only on the first call for a given min_count value; subsequent calls return the cached result immediately.

Parameters:: min_count (int) – Minimum occurrence count for a fragment to be included (default 200).
Return type:: list of [smiles, count, degree, ftype, bonds]

lacan.gen.generate_random_molecule(fragments=[], fragcorpus=None, needed_sample=[], bondstring='', mol=None)[source]

Recursively assemble a random molecule from corpus fragments.

Parameters:

fragments (list — accumulates SMILES of all fragments used so far)
fragcorpus (list of corpus entries (default: ChEMBL rls.csv))
needed_sample (list of lists — allowed fragment types for each open point)
bondstring (str — bond types for the current open attachment points)
mol (RDKit Mol — molecule built so far)

Return type:

(mol, fragments)

lacan.gen.generate_filtered_molecule(profile, fragcorpus=None, threshold=0.5, seed=None, min_atoms=14, max_atoms=35)[source]

Generate one random molecule that passes the LACAN score and size filters.

Parameters:

profile (LACAN profile dict)
fragcorpus (fragment corpus (default: ChEMBL rls.csv))
threshold (minimum LACAN score to accept (default 0.5))
seed (int or None)
min_atoms (minimum heavy-atom count, exclusive (default 14))
max_atoms (maximum heavy-atom count, exclusive (default 35))

Return type:

RDKit Mol

lacan.gen.generate_filtered_molecules(profile, fragcorpus=None, threshold=0.5, seed=None, min_atoms=14, max_atoms=35, n_jobs=1, n_molecules=10)[source]

Generate multiple filtered molecules, optionally in parallel.

Parameters:

profile (LACAN profile dict)
fragcorpus (fragment corpus (default: ChEMBL rls.csv))
threshold (minimum LACAN score (default 0.5))
seed (int or None — base seed)
min_atoms (minimum heavy-atom count, exclusive (default 14))
max_atoms (maximum heavy-atom count, exclusive (default 35))
n_jobs (int — parallel workers; 1 = sequential, -1 = all CPU cores)
n_molecules (int — number of molecules to generate (default 10))

Return type:

list of RDKit Mol

lacan.gen.get_corpus_from_mols(mols, fragcorpus=None, ratio=1)[source]

Merge a custom fragment corpus from reference molecules into the background corpus.

Parameters:

mols (iterable of RDKit Mol objects)
fragcorpus (background corpus (default: ChEMBL rls.csv))
ratio (float — frequency multiplier for reference fragments (default 1))

Return type:

list of corpus entries

lacan.gen.bias_corpus(mols, fragcorpus=None, ratio=2.0)[source]

Build a fragment corpus biased toward the chemistry of the provided molecules.

Parameters:

mols (iterable of RDKit Mol objects — reference molecules)
fragcorpus (background corpus (default: ChEMBL rls.csv))
ratio (float — frequency multiplier (default 2.0))

Return type:

list of corpus entries

class lacan.gen.ThompsonBandit(arm_names, stats_file=None, prior_alpha=None, prior_beta=None)[source]

Bases: object

Beta-Bernoulli Thompson Sampling bandit over GA operations.

Models each arm as a Bernoulli process: did this arm produce at least one improvement this generation? Maintains Beta(α, β) posteriors over each arm’s true hit rate. Allocation weights are drawn from these posteriors each generation, so uncertain arms are still explored but arms with consistently higher hit rates get proportionally more budget over time.

Thompson Sampling is well-suited here because most moves are non-productive (hit rate ~1–5%), so the Bernoulli posterior handles sparse signals natively. Budget allocation is proportional (soft) rather than winner-take-all, so all arms remain explored. Statistics persist across runs (saved to JSON), letting the bandit accumulate meaningful signal over many optimisation campaigns.

Prior α₀ / β₀

The prior encodes domain knowledge about each arm’s expected hit rate, based on empirical observations from fragment-based drug design:

sub, mutate : slightly higher prior hit rate (α=2, β=18 → ~10%)
ring, linker : moderate (α=1, β=14 → ~7%)
decorate : moderate (α=1, β=14 → ~7%)
crossover : lower prior (α=1, β=24 → ~4%)
random : lowest (α=1, β=49 → ~2%) — random injection rarely beats an optimised population directly

These are weak priors (total pseudo-counts 15–25) so they are quickly overridden by real data once enough generations accumulate.

param arm_names:
type arm_names:: list of str
param stats_file:: If provided, α/β counts are loaded on init and saved after each update. Pass None to disable persistence.
type stats_file:: str or None — path to JSON file for persistent statistics.
param prior_alpha:
type prior_alpha:: dict mapping arm name -> α₀ (default: built-in domain priors)
param prior_beta:
type prior_beta:: dict mapping arm name -> β₀ (default: built-in domain priors)

save()[source]: Persist current α/β counts to stats_file.

reset_stats()[source]: Reset posteriors to priors and delete stats_file if present.

sample_weights()[source]

Draw one sample from each arm’s Beta posterior.

Returns a dict arm -> sampled weight. Use these as proportional allocation weights for the scoring budget.

update(arm, success)[source]

Update posterior for arm.

Parameters:

arm (str)
success (bool or int — True/1 if the arm produced at least one) – improvement this generation, False/0 otherwise.

property hit_rate

α / (α + β).

Type:: Posterior mean hit rate per arm

summary()[source]

class lacan.gen.HallOfFame(max_size=100, sim_threshold=0.55, higher_is_better=True)[source]

Bases: object

Diversity-gated collection of the best molecules seen across all generations.

A new molecule is accepted only if its Tanimoto similarity to every current member is below sim_threshold. When full, a new molecule displaces the worst member if it scores better.

Parameters:

max_size (int — maximum number of entries (default 100))
sim_threshold (float — max allowed Tanimoto similarity to existing members) – (default 0.70)
higher_is_better (bool (default True))

offer(smi, score, mol)[source]

Try to add (smi, score) to the HallOfFame.

Diversity gate: the candidate is accepted only if its maximum Tanimoto similarity to the _DIVERSITY_GATE_K most similar existing members is below sim_threshold. Using the top-K rather than all members prevents mediocre early entries from permanently blocking better later ones that happen to be structurally related.

Returns True if accepted, False if rejected.

top(n=None)[source]: Return top-n (smiles, score) pairs sorted best-first.

property best_score

lacan.gen.generate_optimized_molecules(scoring_function, profile, seed=123, startN=None, generations=None, popsize=None, scoring_budget=None, higher_is_better=True, diversity_threshold=0.35, plateau_patience=None, base_explore_fraction=None, explore_ratio=None, hof_size=None, hof_sim_threshold=None, sim_threshold=None, pool_diversity_k=2, n_replacements=None, conservative=None, fragcorpus=None, n_jobs=-1, quiet=False, callback=None, seed_mols=None, preset='ml', bandit_stats_file=None, maxmin_cap_factor=None, win_threshold=None, **kwargs)[source]

Adaptive GA with per-operation Thompson Sampling bandit, scoring budget, and HallOfFame.

Quick start

For most use cases, set preset= and override individual parameters as needed:

# Fast ML model
winners = gen.generate_optimized_molecules(rf_score, profile, preset="ml")

# Slow docking function — tight budget
winners = gen.generate_optimized_molecules(dock, profile, preset="docking",
                                            scoring_budget=15)

# GuacaMol benchmark — unlimited scoring budget
winners = gen.generate_optimized_molecules(guac_score, profile, preset="guacamol")

Presets

preset="ml" — fast QSAR/ML models (0.1–10 ms/mol) preset="medium" — ADMET, shape, ML ensembles (0.1–2 s/mol) preset="docking" — physics-based docking (5–30 s/mol) preset="guacamol" — unlimited scoring budget, large population, mutation-dominant allocation

All preset values can be overridden by passing explicit kwargs.

Key parameters

scoring_budgetint: Hard cap on molecules sent to the scoring function per generation. Candidates are pre-screened with cheap property filters and MaxMin diversity sampling before scoring. This is the most important knob for slow scoring functions.
base_explore_fractionfloat in [0,1]: Baseline fraction of the scoring budget allocated to exploration arms (ring/linker/sub replacement, decoration, crossover, random injection) vs exploitation arms (atom-level mutation). On scoring plateau this shifts toward mutation (highest empirical hit rate); on diversity collapse it shifts toward exploration.
hof_sizeint: Maximum size of the HallOfFame. The HoF collects the best diverse molecules found across all generations; the final result is the top-N from it sorted best-first.
hof_sim_thresholdfloat: Maximum Tanimoto similarity allowed between two HallOfFame members (diversity gate). Lower = more diverse but smaller effective HoF.
diversity_thresholdfloat: If mean pairwise Tanimoto distance of the pool falls below this, explore_fraction is boosted.
plateau_patienceint: After this many generations without improvement, explore_fraction is shifted toward mutation to exploit the arm with the highest hit rate.

returns:: All HallOfFame members sorted best-first.
rtype:: list of (smiles, score) tuples

class lacan.gen.GAReporter(label='run')[source]

Bases: object

Collects per-generation statistics from generate_optimized_molecules() and plots them.

Pass an instance as the callback= argument to the GA. After the run, call plot() to visualise the results, or compare multiple runs by passing a list of reporters to compare().

Parameters:: label (str) – Name for this run, shown in plot legends (default "run").

Example

reporter = GAReporter(label="docking_preset")
winners = gen.generate_optimized_molecules(
    my_score_fn, profile,
    preset="docking",
    callback=reporter,
)
reporter.plot()

plot(ax=None, show=True)[source]: Plot score, diversity, explore fraction, and per-arm hit rates over time.

static compare(reporters, metric='best_pool', show=True)[source]: Overlay score curves from multiple runs.

summary()[source]: Print a compact summary table of the run.

lacan.gen.optimize_from_mol(mol, scoring_function, profile, generations=20, beam_width=10, n_replacements=15, win_threshold=0.8, higher_is_better=True, conservative=True, fragcorpus=None, plateau_patience=5, quiet=False, callback=None, protect_smarts=None)[source]

Optimise a single seed molecule using fragment-level operations.

Unlike generate_optimized_molecules(), which explores chemical space from random starting structures, this function starts from one specific molecule and iteratively improves it by applying ring, linker, substituent, and decoration operations — then retaining the best results as parents for the next round (beam search).

Parameters:

mol (RDKit Mol — the seed / lead molecule)
scoring_function (callable — fn(list[Mol]) -> list[float])
profile (LACAN profile dict)
generations (int — maximum number of rounds (default 20))
beam_width (int — top molecules kept as parents each round (default 10))
n_replacements (int — fragment operation attempts per parent (default 15))
win_threshold (float — score at which a molecule is reported as winner) – (default 0.8). If no winners found, best beam returned.
higher_is_better (bool (default True))
conservative (bool — prefer similar fragment replacements (default True))
fragcorpus (fragment corpus (default: ChEMBL rls.csv))
plateau_patience (int — stop early after this many generations without) – improvement (default 5)
quiet (bool (default False))
callback (callable or None — called each generation with stats dict)
protect_smarts (str or None — SMARTS pattern; atoms matching it are skipped) – by all fragment operations every generation. The protected set is re-derived from each parent mol on every call via get_protected_atoms(), so it survives the SMILES round-trip in the beam transparently. None disables atom exclusion entirely (default).

Return type:

list of (smiles, score) tuples, sorted best-first

lacan.gen.next_population(population)[source]: Placeholder for a future incremental population-advance API.