Overview
LACAN (Leveraging Adjacent Co-occurrence of Atomic Neighborhoods) is a statistical molecular filter and generative toolkit. It scores molecules by asking: how likely is each bond, given the chemical environments on both sides of it? A model trained on ~27 million bonds from ChEMBL assigns a pointwise mutual information (PMI) score to every bond environment pair. The molecule-level score is derived from the minimum per-bond PMI, so a single unusual bond is enough to flag a molecule.
Scoring model
For each bond the model computes two ECFP2-like atom environment hashes — one per endpoint — and looks them up in a pre-built profile. The PMI score is:
observed = pairs[(e1, e2)] / setsize
expected = (idx[e1] / setsize / 2) * (idx[e2] / setsize / 2)
PMI = observed / expected
The molecule-level score:
score = min_PMI / (1 + min_PMI)
saturates toward 1.0 as the worst-bond PMI grows large, and approaches 0 when the worst bond is near zero.
Module overview
Module |
Purpose |
|---|---|
Core scoring: |
|
Atom-level mutations (40+ reaction SMARTS) — the EXPLOIT step |
|
Coarse fragment swaps (ring / substituent / linker) — the EXPLORE step |
|
Molecular crossover via fragment recombination |
|
Random generation, corpus biasing, adaptive GA |
|
SMARTS-based atom exclusion, bond protection, |
|
Molecule fragmentation; corpus building |
Quick start
from rdkit import Chem
from lacan import lacan, gen, mutate, replace
profile = lacan.load_profile("chembl")
mol = Chem.MolFromSmiles("CCCc1nn(C)c2c(=O)[nH]c(-c3ccccc3)nc12")
score, info = lacan.score_mol(mol, profile)
print(f"Score: {score:.3f} bad bonds: {info['bad_bonds']}")
# Generate drug-like molecules
mols = gen.generate_filtered_molecules(profile, n_molecules=10, n_jobs=-1)
# Optimise toward a scoring function
def my_score(mols):
return [lacan.score_mol(m, profile)[0] for m in mols]
winners = gen.generate_optimized_molecules(my_score, profile,
startN=20, generations=5)
Genetic algorithm
generate_optimized_molecules() runs an adaptive GA that
balances exploration and exploitation each generation using two mechanisms:
- Smooth explore fraction
A float
explore_fraction(0–1) controls the budget split between exploration arms (ring/substituent/linker replacement, scaffold decoration, crossover, random injection) and exploitation arms (atom-level mutation fromlacan.mutate). It shifts toward mutation when the population plateaus, and toward exploration on diversity collapse, decaying back to a user-set baseline otherwise.- Per-operation Thompson Sampling bandit
Each operation is treated as an independent arm with a Beta posterior over its hit rate. Budget is allocated proportionally to sampled weights each generation, so productive arms receive more budget while all arms remain explored. Statistics can optionally persist across runs.
Results are collected in a HallOfFame that retains the
all-time best diverse molecules with a Tanimoto diversity gate.
Presets — preset="ml" / "medium" / "docking" / "guacamol"
provide sensible defaults for fast, medium, slow, and unlimited-budget scoring
functions respectively. Individual parameters always override preset values.
See generate_optimized_molecules() for the full parameter
reference.