Overview

LACAN (Leveraging Adjacent Co-occurrence of Atomic Neighborhoods) is a statistical molecular filter and generative toolkit. It scores molecules by asking: how likely is each bond, given the chemical environments on both sides of it? A model trained on ~27 million bonds from ChEMBL assigns a pointwise mutual information (PMI) score to every bond environment pair. The molecule-level score is derived from the minimum per-bond PMI, so a single unusual bond is enough to flag a molecule.

Scoring model

For each bond the model computes two ECFP2-like atom environment hashes — one per endpoint — and looks them up in a pre-built profile. The PMI score is:

observed  = pairs[(e1, e2)] / setsize
expected  = (idx[e1] / setsize / 2) * (idx[e2] / setsize / 2)
PMI       = observed / expected

The molecule-level score:

score = min_PMI / (1 + min_PMI)

saturates toward 1.0 as the worst-bond PMI grows large, and approaches 0 when the worst bond is near zero.

Module overview

Module

Purpose

lacan.lacan

Core scoring: score_mol, assess_per_bond, profile I/O

lacan.mutate

Atom-level mutations (40+ reaction SMARTS) — the EXPLOIT step

lacan.replace

Coarse fragment swaps (ring / substituent / linker) — the EXPLORE step

lacan.breed

Molecular crossover via fragment recombination

lacan.gen

Random generation, corpus biasing, adaptive GA

lacan.protect

SMARTS-based atom exclusion, bond protection, mol_cleaner

lacan.decompose

Molecule fragmentation; corpus building

Quick start

from rdkit import Chem
from lacan import lacan, gen, mutate, replace

profile = lacan.load_profile("chembl")

mol = Chem.MolFromSmiles("CCCc1nn(C)c2c(=O)[nH]c(-c3ccccc3)nc12")
score, info = lacan.score_mol(mol, profile)
print(f"Score: {score:.3f}  bad bonds: {info['bad_bonds']}")

# Generate drug-like molecules
mols = gen.generate_filtered_molecules(profile, n_molecules=10, n_jobs=-1)

# Optimise toward a scoring function
def my_score(mols):
    return [lacan.score_mol(m, profile)[0] for m in mols]

winners = gen.generate_optimized_molecules(my_score, profile,
                                           startN=20, generations=5)

Genetic algorithm

generate_optimized_molecules() runs an adaptive GA that balances exploration and exploitation each generation using two mechanisms:

Smooth explore fraction

A float explore_fraction (0–1) controls the budget split between exploration arms (ring/substituent/linker replacement, scaffold decoration, crossover, random injection) and exploitation arms (atom-level mutation from lacan.mutate). It shifts toward mutation when the population plateaus, and toward exploration on diversity collapse, decaying back to a user-set baseline otherwise.

Per-operation Thompson Sampling bandit

Each operation is treated as an independent arm with a Beta posterior over its hit rate. Budget is allocated proportionally to sampled weights each generation, so productive arms receive more budget while all arms remain explored. Statistics can optionally persist across runs.

Results are collected in a HallOfFame that retains the all-time best diverse molecules with a Tanimoto diversity gate.

Presetspreset="ml" / "medium" / "docking" / "guacamol" provide sensible defaults for fast, medium, slow, and unlimited-budget scoring functions respectively. Individual parameters always override preset values.

See generate_optimized_molecules() for the full parameter reference.