lacan.replace
replace.py — Coarse-grained fragment replacement operations for LACAN.
This module provides four scaffold-editing operations that swap entire fragments (rings, substituents, linkers) for alternatives drawn from the fragment corpus (rls.csv). Together they form the EXPLORE step of the adaptive GA in gen.py.
All four operations share these behaviours:
Random site selection — when a molecule has multiple rings, linkers, or substituents that could be operated on, one is chosen at random each call. This ensures diverse output across repeated calls and prevents the same site from being targeted every time.
Conservative sampling (
conservative=True, the default) — replacement fragments are drawn with probability proportional to both their corpus frequency and their structural similarity to the fragment being replaced (Morgan fp Tanimoto, radius 2, including dummy atoms). Setconservative=Falsefor purely frequency-weighted sampling.Protection — pass
protect_smarts(a SMARTS string) to any function to skip atoms matching the pattern. The exclusion is re-derived from the molecule on every call, so it survives SMILES round-trips transparently.Deduplication — output molecules are deduplicated by InChIKey and scored by the LACAN profile; molecules scoring 0 are discarded.
Reaction conventions
The _breakbond SMARTS [!#0:0]!@[R:1] matches exo-bonds as the pair
(non-ring atom, ring atom), so in every match tuple b[0] is the
exo atom and b[1] is the ring atom. This is important for
replace_ring() and replace_linker() which use these indices to
identify which atoms belong to the scaffold vs the side chains.
- lacan.replace.rxn1 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>
Connect two single-bond dummy attachment points.
- lacan.replace.rxn2 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>
Connect two double-bond dummy attachment points.
- lacan.replace.htodummy = <rdkit.Chem.rdChemReactions.ChemicalReaction object>
Convert one hydrogen to a single-bond dummy attachment point.
- lacan.replace.decompose1 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>
Cut a single-bond ring-exo bond, yielding (exo-fragment, ring-fragment).
- lacan.replace.singledummy = <rdkit.Chem.rdchem.Mol object>
Matches a dummy atom attached by a single bond.
- lacan.replace.doubledummy = <rdkit.Chem.rdchem.Mol object>
Matches a dummy atom attached by a double bond.
- lacan.replace.load_corpus(min_count=200, path=None)[source]
Load a fragment corpus CSV, with caching.
By default loads the built-in ChEMBL corpus from
data/rls.csv. Passpathto load a custom corpus built withpython -m lacan.decompose(e.g. from COCONUT or your own dataset).The CSV is read only on the first call for a given
(path, min_count)combination; subsequent calls return the cached list immediately.- Parameters:
min_count (int) – Minimum occurrence count for a fragment to be included (default 200). Lower values allow rarer fragments into the replacement pool.
path (str or None) – Path to a corpus CSV produced by
lacan.decompose. IfNone(default) the built-in ChEMBL corpus is used.
- Return type:
list of [smiles, count, degree, ftype, bonds]
- lacan.replace.decorate_scaffold(mol, p, score_threshold, fragcorpus=None, n_replacements=100, mode='Hydrogen', replacements_per_mol=1, chance_of_linker=0.5, protect_smarts=None)[source]
Add substituents to a scaffold by decorating free positions.
Two modes are supported:
- Hydrogen mode (default)
For each of the
n_replacementsattempts, one H-bearing atom is chosen at random and its hydrogen replaced with a dummy, then a single-attachment corpus fragment is joined at that position.replacements_per_molcontrols how many H → fragment substitutions are made per attempt. Protected atoms are skipped when selecting the H position.- Dummy mode
The scaffold must already contain
*dummy atoms marking desired decoration sites (e.g.c1c(*)c(*)co1). Each dummy is filled with a corpus fragment. With probabilitychance_of_linkera two- attachment linker fragment is inserted before the terminal substituent, extending the chain length.
- Parameters:
mol (RDKit Mol (scaffold to decorate))
p (LACAN profile dict)
score_threshold (minimum LACAN score required for output molecules;) – set 0.0 in the GA explore phase to accept all structures
fragcorpus (list of corpus entries (default: ChEMBL rls.csv))
n_replacements (number of decorated molecules to attempt generating)
mode (
"Hydrogen"or"Dummy")replacements_per_mol ((Hydrogen mode only) H → fragment substitutions per attempt)
chance_of_linker ((Dummy mode only) probability of prepending a linker fragment)
protect_smarts (SMARTS string; atoms matching it are skipped when selecting) – decoration sites.
Nonedisables the check (default).
- Returns:
Deduplicated molecules that pass
score_threshold.- Return type:
list of RDKit Mol
- lacan.replace.replace_substituent(mol, p, score_threshold, fragcorpus=None, n_replacements=100, conservative=True, protect_smarts=None)[source]
Replace one non-ring substituent with a randomly sampled corpus fragment.
A substituent is defined here strictly: a fragment produced by cutting a single ring-exo bond that (a) has exactly one attachment point (one
*dummy) and (b) contains no ring atoms from the original molecule. This prevents entire ring systems from being mistaken for substituents when the SMARTS fires on inter-ring bonds.Algorithm
Run
decompose1([!#0]!@;-[R]>>[!R]-[*].[R]-[*]) on the molecule to enumerate all possible ring-exo single-bond cuts.For each cut, classify the two fragments: smaller = substituent, larger = scaffold.
Filter: discard any pair where the smaller fragment contains ring atoms from the original molecule (it is a ring system, not a sub).
Also skip if the scaffold’s attachment atom is protected.
Randomly pick one valid substituent site for each of the
n_replacementsattempts, so all sites are sampled across repeated calls.Draw a replacement from corpus substituents (degree 1, single bond), optionally similarity-weighted, and join it to the scaffold dummy.
- param mol:
- type mol:
RDKit Mol
- param p:
- type p:
LACAN profile dict
- param score_threshold:
- type score_threshold:
minimum LACAN score for output molecules
- param fragcorpus:
- type fragcorpus:
fragment corpus
- param n_replacements:
- type n_replacements:
number of replacement attempts
- param conservative:
- type conservative:
if True, bias sampling toward structurally similar fragments
- param protect_smarts:
sites and scaffold attachment points.
None= no exclusion.- type protect_smarts:
SMARTS string; atoms matching it are excluded as substituent
- rtype:
list of RDKit Mol
- lacan.replace.replace_linker(mol, p, score_threshold, fragcorpus=None, n_replacements=100, conservative=True, protect_smarts=None)[source]
Replace one non-ring linker connecting two ring systems.
A linker is a non-ring fragment with exactly two attachment points (degree 2) that bridges two ring systems. The
decompose_moleculeutility identifies them as SMARTS strings with two*dummies.Algorithm
Decompose the molecule to get linker SMARTS strings.
For each linker SMARTS, find all substructure matches in the original mol.
For each match, identify the two exo-bonds using
_breakbond: atoms in the linker match set whoseb[0](the exo/linker atom) appears in the match are the cut points. Exactly two such bonds are required.Collect all valid sites (linker SMARTS + two ring fragments after cutting).
Randomly pick one site from all valid sites so that molecules with multiple linkers sample them uniformly across calls.
Draw
n_replacementsreplacement linkers from the corpus (degree 2, type “Linker”, bond string “--”) and stitch the two ring fragments back through each.
_breakbondconvention:b[0]= non-ring (linker) atom,b[1]= ring atom. Onlyb[0]is checked against the linker match set to locate the two linker-to-ring attachment bonds.- param mol:
- type mol:
RDKit Mol
- param p:
- type p:
LACAN profile dict
- param score_threshold:
- type score_threshold:
minimum LACAN score for output molecules
- param fragcorpus:
- type fragcorpus:
fragment corpus
- param n_replacements:
- type n_replacements:
number of replacement linkers to try at the chosen site
- param conservative:
- type conservative:
if True, similarity-bias the linker sampling
- param protect_smarts:
it are skipped.
None= no exclusion (default).- type protect_smarts:
SMARTS string; linker atoms or bond-endpoint atoms matching
- rtype:
list of RDKit Mol, or
[]if no replaceable linker is found.
- lacan.replace.replace_ring(mol, p, score_threshold, fragcorpus=None, n_replacements=100, conservative=True, protect_smarts=None)[source]
Replace one ring system with a different ring drawn from the corpus.
Algorithm
Find fused ring systems via
_get_ring_systems().For each ring system walk bonds to find exo-bonds (leaving the system, not themselves in a ring).
Skip protected systems; randomly pick one.
Cut the exo-bonds with
FragmentOnBonds. UseGetMolFrags()(withoutasMols) to get tuples of original atom indices for each fragment — these are stable and comparable toring_system. The fragment whose non-dummy atoms are all inring_systemis the old ring; everything else is a side chain to reattach. Build side-chain Mol objects withGetMolFrags(asMols=True)in the same order.Sample corpus rings; stitch side chains back.
- param mol:
- type mol:
RDKit Mol
- param p:
- type p:
LACAN profile dict
- param score_threshold:
- type score_threshold:
minimum LACAN score for output molecules
- param fragcorpus:
- type fragcorpus:
fragment corpus (default: ChEMBL rls.csv)
- param n_replacements:
- type n_replacements:
number of replacement rings to sample
- param conservative:
- type conservative:
if True, bias sampling toward structurally similar rings
- param protect_smarts:
skipped entirely.
None= no exclusion (default).- type protect_smarts:
SMARTS string; ring systems containing any matching atom are
- rtype:
list of RDKit Mol