lacan.replace

replace.py — Coarse-grained fragment replacement operations for LACAN.

This module provides four scaffold-editing operations that swap entire fragments (rings, substituents, linkers) for alternatives drawn from the fragment corpus (rls.csv). Together they form the EXPLORE step of the adaptive GA in gen.py.

All four operations share these behaviours:

  • Random site selection — when a molecule has multiple rings, linkers, or substituents that could be operated on, one is chosen at random each call. This ensures diverse output across repeated calls and prevents the same site from being targeted every time.

  • Conservative sampling (conservative=True, the default) — replacement fragments are drawn with probability proportional to both their corpus frequency and their structural similarity to the fragment being replaced (Morgan fp Tanimoto, radius 2, including dummy atoms). Set conservative=False for purely frequency-weighted sampling.

  • Protection — pass protect_smarts (a SMARTS string) to any function to skip atoms matching the pattern. The exclusion is re-derived from the molecule on every call, so it survives SMILES round-trips transparently.

  • Deduplication — output molecules are deduplicated by InChIKey and scored by the LACAN profile; molecules scoring 0 are discarded.

Reaction conventions

The _breakbond SMARTS [!#0:0]!@[R:1] matches exo-bonds as the pair (non-ring atom, ring atom), so in every match tuple b[0] is the exo atom and b[1] is the ring atom. This is important for replace_ring() and replace_linker() which use these indices to identify which atoms belong to the scaffold vs the side chains.

lacan.replace.rxn1 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>

Connect two single-bond dummy attachment points.

lacan.replace.rxn2 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>

Connect two double-bond dummy attachment points.

lacan.replace.htodummy = <rdkit.Chem.rdChemReactions.ChemicalReaction object>

Convert one hydrogen to a single-bond dummy attachment point.

lacan.replace.decompose1 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>

Cut a single-bond ring-exo bond, yielding (exo-fragment, ring-fragment).

lacan.replace.singledummy = <rdkit.Chem.rdchem.Mol object>

Matches a dummy atom attached by a single bond.

lacan.replace.doubledummy = <rdkit.Chem.rdchem.Mol object>

Matches a dummy atom attached by a double bond.

lacan.replace.load_corpus(min_count=200, path=None)[source]

Load a fragment corpus CSV, with caching.

By default loads the built-in ChEMBL corpus from data/rls.csv. Pass path to load a custom corpus built with python -m lacan.decompose (e.g. from COCONUT or your own dataset).

The CSV is read only on the first call for a given (path, min_count) combination; subsequent calls return the cached list immediately.

Parameters:
  • min_count (int) – Minimum occurrence count for a fragment to be included (default 200). Lower values allow rarer fragments into the replacement pool.

  • path (str or None) – Path to a corpus CSV produced by lacan.decompose. If None (default) the built-in ChEMBL corpus is used.

Return type:

list of [smiles, count, degree, ftype, bonds]

lacan.replace.decorate_scaffold(mol, p, score_threshold, fragcorpus=None, n_replacements=100, mode='Hydrogen', replacements_per_mol=1, chance_of_linker=0.5, protect_smarts=None)[source]

Add substituents to a scaffold by decorating free positions.

Two modes are supported:

Hydrogen mode (default)

For each of the n_replacements attempts, one H-bearing atom is chosen at random and its hydrogen replaced with a dummy, then a single-attachment corpus fragment is joined at that position. replacements_per_mol controls how many H → fragment substitutions are made per attempt. Protected atoms are skipped when selecting the H position.

Dummy mode

The scaffold must already contain * dummy atoms marking desired decoration sites (e.g. c1c(*)c(*)co1). Each dummy is filled with a corpus fragment. With probability chance_of_linker a two- attachment linker fragment is inserted before the terminal substituent, extending the chain length.

Parameters:
  • mol (RDKit Mol (scaffold to decorate))

  • p (LACAN profile dict)

  • score_threshold (minimum LACAN score required for output molecules;) – set 0.0 in the GA explore phase to accept all structures

  • fragcorpus (list of corpus entries (default: ChEMBL rls.csv))

  • n_replacements (number of decorated molecules to attempt generating)

  • mode ("Hydrogen" or "Dummy")

  • replacements_per_mol ((Hydrogen mode only) H → fragment substitutions per attempt)

  • chance_of_linker ((Dummy mode only) probability of prepending a linker fragment)

  • protect_smarts (SMARTS string; atoms matching it are skipped when selecting) – decoration sites. None disables the check (default).

Returns:

Deduplicated molecules that pass score_threshold.

Return type:

list of RDKit Mol

lacan.replace.replace_substituent(mol, p, score_threshold, fragcorpus=None, n_replacements=100, conservative=True, protect_smarts=None)[source]

Replace one non-ring substituent with a randomly sampled corpus fragment.

A substituent is defined here strictly: a fragment produced by cutting a single ring-exo bond that (a) has exactly one attachment point (one * dummy) and (b) contains no ring atoms from the original molecule. This prevents entire ring systems from being mistaken for substituents when the SMARTS fires on inter-ring bonds.

Algorithm

  1. Run decompose1 ([!#0]!@;-[R]>>[!R]-[*].[R]-[*]) on the molecule to enumerate all possible ring-exo single-bond cuts.

  2. For each cut, classify the two fragments: smaller = substituent, larger = scaffold.

  3. Filter: discard any pair where the smaller fragment contains ring atoms from the original molecule (it is a ring system, not a sub).

  4. Also skip if the scaffold’s attachment atom is protected.

  5. Randomly pick one valid substituent site for each of the n_replacements attempts, so all sites are sampled across repeated calls.

  6. Draw a replacement from corpus substituents (degree 1, single bond), optionally similarity-weighted, and join it to the scaffold dummy.

param mol:

type mol:

RDKit Mol

param p:

type p:

LACAN profile dict

param score_threshold:

type score_threshold:

minimum LACAN score for output molecules

param fragcorpus:

type fragcorpus:

fragment corpus

param n_replacements:

type n_replacements:

number of replacement attempts

param conservative:

type conservative:

if True, bias sampling toward structurally similar fragments

param protect_smarts:

sites and scaffold attachment points. None = no exclusion.

type protect_smarts:

SMARTS string; atoms matching it are excluded as substituent

rtype:

list of RDKit Mol

lacan.replace.replace_linker(mol, p, score_threshold, fragcorpus=None, n_replacements=100, conservative=True, protect_smarts=None)[source]

Replace one non-ring linker connecting two ring systems.

A linker is a non-ring fragment with exactly two attachment points (degree 2) that bridges two ring systems. The decompose_molecule utility identifies them as SMARTS strings with two * dummies.

Algorithm

  1. Decompose the molecule to get linker SMARTS strings.

  2. For each linker SMARTS, find all substructure matches in the original mol.

  3. For each match, identify the two exo-bonds using _breakbond: atoms in the linker match set whose b[0] (the exo/linker atom) appears in the match are the cut points. Exactly two such bonds are required.

  4. Collect all valid sites (linker SMARTS + two ring fragments after cutting).

  5. Randomly pick one site from all valid sites so that molecules with multiple linkers sample them uniformly across calls.

  6. Draw n_replacements replacement linkers from the corpus (degree 2, type “Linker”, bond string “--”) and stitch the two ring fragments back through each.

_breakbond convention: b[0] = non-ring (linker) atom, b[1] = ring atom. Only b[0] is checked against the linker match set to locate the two linker-to-ring attachment bonds.

param mol:

type mol:

RDKit Mol

param p:

type p:

LACAN profile dict

param score_threshold:

type score_threshold:

minimum LACAN score for output molecules

param fragcorpus:

type fragcorpus:

fragment corpus

param n_replacements:

type n_replacements:

number of replacement linkers to try at the chosen site

param conservative:

type conservative:

if True, similarity-bias the linker sampling

param protect_smarts:

it are skipped. None = no exclusion (default).

type protect_smarts:

SMARTS string; linker atoms or bond-endpoint atoms matching

rtype:

list of RDKit Mol, or [] if no replaceable linker is found.

lacan.replace.replace_ring(mol, p, score_threshold, fragcorpus=None, n_replacements=100, conservative=True, protect_smarts=None)[source]

Replace one ring system with a different ring drawn from the corpus.

Algorithm

  1. Find fused ring systems via _get_ring_systems().

  2. For each ring system walk bonds to find exo-bonds (leaving the system, not themselves in a ring).

  3. Skip protected systems; randomly pick one.

  4. Cut the exo-bonds with FragmentOnBonds. Use GetMolFrags() (without asMols) to get tuples of original atom indices for each fragment — these are stable and comparable to ring_system. The fragment whose non-dummy atoms are all in ring_system is the old ring; everything else is a side chain to reattach. Build side-chain Mol objects with GetMolFrags(asMols=True) in the same order.

  5. Sample corpus rings; stitch side chains back.

param mol:

type mol:

RDKit Mol

param p:

type p:

LACAN profile dict

param score_threshold:

type score_threshold:

minimum LACAN score for output molecules

param fragcorpus:

type fragcorpus:

fragment corpus (default: ChEMBL rls.csv)

param n_replacements:

type n_replacements:

number of replacement rings to sample

param conservative:

type conservative:

if True, bias sampling toward structurally similar rings

param protect_smarts:

skipped entirely. None = no exclusion (default).

type protect_smarts:

SMARTS string; ring systems containing any matching atom are

rtype:

list of RDKit Mol