lacan.decompose
decompose.py — Molecule fragmentation into rings, linkers, and substituents.
This module breaks drug-like molecules into three canonical fragment types and provides utilities for building and saving fragment corpora.
Fragment taxonomy
After exhaustive recursive cutting at all non-ring exo-bonds, each resulting
fragment is classified by the number and position of its dummy atoms (*):
- Ring
Fragment contains at least one ring atom adjacent to a dummy (matched by
ringdummy). Examples:[*]c1ccccc1,[*]c1ccncc1[*].- Linker
Fragment has no ring atoms adjacent to any dummy, but has two or more non-ring dummy neighbours (matched by
nonringdummytwice or more). Examples:[*]CC[*],[*]C(=O)[*].- Substituent
Fragment has exactly one non-ring dummy neighbour — a terminal group. Examples:
[*]CH3,[*]CF3,[*]OH.
Single-atom molecules, atoms that failed sanitisation, or molecules that do not decompose further (no non-ring exo-bonds) are silently ignored.
Reaction SMARTS
decompose1 cuts single-bond ring-exo bonds: [!#0]!@;-[R] >> [!R]-[*].[R]-[*]
decompose2 cuts double-bond ring-exo bonds: [!#0]!@;=[R] >> [!R]=[*].[R]=[*]
Note the !@ (not-in-ring) flag ensures only exo-bonds are cut, not ring
bonds themselves. The !#0 guard prevents re-cutting already-cut dummy
atoms.
Corpus format
Each corpus entry is a list:
[smiles, count, degree, ftype, bonds]
where smiles is the canonical fragment SMILES (with * dummies),
count is its occurrence frequency, degree is the number of attachment
points, ftype is "Ring", "Linker", or "Sub", and bonds is
a string like "-", "--", or "=-" encoding the bond types at each
dummy in order of dummy atom index.
- lacan.decompose.decompose1 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>
Cut a single-bond ring-exo bond.
Yields two fragments: the non-ring part with a single-bond dummy, and the ring part with a single-bond dummy. The
!#0guard prevents re-cutting dummies from a previous iteration. Modified from the original to also decompose biphenyls and other directly-fused ring systems.
- lacan.decompose.decompose2 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>
Cut a double-bond ring-exo bond (e.g. a carbonyl attached to a ring).
- lacan.decompose.ringdummy = <rdkit.Chem.rdchem.Mol object>
Matches a dummy atom adjacent to a ring atom — identifies Ring fragments.
- lacan.decompose.nonringdummy = <rdkit.Chem.rdchem.Mol object>
Matches a dummy atom adjacent to a non-ring atom — identifies Linker/Sub fragments.
- lacan.decompose.singledummy = <rdkit.Chem.rdchem.Mol object>
Matches a single-bond dummy attachment point.
- lacan.decompose.doubledummy = <rdkit.Chem.rdchem.Mol object>
Matches a double-bond dummy attachment point.
- lacan.decompose.get_bonds_string(smi)[source]
Encode the bond types at all dummy attachment points as a string.
The string contains one character per dummy atom, sorted by dummy atom index:
"-"for single bonds and"="for double bonds. This is stored in the corpus so that fragment matching can filter by compatible bond types.Examples:
"[*]CC"→"-","[*]CC[*]"→"--","[*]C(=O)[*]"→"-=".- Parameters:
smi (str — fragment SMILES containing at least one
*dummy atom)- Returns:
Bond-type string, one character per dummy, sorted by dummy atom index.
- Return type:
- lacan.decompose.decompose_molecule(smi, asSmi=True)[source]
Recursively decompose a molecule into rings, linkers, and substituents.
The algorithm repeatedly applies
decompose1anddecompose2to cut all non-ring exo-bonds, one at a time, until no further cuts are possible. The resulting terminal fragments are then classified into rings, linkers, and substituents based on their dummy-atom topology (see module docstring).Stereochemistry is removed before decomposition so that stereoisomers map to the same fragment SMARTS.
Single-fragment decompositions (molecules with no exo-bonds, e.g. bare rings) are discarded — they yield no useful corpus entries.
- Parameters:
smi (str or RDKit Mol — the molecule to decompose)
asSmi (bool — if True (default), treat smi as a SMILES string and parse) – it; if False, treat it directly as an RDKit Mol object
- Returns:
(rings, linkers, subs)
- Return type:
three lists of canonical SMILES strings
- lacan.decompose.get_all_decompositions(smis, n_jobs=-1)[source]
Decompose a large list of SMILES strings in parallel.
Distributes
decompose_molecule()across a multiprocessing pool usingimap_unorderedfor memory-efficient streaming (important for millions of ChEMBL molecules).- Parameters:
smis (iterable of SMILES strings)
n_jobs (int — number of worker processes; -1 (default) uses all CPU cores)
- Returns:
(all_rings, all_linkers, all_subs) – (unsorted, may contain duplicates — use
save_frags()to count and filter)- Return type:
three flat lists of SMILES strings
- lacan.decompose.save_frags(r, l, s, output_path, minN=10)[source]
Count fragment occurrences and save a filtered corpus CSV.
Only fragments appearing more than
minNtimes are written. The output CSV has columnssmiles, occurrence, degree, type, bondsand is sorted by occurrence descending. The resulting file can be loaded withreplace.load_corpus(path="my_corpus.csv")as a drop-in replacement for the built-in ChEMBL corpus.- Parameters:
r (list of ring SMILES (from
get_all_decompositions()))l (list of linker SMILES)
s (list of substituent SMILES)
output_path (str — path to write the CSV)
minN (int — minimum occurrence count to include (default 10))
- lacan.decompose.get_corpus(mols)[source]
Build a fragment corpus from a list of RDKit Mol objects.
Decomposes each molecule and counts all fragment occurrences. Unlike
save_frags(), this function returns the corpus as a list in memory (no minimum count filter, no file I/O) so it can be passed directly toget_corpus_from_mols()for corpus biasing.- Parameters:
mols (iterable of RDKit Mol objects (None entries are silently skipped))
- Returns:
Sorted by count descending. Contains all fragment types with count ≥ 1.
- Return type:
list of [smiles, count, degree, ftype, bonds]