lacan.decompose

decompose.py — Molecule fragmentation into rings, linkers, and substituents.

This module breaks drug-like molecules into three canonical fragment types and provides utilities for building and saving fragment corpora.

Fragment taxonomy

After exhaustive recursive cutting at all non-ring exo-bonds, each resulting fragment is classified by the number and position of its dummy atoms (*):

Ring: Fragment contains at least one ring atom adjacent to a dummy (matched by ringdummy). Examples: [*]c1ccccc1, [*]c1ccncc1[*].
Linker: Fragment has no ring atoms adjacent to any dummy, but has two or more non-ring dummy neighbours (matched by nonringdummy twice or more). Examples: [*]CC[*], [*]C(=O)[*].
Substituent: Fragment has exactly one non-ring dummy neighbour — a terminal group. Examples: [*]CH3, [*]CF3, [*]OH.

Single-atom molecules, atoms that failed sanitisation, or molecules that do not decompose further (no non-ring exo-bonds) are silently ignored.

Reaction SMARTS

decompose1 cuts single-bond ring-exo bonds: [!#0]!@;-[R] >> [!R]-[*].[R]-[*] decompose2 cuts double-bond ring-exo bonds: [!#0]!@;=[R] >> [!R]=[*].[R]=[*]

Note the !@ (not-in-ring) flag ensures only exo-bonds are cut, not ring bonds themselves. The !#0 guard prevents re-cutting already-cut dummy atoms.

Corpus format

Each corpus entry is a list:

[smiles, count, degree, ftype, bonds]

where smiles is the canonical fragment SMILES (with * dummies), count is its occurrence frequency, degree is the number of attachment points, ftype is "Ring", "Linker", or "Sub", and bonds is a string like "-", "--", or "=-" encoding the bond types at each dummy in order of dummy atom index.

lacan.decompose.decompose1 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>

Cut a single-bond ring-exo bond.

Yields two fragments: the non-ring part with a single-bond dummy, and the ring part with a single-bond dummy. The !#0 guard prevents re-cutting dummies from a previous iteration. Modified from the original to also decompose biphenyls and other directly-fused ring systems.

lacan.decompose.decompose2 = <rdkit.Chem.rdChemReactions.ChemicalReaction object>: Cut a double-bond ring-exo bond (e.g. a carbonyl attached to a ring).

lacan.decompose.ringdummy = <rdkit.Chem.rdchem.Mol object>: Matches a dummy atom adjacent to a ring atom — identifies Ring fragments.

lacan.decompose.nonringdummy = <rdkit.Chem.rdchem.Mol object>: Matches a dummy atom adjacent to a non-ring atom — identifies Linker/Sub fragments.

lacan.decompose.singledummy = <rdkit.Chem.rdchem.Mol object>: Matches a single-bond dummy attachment point.

lacan.decompose.doubledummy = <rdkit.Chem.rdchem.Mol object>: Matches a double-bond dummy attachment point.

lacan.decompose.get_bonds_string(smi)[source]

Encode the bond types at all dummy attachment points as a string.

The string contains one character per dummy atom, sorted by dummy atom index: "-" for single bonds and "=" for double bonds. This is stored in the corpus so that fragment matching can filter by compatible bond types.

Examples: "[*]CC" → "-", "[*]CC[*]" → "--", "[*]C(=O)[*]" → "-=".

Parameters:: smi (str — fragment SMILES containing at least one * dummy atom)
Returns:: Bond-type string, one character per dummy, sorted by dummy atom index.
Return type:: str

lacan.decompose.decompose_molecule(smi, asSmi=True)[source]

Recursively decompose a molecule into rings, linkers, and substituents.

The algorithm repeatedly applies decompose1 and decompose2 to cut all non-ring exo-bonds, one at a time, until no further cuts are possible. The resulting terminal fragments are then classified into rings, linkers, and substituents based on their dummy-atom topology (see module docstring).

Stereochemistry is removed before decomposition so that stereoisomers map to the same fragment SMARTS.

Single-fragment decompositions (molecules with no exo-bonds, e.g. bare rings) are discarded — they yield no useful corpus entries.

Parameters:

smi (str or RDKit Mol — the molecule to decompose)
asSmi (bool — if True (default), treat smi as a SMILES string and parse) – it; if False, treat it directly as an RDKit Mol object

Returns:

(rings, linkers, subs)

Return type:

three lists of canonical SMILES strings

lacan.decompose.get_all_decompositions(smis, n_jobs=-1)[source]

Decompose a large list of SMILES strings in parallel.

Distributes decompose_molecule() across a multiprocessing pool using imap_unordered for memory-efficient streaming (important for millions of ChEMBL molecules).

Parameters:

smis (iterable of SMILES strings)
n_jobs (int — number of worker processes; -1 (default) uses all CPU cores)

Returns:

(all_rings, all_linkers, all_subs) – (unsorted, may contain duplicates — use save_frags() to count and filter)

Return type:

three flat lists of SMILES strings

lacan.decompose.save_frags(r, l, s, output_path, minN=10)[source]

Count fragment occurrences and save a filtered corpus CSV.

Only fragments appearing more than minN times are written. The output CSV has columns smiles, occurrence, degree, type, bonds and is sorted by occurrence descending. The resulting file can be loaded with replace.load_corpus(path="my_corpus.csv") as a drop-in replacement for the built-in ChEMBL corpus.

Parameters:

r (list of ring SMILES (from get_all_decompositions()))
l (list of linker SMILES)
s (list of substituent SMILES)
output_path (str — path to write the CSV)
minN (int — minimum occurrence count to include (default 10))

lacan.decompose.get_corpus(mols)[source]

Build a fragment corpus from a list of RDKit Mol objects.

Decomposes each molecule and counts all fragment occurrences. Unlike save_frags(), this function returns the corpus as a list in memory (no minimum count filter, no file I/O) so it can be passed directly to get_corpus_from_mols() for corpus biasing.

Parameters:: mols (iterable of RDKit Mol objects (None entries are silently skipped))
Returns:: Sorted by count descending. Contains all fragment types with count ≥ 1.
Return type:: list of [smiles, count, degree, ftype, bonds]