lacan.lacan
lacan.py — Core scoring module for LACAN.
LACAN (Leveraging Adjacent Co-occurrence of Atomic Neighborhoods) scores molecules by asking how statistically likely each bond is, given the chemical environments on both sides of it.
Scoring model
For each bond in a molecule, LACAN computes a pair of ECFP2-like atom environment identifiers — one for each endpoint — and looks up their co-occurrence statistics in a pre-built profile.
The atom environment identifier encodes five features:
Atomic number
Heavy-atom degree
Total hydrogen count (explicit + implicit)
Formal charge
Ring type: 0 if acyclic, 1 if in a non-aromatic ring, 2 if in an aromatic ring
These five integers are hashed to a 32-bit integer via MD5
(hash_invariants()). The two hashes for a bond are sorted and stored as
a tuple — the bond pair — so that the bond (A→B) and (B→A) map to the same
key.
A profile is a dict with three entries built from a molecule set:
idx{hash: count}— how often each atom environment appears.pairs{(hash1, hash2): count}— how often each bond pair co-occurs.setsizeTotal number of bonds seen, used to normalise counts to frequencies.
For a bond with environments e1 and e2 the pointwise mutual information (PMI) score is:
observed = pairs[(e1, e2)] / setsize
expected = (idx[e1] / setsize / 2) * (idx[e2] / setsize / 2)
PMI = observed / expected
A PMI > 1 means the two environments co-occur more often than chance; a PMI near 0 means the combination is chemically unusual.
The molecule-level score is derived from the minimum per-bond PMI:
score = min_PMI / (1 + min_PMI)
This saturates toward 1.0 as the worst-bond PMI grows large, and approaches
0 when the worst bond is near zero. Bonds below a threshold t (default
0.05) are reported as bad_bonds in score_mol().
CLI usage
python -m lacan.lacan -i molecules.smi -p chembl -m score -t 0.05
Modes:
scorePrint per-molecule score and list of failing bond indices.
profileBuild a new profile from the input SMILES file and save it to
data/<profile_name>.pickle.
- lacan.lacan.hash_invariants(invs)[source]
Hash a list of integers to a reproducible, portable 32-bit signed int.
Uses the first 4 bytes of an MD5 digest so that the hash is identical across Python versions and platforms — a requirement for profiles built on one machine to be usable on another.
- lacan.lacan.mol_to_pairs(mol)[source]
Compute the bond-pair identifiers for all bonds in a molecule.
For each bond the function calculates two ECFP2-like atom environment hashes — one per endpoint — using the immediate neighbourhood (atom + all directly bonded neighbours, along with bond type to each). The two hashes are sorted and stored as a tuple so that bond direction does not matter.
This replaces RDKit’s fingerprint generators for this task to avoid the overhead of bond fracturing and aromatic re-sanitisation that would be needed with the generator API.
- lacan.lacan.get_atom_invariants(mol)[source]
Compute ECFP2-like atom environment descriptors for every atom.
Each descriptor is a list of five integers:
Atomic number
Heavy-atom degree (number of non-H neighbours)
Total hydrogen count (explicit + implicit)
Formal charge
Ring type: 0 if acyclic, 1 if in a non-aromatic ring, 2 if in an aromatic ring
These are used directly as the innermost layer of the bond-pair hash (see
mol_to_pairs()).
- lacan.lacan.get_profile_for_mols(suppl, profile_name, size=8192, n_jobs=1)[source]
Build a LACAN profile from a molecule supplier and save it to disk.
The profile is a dict
{"idx": ..., "pairs": ..., "setsize": ...}(see module docstring) pickled todata/<profile_name>.pickleinside the package directory.- Parameters:
suppl (iterable of RDKit Mol objects (e.g. SmilesMolSupplier))
profile_name (str — name under which to save the profile)
size (int — keep only the size-1 most common atom environments) – in
idx(reduces profile file size; default 8192)n_jobs (int — parallel workers; set < 1 for all CPU cores)
- Returns:
The profile dict (same object that is written to disk).
- Return type:
- lacan.lacan.assess_per_bond(mol, profile=None)[source]
Return the PMI score for every bond in mol.
For each bond the PMI is:
PMI = (pairs[(e1,e2)] / setsize) / ((idx[e1]/setsize/2) * (idx[e2]/setsize/2))
Unknown environments (not seen in training) contribute an observed occurrence of 0, giving PMI = 0 for that bond.
- lacan.lacan.load_profile(profile_name)[source]
Load a pickled LACAN profile from the package data directory.
- Parameters:
profile_name (str) – Name of the profile (without
.pickleextension). The file must exist at<package_dir>/data/<profile_name>.pickle.- Returns:
Profile dict with keys
"idx","pairs","setsize".- Return type:
- Raises:
FileNotFoundError – If the pickle file does not exist.
- lacan.lacan.get_default_profile()[source]
Lazy-load the default ChEMBL profile the first time it is requested.
Caches the result in the module-level
_DEFAULT_PROFILEvariable so subsequent calls return immediately without re-reading disk.- Returns:
The ChEMBL LACAN profile.
- Return type:
- lacan.lacan.score_mol(mol, profile=None, mode='score', t=0.05)[source]
Score a molecule using the LACAN profile, respecting protected bonds.
Protected bonds (marked with
_lpvialacan.protect) are excluded from the score calculation. This means a molecule with required-but-unusual motifs (e.g. a covalent warhead) can be scored fairly by protecting those bonds first withprotect_rejected_bonds().The score is derived from the minimum per-bond PMI over unprotected bonds only. Two modes are available:
"score"(default)Continuous score in [0, 1] computed as:
score = min_PMI / (1 + min_PMI)
where
min_PMIis the lowest per-bond PMI over unprotected bonds. Saturates toward 1.0 for well-profiled bonds; approaches 0 for unusual ones. The threshold t does not affect this value — it only determines which bond indices appear ininfo["bad_bonds"]."threshold"Binary: 1 if all unprotected bonds pass (PMI ≥ t), else 0.
- Parameters:
mol (RDKit Mol (may have bonds with
_lpprotection property))profile (LACAN profile dict; loads the default ChEMBL profile if None)
mode (
"score"or"threshold")t (bond PMI threshold (default 0.05))
- Returns:
score— float in [0, 1] (or 0/1 in threshold mode).info— dict with key"bad_bonds": list of unprotected bond indices whose PMI is below t.- Return type:
(score, info)