lacan.lacan

lacan.py — Core scoring module for LACAN.

LACAN (Leveraging Adjacent Co-occurrence of Atomic Neighborhoods) scores molecules by asking how statistically likely each bond is, given the chemical environments on both sides of it.

Scoring model

For each bond in a molecule, LACAN computes a pair of ECFP2-like atom environment identifiers — one for each endpoint — and looks up their co-occurrence statistics in a pre-built profile.

The atom environment identifier encodes five features:

  • Atomic number

  • Heavy-atom degree

  • Total hydrogen count (explicit + implicit)

  • Formal charge

  • Ring type: 0 if acyclic, 1 if in a non-aromatic ring, 2 if in an aromatic ring

These five integers are hashed to a 32-bit integer via MD5 (hash_invariants()). The two hashes for a bond are sorted and stored as a tuple — the bond pair — so that the bond (A→B) and (B→A) map to the same key.

A profile is a dict with three entries built from a molecule set:

idx

{hash: count} — how often each atom environment appears.

pairs

{(hash1, hash2): count} — how often each bond pair co-occurs.

setsize

Total number of bonds seen, used to normalise counts to frequencies.

For a bond with environments e1 and e2 the pointwise mutual information (PMI) score is:

observed  = pairs[(e1, e2)] / setsize
expected  = (idx[e1] / setsize / 2) * (idx[e2] / setsize / 2)
PMI       = observed / expected

A PMI > 1 means the two environments co-occur more often than chance; a PMI near 0 means the combination is chemically unusual.

The molecule-level score is derived from the minimum per-bond PMI:

score = min_PMI / (1 + min_PMI)

This saturates toward 1.0 as the worst-bond PMI grows large, and approaches 0 when the worst bond is near zero. Bonds below a threshold t (default 0.05) are reported as bad_bonds in score_mol().

CLI usage

python -m lacan.lacan -i molecules.smi -p chembl -m score -t 0.05

Modes:

score

Print per-molecule score and list of failing bond indices.

profile

Build a new profile from the input SMILES file and save it to data/<profile_name>.pickle.

lacan.lacan.hash_invariants(invs)[source]

Hash a list of integers to a reproducible, portable 32-bit signed int.

Uses the first 4 bytes of an MD5 digest so that the hash is identical across Python versions and platforms — a requirement for profiles built on one machine to be usable on another.

Parameters:

invs (list of int) – Flattened atom-environment descriptor (atomic number, degree, H count, charge, ring type, bond type, …).

Returns:

32-bit signed integer hash.

Return type:

int

lacan.lacan.mol_to_pairs(mol)[source]

Compute the bond-pair identifiers for all bonds in a molecule.

For each bond the function calculates two ECFP2-like atom environment hashes — one per endpoint — using the immediate neighbourhood (atom + all directly bonded neighbours, along with bond type to each). The two hashes are sorted and stored as a tuple so that bond direction does not matter.

This replaces RDKit’s fingerprint generators for this task to avoid the overhead of bond fracturing and aromatic re-sanitisation that would be needed with the generator API.

Parameters:

mol (RDKit Mol)

Returns:

One sorted (h1, h2) bond-pair per bond, in bond-index order. Returns [] if mol is None or unparseable.

Return type:

list of tuple(int, int)

lacan.lacan.get_neighbors(mol)[source]

Return the neighbour atom indices for every atom in mol.

Parameters:

mol (RDKit Mol)

Returns:

result[i] is the list of atom indices directly bonded to atom i.

Return type:

list of list of int

lacan.lacan.get_atom_invariants(mol)[source]

Compute ECFP2-like atom environment descriptors for every atom.

Each descriptor is a list of five integers:

  1. Atomic number

  2. Heavy-atom degree (number of non-H neighbours)

  3. Total hydrogen count (explicit + implicit)

  4. Formal charge

  5. Ring type: 0 if acyclic, 1 if in a non-aromatic ring, 2 if in an aromatic ring

These are used directly as the innermost layer of the bond-pair hash (see mol_to_pairs()).

Parameters:

mol (RDKit Mol)

Returns:

One 5-element descriptor per atom, in atom-index order.

Return type:

list of list of int

lacan.lacan.get_profile_for_mols(suppl, profile_name, size=8192, n_jobs=1)[source]

Build a LACAN profile from a molecule supplier and save it to disk.

The profile is a dict {"idx": ..., "pairs": ..., "setsize": ...} (see module docstring) pickled to data/<profile_name>.pickle inside the package directory.

Parameters:
  • suppl (iterable of RDKit Mol objects (e.g. SmilesMolSupplier))

  • profile_name (str — name under which to save the profile)

  • size (int — keep only the size-1 most common atom environments) – in idx (reduces profile file size; default 8192)

  • n_jobs (int — parallel workers; set < 1 for all CPU cores)

Returns:

The profile dict (same object that is written to disk).

Return type:

dict

lacan.lacan.assess_per_bond(mol, profile=None)[source]

Return the PMI score for every bond in mol.

For each bond the PMI is:

PMI = (pairs[(e1,e2)] / setsize) / ((idx[e1]/setsize/2) * (idx[e2]/setsize/2))

Unknown environments (not seen in training) contribute an observed occurrence of 0, giving PMI = 0 for that bond.

Parameters:
  • mol (RDKit Mol)

  • profile (LACAN profile dict; loads the default ChEMBL profile if None)

Returns:

One PMI value per bond, in bond-index order. Values near 0 indicate unusual bond environments; values > 1 indicate commonly co-occurring environments.

Return type:

list of float

lacan.lacan.load_profile(profile_name)[source]

Load a pickled LACAN profile from the package data directory.

Parameters:

profile_name (str) – Name of the profile (without .pickle extension). The file must exist at <package_dir>/data/<profile_name>.pickle.

Returns:

Profile dict with keys "idx", "pairs", "setsize".

Return type:

dict

Raises:

FileNotFoundError – If the pickle file does not exist.

lacan.lacan.get_default_profile()[source]

Lazy-load the default ChEMBL profile the first time it is requested.

Caches the result in the module-level _DEFAULT_PROFILE variable so subsequent calls return immediately without re-reading disk.

Returns:

The ChEMBL LACAN profile.

Return type:

dict

lacan.lacan.score_mol(mol, profile=None, mode='score', t=0.05)[source]

Score a molecule using the LACAN profile, respecting protected bonds.

Protected bonds (marked with _lp via lacan.protect) are excluded from the score calculation. This means a molecule with required-but-unusual motifs (e.g. a covalent warhead) can be scored fairly by protecting those bonds first with protect_rejected_bonds().

The score is derived from the minimum per-bond PMI over unprotected bonds only. Two modes are available:

"score" (default)

Continuous score in [0, 1] computed as:

score = min_PMI / (1 + min_PMI)

where min_PMI is the lowest per-bond PMI over unprotected bonds. Saturates toward 1.0 for well-profiled bonds; approaches 0 for unusual ones. The threshold t does not affect this value — it only determines which bond indices appear in info["bad_bonds"].

"threshold"

Binary: 1 if all unprotected bonds pass (PMI ≥ t), else 0.

Parameters:
  • mol (RDKit Mol (may have bonds with _lp protection property))

  • profile (LACAN profile dict; loads the default ChEMBL profile if None)

  • mode ("score" or "threshold")

  • t (bond PMI threshold (default 0.05))

Returns:

score — float in [0, 1] (or 0/1 in threshold mode). info — dict with key "bad_bonds": list of unprotected bond indices whose PMI is below t.

Return type:

(score, info)