ZINC20 (700M) subset with similarity threshold¶
This example shows using OnlineDiversityPicker
to choose a diverse set of molecules from ZINC20, based on a Tanimoto distance threshold of >= 0.75 between any two molecules. The subset's exact size isn't predetermined, only the maximum size (picker capacity) to manage resources.
The notebook covers handling large datasets unsuitable for in-memory loading. OnlineDiversityPicker
operates online, scaling linearly with dataset size.
import os
from datetime import datetime
import pandas as pd
from tqdm.auto import tqdm
from moll.pick import OnlineDiversityPicker
from moll.small import Molecule
from moll.utils import (
iter_lines,
iter_slices,
map_concurrently,
no_warnings,
unpack_arguments,
)
Fingerprint properties
FINGERPRINT_SIZE = 2048
FINGERPRINT_RADIUS = 2
FINGERPRINT_FOLD = 1024
Molecules in batch
BATCH_SIZE = 30_000
Directory with SMILES files
GLOB = "/data/zinc-smiles/*.smi"
Number of parallel jobs
N_WORKERS = os.cpu_count() - 4 # leave some cores free
N_WORKERS
20
N_LINES = sum(1 for f in SMILES_FILES for _ in f.read_text().splitlines())
N_BATCHES = N_LINES // BATCH_SIZE
Pick molecules¶
Define picker object:
picker = OnlineDiversityPicker(
capacity=50_000, # limit the number of picked molecules
k_neighbors=300,
similarity_fn="one_minus_tanimoto",
threshold=0.75, # distance threshold
dtype=bool,
)
Define a function to load molecule representations from SMILES file lines:
@unpack_arguments
@no_warnings
def processed_line(source, line_no, line):
smiles, _, id = line.partition(" ")
fp = Molecule.from_smiles(line).to_fp(
"morgan",
radius=FINGERPRINT_RADIUS,
size=FINGERPRINT_SIZE,
fold_size=FINGERPRINT_FOLD,
)
label = (source, line_no, id)
return fp, label
Use built-in data utilities to load data parallelly:
# Iterate over lines
lines_iterator = iter_lines(
GLOB, # .smi files glob pattern
skip_rows=1, # skip header
source_fn="stem", # return file stem as file name
)
# Parallelize lines processing
map_iterator = map_concurrently(
processed_line, # function to apply to each line
lines_iterator, # iterator over lines
proc=True, # use multiprocessing
n_workers=N_WORKERS, # number of workers
exception_fn="ignore", # ignore exceptions
)
# Combine processed lines into batches
batches_iterator = iter_slices(
map_iterator, # iterator over processed lines
BATCH_SIZE, # collect batches
transform_fn="transpose", # transpose batches to (fps, labels)
)
Start the picking process:
%%time
for vectors, labels in tqdm(batches_iterator, total=N_BATCHES):
picker.update(vectors, labels)
CPU times: user 1d 4h 25min 6s, sys: 2h 19min 59s, total: 1d 6h 45min 6s Wall time: 18h 22min 25s
Save results¶
Picked molecules are:
df = pd.DataFrame(picker.labels, columns=["file_stem", "line_no", "id"])
df
file_stem | line_no | id | |
---|---|---|---|
0 | AAABMO | 1 | 5273827 |
1 | AAABMO | 2 | 380227274 |
2 | AAABMO | 3 | 215393865 |
3 | AAABMO | 5 | 6003141 |
4 | AAABMO | 7 | 38363127 |
... | ... | ... | ... |
20304 | JJEBRN | 4672 | ZINC000012654104 |
20305 | JJEBRO | 1 | ZINC000057291984 |
20306 | JJEBRO | 26 | ZINC000016779476 |
20307 | JJEDMN | 56 | ZINC001164728887 |
20308 | JJEDRN | 4330 | ZINC001464591504 |
20309 rows × 3 columns
Save the results to a CSV file.
timestamp = datetime.now().strftime("%Y%m%dT%H%M%SZ")
timestamp
'20240202T144453Z'
df.to_csv(
f"zinc20-50K-{timestamp}.csv",
sep=" ",
index=False,
mode="x", # fail if file exists to avoid overwriting
)
What's next?¶
Strategy could be improved by
- shuffling the molecules before picking to avoid selecting similar molecules consecutively, which ensures better picker initialization;
- lowering the threshold to select more molecules, which can (theoretically) increase diversity, then picker can be run again with smaller capacity to remove the excess;
- decreasing the picker capacity if working with a non-diverse dataset, which can also increase speed.