ZINC20 (700M) subset with similarity threshold¶

This example shows using OnlineDiversityPicker to choose a diverse set of molecules from ZINC20, based on a Tanimoto distance threshold of >= 0.75 between any two molecules. The subset's exact size isn't predetermined, only the maximum size (picker capacity) to manage resources.

The notebook covers handling large datasets unsuitable for in-memory loading. OnlineDiversityPicker operates online, scaling linearly with dataset size.

In [1]:

Copied!





import os
from datetime import datetime

import pandas as pd
from tqdm.auto import tqdm

from moll.pick import OnlineDiversityPicker
from moll.small import Molecule
from moll.utils import (
    iter_lines,
    iter_slices,
    map_concurrently,
    no_warnings,
    unpack_arguments,
)
import os
from datetime import datetime

import pandas as pd
from tqdm.auto import tqdm

from moll.pick import OnlineDiversityPicker
from moll.small import Molecule
from moll.utils import (
    iter_lines,
    iter_slices,
    map_concurrently,
    no_warnings,
    unpack_arguments,
)

Fingerprint properties

In [2]:

Copied!

FINGERPRINT_SIZE = 2048
FINGERPRINT_RADIUS = 2
FINGERPRINT_FOLD = 1024
FINGERPRINT_SIZE = 2048
FINGERPRINT_RADIUS = 2
FINGERPRINT_FOLD = 1024

Molecules in batch

In [3]:

Copied!

BATCH_SIZE = 30_000
BATCH_SIZE = 30_000

Directory with SMILES files

In [4]:

Copied!

GLOB = "/data/zinc-smiles/*.smi"
GLOB = "/data/zinc-smiles/*.smi"

Number of parallel jobs

In [5]:

Copied!

N_WORKERS = os.cpu_count() - 4  # leave some cores free
N_WORKERS
N_WORKERS = os.cpu_count() - 4  # leave some cores free
N_WORKERS

Out[5]:

In [6]:

Copied!

N_LINES = sum(1 for f in SMILES_FILES for _ in f.read_text().splitlines())
N_LINES = sum(1 for f in SMILES_FILES for _ in f.read_text().splitlines())

In [7]:

Copied!

N_BATCHES = N_LINES // BATCH_SIZE
N_BATCHES = N_LINES // BATCH_SIZE

Pick molecules¶

Define picker object:

In [8]:

Copied!





picker = OnlineDiversityPicker(
    capacity=50_000,  # limit the number of picked molecules
    k_neighbors=300,
    similarity_fn="one_minus_tanimoto",
    threshold=0.75,  # distance threshold
    dtype=bool,
)
picker = OnlineDiversityPicker(
    capacity=50_000,  # limit the number of picked molecules
    k_neighbors=300,
    similarity_fn="one_minus_tanimoto",
    threshold=0.75,  # distance threshold
    dtype=bool,
)

Define a function to load molecule representations from SMILES file lines:

In [9]:

Copied!





@unpack_arguments
@no_warnings
def processed_line(source, line_no, line):
    smiles, _, id = line.partition(" ")
    fp = Molecule.from_smiles(line).to_fp(
        "morgan",
        radius=FINGERPRINT_RADIUS,
        size=FINGERPRINT_SIZE,
        fold_size=FINGERPRINT_FOLD,
    )
    label = (source, line_no, id)
    return fp, label
@unpack_arguments
@no_warnings
def processed_line(source, line_no, line):
    smiles, _, id = line.partition(" ")
    fp = Molecule.from_smiles(line).to_fp(
        "morgan",
        radius=FINGERPRINT_RADIUS,
        size=FINGERPRINT_SIZE,
        fold_size=FINGERPRINT_FOLD,
    )
    label = (source, line_no, id)
    return fp, label

Use built-in data utilities to load data parallelly:

In [10]:

Copied!





# Iterate over lines
lines_iterator = iter_lines(
    GLOB,  # .smi files glob pattern
    skip_rows=1,  # skip header
    source_fn="stem",  # return file stem as file name
)

# Parallelize lines processing
map_iterator = map_concurrently(
    processed_line,  # function to apply to each line
    lines_iterator,  # iterator over lines
    proc=True,  # use multiprocessing
    n_workers=N_WORKERS,  # number of workers
    exception_fn="ignore",  # ignore exceptions
)

# Combine processed lines into batches
batches_iterator = iter_slices(
    map_iterator,  # iterator over processed lines
    BATCH_SIZE,  # collect batches
    transform_fn="transpose",  # transpose batches to (fps, labels)
)
# Iterate over lines
lines_iterator = iter_lines(
    GLOB,  # .smi files glob pattern
    skip_rows=1,  # skip header
    source_fn="stem",  # return file stem as file name
)

# Parallelize lines processing
map_iterator = map_concurrently(
    processed_line,  # function to apply to each line
    lines_iterator,  # iterator over lines
    proc=True,  # use multiprocessing
    n_workers=N_WORKERS,  # number of workers
    exception_fn="ignore",  # ignore exceptions
)

# Combine processed lines into batches
batches_iterator = iter_slices(
    map_iterator,  # iterator over processed lines
    BATCH_SIZE,  # collect batches
    transform_fn="transpose",  # transpose batches to (fps, labels)
)

Start the picking process:

In [11]:

Copied!

%%time
for vectors, labels in tqdm(batches_iterator, total=N_BATCHES):
    picker.update(vectors, labels)
%%time
for vectors, labels in tqdm(batches_iterator, total=N_BATCHES):
    picker.update(vectors, labels)

CPU times: user 1d 4h 25min 6s, sys: 2h 19min 59s, total: 1d 6h 45min 6s
Wall time: 18h 22min 25s

Save results¶

Picked molecules are:

In [12]:

Copied!

df = pd.DataFrame(picker.labels, columns=["file_stem", "line_no", "id"])
df
df = pd.DataFrame(picker.labels, columns=["file_stem", "line_no", "id"])
df

Out[12]:

	file_stem	line_no	id
0	AAABMO	1	5273827
1	AAABMO	2	380227274
2	AAABMO	3	215393865
3	AAABMO	5	6003141
4	AAABMO	7	38363127
...	...	...	...
20304	JJEBRN	4672	ZINC000012654104
20305	JJEBRO	1	ZINC000057291984
20306	JJEBRO	26	ZINC000016779476
20307	JJEDMN	56	ZINC001164728887
20308	JJEDRN	4330	ZINC001464591504

20309 rows × 3 columns

Save the results to a CSV file.

In [13]:

Copied!

timestamp = datetime.now().strftime("%Y%m%dT%H%M%SZ")
timestamp
timestamp = datetime.now().strftime("%Y%m%dT%H%M%SZ")
timestamp

Out[13]:

'20240202T144453Z'

In [14]:

Copied!





df.to_csv(
    f"zinc20-50K-{timestamp}.csv",
    sep=" ",
    index=False,
    mode="x",  # fail if file exists to avoid overwriting
)
df.to_csv(
    f"zinc20-50K-{timestamp}.csv",
    sep=" ",
    index=False,
    mode="x",  # fail if file exists to avoid overwriting
)

What's next?¶

Strategy could be improved by

shuffling the molecules before picking to avoid selecting similar molecules consecutively, which ensures better picker initialization;
lowering the threshold to select more molecules, which can (theoretically) increase diversity, then picker can be run again with smaller capacity to remove the excess;
decreasing the picker capacity if working with a non-diverse dataset, which can also increase speed.