晴耕雨讀

Zonveld & Regenboek

Molecular Fingerprints: Encoding Chemical Similarity for Drug Discovery

Posted at # Cheminformatics

Introduction

In drug discovery, one of the most fundamental questions is: “Which molecules are similar to my lead compound?” Molecular fingerprints provide a computational answer by encoding chemical structures into binary vectors that enable rapid similarity calculations across large chemical databases.

What Are Molecular Fingerprints?

Molecular fingerprints are binary vectors where each bit represents the presence or absence of a specific structural feature in a molecule. Think of them as chemical “barcodes” that capture essential structural information in a format computers can quickly process.

The Similarity Principle

The foundation of fingerprint-based approaches is the similarity principle: structurally similar compounds tend to have similar biological activities. This principle, while not always true, provides a powerful starting point for drug discovery.

Types of Molecular Fingerprints

1. Substructure-Based Fingerprints

MACCS Keys (166 bits)

Example MACCS patterns:

2. Path-Based Fingerprints

Daylight Fingerprints

RDKit Morgan Fingerprints (Extended Connectivity)

3. Pharmacophore Fingerprints

Encode 3D arrangements of pharmacophoric features:

Calculating Molecular Fingerprints

Using RDKit (Python)

from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
import numpy as np

# Load a molecule
mol = Chem.MolFromSmiles('CCO')  # Ethanol

# Calculate different fingerprint types
morgan_fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=1024)
maccs_fp = AllChem.GetMACCSKeysFingerprint(mol)
rdkit_fp = Chem.RDKFingerprint(mol)

# Convert to numpy array for analysis
morgan_array = np.array(morgan_fp)
print(f"Morgan fingerprint: {sum(morgan_array)} bits set out of {len(morgan_array)}")

Similarity Calculation

# Compare two molecules
mol1 = Chem.MolFromSmiles('CCO')      # Ethanol
mol2 = Chem.MolFromSmiles('CCCO')     # Propanol

fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, 2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, 2)

# Tanimoto similarity
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
print(f"Tanimoto similarity: {similarity:.3f}")

Similarity Metrics

Tanimoto Coefficient

Most commonly used similarity metric:

Tanimoto = |A ∩ B| / |A ∪ B|

Where A and B are sets of bits set to 1 in each fingerprint.

Properties:

Other Metrics

Dice Coefficient:

Dice = 2|A ∩ B| / (|A| + |B|)

Cosine Similarity:

Cosine = A·B / (||A|| ||B||)

Manhattan Distance:

Manhattan = Σ|A_i - B_i|

Applications in Drug Discovery

1. Virtual Screening

Find compounds similar to known active molecules:

def virtual_screening(query_smiles, database_smiles, threshold=0.7):
    """Screen database for compounds similar to query"""
    query_mol = Chem.MolFromSmiles(query_smiles)
    query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2)

    hits = []
    for smiles in database_smiles:
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue

        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        similarity = DataStructs.TanimotoSimilarity(query_fp, fp)

        if similarity >= threshold:
            hits.append((smiles, similarity))

    return sorted(hits, key=lambda x: x[1], reverse=True)

2. Clustering Chemical Libraries

Group structurally similar compounds:

from sklearn.cluster import KMeans
from scipy.spatial.distance import pdist, squareform

def cluster_molecules(smiles_list, n_clusters=10):
    """Cluster molecules based on structural similarity"""
    # Calculate fingerprints
    fingerprints = []
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
        fingerprints.append(np.array(fp))

    # Perform clustering
    X = np.array(fingerprints)
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(X)

    return clusters

3. Diversity Selection

Choose diverse compounds for screening:

def diversity_selection(smiles_list, n_select=100):
    """Select diverse subset using MaxMin algorithm"""
    mols = [Chem.MolFromSmiles(s) for s in smiles_list]
    fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]

    # Calculate distance matrix
    n_mols = len(fps)
    dist_matrix = np.zeros((n_mols, n_mols))

    for i in range(n_mols):
        for j in range(i+1, n_mols):
            similarity = DataStructs.TanimotoSimilarity(fps[i], fps[j])
            distance = 1 - similarity
            dist_matrix[i,j] = dist_matrix[j,i] = distance

    # MaxMin selection
    selected = [0]  # Start with first compound

    for _ in range(n_select - 1):
        min_distances = []
        for i in range(n_mols):
            if i not in selected:
                min_dist = min(dist_matrix[i][j] for j in selected)
                min_distances.append((min_dist, i))

        # Select compound with maximum minimum distance
        selected.append(max(min_distances)[1])

    return [smiles_list[i] for i in selected]

Advanced Fingerprint Methods

1. Learned Fingerprints

Neural Network-Based:

# Example using Deep Learning (conceptual)
import torch
import torch.nn as nn

class MolecularAutoencoder(nn.Module):
    def __init__(self, input_dim=1024, latent_dim=256):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, input_dim),
            nn.Sigmoid()
        )

    def forward(self, x):
        latent = self.encoder(x)
        reconstructed = self.decoder(latent)
        return reconstructed, latent

2. 3D Pharmacophore Fingerprints

Capture spatial arrangements of features:

# Using RDKit's 3D descriptors
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem.Pharm3D import Pharmacophore

def calculate_3d_fingerprint(mol):
    """Calculate 3D pharmacophore fingerprint"""
    # Generate 3D coordinates
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol)
    AllChem.OptimizeMolecule(mol)

    # Extract pharmacophore features
    factory = ChemicalFeatures.BuildFeatureFactory('pharmacophore.fdef')
    features = factory.GetFeaturesForMol(mol)

    return features

Best Practices

1. Choosing the Right Fingerprint

For general similarity:

For substructure queries:

For 3D similarity:

2. Optimization Strategies

def optimize_fingerprint_search(query_fp, database_fps, threshold=0.7):
    """Optimized similarity search using numpy"""
    # Convert to numpy arrays
    query_array = np.array(query_fp)
    db_arrays = np.array([np.array(fp) for fp in database_fps])

    # Vectorized Tanimoto calculation
    intersection = np.sum(query_array & db_arrays, axis=1)
    union = np.sum(query_array | db_arrays, axis=1)
    similarities = intersection / union

    # Return indices above threshold
    hits = np.where(similarities >= threshold)[0]
    return list(zip(hits, similarities[hits]))

3. Validation and Benchmarking

Always validate fingerprint methods:

Limitations and Considerations

1. Activity Cliffs

Structurally similar compounds can have very different activities:

2. Scaffold Hopping

Fingerprints may miss:

3. Computational Considerations

Conclusion

Molecular fingerprints remain a cornerstone of cheminformatics, enabling rapid similarity searches across millions of compounds. While they have limitations, when used thoughtfully they provide powerful tools for drug discovery.

Key takeaways:

As machine learning advances, we’re seeing hybrid approaches that combine traditional fingerprints with learned representations, offering the best of both interpretability and power.


In the vast chemical space of possible molecules, fingerprints provide the compass to navigate toward biologically relevant compounds.