Spaces:

MooseML
/

homo-lumo-gap-predictor

Running

App Files Files Community

MooseML commited on 30 days ago

Commit

e3eae4d

1 Parent(s): 6568ddd

Initial Streamlit Docker app

Browse files

Files changed (11) hide show

.dockerignore +9 -0
Dockerfile +46 -0
README.md +82 -7
__pycache__/model.cpython-38.pyc +0 -0
__pycache__/utils.cpython-38.pyc +0 -0
app.py +129 -0
best_hybridgnn.pt +3 -0
model.py +50 -0
predictions.db +0 -0
requirements.txt +8 -0
utils.py +47 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,9 @@

+__pycache__
+*.pyc
+*.pkl
+*.sqlite
+.git
+*.csv
+*.db
+*.log
+venv/

Dockerfile ADDED Viewed

	@@ -0,0 +1,46 @@

+#  Dockerfile for Hugging Face Space: Streamlit + RDKit + PyG
+FROM python:3.10-slim
+#  system libraries (needed by RDKit / Pillow)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        build-essential \
+        libxrender1 \
+        libxext6 \
+        libsm6 \
+        libx11-6 \
+        libglib2.0-0 \
+        libfreetype6 \
+        libpng-dev \
+        wget && \
+    rm -rf /var/lib/apt/lists/*
+#  Python packages
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir \
+        streamlit==1.45.0 \
+        rdkit-pypi==2022.9.5 \
+        pandas==2.2.3 \
+        numpy==1.26.4 \
+        torch==2.2.0 \
+        torch-geometric==2.5.2 \
+        ogb==1.3.6 \
+        pillow==10.3.0
+#  working directory & app code
+WORKDIR /app
+COPY . .
+#  Streamlit configuration for Spaces
+ENV \
+    STREAMLIT_SERVER_HEADLESS=true \
+    STREAMLIT_SERVER_ADDRESS=0.0.0.0 \
+    STREAMLIT_SERVER_PORT=7860 \
+    STREAMLIT_TELEMETRY_DISABLED=true
+EXPOSE 7860
+#  launch
+CMD ["streamlit", "run", "app.py"]

README.md CHANGED Viewed

@@ -1,10 +1,85 @@
 ---
-title: Homo Lumo Gap Predictor
-emoji: 🐨
-colorFrom: green
-colorTo: gray
-sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# HOMO–LUMO Gap Predictor
+This web app uses a trained Graph Neural Network (GNN) to predict HOMO–LUMO energy gaps from molecular SMILES strings. Built with [Streamlit](https://streamlit.io), it enables fast single or batch predictions with visualization.
+### Live App
+[Click here to launch the app](https://www.willfillinoncedeployed.com)
+---
+## Features
+- Predict HOMO–LUMO gap for one or many molecules
+- Accepts comma-separated SMILES or CSV uploads
+- RDKit rendering of molecule structures
+- Downloadable CSV of predictions
+- Powered by a trained hybrid GNN model with RDKit descriptors
 ---
+## Usage
+1. **Input Options**:
+   - Type one or more SMILES strings separated by commas
+   - OR upload a `.csv` file with a single column of SMILES
+2. **Example SMILES**: CC(=O)Oc1ccccc1C(=O)O, C1=CC=CC=C1
+3. **CSV Format**:
+- One column
+- No header
+- Each row contains a SMILES string
+4. **Output**:
+- Predictions displayed in-browser (up to 10 molecules shown)
+- Full results available for download as CSV
+---
+## Project Structure
+streamlit-app/
+│
+├── app.py # Main Streamlit app
+├── model.py # Hybrid GNN architecture and model loader
+├── utils.py # RDKit and SMILES processing
+├── requirements.txt # Python dependencies
+└── predictions.db # SQLite log of predictions
 ---
+## Requirements
+To run locally:
+```
+pip install -r requirements.txt
+streamlit run app.py
+```
+## Model Info
+The app uses a trained hybrid GNN model combining:
+* AtomEncoder and BondEncoder from OGB
+* GINEConv layers from PyTorch Geometric
+* Global mean pooling
+* RDKit-based physicochemical descriptors
+Trained on the [OGB PCQM4Mv2 dataset](https://ogb.stanford.edu/docs/lsc/pcqm4mv2/), optimized using Optuna
+## Author
+Developed by [Matthew Graham](https://github.com/MooseML)
+For inquiries, collaborations, or ideas, feel free to reach out!

__pycache__/model.cpython-38.pyc ADDED Viewed

Binary file (2.16 kB). View file

__pycache__/utils.cpython-38.pyc ADDED Viewed

Binary file (1.62 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,129 @@

+import streamlit as st
+import pandas as pd
+import torch
+import sqlite3
+from datetime import datetime
+from rdkit import Chem
+from rdkit.Chem import Draw
+from model import load_model
+from utils import smiles_to_data
+from torch_geometric.loader import DataLoader
+# Config
+DEVICE = "cpu"
+RDKIT_DIM = 6
+MODEL_PATH = "best_hybridgnn.pt"
+MAX_DISPLAY = 10
+# Load Model
+model = load_model(rdkit_dim=RDKIT_DIM, path=MODEL_PATH, device=DEVICE)
+# SQLite Setup
+@st.cache_resource
+def init_db():
+    conn = sqlite3.connect("predictions.db", check_same_thread=False)
+    c = conn.cursor()
+    c.execute("""
+        CREATE TABLE IF NOT EXISTS predictions (
+            id INTEGER PRIMARY KEY AUTOINCREMENT,
+            smiles TEXT,
+            prediction REAL,
+            timestamp TEXT
+        )
+    """)
+    conn.commit()
+    return conn
+conn = init_db()
+cursor = conn.cursor()
+# Streamlit UI
+st.title("HOMO-LUMO Gap Predictor")
+st.markdown("""
+This app predicts the HOMO-LUMO energy gap for molecules using a trained Graph Neural Network (GNN).
+**Instructions:**
+- Enter a **single SMILES** string or **comma-separated list** in the box below.
+- Or **upload a CSV file** containing a single column of SMILES strings.
+- **Note**: If you've uploaded a CSV and want to switch to typing SMILES, please click the “X” next to the uploaded file to clear it.
+- SMILES format should look like: `CC(=O)Oc1ccccc1C(=O)O` (for aspirin).
+- The app will display predictions and molecule images (up to 10 shown at once).
+""")
+# Text Input
+smiles_input = st.text_area("Enter SMILES string(s)", placeholder="C1=CC=CC=C1, CC(=O)Oc1ccccc1C(=O)O")
+# File Upload
+uploaded_file = st.file_uploader("...or upload a CSV file", type=["csv"])
+smiles_list = []
+if uploaded_file:
+    try:
+        df = pd.read_csv(uploaded_file)
+        if df.shape[1] != 1:
+            st.error("CSV should have only one column with SMILES strings.")
+        else:
+            smiles_list = df.iloc[:, 0].dropna().astype(str).tolist()
+            st.success(f"{len(smiles_list)} SMILES loaded from file.")
+    except Exception as e:
+        st.error(f"Error reading CSV: {e}")
+elif smiles_input:
+    raw_input = smiles_input.strip().replace("\n", ",")
+    smiles_list = [smi.strip() for smi in raw_input.split(",") if smi.strip()]
+    st.success(f"{len(smiles_list)} SMILES parsed from input.")
+# Run Inference
+if smiles_list:
+    with st.spinner("Processing molecules..."):
+        data_list = smiles_to_data(smiles_list, device=DEVICE)
+        # Filter only valid molecules and keep aligned SMILES
+        valid_pairs = [(smi, data) for smi, data in zip(smiles_list, data_list) if data is not None]
+        if not valid_pairs:
+            st.warning("No valid molecules found.")
+        else:
+            valid_smiles, valid_data = zip(*valid_pairs)
+            loader = DataLoader(valid_data, batch_size=64)
+            predictions = []
+            for batch in loader:
+                batch = batch.to(DEVICE)
+                with torch.no_grad():
+                    pred = model(batch).view(-1).cpu().numpy()
+                    predictions.extend(pred.tolist())
+            # Display Results
+            st.subheader(f"Predictions (showing up to {MAX_DISPLAY} molecules):")
+            for i, (smi, pred) in enumerate(zip(valid_smiles, predictions)):
+                if i >= MAX_DISPLAY:
+                    st.info(f"...only showing the first {MAX_DISPLAY} molecules")
+                    break
+                mol = Chem.MolFromSmiles(smi)
+                if mol:
+                    st.image(Draw.MolToImage(mol, size=(250, 250)))
+                st.write(f"**SMILES**: `{smi}`")
+                st.write(f"**Predicted HOMO-LUMO Gap**: `{pred:.4f} eV`")
+                # Log to SQLite
+                cursor.execute("INSERT INTO predictions (smiles, prediction, timestamp) VALUES (?, ?, ?)",
+                               (smi, pred, str(datetime.now())))
+                conn.commit()
+            # Download Results
+            result_df = pd.DataFrame({
+                "SMILES": valid_smiles,
+                "Predicted HOMO-LUMO Gap (eV)": [round(p, 4) for p in predictions]
+            })
+            st.download_button(
+                label="Download Predictions as CSV",
+                data=result_df.to_csv(index=False).encode('utf-8'),
+                file_name="homolumo_predictions.csv",
+                mime="text/csv"
+            )

best_hybridgnn.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e3cd6a7f4297f6451cf159ac6a5745ae0edde7c6c481308ff407e065eec2828c
+size 5259810

model.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import torch
+from torch.nn import Linear, Dropout, Module, Sequential
+from torch_geometric.nn import GINEConv, global_mean_pool
+from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
+class HybridGNN(Module):
+    def __init__(self, gnn_dim, rdkit_dim, hidden_dim, dropout_rate=0.2, activation='ReLU'):
+        super().__init__()
+        act_map = {'Swish': torch.nn.SiLU(), 'ReLU': torch.nn.ReLU()}
+        act_fn = act_map[activation]
+        self.gnn_dim = gnn_dim
+        self.rdkit_dim = rdkit_dim
+        self.atom_encoder = AtomEncoder(emb_dim=gnn_dim)
+        self.bond_encoder = BondEncoder(emb_dim=gnn_dim)
+        self.conv1 = GINEConv(Sequential(Linear(gnn_dim, gnn_dim), act_fn, Linear(gnn_dim, gnn_dim)))
+        self.conv2 = GINEConv(Sequential(Linear(gnn_dim, gnn_dim), act_fn, Linear(gnn_dim, gnn_dim)))
+        self.pool = global_mean_pool
+        self.mlp = Sequential(Linear(gnn_dim + rdkit_dim, hidden_dim), act_fn,
+                              Dropout(dropout_rate),
+                              Linear(hidden_dim, hidden_dim // 2), act_fn,
+                              Dropout(dropout_rate),
+                              Linear(hidden_dim // 2, 1))
+    def forward(self, data):
+        x = self.atom_encoder(data.x)
+        edge_attr = self.bond_encoder(data.edge_attr)
+        x = self.conv1(x, data.edge_index, edge_attr)
+        x = self.conv2(x, data.edge_index, edge_attr)
+        x = self.pool(x, data.batch)
+        rdkit_feats = getattr(data, 'rdkit_feats', None)
+        if rdkit_feats is not None:
+            if x.shape[0] != rdkit_feats.shape[0]:
+                raise ValueError(f"Shape mismatch: GNN output ({x.shape}) vs rdkit_feats ({rdkit_feats.shape})")
+            x = torch.cat([x, rdkit_feats], dim=1)
+        else:
+            raise ValueError("RDKit features not found in the data object.")
+        return self.mlp(x)
+def load_model(rdkit_dim: int, path: str = "best_hybridgnn.pt", device: str = "cpu"):
+    model = HybridGNN(gnn_dim=512, rdkit_dim=rdkit_dim, hidden_dim=256, dropout_rate=0.29, activation='Swish')
+    model.load_state_dict(torch.load(path, map_location=device))
+    model.to(device)
+    model.eval()
+    return model

predictions.db ADDED Viewed

Binary file (24.6 kB). View file

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+streamlit==1.45.0
+rdkit-pypi==2022.9.5
+pandas==2.2.3
+numpy==1.26.4
+torch==2.2.0
+torch-geometric==2.5.2
+ogb==1.3.6
+pillow==10.3.0

utils.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import numpy as np
+import torch
+from rdkit import Chem
+from rdkit.Chem import Descriptors
+from torch_geometric.data import Data
+from ogb.utils.features import get_atom_feature_dims, get_bond_feature_dims
+from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
+from ogb.lsc import PCQM4Mv2Evaluator
+from ogb.utils import smiles2graph
+from torch_geometric.loader import DataLoader
+def compute_rdkit_features(smiles):
+    mol = Chem.MolFromSmiles(smiles)
+    if mol is None:
+        raise ValueError("Invalid SMILES")
+    return [
+        Descriptors.MolWt(mol),
+        Descriptors.NumRotatableBonds(mol),
+        Descriptors.TPSA(mol),
+        Descriptors.NumHAcceptors(mol),
+        Descriptors.NumHDonors(mol),
+        Descriptors.RingCount(mol)
+    ]
+def smiles_to_data(smiles_list, device="cpu"):
+    graph_list = []
+    rdkit_list = []
+    for smi in smiles_list:
+        try:
+            graph = smiles2graph(smi)
+            rdkit_feats = compute_rdkit_features(smi)
+            data = Data(
+                x=torch.tensor(graph['node_feat'], dtype=torch.long),
+                edge_index=torch.tensor(graph['edge_index'], dtype=torch.long),
+                edge_attr=torch.tensor(graph['edge_feat'], dtype=torch.long),
+                rdkit_feats=torch.tensor(rdkit_feats, dtype=torch.float32).unsqueeze(0),
+                num_nodes=graph['num_nodes']
+            )
+            graph_list.append(data)
+        except Exception as e:
+            print(f"Error with SMILES '{smi}': {e}")
+            continue
+    return graph_list