Spaces:

ImedHa
/

hackathon_acvss

Sleeping

App Files Files Community

ImedHa commited on 19 days ago

Commit

ee412eb

verified ·

1 Parent(s): 02bb97a

Upload 7 files

Browse files

Files changed (6) hide show

about_page.py +148 -0
app.py +29 -128
datasets_page.py +91 -0
main_dashboard.py +37 -0
s2-swinunetr-weights.pth +3 -0
system_test_page.py +262 -0

about_page.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import streamlit as st
+import os
+def show():
+    st.markdown('<div class="main-header">ℹ️ About This Project</div>', unsafe_allow_html=True)
+    # ACVSS Hackathon Information
+    st.markdown("## ACVSS 2025 Summer School Hackathon Project")
+    st.info(
+        "This project was developed by **Team SATOR** as part of the **ACVSS 2025 - The 4th Summer School on Advanced Computer Vision** hackathon. "
+        "Our goal was to build a functional prototype for surgical scene understanding in a limited time frame."
+    )
+    # ACVSS Description
+    st.markdown("""
+    ### About ACVSS
+    The **African Computer Vision Summer School (ACVSS)** is an intensive program designed to advance computer vision research and applications across Africa. The summer school brings together researchers, students, and industry professionals to explore cutting-edge technologies in computer vision, machine learning, and artificial intelligence.
+    **Learn more**: [acvss.ai](https://www.acvss.ai/) | **Year**: 2025 | **Edition**: 4th Summer School
+    """)
+    st.markdown("---")
+    # Team Section
+    st.markdown("## 👥 Meet Team SATOR")
+    # Add team description
+    st.markdown("""
+    **Team SATOR** is a diverse group of professionals brought together for the ACVSS 2025 hackathon.
+    Our team combines expertise in AI/ML, software engineering, data science, and quality assurance to deliver
+    innovative solutions in surgical scene understanding.
+    """)
+    st.markdown("### Team Members")
+    # Team Member Profiles
+    team_members = [
+        {
+            "name": "MEM1",
+            "role": "Team Lead & System Architect",
+            "desc": "Led the project, designed the overall system architecture, and ensured seamless integration of all components. Her vision guided the project's success.",
+            "email": "[email protected]",
+            "linkedin": "https://www.linkedin.com/in/evelyn-reed-acvss",
+            "github": "https://github.com/evelyn-reed",
+            "img": "https://i.pravatar.cc/150?img=1"
+        },
+        {
+            "name": "MEM2",
+            "role": "AI/ML Specialist",
+            "desc": "Focused on developing and training the core SwinUnet and scene understanding models. Responsible for the AI-powered analysis and insights.",
+            "email": "[email protected]",
+            "linkedin": "https://www.linkedin.com/in/kenji-tanaka-ml",
+            "github": "https://github.com/kenji-tanaka",
+            "img": "https://i.pravatar.cc/150?img=2"
+        },
+        {
+            "name": "MEM3",
+            "role": "UI/UX & Frontend Developer",
+            "desc": "Designed and built the Streamlit dashboard, focusing on creating an intuitive and informative user interface for surgeons and researchers.",
+            "email": "[email protected]",
+            "linkedin": "https://www.linkedin.com/in/sofia-rossi-ui",
+            "github": "https://github.com/sofia-rossi",
+            "img": "https://i.pravatar.cc/150?img=3"
+        },
+        {
+            "name": "MEM4",
+            "role": "Data Engineer",
+            "desc": "Managed the data pipeline, from processing the MM-OR dataset to ensuring the models received clean, well-structured data for training and testing.",
+            "email": "[email protected]",
+            "linkedin": "https://www.linkedin.com/in/david-chen-data",
+            "github": "https://github.com/david-chen",
+            "img": "https://i.pravatar.cc/150?img=4"
+        },
+        {
+            "name": "MEM5",
+            "role": "QA & Testing Lead",
+            "desc": "Oversaw the testing and validation of the entire pipeline, ensuring the system was robust, accurate, and met the project's objectives.",
+            "email": "[email protected]",
+            "linkedin": "https://www.linkedin.com/in/aisha-bello-qa",
+            "github": "https://github.com/aisha-bello",
+            "img": "https://i.pravatar.cc/150?img=5"
+        }
+    ]
+    # Display team members in columns
+    # Display team members in a responsive grid
+    cols = st.columns(5)
+    for i, member in enumerate(team_members):
+        with cols[i]:
+            st.markdown(f"##### {member['name']}")
+            st.image(member['img'], width=120)
+            st.markdown(f"**{member['role']}**")
+            st.caption(member['desc'])
+            st.markdown(f"✉️ [{member['email']}](mailto:{member['email']})")
+            st.markdown(f"💼 [LinkedIn]({member['linkedin']})")
+            st.markdown(f"💻 [GitHub]({member['github']})")
+    st.markdown("---")
+    # Project Overview Section
+    st.markdown("## 🎯 Project Overview")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("""
+        ### 🏥 Video Surgical Scene Understanding
+        Our project focuses on developing an advanced computer vision system capable of:
+        - **Scene Analysis**: Understanding surgical environments
+        - **Tool Recognition**: Identifying medical instruments
+        - **Workflow Tracking**: Monitoring surgical procedures
+        - **Real-time Processing**: Immediate analysis and feedback
+        """)
+    with col2:
+        st.markdown("""
+        ### 🛠️ Technical Stack
+        - **Frontend**: Streamlit Dashboard
+        - **Backend**: Python
+        - **ML Models**: SwinUnet, Scene Graphs
+        - **Dataset**: MM-OR (Multimodal Operating Room)
+        - **Version**: v1.0 (July 2025)
+        """)
+    st.markdown("---")
+    # Hackathon Achievement Section
+    st.markdown("## 🏆 Hackathon Achievement")
+    achievement_col1, achievement_col2, achievement_col3 = st.columns(3)
+    with achievement_col1:
+        st.metric("Pipeline Version", "v1.0", "Completed")
+    with achievement_col2:
+        st.metric("Models Integrated", "2/2", "✅ Working")
+    with achievement_col3:
+        st.metric("Development Time", "Hackathon", "July 2025")
+    st.markdown("---")
+    st.markdown("© 2025 Team SATOR - ACVSS Hackathon. All Rights Reserved.")

app.py CHANGED Viewed

@@ -1,128 +1,29 @@
-import streamlit as st
-from PIL import Image
-import torch
-import os
-from io import StringIO
-import sys
-# --- TorchDynamo Fix for Unsloth/MedGemma ---
-import torch._dynamo
-torch._dynamo.config.capture_scalar_outputs = True
-torch.compiler.disable()
-# --- Dependency Handling ---
-try:
-    from unsloth import FastVisionModel
-    from transformers import TextStreamer
-except ImportError as e:
-    st.error(f"A required library is not installed. Please install dependencies. Error: {e}")
-    st.stop()
-@st.cache_resource
-def load_medgemma_model():
-    """Loads the MedGemma vision-language model in eager mode."""
-    try:
-        model, processor = FastVisionModel.from_pretrained(
-            "fiqqy/MedGemma-MM-OR-FT10",
-            load_in_4bit=False,
-            use_gradient_checkpointing="unsloth",
-        )
-        return model, processor
-    except Exception as e:
-        st.error(f"Error loading MedGemma model: {e}")
-        return None, None
-def run_captioning(medgemma_model, processor, frames, instruction):
-    """Runs MedGemma inference using 3 frames and an instruction."""
-    st.write("Preparing inputs for MedGemma...")
-    images = [f.convert("RGB") for f in frames]
-    messages = [
-        {"role": "user", "content": [
-            {"type": "image"}, {"type": "image"}, {"type": "image"},
-            {"type": "text", "text": instruction},
-        ]},
-    ]
-    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    inputs = processor(
-        images, input_text, add_special_tokens=False, return_tensors="pt",
-    ).to(device)
-    text_streamer = TextStreamer(processor, skip_prompt=True)
-    old_stdout = sys.stdout
-    sys.stdout = captured_output = StringIO()
-    st.write("Running MedGemma Analysis...")
-    torch._dynamo.disable()
-    medgemma_model.generate(
-        **inputs, streamer=text_streamer, max_new_tokens=768,
-        use_cache=True, temperature=1.0, top_p=0.95, top_k=64
-    )
-    sys.stdout = old_stdout
-    result = captured_output.getvalue()
-    return result
-def show():
-    """Main function to render the Streamlit UI."""
-    st.title("MedGemma Scene Analysis System")
-    st.write("A system to test MedGemma vision-language captioning model.")
-    st.header("1. Load MedGemma Model")
-    if "medgemma_model" not in st.session_state:
-        st.session_state.medgemma_model, st.session_state.processor = None, None
-    if st.button("Load MedGemma Model"):
-        with st.spinner("Loading MedGemma... This can take several minutes."):
-            st.session_state.medgemma_model, st.session_state.processor = load_medgemma_model()
-    if st.session_state.get("medgemma_model") and st.session_state.get("processor"):
-        st.success("MedGemma model is loaded.")
-    else:
-        st.warning("MedGemma model is not loaded.")
-    st.header("2. Upload Data")
-    st.subheader("Upload Three Sequential Surgical Video Frames")
-    col1, col2, col3 = st.columns(3)
-    uploaded_files = [
-        col1.file_uploader("Upload Frame 1", type=["png", "jpg", "jpeg"], key="frame1"),
-        col2.file_uploader("Upload Frame 2", type=["png", "jpg", "jpeg"], key="frame2"),
-        col3.file_uploader("Upload Frame 3", type=["png", "jpg", "jpeg"], key="frame3")
-    ]
-    frames = [Image.open(f) for f in uploaded_files if f is not None]
-    display_size = (256, 256)
-    if len(frames) == 3:
-        st.success("All three frames have been uploaded successfully.")
-        img_cols = st.columns(3)
-        for i, frame in enumerate(frames):
-            img_cols[i].image(frame.resize(display_size), caption=f"Frame {i+1}", use_container_width=True)
-    else:
-        st.info("Please upload all three frames to proceed.")
-    st.header("3. Generate Scene Analysis")
-    instruction_prompt = st.text_area(
-        "Enter your custom instruction prompt:",
-        "Provide a detailed summary of the surgical action, noting the instruments used and their interactions."
-    )
-    can_run_analysis = (
-        st.session_state.get("medgemma_model") is not None and
-        len(frames) == 3 and
-        bool(instruction_prompt)
-    )
-    if st.button("Run Analysis", disabled=not can_run_analysis):
-        with st.spinner("Running MedGemma analysis... This may take a moment."):
-            result = run_captioning(
-                st.session_state.medgemma_model, st.session_state.processor,
-                frames, instruction_prompt
-            )
-            st.subheader("Analysis Result")
-            st.write(result)
-    if not can_run_analysis:
-        st.warning("Please ensure the MedGemma model is loaded, three frames are uploaded, and a prompt is provided.")
-if __name__ == "__main__":
-    show()

+import streamlit as st
+import main_dashboard
+import about_page
+import datasets_page
+import system_test_page
+st.set_page_config(page_title="Surgical Scene Understanding", page_icon="🩺", layout="wide")
+with st.sidebar:
+    st.markdown("## 🩺 Surgical Scene Understanding")
+    page = st.radio(
+        "Navigation",
+        [
+            "🏠 Main Dashboard",
+            "🧪 Test System",
+            "📂 Dataset",
+            "ℹ️ About"
+        ],
+        label_visibility="collapsed"
+    )
+if page.startswith("🏠"):
+    main_dashboard.show()
+elif page.startswith("🧪"):
+    system_test_page.show()
+elif page.startswith("📂"):
+    datasets_page.show()
+elif page.startswith("ℹ️"):
+    about_page.show()

datasets_page.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+import plotly.express as px
+import plotly.graph_objects as go
+def show():
+    st.markdown('<div class="main-header">📁 Dataset: MM-OR</div>', unsafe_allow_html=True)
+    st.markdown("---")
+    st.markdown("## 🗂️ MM-OR: A Large-scale Multimodal Operating Room Dataset")
+    st.markdown("""
+    This project utilizes the **MM-OR** dataset, a comprehensive collection of data recorded in a realistic operating room environment.
+    It is designed to support research in surgical workflow analysis, human activity recognition, and context-aware systems in healthcare.
+    """)
+    # Dataset overview
+    st.markdown("### 📊 Dataset High-Level Statistics")
+    col1, col2, col3, col4 = st.columns(4)
+    with col1:
+        st.metric(
+            label="📹 Surgical Procedures",
+            value="10",
+        )
+    with col2:
+        st.metric(
+            label="⏱️ Total Duration",
+            value=">100 hours",
+        )
+    with col3:
+        st.metric(
+            label="🏷️ Modalities",
+            value="3 (Video, Audio, Depth)",
+        )
+    with col4:
+        st.metric(
+            label="📂 Total Size",
+            value="~12 TB",
+        )
+    st.markdown("---")
+    # Dataset categories
+    st.markdown("### 🏥 Dataset Details")
+    st.info("The MM-OR dataset is the primary source of data for training and evaluating the models in this system.")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("#### Key Features")
+        st.markdown("""
+        - **Multimodal Data**: Includes synchronized video, multi-channel audio, and depth information.
+        - **Multiple Views**: Video captured from multiple camera perspectives to provide a comprehensive view of the operating room.
+        - **Rich Annotations**: Detailed annotations of:
+            - Surgical roles (e.g., primary surgeon, assistant, nurse).
+            - Atomic actions and complex activities.
+            - Interactions between team members.
+        - **Realistic Environment**: Data was collected in a high-fidelity simulated operating room.
+        """)
+    with col2:
+        st.markdown("#### Data Modalities")
+        st.image("https://www.researchgate.net/publication/359174963/figure/fig1/AS:1143128108556288@1649553881835/An-overview-of-our-data-acquisition-system-in-the-operating-room-OR-We-record.jpg",
+                 caption="Overview of the data acquisition system in the operating room.")
+    st.markdown("---")
+    st.markdown("### 📈 Data Distribution")
+    # Create sample data for visualization
+    procedure_data = {
+        'Surgical Procedure': [f'Procedure {i+1}' for i in range(10)],
+        'Duration (hours)': np.random.uniform(8, 12, 10).round(1),
+        'Number of Annotations': np.random.randint(1500, 3000, 10)
+    }
+    df_procedures = pd.DataFrame(procedure_data)
+    fig = px.bar(df_procedures, x='Surgical Procedure', y='Duration (hours)',
+                 title='Duration per Surgical Procedure',
+                 labels={'Duration (hours)': 'Duration (hours)'},
+                 color='Surgical Procedure')
+    st.plotly_chart(fig, use_container_width=True)
+    st.markdown("For more information, please refer to the original publication: *MM-OR: A Large-scale Multimodal Operating Room Dataset for Human Activity Recognition*.")
+    st.markdown("The dataset is available on GitHub: [MM-OR Dataset](https://github.com/egeozsoy/MM-OR)")

main_dashboard.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import streamlit as st
+def show():
+    st.markdown('<div class="main-header">🏥 Video Surgical Scene Understanding Dashboard</div>', unsafe_allow_html=True)
+    st.markdown("---")
+    # Welcome and overall description
+    st.markdown("## Welcome to the Surgical Scene Analysis Platform")
+    st.markdown("""
+    This platform demonstrates an end-to-end pipeline for automated understanding of surgical scenes from video data.
+    The system leverages advanced computer vision and AI models to analyze surgical workflows, recognize tools, and generate scene-level captions.
+    Navigate through the sidebar to test the system, explore datasets, or learn more about the project.
+    """)
+    st.markdown("---")
+    st.markdown("## 🔄 Pipeline Overview")
+    st.markdown("""
+    The surgical scene understanding pipeline consists of the following main steps:
+    1. **Frame Extraction**: Select or upload three consecutive frames from a surgical video.
+    2. **Segmentation**: Use the SwinUNETR model to generate a segmentation mask for the scene.
+    3. **Captioning**: Input the frames and mask into the MedGemma model to generate a descriptive caption or scene graph.
+    4. **Results & Analysis**: Review the generated mask and caption to understand the surgical context.
+    """)
+    st.markdown("---")
+    st.markdown("## 📚 Project Description")
+    st.markdown("""
+    This project was developed by **Team SATOR** for the ACVSS 2025 Hackathon.
+    Our goal is to provide an accessible, interactive demonstration of state-of-the-art surgical scene understanding using deep learning.
+    - **Frontend**: Streamlit Dashboard
+    - **Backend**: Python, PyTorch, MONAI, HuggingFace Transformers
+    - **Models**: SwinUNETR (segmentation), MedGemma (captioning)
+    - **Dataset**: MM-OR (Multimodal Operating Room)
+    """)
+    st.markdown("---")
+    st.info("Use the sidebar to start testing the system or to learn more about the dataset and team.")

s2-swinunetr-weights.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:af70d2fd82d8184036623e936723bca2c80305b3b2b4e6d3c32692adc17866c7
+size 114911598

system_test_page.py ADDED Viewed

	@@ -0,0 +1,262 @@

+import streamlit as st
+from PIL import Image
+import torch
+import numpy as np
+import os
+from io import StringIO
+import sys
+import torch.nn as nn
+# --- TorchDynamo Fix for Unsloth/MedGemma ---
+import torch._dynamo
+torch._dynamo.config.capture_scalar_outputs = True
+# --- DEFINITIVE FIX FOR JIT COMPILER ERRORS ---
+torch.compiler.disable()
+# --- Dependency Handling ---
+try:
+    from monai.networks.nets import SwinUNETR
+    import torchvision.transforms as T
+    from unsloth import FastVisionModel
+    from transformers import TextStreamer
+    from s2wrapper import forward as multiscale_forward
+except ImportError as e:
+    st.error(f"A required library is not installed. Please install dependencies. Error: {e}")
+    st.stop()
+# --- Config and Model Definition ---
+class Config:
+    ORIGINAL_LABELS = [0,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48,51,54,57,60]
+    LABEL_MAP = {val: i for i, val in enumerate(ORIGINAL_LABELS)}
+    NUM_CLASSES = len(ORIGINAL_LABELS)
+    IMG_SIZE = (256, 256)
+    FEATURE_SIZE = 48
+    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+class multiscaleSwinUNETR(nn.Module):
+    def __init__(self, num_classes, scales=[1]):
+        super().__init__()
+        self.scales = scales
+        self.num_classes = num_classes
+        self.model = SwinUNETR(
+            spatial_dims=2,
+            in_channels=3,
+            out_channels=num_classes,
+            feature_size=Config.FEATURE_SIZE,
+            drop_rate=0.0,
+            attn_drop_rate=0.0,
+            dropout_path_rate=0.0,
+            use_checkpoint=True,
+            use_v2=True
+        )
+        self.segmentation_head = nn.Sequential(
+            nn.Conv2d(len(scales)*num_classes, num_classes, 3, padding=1),
+            nn.BatchNorm2d(num_classes),
+            nn.ReLU(inplace=True),
+            nn.Conv2d(num_classes, num_classes, 1)
+        )
+    def forward(self, x):
+        outs = multiscale_forward(self.model, x, scales=self.scales, output_shape="bchw")
+        if isinstance(outs, (list, tuple)):
+            normed = []
+            for f in outs:
+                f = f / (f.std(dim=(2, 3), keepdim=True) + 1e-6)
+                normed.append(f)
+            feats = torch.cat(normed, dim=1)
+        elif isinstance(outs, torch.Tensor) and outs.dim() == 4:
+            if len(self.scales) == 1:
+                return outs
+            feats = outs / (outs.std(dim=(2, 3), keepdim=True) + 1e-6)
+        else:
+            raise ValueError(f"Unexpected output shape/type from multiscale_forward: {type(outs)}, {getattr(outs,'shape',None)}")
+        logits = self.segmentation_head(feats)
+        return logits
+# --- Model Loading ---
+@st.cache_resource
+def load_swinunetr_model():
+    """Loads the multiscale SwinUNETR segmentation model."""
+    model_path = 's2-swinunetr-weights.pth'
+    if not os.path.exists(model_path):
+        st.error(f"Segmentation model file not found at {model_path}")
+        return None, None
+    try:
+        model = multiscaleSwinUNETR(num_classes=Config.NUM_CLASSES, scales=[1])
+        model.load_state_dict(torch.load(model_path, map_location=Config.DEVICE))
+        model.eval()
+        return model, Config
+    except Exception as e:
+        st.error(f"Error loading segmentation model: {e}")
+        return None, None
+@st.cache_resource
+def load_medgemma_model():
+    """Loads the MedGemma vision-language model in eager mode."""
+    try:
+        model, processor = FastVisionModel.from_pretrained(
+            "fiqqy/MedGemma-MM-OR-FT10",
+            load_in_4bit=False,
+            use_gradient_checkpointing="unsloth",
+        )
+        return model, processor
+    except Exception as e:
+        st.error(f"Error loading MedGemma model: {e}")
+        return None, None
+# --- Preprocessing ---
+def preprocess_frames(frames, config):
+    """Prepares image frames for the segmentation model."""
+    transform = T.Compose([
+        T.Resize(config.IMG_SIZE, antialias=True),
+        T.ToTensor(),
+        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+    ])
+    tensors = [transform(frame.convert("RGB")) for frame in frames]
+    batch = torch.stack(tensors)
+    return batch
+# --- Color Palette for Mask Visualization ---
+def make_palette(num_classes):
+    rng = np.random.default_rng(0)
+    colors = rng.integers(0, 255, size=(num_classes, 3), dtype=np.uint8)
+    colors[0] = np.array([0, 0, 0])
+    return colors
+# --- Inference ---
+def run_segmentation(model, config, frames):
+    """Runs segmentation on the uploaded frames and visualizes with a color palette."""
+    st.write("Running segmentation...")
+    batch = preprocess_frames(frames, config)
+    device = config.DEVICE
+    batch = batch.to(device)
+    model = model.to(device)
+    with torch.no_grad():
+        logits = model(batch)
+        preds = torch.argmax(logits, 1).cpu().numpy()
+    mask = preds[0]
+    st.write(f"Mask unique values: {np.unique(mask)}")
+    palette = make_palette(config.NUM_CLASSES)
+    color_mask = palette[mask]
+    mask_img = Image.fromarray(color_mask.astype(np.uint8))
+    return mask_img
+# --- MedGemma Captioning ---
+def run_captioning(medgemma_model, processor, frames, mask_img, instruction):
+    """Runs MedGemma inference using 3 frames, 1 mask, and an instruction."""
+    st.write("Preparing inputs for MedGemma...")
+    images = [f.convert("RGB") for f in frames]
+    mask_img = mask_img.convert("RGB")
+    messages = [
+        {"role": "user", "content": [
+            {"type": "image"}, {"type": "image"}, {"type": "image"}, {"type": "image"},
+            {"type": "text", "text": instruction},
+        ]},
+    ]
+    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    all_images = images + [mask_img]
+    inputs = processor(
+        all_images, input_text, add_special_tokens=False, return_tensors="pt",
+    ).to(device)
+    text_streamer = TextStreamer(processor, skip_prompt=True)
+    old_stdout = sys.stdout
+    sys.stdout = captured_output = StringIO()
+    st.write("Running MedGemma Analysis...")
+    torch._dynamo.disable()
+    medgemma_model.generate(
+        **inputs, streamer=text_streamer, max_new_tokens=768,
+        use_cache=True, temperature=1.0, top_p=0.95, top_k=64
+    )
+    sys.stdout = old_stdout
+    result = captured_output.getvalue()
+    return result
+# --- Streamlit UI ---
+def show():
+    """Main function to render the Streamlit UI."""
+    st.title("Surgical Scene Analysis System")
+    st.write("A system to test surgical scene segmentation and captioning models.")
+    st.header("1. Load Models")
+    if "seg_model" not in st.session_state or "seg_config" not in st.session_state:
+        st.session_state.seg_model, st.session_state.seg_config = None, None
+    if st.button("Load Segmentation Model"):
+        with st.spinner("Loading SwinUNETR..."):
+            st.session_state.seg_model, st.session_state.seg_config = load_swinunetr_model()
+    if st.session_state.seg_model is not None:
+        st.success("Segmentation model is loaded.")
+    else:
+        st.warning("Segmentation model is not loaded.")
+    if "medgemma_model" not in st.session_state:
+        st.session_state.medgemma_model, st.session_state.processor = None, None
+    if st.button("Load MedGemma Model"):
+        with st.spinner("Loading MedGemma... This can take several minutes."):
+            st.session_state.medgemma_model, st.session_state.processor = load_medgemma_model()
+    if st.session_state.get("medgemma_model") and st.session_state.get("processor"):
+        st.success("MedGemma model is loaded.")
+    else:
+        st.warning("MedGemma model is not loaded.")
+    st.header("2. Upload Data & Generate Mask")
+    st.subheader("Upload Three Sequential Surgical Video Frames")
+    col1, col2, col3 = st.columns(3)
+    uploaded_files = [
+        col1.file_uploader("Upload Frame 1", type=["png", "jpg", "jpeg"], key="frame1"),
+        col2.file_uploader("Upload Frame 2", type=["png", "jpg", "jpeg"], key="frame2"),
+        col3.file_uploader("Upload Frame 3", type=["png", "jpg", "jpeg"], key="frame3")
+    ]
+    frames = [Image.open(f) for f in uploaded_files if f is not None]
+    display_size = (256, 256)
+    if "mask_img" not in st.session_state:
+        st.session_state.mask_img = None
+    if len(frames) == 3:
+        st.success("All three frames have been uploaded successfully.")
+        img_cols = st.columns(4)
+        for i, frame in enumerate(frames):
+            img_cols[i].image(frame.resize(display_size), caption=f"Frame {i+1}", use_container_width=True)
+        if st.session_state.seg_model and st.session_state.seg_config and st.button("Run Segmentation"):
+            with st.spinner("Generating segmentation mask..."):
+                st.session_state.mask_img = run_segmentation(st.session_state.seg_model, st.session_state.seg_config, frames)
+        if st.session_state.mask_img is not None:
+            img_cols[3].image(st.session_state.mask_img.resize(display_size), caption="Segmentation Mask", use_container_width=True)
+    else:
+        st.info("Please upload all three frames to proceed.")
+    st.header("3. Generate Scene Analysis")
+    instruction_prompt = st.text_area(
+        "Enter your custom instruction prompt:",
+        "Provide a detailed summary of the surgical action, noting the instruments used and their interactions."
+    )
+    can_run_analysis = (
+        st.session_state.get("medgemma_model") is not None and
+        len(frames) == 3 and
+        st.session_state.get("mask_img") is not None and
+        bool(instruction_prompt)
+    )
+    if st.button("Run Analysis", disabled=not can_run_analysis):
+        with st.spinner("Running MedGemma analysis... This may take a moment."):
+            result = run_captioning(
+                st.session_state.medgemma_model, st.session_state.processor,
+                frames, st.session_state.mask_img, instruction_prompt
+            )
+            st.subheader("Analysis Result")
+            st.write(result)
+    if not can_run_analysis:
+        st.warning("Please ensure the MedGemma model is loaded, three frames are uploaded, segmentation is complete, and a prompt is provided.")
+if __name__ == "__main__":
+    show()