Spaces:

ybchen928
/

oncall-guide-ai

Running

YanBoChen commited on Jul 28

Commit

922ed80

1 Parent(s): 6083d96

feat(data_processing): Implement token length control with semantic preservation

BREAKING CHANGE: Modify chunk creation to handle >512 token texts

Problem:
- Token indices sequence length exceeding model's maximum (512 tokens)
- Risk of semantic information loss during text chunking
- Potential impact on medical term context preservation

Solution:
1. Dynamic Character-to-Token Ratio
- Calculate average chars_per_token from sample text
- Use ratio to estimate initial chunk boundaries
- Prevents tokenizing entire long document at once

2. Semantic-Aware Chunking
- Set ROUGH_CHUNK_TARGET_TOKENS = 512
- Keep keywords centered in chunks
- Maintain context window around keywords
- Ensure rough_chunk stays within token limit

3. Overlap Strategy
- Implement sliding window with 64-token overlap
- Preserve context across chunk boundaries
- Maintain semantic continuity
- Prevent information loss at chunk edges

Technical Details:
- Target chunk size: 512 tokens (maximum model limit)
- Overlap size: 64 tokens (empirically determined)
- Dynamic ratio calculation using sample text
- Centered keyword positioning

Impact:
✓ Eliminates token length warnings
✓ Preserves medical term context
✓ Maintains semantic relationships
✓ Improves retrieval quality
✓ Optimizes processing efficiency

Testing:
- Verified with long medical texts
- Confirmed keyword context preservation
- Validated chunk boundary handling
- Tested overlap effectiveness

Co-authored-by: YanBo Chen

Files changed (6) hide show

commit_message_embedding_update.txt +0 -43
src/__init__.py +8 -0
src/commit_message_20250726_data_processing.txt +0 -52
src/commit_message_embedding_update.txt +0 -43
src/data_processing.py +163 -54
tests/test_embedding_and_index.py +29 -0

commit_message_embedding_update.txt DELETED Viewed

@@ -1,43 +0,0 @@
-refactor(data_processing): optimize chunking strategy with token-based approach
-BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation
-- Replace character-based chunking with token-based approach using PubMedBERT tokenizer
-- Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance
-- Implement dynamic chunking strategy centered around medical keywords
-- Add token count validation to ensure semantic integrity
-- Optimize memory usage with lazy loading of tokenizer and model
-- Update chunking methods to handle token-level operations
-- Add comprehensive logging for debugging token counts
-- Update tests to verify token-based chunking behavior
-Recent Improvements:
-- Fix keyword context preservation in chunks
-- Implement separate tokenization for pre-keyword and post-keyword text
-- Add precise boundary calculation based on keyword length
-- Ensure medical terms (e.g., "ST elevation") remain intact
-- Improve chunk boundary calculations to maintain keyword context
-- Add validation to verify keyword presence in generated chunks
-Technical Details:
-- chunk_size: 256 tokens (based on PubMedBERT context window)
-- overlap: 64 tokens (25% overlap for context continuity)
-- Model: NeuML/pubmedbert-base-embeddings (768 dims)
-- Tokenizer: Same as embedding model for consistency
-- Keyword-centered chunking with balanced context distribution
-Performance Impact:
-- Improved semantic coherence in chunks
-- Better handling of medical terminology
-- Reduced redundancy in overlapping regions
-- Optimized for downstream retrieval tasks
-- Enhanced preservation of medical term context
-- More accurate chunk boundaries around keywords
-Testing:
-- Added token count validation in tests
-- Verified keyword preservation in chunks
-- Confirmed overlap handling
-- Tested with sample medical texts
-- Validated medical terminology preservation
-- Verified chunk context balance around keywords

src/__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+"""
+OnCall.ai src package
+This package contains the core implementation of the OnCall.ai system.
+"""
+# Version
+__version__ = '0.1.0'

src/commit_message_20250726_data_processing.txt DELETED Viewed

@@ -1,52 +0,0 @@
-feat(data-processing): implement data processing pipeline with embeddings
-BREAKING CHANGE: Add data processing implementation with robust path handling and improved text processing
-Key Changes:
-1. Create DataProcessor class for medical data processing:
-   - Handle paths with spaces and special characters
-   - Support dataset/dataset directory structure
-   - Add detailed logging for debugging
-   - Implement case-insensitive text processing
-2. Implement core functionalities:
-   - Load filtered emergency and treatment data
-   - Create intelligent chunks based on matched keywords
-   - Generate embeddings using NeuML/pubmedbert-base-embeddings
-   - Build ANNOY indices for vector search
-   - Save embeddings and metadata separately
-   - Improve keyword matching with case-insensitive comparison
-   - Add proper chunk boundary calculations for medical terms
-3. Add test coverage:
-   - Basic data loading tests
-   - Chunking functionality tests
-   - Model loading tests
-   - Token-based chunking validation
-   - Medical terminology preservation tests
-Technical Details:
-- Use pathlib.Path.resolve() for robust path handling
-- Separate storage for embeddings and indices:
-  * /models/embeddings/ for vector representations
-  * /models/indices/annoy/ for search indices
-- Keep keywords as metadata without embedding
-- Implement case-insensitive text processing while preserving medical term integrity
-- Add proper chunk overlap handling
-Testing:
-✅ Data loading: 11,914 emergency + 11,023 treatment records
-✅ Chunking: Successful with keyword-centered approach
-✅ Model loading: NeuML/pubmedbert-base-embeddings (768 dims)
-✅ Token chunking: Verified with medical terms (e.g., "ST elevation")
-Storage Structure:
-/models/
-  ├── embeddings/          # Vector representations
-  └── indices/
-      └── annoy/          # Search indices (.ann files)
-Next Steps:
-- Integrate with Meditron for enhanced processing
-- Implement prompt engineering
-- Add hybrid search functionality

src/commit_message_embedding_update.txt DELETED Viewed

@@ -1,43 +0,0 @@
-refactor(data_processing): optimize chunking strategy with token-based approach
-BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation
-- Replace character-based chunking with token-based approach using PubMedBERT tokenizer
-- Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance
-- Implement dynamic chunking strategy centered around medical keywords
-- Add token count validation to ensure semantic integrity
-- Optimize memory usage with lazy loading of tokenizer and model
-- Update chunking methods to handle token-level operations
-- Add comprehensive logging for debugging token counts
-- Update tests to verify token-based chunking behavior
-Recent Improvements:
-- Fix keyword context preservation in chunks
-- Implement separate tokenization for pre-keyword and post-keyword text
-- Add precise boundary calculation based on keyword length
-- Ensure medical terms (e.g., "ST elevation") remain intact
-- Improve chunk boundary calculations to maintain keyword context
-- Add validation to verify keyword presence in generated chunks
-Technical Details:
-- chunk_size: 256 tokens (based on PubMedBERT context window)
-- overlap: 64 tokens (25% overlap for context continuity)
-- Model: NeuML/pubmedbert-base-embeddings (768 dims)
-- Tokenizer: Same as embedding model for consistency
-- Keyword-centered chunking with balanced context distribution
-Performance Impact:
-- Improved semantic coherence in chunks
-- Better handling of medical terminology
-- Reduced redundancy in overlapping regions
-- Optimized for downstream retrieval tasks
-- Enhanced preservation of medical term context
-- More accurate chunk boundaries around keywords
-Testing:
-- Added token count validation in tests
-- Verified keyword preservation in chunks
-- Confirmed overlap handling
-- Tested with sample medical texts
-- Validated medical terminology preservation
-- Verified chunk context balance around keywords

src/data_processing.py CHANGED Viewed

@@ -21,6 +21,7 @@ from typing import List, Dict, Tuple, Any
 from sentence_transformers import SentenceTransformer
 from annoy import AnnoyIndex
 import logging
 # Setup logging
 logging.basicConfig(
@@ -141,10 +142,23 @@ class DataProcessor:
         chunk_size = chunk_size or self.chunk_size
         chunks = []
-        # Tokenize full text once
-        full_text_tokens = self.tokenizer.tokenize(text)
-        total_tokens = len(full_text_tokens)
         for i, keyword in enumerate(keywords):
             # Find keyword position in text (already lowercase)
             keyword_pos = text.find(keyword)
@@ -153,53 +167,66 @@ class DataProcessor:
                 # Get the keyword text (already lowercase)
                 actual_keyword = text[keyword_pos:keyword_pos + len(keyword)]
-                # Get text before and after keyword
-                text_before = text[:keyword_pos]
-                text_after = text[keyword_pos + len(keyword):]
-                # Tokenize each part separately
-                tokens_before = self.tokenizer.tokenize(text_before)
-                keyword_tokens = self.tokenizer.tokenize(actual_keyword)
-                tokens_after = self.tokenizer.tokenize(text_after)
-                # Calculate token positions
                 keyword_start_pos = len(tokens_before)
                 keyword_length = len(keyword_tokens)
-                # Calculate how many tokens we want on each side of the keyword
                 tokens_each_side = (chunk_size - keyword_length) // 2
-                # Calculate chunk boundaries
                 chunk_start = max(0, keyword_start_pos - tokens_each_side)
-                chunk_end = min(total_tokens, keyword_start_pos + keyword_length + tokens_each_side)
                 # Add overlap if possible
                 if chunk_start > 0:
                     chunk_start = max(0, chunk_start - self.chunk_overlap)
-                if chunk_end < total_tokens:
-                    chunk_end = min(total_tokens, chunk_end + self.chunk_overlap)
-                # Extract chunk tokens and convert to text
-                chunk_tokens = full_text_tokens[chunk_start:chunk_end]
-                chunk_text = self.tokenizer.convert_tokens_to_string(chunk_tokens)
-                # Verify the keyword is in the chunk (direct comparison since all lowercase)
                 if chunk_text and actual_keyword in chunk_text:
                     chunk_info = {
                         "text": chunk_text,
                         "primary_keyword": actual_keyword,
                         "all_matched_keywords": matched_keywords.lower(),
-                        "token_position": keyword_start_pos,
-                        "token_start": chunk_start,
-                        "token_end": chunk_end,
-                        "token_count": len(chunk_tokens),
                         "chunk_id": f"{doc_id}_chunk_{i}" if doc_id else f"chunk_{i}",
                         "source_doc_id": doc_id
                     }
                     chunks.append(chunk_info)
-                    logger.info(f"Created chunk for keyword '{actual_keyword}' with {len(chunk_tokens)} tokens")
                 else:
-                    logger.warning(f"Failed to create valid chunk for keyword '{actual_keyword}' - keyword not found in generated chunk")
         return chunks
@@ -276,14 +303,17 @@ class DataProcessor:
     def process_emergency_chunks(self) -> List[Dict[str, Any]]:
         """Process emergency data into chunks"""
-        logger.info("Processing emergency data into chunks...")
         if self.emergency_data is None:
             raise ValueError("Emergency data not loaded. Call load_filtered_data() first.")
         all_chunks = []
-        for idx, row in self.emergency_data.iterrows():
             if pd.notna(row.get('clean_text')) and pd.notna(row.get('matched')):
                 chunks = self.create_keyword_centered_chunks(
                     text=row['clean_text'],
@@ -305,19 +335,22 @@ class DataProcessor:
                 all_chunks.extend(chunks)
         self.emergency_chunks = all_chunks
-        logger.info(f"Generated {len(all_chunks)} emergency chunks")
         return all_chunks
     def process_treatment_chunks(self) -> List[Dict[str, Any]]:
         """Process treatment data into chunks"""
-        logger.info("Processing treatment data into chunks...")
         if self.treatment_data is None:
             raise ValueError("Treatment data not loaded. Call load_filtered_data() first.")
         all_chunks = []
-        for idx, row in self.treatment_data.iterrows():
             if (pd.notna(row.get('clean_text')) and
                 pd.notna(row.get('treatment_matched'))):
@@ -343,13 +376,39 @@ class DataProcessor:
                 all_chunks.extend(chunks)
         self.treatment_chunks = all_chunks
-        logger.info(f"Generated {len(all_chunks)} treatment chunks")
         return all_chunks
     def generate_embeddings(self, chunks: List[Dict[str, Any]],
                           chunk_type: str = "emergency") -> np.ndarray:
         """
-        Generate embeddings for chunks
         Args:
             chunks: List of chunk dictionaries
@@ -358,28 +417,78 @@ class DataProcessor:
         Returns:
             numpy array of embeddings
         """
-        logger.info(f"Generating embeddings for {len(chunks)} {chunk_type} chunks...")
-        # Load model if not already loaded
-        model = self.load_embedding_model()
-        # Extract text from chunks
-        texts = [chunk['text'] for chunk in chunks]
-        # Generate embeddings in batches
-        batch_size = 32
-        embeddings = []
-        for i in range(0, len(texts), batch_size):
-            batch_texts = texts[i:i+batch_size]
-            batch_embeddings = model.encode(batch_texts, show_progress_bar=True)
-            embeddings.append(batch_embeddings)
-        # Concatenate all embeddings
-        all_embeddings = np.vstack(embeddings)
-        logger.info(f"Generated embeddings shape: {all_embeddings.shape}")
-        return all_embeddings
     def build_annoy_index(self, embeddings: np.ndarray,
                          index_name: str, n_trees: int = 15) -> AnnoyIndex:

 from sentence_transformers import SentenceTransformer
 from annoy import AnnoyIndex
 import logging
+from tqdm import tqdm
 # Setup logging
 logging.basicConfig(
         chunk_size = chunk_size or self.chunk_size
         chunks = []
+        # Calculate character-to-token ratio using a sample around the first keyword
+        if keywords:
+            first_keyword = keywords[0]
+            first_pos = text.find(first_keyword)
+            if first_pos != -1:
+                # Take a sample around the first keyword for ratio calculation
+                sample_start = max(0, first_pos - 100)
+                sample_end = min(len(text), first_pos + len(first_keyword) + 100)
+                sample_text = text[sample_start:sample_end]
+                sample_tokens = len(self.tokenizer.tokenize(sample_text))
+                chars_per_token = len(sample_text) / sample_tokens if sample_tokens > 0 else 4.0
+            else:
+                chars_per_token = 4.0  # Fallback ratio
+        else:
+            chars_per_token = 4.0  # Default ratio
+        # Process keywords
         for i, keyword in enumerate(keywords):
             # Find keyword position in text (already lowercase)
             keyword_pos = text.find(keyword)
                 # Get the keyword text (already lowercase)
                 actual_keyword = text[keyword_pos:keyword_pos + len(keyword)]
+                # Calculate rough window size using dynamic ratio
+                # Cap the rough chunk target token size to prevent tokenizer warnings
+                # Use 512 tokens as target (model's max limit)
+                ROUGH_CHUNK_TARGET_TOKENS = 512
+                char_window = int(ROUGH_CHUNK_TARGET_TOKENS * chars_per_token / 2)
+                # Get rough chunk boundaries in characters
+                rough_start = max(0, keyword_pos - char_window)
+                rough_end = min(len(text), keyword_pos + len(keyword) + char_window)
+                # Extract rough chunk for processing
+                rough_chunk = text[rough_start:rough_end]
+                # Find keyword's relative position in rough chunk
+                rel_pos = rough_chunk.find(actual_keyword)
+                if rel_pos == -1:
+                    logger.debug(f"Could not locate keyword '{actual_keyword}' in rough chunk for doc {doc_id}")
+                    continue
+                # Calculate token position by tokenizing text before keyword
+                text_before = rough_chunk[:rel_pos]
+                tokens_before = self.tokenizer.tokenize(text_before)
                 keyword_start_pos = len(tokens_before)
+                # Tokenize necessary parts
+                chunk_tokens = self.tokenizer.tokenize(rough_chunk)
+                keyword_tokens = self.tokenizer.tokenize(actual_keyword)
                 keyword_length = len(keyword_tokens)
+                # Calculate final chunk boundaries in tokens
                 tokens_each_side = (chunk_size - keyword_length) // 2
                 chunk_start = max(0, keyword_start_pos - tokens_each_side)
+                chunk_end = min(len(chunk_tokens), keyword_start_pos + keyword_length + tokens_each_side)
                 # Add overlap if possible
                 if chunk_start > 0:
                     chunk_start = max(0, chunk_start - self.chunk_overlap)
+                if chunk_end < len(chunk_tokens):
+                    chunk_end = min(len(chunk_tokens), chunk_end + self.chunk_overlap)
+                # Extract final tokens and convert to text
+                final_tokens = chunk_tokens[chunk_start:chunk_end]
+                chunk_text = self.tokenizer.convert_tokens_to_string(final_tokens)
+                # Verify keyword presence in final chunk
                 if chunk_text and actual_keyword in chunk_text:
                     chunk_info = {
                         "text": chunk_text,
                         "primary_keyword": actual_keyword,
                         "all_matched_keywords": matched_keywords.lower(),
+                        "token_count": len(final_tokens),
                         "chunk_id": f"{doc_id}_chunk_{i}" if doc_id else f"chunk_{i}",
                         "source_doc_id": doc_id
                     }
                     chunks.append(chunk_info)
                 else:
+                    logger.debug(f"Could not create chunk for keyword '{actual_keyword}' in doc {doc_id}")
+        if chunks:
+            logger.debug(f"Created {len(chunks)} chunks for document {doc_id or 'unknown'}")
         return chunks
     def process_emergency_chunks(self) -> List[Dict[str, Any]]:
         """Process emergency data into chunks"""
         if self.emergency_data is None:
             raise ValueError("Emergency data not loaded. Call load_filtered_data() first.")
         all_chunks = []
+        # Add progress bar with leave=False to avoid cluttering
+        for idx, row in tqdm(self.emergency_data.iterrows(),
+                        total=len(self.emergency_data),
+                        desc="Processing emergency documents",
+                        unit="doc",
+                        leave=False):
             if pd.notna(row.get('clean_text')) and pd.notna(row.get('matched')):
                 chunks = self.create_keyword_centered_chunks(
                     text=row['clean_text'],
                 all_chunks.extend(chunks)
         self.emergency_chunks = all_chunks
+        logger.info(f"Completed processing emergency data: {len(all_chunks)} chunks generated")
         return all_chunks
     def process_treatment_chunks(self) -> List[Dict[str, Any]]:
         """Process treatment data into chunks"""
         if self.treatment_data is None:
             raise ValueError("Treatment data not loaded. Call load_filtered_data() first.")
         all_chunks = []
+        # Add progress bar with leave=False to avoid cluttering
+        for idx, row in tqdm(self.treatment_data.iterrows(),
+                        total=len(self.treatment_data),
+                        desc="Processing treatment documents",
+                        unit="doc",
+                        leave=False):
             if (pd.notna(row.get('clean_text')) and
                 pd.notna(row.get('treatment_matched'))):
                 all_chunks.extend(chunks)
         self.treatment_chunks = all_chunks
+        logger.info(f"Completed processing treatment data: {len(all_chunks)} chunks generated")
         return all_chunks
+    def _get_chunk_hash(self, text: str) -> str:
+        """Generate hash for chunk text to use as cache key"""
+        import hashlib
+        return hashlib.md5(text.encode('utf-8')).hexdigest()
+    def _load_embedding_cache(self, cache_file: str) -> dict:
+        """Load embedding cache from file"""
+        import pickle
+        import os
+        if os.path.exists(cache_file):
+            try:
+                with open(cache_file, 'rb') as f:
+                    return pickle.load(f)
+            except:
+                logger.warning(f"Could not load cache file {cache_file}, starting fresh")
+                return {}
+        return {}
+    def _save_embedding_cache(self, cache: dict, cache_file: str):
+        """Save embedding cache to file"""
+        import pickle
+        import os
+        os.makedirs(os.path.dirname(cache_file), exist_ok=True)
+        with open(cache_file, 'wb') as f:
+            pickle.dump(cache, f)
     def generate_embeddings(self, chunks: List[Dict[str, Any]],
                           chunk_type: str = "emergency") -> np.ndarray:
         """
+        Generate embeddings for chunks with caching support
         Args:
             chunks: List of chunk dictionaries
         Returns:
             numpy array of embeddings
         """
+        logger.info(f"Starting embedding generation for {len(chunks)} {chunk_type} chunks...")
+        # Cache setup
+        cache_dir = self.models_dir / "cache"
+        cache_dir.mkdir(parents=True, exist_ok=True)
+        cache_file = cache_dir / f"{chunk_type}_embeddings_cache.pkl"
+        # Load existing cache
+        cache = self._load_embedding_cache(str(cache_file))
+        cached_embeddings = []
+        to_embed = []
+        # Check cache for each chunk
+        for i, chunk in enumerate(chunks):
+            chunk_hash = self._get_chunk_hash(chunk['text'])
+            if chunk_hash in cache:
+                cached_embeddings.append((i, cache[chunk_hash]))
+            else:
+                to_embed.append((i, chunk_hash, chunk['text']))
+        logger.info(f"Cache status: {len(cached_embeddings)} cached, {len(to_embed)} new chunks to embed")
+        # Generate embeddings for new chunks
+        new_embeddings = []
+        if to_embed:
+            # Load model
+            model = self.load_embedding_model()
+            texts = [text for _, _, text in to_embed]
+            # Generate embeddings in batches with clear progress
+            batch_size = 32
+            total_batches = (len(texts) + batch_size - 1) // batch_size
+            logger.info(f"Processing {len(texts)} new {chunk_type} texts in {total_batches} batches...")
+            for i in tqdm(range(0, len(texts), batch_size),
+                         desc=f"Embedding {chunk_type} subset",
+                         total=total_batches,
+                         unit="batch",
+                         leave=False):
+                batch_texts = texts[i:i + batch_size]
+                batch_emb = model.encode(
+                    batch_texts,
+                    show_progress_bar=False
+                )
+                new_embeddings.extend(batch_emb)
+            # Update cache with new embeddings
+            for (_, chunk_hash, _), emb in zip(to_embed, new_embeddings):
+                cache[chunk_hash] = emb
+            # Save updated cache
+            self._save_embedding_cache(cache, str(cache_file))
+            logger.info(f"Updated cache with {len(new_embeddings)} new embeddings")
+        # Combine cached and new embeddings in correct order
+        all_embeddings = [None] * len(chunks)
+        # Place cached embeddings
+        for idx, emb in cached_embeddings:
+            all_embeddings[idx] = emb
+        # Place new embeddings
+        for (idx, _, _), emb in zip(to_embed, new_embeddings):
+            all_embeddings[idx] = emb
+        # Convert to numpy array
+        result = np.vstack(all_embeddings)
+        logger.info(f"Completed embedding generation: shape {result.shape}")
+        return result
     def build_annoy_index(self, embeddings: np.ndarray,
                          index_name: str, n_trees: int = 15) -> AnnoyIndex:

tests/test_embedding_and_index.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import numpy as np
+from annoy import AnnoyIndex
+import pytest
+from data_processing import DataProcessor
+@pytest.fixture(scope="module")
+def processor():
+    return DataProcessor(base_dir=".")
+def test_embedding_dimensions(processor):
+    # load emergency embeddings
+    emb = np.load(processor.models_dir / "embeddings" / "emergency_embeddings.npy")
+    expected_dim = processor.embedding_dim
+    assert emb.ndim == 2, f"Expected 2D array, got {emb.ndim}D"
+    assert emb.shape[1] == expected_dim, (
+        f"Expected embedding dimension {expected_dim}, got {emb.shape[1]}"
+    )
+def test_annoy_search(processor):
+    # load embeddings
+    emb = np.load(processor.models_dir / "embeddings" / "emergency_embeddings.npy")
+    # load Annoy index
+    idx = AnnoyIndex(processor.embedding_dim, 'angular')
+    idx.load(str(processor.models_dir / "indices" / "annoy" / "emergency_index.ann"))
+    # perform a sample query
+    query_vec = emb[0]
+    ids, distances = idx.get_nns_by_vector(query_vec, 5, include_distances=True)
+    assert len(ids) == 5
+    assert all(0 <= d <= 2 for d in distances)