Spaces:

ybchen928
/

oncall-guide-ai

Running

App Files Files Community

YanBoChen commited on Jul 27

Commit

6083d96

1 Parent(s): 87dcd9d

refactor(data_processing): add token-based chunking strategy for improved keyword context

Browse files

Files changed (2) hide show

src/commit_message_20250726_data_processing.txt +15 -1
src/commit_message_embedding_update.txt +43 -0

src/commit_message_20250726_data_processing.txt CHANGED Viewed

@@ -1,12 +1,13 @@
 feat(data-processing): implement data processing pipeline with embeddings
-BREAKING CHANGE: Add data processing implementation with robust path handling
 Key Changes:
 1. Create DataProcessor class for medical data processing:
    - Handle paths with spaces and special characters
    - Support dataset/dataset directory structure
    - Add detailed logging for debugging
 2. Implement core functionalities:
    - Load filtered emergency and treatment data
@@ -14,11 +15,15 @@ Key Changes:
    - Generate embeddings using NeuML/pubmedbert-base-embeddings
    - Build ANNOY indices for vector search
    - Save embeddings and metadata separately
 3. Add test coverage:
    - Basic data loading tests
    - Chunking functionality tests
    - Model loading tests
 Technical Details:
 - Use pathlib.Path.resolve() for robust path handling
@@ -26,11 +31,20 @@ Technical Details:
   * /models/embeddings/ for vector representations
   * /models/indices/annoy/ for search indices
 - Keep keywords as metadata without embedding
 Testing:
 ✅ Data loading: 11,914 emergency + 11,023 treatment records
 ✅ Chunking: Successful with keyword-centered approach
 ✅ Model loading: NeuML/pubmedbert-base-embeddings (768 dims)
 Next Steps:
 - Integrate with Meditron for enhanced processing

 feat(data-processing): implement data processing pipeline with embeddings
+BREAKING CHANGE: Add data processing implementation with robust path handling and improved text processing
 Key Changes:
 1. Create DataProcessor class for medical data processing:
    - Handle paths with spaces and special characters
    - Support dataset/dataset directory structure
    - Add detailed logging for debugging
+   - Implement case-insensitive text processing
 2. Implement core functionalities:
    - Load filtered emergency and treatment data
    - Generate embeddings using NeuML/pubmedbert-base-embeddings
    - Build ANNOY indices for vector search
    - Save embeddings and metadata separately
+   - Improve keyword matching with case-insensitive comparison
+   - Add proper chunk boundary calculations for medical terms
 3. Add test coverage:
    - Basic data loading tests
    - Chunking functionality tests
    - Model loading tests
+   - Token-based chunking validation
+   - Medical terminology preservation tests
 Technical Details:
 - Use pathlib.Path.resolve() for robust path handling
   * /models/embeddings/ for vector representations
   * /models/indices/annoy/ for search indices
 - Keep keywords as metadata without embedding
+- Implement case-insensitive text processing while preserving medical term integrity
+- Add proper chunk overlap handling
 Testing:
 ✅ Data loading: 11,914 emergency + 11,023 treatment records
 ✅ Chunking: Successful with keyword-centered approach
 ✅ Model loading: NeuML/pubmedbert-base-embeddings (768 dims)
+✅ Token chunking: Verified with medical terms (e.g., "ST elevation")
+Storage Structure:
+/models/
+  ├── embeddings/          # Vector representations
+  └── indices/
+      └── annoy/          # Search indices (.ann files)
 Next Steps:
 - Integrate with Meditron for enhanced processing

src/commit_message_embedding_update.txt ADDED Viewed

	@@ -0,0 +1,43 @@

+refactor(data_processing): optimize chunking strategy with token-based approach
+BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation
+- Replace character-based chunking with token-based approach using PubMedBERT tokenizer
+- Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance
+- Implement dynamic chunking strategy centered around medical keywords
+- Add token count validation to ensure semantic integrity
+- Optimize memory usage with lazy loading of tokenizer and model
+- Update chunking methods to handle token-level operations
+- Add comprehensive logging for debugging token counts
+- Update tests to verify token-based chunking behavior
+Recent Improvements:
+- Fix keyword context preservation in chunks
+- Implement separate tokenization for pre-keyword and post-keyword text
+- Add precise boundary calculation based on keyword length
+- Ensure medical terms (e.g., "ST elevation") remain intact
+- Improve chunk boundary calculations to maintain keyword context
+- Add validation to verify keyword presence in generated chunks
+Technical Details:
+- chunk_size: 256 tokens (based on PubMedBERT context window)
+- overlap: 64 tokens (25% overlap for context continuity)
+- Model: NeuML/pubmedbert-base-embeddings (768 dims)
+- Tokenizer: Same as embedding model for consistency
+- Keyword-centered chunking with balanced context distribution
+Performance Impact:
+- Improved semantic coherence in chunks
+- Better handling of medical terminology
+- Reduced redundancy in overlapping regions
+- Optimized for downstream retrieval tasks
+- Enhanced preservation of medical term context
+- More accurate chunk boundaries around keywords
+Testing:
+- Added token count validation in tests
+- Verified keyword preservation in chunks
+- Confirmed overlap handling
+- Tested with sample medical texts
+- Validated medical terminology preservation
+- Verified chunk context balance around keywords