oncall-guide-ai / commit_message_embedding_update.txt
YanBoChen
refactor(data_processing): optimize chunking strategy with token-based approach
87dcd9d
raw
history blame
1.96 kB
refactor(data_processing): optimize chunking strategy with token-based approach
BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation
- Replace character-based chunking with token-based approach using PubMedBERT tokenizer
- Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance
- Implement dynamic chunking strategy centered around medical keywords
- Add token count validation to ensure semantic integrity
- Optimize memory usage with lazy loading of tokenizer and model
- Update chunking methods to handle token-level operations
- Add comprehensive logging for debugging token counts
- Update tests to verify token-based chunking behavior
Recent Improvements:
- Fix keyword context preservation in chunks
- Implement separate tokenization for pre-keyword and post-keyword text
- Add precise boundary calculation based on keyword length
- Ensure medical terms (e.g., "ST elevation") remain intact
- Improve chunk boundary calculations to maintain keyword context
- Add validation to verify keyword presence in generated chunks
Technical Details:
- chunk_size: 256 tokens (based on PubMedBERT context window)
- overlap: 64 tokens (25% overlap for context continuity)
- Model: NeuML/pubmedbert-base-embeddings (768 dims)
- Tokenizer: Same as embedding model for consistency
- Keyword-centered chunking with balanced context distribution
Performance Impact:
- Improved semantic coherence in chunks
- Better handling of medical terminology
- Reduced redundancy in overlapping regions
- Optimized for downstream retrieval tasks
- Enhanced preservation of medical term context
- More accurate chunk boundaries around keywords
Testing:
- Added token count validation in tests
- Verified keyword preservation in chunks
- Confirmed overlap handling
- Tested with sample medical texts
- Validated medical terminology preservation
- Verified chunk context balance around keywords