Spaces:
Running
Running
refactor(data_processing): optimize chunking strategy with token-based approach | |
BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation | |
- Replace character-based chunking with token-based approach using PubMedBERT tokenizer | |
- Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance | |
- Implement dynamic chunking strategy centered around medical keywords | |
- Add token count validation to ensure semantic integrity | |
- Optimize memory usage with lazy loading of tokenizer and model | |
- Update chunking methods to handle token-level operations | |
- Add comprehensive logging for debugging token counts | |
- Update tests to verify token-based chunking behavior | |
Recent Improvements: | |
- Fix keyword context preservation in chunks | |
- Implement separate tokenization for pre-keyword and post-keyword text | |
- Add precise boundary calculation based on keyword length | |
- Ensure medical terms (e.g., "ST elevation") remain intact | |
- Improve chunk boundary calculations to maintain keyword context | |
- Add validation to verify keyword presence in generated chunks | |
Technical Details: | |
- chunk_size: 256 tokens (based on PubMedBERT context window) | |
- overlap: 64 tokens (25% overlap for context continuity) | |
- Model: NeuML/pubmedbert-base-embeddings (768 dims) | |
- Tokenizer: Same as embedding model for consistency | |
- Keyword-centered chunking with balanced context distribution | |
Performance Impact: | |
- Improved semantic coherence in chunks | |
- Better handling of medical terminology | |
- Reduced redundancy in overlapping regions | |
- Optimized for downstream retrieval tasks | |
- Enhanced preservation of medical term context | |
- More accurate chunk boundaries around keywords | |
Testing: | |
- Added token count validation in tests | |
- Verified keyword preservation in chunks | |
- Confirmed overlap handling | |
- Tested with sample medical texts | |
- Validated medical terminology preservation | |
- Verified chunk context balance around keywords |