Spaces:
Running
Running
File size: 1,962 Bytes
87dcd9d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
refactor(data_processing): optimize chunking strategy with token-based approach BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation - Replace character-based chunking with token-based approach using PubMedBERT tokenizer - Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance - Implement dynamic chunking strategy centered around medical keywords - Add token count validation to ensure semantic integrity - Optimize memory usage with lazy loading of tokenizer and model - Update chunking methods to handle token-level operations - Add comprehensive logging for debugging token counts - Update tests to verify token-based chunking behavior Recent Improvements: - Fix keyword context preservation in chunks - Implement separate tokenization for pre-keyword and post-keyword text - Add precise boundary calculation based on keyword length - Ensure medical terms (e.g., "ST elevation") remain intact - Improve chunk boundary calculations to maintain keyword context - Add validation to verify keyword presence in generated chunks Technical Details: - chunk_size: 256 tokens (based on PubMedBERT context window) - overlap: 64 tokens (25% overlap for context continuity) - Model: NeuML/pubmedbert-base-embeddings (768 dims) - Tokenizer: Same as embedding model for consistency - Keyword-centered chunking with balanced context distribution Performance Impact: - Improved semantic coherence in chunks - Better handling of medical terminology - Reduced redundancy in overlapping regions - Optimized for downstream retrieval tasks - Enhanced preservation of medical term context - More accurate chunk boundaries around keywords Testing: - Added token count validation in tests - Verified keyword preservation in chunks - Confirmed overlap handling - Tested with sample medical texts - Validated medical terminology preservation - Verified chunk context balance around keywords |