Spaces:

ybchen928
/

oncall-guide-ai

Running

oncall-guide-ai / commit_message_embedding_update.txt

YanBoChen

refactor(data_processing): optimize chunking strategy with token-based approach

87dcd9d about 2 months ago

1.96 kB

	refactor(data_processing): optimize chunking strategy with token-based approach

	BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation

	- Replace character-based chunking with token-based approach using PubMedBERT tokenizer
	- Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance
	- Implement dynamic chunking strategy centered around medical keywords
	- Add token count validation to ensure semantic integrity
	- Optimize memory usage with lazy loading of tokenizer and model
	- Update chunking methods to handle token-level operations
	- Add comprehensive logging for debugging token counts
	- Update tests to verify token-based chunking behavior

	Recent Improvements:
	- Fix keyword context preservation in chunks
	- Implement separate tokenization for pre-keyword and post-keyword text
	- Add precise boundary calculation based on keyword length
	- Ensure medical terms (e.g., "ST elevation") remain intact
	- Improve chunk boundary calculations to maintain keyword context
	- Add validation to verify keyword presence in generated chunks

	Technical Details:
	- chunk_size: 256 tokens (based on PubMedBERT context window)
	- overlap: 64 tokens (25% overlap for context continuity)
	- Model: NeuML/pubmedbert-base-embeddings (768 dims)
	- Tokenizer: Same as embedding model for consistency
	- Keyword-centered chunking with balanced context distribution

	Performance Impact:
	- Improved semantic coherence in chunks
	- Better handling of medical terminology
	- Reduced redundancy in overlapping regions
	- Optimized for downstream retrieval tasks
	- Enhanced preservation of medical term context
	- More accurate chunk boundaries around keywords

	Testing:
	- Added token count validation in tests
	- Verified keyword preservation in chunks
	- Confirmed overlap handling
	- Tested with sample medical texts
	- Validated medical terminology preservation
	- Verified chunk context balance around keywords