File size: 1,962 Bytes
87dcd9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
refactor(data_processing): optimize chunking strategy with token-based approach

BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation

- Replace character-based chunking with token-based approach using PubMedBERT tokenizer
- Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance
- Implement dynamic chunking strategy centered around medical keywords
- Add token count validation to ensure semantic integrity
- Optimize memory usage with lazy loading of tokenizer and model
- Update chunking methods to handle token-level operations
- Add comprehensive logging for debugging token counts
- Update tests to verify token-based chunking behavior

Recent Improvements:
- Fix keyword context preservation in chunks
- Implement separate tokenization for pre-keyword and post-keyword text
- Add precise boundary calculation based on keyword length
- Ensure medical terms (e.g., "ST elevation") remain intact
- Improve chunk boundary calculations to maintain keyword context
- Add validation to verify keyword presence in generated chunks

Technical Details:
- chunk_size: 256 tokens (based on PubMedBERT context window)
- overlap: 64 tokens (25% overlap for context continuity)
- Model: NeuML/pubmedbert-base-embeddings (768 dims)
- Tokenizer: Same as embedding model for consistency
- Keyword-centered chunking with balanced context distribution

Performance Impact:
- Improved semantic coherence in chunks
- Better handling of medical terminology
- Reduced redundancy in overlapping regions
- Optimized for downstream retrieval tasks
- Enhanced preservation of medical term context
- More accurate chunk boundaries around keywords

Testing:
- Added token count validation in tests
- Verified keyword preservation in chunks
- Confirmed overlap handling
- Tested with sample medical texts
- Validated medical terminology preservation
- Verified chunk context balance around keywords