YanBoChen commited on
Commit
6083d96
Β·
1 Parent(s): 87dcd9d

refactor(data_processing): add token-based chunking strategy for improved keyword context

Browse files
src/commit_message_20250726_data_processing.txt CHANGED
@@ -1,12 +1,13 @@
1
  feat(data-processing): implement data processing pipeline with embeddings
2
 
3
- BREAKING CHANGE: Add data processing implementation with robust path handling
4
 
5
  Key Changes:
6
  1. Create DataProcessor class for medical data processing:
7
  - Handle paths with spaces and special characters
8
  - Support dataset/dataset directory structure
9
  - Add detailed logging for debugging
 
10
 
11
  2. Implement core functionalities:
12
  - Load filtered emergency and treatment data
@@ -14,11 +15,15 @@ Key Changes:
14
  - Generate embeddings using NeuML/pubmedbert-base-embeddings
15
  - Build ANNOY indices for vector search
16
  - Save embeddings and metadata separately
 
 
17
 
18
  3. Add test coverage:
19
  - Basic data loading tests
20
  - Chunking functionality tests
21
  - Model loading tests
 
 
22
 
23
  Technical Details:
24
  - Use pathlib.Path.resolve() for robust path handling
@@ -26,11 +31,20 @@ Technical Details:
26
  * /models/embeddings/ for vector representations
27
  * /models/indices/annoy/ for search indices
28
  - Keep keywords as metadata without embedding
 
 
29
 
30
  Testing:
31
  βœ… Data loading: 11,914 emergency + 11,023 treatment records
32
  βœ… Chunking: Successful with keyword-centered approach
33
  βœ… Model loading: NeuML/pubmedbert-base-embeddings (768 dims)
 
 
 
 
 
 
 
34
 
35
  Next Steps:
36
  - Integrate with Meditron for enhanced processing
 
1
  feat(data-processing): implement data processing pipeline with embeddings
2
 
3
+ BREAKING CHANGE: Add data processing implementation with robust path handling and improved text processing
4
 
5
  Key Changes:
6
  1. Create DataProcessor class for medical data processing:
7
  - Handle paths with spaces and special characters
8
  - Support dataset/dataset directory structure
9
  - Add detailed logging for debugging
10
+ - Implement case-insensitive text processing
11
 
12
  2. Implement core functionalities:
13
  - Load filtered emergency and treatment data
 
15
  - Generate embeddings using NeuML/pubmedbert-base-embeddings
16
  - Build ANNOY indices for vector search
17
  - Save embeddings and metadata separately
18
+ - Improve keyword matching with case-insensitive comparison
19
+ - Add proper chunk boundary calculations for medical terms
20
 
21
  3. Add test coverage:
22
  - Basic data loading tests
23
  - Chunking functionality tests
24
  - Model loading tests
25
+ - Token-based chunking validation
26
+ - Medical terminology preservation tests
27
 
28
  Technical Details:
29
  - Use pathlib.Path.resolve() for robust path handling
 
31
  * /models/embeddings/ for vector representations
32
  * /models/indices/annoy/ for search indices
33
  - Keep keywords as metadata without embedding
34
+ - Implement case-insensitive text processing while preserving medical term integrity
35
+ - Add proper chunk overlap handling
36
 
37
  Testing:
38
  βœ… Data loading: 11,914 emergency + 11,023 treatment records
39
  βœ… Chunking: Successful with keyword-centered approach
40
  βœ… Model loading: NeuML/pubmedbert-base-embeddings (768 dims)
41
+ βœ… Token chunking: Verified with medical terms (e.g., "ST elevation")
42
+
43
+ Storage Structure:
44
+ /models/
45
+ β”œβ”€β”€ embeddings/ # Vector representations
46
+ └── indices/
47
+ └── annoy/ # Search indices (.ann files)
48
 
49
  Next Steps:
50
  - Integrate with Meditron for enhanced processing
src/commit_message_embedding_update.txt ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ refactor(data_processing): optimize chunking strategy with token-based approach
2
+
3
+ BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation
4
+
5
+ - Replace character-based chunking with token-based approach using PubMedBERT tokenizer
6
+ - Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance
7
+ - Implement dynamic chunking strategy centered around medical keywords
8
+ - Add token count validation to ensure semantic integrity
9
+ - Optimize memory usage with lazy loading of tokenizer and model
10
+ - Update chunking methods to handle token-level operations
11
+ - Add comprehensive logging for debugging token counts
12
+ - Update tests to verify token-based chunking behavior
13
+
14
+ Recent Improvements:
15
+ - Fix keyword context preservation in chunks
16
+ - Implement separate tokenization for pre-keyword and post-keyword text
17
+ - Add precise boundary calculation based on keyword length
18
+ - Ensure medical terms (e.g., "ST elevation") remain intact
19
+ - Improve chunk boundary calculations to maintain keyword context
20
+ - Add validation to verify keyword presence in generated chunks
21
+
22
+ Technical Details:
23
+ - chunk_size: 256 tokens (based on PubMedBERT context window)
24
+ - overlap: 64 tokens (25% overlap for context continuity)
25
+ - Model: NeuML/pubmedbert-base-embeddings (768 dims)
26
+ - Tokenizer: Same as embedding model for consistency
27
+ - Keyword-centered chunking with balanced context distribution
28
+
29
+ Performance Impact:
30
+ - Improved semantic coherence in chunks
31
+ - Better handling of medical terminology
32
+ - Reduced redundancy in overlapping regions
33
+ - Optimized for downstream retrieval tasks
34
+ - Enhanced preservation of medical term context
35
+ - More accurate chunk boundaries around keywords
36
+
37
+ Testing:
38
+ - Added token count validation in tests
39
+ - Verified keyword preservation in chunks
40
+ - Confirmed overlap handling
41
+ - Tested with sample medical texts
42
+ - Validated medical terminology preservation
43
+ - Verified chunk context balance around keywords