Spaces:
Running
Running
YanBoChen
commited on
Commit
Β·
6083d96
1
Parent(s):
87dcd9d
refactor(data_processing): add token-based chunking strategy for improved keyword context
Browse files
src/commit_message_20250726_data_processing.txt
CHANGED
@@ -1,12 +1,13 @@
|
|
1 |
feat(data-processing): implement data processing pipeline with embeddings
|
2 |
|
3 |
-
BREAKING CHANGE: Add data processing implementation with robust path handling
|
4 |
|
5 |
Key Changes:
|
6 |
1. Create DataProcessor class for medical data processing:
|
7 |
- Handle paths with spaces and special characters
|
8 |
- Support dataset/dataset directory structure
|
9 |
- Add detailed logging for debugging
|
|
|
10 |
|
11 |
2. Implement core functionalities:
|
12 |
- Load filtered emergency and treatment data
|
@@ -14,11 +15,15 @@ Key Changes:
|
|
14 |
- Generate embeddings using NeuML/pubmedbert-base-embeddings
|
15 |
- Build ANNOY indices for vector search
|
16 |
- Save embeddings and metadata separately
|
|
|
|
|
17 |
|
18 |
3. Add test coverage:
|
19 |
- Basic data loading tests
|
20 |
- Chunking functionality tests
|
21 |
- Model loading tests
|
|
|
|
|
22 |
|
23 |
Technical Details:
|
24 |
- Use pathlib.Path.resolve() for robust path handling
|
@@ -26,11 +31,20 @@ Technical Details:
|
|
26 |
* /models/embeddings/ for vector representations
|
27 |
* /models/indices/annoy/ for search indices
|
28 |
- Keep keywords as metadata without embedding
|
|
|
|
|
29 |
|
30 |
Testing:
|
31 |
β
Data loading: 11,914 emergency + 11,023 treatment records
|
32 |
β
Chunking: Successful with keyword-centered approach
|
33 |
β
Model loading: NeuML/pubmedbert-base-embeddings (768 dims)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
Next Steps:
|
36 |
- Integrate with Meditron for enhanced processing
|
|
|
1 |
feat(data-processing): implement data processing pipeline with embeddings
|
2 |
|
3 |
+
BREAKING CHANGE: Add data processing implementation with robust path handling and improved text processing
|
4 |
|
5 |
Key Changes:
|
6 |
1. Create DataProcessor class for medical data processing:
|
7 |
- Handle paths with spaces and special characters
|
8 |
- Support dataset/dataset directory structure
|
9 |
- Add detailed logging for debugging
|
10 |
+
- Implement case-insensitive text processing
|
11 |
|
12 |
2. Implement core functionalities:
|
13 |
- Load filtered emergency and treatment data
|
|
|
15 |
- Generate embeddings using NeuML/pubmedbert-base-embeddings
|
16 |
- Build ANNOY indices for vector search
|
17 |
- Save embeddings and metadata separately
|
18 |
+
- Improve keyword matching with case-insensitive comparison
|
19 |
+
- Add proper chunk boundary calculations for medical terms
|
20 |
|
21 |
3. Add test coverage:
|
22 |
- Basic data loading tests
|
23 |
- Chunking functionality tests
|
24 |
- Model loading tests
|
25 |
+
- Token-based chunking validation
|
26 |
+
- Medical terminology preservation tests
|
27 |
|
28 |
Technical Details:
|
29 |
- Use pathlib.Path.resolve() for robust path handling
|
|
|
31 |
* /models/embeddings/ for vector representations
|
32 |
* /models/indices/annoy/ for search indices
|
33 |
- Keep keywords as metadata without embedding
|
34 |
+
- Implement case-insensitive text processing while preserving medical term integrity
|
35 |
+
- Add proper chunk overlap handling
|
36 |
|
37 |
Testing:
|
38 |
β
Data loading: 11,914 emergency + 11,023 treatment records
|
39 |
β
Chunking: Successful with keyword-centered approach
|
40 |
β
Model loading: NeuML/pubmedbert-base-embeddings (768 dims)
|
41 |
+
β
Token chunking: Verified with medical terms (e.g., "ST elevation")
|
42 |
+
|
43 |
+
Storage Structure:
|
44 |
+
/models/
|
45 |
+
βββ embeddings/ # Vector representations
|
46 |
+
βββ indices/
|
47 |
+
βββ annoy/ # Search indices (.ann files)
|
48 |
|
49 |
Next Steps:
|
50 |
- Integrate with Meditron for enhanced processing
|
src/commit_message_embedding_update.txt
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
refactor(data_processing): optimize chunking strategy with token-based approach
|
2 |
+
|
3 |
+
BREAKING CHANGE: Switch from character-based to token-based chunking and improve keyword context preservation
|
4 |
+
|
5 |
+
- Replace character-based chunking with token-based approach using PubMedBERT tokenizer
|
6 |
+
- Set chunk_size to 256 tokens and chunk_overlap to 64 tokens for optimal performance
|
7 |
+
- Implement dynamic chunking strategy centered around medical keywords
|
8 |
+
- Add token count validation to ensure semantic integrity
|
9 |
+
- Optimize memory usage with lazy loading of tokenizer and model
|
10 |
+
- Update chunking methods to handle token-level operations
|
11 |
+
- Add comprehensive logging for debugging token counts
|
12 |
+
- Update tests to verify token-based chunking behavior
|
13 |
+
|
14 |
+
Recent Improvements:
|
15 |
+
- Fix keyword context preservation in chunks
|
16 |
+
- Implement separate tokenization for pre-keyword and post-keyword text
|
17 |
+
- Add precise boundary calculation based on keyword length
|
18 |
+
- Ensure medical terms (e.g., "ST elevation") remain intact
|
19 |
+
- Improve chunk boundary calculations to maintain keyword context
|
20 |
+
- Add validation to verify keyword presence in generated chunks
|
21 |
+
|
22 |
+
Technical Details:
|
23 |
+
- chunk_size: 256 tokens (based on PubMedBERT context window)
|
24 |
+
- overlap: 64 tokens (25% overlap for context continuity)
|
25 |
+
- Model: NeuML/pubmedbert-base-embeddings (768 dims)
|
26 |
+
- Tokenizer: Same as embedding model for consistency
|
27 |
+
- Keyword-centered chunking with balanced context distribution
|
28 |
+
|
29 |
+
Performance Impact:
|
30 |
+
- Improved semantic coherence in chunks
|
31 |
+
- Better handling of medical terminology
|
32 |
+
- Reduced redundancy in overlapping regions
|
33 |
+
- Optimized for downstream retrieval tasks
|
34 |
+
- Enhanced preservation of medical term context
|
35 |
+
- More accurate chunk boundaries around keywords
|
36 |
+
|
37 |
+
Testing:
|
38 |
+
- Added token count validation in tests
|
39 |
+
- Verified keyword preservation in chunks
|
40 |
+
- Confirmed overlap handling
|
41 |
+
- Tested with sample medical texts
|
42 |
+
- Validated medical terminology preservation
|
43 |
+
- Verified chunk context balance around keywords
|