Spaces:

ybchen928
/

oncall-guide-ai

Sleeping

YanBoChen commited on Jul 22

Commit

988dac9

1 Parent(s): 6845aa8

feat(dataset): Implement emergency subset extraction with enhanced matching

Implement initial data preprocessing pipeline for RAG system evaluation.

Key Changes:
- Enhance keyword matching with findall and non-capturing groups
- Add matched column for tracking all keyword occurrences
- Implement basic statistics calculation
- Prepare for data exploration phase

Technical Details:
1. Keyword Matching Enhancement:
- Use non-capturing groups (?:...) to handle multiple matches
- Implement proper regex pattern with word boundaries
- Handle NaN values explicitly

2. Data Flow:
```
Raw Data (guidelines_source_filtered.jsonl)
│
▼
Keyword Matching (emergency_keywords.txt)
│ ┌─ Pattern: \b(?:keyword1|keyword2)\b
│ └─ Flags: re.IGNORECASE
▼
Multiple Match Extraction
│ ┌─ Use str.findall
│ └─ Join multiple matches with |
▼
Subset Creation
│ ┌─ matched column: "keyword1|keyword2"
│ └─ has_emergency flag
▼
Output Files
├─ emergency_subset.jsonl
└─ emergency_subset.csv
```

3. Next Steps:
- Run data_explorer.py for detailed analysis
- Evaluate subset quality against draft_offlineSubsetbuilding.md
- Consider implementing treatment subset with similar approach

Performance Metrics:
- Capture all keyword matches (not just first occurrence)
- Calculate average keywords per document
- Prepare for co-occurrence analysis

This approach aligns with the RAG system requirements:
1. Maintain semantic relationships (multiple keyword tracking)
2. Enable detailed analysis (matched column)
3. Support future enhancements (treatment subset)

Files changed (8) hide show

dataset/check_source.py +18 -0
dataset/filter_guidelines.py +31 -0
dataset/keywords/emergency_keywords.txt +44 -0
dataset/keywords/treatment_keywords.txt +113 -0
dataset/scripts/01_filter_emergency.py +54 -0
dataset/scripts/02_filter_treatment.py +37 -0
dataset/scripts/20250722_datesetA_emergency_subset_preprocessing_commit_message.txt +52 -0
dataset/scripts/data_explorer.py +92 -0

dataset/check_source.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import pandas as pd
+# 讀取剛剛下載並過濾後的 JSONL 檔案
+df = pd.read_json("dataset/guidelines_source_filtered.jsonl", lines=True)
+# 顯示各來源出現次數
+print("📊 各來源出現次數：")
+print(df["source"].value_counts())
+# 驗證來源是否只有指定的 9 個
+expected_sources = {"cco", "cdc", "cma", "icrc", "nice", "pubmed", "spor", "who", "wikidoc"}
+actual_sources = set(df["source"].unique())
+# 顯示驗證結果
+if actual_sources == expected_sources:
+    print("✅ 來源完全符合預期，沒有其他來源。")
+else:
+    print(f"❌ 發現未預期來源：{actual_sources - expected_sources}")

dataset/filter_guidelines.py ADDED Viewed

	@@ -0,0 +1,31 @@

+# filter_guidelines.py
+from datasets import load_dataset
+import pandas as pd
+import os
+# ✅ 你信任的來源來源縮寫（Hugging Face dataset 中的 source 欄位）
+approved_sources = ["cco", "cdc", "cma", "icrc", "nice", "pubmed", "spor", "who", "wikidoc"]
+# Step 1: 從 Hugging Face 載入資料集
+print("⏳ 載入資料中...")
+ds = load_dataset("epfl-llm/guidelines", split="train")
+# Step 2: 依據 source 欄位進行過濾
+print("🔍 篩選可信來源中...")
+ds_filtered = ds.filter(lambda ex: ex["source"] in approved_sources)
+print(f"✅ 篩選完成，總共 {len(ds_filtered)} 筆資料。")
+# Step 3: 轉成 pandas DataFrame
+print("📄 轉換為 DataFrame...")
+df = ds_filtered.to_pandas()
+# Step 4: 建立 dataset 資料夾（如果不存在）
+os.makedirs("dataset", exist_ok=True)
+# Step 5: 儲存為 JSONL 與 CSV 到 dataset/ 資料夾中
+print("💾 儲存到 dataset/ 資料夾...")
+df.to_json("dataset/guidelines_source_filtered.jsonl", orient="records", lines=True)
+df.to_csv("dataset/guidelines_source_filtered.csv", index=False)
+print("🎉 完成！已儲存來自可信來源的資料。")

dataset/keywords/emergency_keywords.txt ADDED Viewed

	@@ -0,0 +1,44 @@

+Acute abdomen
+Acute bleeding
+Acute Coronary Syndrome
+Acute Kidney Injury
+Acute pancréatitis
+Acute respiratory distress syndrome
+Acute stroke
+Anaphylaxis
+Anaphylactic Shock
+Arrhythmia
+Atrial fibrillation
+Bradycardia
+Cardiac arrest
+Cardiogenic Shock
+Chest pain
+Dyspnea
+Fever
+Gastrointestinal Hemorrhage (GI bleeding)
+Hemorrhage
+Hemorrhagic stroke
+Hyperthermia
+Hypovolemic Shock
+Hypotension
+Hypothermia
+Internal bleeding
+Intracranial Hemorrhages
+Ischemic stroke
+Loss of consciousness
+Myocardial Infarction
+MI
+Pulmonary Edema
+Pulmonary Embolism
+Respiratory distress
+Respiratory failure
+Sepsis
+Sepsis, Severe
+Septic Shock
+Shock
+Status Epilepticus
+Syncope
+Tachycardia
+Tachypnea
+Traumatic Brain Injury
+Ventricular Tachycardia

dataset/keywords/treatment_keywords.txt ADDED Viewed

	@@ -0,0 +1,113 @@

+iv fluids
+Infusion Intravenous
+fluid resuscitation
+Infusion Intravenous
+normal saline
+Infusion Intravenous
+crystalloids
+Infusion Intravenous
+vasopressors
+Vasoconstrictor Agents
+Epinephrine
+Ondansetron
+Ibuprofen
+Morphine
+Lidocaine
+Airway Management
+intubation
+Intubation Intratracheal
+ventilation support
+Ventilators
+oxygen therapy
+Oxygen Inhalation Therapy
+cpap
+Continuous Positive Airway Pressure
+bipap
+Bi-level Positive Airway Pressure
+Nebulization
+cpr
+Cardiopulmonary Resuscitation
+ACLS
+Advanced Cardiac Life Support
+Defibrillation
+Cardioversion
+Blood Transfusion
+transfusion
+hemodynamic monitoring
+Hemodynamics
+central line placement
+Catheterization Central Venous
+arterial line placement
+Catheterization Arterial
+Hemostasis
+wound care
+Wound Management
+Suturing
+Tourniquet
+compression dressing
+Wound Dressing
+splinting
+Splints
+radiologic imaging
+Radiography
+point-of-care ultrasound
+POCUS
+Ultrasonography Point-of-Care
+x-ray
+Radiography
+ct scan
+Tomography X-Ray Computed
+laboratory testing
+Laboratory Techniques
+Sedation
+analgesia
+Analgesia
+procedural sedation
+Anesthesia Procedural
+ketamine
+Ketamine
+midazolam
+Midazolam
+supportive care
+Supportive Care
+monitoring
+Patient Monitoring
+vital signs monitoring
+Vital Signs
+icu transfer
+Intensive Care Units
+treatment
+Therapeutics
+manage
+Patient Management
+management
+Patient Management
+intervention
+Therapeutic Intervention
+Therapy
+medication
+Drug Therapy
+procedure
+Surgical Procedures Operative
+resuscitation
+Cardiopulmonary Resuscitation
+administer
+Drug Administration Routes
+dose
+Dosage Forms
+monitor
+Patient Monitoring
+Oxygen
+fluid
+Infusion Intravenous
+surgery
+Surgical Procedures
+antibiotic
+Anti-Bacterial Agents
+Dopamine
+Amiodarone
+levophed
+Norepinephrine
+Epinephrine
+Bosmin
+Adrenaline

dataset/scripts/01_filter_emergency.py ADDED Viewed

	@@ -0,0 +1,54 @@

+# scripts/01_filter_emergency.py
+import os
+import re
+import pandas as pd
+# 工具函数：载入关键字并打印进度
+def load_keywords(path):
+    print(f"📥 读取关键字：{path}")
+    with open(path, "r", encoding="utf-8") as f:
+        kws = [line.strip() for line in f if line.strip()]
+    print(f"   共载入 {len(kws)} 个关键字")
+    return kws
+# Step 1: 读取原始数据
+print("1️⃣ 读取原始数据…")
+source_path = "../dataset/guidelines_source_filtered.jsonl"
+df = pd.read_json(source_path, lines=True)
+print(f"   已读取 {len(df)} 条记录")
+# Step 2: 载入急症关键字并匹配
+print("2️⃣ 读取急症关键字并开始匹配…")
+keywords = load_keywords("../keywords/emergency_keywords.txt")
+pattern = r"\b(?:" + "|".join(keywords) + r")\b"  # 使用非捕獲組 (?:...)
+# 匹配關鍵詞
+df["matched"] = (
+    df["clean_text"]
+      .fillna("")  # 把 NaN 变成 ""
+      .str.findall(pattern, flags=re.IGNORECASE)
+      .apply(lambda lst: "|".join(lst) if lst else "")
+)
+df["has_emergency"] = df["matched"].str.len() > 0
+cnt_em = df["has_emergency"].sum()
+# 计算平均匹配数（注意转义）
+avg_matches = (
+    df[df["has_emergency"]]["matched"]
+      .str.count(r"\|")  # 这里要转义
+      .add(1)
+      .mean()
+)
+print(f"   匹配到 {cnt_em} 条急症相关记录")
+print(f"   其中平均每条记录包含 {avg_matches:.2f} 个关键词")
+# Step 3: 保存急症子集
+print("3️⃣ 保存急症子集…")
+out_dir = "../dataset/emergency"
+os.makedirs(out_dir, exist_ok=True)
+subset = df[df["has_emergency"]]
+subset.to_json(f"{out_dir}/emergency_subset.jsonl", orient="records", lines=True)
+subset.to_csv(f"{out_dir}/emergency_subset.csv", index=False)
+print(f"✅ 完成！已生成急症子集，共 {len(subset)} 条记录，保存在 `{out_dir}`")

dataset/scripts/02_filter_treatment.py ADDED Viewed

	@@ -0,0 +1,37 @@

+# scripts/02_filter_treatment.py
+import os
+import pandas as pd
+# 工具函数：载入关键字
+def load_keywords(path):
+    print(f"📥 载入关键字：{path}")
+    with open(path, "r") as f:
+        kws = [line.strip() for line in f if line.strip()]
+    print(f"   共载入 {len(kws)} 个关键字")
+    return kws
+# Step 1: 载入急症子集
+print("1️⃣ 读取急症子集…")
+emergency_path = "../dataset/emergency/emergency_subset.jsonl"
+df = pd.read_json(emergency_path, lines=True)
+print(f"   已读取 {len(df)} 条急症相关记录")
+# Step 2: 载入处置/管理关键字并过滤
+print("2️⃣ 读取处置/管理关键字并开始过滤…")
+treatment_keywords = load_keywords("../keywords/treatment_keywords.txt")
+pattern2 = "|".join(treatment_keywords)
+df["has_treatment"] = df["clean_text"].str.contains(pattern2, case=False, na=False)
+cnt_treat = df["has_treatment"].sum()
+print(f"   匹配到 {cnt_treat} 条包含处置/管理描述的记录")
+# Step 3: 保存急症+处置子集
+print("3️⃣ 保存急症+处置子集…")
+out_dir = "../dataset/emergency_treatment"
+os.makedirs(out_dir, exist_ok=True)
+subset2 = df[df["has_treatment"]]
+subset2.to_json(f"{out_dir}/emergency_treatment_subset.jsonl", orient="records", lines=True)
+subset2.to_csv(f"{out_dir}/emergency_treatment_subset.csv", index=False)
+print(f"   已保存 {len(subset2)} 条记录到 `{out_dir}`")
+print("✅ 完成！急症+处置子集已生成。")

dataset/scripts/20250722_datesetA_emergency_subset_preprocessing_commit_message.txt ADDED Viewed

	@@ -0,0 +1,52 @@

+feat(dataset): Implement emergency subset extraction with enhanced matching
+Implement initial data preprocessing pipeline for RAG system evaluation.
+Key Changes:
+- Enhance keyword matching with findall and non-capturing groups
+- Add matched column for tracking all keyword occurrences
+- Implement basic statistics calculation
+- Prepare for data exploration phase
+Technical Details:
+1. Keyword Matching Enhancement:
+   - Use non-capturing groups (?:...) to handle multiple matches
+   - Implement proper regex pattern with word boundaries
+   - Handle NaN values explicitly
+2. Data Flow:
+```
+Raw Data (guidelines_source_filtered.jsonl)
+     │
+     ▼
+Keyword Matching (emergency_keywords.txt)
+     │    ┌─ Pattern: \b(?:keyword1|keyword2)\b
+     │    └─ Flags: re.IGNORECASE
+     ▼
+Multiple Match Extraction
+     │    ┌─ Use str.findall
+     │    └─ Join multiple matches with |
+     ▼
+Subset Creation
+     │    ┌─ matched column: "keyword1|keyword2"
+     │    └─ has_emergency flag
+     ▼
+Output Files
+     ├─ emergency_subset.jsonl
+     └─ emergency_subset.csv
+```
+3. Next Steps:
+   - Run data_explorer.py for detailed analysis
+   - Evaluate subset quality against draft_offlineSubsetbuilding.md
+   - Consider implementing treatment subset with similar approach
+Performance Metrics:
+- Capture all keyword matches (not just first occurrence)
+- Calculate average keywords per document
+- Prepare for co-occurrence analysis
+This approach aligns with the RAG system requirements:
+1. Maintain semantic relationships (multiple keyword tracking)
+2. Enable detailed analysis (matched column)
+3. Support future enhancements (treatment subset)

dataset/scripts/data_explorer.py ADDED Viewed

	@@ -0,0 +1,92 @@

+# /scripts/data_explorer.py
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns  # 添加
+import numpy as np     # 添加
+from pathlib import Path
+import json           # 添加
+def analyze_subset(file_path, keywords_path, output_dir="analysis"):
+    """分析子集數據質量和分布"""
+    print(f"正在分析: {file_path}")
+    # 載入數據
+    df = pd.read_csv(file_path)
+    output_dir = Path(output_dir)
+    # 1. 基本統計 (保持原有的)
+    print(f"總記錄數: {len(df)}")
+    df['text_length'] = df['clean_text'].str.len()  # 移到這裡
+    print(f"平均文本長度: {df['text_length'].mean():.2f}")
+    # 2. 關鍵字分析 (保持原有的)
+    with open(keywords_path, 'r') as f:
+        keywords = [line.strip() for line in f if line.strip()]
+    keyword_stats = {}
+    for keyword in keywords:
+        count = df['clean_text'].str.contains(keyword, case=False).sum()
+        keyword_stats[keyword] = count
+        print(f"{keyword}: {count} 條記錄")
+    # 3. 可視化
+    output_path = Path(output_dir) / "plots"
+    output_path.mkdir(parents=True, exist_ok=True)
+    # 3.1 關鍵詞分布圖 (原有的)
+    plt.figure(figsize=(15, 8))
+    plt.bar(keyword_stats.keys(), keyword_stats.values())
+    plt.xticks(rotation=45, ha='right')
+    plt.title('關鍵詞匹配分布')
+    plt.xlabel('關鍵詞')
+    plt.ylabel('匹配數量')
+    # TODO: change the name of the file to the name of the subset
+    plt.savefig(output_path / "keyword_distribution_emergency_subset.png", bbox_inches='tight')
+    plt.close()
+    # 3.2 文本長度分布 (新增的)
+    plt.figure(figsize=(10, 6))
+    df['text_length'].hist(bins=50)
+    plt.title('文本長度分布')
+    plt.xlabel('文本長度')
+    plt.ylabel('頻率')
+    plt.savefig(output_path / "text_length_dist.png", bbox_inches='tight')
+    plt.close()
+    # 3.3 關鍵詞共現分析 (新增的)
+    cooccurrence_matrix = np.zeros((len(keywords), len(keywords)))
+    for text in df['clean_text']:
+        present_keywords = [k for k in keywords if k.lower() in text.lower()]
+        for i, k1 in enumerate(present_keywords):
+            for j, k2 in enumerate(present_keywords):
+                if i != j:
+                    cooccurrence_matrix[keywords.index(k1)][keywords.index(k2)] += 1
+    plt.figure(figsize=(12, 8))
+    sns.heatmap(cooccurrence_matrix,
+                xticklabels=keywords,
+                yticklabels=keywords,
+                cmap='YlOrRd')
+    plt.title('關鍵詞共現熱力圖')
+    plt.xticks(rotation=45, ha='right')
+    plt.tight_layout()
+    # TODO: change the name of the file to the name of the subset
+    plt.savefig(output_path / "keyword_cooccurrence_emergency_subset.png", bbox_inches='tight')
+    plt.close()
+    # 4. 保存統計數據 (擴展原有的)
+    stats_path = Path(output_dir) / "stats"
+    stats_path.mkdir(parents=True, exist_ok=True)
+    stats = {
+        '基本統計': {
+            '總記錄數': len(df),
+            '平均文本長度': float(df['text_length'].mean()),
+            '文本長度分位數': df['text_length'].describe().to_dict()
+        },
+        '關鍵詞統計': keyword_stats
+    }
+    # TODO: change the name of the file to the name of the subset
+    with open(stats_path / "analysis_stats_emergency_subset.json", 'w', encoding='utf-8') as f:
+        json.dump(stats, f, indent=2, ensure_ascii=False)