Spaces:
Running
Running
Daniel Wojahn
commited on
Commit
·
75e8f38
1
Parent(s):
54934d5
feat(ui): Add preset-based analysis UI and Gradio 6 compatibility
Browse files- README.md +104 -48
- app.py +387 -197
- pipeline/hf_embedding.py +10 -8
- pipeline/llm_service.py +95 -95
- pipeline/metrics.py +239 -89
- pipeline/normalize_bo.py +106 -0
- pipeline/process.py +81 -63
- pipeline/progressive_loader.py +14 -14
- pipeline/progressive_ui.py +36 -36
- pipeline/stopwords_bo.py +23 -12
- pipeline/stopwords_lite_bo.py +14 -4
- pipeline/tokenize.py +15 -15
- pipeline/visualize.py +7 -7
- pyproject.toml +29 -4
- requirements.txt +2 -1
- setup.py +21 -27
- theme.py +369 -234
README.md
CHANGED
|
@@ -15,27 +15,53 @@ app_file: app.py
|
|
| 15 |
[](https://creativecommons.org/licenses/by/4.0/)
|
| 16 |
[](https://www.repostatus.org/#active)
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## Background
|
| 21 |
|
| 22 |
-
The Tibetan Text Metrics project
|
| 23 |
|
| 24 |
## Key Features of the Web App
|
| 25 |
|
| 26 |
- **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
|
| 27 |
- **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
|
| 28 |
- **Core Metrics Computed**:
|
| 29 |
-
- **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
|
| 30 |
-
- **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
|
| 31 |
-
- **Fuzzy Similarity**: Uses fuzzy
|
| 32 |
-
- **Semantic Similarity**: Uses sentence-transformer embeddings (
|
| 33 |
- **Handles Long Texts**: Implements automated handling for long segments when computing semantic embeddings.
|
| 34 |
-
- **Model Selection**: Semantic similarity
|
|
|
|
|
|
|
|
|
|
| 35 |
- **Stopword Filtering**: Three levels of filtering for Tibetan words:
|
| 36 |
- **None**: No filtering, includes all words
|
| 37 |
- **Standard**: Filters only common particles and punctuation
|
| 38 |
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
|
|
|
|
| 39 |
- **Interactive Visualizations**:
|
| 40 |
- Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
|
| 41 |
- Bar chart displaying word counts per segment.
|
|
@@ -91,7 +117,10 @@ To obtain meaningful results, it is highly recommended to divide your Tibetan te
|
|
| 91 |
## Implemented Metrics
|
| 92 |
|
| 93 |
**Stopword Filtering:**
|
| 94 |
-
To enhance the accuracy and relevance of similarity scores,
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
|
| 97 |
- The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
|
|
@@ -99,7 +128,7 @@ The comprehensive list of Tibetan stopwords used is adapted and compiled from th
|
|
| 99 |
|
| 100 |
We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
|
| 101 |
|
| 102 |
-
Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/
|
| 103 |
|
| 104 |
### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
|
| 105 |
|
|
@@ -116,29 +145,33 @@ A higher percentage indicates a greater overlap in the significant vocabularies
|
|
| 116 |
|
| 117 |
This helps focus on meaningful content words rather than grammatical elements.
|
| 118 |
|
| 119 |
-
2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.
|
| 120 |
-
* *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
|
| 121 |
-
3. **Fuzzy Similarity**: This metric uses fuzzy string matching algorithms to detect approximate matches between words, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical. The implementation offers multiple matching methods:
|
| 122 |
-
- **Token Set Ratio** (default): Compares the sets of words regardless of order, finding the best alignment between them
|
| 123 |
-
- **Token Sort Ratio**: Sorts the words alphabetically before comparing, useful for texts with similar vocabulary in different orders
|
| 124 |
-
- **Partial Ratio**: Finds the best matching substring, helpful for detecting when one text contains parts of another
|
| 125 |
-
- **Simple Ratio**: Performs character-by-character comparison, best for detecting minor spelling variations
|
| 126 |
|
| 127 |
-
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
|
|
|
|
|
|
|
| 137 |
- **None**: No filtering, includes all words in the comparison
|
| 138 |
- **Standard**: Filters only common particles and punctuation
|
| 139 |
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
|
| 140 |
|
| 141 |
-
|
|
|
|
|
|
|
| 142 |
|
| 143 |
## Getting Started (if run Locally)
|
| 144 |
|
|
@@ -173,40 +206,63 @@ This helps focus on meaningful content words rather than grammatical elements.
|
|
| 173 |
|
| 174 |
## Usage
|
| 175 |
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
|
| 197 |
## Embedding Model
|
| 198 |
|
| 199 |
-
Semantic similarity uses Hugging Face sentence-transformer models
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
## Structure
|
| 202 |
|
| 203 |
- `app.py` — Gradio web app entry point and UI definition.
|
| 204 |
- `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
|
| 205 |
- `process.py`: Core logic for segmenting texts and orchestrating metric computation.
|
| 206 |
-
- `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity.
|
| 207 |
- `hf_embedding.py`: Handles loading and using sentence-transformer models.
|
| 208 |
- `tokenize.py`: Tibetan text tokenization using `botok`.
|
| 209 |
-
- `
|
|
|
|
| 210 |
- `visualize.py`: Generates heatmaps and word count plots.
|
| 211 |
- `requirements.txt` — Python dependencies for the web application.
|
| 212 |
|
|
@@ -228,7 +284,7 @@ If you use this web application or the underlying TTM tool in your research, ple
|
|
| 228 |
author = {Daniel Wojahn},
|
| 229 |
year = {2025},
|
| 230 |
url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
|
| 231 |
-
version = {0.
|
| 232 |
}
|
| 233 |
```
|
| 234 |
|
|
|
|
| 15 |
[](https://creativecommons.org/licenses/by/4.0/)
|
| 16 |
[](https://www.repostatus.org/#active)
|
| 17 |
|
| 18 |
+
Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts — no programming required.
|
| 19 |
+
|
| 20 |
+
## Quick Start (3 Steps)
|
| 21 |
+
|
| 22 |
+
1. **Upload** two or more Tibetan text files (.txt format)
|
| 23 |
+
2. **Click** "Compare My Texts"
|
| 24 |
+
3. **View** the results — higher scores mean more similarity
|
| 25 |
+
|
| 26 |
+
That's it! The default settings work well for most cases. See the results section for colorful heatmaps showing which chapters are most similar.
|
| 27 |
+
|
| 28 |
+
> **Tip:** If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.
|
| 29 |
+
|
| 30 |
+
## What's New (v0.4.0)
|
| 31 |
+
|
| 32 |
+
- **New preset-based UI**: Choose "Quick Start" for simple analysis or "Custom" for full control
|
| 33 |
+
- **Three analysis presets**: Standard, Deep (with AI), and Quick (fastest)
|
| 34 |
+
- **Word-level tokenization** is now the default (recommended for Jaccard similarity)
|
| 35 |
+
- **Particle normalization**: Treat grammatical particle variants as equivalent (གི/ཀྱི/གྱི → གི)
|
| 36 |
+
- **LCS normalization options**: Choose how to handle texts of different lengths
|
| 37 |
+
- **Improved stopword matching**: Fixed tsek (་) handling for consistent filtering
|
| 38 |
+
- **Tibetan-optimized fuzzy matching**: Syllable-level methods only (removed character-level methods)
|
| 39 |
+
- **Dharmamitra models**: Buddhist-specific semantic similarity models as default
|
| 40 |
+
- **Modernized theme**: Cleaner UI with better responsive design
|
| 41 |
|
| 42 |
## Background
|
| 43 |
|
| 44 |
+
The Tibetan Text Metrics project provides quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application makes these capabilities accessible through an intuitive interface — no command-line or Python experience needed.
|
| 45 |
|
| 46 |
## Key Features of the Web App
|
| 47 |
|
| 48 |
- **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
|
| 49 |
- **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
|
| 50 |
- **Core Metrics Computed**:
|
| 51 |
+
- **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. Word-level tokenization recommended. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
|
| 52 |
+
- **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels. Supports multiple normalization modes (average, min, max).
|
| 53 |
+
- **Fuzzy Similarity**: Uses syllable-level fuzzy matching to detect approximate matches, accommodating spelling variations and scribal differences in Tibetan text.
|
| 54 |
+
- **Semantic Similarity**: Uses Buddhist-specific sentence-transformer embeddings (Dharmamitra) to compare the contextual meaning of segments.
|
| 55 |
- **Handles Long Texts**: Implements automated handling for long segments when computing semantic embeddings.
|
| 56 |
+
- **Model Selection**: Semantic similarity uses Hugging Face sentence-transformer models. Default is Dharmamitra's `buddhist-nlp/buddhist-sentence-similarity`, trained specifically for Buddhist texts.
|
| 57 |
+
- **Tokenization Modes**:
|
| 58 |
+
- **Word** (default, recommended): Keeps multi-syllable words together for more meaningful comparison
|
| 59 |
+
- **Syllable**: Splits into individual syllables for finer-grained analysis
|
| 60 |
- **Stopword Filtering**: Three levels of filtering for Tibetan words:
|
| 61 |
- **None**: No filtering, includes all words
|
| 62 |
- **Standard**: Filters only common particles and punctuation
|
| 63 |
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
|
| 64 |
+
- **Particle Normalization**: Optional normalization of grammatical particles to canonical forms (e.g., གི/ཀྱི/གྱི → གི, ལ/ར/སུ/ཏུ/དུ → ལ). Reduces false negatives from sandhi variation.
|
| 65 |
- **Interactive Visualizations**:
|
| 66 |
- Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
|
| 67 |
- Bar chart displaying word counts per segment.
|
|
|
|
| 117 |
## Implemented Metrics
|
| 118 |
|
| 119 |
**Stopword Filtering:**
|
| 120 |
+
To enhance the accuracy and relevance of similarity scores, the Jaccard Similarity and Fuzzy Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. Stopwords are normalized to handle tsek (་) variations consistently.
|
| 121 |
+
|
| 122 |
+
**Particle Normalization:**
|
| 123 |
+
Tibetan grammatical particles change form based on the preceding syllable (sandhi). For example, the genitive particle appears as གི, ཀྱི, གྱི, ཡི, or འི depending on context. When particle normalization is enabled, all variants are treated as equivalent, reducing false negatives when comparing texts with different scribal conventions.
|
| 124 |
|
| 125 |
The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
|
| 126 |
- The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
|
|
|
|
| 128 |
|
| 129 |
We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
|
| 130 |
|
| 131 |
+
Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords_bo.py` file.
|
| 132 |
|
| 133 |
### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
|
| 134 |
|
|
|
|
| 145 |
|
| 146 |
This helps focus on meaningful content words rather than grammatical elements.
|
| 147 |
|
| 148 |
+
2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
**Normalization options:**
|
| 151 |
+
- **Average** (default): Divides LCS length by the average of both text lengths. Balanced comparison.
|
| 152 |
+
- **Min**: Divides by the shorter text length. Useful for detecting if one text contains the other (e.g., quotes within commentary). Can return 1.0 if shorter text is fully contained.
|
| 153 |
+
- **Max**: Divides by the longer text length. Stricter metric that penalizes length differences.
|
| 154 |
|
| 155 |
+
A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.
|
| 156 |
+
|
| 157 |
+
*Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary.
|
| 158 |
+
3. **Fuzzy Similarity**: This metric uses syllable-level fuzzy matching algorithms to detect approximate matches, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical.
|
| 159 |
|
| 160 |
+
**Available methods (all work at syllable level):**
|
| 161 |
+
- **Syllable N-gram Overlap** (default, recommended): Compares syllable bigrams between texts. Best for detecting shared phrases and local patterns.
|
| 162 |
+
- **Syllable-level Edit Distance**: Computes Levenshtein distance at the syllable/token level. Detects minor variations while respecting syllable boundaries.
|
| 163 |
+
- **Weighted Jaccard**: Like standard Jaccard but considers token frequency, giving more weight to frequently shared terms.
|
| 164 |
|
| 165 |
+
Scores range from 0 to 1, where 1 indicates perfect or near-perfect matches. All methods work at the syllable level, which is linguistically appropriate for Tibetan.
|
| 166 |
+
|
| 167 |
+
**Stopword Filtering**: The same three levels of filtering used for Jaccard Similarity are applied to fuzzy matching:
|
| 168 |
- **None**: No filtering, includes all words in the comparison
|
| 169 |
- **Standard**: Filters only common particles and punctuation
|
| 170 |
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
|
| 171 |
|
| 172 |
+
4. **Semantic Similarity**: Computes the cosine similarity between sentence-transformer embeddings of text segments. Uses Dharmamitra's Buddhist-specific models by default. Segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate a higher degree of semantic overlap.
|
| 173 |
+
|
| 174 |
+
*Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
|
| 175 |
|
| 176 |
## Getting Started (if run Locally)
|
| 177 |
|
|
|
|
| 206 |
|
| 207 |
## Usage
|
| 208 |
|
| 209 |
+
### Quick Start (Recommended for Most Users)
|
| 210 |
+
|
| 211 |
+
1. **Upload Files**: Select one or more `.txt` files containing Tibetan Unicode text.
|
| 212 |
+
2. **Choose a Preset**: In the "Quick Start" tab, select an analysis type:
|
| 213 |
+
|
| 214 |
+
| Preset | What it does | Best for |
|
| 215 |
+
|--------|--------------|----------|
|
| 216 |
+
| 📊 **Standard** | Vocabulary + Sequences + Fuzzy matching | Most comparisons |
|
| 217 |
+
| 🧠 **Deep** | All metrics including AI meaning analysis | Finding semantic parallels |
|
| 218 |
+
| ⚡ **Quick** | Vocabulary overlap only | Fast initial scan |
|
| 219 |
+
|
| 220 |
+
3. **Click "Compare My Texts"**: Results appear below with heatmaps and downloadable CSV.
|
| 221 |
+
|
| 222 |
+
### Custom Analysis (Advanced Users)
|
| 223 |
+
|
| 224 |
+
For fine-grained control, use the "Custom" tab:
|
| 225 |
+
|
| 226 |
+
- **Lexical Metrics**: Configure tokenization (word/syllable), stopword filtering, and particle normalization
|
| 227 |
+
- **Sequence Matching (LCS)**: Enable/disable and choose normalization mode (avg/min/max)
|
| 228 |
+
- **Fuzzy Matching**: Choose method (N-gram, Syllable Edit, or Weighted Jaccard)
|
| 229 |
+
- **Semantic Analysis**: Enable AI-based meaning comparison with model selection
|
| 230 |
+
|
| 231 |
+
### Viewing Results
|
| 232 |
+
|
| 233 |
+
- **Metrics Preview**: Summary table of similarity scores
|
| 234 |
+
- **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
|
| 235 |
+
- **Word Counts**: Bar chart showing segment lengths
|
| 236 |
+
- **CSV Download**: Full results for further analysis
|
| 237 |
+
|
| 238 |
+
### AI Interpretation (Optional)
|
| 239 |
+
|
| 240 |
+
After running analysis, click "Help Interpret Results" for scholarly insights:
|
| 241 |
+
- Pattern identification across chapters
|
| 242 |
+
- Notable textual relationships
|
| 243 |
+
- Suggestions for further investigation
|
| 244 |
|
| 245 |
## Embedding Model
|
| 246 |
|
| 247 |
+
Semantic similarity uses Hugging Face sentence-transformer models. The following models are available:
|
| 248 |
+
|
| 249 |
+
- **`buddhist-nlp/buddhist-sentence-similarity`** (default, recommended): Developed by [Dharmamitra](https://huggingface.co/buddhist-nlp), this model is specifically trained for sentence similarity on Buddhist texts in Tibetan, Buddhist Chinese, Sanskrit (IAST), and Pāli. Best choice for Tibetan Buddhist manuscripts.
|
| 250 |
+
- **`buddhist-nlp/bod-eng-similarity`**: Also from Dharmamitra, optimized for Tibetan-English bitext alignment tasks.
|
| 251 |
+
- **`sentence-transformers/LaBSE`**: General multilingual model, good baseline for non-Buddhist texts.
|
| 252 |
+
- **`BAAI/bge-m3`**: Strong multilingual alternative with broad language coverage.
|
| 253 |
+
|
| 254 |
+
These models provide context-aware, segment-level embeddings suitable for comparing Tibetan text passages.
|
| 255 |
|
| 256 |
## Structure
|
| 257 |
|
| 258 |
- `app.py` — Gradio web app entry point and UI definition.
|
| 259 |
- `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
|
| 260 |
- `process.py`: Core logic for segmenting texts and orchestrating metric computation.
|
| 261 |
+
- `metrics.py`: Implementation of Jaccard, LCS, Fuzzy, and Semantic Similarity.
|
| 262 |
- `hf_embedding.py`: Handles loading and using sentence-transformer models.
|
| 263 |
- `tokenize.py`: Tibetan text tokenization using `botok`.
|
| 264 |
+
- `normalize_bo.py`: Tibetan particle normalization for grammatical variants.
|
| 265 |
+
- `stopwords_bo.py`: Comprehensive Tibetan stopword list with tsek normalization.
|
| 266 |
- `visualize.py`: Generates heatmaps and word count plots.
|
| 267 |
- `requirements.txt` — Python dependencies for the web application.
|
| 268 |
|
|
|
|
| 284 |
author = {Daniel Wojahn},
|
| 285 |
year = {2025},
|
| 286 |
url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
|
| 287 |
+
version = {0.4.0}
|
| 288 |
}
|
| 289 |
```
|
| 290 |
|
app.py
CHANGED
|
@@ -17,14 +17,16 @@ load_dotenv()
|
|
| 17 |
|
| 18 |
logger = logging.getLogger(__name__)
|
| 19 |
def main_interface():
|
|
|
|
|
|
|
| 20 |
with gr.Blocks(
|
| 21 |
theme=tibetan_theme,
|
| 22 |
-
|
| 23 |
-
|
| 24 |
) as demo:
|
| 25 |
gr.Markdown(
|
| 26 |
-
"""# Tibetan Text Metrics
|
| 27 |
-
<span style='font-size:18px;'>
|
| 28 |
""",
|
| 29 |
|
| 30 |
elem_classes="gr-markdown",
|
|
@@ -35,93 +37,174 @@ def main_interface():
|
|
| 35 |
with gr.Group(elem_classes="step-box"):
|
| 36 |
gr.Markdown(
|
| 37 |
"""
|
| 38 |
-
## Step 1: Upload Your
|
| 39 |
-
<span style='font-size:16px;'>Upload two or more
|
| 40 |
""",
|
| 41 |
elem_classes="gr-markdown",
|
| 42 |
)
|
| 43 |
file_input = gr.File(
|
| 44 |
-
label="
|
| 45 |
file_types=[".txt"],
|
| 46 |
file_count="multiple",
|
| 47 |
)
|
| 48 |
gr.Markdown(
|
| 49 |
-
"<small>
|
| 50 |
elem_classes="gr-markdown"
|
| 51 |
)
|
| 52 |
with gr.Column(scale=1, elem_classes="step-column"):
|
| 53 |
with gr.Group(elem_classes="step-box"):
|
| 54 |
gr.Markdown(
|
| 55 |
-
"""## Step 2:
|
| 56 |
-
<span style='font-size:16px;'>
|
| 57 |
""",
|
| 58 |
elem_classes="gr-markdown",
|
| 59 |
)
|
| 60 |
-
semantic_toggle_radio = gr.Radio(
|
| 61 |
-
label="Compute semantic similarity? (Experimental)",
|
| 62 |
-
choices=["Yes", "No"],
|
| 63 |
-
value="No",
|
| 64 |
-
info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
|
| 65 |
-
elem_id="semantic-radio-group",
|
| 66 |
-
)
|
| 67 |
-
|
| 68 |
-
model_dropdown = gr.Dropdown(
|
| 69 |
-
choices=[
|
| 70 |
-
"sentence-transformers/LaBSE"
|
| 71 |
-
],
|
| 72 |
-
label="Select Embedding Model",
|
| 73 |
-
value="sentence-transformers/LaBSE",
|
| 74 |
-
info="Select the embedding model to use for semantic similarity analysis. Only Hugging Face sentence-transformers are supported."
|
| 75 |
-
)
|
| 76 |
-
|
| 77 |
-
with gr.Accordion("Advanced Options", open=False):
|
| 78 |
-
batch_size_slider = gr.Slider(
|
| 79 |
-
minimum=1,
|
| 80 |
-
maximum=64,
|
| 81 |
-
value=8,
|
| 82 |
-
step=1,
|
| 83 |
-
label="Batch Size (for Hugging Face models)",
|
| 84 |
-
info="Adjust based on your hardware (VRAM). Lower this if you encounter memory issues."
|
| 85 |
-
)
|
| 86 |
-
progress_bar_checkbox = gr.Checkbox(
|
| 87 |
-
label="Show Embedding Progress Bar",
|
| 88 |
-
value=False,
|
| 89 |
-
info="Display a progress bar during embedding generation. Useful for large datasets."
|
| 90 |
-
)
|
| 91 |
-
|
| 92 |
-
stopwords_dropdown = gr.Dropdown(
|
| 93 |
-
label="Stopword Filtering",
|
| 94 |
-
choices=[
|
| 95 |
-
"None (No filtering)",
|
| 96 |
-
"Standard (Common particles only)",
|
| 97 |
-
"Aggressive (All function words)"
|
| 98 |
-
],
|
| 99 |
-
value="Standard (Common particles only)", # Default
|
| 100 |
-
info="Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words."
|
| 101 |
-
)
|
| 102 |
-
|
| 103 |
-
fuzzy_toggle_radio = gr.Radio(
|
| 104 |
-
label="Enable Fuzzy String Matching",
|
| 105 |
-
choices=["Yes", "No"],
|
| 106 |
-
value="Yes",
|
| 107 |
-
info="Fuzzy matching helps detect similar but not identical text segments. Useful for identifying variations and modifications."
|
| 108 |
-
)
|
| 109 |
-
|
| 110 |
-
fuzzy_method_dropdown = gr.Dropdown(
|
| 111 |
-
label="Fuzzy Matching Method",
|
| 112 |
-
choices=[
|
| 113 |
-
"token_set - Order-independent matching",
|
| 114 |
-
"token_sort - Order-normalized matching",
|
| 115 |
-
"partial - Best partial matching",
|
| 116 |
-
"ratio - Simple ratio matching"
|
| 117 |
-
],
|
| 118 |
-
value="token_set - Order-independent matching",
|
| 119 |
-
info="Select the fuzzy matching algorithm to use:\n\n• token_set: Best for texts with different word orders and partial overlaps. Compares unique words regardless of their order (recommended for Tibetan texts).\n\n• token_sort: Good for texts with different word orders but similar content. Sorts words alphabetically before comparing.\n\n• partial: Best for finding shorter strings within longer ones. Useful when one text is a fragment of another.\n\n• ratio: Simple Levenshtein distance ratio. Best for detecting small edits and typos in otherwise identical texts."
|
| 120 |
-
)
|
| 121 |
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
gr.Markdown(
|
| 127 |
"""## Results
|
|
@@ -131,165 +214,208 @@ def main_interface():
|
|
| 131 |
# The heatmap_titles and metric_tooltips dictionaries are defined here
|
| 132 |
# heatmap_titles = { ... }
|
| 133 |
# metric_tooltips = { ... }
|
| 134 |
-
csv_output = gr.File(label="Download CSV
|
| 135 |
metrics_preview = gr.Dataframe(
|
| 136 |
-
label="
|
| 137 |
)
|
| 138 |
# States for data persistence
|
| 139 |
state_text_data = gr.State()
|
| 140 |
state_df_results = gr.State()
|
| 141 |
-
|
| 142 |
# LLM Interpretation components
|
| 143 |
with gr.Row():
|
| 144 |
with gr.Column():
|
| 145 |
gr.Markdown(
|
| 146 |
-
"##
|
| 147 |
elem_classes="gr-markdown"
|
| 148 |
)
|
| 149 |
-
|
| 150 |
# Add the interpret button
|
| 151 |
with gr.Row():
|
| 152 |
interpret_btn = gr.Button(
|
| 153 |
-
"
|
| 154 |
variant="primary",
|
| 155 |
elem_id="interpret-btn"
|
| 156 |
)
|
| 157 |
# Create a placeholder message with proper formatting and structure
|
| 158 |
initial_message = """
|
| 159 |
-
##
|
| 160 |
|
| 161 |
-
<small>*
|
| 162 |
"""
|
| 163 |
interpretation_output = gr.Markdown(
|
| 164 |
value=initial_message,
|
| 165 |
elem_id="llm-analysis"
|
| 166 |
)
|
| 167 |
-
|
| 168 |
# Heatmap tabs for each metric
|
| 169 |
heatmap_titles = {
|
| 170 |
-
"Jaccard Similarity (%)": "Higher
|
| 171 |
-
"Normalized LCS": "Higher
|
| 172 |
-
"Fuzzy Similarity": "
|
| 173 |
-
"Semantic Similarity": "Higher
|
| 174 |
-
"Word Counts": "
|
| 175 |
}
|
| 176 |
|
| 177 |
metric_tooltips = {
|
| 178 |
"Jaccard Similarity (%)": """
|
| 179 |
-
### Jaccard Similarity
|
| 180 |
-
|
|
|
|
| 181 |
|
| 182 |
-
|
| 183 |
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
""",
|
| 188 |
"Fuzzy Similarity": """
|
| 189 |
-
### Fuzzy Similarity
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
-
|
|
|
|
|
|
|
|
|
|
| 193 |
|
| 194 |
-
**
|
| 195 |
-
-
|
| 196 |
-
-
|
| 197 |
-
-
|
| 198 |
-
- **Simple Ratio**: Direct character-by-character comparison (best for detecting minor variations)
|
| 199 |
|
| 200 |
-
**
|
|
|
|
|
|
|
|
|
|
| 201 |
""",
|
| 202 |
"Normalized LCS": """
|
| 203 |
-
###
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
|
| 209 |
|
| 210 |
-
**
|
| 211 |
|
| 212 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
""",
|
| 214 |
"Semantic Similarity": """
|
| 215 |
-
### Semantic
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
""",
|
| 224 |
"Word Counts": """
|
| 225 |
-
###
|
| 226 |
-
|
|
|
|
|
|
|
|
|
|
| 227 |
|
| 228 |
-
|
|
|
|
|
|
|
|
|
|
| 229 |
|
| 230 |
-
**
|
| 231 |
-
- Longer bars indicate segments with more words
|
| 232 |
-
- Segments are grouped by source document
|
| 233 |
-
- Useful for identifying structural patterns and content distribution
|
| 234 |
-
- Can help explain similarity metric variations (longer texts may show different patterns)
|
| 235 |
""",
|
| 236 |
"Structural Analysis": """
|
| 237 |
-
###
|
| 238 |
-
This advanced analysis examines the structural relationships between text segments across your documents. It identifies patterns of similarity and difference that may indicate textual dependencies, common sources, or editorial modifications.
|
| 239 |
|
| 240 |
-
|
| 241 |
|
| 242 |
-
**
|
| 243 |
-
-
|
| 244 |
-
-
|
| 245 |
-
-
|
| 246 |
-
- Provides insights into textual transmission and editorial history
|
| 247 |
|
| 248 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
"""
|
| 250 |
|
| 251 |
}
|
| 252 |
heatmap_tabs = {}
|
| 253 |
-
gr.Markdown("##
|
| 254 |
-
|
| 255 |
with gr.Tabs(elem_id="heatmap-tab-group"):
|
| 256 |
# Process all metrics
|
| 257 |
metrics_to_display = heatmap_titles
|
| 258 |
-
|
| 259 |
for metric_key, descriptive_title in metrics_to_display.items():
|
| 260 |
with gr.Tab(metric_key):
|
| 261 |
# Set CSS class based on metric type
|
| 262 |
if metric_key == "Jaccard Similarity (%)":
|
| 263 |
css_class = "metric-info-accordion jaccard-info"
|
| 264 |
-
accordion_title = "
|
| 265 |
elif metric_key == "Normalized LCS":
|
| 266 |
css_class = "metric-info-accordion lcs-info"
|
| 267 |
-
accordion_title = "
|
| 268 |
elif metric_key == "Fuzzy Similarity":
|
| 269 |
css_class = "metric-info-accordion fuzzy-info"
|
| 270 |
-
accordion_title = "
|
| 271 |
elif metric_key == "Semantic Similarity":
|
| 272 |
css_class = "metric-info-accordion semantic-info"
|
| 273 |
-
accordion_title = "
|
| 274 |
elif metric_key == "Word Counts":
|
| 275 |
css_class = "metric-info-accordion wordcount-info"
|
| 276 |
-
accordion_title = "
|
| 277 |
else:
|
| 278 |
css_class = "metric-info-accordion"
|
| 279 |
-
accordion_title = f"About {metric_key}"
|
| 280 |
-
|
| 281 |
# Create the accordion with appropriate content
|
| 282 |
with gr.Accordion(accordion_title, open=False, elem_classes=css_class):
|
| 283 |
if metric_key == "Word Counts":
|
| 284 |
gr.Markdown("""
|
| 285 |
-
|
| 286 |
-
|
|
|
|
|
|
|
|
|
|
| 287 |
""")
|
| 288 |
elif metric_key in metric_tooltips:
|
| 289 |
gr.Markdown(value=metric_tooltips[metric_key], elem_classes="metric-description")
|
| 290 |
else:
|
| 291 |
gr.Markdown(value=f"### {metric_key}\nDescription not found.")
|
| 292 |
-
|
| 293 |
# Add the appropriate plot
|
| 294 |
if metric_key == "Word Counts":
|
| 295 |
word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
|
|
@@ -302,26 +428,28 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 302 |
# The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
|
| 303 |
|
| 304 |
warning_box = gr.Markdown(visible=False)
|
| 305 |
-
|
| 306 |
# Create a container for metric progress indicators
|
| 307 |
with gr.Row(visible=False) as progress_container:
|
| 308 |
# Progress indicators will be created dynamically by ProgressiveUI
|
| 309 |
gr.Markdown("Metric progress will appear here during analysis")
|
| 310 |
|
| 311 |
-
def run_pipeline(files, enable_semantic, enable_fuzzy, fuzzy_method, model_name, stopwords_option, batch_size, show_progress, progress=gr.Progress()):
|
| 312 |
"""Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
|
| 313 |
-
|
| 314 |
Args:
|
| 315 |
files: A list of file objects uploaded by the user.
|
| 316 |
enable_semantic: Whether to compute semantic similarity.
|
| 317 |
enable_fuzzy: Whether to compute fuzzy string similarity.
|
| 318 |
fuzzy_method: The fuzzy matching method to use.
|
| 319 |
model_name: Name of the embedding model to use.
|
|
|
|
| 320 |
stopwords_option: Stopword filtering level (None, Standard, or Aggressive).
|
|
|
|
| 321 |
batch_size: Batch size for embedding generation.
|
| 322 |
show_progress: Whether to show progress bars during embedding.
|
| 323 |
progress: Gradio progress indicator.
|
| 324 |
-
|
| 325 |
Returns:
|
| 326 |
tuple: Results for UI components including metrics, visualizations, and state.
|
| 327 |
"""
|
|
@@ -336,7 +464,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 336 |
warning_update_res = gr.update(visible=False)
|
| 337 |
state_text_data_res = None
|
| 338 |
state_df_results_res = None
|
| 339 |
-
|
| 340 |
# Create a ProgressiveUI instance for handling progressive updates
|
| 341 |
progressive_ui = ProgressiveUI(
|
| 342 |
metrics_preview=metrics_preview,
|
|
@@ -349,10 +477,10 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 349 |
progress_container=progress_container,
|
| 350 |
heatmap_titles=heatmap_titles
|
| 351 |
)
|
| 352 |
-
|
| 353 |
# Make progress container visible during analysis
|
| 354 |
progress_container.update(visible=True)
|
| 355 |
-
|
| 356 |
# Create a progressive callback function
|
| 357 |
progressive_callback = create_progressive_callback(progressive_ui)
|
| 358 |
# Check if files are provided
|
|
@@ -369,7 +497,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 369 |
None, # state_text_data
|
| 370 |
None # state_df_results
|
| 371 |
)
|
| 372 |
-
|
| 373 |
# Check file size limits (10MB per file)
|
| 374 |
for file in files:
|
| 375 |
file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
|
|
@@ -393,13 +521,13 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 393 |
progress(0.1, desc="Preparing files...")
|
| 394 |
except Exception as e:
|
| 395 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 396 |
-
|
| 397 |
# Get filenames and read file contents
|
| 398 |
filenames = [
|
| 399 |
Path(file.name).name for file in files
|
| 400 |
] # Use Path().name to get just the filename
|
| 401 |
text_data = {}
|
| 402 |
-
|
| 403 |
# Read files with progress updates
|
| 404 |
for i, file in enumerate(files):
|
| 405 |
file_path = Path(file.name)
|
|
@@ -409,7 +537,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 409 |
progress(0.1 + (0.1 * (i / len(files))), desc=f"Reading file: {filename}")
|
| 410 |
except Exception as e:
|
| 411 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 412 |
-
|
| 413 |
try:
|
| 414 |
text_data[filename] = file_path.read_text(encoding="utf-8-sig")
|
| 415 |
except UnicodeDecodeError:
|
|
@@ -433,21 +561,27 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 433 |
# Configure semantic similarity and fuzzy matching
|
| 434 |
enable_semantic_bool = enable_semantic == "Yes"
|
| 435 |
enable_fuzzy_bool = enable_fuzzy == "Yes"
|
| 436 |
-
|
| 437 |
# Extract the fuzzy method from the dropdown value
|
| 438 |
-
fuzzy_method_value = fuzzy_method.split(' - ')[0] if fuzzy_method else '
|
| 439 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 440 |
if progress is not None:
|
| 441 |
try:
|
| 442 |
progress(0.2, desc="Loading model..." if enable_semantic_bool else "Processing text...")
|
| 443 |
except Exception as e:
|
| 444 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 445 |
-
|
| 446 |
# Process texts with selected model
|
| 447 |
# Convert stopword option to appropriate parameters
|
| 448 |
use_stopwords = stopwords_option != "None (No filtering)"
|
| 449 |
use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
|
| 450 |
-
|
| 451 |
# For Hugging Face models, the UI value is the correct model ID
|
| 452 |
internal_model_id = model_name
|
| 453 |
|
|
@@ -457,9 +591,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 457 |
enable_semantic=enable_semantic_bool,
|
| 458 |
enable_fuzzy=enable_fuzzy_bool,
|
| 459 |
fuzzy_method=fuzzy_method_value,
|
|
|
|
| 460 |
model_name=internal_model_id,
|
| 461 |
use_stopwords=use_stopwords,
|
| 462 |
use_lite_stopwords=use_lite_stopwords,
|
|
|
|
|
|
|
| 463 |
progress_callback=progress,
|
| 464 |
progressive_callback=progressive_callback,
|
| 465 |
batch_size=batch_size,
|
|
@@ -479,12 +616,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 479 |
progress(0.8, desc="Generating visualizations...")
|
| 480 |
except Exception as e:
|
| 481 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 482 |
-
|
| 483 |
# heatmap_titles is already defined in the outer scope of main_interface
|
| 484 |
heatmaps_data = generate_visualizations(
|
| 485 |
df_results, descriptive_titles=heatmap_titles
|
| 486 |
)
|
| 487 |
-
|
| 488 |
# Generate word count chart
|
| 489 |
if progress is not None:
|
| 490 |
try:
|
|
@@ -492,12 +629,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 492 |
except Exception as e:
|
| 493 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 494 |
word_count_fig_res = generate_word_count_chart(word_counts_df_data)
|
| 495 |
-
|
| 496 |
# Store state data for potential future use
|
| 497 |
state_text_data_res = text_data
|
| 498 |
state_df_results_res = df_results
|
| 499 |
logger.info("Analysis complete, storing state data")
|
| 500 |
-
|
| 501 |
# Save results to CSV
|
| 502 |
if progress is not None:
|
| 503 |
try:
|
|
@@ -506,7 +643,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 506 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 507 |
csv_path_res = "results.csv"
|
| 508 |
df_results.to_csv(csv_path_res, index=False)
|
| 509 |
-
|
| 510 |
# Prepare final output
|
| 511 |
warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
|
| 512 |
metrics_preview_df_res = df_results.head(10)
|
|
@@ -514,10 +651,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 514 |
jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
|
| 515 |
lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
|
| 516 |
fuzzy_heatmap_res = heatmaps_data.get("Fuzzy Similarity")
|
| 517 |
-
semantic_heatmap_res = heatmaps_data.get(
|
| 518 |
-
"Semantic Similarity"
|
| 519 |
-
)
|
| 520 |
-
# TF-IDF has been completely removed
|
| 521 |
warning_update_res = gr.update(
|
| 522 |
visible=bool(warning_raw), value=warning_md
|
| 523 |
)
|
|
@@ -546,27 +680,27 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 546 |
try:
|
| 547 |
if not csv_path or not Path(csv_path).exists():
|
| 548 |
return "Please run the analysis first to generate results."
|
| 549 |
-
|
| 550 |
# Read the CSV file
|
| 551 |
df_results = pd.read_csv(csv_path)
|
| 552 |
-
|
| 553 |
# Show detailed progress messages with percentages
|
| 554 |
progress(0, desc="Preparing data for analysis...")
|
| 555 |
progress(0.1, desc="Analyzing similarity patterns...")
|
| 556 |
progress(0.2, desc="Connecting to Mistral 7B via OpenRouter...")
|
| 557 |
-
|
| 558 |
# Get interpretation from LLM (using OpenRouter API)
|
| 559 |
progress(0.3, desc="Generating scholarly interpretation (this may take 20-40 seconds)...")
|
| 560 |
llm_service = LLMService()
|
| 561 |
interpretation = llm_service.analyze_similarity(df_results)
|
| 562 |
-
|
| 563 |
# Simulate completion steps
|
| 564 |
progress(0.9, desc="Formatting results...")
|
| 565 |
progress(0.95, desc="Applying scholarly formatting...")
|
| 566 |
-
|
| 567 |
# Completed
|
| 568 |
progress(1.0, desc="Analysis complete!")
|
| 569 |
-
|
| 570 |
# Add a timestamp to the interpretation
|
| 571 |
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
|
| 572 |
interpretation = f"{interpretation}\n\n<small>Analysis generated on {timestamp}</small>"
|
|
@@ -574,36 +708,92 @@ The structural analysis combines multiple similarity metrics to create a compreh
|
|
| 574 |
except Exception as e:
|
| 575 |
logger.error(f"Error in interpret_results: {e}", exc_info=True)
|
| 576 |
return f"Error interpreting results: {str(e)}"
|
| 577 |
-
|
| 578 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 579 |
fn=run_pipeline,
|
| 580 |
-
inputs=[
|
| 581 |
-
|
| 582 |
-
|
| 583 |
-
|
| 584 |
-
|
| 585 |
-
|
| 586 |
-
|
| 587 |
-
|
| 588 |
-
|
| 589 |
-
|
| 590 |
-
|
| 591 |
-
|
| 592 |
-
]
|
|
|
|
| 593 |
)
|
| 594 |
|
| 595 |
# Structural analysis functionality removed - see dedicated collation app
|
| 596 |
-
|
| 597 |
# Connect the interpret button
|
| 598 |
interpret_btn.click(
|
| 599 |
fn=interpret_results,
|
| 600 |
inputs=[csv_output],
|
| 601 |
outputs=interpretation_output
|
| 602 |
)
|
| 603 |
-
|
| 604 |
return demo
|
| 605 |
|
| 606 |
|
| 607 |
if __name__ == "__main__":
|
| 608 |
demo = main_interface()
|
| 609 |
-
demo.launch()
|
|
|
|
| 17 |
|
| 18 |
logger = logging.getLogger(__name__)
|
| 19 |
def main_interface():
|
| 20 |
+
# Theme and CSS applied here for Gradio 5.x compatibility
|
| 21 |
+
# For Gradio 6.x, these will move to launch() - see migration guide
|
| 22 |
with gr.Blocks(
|
| 23 |
theme=tibetan_theme,
|
| 24 |
+
css=tibetan_theme.get_css_string(),
|
| 25 |
+
title="Tibetan Text Metrics Web App"
|
| 26 |
) as demo:
|
| 27 |
gr.Markdown(
|
| 28 |
+
"""# Tibetan Text Metrics
|
| 29 |
+
<span style='font-size:18px;'>Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts. Part of the <a href="https://github.com/daniel-wojahn/tibetan-text-metrics" target="_blank">TTM project</a>.</span>
|
| 30 |
""",
|
| 31 |
|
| 32 |
elem_classes="gr-markdown",
|
|
|
|
| 37 |
with gr.Group(elem_classes="step-box"):
|
| 38 |
gr.Markdown(
|
| 39 |
"""
|
| 40 |
+
## Step 1: Upload Your Texts
|
| 41 |
+
<span style='font-size:16px;'>Upload two or more Tibetan text files (.txt format). If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.</span>
|
| 42 |
""",
|
| 43 |
elem_classes="gr-markdown",
|
| 44 |
)
|
| 45 |
file_input = gr.File(
|
| 46 |
+
label="Choose your Tibetan text files",
|
| 47 |
file_types=[".txt"],
|
| 48 |
file_count="multiple",
|
| 49 |
)
|
| 50 |
gr.Markdown(
|
| 51 |
+
"<small>Tip: Files should be under 1MB for best performance. Use UTF-8 encoded .txt files.</small>",
|
| 52 |
elem_classes="gr-markdown"
|
| 53 |
)
|
| 54 |
with gr.Column(scale=1, elem_classes="step-column"):
|
| 55 |
with gr.Group(elem_classes="step-box"):
|
| 56 |
gr.Markdown(
|
| 57 |
+
"""## Step 2: Choose Analysis Type
|
| 58 |
+
<span style='font-size:16px;'>Pick a preset for quick results, or use Custom for full control.</span>
|
| 59 |
""",
|
| 60 |
elem_classes="gr-markdown",
|
| 61 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
with gr.Tabs():
|
| 64 |
+
# ===== QUICK START TAB =====
|
| 65 |
+
with gr.Tab("Quick Start", id="quick_tab"):
|
| 66 |
+
analysis_preset = gr.Radio(
|
| 67 |
+
label="What kind of analysis do you need?",
|
| 68 |
+
choices=[
|
| 69 |
+
"📊 Standard — Vocabulary + Sequences + Fuzzy matching",
|
| 70 |
+
"🧠 Deep — All metrics including AI meaning analysis",
|
| 71 |
+
"⚡ Quick — Vocabulary overlap only (fastest)"
|
| 72 |
+
],
|
| 73 |
+
value="📊 Standard — Vocabulary + Sequences + Fuzzy matching",
|
| 74 |
+
info="Standard is recommended for most users. Deep analysis takes longer but finds texts with similar meaning even when words differ."
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
gr.Markdown("""
|
| 78 |
+
**What each preset includes:**
|
| 79 |
+
|
| 80 |
+
| Preset | Jaccard | LCS | Fuzzy | Semantic AI |
|
| 81 |
+
|--------|---------|-----|-------|-------------|
|
| 82 |
+
| 📊 Standard | ✓ | ✓ | ✓ | — |
|
| 83 |
+
| 🧠 Deep | ✓ | ✓ | ✓ | ✓ |
|
| 84 |
+
| ⚡ Quick | ✓ | — | — | — |
|
| 85 |
+
""", elem_classes="preset-table")
|
| 86 |
+
|
| 87 |
+
process_btn_quick = gr.Button(
|
| 88 |
+
"🔍 Compare My Texts", elem_id="run-btn-quick", variant="primary"
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
# ===== CUSTOM TAB =====
|
| 92 |
+
with gr.Tab("Custom", id="custom_tab"):
|
| 93 |
+
gr.Markdown("**Fine-tune each metric and option:**", elem_classes="custom-header")
|
| 94 |
+
|
| 95 |
+
with gr.Accordion("📊 Lexical Metrics", open=True):
|
| 96 |
+
gr.Markdown("*Compare the actual words used in texts*")
|
| 97 |
+
|
| 98 |
+
tokenization_mode_dropdown = gr.Dropdown(
|
| 99 |
+
label="How to split text?",
|
| 100 |
+
choices=[
|
| 101 |
+
"word - Whole words (recommended)",
|
| 102 |
+
"syllable - Individual syllables (finer detail)"
|
| 103 |
+
],
|
| 104 |
+
value="word - Whole words (recommended)",
|
| 105 |
+
info="'Word' keeps multi-syllable words together — recommended for Jaccard."
|
| 106 |
+
)
|
| 107 |
+
|
| 108 |
+
stopwords_dropdown = gr.Dropdown(
|
| 109 |
+
label="Filter common words?",
|
| 110 |
+
choices=[
|
| 111 |
+
"None (No filtering)",
|
| 112 |
+
"Standard (Common particles only)",
|
| 113 |
+
"Aggressive (All function words)"
|
| 114 |
+
],
|
| 115 |
+
value="Standard (Common particles only)",
|
| 116 |
+
info="Remove common particles (གི, ལ, ནི) before comparing."
|
| 117 |
+
)
|
| 118 |
+
|
| 119 |
+
particle_normalization_checkbox = gr.Checkbox(
|
| 120 |
+
label="Normalize grammatical particles?",
|
| 121 |
+
value=False,
|
| 122 |
+
info="Treat variants as equivalent (གི/ཀྱི/གྱི → གི). Useful for different scribal conventions."
|
| 123 |
+
)
|
| 124 |
+
|
| 125 |
+
with gr.Accordion("📏 Sequence Matching (LCS)", open=True):
|
| 126 |
+
gr.Markdown("*Find shared passages in the same order*")
|
| 127 |
+
|
| 128 |
+
gr.Checkbox(
|
| 129 |
+
label="Enable sequence matching",
|
| 130 |
+
value=True,
|
| 131 |
+
info="Finds the longest sequence of words appearing in both texts."
|
| 132 |
+
) # LCS is always computed as a core metric
|
| 133 |
+
|
| 134 |
+
lcs_normalization_dropdown = gr.Dropdown(
|
| 135 |
+
label="How to handle different text lengths?",
|
| 136 |
+
choices=[
|
| 137 |
+
"avg - Balanced comparison (default)",
|
| 138 |
+
"min - Detect if one text contains the other",
|
| 139 |
+
"max - Stricter, penalizes length differences"
|
| 140 |
+
],
|
| 141 |
+
value="avg - Balanced comparison (default)",
|
| 142 |
+
info="'min' is useful for finding quotes or excerpts."
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
with gr.Accordion("🔍 Fuzzy Matching", open=True):
|
| 146 |
+
gr.Markdown("*Detect similar but not identical text*")
|
| 147 |
+
|
| 148 |
+
fuzzy_toggle_radio = gr.Radio(
|
| 149 |
+
label="Find approximate matches?",
|
| 150 |
+
choices=["Yes", "No"],
|
| 151 |
+
value="Yes",
|
| 152 |
+
info="Useful for spelling variations and scribal differences."
|
| 153 |
+
)
|
| 154 |
+
|
| 155 |
+
fuzzy_method_dropdown = gr.Dropdown(
|
| 156 |
+
label="Matching method",
|
| 157 |
+
choices=[
|
| 158 |
+
"ngram - Syllable pairs (recommended)",
|
| 159 |
+
"syllable_edit - Count syllable changes",
|
| 160 |
+
"weighted_jaccard - Word frequency comparison"
|
| 161 |
+
],
|
| 162 |
+
value="ngram - Syllable pairs (recommended)",
|
| 163 |
+
info="All options work at the Tibetan syllable level."
|
| 164 |
+
)
|
| 165 |
+
|
| 166 |
+
with gr.Accordion("🧠 Semantic Analysis", open=False):
|
| 167 |
+
gr.Markdown("*Compare meaning using AI (slower)*")
|
| 168 |
+
|
| 169 |
+
semantic_toggle_radio = gr.Radio(
|
| 170 |
+
label="Analyze meaning similarity?",
|
| 171 |
+
choices=["Yes", "No"],
|
| 172 |
+
value="No",
|
| 173 |
+
info="Finds texts that say similar things in different words."
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
model_dropdown = gr.Dropdown(
|
| 177 |
+
choices=[
|
| 178 |
+
"buddhist-nlp/buddhist-sentence-similarity",
|
| 179 |
+
"buddhist-nlp/bod-eng-similarity",
|
| 180 |
+
"sentence-transformers/LaBSE",
|
| 181 |
+
"BAAI/bge-m3"
|
| 182 |
+
],
|
| 183 |
+
label="AI Model",
|
| 184 |
+
value="buddhist-nlp/buddhist-sentence-similarity",
|
| 185 |
+
info="'buddhist-sentence-similarity' works best for Buddhist texts."
|
| 186 |
+
)
|
| 187 |
+
|
| 188 |
+
batch_size_slider = gr.Slider(
|
| 189 |
+
minimum=1,
|
| 190 |
+
maximum=64,
|
| 191 |
+
value=8,
|
| 192 |
+
step=1,
|
| 193 |
+
label="Processing batch size",
|
| 194 |
+
info="Higher = faster but uses more memory."
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
progress_bar_checkbox = gr.Checkbox(
|
| 198 |
+
label="Show detailed progress",
|
| 199 |
+
value=False,
|
| 200 |
+
info="See step-by-step progress during analysis."
|
| 201 |
+
)
|
| 202 |
+
|
| 203 |
+
process_btn_custom = gr.Button(
|
| 204 |
+
"🔍 Compare My Texts (Custom)", elem_id="run-btn-custom", variant="primary"
|
| 205 |
+
)
|
| 206 |
+
|
| 207 |
+
# Note: Both process_btn_quick and process_btn_custom are wired below
|
| 208 |
|
| 209 |
gr.Markdown(
|
| 210 |
"""## Results
|
|
|
|
| 214 |
# The heatmap_titles and metric_tooltips dictionaries are defined here
|
| 215 |
# heatmap_titles = { ... }
|
| 216 |
# metric_tooltips = { ... }
|
| 217 |
+
csv_output = gr.File(label="📥 Download Full Results (CSV spreadsheet)")
|
| 218 |
metrics_preview = gr.Dataframe(
|
| 219 |
+
label="Results Summary — Compare chapters across your texts", interactive=False, visible=True
|
| 220 |
)
|
| 221 |
# States for data persistence
|
| 222 |
state_text_data = gr.State()
|
| 223 |
state_df_results = gr.State()
|
| 224 |
+
|
| 225 |
# LLM Interpretation components
|
| 226 |
with gr.Row():
|
| 227 |
with gr.Column():
|
| 228 |
gr.Markdown(
|
| 229 |
+
"## Get Expert Insights\n*Let AI help you understand what the numbers mean and what patterns they reveal about your texts.*",
|
| 230 |
elem_classes="gr-markdown"
|
| 231 |
)
|
| 232 |
+
|
| 233 |
# Add the interpret button
|
| 234 |
with gr.Row():
|
| 235 |
interpret_btn = gr.Button(
|
| 236 |
+
"📊 Explain My Results",
|
| 237 |
variant="primary",
|
| 238 |
elem_id="interpret-btn"
|
| 239 |
)
|
| 240 |
# Create a placeholder message with proper formatting and structure
|
| 241 |
initial_message = """
|
| 242 |
+
## Understanding Your Results
|
| 243 |
|
| 244 |
+
<small>*After running the analysis, click "Explain My Results" to get a plain-language interpretation of what the similarity scores mean for your texts.*</small>
|
| 245 |
"""
|
| 246 |
interpretation_output = gr.Markdown(
|
| 247 |
value=initial_message,
|
| 248 |
elem_id="llm-analysis"
|
| 249 |
)
|
| 250 |
+
|
| 251 |
# Heatmap tabs for each metric
|
| 252 |
heatmap_titles = {
|
| 253 |
+
"Jaccard Similarity (%)": "Shows how much vocabulary the texts share. Higher = more words in common.",
|
| 254 |
+
"Normalized LCS": "Shows shared phrases in the same order. Higher = more passages appear in both texts.",
|
| 255 |
+
"Fuzzy Similarity": "Finds similar text even with spelling differences. Higher = more alike.",
|
| 256 |
+
"Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
|
| 257 |
+
"Word Counts": "How long is each section? Helps you understand text structure.",
|
| 258 |
}
|
| 259 |
|
| 260 |
metric_tooltips = {
|
| 261 |
"Jaccard Similarity (%)": """
|
| 262 |
+
### Vocabulary Overlap (Jaccard Similarity)
|
| 263 |
+
|
| 264 |
+
**What it measures:** How many unique words appear in both texts.
|
| 265 |
|
| 266 |
+
**How to read it:** A score of 70% means 70% of all unique words found in either text appear in both. Higher scores = more shared vocabulary.
|
| 267 |
|
| 268 |
+
**What it tells you:**
|
| 269 |
+
- High scores (>70%): Texts use very similar vocabulary — possibly the same source or direct copying
|
| 270 |
+
- Medium scores (40-70%): Texts share significant vocabulary — likely related topics or traditions
|
| 271 |
+
- Low scores (<40%): Texts use different words — different sources or heavily edited versions
|
| 272 |
|
| 273 |
+
**Good to know:** This metric ignores word order and how often words repeat. It only asks "does this word appear in both texts?"
|
| 274 |
+
|
| 275 |
+
**Tips:**
|
| 276 |
+
- Use the "Filter common words" option to focus on meaningful content words rather than grammatical particles.
|
| 277 |
+
- **Word mode is recommended** for Jaccard. Syllable mode may inflate scores because common syllables (like ས, ར, ན) appear in many different words.
|
| 278 |
""",
|
| 279 |
"Fuzzy Similarity": """
|
| 280 |
+
### Approximate Matching (Fuzzy Similarity)
|
| 281 |
+
|
| 282 |
+
**What it measures:** How similar texts are, even when they're not exactly the same.
|
| 283 |
+
|
| 284 |
+
**How to read it:** Scores from 0 to 1. Higher = more similar. A score of 0.85 means the texts are 85% alike.
|
| 285 |
|
| 286 |
+
**What it tells you:**
|
| 287 |
+
- High scores (>0.8): Very similar texts with minor differences (spelling, small edits)
|
| 288 |
+
- Medium scores (0.5-0.8): Noticeably different but clearly related
|
| 289 |
+
- Low scores (<0.5): Substantially different texts
|
| 290 |
|
| 291 |
+
**Why it matters for Tibetan texts:**
|
| 292 |
+
- Catches spelling variations between manuscripts
|
| 293 |
+
- Finds scribal differences and regional conventions
|
| 294 |
+
- Identifies passages that were slightly modified
|
|
|
|
| 295 |
|
| 296 |
+
**Recommended methods:**
|
| 297 |
+
- **Syllable pairs (ngram)**: Best for Tibetan — compares pairs of syllables
|
| 298 |
+
- **Count syllable changes**: Good for finding minor edits
|
| 299 |
+
- **Word frequency**: Useful when certain words repeat often
|
| 300 |
""",
|
| 301 |
"Normalized LCS": """
|
| 302 |
+
### Shared Sequences (Longest Common Subsequence)
|
| 303 |
+
|
| 304 |
+
**What it measures:** The longest chain of words that appears in both texts *in the same order*.
|
| 305 |
+
|
| 306 |
+
**How to read it:** Higher scores mean longer shared passages. A score of 0.6 means 60% of the text follows the same word sequence.
|
|
|
|
| 307 |
|
| 308 |
+
**Example:** If Text A says "the quick brown fox" and Text B says "the lazy brown dog", the shared sequence is "the brown" — words that appear in both, in the same order.
|
| 309 |
|
| 310 |
+
**What it tells you:**
|
| 311 |
+
- High scores (>0.6): Texts share substantial passages — likely direct copying or common source
|
| 312 |
+
- Medium scores (0.3-0.6): Some shared phrasing — possibly related traditions
|
| 313 |
+
- Low scores (<0.3): Different word ordering — independent compositions or heavy editing
|
| 314 |
+
|
| 315 |
+
**Why this is different from vocabulary overlap:**
|
| 316 |
+
- Vocabulary overlap asks: "Do they use the same words?"
|
| 317 |
+
- Sequence matching asks: "Do they say things in the same order?"
|
| 318 |
+
|
| 319 |
+
Two texts might share many words (high Jaccard) but arrange them differently (low LCS), suggesting they discuss similar topics but were composed independently.
|
| 320 |
""",
|
| 321 |
"Semantic Similarity": """
|
| 322 |
+
### Meaning Similarity (Semantic Analysis)
|
| 323 |
+
|
| 324 |
+
**What it measures:** Whether texts convey similar *meaning*, even if they use different words.
|
| 325 |
+
|
| 326 |
+
**How to read it:** Scores from 0 to 1. Higher = more similar meaning. A score of 0.8 means the texts express very similar ideas.
|
| 327 |
+
|
| 328 |
+
**What it tells you:**
|
| 329 |
+
- High scores (>0.75): Texts say similar things, even if worded differently
|
| 330 |
+
- Medium scores (0.5-0.75): Related topics or themes
|
| 331 |
+
- Low scores (<0.5): Different subject matter
|
| 332 |
+
|
| 333 |
+
**How it works:** An AI model (trained on Buddhist texts) reads both passages and judges how similar their meaning is. This catches similarities that word-matching would miss.
|
| 334 |
+
|
| 335 |
+
**When to use it:**
|
| 336 |
+
- Finding paraphrased passages
|
| 337 |
+
- Identifying texts that discuss the same concepts differently
|
| 338 |
+
- Comparing translations or commentaries
|
| 339 |
+
|
| 340 |
+
**Note:** This takes longer to compute but provides insights the other metrics can't.
|
| 341 |
""",
|
| 342 |
"Word Counts": """
|
| 343 |
+
### Text Length by Section
|
| 344 |
+
|
| 345 |
+
**What it shows:** How many words are in each chapter or section of your texts.
|
| 346 |
+
|
| 347 |
+
**How to read it:** Taller bars = longer sections. Compare bars to see which parts of your texts are longer or shorter.
|
| 348 |
|
| 349 |
+
**What it tells you:**
|
| 350 |
+
- Similar bar heights across texts suggest similar structure
|
| 351 |
+
- Very different lengths might explain why similarity scores vary
|
| 352 |
+
- Helps identify which sections to examine more closely
|
| 353 |
|
| 354 |
+
**Tip:** If one text has much longer chapters, it might contain additional material not in the other version.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 355 |
""",
|
| 356 |
"Structural Analysis": """
|
| 357 |
+
### How Texts Relate to Each Other
|
|
|
|
| 358 |
|
| 359 |
+
**What it shows:** An overview of how your text sections connect and relate across documents.
|
| 360 |
|
| 361 |
+
**What it tells you:**
|
| 362 |
+
- Which sections are most similar to each other
|
| 363 |
+
- Possible patterns of copying or shared sources
|
| 364 |
+
- How texts might have evolved or been edited over time
|
|
|
|
| 365 |
|
| 366 |
+
**Useful for:**
|
| 367 |
+
- Understanding textual transmission history
|
| 368 |
+
- Identifying which version might be older or more original
|
| 369 |
+
- Finding sections that were added, removed, or modified
|
| 370 |
+
|
| 371 |
+
**Note:** This analysis combines all the other metrics to give you the big picture.
|
| 372 |
"""
|
| 373 |
|
| 374 |
}
|
| 375 |
heatmap_tabs = {}
|
| 376 |
+
gr.Markdown("## Visual Comparison", elem_classes="gr-markdown")
|
| 377 |
+
|
| 378 |
with gr.Tabs(elem_id="heatmap-tab-group"):
|
| 379 |
# Process all metrics
|
| 380 |
metrics_to_display = heatmap_titles
|
| 381 |
+
|
| 382 |
for metric_key, descriptive_title in metrics_to_display.items():
|
| 383 |
with gr.Tab(metric_key):
|
| 384 |
# Set CSS class based on metric type
|
| 385 |
if metric_key == "Jaccard Similarity (%)":
|
| 386 |
css_class = "metric-info-accordion jaccard-info"
|
| 387 |
+
accordion_title = "ℹ️ What does this mean?"
|
| 388 |
elif metric_key == "Normalized LCS":
|
| 389 |
css_class = "metric-info-accordion lcs-info"
|
| 390 |
+
accordion_title = "ℹ️ What does this mean?"
|
| 391 |
elif metric_key == "Fuzzy Similarity":
|
| 392 |
css_class = "metric-info-accordion fuzzy-info"
|
| 393 |
+
accordion_title = "ℹ️ What does this mean?"
|
| 394 |
elif metric_key == "Semantic Similarity":
|
| 395 |
css_class = "metric-info-accordion semantic-info"
|
| 396 |
+
accordion_title = "ℹ️ What does this mean?"
|
| 397 |
elif metric_key == "Word Counts":
|
| 398 |
css_class = "metric-info-accordion wordcount-info"
|
| 399 |
+
accordion_title = "ℹ️ What does this mean?"
|
| 400 |
else:
|
| 401 |
css_class = "metric-info-accordion"
|
| 402 |
+
accordion_title = f"ℹ️ About {metric_key}"
|
| 403 |
+
|
| 404 |
# Create the accordion with appropriate content
|
| 405 |
with gr.Accordion(accordion_title, open=False, elem_classes=css_class):
|
| 406 |
if metric_key == "Word Counts":
|
| 407 |
gr.Markdown("""
|
| 408 |
+
### Text Length by Section
|
| 409 |
+
|
| 410 |
+
This chart shows how many words are in each chapter or section. Taller bars = longer sections.
|
| 411 |
+
|
| 412 |
+
**Why it matters:** If sections have very different lengths, it might explain differences in similarity scores.
|
| 413 |
""")
|
| 414 |
elif metric_key in metric_tooltips:
|
| 415 |
gr.Markdown(value=metric_tooltips[metric_key], elem_classes="metric-description")
|
| 416 |
else:
|
| 417 |
gr.Markdown(value=f"### {metric_key}\nDescription not found.")
|
| 418 |
+
|
| 419 |
# Add the appropriate plot
|
| 420 |
if metric_key == "Word Counts":
|
| 421 |
word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
|
|
|
|
| 428 |
# The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
|
| 429 |
|
| 430 |
warning_box = gr.Markdown(visible=False)
|
| 431 |
+
|
| 432 |
# Create a container for metric progress indicators
|
| 433 |
with gr.Row(visible=False) as progress_container:
|
| 434 |
# Progress indicators will be created dynamically by ProgressiveUI
|
| 435 |
gr.Markdown("Metric progress will appear here during analysis")
|
| 436 |
|
| 437 |
+
def run_pipeline(files, enable_semantic, enable_fuzzy, fuzzy_method, lcs_normalization, model_name, tokenization_mode, stopwords_option, normalize_particles, batch_size, show_progress, progress=gr.Progress()):
|
| 438 |
"""Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
|
| 439 |
+
|
| 440 |
Args:
|
| 441 |
files: A list of file objects uploaded by the user.
|
| 442 |
enable_semantic: Whether to compute semantic similarity.
|
| 443 |
enable_fuzzy: Whether to compute fuzzy string similarity.
|
| 444 |
fuzzy_method: The fuzzy matching method to use.
|
| 445 |
model_name: Name of the embedding model to use.
|
| 446 |
+
tokenization_mode: How to tokenize text (syllable or word).
|
| 447 |
stopwords_option: Stopword filtering level (None, Standard, or Aggressive).
|
| 448 |
+
normalize_particles: Whether to normalize grammatical particles.
|
| 449 |
batch_size: Batch size for embedding generation.
|
| 450 |
show_progress: Whether to show progress bars during embedding.
|
| 451 |
progress: Gradio progress indicator.
|
| 452 |
+
|
| 453 |
Returns:
|
| 454 |
tuple: Results for UI components including metrics, visualizations, and state.
|
| 455 |
"""
|
|
|
|
| 464 |
warning_update_res = gr.update(visible=False)
|
| 465 |
state_text_data_res = None
|
| 466 |
state_df_results_res = None
|
| 467 |
+
|
| 468 |
# Create a ProgressiveUI instance for handling progressive updates
|
| 469 |
progressive_ui = ProgressiveUI(
|
| 470 |
metrics_preview=metrics_preview,
|
|
|
|
| 477 |
progress_container=progress_container,
|
| 478 |
heatmap_titles=heatmap_titles
|
| 479 |
)
|
| 480 |
+
|
| 481 |
# Make progress container visible during analysis
|
| 482 |
progress_container.update(visible=True)
|
| 483 |
+
|
| 484 |
# Create a progressive callback function
|
| 485 |
progressive_callback = create_progressive_callback(progressive_ui)
|
| 486 |
# Check if files are provided
|
|
|
|
| 497 |
None, # state_text_data
|
| 498 |
None # state_df_results
|
| 499 |
)
|
| 500 |
+
|
| 501 |
# Check file size limits (10MB per file)
|
| 502 |
for file in files:
|
| 503 |
file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
|
|
|
|
| 521 |
progress(0.1, desc="Preparing files...")
|
| 522 |
except Exception as e:
|
| 523 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 524 |
+
|
| 525 |
# Get filenames and read file contents
|
| 526 |
filenames = [
|
| 527 |
Path(file.name).name for file in files
|
| 528 |
] # Use Path().name to get just the filename
|
| 529 |
text_data = {}
|
| 530 |
+
|
| 531 |
# Read files with progress updates
|
| 532 |
for i, file in enumerate(files):
|
| 533 |
file_path = Path(file.name)
|
|
|
|
| 537 |
progress(0.1 + (0.1 * (i / len(files))), desc=f"Reading file: {filename}")
|
| 538 |
except Exception as e:
|
| 539 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 540 |
+
|
| 541 |
try:
|
| 542 |
text_data[filename] = file_path.read_text(encoding="utf-8-sig")
|
| 543 |
except UnicodeDecodeError:
|
|
|
|
| 561 |
# Configure semantic similarity and fuzzy matching
|
| 562 |
enable_semantic_bool = enable_semantic == "Yes"
|
| 563 |
enable_fuzzy_bool = enable_fuzzy == "Yes"
|
| 564 |
+
|
| 565 |
# Extract the fuzzy method from the dropdown value
|
| 566 |
+
fuzzy_method_value = fuzzy_method.split(' - ')[0] if fuzzy_method else 'ngram'
|
| 567 |
+
|
| 568 |
+
# Extract the LCS normalization from the dropdown value
|
| 569 |
+
lcs_normalization_value = lcs_normalization.split(' - ')[0] if lcs_normalization else 'avg'
|
| 570 |
+
|
| 571 |
+
# Extract the tokenization mode from the dropdown value
|
| 572 |
+
tokenization_mode_value = tokenization_mode.split(' - ')[0] if tokenization_mode else 'syllable'
|
| 573 |
+
|
| 574 |
if progress is not None:
|
| 575 |
try:
|
| 576 |
progress(0.2, desc="Loading model..." if enable_semantic_bool else "Processing text...")
|
| 577 |
except Exception as e:
|
| 578 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 579 |
+
|
| 580 |
# Process texts with selected model
|
| 581 |
# Convert stopword option to appropriate parameters
|
| 582 |
use_stopwords = stopwords_option != "None (No filtering)"
|
| 583 |
use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
|
| 584 |
+
|
| 585 |
# For Hugging Face models, the UI value is the correct model ID
|
| 586 |
internal_model_id = model_name
|
| 587 |
|
|
|
|
| 591 |
enable_semantic=enable_semantic_bool,
|
| 592 |
enable_fuzzy=enable_fuzzy_bool,
|
| 593 |
fuzzy_method=fuzzy_method_value,
|
| 594 |
+
lcs_normalization=lcs_normalization_value,
|
| 595 |
model_name=internal_model_id,
|
| 596 |
use_stopwords=use_stopwords,
|
| 597 |
use_lite_stopwords=use_lite_stopwords,
|
| 598 |
+
normalize_particles=normalize_particles,
|
| 599 |
+
tokenization_mode=tokenization_mode_value,
|
| 600 |
progress_callback=progress,
|
| 601 |
progressive_callback=progressive_callback,
|
| 602 |
batch_size=batch_size,
|
|
|
|
| 616 |
progress(0.8, desc="Generating visualizations...")
|
| 617 |
except Exception as e:
|
| 618 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 619 |
+
|
| 620 |
# heatmap_titles is already defined in the outer scope of main_interface
|
| 621 |
heatmaps_data = generate_visualizations(
|
| 622 |
df_results, descriptive_titles=heatmap_titles
|
| 623 |
)
|
| 624 |
+
|
| 625 |
# Generate word count chart
|
| 626 |
if progress is not None:
|
| 627 |
try:
|
|
|
|
| 629 |
except Exception as e:
|
| 630 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 631 |
word_count_fig_res = generate_word_count_chart(word_counts_df_data)
|
| 632 |
+
|
| 633 |
# Store state data for potential future use
|
| 634 |
state_text_data_res = text_data
|
| 635 |
state_df_results_res = df_results
|
| 636 |
logger.info("Analysis complete, storing state data")
|
| 637 |
+
|
| 638 |
# Save results to CSV
|
| 639 |
if progress is not None:
|
| 640 |
try:
|
|
|
|
| 643 |
logger.warning(f"Progress update error (non-critical): {e}")
|
| 644 |
csv_path_res = "results.csv"
|
| 645 |
df_results.to_csv(csv_path_res, index=False)
|
| 646 |
+
|
| 647 |
# Prepare final output
|
| 648 |
warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
|
| 649 |
metrics_preview_df_res = df_results.head(10)
|
|
|
|
| 651 |
jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
|
| 652 |
lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
|
| 653 |
fuzzy_heatmap_res = heatmaps_data.get("Fuzzy Similarity")
|
| 654 |
+
semantic_heatmap_res = heatmaps_data.get("Semantic Similarity")
|
|
|
|
|
|
|
|
|
|
| 655 |
warning_update_res = gr.update(
|
| 656 |
visible=bool(warning_raw), value=warning_md
|
| 657 |
)
|
|
|
|
| 680 |
try:
|
| 681 |
if not csv_path or not Path(csv_path).exists():
|
| 682 |
return "Please run the analysis first to generate results."
|
| 683 |
+
|
| 684 |
# Read the CSV file
|
| 685 |
df_results = pd.read_csv(csv_path)
|
| 686 |
+
|
| 687 |
# Show detailed progress messages with percentages
|
| 688 |
progress(0, desc="Preparing data for analysis...")
|
| 689 |
progress(0.1, desc="Analyzing similarity patterns...")
|
| 690 |
progress(0.2, desc="Connecting to Mistral 7B via OpenRouter...")
|
| 691 |
+
|
| 692 |
# Get interpretation from LLM (using OpenRouter API)
|
| 693 |
progress(0.3, desc="Generating scholarly interpretation (this may take 20-40 seconds)...")
|
| 694 |
llm_service = LLMService()
|
| 695 |
interpretation = llm_service.analyze_similarity(df_results)
|
| 696 |
+
|
| 697 |
# Simulate completion steps
|
| 698 |
progress(0.9, desc="Formatting results...")
|
| 699 |
progress(0.95, desc="Applying scholarly formatting...")
|
| 700 |
+
|
| 701 |
# Completed
|
| 702 |
progress(1.0, desc="Analysis complete!")
|
| 703 |
+
|
| 704 |
# Add a timestamp to the interpretation
|
| 705 |
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
|
| 706 |
interpretation = f"{interpretation}\n\n<small>Analysis generated on {timestamp}</small>"
|
|
|
|
| 708 |
except Exception as e:
|
| 709 |
logger.error(f"Error in interpret_results: {e}", exc_info=True)
|
| 710 |
return f"Error interpreting results: {str(e)}"
|
| 711 |
+
|
| 712 |
+
def run_pipeline_preset(files, preset, progress=gr.Progress()):
|
| 713 |
+
"""Wrapper that converts preset selection to pipeline parameters."""
|
| 714 |
+
# Determine settings based on preset
|
| 715 |
+
if "Quick" in preset:
|
| 716 |
+
# Quick: Jaccard only
|
| 717 |
+
enable_semantic = "No"
|
| 718 |
+
enable_fuzzy = "No"
|
| 719 |
+
elif "Deep" in preset:
|
| 720 |
+
# Deep: All metrics including semantic
|
| 721 |
+
enable_semantic = "Yes"
|
| 722 |
+
enable_fuzzy = "Yes"
|
| 723 |
+
else:
|
| 724 |
+
# Standard: Jaccard + LCS + Fuzzy (no semantic)
|
| 725 |
+
enable_semantic = "No"
|
| 726 |
+
enable_fuzzy = "Yes"
|
| 727 |
+
|
| 728 |
+
# Use sensible defaults for preset mode
|
| 729 |
+
fuzzy_method = "ngram - Syllable pairs (recommended)"
|
| 730 |
+
lcs_normalization = "avg - Balanced comparison (default)"
|
| 731 |
+
model_name = "buddhist-nlp/buddhist-sentence-similarity"
|
| 732 |
+
tokenization_mode = "word - Whole words (recommended)"
|
| 733 |
+
stopwords_option = "Standard (Common particles only)"
|
| 734 |
+
normalize_particles = False
|
| 735 |
+
batch_size = 8
|
| 736 |
+
show_progress = False
|
| 737 |
+
|
| 738 |
+
return run_pipeline(
|
| 739 |
+
files, enable_semantic, enable_fuzzy, fuzzy_method,
|
| 740 |
+
lcs_normalization, model_name, tokenization_mode,
|
| 741 |
+
stopwords_option, normalize_particles, batch_size,
|
| 742 |
+
show_progress, progress
|
| 743 |
+
)
|
| 744 |
+
|
| 745 |
+
# Output components for both buttons
|
| 746 |
+
pipeline_outputs = [
|
| 747 |
+
csv_output,
|
| 748 |
+
metrics_preview,
|
| 749 |
+
word_count_plot,
|
| 750 |
+
heatmap_tabs["Jaccard Similarity (%)"],
|
| 751 |
+
heatmap_tabs["Normalized LCS"],
|
| 752 |
+
heatmap_tabs["Fuzzy Similarity"],
|
| 753 |
+
heatmap_tabs["Semantic Similarity"],
|
| 754 |
+
warning_box,
|
| 755 |
+
state_text_data,
|
| 756 |
+
state_df_results,
|
| 757 |
+
]
|
| 758 |
+
|
| 759 |
+
# Quick Start button uses presets
|
| 760 |
+
process_btn_quick.click(
|
| 761 |
+
fn=run_pipeline_preset,
|
| 762 |
+
inputs=[file_input, analysis_preset],
|
| 763 |
+
outputs=pipeline_outputs
|
| 764 |
+
)
|
| 765 |
+
|
| 766 |
+
# Custom button uses all the detailed settings
|
| 767 |
+
process_btn_custom.click(
|
| 768 |
fn=run_pipeline,
|
| 769 |
+
inputs=[
|
| 770 |
+
file_input,
|
| 771 |
+
semantic_toggle_radio,
|
| 772 |
+
fuzzy_toggle_radio,
|
| 773 |
+
fuzzy_method_dropdown,
|
| 774 |
+
lcs_normalization_dropdown,
|
| 775 |
+
model_dropdown,
|
| 776 |
+
tokenization_mode_dropdown,
|
| 777 |
+
stopwords_dropdown,
|
| 778 |
+
particle_normalization_checkbox,
|
| 779 |
+
batch_size_slider,
|
| 780 |
+
progress_bar_checkbox
|
| 781 |
+
],
|
| 782 |
+
outputs=pipeline_outputs
|
| 783 |
)
|
| 784 |
|
| 785 |
# Structural analysis functionality removed - see dedicated collation app
|
| 786 |
+
|
| 787 |
# Connect the interpret button
|
| 788 |
interpret_btn.click(
|
| 789 |
fn=interpret_results,
|
| 790 |
inputs=[csv_output],
|
| 791 |
outputs=interpretation_output
|
| 792 |
)
|
| 793 |
+
|
| 794 |
return demo
|
| 795 |
|
| 796 |
|
| 797 |
if __name__ == "__main__":
|
| 798 |
demo = main_interface()
|
| 799 |
+
demo.launch()
|
pipeline/hf_embedding.py
CHANGED
|
@@ -10,8 +10,10 @@ _model_cache = {}
|
|
| 10 |
|
| 11 |
# Model version mapping
|
| 12 |
MODEL_VERSIONS = {
|
| 13 |
-
"sentence-
|
| 14 |
-
"
|
|
|
|
|
|
|
| 15 |
}
|
| 16 |
|
| 17 |
def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[str]]:
|
|
@@ -28,7 +30,7 @@ def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[st
|
|
| 28 |
# Include version information in cache key
|
| 29 |
model_version = MODEL_VERSIONS.get(model_id, "unknown")
|
| 30 |
cache_key = f"{model_id}@{model_version}"
|
| 31 |
-
|
| 32 |
if cache_key in _model_cache:
|
| 33 |
logger.info(f"Returning cached model: {model_id} (version: {model_version})")
|
| 34 |
return _model_cache[cache_key], "sentence-transformer"
|
|
@@ -44,9 +46,9 @@ def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[st
|
|
| 44 |
return None, None
|
| 45 |
|
| 46 |
def generate_embeddings(
|
| 47 |
-
texts: List[str],
|
| 48 |
-
model: SentenceTransformer,
|
| 49 |
-
batch_size: int = 32,
|
| 50 |
show_progress_bar: bool = False
|
| 51 |
) -> np.ndarray:
|
| 52 |
"""
|
|
@@ -70,9 +72,9 @@ def generate_embeddings(
|
|
| 70 |
logger.info(f"Generating embeddings for {len(texts)} texts with {type(model).__name__}...")
|
| 71 |
try:
|
| 72 |
embeddings = model.encode(
|
| 73 |
-
texts,
|
| 74 |
batch_size=batch_size,
|
| 75 |
-
convert_to_numpy=True,
|
| 76 |
show_progress_bar=show_progress_bar
|
| 77 |
)
|
| 78 |
logger.info(f"Embeddings generated with shape: {embeddings.shape}")
|
|
|
|
| 10 |
|
| 11 |
# Model version mapping
|
| 12 |
MODEL_VERSIONS = {
|
| 13 |
+
"buddhist-nlp/buddhist-sentence-similarity": "v1.0", # Dharmamitra - best for Tibetan Buddhist texts
|
| 14 |
+
"buddhist-nlp/bod-eng-similarity": "v1.0", # Dharmamitra - Tibetan-English bitext alignment
|
| 15 |
+
"sentence-transformers/LaBSE": "v1.0", # Multilingual baseline
|
| 16 |
+
"BAAI/bge-m3": "v1.0", # Strong multilingual alternative
|
| 17 |
}
|
| 18 |
|
| 19 |
def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[str]]:
|
|
|
|
| 30 |
# Include version information in cache key
|
| 31 |
model_version = MODEL_VERSIONS.get(model_id, "unknown")
|
| 32 |
cache_key = f"{model_id}@{model_version}"
|
| 33 |
+
|
| 34 |
if cache_key in _model_cache:
|
| 35 |
logger.info(f"Returning cached model: {model_id} (version: {model_version})")
|
| 36 |
return _model_cache[cache_key], "sentence-transformer"
|
|
|
|
| 46 |
return None, None
|
| 47 |
|
| 48 |
def generate_embeddings(
|
| 49 |
+
texts: List[str],
|
| 50 |
+
model: SentenceTransformer,
|
| 51 |
+
batch_size: int = 32,
|
| 52 |
show_progress_bar: bool = False
|
| 53 |
) -> np.ndarray:
|
| 54 |
"""
|
|
|
|
| 72 |
logger.info(f"Generating embeddings for {len(texts)} texts with {type(model).__name__}...")
|
| 73 |
try:
|
| 74 |
embeddings = model.encode(
|
| 75 |
+
texts,
|
| 76 |
batch_size=batch_size,
|
| 77 |
+
convert_to_numpy=True,
|
| 78 |
show_progress_bar=show_progress_bar
|
| 79 |
)
|
| 80 |
logger.info(f"Embeddings generated with shape: {embeddings.shape}")
|
pipeline/llm_service.py
CHANGED
|
@@ -39,11 +39,11 @@ class LLMService:
|
|
| 39 |
"""
|
| 40 |
Service for analyzing text similarity metrics using LLMs and rule-based methods.
|
| 41 |
"""
|
| 42 |
-
|
| 43 |
def __init__(self, api_key: str = None):
|
| 44 |
"""
|
| 45 |
Initialize the LLM service.
|
| 46 |
-
|
| 47 |
Args:
|
| 48 |
api_key: Optional API key for OpenRouter. If not provided, will try to load from environment.
|
| 49 |
"""
|
|
@@ -51,19 +51,19 @@ class LLMService:
|
|
| 51 |
self.models = PREFERRED_MODELS
|
| 52 |
self.temperature = DEFAULT_TEMPERATURE
|
| 53 |
self.top_p = DEFAULT_TOP_P
|
| 54 |
-
|
| 55 |
def analyze_similarity(
|
| 56 |
-
self,
|
| 57 |
-
results_df: pd.DataFrame,
|
| 58 |
use_llm: bool = True,
|
| 59 |
) -> str:
|
| 60 |
"""
|
| 61 |
Analyze similarity metrics using either LLM or rule-based approach.
|
| 62 |
-
|
| 63 |
Args:
|
| 64 |
results_df: DataFrame containing similarity metrics
|
| 65 |
use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
|
| 66 |
-
|
| 67 |
Returns:
|
| 68 |
str: Analysis of the metrics in markdown format with appropriate fallback messages
|
| 69 |
"""
|
|
@@ -71,19 +71,19 @@ class LLMService:
|
|
| 71 |
if not use_llm:
|
| 72 |
logger.info("LLM analysis disabled. Using rule-based analysis.")
|
| 73 |
return self._analyze_with_rules(results_df)
|
| 74 |
-
|
| 75 |
# Try LLM analysis if enabled
|
| 76 |
try:
|
| 77 |
if not self.api_key:
|
| 78 |
raise ValueError("No OpenRouter API key provided. Please set the OPENROUTER_API_KEY environment variable.")
|
| 79 |
-
|
| 80 |
logger.info("Attempting LLM-based analysis...")
|
| 81 |
return self._analyze_with_llm(results_df, max_tokens=DEFAULT_MAX_TOKENS)
|
| 82 |
-
|
| 83 |
except Exception as e:
|
| 84 |
error_msg = str(e)
|
| 85 |
logger.error(f"Error in LLM analysis: {error_msg}")
|
| 86 |
-
|
| 87 |
# Create a user-friendly error message
|
| 88 |
if "no openrouter api key" in error_msg.lower():
|
| 89 |
error_note = "OpenRouter API key not found. Please set the `OPENROUTER_API_KEY` environment variable to use this feature."
|
|
@@ -95,42 +95,42 @@ class LLMService:
|
|
| 95 |
error_note = "API rate limit exceeded. Falling back to rule-based analysis."
|
| 96 |
else:
|
| 97 |
error_note = f"LLM analysis failed: {error_msg[:200]}. Falling back to rule-based analysis."
|
| 98 |
-
|
| 99 |
# Get rule-based analysis
|
| 100 |
rule_based_analysis = self._analyze_with_rules(results_df)
|
| 101 |
-
|
| 102 |
# Combine the error message with the rule-based analysis
|
| 103 |
return f"## Analysis of Tibetan Text Similarity Metrics\n\n*Note: {error_note}*\n\n{rule_based_analysis}"
|
| 104 |
-
|
| 105 |
def _prepare_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
|
| 106 |
"""
|
| 107 |
Prepare the DataFrame for analysis.
|
| 108 |
-
|
| 109 |
Args:
|
| 110 |
df: Input DataFrame with similarity metrics
|
| 111 |
-
|
| 112 |
Returns:
|
| 113 |
pd.DataFrame: Cleaned and prepared DataFrame
|
| 114 |
"""
|
| 115 |
# Make a copy to avoid modifying the original
|
| 116 |
df = df.copy()
|
| 117 |
-
|
| 118 |
# Clean text columns
|
| 119 |
text_cols = ['Text A', 'Text B']
|
| 120 |
for col in text_cols:
|
| 121 |
if col in df.columns:
|
| 122 |
df[col] = df[col].fillna('Unknown').astype(str)
|
| 123 |
df[col] = df[col].str.replace('.txt$', '', regex=True)
|
| 124 |
-
|
| 125 |
# Filter out perfect matches (likely empty cells)
|
| 126 |
metrics_cols = ['Jaccard Similarity (%)', 'Normalized LCS']
|
| 127 |
if all(col in df.columns for col in metrics_cols):
|
| 128 |
-
mask = ~((df['Jaccard Similarity (%)'] == 100.0) &
|
| 129 |
(df['Normalized LCS'] == 1.0))
|
| 130 |
df = df[mask].copy()
|
| 131 |
-
|
| 132 |
return df
|
| 133 |
-
|
| 134 |
def _analyze_with_llm(self, df: pd.DataFrame, max_tokens: int) -> str:
|
| 135 |
"""
|
| 136 |
Analyze metrics using an LLM via OpenRouter API, with fallback models.
|
|
@@ -181,65 +181,65 @@ class LLMService:
|
|
| 181 |
raise last_error
|
| 182 |
else:
|
| 183 |
raise Exception("LLM analysis failed for all available models.")
|
| 184 |
-
|
| 185 |
def _analyze_with_rules(self, df: pd.DataFrame) -> str:
|
| 186 |
"""
|
| 187 |
Analyze metrics using rule-based approach.
|
| 188 |
-
|
| 189 |
Args:
|
| 190 |
df: Prepared DataFrame with metrics
|
| 191 |
-
|
| 192 |
Returns:
|
| 193 |
str: Rule-based analysis in markdown format
|
| 194 |
"""
|
| 195 |
analysis = ["## Tibetan Text Similarity Analysis (Rule-Based)"]
|
| 196 |
-
|
| 197 |
# Basic stats
|
| 198 |
text_a_col = 'Text A' if 'Text A' in df.columns else None
|
| 199 |
text_b_col = 'Text B' if 'Text B' in df.columns else None
|
| 200 |
-
|
| 201 |
if text_a_col and text_b_col:
|
| 202 |
unique_texts = set(df[text_a_col].unique()) | set(df[text_b_col].unique())
|
| 203 |
analysis.append(f"- **Texts analyzed:** {', '.join(sorted(unique_texts))}")
|
| 204 |
-
|
| 205 |
# Analyze each metric
|
| 206 |
metric_analyses = []
|
| 207 |
-
|
| 208 |
if 'Jaccard Similarity (%)' in df.columns:
|
| 209 |
jaccard_analysis = self._analyze_jaccard(df)
|
| 210 |
metric_analyses.append(jaccard_analysis)
|
| 211 |
-
|
| 212 |
if 'Normalized LCS' in df.columns:
|
| 213 |
lcs_analysis = self._analyze_lcs(df)
|
| 214 |
metric_analyses.append(lcs_analysis)
|
| 215 |
-
|
| 216 |
# TF-IDF analysis removed
|
| 217 |
-
|
| 218 |
# Add all metric analyses
|
| 219 |
if metric_analyses:
|
| 220 |
analysis.extend(metric_analyses)
|
| 221 |
-
|
| 222 |
# Add overall interpretation
|
| 223 |
analysis.append("\n## Overall Interpretation")
|
| 224 |
analysis.append(self._generate_overall_interpretation(df))
|
| 225 |
-
|
| 226 |
return "\n\n".join(analysis)
|
| 227 |
-
|
| 228 |
def _analyze_jaccard(self, df: pd.DataFrame) -> str:
|
| 229 |
"""Analyze Jaccard similarity scores."""
|
| 230 |
jaccard = df['Jaccard Similarity (%)'].dropna()
|
| 231 |
if jaccard.empty:
|
| 232 |
return ""
|
| 233 |
-
|
| 234 |
mean_jaccard = jaccard.mean()
|
| 235 |
max_jaccard = jaccard.max()
|
| 236 |
min_jaccard = jaccard.min()
|
| 237 |
-
|
| 238 |
analysis = [
|
| 239 |
"### Jaccard Similarity Analysis",
|
| 240 |
f"- **Range:** {min_jaccard:.1f}% to {max_jaccard:.1f}% (mean: {mean_jaccard:.1f}%)"
|
| 241 |
]
|
| 242 |
-
|
| 243 |
# Interpret the scores
|
| 244 |
if mean_jaccard > 60:
|
| 245 |
analysis.append("- **High vocabulary overlap** suggests texts share significant content or are from the same tradition.")
|
|
@@ -247,7 +247,7 @@ class LLMService:
|
|
| 247 |
analysis.append("- **Moderate vocabulary overlap** indicates some shared content or themes.")
|
| 248 |
else:
|
| 249 |
analysis.append("- **Low vocabulary overlap** suggests texts are on different topics or from different traditions.")
|
| 250 |
-
|
| 251 |
# Add top pairs
|
| 252 |
top_pairs = df.nlargest(3, 'Jaccard Similarity (%)')
|
| 253 |
if not top_pairs.empty:
|
|
@@ -257,24 +257,24 @@ class LLMService:
|
|
| 257 |
text_b = row.get('Text B', 'Text 2')
|
| 258 |
score = row['Jaccard Similarity (%)']
|
| 259 |
analysis.append(f"- {text_a} ↔ {text_b}: {score:.1f}%")
|
| 260 |
-
|
| 261 |
return "\n".join(analysis)
|
| 262 |
-
|
| 263 |
def _analyze_lcs(self, df: pd.DataFrame) -> str:
|
| 264 |
"""Analyze Longest Common Subsequence scores."""
|
| 265 |
lcs = df['Normalized LCS'].dropna()
|
| 266 |
if lcs.empty:
|
| 267 |
return ""
|
| 268 |
-
|
| 269 |
mean_lcs = lcs.mean()
|
| 270 |
max_lcs = lcs.max()
|
| 271 |
min_lcs = lcs.min()
|
| 272 |
-
|
| 273 |
analysis = [
|
| 274 |
"### Structural Similarity (LCS) Analysis",
|
| 275 |
f"- **Range:** {min_lcs:.2f} to {max_lcs:.2f} (mean: {mean_lcs:.2f})"
|
| 276 |
]
|
| 277 |
-
|
| 278 |
# Interpret the scores
|
| 279 |
if mean_lcs > 0.7:
|
| 280 |
analysis.append("- **High structural similarity** suggests texts follow similar organizational patterns.")
|
|
@@ -282,7 +282,7 @@ class LLMService:
|
|
| 282 |
analysis.append("- **Moderate structural similarity** indicates some shared organizational elements.")
|
| 283 |
else:
|
| 284 |
analysis.append("- **Low structural similarity** suggests different organizational approaches.")
|
| 285 |
-
|
| 286 |
# Add top pairs
|
| 287 |
top_pairs = df.nlargest(3, 'Normalized LCS')
|
| 288 |
if not top_pairs.empty:
|
|
@@ -292,19 +292,19 @@ class LLMService:
|
|
| 292 |
text_b = row.get('Text B', 'Text 2')
|
| 293 |
score = row['Normalized LCS']
|
| 294 |
analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
|
| 295 |
-
|
| 296 |
return "\n".join(analysis)
|
| 297 |
-
|
| 298 |
# TF-IDF analysis method removed
|
| 299 |
-
|
| 300 |
def _generate_overall_interpretation(self, df: pd.DataFrame) -> str:
|
| 301 |
"""Generate an overall interpretation of the metrics."""
|
| 302 |
interpretations = []
|
| 303 |
-
|
| 304 |
# Get metrics if they exist
|
| 305 |
has_jaccard = 'Jaccard Similarity (%)' in df.columns
|
| 306 |
has_lcs = 'Normalized LCS' in df.columns
|
| 307 |
-
|
| 308 |
# Calculate means for available metrics
|
| 309 |
metrics = {}
|
| 310 |
if has_jaccard:
|
|
@@ -312,51 +312,51 @@ class LLMService:
|
|
| 312 |
if has_lcs:
|
| 313 |
metrics['lcs'] = df['Normalized LCS'].mean()
|
| 314 |
# TF-IDF metrics removed
|
| 315 |
-
|
| 316 |
# Generate interpretation based on metrics
|
| 317 |
if metrics:
|
| 318 |
interpretations.append("Based on the analysis of similarity metrics:")
|
| 319 |
-
|
| 320 |
if has_jaccard and metrics['jaccard'] > 60:
|
| 321 |
interpretations.append("- The high Jaccard similarity indicates significant vocabulary overlap between texts, "
|
| 322 |
"suggesting they may share common sources or be part of the same textual tradition.")
|
| 323 |
-
|
| 324 |
if has_lcs and metrics['lcs'] > 0.7:
|
| 325 |
interpretations.append("- The high LCS score indicates strong structural similarity, "
|
| 326 |
"suggesting the texts may follow similar organizational patterns or share common structural elements.")
|
| 327 |
-
|
| 328 |
# TF-IDF interpretation removed
|
| 329 |
-
|
| 330 |
# Add cross-metric interpretations
|
| 331 |
if has_jaccard and has_lcs and metrics['jaccard'] > 60 and metrics['lcs'] > 0.7:
|
| 332 |
interpretations.append("\nThe combination of high Jaccard and LCS similarities strongly suggests "
|
| 333 |
"that these texts are closely related, possibly being different versions or "
|
| 334 |
"transmissions of the same work or sharing a common source.")
|
| 335 |
-
|
| 336 |
# TF-IDF cross-metric interpretation removed
|
| 337 |
-
|
| 338 |
# Add general guidance if no specific patterns found
|
| 339 |
if not interpretations:
|
| 340 |
interpretations.append("The analysis did not reveal strong patterns in the similarity metrics. "
|
| 341 |
"This could indicate that the texts are either very similar or very different "
|
| 342 |
"across all measured dimensions.")
|
| 343 |
-
|
| 344 |
return "\n\n".join(interpretations)
|
| 345 |
-
|
| 346 |
def _create_llm_prompt(self, df: pd.DataFrame, model_name: str) -> str:
|
| 347 |
"""
|
| 348 |
Create a prompt for the LLM based on the DataFrame.
|
| 349 |
-
|
| 350 |
Args:
|
| 351 |
df: Prepared DataFrame with metrics
|
| 352 |
model_name: Name of the model being used
|
| 353 |
-
|
| 354 |
Returns:
|
| 355 |
str: Formatted prompt for the LLM
|
| 356 |
"""
|
| 357 |
# Convert DataFrame to markdown for the prompt
|
| 358 |
md_table = df.to_markdown(index=False)
|
| 359 |
-
|
| 360 |
# Create the prompt
|
| 361 |
prompt = f"""
|
| 362 |
# Tibetan Text Similarity Analysis
|
|
@@ -372,19 +372,19 @@ You will be provided with a table of text similarity scores in Markdown format.
|
|
| 372 |
|
| 373 |
Your analysis will be performed using the `{model_name}` model. Provide a concise, scholarly analysis in well-structured markdown.
|
| 374 |
"""
|
| 375 |
-
|
| 376 |
|
| 377 |
-
|
|
|
|
| 378 |
return prompt
|
| 379 |
-
|
| 380 |
def _get_system_prompt(self) -> str:
|
| 381 |
"""Get the system prompt for the LLM."""
|
| 382 |
return """You are a senior scholar of Tibetan Buddhist texts, specializing in textual criticism. Your task is to analyze the provided similarity metrics and provide expert insights into the relationships between these texts. Ground your analysis in the data, be precise, and focus on what the metrics reveal about the texts' transmission and history."""
|
| 383 |
-
|
| 384 |
def _call_openrouter_api(self, model: str, prompt: str, system_message: str = None, max_tokens: int = None, temperature: float = None, top_p: float = None) -> str:
|
| 385 |
"""
|
| 386 |
Call the OpenRouter API.
|
| 387 |
-
|
| 388 |
Args:
|
| 389 |
model: Model to use for the API call
|
| 390 |
prompt: The user prompt
|
|
@@ -392,10 +392,10 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
|
|
| 392 |
max_tokens: Maximum tokens for the response
|
| 393 |
temperature: Sampling temperature
|
| 394 |
top_p: Nucleus sampling parameter
|
| 395 |
-
|
| 396 |
Returns:
|
| 397 |
str: The API response
|
| 398 |
-
|
| 399 |
Raises:
|
| 400 |
ValueError: If API key is missing or invalid
|
| 401 |
requests.exceptions.RequestException: For network-related errors
|
|
@@ -405,21 +405,21 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
|
|
| 405 |
error_msg = "OpenRouter API key not provided. Please set the OPENROUTER_API_KEY environment variable."
|
| 406 |
logger.error(error_msg)
|
| 407 |
raise ValueError(error_msg)
|
| 408 |
-
|
| 409 |
url = "https://openrouter.ai/api/v1/chat/completions"
|
| 410 |
-
|
| 411 |
headers = {
|
| 412 |
"Authorization": f"Bearer {self.api_key}",
|
| 413 |
"Content-Type": "application/json",
|
| 414 |
"HTTP-Referer": "https://github.com/daniel-wojahn/tibetan-text-metrics",
|
| 415 |
"X-Title": "Tibetan Text Metrics"
|
| 416 |
}
|
| 417 |
-
|
| 418 |
messages = []
|
| 419 |
if system_message:
|
| 420 |
messages.append({"role": "system", "content": system_message})
|
| 421 |
messages.append({"role": "user", "content": prompt})
|
| 422 |
-
|
| 423 |
data = {
|
| 424 |
"model": model, # Use the model parameter here
|
| 425 |
"messages": messages,
|
|
@@ -427,11 +427,11 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
|
|
| 427 |
"temperature": temperature or self.temperature,
|
| 428 |
"top_p": top_p or self.top_p,
|
| 429 |
}
|
| 430 |
-
|
| 431 |
try:
|
| 432 |
logger.info(f"Calling OpenRouter API with model: {model}")
|
| 433 |
response = requests.post(url, headers=headers, json=data, timeout=60)
|
| 434 |
-
|
| 435 |
# Handle different HTTP status codes
|
| 436 |
if response.status_code == 200:
|
| 437 |
result = response.json()
|
|
@@ -441,53 +441,53 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
|
|
| 441 |
error_msg = "Unexpected response format from OpenRouter API"
|
| 442 |
logger.error(f"{error_msg}: {result}")
|
| 443 |
raise ValueError(error_msg)
|
| 444 |
-
|
| 445 |
elif response.status_code == 401:
|
| 446 |
error_msg = "Invalid OpenRouter API key. Please check your API key and try again."
|
| 447 |
logger.error(error_msg)
|
| 448 |
raise ValueError(error_msg)
|
| 449 |
-
|
| 450 |
elif response.status_code == 402:
|
| 451 |
error_msg = "OpenRouter API payment required. Please check your OpenRouter account balance or billing status."
|
| 452 |
logger.error(error_msg)
|
| 453 |
raise ValueError(error_msg)
|
| 454 |
-
|
| 455 |
elif response.status_code == 429:
|
| 456 |
error_msg = "API rate limit exceeded. Please try again later or check your OpenRouter rate limits."
|
| 457 |
logger.error(error_msg)
|
| 458 |
raise ValueError(error_msg)
|
| 459 |
-
|
| 460 |
else:
|
| 461 |
error_msg = f"OpenRouter API error: {response.status_code} - {response.text}"
|
| 462 |
logger.error(error_msg)
|
| 463 |
raise Exception(error_msg)
|
| 464 |
-
|
| 465 |
except requests.exceptions.RequestException as e:
|
| 466 |
error_msg = f"Failed to connect to OpenRouter API: {str(e)}"
|
| 467 |
logger.error(error_msg)
|
| 468 |
raise Exception(error_msg) from e
|
| 469 |
-
|
| 470 |
except json.JSONDecodeError as e:
|
| 471 |
error_msg = f"Failed to parse OpenRouter API response: {str(e)}"
|
| 472 |
logger.error(error_msg)
|
| 473 |
raise Exception(error_msg) from e
|
| 474 |
-
|
| 475 |
def _format_llm_response(self, response: str, df: pd.DataFrame, model_name: str) -> str:
|
| 476 |
"""
|
| 477 |
Format the LLM response for display.
|
| 478 |
-
|
| 479 |
Args:
|
| 480 |
response: Raw LLM response
|
| 481 |
df: Original DataFrame for reference
|
| 482 |
model_name: Name of the model used
|
| 483 |
-
|
| 484 |
Returns:
|
| 485 |
str: Formatted response with fallback if needed
|
| 486 |
"""
|
| 487 |
# Basic validation
|
| 488 |
if not response or len(response) < 100:
|
| 489 |
raise ValueError("Response too short or empty")
|
| 490 |
-
|
| 491 |
# Check for garbled output (random numbers, nonsensical patterns)
|
| 492 |
# This is a simple heuristic - look for long sequences of numbers or strange patterns
|
| 493 |
suspicious_patterns = [
|
|
@@ -495,24 +495,24 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
|
|
| 495 |
r'[0-9,.]{20,}', # Long sequences of digits, commas and periods
|
| 496 |
r'[\W]{20,}', # Long sequences of non-word characters
|
| 497 |
]
|
| 498 |
-
|
| 499 |
for pattern in suspicious_patterns:
|
| 500 |
if re.search(pattern, response):
|
| 501 |
logger.warning(f"Detected potentially garbled output matching pattern: {pattern}")
|
| 502 |
# Don't immediately raise - we'll do a more comprehensive check
|
| 503 |
-
|
| 504 |
# Check for content quality - ensure it has expected sections
|
| 505 |
expected_content = [
|
| 506 |
"introduction", "analysis", "similarity", "patterns", "conclusion", "question"
|
| 507 |
]
|
| 508 |
-
|
| 509 |
# Count how many expected content markers we find
|
| 510 |
content_matches = sum(1 for term in expected_content if term.lower() in response.lower())
|
| 511 |
-
|
| 512 |
# If we find fewer than 3 expected content markers, log a warning
|
| 513 |
if content_matches < 3:
|
| 514 |
logger.warning(f"LLM response missing expected content sections (found {content_matches}/6)")
|
| 515 |
-
|
| 516 |
# Check for text names from the dataset
|
| 517 |
# Extract text names from the Text Pair column
|
| 518 |
text_names = set()
|
|
@@ -521,22 +521,22 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
|
|
| 521 |
if isinstance(pair, str) and " vs " in pair:
|
| 522 |
texts = pair.split(" vs ")
|
| 523 |
text_names.update(texts)
|
| 524 |
-
|
| 525 |
# Check if at least some text names appear in the response
|
| 526 |
text_name_matches = sum(1 for name in text_names if name in response)
|
| 527 |
if text_names and text_name_matches == 0:
|
| 528 |
logger.warning("LLM response does not mention any of the text names from the dataset. The analysis may be generic.")
|
| 529 |
-
|
| 530 |
# Ensure basic markdown structure
|
| 531 |
if '##' not in response:
|
| 532 |
response = f"## Analysis of Tibetan Text Similarity\n\n{response}"
|
| 533 |
-
|
| 534 |
# Add styling to make the output more readable
|
| 535 |
response = f"<div class='llm-analysis'>\n{response}\n</div>"
|
| 536 |
-
|
| 537 |
# Format the response into a markdown block
|
| 538 |
formatted_response = f"""## AI-Powered Analysis (Model: {model_name})\n\n{response}"""
|
| 539 |
-
|
| 540 |
return formatted_response
|
| 541 |
-
|
| 542 |
|
|
|
|
| 39 |
"""
|
| 40 |
Service for analyzing text similarity metrics using LLMs and rule-based methods.
|
| 41 |
"""
|
| 42 |
+
|
| 43 |
def __init__(self, api_key: str = None):
|
| 44 |
"""
|
| 45 |
Initialize the LLM service.
|
| 46 |
+
|
| 47 |
Args:
|
| 48 |
api_key: Optional API key for OpenRouter. If not provided, will try to load from environment.
|
| 49 |
"""
|
|
|
|
| 51 |
self.models = PREFERRED_MODELS
|
| 52 |
self.temperature = DEFAULT_TEMPERATURE
|
| 53 |
self.top_p = DEFAULT_TOP_P
|
| 54 |
+
|
| 55 |
def analyze_similarity(
|
| 56 |
+
self,
|
| 57 |
+
results_df: pd.DataFrame,
|
| 58 |
use_llm: bool = True,
|
| 59 |
) -> str:
|
| 60 |
"""
|
| 61 |
Analyze similarity metrics using either LLM or rule-based approach.
|
| 62 |
+
|
| 63 |
Args:
|
| 64 |
results_df: DataFrame containing similarity metrics
|
| 65 |
use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
|
| 66 |
+
|
| 67 |
Returns:
|
| 68 |
str: Analysis of the metrics in markdown format with appropriate fallback messages
|
| 69 |
"""
|
|
|
|
| 71 |
if not use_llm:
|
| 72 |
logger.info("LLM analysis disabled. Using rule-based analysis.")
|
| 73 |
return self._analyze_with_rules(results_df)
|
| 74 |
+
|
| 75 |
# Try LLM analysis if enabled
|
| 76 |
try:
|
| 77 |
if not self.api_key:
|
| 78 |
raise ValueError("No OpenRouter API key provided. Please set the OPENROUTER_API_KEY environment variable.")
|
| 79 |
+
|
| 80 |
logger.info("Attempting LLM-based analysis...")
|
| 81 |
return self._analyze_with_llm(results_df, max_tokens=DEFAULT_MAX_TOKENS)
|
| 82 |
+
|
| 83 |
except Exception as e:
|
| 84 |
error_msg = str(e)
|
| 85 |
logger.error(f"Error in LLM analysis: {error_msg}")
|
| 86 |
+
|
| 87 |
# Create a user-friendly error message
|
| 88 |
if "no openrouter api key" in error_msg.lower():
|
| 89 |
error_note = "OpenRouter API key not found. Please set the `OPENROUTER_API_KEY` environment variable to use this feature."
|
|
|
|
| 95 |
error_note = "API rate limit exceeded. Falling back to rule-based analysis."
|
| 96 |
else:
|
| 97 |
error_note = f"LLM analysis failed: {error_msg[:200]}. Falling back to rule-based analysis."
|
| 98 |
+
|
| 99 |
# Get rule-based analysis
|
| 100 |
rule_based_analysis = self._analyze_with_rules(results_df)
|
| 101 |
+
|
| 102 |
# Combine the error message with the rule-based analysis
|
| 103 |
return f"## Analysis of Tibetan Text Similarity Metrics\n\n*Note: {error_note}*\n\n{rule_based_analysis}"
|
| 104 |
+
|
| 105 |
def _prepare_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
|
| 106 |
"""
|
| 107 |
Prepare the DataFrame for analysis.
|
| 108 |
+
|
| 109 |
Args:
|
| 110 |
df: Input DataFrame with similarity metrics
|
| 111 |
+
|
| 112 |
Returns:
|
| 113 |
pd.DataFrame: Cleaned and prepared DataFrame
|
| 114 |
"""
|
| 115 |
# Make a copy to avoid modifying the original
|
| 116 |
df = df.copy()
|
| 117 |
+
|
| 118 |
# Clean text columns
|
| 119 |
text_cols = ['Text A', 'Text B']
|
| 120 |
for col in text_cols:
|
| 121 |
if col in df.columns:
|
| 122 |
df[col] = df[col].fillna('Unknown').astype(str)
|
| 123 |
df[col] = df[col].str.replace('.txt$', '', regex=True)
|
| 124 |
+
|
| 125 |
# Filter out perfect matches (likely empty cells)
|
| 126 |
metrics_cols = ['Jaccard Similarity (%)', 'Normalized LCS']
|
| 127 |
if all(col in df.columns for col in metrics_cols):
|
| 128 |
+
mask = ~((df['Jaccard Similarity (%)'] == 100.0) &
|
| 129 |
(df['Normalized LCS'] == 1.0))
|
| 130 |
df = df[mask].copy()
|
| 131 |
+
|
| 132 |
return df
|
| 133 |
+
|
| 134 |
def _analyze_with_llm(self, df: pd.DataFrame, max_tokens: int) -> str:
|
| 135 |
"""
|
| 136 |
Analyze metrics using an LLM via OpenRouter API, with fallback models.
|
|
|
|
| 181 |
raise last_error
|
| 182 |
else:
|
| 183 |
raise Exception("LLM analysis failed for all available models.")
|
| 184 |
+
|
| 185 |
def _analyze_with_rules(self, df: pd.DataFrame) -> str:
|
| 186 |
"""
|
| 187 |
Analyze metrics using rule-based approach.
|
| 188 |
+
|
| 189 |
Args:
|
| 190 |
df: Prepared DataFrame with metrics
|
| 191 |
+
|
| 192 |
Returns:
|
| 193 |
str: Rule-based analysis in markdown format
|
| 194 |
"""
|
| 195 |
analysis = ["## Tibetan Text Similarity Analysis (Rule-Based)"]
|
| 196 |
+
|
| 197 |
# Basic stats
|
| 198 |
text_a_col = 'Text A' if 'Text A' in df.columns else None
|
| 199 |
text_b_col = 'Text B' if 'Text B' in df.columns else None
|
| 200 |
+
|
| 201 |
if text_a_col and text_b_col:
|
| 202 |
unique_texts = set(df[text_a_col].unique()) | set(df[text_b_col].unique())
|
| 203 |
analysis.append(f"- **Texts analyzed:** {', '.join(sorted(unique_texts))}")
|
| 204 |
+
|
| 205 |
# Analyze each metric
|
| 206 |
metric_analyses = []
|
| 207 |
+
|
| 208 |
if 'Jaccard Similarity (%)' in df.columns:
|
| 209 |
jaccard_analysis = self._analyze_jaccard(df)
|
| 210 |
metric_analyses.append(jaccard_analysis)
|
| 211 |
+
|
| 212 |
if 'Normalized LCS' in df.columns:
|
| 213 |
lcs_analysis = self._analyze_lcs(df)
|
| 214 |
metric_analyses.append(lcs_analysis)
|
| 215 |
+
|
| 216 |
# TF-IDF analysis removed
|
| 217 |
+
|
| 218 |
# Add all metric analyses
|
| 219 |
if metric_analyses:
|
| 220 |
analysis.extend(metric_analyses)
|
| 221 |
+
|
| 222 |
# Add overall interpretation
|
| 223 |
analysis.append("\n## Overall Interpretation")
|
| 224 |
analysis.append(self._generate_overall_interpretation(df))
|
| 225 |
+
|
| 226 |
return "\n\n".join(analysis)
|
| 227 |
+
|
| 228 |
def _analyze_jaccard(self, df: pd.DataFrame) -> str:
|
| 229 |
"""Analyze Jaccard similarity scores."""
|
| 230 |
jaccard = df['Jaccard Similarity (%)'].dropna()
|
| 231 |
if jaccard.empty:
|
| 232 |
return ""
|
| 233 |
+
|
| 234 |
mean_jaccard = jaccard.mean()
|
| 235 |
max_jaccard = jaccard.max()
|
| 236 |
min_jaccard = jaccard.min()
|
| 237 |
+
|
| 238 |
analysis = [
|
| 239 |
"### Jaccard Similarity Analysis",
|
| 240 |
f"- **Range:** {min_jaccard:.1f}% to {max_jaccard:.1f}% (mean: {mean_jaccard:.1f}%)"
|
| 241 |
]
|
| 242 |
+
|
| 243 |
# Interpret the scores
|
| 244 |
if mean_jaccard > 60:
|
| 245 |
analysis.append("- **High vocabulary overlap** suggests texts share significant content or are from the same tradition.")
|
|
|
|
| 247 |
analysis.append("- **Moderate vocabulary overlap** indicates some shared content or themes.")
|
| 248 |
else:
|
| 249 |
analysis.append("- **Low vocabulary overlap** suggests texts are on different topics or from different traditions.")
|
| 250 |
+
|
| 251 |
# Add top pairs
|
| 252 |
top_pairs = df.nlargest(3, 'Jaccard Similarity (%)')
|
| 253 |
if not top_pairs.empty:
|
|
|
|
| 257 |
text_b = row.get('Text B', 'Text 2')
|
| 258 |
score = row['Jaccard Similarity (%)']
|
| 259 |
analysis.append(f"- {text_a} ↔ {text_b}: {score:.1f}%")
|
| 260 |
+
|
| 261 |
return "\n".join(analysis)
|
| 262 |
+
|
| 263 |
def _analyze_lcs(self, df: pd.DataFrame) -> str:
|
| 264 |
"""Analyze Longest Common Subsequence scores."""
|
| 265 |
lcs = df['Normalized LCS'].dropna()
|
| 266 |
if lcs.empty:
|
| 267 |
return ""
|
| 268 |
+
|
| 269 |
mean_lcs = lcs.mean()
|
| 270 |
max_lcs = lcs.max()
|
| 271 |
min_lcs = lcs.min()
|
| 272 |
+
|
| 273 |
analysis = [
|
| 274 |
"### Structural Similarity (LCS) Analysis",
|
| 275 |
f"- **Range:** {min_lcs:.2f} to {max_lcs:.2f} (mean: {mean_lcs:.2f})"
|
| 276 |
]
|
| 277 |
+
|
| 278 |
# Interpret the scores
|
| 279 |
if mean_lcs > 0.7:
|
| 280 |
analysis.append("- **High structural similarity** suggests texts follow similar organizational patterns.")
|
|
|
|
| 282 |
analysis.append("- **Moderate structural similarity** indicates some shared organizational elements.")
|
| 283 |
else:
|
| 284 |
analysis.append("- **Low structural similarity** suggests different organizational approaches.")
|
| 285 |
+
|
| 286 |
# Add top pairs
|
| 287 |
top_pairs = df.nlargest(3, 'Normalized LCS')
|
| 288 |
if not top_pairs.empty:
|
|
|
|
| 292 |
text_b = row.get('Text B', 'Text 2')
|
| 293 |
score = row['Normalized LCS']
|
| 294 |
analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
|
| 295 |
+
|
| 296 |
return "\n".join(analysis)
|
| 297 |
+
|
| 298 |
# TF-IDF analysis method removed
|
| 299 |
+
|
| 300 |
def _generate_overall_interpretation(self, df: pd.DataFrame) -> str:
|
| 301 |
"""Generate an overall interpretation of the metrics."""
|
| 302 |
interpretations = []
|
| 303 |
+
|
| 304 |
# Get metrics if they exist
|
| 305 |
has_jaccard = 'Jaccard Similarity (%)' in df.columns
|
| 306 |
has_lcs = 'Normalized LCS' in df.columns
|
| 307 |
+
|
| 308 |
# Calculate means for available metrics
|
| 309 |
metrics = {}
|
| 310 |
if has_jaccard:
|
|
|
|
| 312 |
if has_lcs:
|
| 313 |
metrics['lcs'] = df['Normalized LCS'].mean()
|
| 314 |
# TF-IDF metrics removed
|
| 315 |
+
|
| 316 |
# Generate interpretation based on metrics
|
| 317 |
if metrics:
|
| 318 |
interpretations.append("Based on the analysis of similarity metrics:")
|
| 319 |
+
|
| 320 |
if has_jaccard and metrics['jaccard'] > 60:
|
| 321 |
interpretations.append("- The high Jaccard similarity indicates significant vocabulary overlap between texts, "
|
| 322 |
"suggesting they may share common sources or be part of the same textual tradition.")
|
| 323 |
+
|
| 324 |
if has_lcs and metrics['lcs'] > 0.7:
|
| 325 |
interpretations.append("- The high LCS score indicates strong structural similarity, "
|
| 326 |
"suggesting the texts may follow similar organizational patterns or share common structural elements.")
|
| 327 |
+
|
| 328 |
# TF-IDF interpretation removed
|
| 329 |
+
|
| 330 |
# Add cross-metric interpretations
|
| 331 |
if has_jaccard and has_lcs and metrics['jaccard'] > 60 and metrics['lcs'] > 0.7:
|
| 332 |
interpretations.append("\nThe combination of high Jaccard and LCS similarities strongly suggests "
|
| 333 |
"that these texts are closely related, possibly being different versions or "
|
| 334 |
"transmissions of the same work or sharing a common source.")
|
| 335 |
+
|
| 336 |
# TF-IDF cross-metric interpretation removed
|
| 337 |
+
|
| 338 |
# Add general guidance if no specific patterns found
|
| 339 |
if not interpretations:
|
| 340 |
interpretations.append("The analysis did not reveal strong patterns in the similarity metrics. "
|
| 341 |
"This could indicate that the texts are either very similar or very different "
|
| 342 |
"across all measured dimensions.")
|
| 343 |
+
|
| 344 |
return "\n\n".join(interpretations)
|
| 345 |
+
|
| 346 |
def _create_llm_prompt(self, df: pd.DataFrame, model_name: str) -> str:
|
| 347 |
"""
|
| 348 |
Create a prompt for the LLM based on the DataFrame.
|
| 349 |
+
|
| 350 |
Args:
|
| 351 |
df: Prepared DataFrame with metrics
|
| 352 |
model_name: Name of the model being used
|
| 353 |
+
|
| 354 |
Returns:
|
| 355 |
str: Formatted prompt for the LLM
|
| 356 |
"""
|
| 357 |
# Convert DataFrame to markdown for the prompt
|
| 358 |
md_table = df.to_markdown(index=False)
|
| 359 |
+
|
| 360 |
# Create the prompt
|
| 361 |
prompt = f"""
|
| 362 |
# Tibetan Text Similarity Analysis
|
|
|
|
| 372 |
|
| 373 |
Your analysis will be performed using the `{model_name}` model. Provide a concise, scholarly analysis in well-structured markdown.
|
| 374 |
"""
|
|
|
|
| 375 |
|
| 376 |
+
|
| 377 |
+
|
| 378 |
return prompt
|
| 379 |
+
|
| 380 |
def _get_system_prompt(self) -> str:
|
| 381 |
"""Get the system prompt for the LLM."""
|
| 382 |
return """You are a senior scholar of Tibetan Buddhist texts, specializing in textual criticism. Your task is to analyze the provided similarity metrics and provide expert insights into the relationships between these texts. Ground your analysis in the data, be precise, and focus on what the metrics reveal about the texts' transmission and history."""
|
| 383 |
+
|
| 384 |
def _call_openrouter_api(self, model: str, prompt: str, system_message: str = None, max_tokens: int = None, temperature: float = None, top_p: float = None) -> str:
|
| 385 |
"""
|
| 386 |
Call the OpenRouter API.
|
| 387 |
+
|
| 388 |
Args:
|
| 389 |
model: Model to use for the API call
|
| 390 |
prompt: The user prompt
|
|
|
|
| 392 |
max_tokens: Maximum tokens for the response
|
| 393 |
temperature: Sampling temperature
|
| 394 |
top_p: Nucleus sampling parameter
|
| 395 |
+
|
| 396 |
Returns:
|
| 397 |
str: The API response
|
| 398 |
+
|
| 399 |
Raises:
|
| 400 |
ValueError: If API key is missing or invalid
|
| 401 |
requests.exceptions.RequestException: For network-related errors
|
|
|
|
| 405 |
error_msg = "OpenRouter API key not provided. Please set the OPENROUTER_API_KEY environment variable."
|
| 406 |
logger.error(error_msg)
|
| 407 |
raise ValueError(error_msg)
|
| 408 |
+
|
| 409 |
url = "https://openrouter.ai/api/v1/chat/completions"
|
| 410 |
+
|
| 411 |
headers = {
|
| 412 |
"Authorization": f"Bearer {self.api_key}",
|
| 413 |
"Content-Type": "application/json",
|
| 414 |
"HTTP-Referer": "https://github.com/daniel-wojahn/tibetan-text-metrics",
|
| 415 |
"X-Title": "Tibetan Text Metrics"
|
| 416 |
}
|
| 417 |
+
|
| 418 |
messages = []
|
| 419 |
if system_message:
|
| 420 |
messages.append({"role": "system", "content": system_message})
|
| 421 |
messages.append({"role": "user", "content": prompt})
|
| 422 |
+
|
| 423 |
data = {
|
| 424 |
"model": model, # Use the model parameter here
|
| 425 |
"messages": messages,
|
|
|
|
| 427 |
"temperature": temperature or self.temperature,
|
| 428 |
"top_p": top_p or self.top_p,
|
| 429 |
}
|
| 430 |
+
|
| 431 |
try:
|
| 432 |
logger.info(f"Calling OpenRouter API with model: {model}")
|
| 433 |
response = requests.post(url, headers=headers, json=data, timeout=60)
|
| 434 |
+
|
| 435 |
# Handle different HTTP status codes
|
| 436 |
if response.status_code == 200:
|
| 437 |
result = response.json()
|
|
|
|
| 441 |
error_msg = "Unexpected response format from OpenRouter API"
|
| 442 |
logger.error(f"{error_msg}: {result}")
|
| 443 |
raise ValueError(error_msg)
|
| 444 |
+
|
| 445 |
elif response.status_code == 401:
|
| 446 |
error_msg = "Invalid OpenRouter API key. Please check your API key and try again."
|
| 447 |
logger.error(error_msg)
|
| 448 |
raise ValueError(error_msg)
|
| 449 |
+
|
| 450 |
elif response.status_code == 402:
|
| 451 |
error_msg = "OpenRouter API payment required. Please check your OpenRouter account balance or billing status."
|
| 452 |
logger.error(error_msg)
|
| 453 |
raise ValueError(error_msg)
|
| 454 |
+
|
| 455 |
elif response.status_code == 429:
|
| 456 |
error_msg = "API rate limit exceeded. Please try again later or check your OpenRouter rate limits."
|
| 457 |
logger.error(error_msg)
|
| 458 |
raise ValueError(error_msg)
|
| 459 |
+
|
| 460 |
else:
|
| 461 |
error_msg = f"OpenRouter API error: {response.status_code} - {response.text}"
|
| 462 |
logger.error(error_msg)
|
| 463 |
raise Exception(error_msg)
|
| 464 |
+
|
| 465 |
except requests.exceptions.RequestException as e:
|
| 466 |
error_msg = f"Failed to connect to OpenRouter API: {str(e)}"
|
| 467 |
logger.error(error_msg)
|
| 468 |
raise Exception(error_msg) from e
|
| 469 |
+
|
| 470 |
except json.JSONDecodeError as e:
|
| 471 |
error_msg = f"Failed to parse OpenRouter API response: {str(e)}"
|
| 472 |
logger.error(error_msg)
|
| 473 |
raise Exception(error_msg) from e
|
| 474 |
+
|
| 475 |
def _format_llm_response(self, response: str, df: pd.DataFrame, model_name: str) -> str:
|
| 476 |
"""
|
| 477 |
Format the LLM response for display.
|
| 478 |
+
|
| 479 |
Args:
|
| 480 |
response: Raw LLM response
|
| 481 |
df: Original DataFrame for reference
|
| 482 |
model_name: Name of the model used
|
| 483 |
+
|
| 484 |
Returns:
|
| 485 |
str: Formatted response with fallback if needed
|
| 486 |
"""
|
| 487 |
# Basic validation
|
| 488 |
if not response or len(response) < 100:
|
| 489 |
raise ValueError("Response too short or empty")
|
| 490 |
+
|
| 491 |
# Check for garbled output (random numbers, nonsensical patterns)
|
| 492 |
# This is a simple heuristic - look for long sequences of numbers or strange patterns
|
| 493 |
suspicious_patterns = [
|
|
|
|
| 495 |
r'[0-9,.]{20,}', # Long sequences of digits, commas and periods
|
| 496 |
r'[\W]{20,}', # Long sequences of non-word characters
|
| 497 |
]
|
| 498 |
+
|
| 499 |
for pattern in suspicious_patterns:
|
| 500 |
if re.search(pattern, response):
|
| 501 |
logger.warning(f"Detected potentially garbled output matching pattern: {pattern}")
|
| 502 |
# Don't immediately raise - we'll do a more comprehensive check
|
| 503 |
+
|
| 504 |
# Check for content quality - ensure it has expected sections
|
| 505 |
expected_content = [
|
| 506 |
"introduction", "analysis", "similarity", "patterns", "conclusion", "question"
|
| 507 |
]
|
| 508 |
+
|
| 509 |
# Count how many expected content markers we find
|
| 510 |
content_matches = sum(1 for term in expected_content if term.lower() in response.lower())
|
| 511 |
+
|
| 512 |
# If we find fewer than 3 expected content markers, log a warning
|
| 513 |
if content_matches < 3:
|
| 514 |
logger.warning(f"LLM response missing expected content sections (found {content_matches}/6)")
|
| 515 |
+
|
| 516 |
# Check for text names from the dataset
|
| 517 |
# Extract text names from the Text Pair column
|
| 518 |
text_names = set()
|
|
|
|
| 521 |
if isinstance(pair, str) and " vs " in pair:
|
| 522 |
texts = pair.split(" vs ")
|
| 523 |
text_names.update(texts)
|
| 524 |
+
|
| 525 |
# Check if at least some text names appear in the response
|
| 526 |
text_name_matches = sum(1 for name in text_names if name in response)
|
| 527 |
if text_names and text_name_matches == 0:
|
| 528 |
logger.warning("LLM response does not mention any of the text names from the dataset. The analysis may be generic.")
|
| 529 |
+
|
| 530 |
# Ensure basic markdown structure
|
| 531 |
if '##' not in response:
|
| 532 |
response = f"## Analysis of Tibetan Text Similarity\n\n{response}"
|
| 533 |
+
|
| 534 |
# Add styling to make the output more readable
|
| 535 |
response = f"<div class='llm-analysis'>\n{response}\n</div>"
|
| 536 |
+
|
| 537 |
# Format the response into a markdown block
|
| 538 |
formatted_response = f"""## AI-Powered Analysis (Model: {model_name})\n\n{response}"""
|
| 539 |
+
|
| 540 |
return formatted_response
|
| 541 |
+
|
| 542 |
|
pipeline/metrics.py
CHANGED
|
@@ -4,14 +4,19 @@ from typing import List, Dict, Union
|
|
| 4 |
from itertools import combinations
|
| 5 |
|
| 6 |
from sklearn.metrics.pairwise import cosine_similarity
|
| 7 |
-
from thefuzz import fuzz
|
| 8 |
from .hf_embedding import generate_embeddings as generate_hf_embeddings
|
| 9 |
from .stopwords_bo import TIBETAN_STOPWORDS_SET
|
| 10 |
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
|
|
|
|
| 11 |
|
| 12 |
import logging
|
| 13 |
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
# Attempt to import the Cython-compiled fast_lcs module
|
| 16 |
try:
|
| 17 |
from .fast_lcs import compute_lcs_fast
|
|
@@ -25,19 +30,37 @@ logger = logging.getLogger(__name__)
|
|
| 25 |
|
| 26 |
|
| 27 |
|
| 28 |
-
def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
m, n = len(words1), len(words2)
|
| 32 |
|
|
|
|
|
|
|
|
|
|
| 33 |
if USE_CYTHON_LCS:
|
| 34 |
-
# Use the Cython-compiled version if available
|
| 35 |
lcs_length = compute_lcs_fast(words1, words2)
|
| 36 |
else:
|
| 37 |
-
#
|
| 38 |
-
# m, n = len(words1), len(words2) # Moved to the beginning of the function
|
| 39 |
-
# Using numpy array for dp table can be slightly faster than list of lists for large inputs
|
| 40 |
-
# but the primary bottleneck is the Python loop itself compared to Cython.
|
| 41 |
dp = np.zeros((m + 1, n + 1), dtype=np.int32)
|
| 42 |
|
| 43 |
for i in range(1, m + 1):
|
|
@@ -47,63 +70,192 @@ def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
|
|
| 47 |
else:
|
| 48 |
dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
|
| 49 |
lcs_length = int(dp[m, n])
|
| 50 |
-
avg_length = (m + n) / 2
|
| 51 |
-
return lcs_length / avg_length if avg_length > 0 else 0.0
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
"""
|
| 56 |
-
Computes
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
Args:
|
| 59 |
words1: First list of tokens
|
| 60 |
words2: Second list of tokens
|
| 61 |
method: The fuzzy matching method to use:
|
| 62 |
-
'
|
| 63 |
-
'
|
| 64 |
-
'
|
| 65 |
-
|
| 66 |
-
|
| 67 |
Returns:
|
| 68 |
float: Fuzzy similarity score between 0.0 and 1.0
|
| 69 |
"""
|
| 70 |
if not words1 or not words2:
|
| 71 |
return 0.0
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
# Best for finding shorter strings within longer ones
|
| 86 |
-
score = fuzz.partial_ratio(text1, text2)
|
| 87 |
-
else: # 'ratio'
|
| 88 |
-
# Simple Levenshtein distance ratio
|
| 89 |
-
score = fuzz.ratio(text1, text2)
|
| 90 |
-
|
| 91 |
-
# Convert score from 0-100 scale to 0-1 scale
|
| 92 |
-
return score / 100.0
|
| 93 |
|
| 94 |
|
| 95 |
|
| 96 |
def compute_semantic_similarity(
|
| 97 |
text1_segment: str,
|
| 98 |
text2_segment: str,
|
| 99 |
-
tokens1: List[str],
|
| 100 |
-
tokens2: List[str],
|
| 101 |
model,
|
| 102 |
batch_size: int = 32,
|
| 103 |
show_progress_bar: bool = False
|
| 104 |
) -> float:
|
| 105 |
-
"""
|
|
|
|
| 106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
if model is None:
|
| 108 |
logger.warning(
|
| 109 |
"Embedding model not available for semantic similarity. Skipping calculation."
|
|
@@ -116,38 +268,27 @@ def compute_semantic_similarity(
|
|
| 116 |
)
|
| 117 |
return 0.0
|
| 118 |
|
| 119 |
-
def
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
batch_size_param: int,
|
| 124 |
-
show_progress_bar_param: bool
|
| 125 |
-
) -> Union[np.ndarray, None]:
|
| 126 |
-
"""Helper to get a single embedding for a text using Sentence Transformers."""
|
| 127 |
-
if not raw_text_segment.strip():
|
| 128 |
-
logger.info(
|
| 129 |
-
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
| 130 |
-
)
|
| 131 |
return None
|
| 132 |
-
|
| 133 |
embedding = generate_hf_embeddings(
|
| 134 |
-
texts=[
|
| 135 |
-
model=
|
| 136 |
-
batch_size=
|
| 137 |
-
show_progress_bar=
|
| 138 |
)
|
| 139 |
-
|
| 140 |
-
if embedding is None or embedding.size == 0:
|
| 141 |
-
logger.error(
|
| 142 |
-
f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
|
| 143 |
-
)
|
| 144 |
return None
|
| 145 |
return embedding
|
| 146 |
|
| 147 |
try:
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
emb2 = _get_aggregated_embedding(text2_segment, tokens2, model, batch_size, show_progress_bar)
|
| 151 |
|
| 152 |
if emb1 is None or emb2 is None or emb1.size == 0 or emb2.size == 0:
|
| 153 |
logger.error(
|
|
@@ -168,7 +309,7 @@ def compute_semantic_similarity(
|
|
| 168 |
if np.all(emb1 == 0) or np.all(emb2 == 0):
|
| 169 |
logger.info("One of the embeddings is zero. Semantic similarity is 0.0.")
|
| 170 |
return 0.0
|
| 171 |
-
|
| 172 |
# Handle NaN or Inf in embeddings
|
| 173 |
if np.isnan(emb1).any() or np.isinf(emb1).any() or \
|
| 174 |
np.isnan(emb2).any() or np.isinf(emb2).any():
|
|
@@ -180,9 +321,9 @@ def compute_semantic_similarity(
|
|
| 180 |
emb1 = emb1.reshape(1, -1)
|
| 181 |
if emb2.ndim == 1:
|
| 182 |
emb2 = emb2.reshape(1, -1)
|
| 183 |
-
|
| 184 |
similarity_score = cosine_similarity(emb1, emb2)[0][0]
|
| 185 |
-
|
| 186 |
return max(0.0, float(similarity_score))
|
| 187 |
|
| 188 |
except Exception as e:
|
|
@@ -202,8 +343,10 @@ def compute_all_metrics(
|
|
| 202 |
enable_semantic: bool = True,
|
| 203 |
enable_fuzzy: bool = True,
|
| 204 |
fuzzy_method: str = 'token_set',
|
|
|
|
| 205 |
use_stopwords: bool = True,
|
| 206 |
use_lite_stopwords: bool = False,
|
|
|
|
| 207 |
batch_size: int = 32,
|
| 208 |
show_progress_bar: bool = False
|
| 209 |
) -> pd.DataFrame:
|
|
@@ -218,10 +361,13 @@ def compute_all_metrics(
|
|
| 218 |
Defaults to None.
|
| 219 |
enable_semantic (bool): Whether to compute semantic similarity. Defaults to True.
|
| 220 |
enable_fuzzy (bool): Whether to compute fuzzy string similarity. Defaults to True.
|
| 221 |
-
fuzzy_method (str): The fuzzy matching method to use ('
|
| 222 |
Defaults to 'token_set'.
|
|
|
|
| 223 |
use_stopwords (bool): Whether to filter stopwords for Jaccard similarity. Defaults to True.
|
| 224 |
use_lite_stopwords (bool): Whether to use the lite version of stopwords. Defaults to False.
|
|
|
|
|
|
|
| 225 |
batch_size (int): Batch size for semantic similarity computation. Defaults to 32.
|
| 226 |
show_progress_bar (bool): Whether to show progress bar for semantic similarity. Defaults to False.
|
| 227 |
|
|
@@ -232,14 +378,7 @@ def compute_all_metrics(
|
|
| 232 |
"""
|
| 233 |
files = list(texts.keys())
|
| 234 |
results = []
|
| 235 |
-
corpus_for_sklearn_tfidf = [] # Kept for potential future use
|
| 236 |
-
|
| 237 |
-
for fname, content in texts.items():
|
| 238 |
-
# Use the pre-computed tokens from the token_lists dictionary
|
| 239 |
-
current_tokens_for_file = token_lists.get(fname, [])
|
| 240 |
-
corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
|
| 241 |
|
| 242 |
-
|
| 243 |
for i, j in combinations(range(len(files)), 2):
|
| 244 |
f1, f2 = files[i], files[j]
|
| 245 |
words1_raw, words2_raw = token_lists[f1], token_lists[f2]
|
|
@@ -254,21 +393,33 @@ def compute_all_metrics(
|
|
| 254 |
else:
|
| 255 |
# If stopwords are disabled, use an empty set
|
| 256 |
stopwords_set_to_use = set()
|
| 257 |
-
|
| 258 |
-
# Filter stopwords for Jaccard calculation
|
| 259 |
-
|
| 260 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 261 |
|
| 262 |
jaccard = (
|
| 263 |
len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
|
| 264 |
if set(words1_jaccard) | set(words2_jaccard) # Ensure denominator is not zero
|
| 265 |
else 0.0
|
| 266 |
)
|
| 267 |
-
# LCS uses raw tokens (words1_raw, words2_raw) to provide a complementary metric.
|
| 268 |
-
# Semantic similarity also uses raw text and its botok tokens for chunking decisions.
|
| 269 |
jaccard_percent = jaccard * 100.0
|
| 270 |
-
|
| 271 |
-
|
|
|
|
|
|
|
| 272 |
# Fuzzy Similarity Calculation
|
| 273 |
if enable_fuzzy:
|
| 274 |
fuzzy_sim = compute_fuzzy_similarity(words1_jaccard, words2_jaccard, method=fuzzy_method)
|
|
@@ -277,9 +428,8 @@ def compute_all_metrics(
|
|
| 277 |
|
| 278 |
# Semantic Similarity Calculation
|
| 279 |
if enable_semantic:
|
| 280 |
-
# Pass raw texts and their pre-computed botok tokens
|
| 281 |
semantic_sim = compute_semantic_similarity(
|
| 282 |
-
texts[f1], texts[f2],
|
| 283 |
batch_size=batch_size,
|
| 284 |
show_progress_bar=show_progress_bar
|
| 285 |
)
|
|
|
|
| 4 |
from itertools import combinations
|
| 5 |
|
| 6 |
from sklearn.metrics.pairwise import cosine_similarity
|
|
|
|
| 7 |
from .hf_embedding import generate_embeddings as generate_hf_embeddings
|
| 8 |
from .stopwords_bo import TIBETAN_STOPWORDS_SET
|
| 9 |
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
|
| 10 |
+
from .normalize_bo import normalize_particles
|
| 11 |
|
| 12 |
import logging
|
| 13 |
|
| 14 |
|
| 15 |
+
def _normalize_token_for_stopwords(token: str) -> str:
|
| 16 |
+
"""Normalize token by removing trailing tsek for stopword matching."""
|
| 17 |
+
return token.rstrip('་')
|
| 18 |
+
|
| 19 |
+
|
| 20 |
# Attempt to import the Cython-compiled fast_lcs module
|
| 21 |
try:
|
| 22 |
from .fast_lcs import compute_lcs_fast
|
|
|
|
| 30 |
|
| 31 |
|
| 32 |
|
| 33 |
+
def compute_normalized_lcs(words1: List[str], words2: List[str], normalization: str = "avg") -> float:
|
| 34 |
+
"""
|
| 35 |
+
Computes the Longest Common Subsequence (LCS) similarity between two token lists.
|
| 36 |
+
|
| 37 |
+
Args:
|
| 38 |
+
words1: First list of tokens
|
| 39 |
+
words2: Second list of tokens
|
| 40 |
+
normalization: How to normalize the LCS length. Options:
|
| 41 |
+
'avg' - Divide by average length (default, balanced)
|
| 42 |
+
'min' - Divide by shorter text (detects if one text contains the other)
|
| 43 |
+
'max' - Divide by longer text (stricter, penalizes length differences)
|
| 44 |
+
|
| 45 |
+
Returns:
|
| 46 |
+
float: Normalized LCS score between 0.0 and 1.0
|
| 47 |
+
|
| 48 |
+
Note on normalization choice:
|
| 49 |
+
- 'avg': Good general-purpose choice, treats both texts equally
|
| 50 |
+
- 'min': Use when looking for containment (e.g., quotes within commentary)
|
| 51 |
+
Can return 1.0 if shorter text is fully contained in longer
|
| 52 |
+
- 'max': Use when you want to penalize length differences
|
| 53 |
+
Will be lower when texts have very different lengths
|
| 54 |
+
"""
|
| 55 |
m, n = len(words1), len(words2)
|
| 56 |
|
| 57 |
+
if m == 0 or n == 0:
|
| 58 |
+
return 0.0
|
| 59 |
+
|
| 60 |
if USE_CYTHON_LCS:
|
|
|
|
| 61 |
lcs_length = compute_lcs_fast(words1, words2)
|
| 62 |
else:
|
| 63 |
+
# Pure Python implementation using dynamic programming
|
|
|
|
|
|
|
|
|
|
| 64 |
dp = np.zeros((m + 1, n + 1), dtype=np.int32)
|
| 65 |
|
| 66 |
for i in range(1, m + 1):
|
|
|
|
| 70 |
else:
|
| 71 |
dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
|
| 72 |
lcs_length = int(dp[m, n])
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
# Apply selected normalization
|
| 75 |
+
if normalization == "min":
|
| 76 |
+
divisor = min(m, n)
|
| 77 |
+
elif normalization == "max":
|
| 78 |
+
divisor = max(m, n)
|
| 79 |
+
else: # "avg" (default)
|
| 80 |
+
divisor = (m + n) / 2
|
| 81 |
+
|
| 82 |
+
return lcs_length / divisor if divisor > 0 else 0.0
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def compute_ngram_similarity(tokens1: List[str], tokens2: List[str], n: int = 2) -> float:
|
| 86 |
+
"""
|
| 87 |
+
Computes syllable/token n-gram overlap similarity (Jaccard on n-grams).
|
| 88 |
+
|
| 89 |
+
This is more effective for Tibetan than character-level fuzzy matching because
|
| 90 |
+
it preserves syllable boundaries and captures local word patterns.
|
| 91 |
+
|
| 92 |
+
Args:
|
| 93 |
+
tokens1: First list of tokens (syllables or words)
|
| 94 |
+
tokens2: Second list of tokens (syllables or words)
|
| 95 |
+
n: Size of n-grams (default: 2 for bigrams)
|
| 96 |
+
|
| 97 |
+
Returns:
|
| 98 |
+
float: N-gram similarity score between 0.0 and 1.0
|
| 99 |
+
"""
|
| 100 |
+
if not tokens1 or not tokens2:
|
| 101 |
+
return 0.0
|
| 102 |
+
|
| 103 |
+
# Handle edge case where text is shorter than n
|
| 104 |
+
if len(tokens1) < n or len(tokens2) < n:
|
| 105 |
+
# Fall back to unigram comparison
|
| 106 |
+
set1, set2 = set(tokens1), set(tokens2)
|
| 107 |
+
if not set1 or not set2:
|
| 108 |
+
return 0.0
|
| 109 |
+
intersection = len(set1 & set2)
|
| 110 |
+
union = len(set1 | set2)
|
| 111 |
+
return intersection / union if union > 0 else 0.0
|
| 112 |
+
|
| 113 |
+
def get_ngrams(tokens: List[str], size: int) -> set:
|
| 114 |
+
return set(tuple(tokens[i:i+size]) for i in range(len(tokens) - size + 1))
|
| 115 |
+
|
| 116 |
+
ngrams1 = get_ngrams(tokens1, n)
|
| 117 |
+
ngrams2 = get_ngrams(tokens2, n)
|
| 118 |
|
| 119 |
+
intersection = len(ngrams1 & ngrams2)
|
| 120 |
+
union = len(ngrams1 | ngrams2)
|
| 121 |
+
return intersection / union if union > 0 else 0.0
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def compute_syllable_edit_similarity(syls1: List[str], syls2: List[str]) -> float:
|
| 125 |
"""
|
| 126 |
+
Computes edit distance at the syllable/token level rather than character level.
|
| 127 |
+
|
| 128 |
+
This is more appropriate for Tibetan because:
|
| 129 |
+
- Tibetan syllables are meaningful units (unlike individual characters)
|
| 130 |
+
- Character-level Levenshtein over-penalizes syllable differences
|
| 131 |
+
- Syllable-level comparison better captures textual variation patterns
|
| 132 |
+
|
| 133 |
+
Args:
|
| 134 |
+
syls1: First list of syllables/tokens
|
| 135 |
+
syls2: Second list of syllables/tokens
|
| 136 |
+
|
| 137 |
+
Returns:
|
| 138 |
+
float: Syllable-level similarity score between 0.0 and 1.0
|
| 139 |
+
"""
|
| 140 |
+
if not syls1 and not syls2:
|
| 141 |
+
return 1.0
|
| 142 |
+
if not syls1 or not syls2:
|
| 143 |
+
return 0.0
|
| 144 |
+
|
| 145 |
+
m, n = len(syls1), len(syls2)
|
| 146 |
+
|
| 147 |
+
# Create DP table for syllable-level edit distance
|
| 148 |
+
dp = np.zeros((m + 1, n + 1), dtype=np.int32)
|
| 149 |
+
|
| 150 |
+
# Initialize base cases
|
| 151 |
+
for i in range(m + 1):
|
| 152 |
+
dp[i, 0] = i
|
| 153 |
+
for j in range(n + 1):
|
| 154 |
+
dp[0, j] = j
|
| 155 |
+
|
| 156 |
+
# Fill DP table
|
| 157 |
+
for i in range(1, m + 1):
|
| 158 |
+
for j in range(1, n + 1):
|
| 159 |
+
if syls1[i - 1] == syls2[j - 1]:
|
| 160 |
+
dp[i, j] = dp[i - 1, j - 1]
|
| 161 |
+
else:
|
| 162 |
+
dp[i, j] = 1 + min(
|
| 163 |
+
dp[i - 1, j], # deletion
|
| 164 |
+
dp[i, j - 1], # insertion
|
| 165 |
+
dp[i - 1, j - 1] # substitution
|
| 166 |
+
)
|
| 167 |
+
|
| 168 |
+
edit_distance = dp[m, n]
|
| 169 |
+
max_len = max(m, n)
|
| 170 |
+
return 1.0 - (edit_distance / max_len) if max_len > 0 else 1.0
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
def compute_weighted_jaccard(tokens1: List[str], tokens2: List[str]) -> float:
|
| 174 |
+
"""
|
| 175 |
+
Computes weighted Jaccard similarity using token frequencies.
|
| 176 |
+
|
| 177 |
+
Unlike standard Jaccard which treats all tokens as binary (present/absent),
|
| 178 |
+
this considers how often each token appears, giving more weight to
|
| 179 |
+
frequently shared terms.
|
| 180 |
+
|
| 181 |
+
Args:
|
| 182 |
+
tokens1: First list of tokens
|
| 183 |
+
tokens2: Second list of tokens
|
| 184 |
+
|
| 185 |
+
Returns:
|
| 186 |
+
float: Weighted Jaccard similarity between 0.0 and 1.0
|
| 187 |
+
"""
|
| 188 |
+
from collections import Counter
|
| 189 |
+
|
| 190 |
+
if not tokens1 or not tokens2:
|
| 191 |
+
return 0.0
|
| 192 |
+
|
| 193 |
+
c1, c2 = Counter(tokens1), Counter(tokens2)
|
| 194 |
+
|
| 195 |
+
# Intersection: min count for each shared token
|
| 196 |
+
intersection = sum((c1 & c2).values())
|
| 197 |
+
# Union: max count for each token
|
| 198 |
+
union = sum((c1 | c2).values())
|
| 199 |
+
|
| 200 |
+
return intersection / union if union > 0 else 0.0
|
| 201 |
+
|
| 202 |
+
|
| 203 |
+
def compute_fuzzy_similarity(words1: List[str], words2: List[str], method: str = 'ngram') -> float:
|
| 204 |
+
"""
|
| 205 |
+
Computes fuzzy string similarity between two lists of words.
|
| 206 |
+
|
| 207 |
+
All methods work at the syllable/token level, which is linguistically
|
| 208 |
+
appropriate for Tibetan text.
|
| 209 |
+
|
| 210 |
Args:
|
| 211 |
words1: First list of tokens
|
| 212 |
words2: Second list of tokens
|
| 213 |
method: The fuzzy matching method to use:
|
| 214 |
+
'ngram' - Syllable bigram overlap (default, recommended)
|
| 215 |
+
'syllable_edit' - Syllable-level edit distance
|
| 216 |
+
'weighted_jaccard' - Frequency-weighted Jaccard
|
| 217 |
+
|
|
|
|
| 218 |
Returns:
|
| 219 |
float: Fuzzy similarity score between 0.0 and 1.0
|
| 220 |
"""
|
| 221 |
if not words1 or not words2:
|
| 222 |
return 0.0
|
| 223 |
+
|
| 224 |
+
if method == 'ngram':
|
| 225 |
+
# Syllable bigram overlap - good for detecting shared phrases
|
| 226 |
+
return compute_ngram_similarity(words1, words2, n=2)
|
| 227 |
+
elif method == 'syllable_edit':
|
| 228 |
+
# Syllable-level edit distance - good for detecting minor variations
|
| 229 |
+
return compute_syllable_edit_similarity(words1, words2)
|
| 230 |
+
elif method == 'weighted_jaccard':
|
| 231 |
+
# Frequency-weighted Jaccard - good for repeated terms
|
| 232 |
+
return compute_weighted_jaccard(words1, words2)
|
| 233 |
+
else:
|
| 234 |
+
# Default to ngram for any unrecognized method
|
| 235 |
+
return compute_ngram_similarity(words1, words2, n=2)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
|
| 237 |
|
| 238 |
|
| 239 |
def compute_semantic_similarity(
|
| 240 |
text1_segment: str,
|
| 241 |
text2_segment: str,
|
|
|
|
|
|
|
| 242 |
model,
|
| 243 |
batch_size: int = 32,
|
| 244 |
show_progress_bar: bool = False
|
| 245 |
) -> float:
|
| 246 |
+
"""
|
| 247 |
+
Computes semantic similarity using a Sentence Transformer model.
|
| 248 |
|
| 249 |
+
Args:
|
| 250 |
+
text1_segment: First text segment
|
| 251 |
+
text2_segment: Second text segment
|
| 252 |
+
model: Pre-loaded SentenceTransformer model
|
| 253 |
+
batch_size: Batch size for encoding
|
| 254 |
+
show_progress_bar: Whether to show progress bar
|
| 255 |
+
|
| 256 |
+
Returns:
|
| 257 |
+
float: Cosine similarity between embeddings (0.0 to 1.0), or np.nan on error
|
| 258 |
+
"""
|
| 259 |
if model is None:
|
| 260 |
logger.warning(
|
| 261 |
"Embedding model not available for semantic similarity. Skipping calculation."
|
|
|
|
| 268 |
)
|
| 269 |
return 0.0
|
| 270 |
|
| 271 |
+
def _get_embedding(raw_text: str) -> Union[np.ndarray, None]:
|
| 272 |
+
"""Helper to get embedding for a single text."""
|
| 273 |
+
if not raw_text.strip():
|
| 274 |
+
logger.info("Text is empty or whitespace. Returning None.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 275 |
return None
|
| 276 |
+
|
| 277 |
embedding = generate_hf_embeddings(
|
| 278 |
+
texts=[raw_text],
|
| 279 |
+
model=model,
|
| 280 |
+
batch_size=batch_size,
|
| 281 |
+
show_progress_bar=show_progress_bar
|
| 282 |
)
|
| 283 |
+
|
| 284 |
+
if embedding is None or embedding.size == 0:
|
| 285 |
+
logger.error(f"Failed to generate embedding for text: {raw_text[:100]}...")
|
|
|
|
|
|
|
| 286 |
return None
|
| 287 |
return embedding
|
| 288 |
|
| 289 |
try:
|
| 290 |
+
emb1 = _get_embedding(text1_segment)
|
| 291 |
+
emb2 = _get_embedding(text2_segment)
|
|
|
|
| 292 |
|
| 293 |
if emb1 is None or emb2 is None or emb1.size == 0 or emb2.size == 0:
|
| 294 |
logger.error(
|
|
|
|
| 309 |
if np.all(emb1 == 0) or np.all(emb2 == 0):
|
| 310 |
logger.info("One of the embeddings is zero. Semantic similarity is 0.0.")
|
| 311 |
return 0.0
|
| 312 |
+
|
| 313 |
# Handle NaN or Inf in embeddings
|
| 314 |
if np.isnan(emb1).any() or np.isinf(emb1).any() or \
|
| 315 |
np.isnan(emb2).any() or np.isinf(emb2).any():
|
|
|
|
| 321 |
emb1 = emb1.reshape(1, -1)
|
| 322 |
if emb2.ndim == 1:
|
| 323 |
emb2 = emb2.reshape(1, -1)
|
| 324 |
+
|
| 325 |
similarity_score = cosine_similarity(emb1, emb2)[0][0]
|
| 326 |
+
|
| 327 |
return max(0.0, float(similarity_score))
|
| 328 |
|
| 329 |
except Exception as e:
|
|
|
|
| 343 |
enable_semantic: bool = True,
|
| 344 |
enable_fuzzy: bool = True,
|
| 345 |
fuzzy_method: str = 'token_set',
|
| 346 |
+
lcs_normalization: str = 'avg',
|
| 347 |
use_stopwords: bool = True,
|
| 348 |
use_lite_stopwords: bool = False,
|
| 349 |
+
normalize_particles_opt: bool = False,
|
| 350 |
batch_size: int = 32,
|
| 351 |
show_progress_bar: bool = False
|
| 352 |
) -> pd.DataFrame:
|
|
|
|
| 361 |
Defaults to None.
|
| 362 |
enable_semantic (bool): Whether to compute semantic similarity. Defaults to True.
|
| 363 |
enable_fuzzy (bool): Whether to compute fuzzy string similarity. Defaults to True.
|
| 364 |
+
fuzzy_method (str): The fuzzy matching method to use ('ngram', 'syllable_edit', 'weighted_jaccard').
|
| 365 |
Defaults to 'token_set'.
|
| 366 |
+
lcs_normalization (str): How to normalize LCS ('avg', 'min', 'max'). Defaults to 'avg'.
|
| 367 |
use_stopwords (bool): Whether to filter stopwords for Jaccard similarity. Defaults to True.
|
| 368 |
use_lite_stopwords (bool): Whether to use the lite version of stopwords. Defaults to False.
|
| 369 |
+
normalize_particles_opt (bool): Whether to normalize grammatical particles (གི/ཀྱི/གྱི → གི).
|
| 370 |
+
Reduces false negatives from sandhi variation. Defaults to False.
|
| 371 |
batch_size (int): Batch size for semantic similarity computation. Defaults to 32.
|
| 372 |
show_progress_bar (bool): Whether to show progress bar for semantic similarity. Defaults to False.
|
| 373 |
|
|
|
|
| 378 |
"""
|
| 379 |
files = list(texts.keys())
|
| 380 |
results = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 381 |
|
|
|
|
| 382 |
for i, j in combinations(range(len(files)), 2):
|
| 383 |
f1, f2 = files[i], files[j]
|
| 384 |
words1_raw, words2_raw = token_lists[f1], token_lists[f2]
|
|
|
|
| 393 |
else:
|
| 394 |
# If stopwords are disabled, use an empty set
|
| 395 |
stopwords_set_to_use = set()
|
| 396 |
+
|
| 397 |
+
# Filter stopwords for Jaccard calculation (normalize tokens for consistent matching)
|
| 398 |
+
words1_filtered = [word for word in words1_raw if _normalize_token_for_stopwords(word) not in stopwords_set_to_use]
|
| 399 |
+
words2_filtered = [word for word in words2_raw if _normalize_token_for_stopwords(word) not in stopwords_set_to_use]
|
| 400 |
+
|
| 401 |
+
# Apply particle normalization if enabled
|
| 402 |
+
if normalize_particles_opt:
|
| 403 |
+
words1_jaccard = normalize_particles(words1_filtered)
|
| 404 |
+
words2_jaccard = normalize_particles(words2_filtered)
|
| 405 |
+
words1_lcs = normalize_particles(words1_raw)
|
| 406 |
+
words2_lcs = normalize_particles(words2_raw)
|
| 407 |
+
else:
|
| 408 |
+
words1_jaccard = words1_filtered
|
| 409 |
+
words2_jaccard = words2_filtered
|
| 410 |
+
words1_lcs = words1_raw
|
| 411 |
+
words2_lcs = words2_raw
|
| 412 |
|
| 413 |
jaccard = (
|
| 414 |
len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
|
| 415 |
if set(words1_jaccard) | set(words2_jaccard) # Ensure denominator is not zero
|
| 416 |
else 0.0
|
| 417 |
)
|
|
|
|
|
|
|
| 418 |
jaccard_percent = jaccard * 100.0
|
| 419 |
+
|
| 420 |
+
# LCS uses tokens (with optional particle normalization)
|
| 421 |
+
norm_lcs = compute_normalized_lcs(words1_lcs, words2_lcs, normalization=lcs_normalization)
|
| 422 |
+
|
| 423 |
# Fuzzy Similarity Calculation
|
| 424 |
if enable_fuzzy:
|
| 425 |
fuzzy_sim = compute_fuzzy_similarity(words1_jaccard, words2_jaccard, method=fuzzy_method)
|
|
|
|
| 428 |
|
| 429 |
# Semantic Similarity Calculation
|
| 430 |
if enable_semantic:
|
|
|
|
| 431 |
semantic_sim = compute_semantic_similarity(
|
| 432 |
+
texts[f1], texts[f2], model,
|
| 433 |
batch_size=batch_size,
|
| 434 |
show_progress_bar=show_progress_bar
|
| 435 |
)
|
pipeline/normalize_bo.py
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""
|
| 3 |
+
Tibetan text normalization for improved text comparison.
|
| 4 |
+
|
| 5 |
+
This module provides normalization functions for Tibetan grammatical particles,
|
| 6 |
+
which change form based on the preceding syllable (sandhi). Normalizing these
|
| 7 |
+
allows more accurate comparison between texts that may use different particle
|
| 8 |
+
forms for grammatical reasons rather than semantic differences.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from typing import List
|
| 12 |
+
|
| 13 |
+
# Particle equivalence classes
|
| 14 |
+
# All forms in each class are grammatically equivalent
|
| 15 |
+
# The first form in each list is the canonical/normalized form
|
| 16 |
+
PARTICLE_CLASSES = {
|
| 17 |
+
# Genitive particles (གི་སྒྲ) - "of"
|
| 18 |
+
# Form depends on final letter of preceding syllable
|
| 19 |
+
"genitive": ["གི", "ཀྱི", "གྱི", "ཡི", "འི"],
|
| 20 |
+
|
| 21 |
+
# Agentive/instrumental particles (བྱེད་སྒྲ) - "by"
|
| 22 |
+
"agentive": ["གིས", "ཀྱིས", "གྱིས", "ཡིས", "ས"],
|
| 23 |
+
|
| 24 |
+
# Dative/locative particles (ལ་དོན) - "to/at/in"
|
| 25 |
+
"dative": ["ལ", "ར", "སུ", "ཏུ", "དུ", "རུ"],
|
| 26 |
+
|
| 27 |
+
# Ablative particles (འབྱུང་ཁུངས) - "from"
|
| 28 |
+
"ablative": ["ནས", "ལས"],
|
| 29 |
+
|
| 30 |
+
# Conjunctive particles (སྦྱོར་སྒྲ) - verbal connective "and/while"
|
| 31 |
+
"conjunctive": ["ཅིང", "ཤིང", "ཞིང"],
|
| 32 |
+
|
| 33 |
+
# Terminative particles (མཐའ་སྒྲ) - clause ending
|
| 34 |
+
"terminative": ["སྟེ", "ཏེ", "དེ"],
|
| 35 |
+
|
| 36 |
+
# Concessive particles - "even/also"
|
| 37 |
+
"concessive": ["ཀྱང", "ཡང", "འང"],
|
| 38 |
+
|
| 39 |
+
# Imperative particles
|
| 40 |
+
"imperative": ["ཅིག", "ཤིག", "ཞིག"],
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def _build_particle_map() -> dict:
|
| 45 |
+
"""Build mapping from all particle variants to canonical form."""
|
| 46 |
+
mapping = {}
|
| 47 |
+
for class_name, forms in PARTICLE_CLASSES.items():
|
| 48 |
+
canonical = forms[0] # First form is canonical
|
| 49 |
+
for variant in forms:
|
| 50 |
+
# Strip tsek for matching (will be normalized anyway)
|
| 51 |
+
variant_clean = variant.rstrip('་')
|
| 52 |
+
mapping[variant_clean] = canonical
|
| 53 |
+
return mapping
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
# Pre-built mapping for efficiency
|
| 57 |
+
PARTICLE_NORMALIZATION_MAP = _build_particle_map()
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def normalize_particles(tokens: List[str]) -> List[str]:
|
| 61 |
+
"""
|
| 62 |
+
Normalize grammatical particles to canonical forms.
|
| 63 |
+
|
| 64 |
+
This treats all sandhi variants of a particle as equivalent:
|
| 65 |
+
- གི, ཀྱི, གྱི, ཡི, འི → གི (genitive)
|
| 66 |
+
- གིས, ཀྱིས, གྱིས, ཡིས, ས → གིས (agentive)
|
| 67 |
+
- ལ, ར, སུ, ཏུ, དུ, རུ → ལ (dative)
|
| 68 |
+
- etc.
|
| 69 |
+
|
| 70 |
+
This is useful when comparing texts that may use different particle forms
|
| 71 |
+
based on phonological context rather than semantic differences.
|
| 72 |
+
|
| 73 |
+
Args:
|
| 74 |
+
tokens: List of Tibetan tokens (syllables or words)
|
| 75 |
+
|
| 76 |
+
Returns:
|
| 77 |
+
List of tokens with particles normalized to canonical forms
|
| 78 |
+
"""
|
| 79 |
+
normalized = []
|
| 80 |
+
for token in tokens:
|
| 81 |
+
# Strip tsek for lookup
|
| 82 |
+
token_clean = token.rstrip('་')
|
| 83 |
+
# Check if it's a particle that should be normalized
|
| 84 |
+
if token_clean in PARTICLE_NORMALIZATION_MAP:
|
| 85 |
+
normalized.append(PARTICLE_NORMALIZATION_MAP[token_clean])
|
| 86 |
+
else:
|
| 87 |
+
normalized.append(token_clean)
|
| 88 |
+
return normalized
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def get_particle_class(token: str) -> str:
|
| 92 |
+
"""
|
| 93 |
+
Get the grammatical class of a particle.
|
| 94 |
+
|
| 95 |
+
Args:
|
| 96 |
+
token: A Tibetan token
|
| 97 |
+
|
| 98 |
+
Returns:
|
| 99 |
+
The particle class name (e.g., 'genitive', 'agentive') or None
|
| 100 |
+
"""
|
| 101 |
+
token_clean = token.rstrip('་')
|
| 102 |
+
for class_name, forms in PARTICLE_CLASSES.items():
|
| 103 |
+
clean_forms = [f.rstrip('་') for f in forms]
|
| 104 |
+
if token_clean in clean_forms:
|
| 105 |
+
return class_name
|
| 106 |
+
return None
|
pipeline/process.py
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
import pandas as pd
|
|
|
|
| 2 |
from typing import Dict, List, Tuple
|
| 3 |
from .metrics import compute_all_metrics
|
| 4 |
from .hf_embedding import get_model as get_hf_model
|
|
@@ -13,7 +14,7 @@ import re
|
|
| 13 |
|
| 14 |
def get_botok_tokens_for_single_text(text: str, mode: str = "syllable") -> list[str]:
|
| 15 |
"""
|
| 16 |
-
A wrapper around tokenize_texts to make it suitable for tokenize_fn
|
| 17 |
in generate_embeddings, which expects a function that tokenizes a single string.
|
| 18 |
Accepts a 'mode' argument ('syllable' or 'word') to pass to tokenize_texts.
|
| 19 |
"""
|
|
@@ -46,14 +47,17 @@ logger = logging.getLogger(__name__)
|
|
| 46 |
|
| 47 |
|
| 48 |
def process_texts(
|
| 49 |
-
text_data: Dict[str, str],
|
| 50 |
-
filenames: List[str],
|
| 51 |
enable_semantic: bool = True,
|
| 52 |
enable_fuzzy: bool = True,
|
| 53 |
-
fuzzy_method: str = '
|
| 54 |
-
|
|
|
|
| 55 |
use_stopwords: bool = True,
|
| 56 |
use_lite_stopwords: bool = False,
|
|
|
|
|
|
|
| 57 |
progress_callback = None,
|
| 58 |
progressive_callback = None,
|
| 59 |
batch_size: int = 32,
|
|
@@ -61,11 +65,11 @@ def process_texts(
|
|
| 61 |
) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
|
| 62 |
"""
|
| 63 |
Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
|
| 64 |
-
|
| 65 |
Args:
|
| 66 |
text_data (Dict[str, str]): A dictionary mapping filenames to their content.
|
| 67 |
filenames (List[str]): A list of filenames that were uploaded.
|
| 68 |
-
enable_semantic (bool, optional): Whether to compute semantic similarity metrics.
|
| 69 |
Requires loading a sentence-transformer model, which can be time-consuming. Defaults to True.
|
| 70 |
enable_fuzzy (bool, optional): Whether to compute fuzzy string similarity metrics.
|
| 71 |
Uses TheFuzz library for approximate string matching. Defaults to True.
|
|
@@ -74,16 +78,28 @@ def process_texts(
|
|
| 74 |
'token_sort' - Order-normalized token matching
|
| 75 |
'partial' - Best partial token matching
|
| 76 |
'ratio' - Simple ratio matching
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
model_name (str, optional): The Hugging Face sentence-transformer model to use for semantic similarity.
|
| 78 |
-
Must be a valid model identifier on Hugging Face. Defaults to "sentence-
|
| 79 |
use_stopwords (bool, optional): Whether to use stopwords in the metrics calculation. Defaults to True.
|
| 80 |
use_lite_stopwords (bool, optional): Whether to use the lite stopwords list (common particles only)
|
| 81 |
instead of the comprehensive list. Only applies if use_stopwords is True. Defaults to False.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
progress_callback (callable, optional): A callback function for reporting progress updates.
|
| 83 |
Should accept a float between 0 and 1 and a description string. Defaults to None.
|
| 84 |
progressive_callback (callable, optional): A callback function for sending incremental results.
|
| 85 |
Used for progressive loading of metrics as they become available. Defaults to None.
|
| 86 |
-
|
| 87 |
Returns:
|
| 88 |
Tuple[pd.DataFrame, pd.DataFrame, str]:
|
| 89 |
- metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
|
|
@@ -92,7 +108,7 @@ def process_texts(
|
|
| 92 |
- word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
|
| 93 |
Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
|
| 94 |
- warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
|
| 95 |
-
|
| 96 |
Raises:
|
| 97 |
RuntimeError: If the botok tokenizer fails to initialize.
|
| 98 |
ValueError: If the input files cannot be processed or if metrics computation fails.
|
|
@@ -132,7 +148,7 @@ def process_texts(
|
|
| 132 |
progress_callback(0.3, desc="Unsupported model, continuing without semantic similarity.")
|
| 133 |
except Exception as e:
|
| 134 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 135 |
-
|
| 136 |
except Exception as e: # General catch-all for unexpected errors during model loading attempts
|
| 137 |
model_warning = f"An unexpected error occurred while attempting to load model '{model_name}': {e}. Semantic similarity will be disabled."
|
| 138 |
logger.error(model_warning, exc_info=True)
|
|
@@ -156,38 +172,38 @@ def process_texts(
|
|
| 156 |
progress_callback(0.35, desc="Segmenting texts by chapters...")
|
| 157 |
except Exception as e:
|
| 158 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 159 |
-
|
| 160 |
chapter_marker = "༈"
|
| 161 |
fallback = False
|
| 162 |
segment_texts = {}
|
| 163 |
-
|
| 164 |
# Process each file
|
| 165 |
for i, fname in enumerate(filenames):
|
| 166 |
if progress_callback is not None and len(filenames) > 1:
|
| 167 |
try:
|
| 168 |
-
progress_callback(0.35 + (0.05 * (i / len(filenames))),
|
| 169 |
desc=f"Segmenting file {i+1}/{len(filenames)}: {fname}")
|
| 170 |
except Exception as e:
|
| 171 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 172 |
-
|
| 173 |
content = text_data[fname]
|
| 174 |
-
|
| 175 |
# Check if content is empty
|
| 176 |
if not content.strip():
|
| 177 |
logger.warning(f"File '{fname}' is empty or contains only whitespace.")
|
| 178 |
continue
|
| 179 |
-
|
| 180 |
# Split by chapter marker if present
|
| 181 |
if chapter_marker in content:
|
| 182 |
segments = [
|
| 183 |
seg.strip() for seg in content.split(chapter_marker) if seg.strip()
|
| 184 |
]
|
| 185 |
-
|
| 186 |
# Check if we have valid segments after splitting
|
| 187 |
if not segments:
|
| 188 |
logger.warning(f"File '{fname}' contains chapter markers but no valid text segments.")
|
| 189 |
continue
|
| 190 |
-
|
| 191 |
for idx, seg in enumerate(segments):
|
| 192 |
seg_id = f"{fname}|chapter {idx+1}"
|
| 193 |
cleaned_seg = clean_tibetan_text(seg)
|
|
@@ -198,7 +214,7 @@ def process_texts(
|
|
| 198 |
cleaned_content = clean_tibetan_text(content.strip())
|
| 199 |
segment_texts[seg_id] = cleaned_content
|
| 200 |
fallback = True
|
| 201 |
-
|
| 202 |
# Generate warning if no chapter markers found
|
| 203 |
warning = model_warning # Include any model warnings
|
| 204 |
if fallback:
|
|
@@ -208,7 +224,7 @@ def process_texts(
|
|
| 208 |
"For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
|
| 209 |
)
|
| 210 |
warning = warning + " " + chapter_warning if warning else chapter_warning
|
| 211 |
-
|
| 212 |
# Check if we have any valid segments
|
| 213 |
if not segment_texts:
|
| 214 |
logger.error("No valid text segments found in any of the uploaded files.")
|
|
@@ -216,90 +232,90 @@ def process_texts(
|
|
| 216 |
# Tokenize all segments at once for efficiency
|
| 217 |
if progress_callback is not None:
|
| 218 |
try:
|
| 219 |
-
progress_callback(0.
|
| 220 |
except Exception as e:
|
| 221 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 222 |
|
| 223 |
all_segment_ids = list(segment_texts.keys())
|
| 224 |
all_segment_contents = list(segment_texts.values())
|
| 225 |
-
tokenized_segments_list = tokenize_texts(all_segment_contents)
|
| 226 |
|
| 227 |
segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
|
| 228 |
|
| 229 |
# Group chapters by filename (preserving order)
|
| 230 |
if progress_callback is not None:
|
| 231 |
try:
|
| 232 |
-
progress_callback(0.
|
| 233 |
except Exception as e:
|
| 234 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 235 |
-
|
| 236 |
file_to_chapters = {}
|
| 237 |
for seg_id in segment_texts:
|
| 238 |
fname = seg_id.split("|")[0]
|
| 239 |
file_to_chapters.setdefault(fname, []).append(seg_id)
|
| 240 |
-
|
| 241 |
# For each pair of files, compare corresponding chapters (by index)
|
| 242 |
if progress_callback is not None:
|
| 243 |
try:
|
| 244 |
progress_callback(0.45, desc="Computing similarity metrics...")
|
| 245 |
except Exception as e:
|
| 246 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 247 |
-
|
| 248 |
results = []
|
| 249 |
files = list(file_to_chapters.keys())
|
| 250 |
-
|
| 251 |
# Check if we have at least two files to compare
|
| 252 |
if len(files) < 2:
|
| 253 |
logger.warning("Need at least two files to compute similarity metrics.")
|
| 254 |
return pd.DataFrame(), pd.DataFrame(), "Need at least two files to compute similarity metrics."
|
| 255 |
-
|
| 256 |
# Track total number of comparisons for progress reporting
|
| 257 |
total_comparisons = 0
|
| 258 |
for file1, file2 in combinations(files, 2):
|
| 259 |
chaps1 = file_to_chapters[file1]
|
| 260 |
chaps2 = file_to_chapters[file2]
|
| 261 |
total_comparisons += min(len(chaps1), len(chaps2))
|
| 262 |
-
|
| 263 |
# Initialize results DataFrame for progressive updates
|
| 264 |
results_columns = ['Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS']
|
| 265 |
if enable_fuzzy:
|
| 266 |
results_columns.append('Fuzzy Similarity')
|
| 267 |
if enable_semantic:
|
| 268 |
results_columns.append('Semantic Similarity')
|
| 269 |
-
|
| 270 |
# Create empty DataFrame with the correct columns
|
| 271 |
progressive_df = pd.DataFrame(columns=results_columns)
|
| 272 |
-
|
| 273 |
# Track which metrics have been completed for progressive updates
|
| 274 |
completed_metrics = []
|
| 275 |
-
|
| 276 |
# Process each file pair
|
| 277 |
comparison_count = 0
|
| 278 |
for file1, file2 in combinations(files, 2):
|
| 279 |
chaps1 = file_to_chapters[file1]
|
| 280 |
chaps2 = file_to_chapters[file2]
|
| 281 |
min_chaps = min(len(chaps1), len(chaps2))
|
| 282 |
-
|
| 283 |
if progress_callback is not None:
|
| 284 |
try:
|
| 285 |
progress_callback(0.45, desc=f"Comparing {file1} with {file2}...")
|
| 286 |
except Exception as e:
|
| 287 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 288 |
-
|
| 289 |
for idx in range(min_chaps):
|
| 290 |
seg1 = chaps1[idx]
|
| 291 |
seg2 = chaps2[idx]
|
| 292 |
-
|
| 293 |
# Update progress
|
| 294 |
comparison_count += 1
|
| 295 |
if progress_callback is not None and total_comparisons > 0:
|
| 296 |
try:
|
| 297 |
progress_percentage = 0.45 + (0.25 * (comparison_count / total_comparisons))
|
| 298 |
-
progress_callback(progress_percentage,
|
| 299 |
desc=f"Computing metrics for chapter {idx+1} ({comparison_count}/{total_comparisons})")
|
| 300 |
except Exception as e:
|
| 301 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 302 |
-
|
| 303 |
try:
|
| 304 |
# Compute metrics for this chapter pair
|
| 305 |
metrics_df = compute_all_metrics(
|
|
@@ -309,10 +325,12 @@ def process_texts(
|
|
| 309 |
enable_semantic=enable_semantic,
|
| 310 |
enable_fuzzy=enable_fuzzy,
|
| 311 |
fuzzy_method=fuzzy_method,
|
|
|
|
| 312 |
use_stopwords=use_stopwords,
|
| 313 |
use_lite_stopwords=use_lite_stopwords,
|
|
|
|
| 314 |
)
|
| 315 |
-
|
| 316 |
# Extract metrics from the DataFrame (should have only one row)
|
| 317 |
if not metrics_df.empty:
|
| 318 |
pair_metrics = metrics_df.iloc[0].to_dict()
|
|
@@ -325,57 +343,57 @@ def process_texts(
|
|
| 325 |
"Fuzzy Similarity": 0.0 if enable_fuzzy else np.nan,
|
| 326 |
"Semantic Similarity": 0.0 if enable_semantic else np.nan
|
| 327 |
}
|
| 328 |
-
|
| 329 |
# Format the results
|
| 330 |
text_pair = f"{file1} vs {file2}"
|
| 331 |
chapter_num = idx + 1
|
| 332 |
-
|
| 333 |
result_row = {
|
| 334 |
"Text Pair": text_pair,
|
| 335 |
"Chapter": chapter_num,
|
| 336 |
"Jaccard Similarity (%)": pair_metrics["Jaccard Similarity (%)"], # Already in percentage
|
| 337 |
"Normalized LCS": pair_metrics["Normalized LCS"],
|
| 338 |
}
|
| 339 |
-
|
| 340 |
# Add fuzzy similarity if enabled
|
| 341 |
if enable_fuzzy:
|
| 342 |
result_row["Fuzzy Similarity"] = pair_metrics["Fuzzy Similarity"]
|
| 343 |
-
|
| 344 |
# Add semantic similarity if enabled and available
|
| 345 |
if enable_semantic and "Semantic Similarity" in pair_metrics:
|
| 346 |
result_row["Semantic Similarity"] = pair_metrics["Semantic Similarity"]
|
| 347 |
-
|
| 348 |
# Convert the dictionary to a DataFrame before appending
|
| 349 |
result_df = pd.DataFrame([result_row])
|
| 350 |
results.append(result_df)
|
| 351 |
-
|
| 352 |
# Update progressive DataFrame and send update if callback is provided
|
| 353 |
progressive_df = pd.concat(results, ignore_index=True)
|
| 354 |
-
|
| 355 |
# Send progressive update if callback is provided
|
| 356 |
if progressive_callback is not None:
|
| 357 |
# Determine which metrics are complete in this update
|
| 358 |
current_metrics = []
|
| 359 |
-
|
| 360 |
# Always include these basic metrics
|
| 361 |
if "Jaccard Similarity (%)" in progressive_df.columns and MetricType.JACCARD not in completed_metrics:
|
| 362 |
current_metrics.append(MetricType.JACCARD)
|
| 363 |
completed_metrics.append(MetricType.JACCARD)
|
| 364 |
-
|
| 365 |
if "Normalized LCS" in progressive_df.columns and MetricType.LCS not in completed_metrics:
|
| 366 |
current_metrics.append(MetricType.LCS)
|
| 367 |
completed_metrics.append(MetricType.LCS)
|
| 368 |
-
|
| 369 |
# Add fuzzy if enabled and available
|
| 370 |
if enable_fuzzy and "Fuzzy Similarity" in progressive_df.columns and MetricType.FUZZY not in completed_metrics:
|
| 371 |
current_metrics.append(MetricType.FUZZY)
|
| 372 |
completed_metrics.append(MetricType.FUZZY)
|
| 373 |
-
|
| 374 |
# Add semantic if enabled and available
|
| 375 |
if enable_semantic and "Semantic Similarity" in progressive_df.columns and MetricType.SEMANTIC not in completed_metrics:
|
| 376 |
current_metrics.append(MetricType.SEMANTIC)
|
| 377 |
completed_metrics.append(MetricType.SEMANTIC)
|
| 378 |
-
|
| 379 |
# Create word counts DataFrame for progressive update
|
| 380 |
word_counts_data = []
|
| 381 |
for seg_id, tokens in segment_tokens.items():
|
|
@@ -388,7 +406,7 @@ def process_texts(
|
|
| 388 |
"WordCount": len(tokens)
|
| 389 |
})
|
| 390 |
word_counts_df_progressive = pd.DataFrame(word_counts_data)
|
| 391 |
-
|
| 392 |
# Send the update
|
| 393 |
try:
|
| 394 |
progressive_callback(
|
|
@@ -400,12 +418,12 @@ def process_texts(
|
|
| 400 |
)
|
| 401 |
except Exception as e:
|
| 402 |
logger.warning(f"Progressive callback error (non-critical): {e}")
|
| 403 |
-
|
| 404 |
except Exception as e:
|
| 405 |
logger.error(f"Error computing metrics for {seg1} vs {seg2}: {e}", exc_info=True)
|
| 406 |
# Continue with other segmentsparisons instead of failing completely
|
| 407 |
continue
|
| 408 |
-
|
| 409 |
# Create the metrics DataFrame
|
| 410 |
if results:
|
| 411 |
# Results are already DataFrames, so we can concatenate them directly
|
|
@@ -420,9 +438,9 @@ def process_texts(
|
|
| 420 |
progress_callback(0.75, desc="Calculating word counts...")
|
| 421 |
except Exception as e:
|
| 422 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 423 |
-
|
| 424 |
word_counts_data = []
|
| 425 |
-
|
| 426 |
# Process each segment
|
| 427 |
for i, (seg_id, text_content) in enumerate(segment_texts.items()):
|
| 428 |
# Update progress
|
|
@@ -432,10 +450,10 @@ def process_texts(
|
|
| 432 |
progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
|
| 433 |
except Exception as e:
|
| 434 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 435 |
-
|
| 436 |
fname, chapter_info = seg_id.split("|", 1)
|
| 437 |
chapter_num = int(chapter_info.replace("chapter ", ""))
|
| 438 |
-
|
| 439 |
try:
|
| 440 |
# Use botok for accurate word count for raw Tibetan text
|
| 441 |
tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
|
|
@@ -443,7 +461,7 @@ def process_texts(
|
|
| 443 |
word_count = len(tokenized_segments[0])
|
| 444 |
else:
|
| 445 |
word_count = 0
|
| 446 |
-
|
| 447 |
word_counts_data.append(
|
| 448 |
{
|
| 449 |
"Filename": fname.replace(".txt", ""),
|
|
@@ -463,20 +481,20 @@ def process_texts(
|
|
| 463 |
"WordCount": 0,
|
| 464 |
}
|
| 465 |
)
|
| 466 |
-
|
| 467 |
# Create and sort the word counts DataFrame
|
| 468 |
word_counts_df = pd.DataFrame(word_counts_data)
|
| 469 |
if not word_counts_df.empty:
|
| 470 |
word_counts_df = word_counts_df.sort_values(
|
| 471 |
by=["Filename", "ChapterNumber"]
|
| 472 |
).reset_index(drop=True)
|
| 473 |
-
|
| 474 |
if progress_callback is not None:
|
| 475 |
try:
|
| 476 |
progress_callback(0.95, desc="Analysis complete!")
|
| 477 |
except Exception as e:
|
| 478 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 479 |
-
|
| 480 |
# Send final progressive update if callback is provided
|
| 481 |
if progressive_callback is not None:
|
| 482 |
try:
|
|
@@ -490,6 +508,6 @@ def process_texts(
|
|
| 490 |
)
|
| 491 |
except Exception as e:
|
| 492 |
logger.warning(f"Final progressive callback error (non-critical): {e}")
|
| 493 |
-
|
| 494 |
# Return the results
|
| 495 |
return metrics_df, word_counts_df, warning
|
|
|
|
| 1 |
import pandas as pd
|
| 2 |
+
import numpy as np
|
| 3 |
from typing import Dict, List, Tuple
|
| 4 |
from .metrics import compute_all_metrics
|
| 5 |
from .hf_embedding import get_model as get_hf_model
|
|
|
|
| 14 |
|
| 15 |
def get_botok_tokens_for_single_text(text: str, mode: str = "syllable") -> list[str]:
|
| 16 |
"""
|
| 17 |
+
A wrapper around tokenize_texts to make it suitable for tokenize_fn
|
| 18 |
in generate_embeddings, which expects a function that tokenizes a single string.
|
| 19 |
Accepts a 'mode' argument ('syllable' or 'word') to pass to tokenize_texts.
|
| 20 |
"""
|
|
|
|
| 47 |
|
| 48 |
|
| 49 |
def process_texts(
|
| 50 |
+
text_data: Dict[str, str],
|
| 51 |
+
filenames: List[str],
|
| 52 |
enable_semantic: bool = True,
|
| 53 |
enable_fuzzy: bool = True,
|
| 54 |
+
fuzzy_method: str = 'ngram',
|
| 55 |
+
lcs_normalization: str = 'avg',
|
| 56 |
+
model_name: str = "buddhist-nlp/buddhist-sentence-similarity",
|
| 57 |
use_stopwords: bool = True,
|
| 58 |
use_lite_stopwords: bool = False,
|
| 59 |
+
normalize_particles: bool = False,
|
| 60 |
+
tokenization_mode: str = "word",
|
| 61 |
progress_callback = None,
|
| 62 |
progressive_callback = None,
|
| 63 |
batch_size: int = 32,
|
|
|
|
| 65 |
) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
|
| 66 |
"""
|
| 67 |
Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
|
| 68 |
+
|
| 69 |
Args:
|
| 70 |
text_data (Dict[str, str]): A dictionary mapping filenames to their content.
|
| 71 |
filenames (List[str]): A list of filenames that were uploaded.
|
| 72 |
+
enable_semantic (bool, optional): Whether to compute semantic similarity metrics.
|
| 73 |
Requires loading a sentence-transformer model, which can be time-consuming. Defaults to True.
|
| 74 |
enable_fuzzy (bool, optional): Whether to compute fuzzy string similarity metrics.
|
| 75 |
Uses TheFuzz library for approximate string matching. Defaults to True.
|
|
|
|
| 78 |
'token_sort' - Order-normalized token matching
|
| 79 |
'partial' - Best partial token matching
|
| 80 |
'ratio' - Simple ratio matching
|
| 81 |
+
'ngram' - Syllable bigram overlap (recommended for Tibetan)
|
| 82 |
+
'syllable_edit' - Syllable-level edit distance
|
| 83 |
+
'weighted_jaccard' - Frequency-weighted Jaccard
|
| 84 |
+
lcs_normalization (str, optional): How to normalize LCS length. Options:
|
| 85 |
+
'avg' - Divide by average length (default, balanced)
|
| 86 |
+
'min' - Divide by shorter text (detects containment)
|
| 87 |
+
'max' - Divide by longer text (stricter)
|
| 88 |
model_name (str, optional): The Hugging Face sentence-transformer model to use for semantic similarity.
|
| 89 |
+
Must be a valid model identifier on Hugging Face. Defaults to "buddhist-nlp/buddhist-sentence-similarity".
|
| 90 |
use_stopwords (bool, optional): Whether to use stopwords in the metrics calculation. Defaults to True.
|
| 91 |
use_lite_stopwords (bool, optional): Whether to use the lite stopwords list (common particles only)
|
| 92 |
instead of the comprehensive list. Only applies if use_stopwords is True. Defaults to False.
|
| 93 |
+
normalize_particles (bool, optional): Whether to normalize grammatical particles to canonical forms.
|
| 94 |
+
Treats གི/ཀྱི/གྱི as equivalent, ལ/ར/སུ/ཏུ/དུ as equivalent, etc. Defaults to False.
|
| 95 |
+
tokenization_mode (str, optional): How to tokenize the text. Options are:
|
| 96 |
+
'word' - Keep multi-syllable words together (default, recommended for Jaccard)
|
| 97 |
+
'syllable' - Split into individual syllables (finer granularity)
|
| 98 |
progress_callback (callable, optional): A callback function for reporting progress updates.
|
| 99 |
Should accept a float between 0 and 1 and a description string. Defaults to None.
|
| 100 |
progressive_callback (callable, optional): A callback function for sending incremental results.
|
| 101 |
Used for progressive loading of metrics as they become available. Defaults to None.
|
| 102 |
+
|
| 103 |
Returns:
|
| 104 |
Tuple[pd.DataFrame, pd.DataFrame, str]:
|
| 105 |
- metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
|
|
|
|
| 108 |
- word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
|
| 109 |
Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
|
| 110 |
- warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
|
| 111 |
+
|
| 112 |
Raises:
|
| 113 |
RuntimeError: If the botok tokenizer fails to initialize.
|
| 114 |
ValueError: If the input files cannot be processed or if metrics computation fails.
|
|
|
|
| 148 |
progress_callback(0.3, desc="Unsupported model, continuing without semantic similarity.")
|
| 149 |
except Exception as e:
|
| 150 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 151 |
+
|
| 152 |
except Exception as e: # General catch-all for unexpected errors during model loading attempts
|
| 153 |
model_warning = f"An unexpected error occurred while attempting to load model '{model_name}': {e}. Semantic similarity will be disabled."
|
| 154 |
logger.error(model_warning, exc_info=True)
|
|
|
|
| 172 |
progress_callback(0.35, desc="Segmenting texts by chapters...")
|
| 173 |
except Exception as e:
|
| 174 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 175 |
+
|
| 176 |
chapter_marker = "༈"
|
| 177 |
fallback = False
|
| 178 |
segment_texts = {}
|
| 179 |
+
|
| 180 |
# Process each file
|
| 181 |
for i, fname in enumerate(filenames):
|
| 182 |
if progress_callback is not None and len(filenames) > 1:
|
| 183 |
try:
|
| 184 |
+
progress_callback(0.35 + (0.05 * (i / len(filenames))),
|
| 185 |
desc=f"Segmenting file {i+1}/{len(filenames)}: {fname}")
|
| 186 |
except Exception as e:
|
| 187 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 188 |
+
|
| 189 |
content = text_data[fname]
|
| 190 |
+
|
| 191 |
# Check if content is empty
|
| 192 |
if not content.strip():
|
| 193 |
logger.warning(f"File '{fname}' is empty or contains only whitespace.")
|
| 194 |
continue
|
| 195 |
+
|
| 196 |
# Split by chapter marker if present
|
| 197 |
if chapter_marker in content:
|
| 198 |
segments = [
|
| 199 |
seg.strip() for seg in content.split(chapter_marker) if seg.strip()
|
| 200 |
]
|
| 201 |
+
|
| 202 |
# Check if we have valid segments after splitting
|
| 203 |
if not segments:
|
| 204 |
logger.warning(f"File '{fname}' contains chapter markers but no valid text segments.")
|
| 205 |
continue
|
| 206 |
+
|
| 207 |
for idx, seg in enumerate(segments):
|
| 208 |
seg_id = f"{fname}|chapter {idx+1}"
|
| 209 |
cleaned_seg = clean_tibetan_text(seg)
|
|
|
|
| 214 |
cleaned_content = clean_tibetan_text(content.strip())
|
| 215 |
segment_texts[seg_id] = cleaned_content
|
| 216 |
fallback = True
|
| 217 |
+
|
| 218 |
# Generate warning if no chapter markers found
|
| 219 |
warning = model_warning # Include any model warnings
|
| 220 |
if fallback:
|
|
|
|
| 224 |
"For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
|
| 225 |
)
|
| 226 |
warning = warning + " " + chapter_warning if warning else chapter_warning
|
| 227 |
+
|
| 228 |
# Check if we have any valid segments
|
| 229 |
if not segment_texts:
|
| 230 |
logger.error("No valid text segments found in any of the uploaded files.")
|
|
|
|
| 232 |
# Tokenize all segments at once for efficiency
|
| 233 |
if progress_callback is not None:
|
| 234 |
try:
|
| 235 |
+
progress_callback(0.40, desc="Tokenizing all text segments...")
|
| 236 |
except Exception as e:
|
| 237 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 238 |
|
| 239 |
all_segment_ids = list(segment_texts.keys())
|
| 240 |
all_segment_contents = list(segment_texts.values())
|
| 241 |
+
tokenized_segments_list = tokenize_texts(all_segment_contents, mode=tokenization_mode)
|
| 242 |
|
| 243 |
segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
|
| 244 |
|
| 245 |
# Group chapters by filename (preserving order)
|
| 246 |
if progress_callback is not None:
|
| 247 |
try:
|
| 248 |
+
progress_callback(0.42, desc="Organizing text segments...")
|
| 249 |
except Exception as e:
|
| 250 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 251 |
+
|
| 252 |
file_to_chapters = {}
|
| 253 |
for seg_id in segment_texts:
|
| 254 |
fname = seg_id.split("|")[0]
|
| 255 |
file_to_chapters.setdefault(fname, []).append(seg_id)
|
| 256 |
+
|
| 257 |
# For each pair of files, compare corresponding chapters (by index)
|
| 258 |
if progress_callback is not None:
|
| 259 |
try:
|
| 260 |
progress_callback(0.45, desc="Computing similarity metrics...")
|
| 261 |
except Exception as e:
|
| 262 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 263 |
+
|
| 264 |
results = []
|
| 265 |
files = list(file_to_chapters.keys())
|
| 266 |
+
|
| 267 |
# Check if we have at least two files to compare
|
| 268 |
if len(files) < 2:
|
| 269 |
logger.warning("Need at least two files to compute similarity metrics.")
|
| 270 |
return pd.DataFrame(), pd.DataFrame(), "Need at least two files to compute similarity metrics."
|
| 271 |
+
|
| 272 |
# Track total number of comparisons for progress reporting
|
| 273 |
total_comparisons = 0
|
| 274 |
for file1, file2 in combinations(files, 2):
|
| 275 |
chaps1 = file_to_chapters[file1]
|
| 276 |
chaps2 = file_to_chapters[file2]
|
| 277 |
total_comparisons += min(len(chaps1), len(chaps2))
|
| 278 |
+
|
| 279 |
# Initialize results DataFrame for progressive updates
|
| 280 |
results_columns = ['Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS']
|
| 281 |
if enable_fuzzy:
|
| 282 |
results_columns.append('Fuzzy Similarity')
|
| 283 |
if enable_semantic:
|
| 284 |
results_columns.append('Semantic Similarity')
|
| 285 |
+
|
| 286 |
# Create empty DataFrame with the correct columns
|
| 287 |
progressive_df = pd.DataFrame(columns=results_columns)
|
| 288 |
+
|
| 289 |
# Track which metrics have been completed for progressive updates
|
| 290 |
completed_metrics = []
|
| 291 |
+
|
| 292 |
# Process each file pair
|
| 293 |
comparison_count = 0
|
| 294 |
for file1, file2 in combinations(files, 2):
|
| 295 |
chaps1 = file_to_chapters[file1]
|
| 296 |
chaps2 = file_to_chapters[file2]
|
| 297 |
min_chaps = min(len(chaps1), len(chaps2))
|
| 298 |
+
|
| 299 |
if progress_callback is not None:
|
| 300 |
try:
|
| 301 |
progress_callback(0.45, desc=f"Comparing {file1} with {file2}...")
|
| 302 |
except Exception as e:
|
| 303 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 304 |
+
|
| 305 |
for idx in range(min_chaps):
|
| 306 |
seg1 = chaps1[idx]
|
| 307 |
seg2 = chaps2[idx]
|
| 308 |
+
|
| 309 |
# Update progress
|
| 310 |
comparison_count += 1
|
| 311 |
if progress_callback is not None and total_comparisons > 0:
|
| 312 |
try:
|
| 313 |
progress_percentage = 0.45 + (0.25 * (comparison_count / total_comparisons))
|
| 314 |
+
progress_callback(progress_percentage,
|
| 315 |
desc=f"Computing metrics for chapter {idx+1} ({comparison_count}/{total_comparisons})")
|
| 316 |
except Exception as e:
|
| 317 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 318 |
+
|
| 319 |
try:
|
| 320 |
# Compute metrics for this chapter pair
|
| 321 |
metrics_df = compute_all_metrics(
|
|
|
|
| 325 |
enable_semantic=enable_semantic,
|
| 326 |
enable_fuzzy=enable_fuzzy,
|
| 327 |
fuzzy_method=fuzzy_method,
|
| 328 |
+
lcs_normalization=lcs_normalization,
|
| 329 |
use_stopwords=use_stopwords,
|
| 330 |
use_lite_stopwords=use_lite_stopwords,
|
| 331 |
+
normalize_particles_opt=normalize_particles,
|
| 332 |
)
|
| 333 |
+
|
| 334 |
# Extract metrics from the DataFrame (should have only one row)
|
| 335 |
if not metrics_df.empty:
|
| 336 |
pair_metrics = metrics_df.iloc[0].to_dict()
|
|
|
|
| 343 |
"Fuzzy Similarity": 0.0 if enable_fuzzy else np.nan,
|
| 344 |
"Semantic Similarity": 0.0 if enable_semantic else np.nan
|
| 345 |
}
|
| 346 |
+
|
| 347 |
# Format the results
|
| 348 |
text_pair = f"{file1} vs {file2}"
|
| 349 |
chapter_num = idx + 1
|
| 350 |
+
|
| 351 |
result_row = {
|
| 352 |
"Text Pair": text_pair,
|
| 353 |
"Chapter": chapter_num,
|
| 354 |
"Jaccard Similarity (%)": pair_metrics["Jaccard Similarity (%)"], # Already in percentage
|
| 355 |
"Normalized LCS": pair_metrics["Normalized LCS"],
|
| 356 |
}
|
| 357 |
+
|
| 358 |
# Add fuzzy similarity if enabled
|
| 359 |
if enable_fuzzy:
|
| 360 |
result_row["Fuzzy Similarity"] = pair_metrics["Fuzzy Similarity"]
|
| 361 |
+
|
| 362 |
# Add semantic similarity if enabled and available
|
| 363 |
if enable_semantic and "Semantic Similarity" in pair_metrics:
|
| 364 |
result_row["Semantic Similarity"] = pair_metrics["Semantic Similarity"]
|
| 365 |
+
|
| 366 |
# Convert the dictionary to a DataFrame before appending
|
| 367 |
result_df = pd.DataFrame([result_row])
|
| 368 |
results.append(result_df)
|
| 369 |
+
|
| 370 |
# Update progressive DataFrame and send update if callback is provided
|
| 371 |
progressive_df = pd.concat(results, ignore_index=True)
|
| 372 |
+
|
| 373 |
# Send progressive update if callback is provided
|
| 374 |
if progressive_callback is not None:
|
| 375 |
# Determine which metrics are complete in this update
|
| 376 |
current_metrics = []
|
| 377 |
+
|
| 378 |
# Always include these basic metrics
|
| 379 |
if "Jaccard Similarity (%)" in progressive_df.columns and MetricType.JACCARD not in completed_metrics:
|
| 380 |
current_metrics.append(MetricType.JACCARD)
|
| 381 |
completed_metrics.append(MetricType.JACCARD)
|
| 382 |
+
|
| 383 |
if "Normalized LCS" in progressive_df.columns and MetricType.LCS not in completed_metrics:
|
| 384 |
current_metrics.append(MetricType.LCS)
|
| 385 |
completed_metrics.append(MetricType.LCS)
|
| 386 |
+
|
| 387 |
# Add fuzzy if enabled and available
|
| 388 |
if enable_fuzzy and "Fuzzy Similarity" in progressive_df.columns and MetricType.FUZZY not in completed_metrics:
|
| 389 |
current_metrics.append(MetricType.FUZZY)
|
| 390 |
completed_metrics.append(MetricType.FUZZY)
|
| 391 |
+
|
| 392 |
# Add semantic if enabled and available
|
| 393 |
if enable_semantic and "Semantic Similarity" in progressive_df.columns and MetricType.SEMANTIC not in completed_metrics:
|
| 394 |
current_metrics.append(MetricType.SEMANTIC)
|
| 395 |
completed_metrics.append(MetricType.SEMANTIC)
|
| 396 |
+
|
| 397 |
# Create word counts DataFrame for progressive update
|
| 398 |
word_counts_data = []
|
| 399 |
for seg_id, tokens in segment_tokens.items():
|
|
|
|
| 406 |
"WordCount": len(tokens)
|
| 407 |
})
|
| 408 |
word_counts_df_progressive = pd.DataFrame(word_counts_data)
|
| 409 |
+
|
| 410 |
# Send the update
|
| 411 |
try:
|
| 412 |
progressive_callback(
|
|
|
|
| 418 |
)
|
| 419 |
except Exception as e:
|
| 420 |
logger.warning(f"Progressive callback error (non-critical): {e}")
|
| 421 |
+
|
| 422 |
except Exception as e:
|
| 423 |
logger.error(f"Error computing metrics for {seg1} vs {seg2}: {e}", exc_info=True)
|
| 424 |
# Continue with other segmentsparisons instead of failing completely
|
| 425 |
continue
|
| 426 |
+
|
| 427 |
# Create the metrics DataFrame
|
| 428 |
if results:
|
| 429 |
# Results are already DataFrames, so we can concatenate them directly
|
|
|
|
| 438 |
progress_callback(0.75, desc="Calculating word counts...")
|
| 439 |
except Exception as e:
|
| 440 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 441 |
+
|
| 442 |
word_counts_data = []
|
| 443 |
+
|
| 444 |
# Process each segment
|
| 445 |
for i, (seg_id, text_content) in enumerate(segment_texts.items()):
|
| 446 |
# Update progress
|
|
|
|
| 450 |
progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
|
| 451 |
except Exception as e:
|
| 452 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 453 |
+
|
| 454 |
fname, chapter_info = seg_id.split("|", 1)
|
| 455 |
chapter_num = int(chapter_info.replace("chapter ", ""))
|
| 456 |
+
|
| 457 |
try:
|
| 458 |
# Use botok for accurate word count for raw Tibetan text
|
| 459 |
tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
|
|
|
|
| 461 |
word_count = len(tokenized_segments[0])
|
| 462 |
else:
|
| 463 |
word_count = 0
|
| 464 |
+
|
| 465 |
word_counts_data.append(
|
| 466 |
{
|
| 467 |
"Filename": fname.replace(".txt", ""),
|
|
|
|
| 481 |
"WordCount": 0,
|
| 482 |
}
|
| 483 |
)
|
| 484 |
+
|
| 485 |
# Create and sort the word counts DataFrame
|
| 486 |
word_counts_df = pd.DataFrame(word_counts_data)
|
| 487 |
if not word_counts_df.empty:
|
| 488 |
word_counts_df = word_counts_df.sort_values(
|
| 489 |
by=["Filename", "ChapterNumber"]
|
| 490 |
).reset_index(drop=True)
|
| 491 |
+
|
| 492 |
if progress_callback is not None:
|
| 493 |
try:
|
| 494 |
progress_callback(0.95, desc="Analysis complete!")
|
| 495 |
except Exception as e:
|
| 496 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
| 497 |
+
|
| 498 |
# Send final progressive update if callback is provided
|
| 499 |
if progressive_callback is not None:
|
| 500 |
try:
|
|
|
|
| 508 |
)
|
| 509 |
except Exception as e:
|
| 510 |
logger.warning(f"Final progressive callback error (non-critical): {e}")
|
| 511 |
+
|
| 512 |
# Return the results
|
| 513 |
return metrics_df, word_counts_df, warning
|
pipeline/progressive_loader.py
CHANGED
|
@@ -36,15 +36,15 @@ class ProgressiveResult:
|
|
| 36 |
class ProgressiveLoader:
|
| 37 |
"""
|
| 38 |
Manages progressive loading of metrics computation results.
|
| 39 |
-
|
| 40 |
This class handles the incremental updates of metrics as they are computed,
|
| 41 |
allowing the UI to display partial results before the entire computation is complete.
|
| 42 |
"""
|
| 43 |
-
|
| 44 |
def __init__(self, update_callback: Optional[Callable[[ProgressiveResult], None]] = None):
|
| 45 |
"""
|
| 46 |
Initialize the ProgressiveLoader.
|
| 47 |
-
|
| 48 |
Args:
|
| 49 |
update_callback: Function to call when new results are available.
|
| 50 |
Should accept a ProgressiveResult object.
|
|
@@ -57,16 +57,16 @@ class ProgressiveLoader:
|
|
| 57 |
self.is_complete = False
|
| 58 |
self.last_update_time = 0
|
| 59 |
self.update_interval = 0.5 # Minimum seconds between updates to avoid UI thrashing
|
| 60 |
-
|
| 61 |
-
def update(self,
|
| 62 |
metrics_df: Optional[pd.DataFrame] = None,
|
| 63 |
-
word_counts_df: Optional[pd.DataFrame] = None,
|
| 64 |
completed_metric: Optional[MetricType] = None,
|
| 65 |
warning: Optional[str] = None,
|
| 66 |
is_complete: bool = False) -> None:
|
| 67 |
"""
|
| 68 |
Update the progressive results and trigger the callback if enough time has passed.
|
| 69 |
-
|
| 70 |
Args:
|
| 71 |
metrics_df: Updated metrics DataFrame
|
| 72 |
word_counts_df: Updated word counts DataFrame
|
|
@@ -75,27 +75,27 @@ class ProgressiveLoader:
|
|
| 75 |
is_complete: Whether the computation is complete
|
| 76 |
"""
|
| 77 |
current_time = time.time()
|
| 78 |
-
|
| 79 |
# Update internal state
|
| 80 |
if metrics_df is not None:
|
| 81 |
self.metrics_df = metrics_df
|
| 82 |
-
|
| 83 |
if word_counts_df is not None:
|
| 84 |
self.word_counts_df = word_counts_df
|
| 85 |
-
|
| 86 |
if completed_metric is not None and completed_metric not in self.completed_metrics:
|
| 87 |
self.completed_metrics.append(completed_metric)
|
| 88 |
-
|
| 89 |
if warning:
|
| 90 |
self.warning = warning
|
| 91 |
-
|
| 92 |
self.is_complete = is_complete
|
| 93 |
-
|
| 94 |
# Only trigger update if enough time has passed or if this is the final update
|
| 95 |
if (current_time - self.last_update_time >= self.update_interval) or is_complete:
|
| 96 |
self._trigger_update()
|
| 97 |
self.last_update_time = current_time
|
| 98 |
-
|
| 99 |
def _trigger_update(self) -> None:
|
| 100 |
"""Trigger the update callback with the current state."""
|
| 101 |
if self.update_callback:
|
|
|
|
| 36 |
class ProgressiveLoader:
|
| 37 |
"""
|
| 38 |
Manages progressive loading of metrics computation results.
|
| 39 |
+
|
| 40 |
This class handles the incremental updates of metrics as they are computed,
|
| 41 |
allowing the UI to display partial results before the entire computation is complete.
|
| 42 |
"""
|
| 43 |
+
|
| 44 |
def __init__(self, update_callback: Optional[Callable[[ProgressiveResult], None]] = None):
|
| 45 |
"""
|
| 46 |
Initialize the ProgressiveLoader.
|
| 47 |
+
|
| 48 |
Args:
|
| 49 |
update_callback: Function to call when new results are available.
|
| 50 |
Should accept a ProgressiveResult object.
|
|
|
|
| 57 |
self.is_complete = False
|
| 58 |
self.last_update_time = 0
|
| 59 |
self.update_interval = 0.5 # Minimum seconds between updates to avoid UI thrashing
|
| 60 |
+
|
| 61 |
+
def update(self,
|
| 62 |
metrics_df: Optional[pd.DataFrame] = None,
|
| 63 |
+
word_counts_df: Optional[pd.DataFrame] = None,
|
| 64 |
completed_metric: Optional[MetricType] = None,
|
| 65 |
warning: Optional[str] = None,
|
| 66 |
is_complete: bool = False) -> None:
|
| 67 |
"""
|
| 68 |
Update the progressive results and trigger the callback if enough time has passed.
|
| 69 |
+
|
| 70 |
Args:
|
| 71 |
metrics_df: Updated metrics DataFrame
|
| 72 |
word_counts_df: Updated word counts DataFrame
|
|
|
|
| 75 |
is_complete: Whether the computation is complete
|
| 76 |
"""
|
| 77 |
current_time = time.time()
|
| 78 |
+
|
| 79 |
# Update internal state
|
| 80 |
if metrics_df is not None:
|
| 81 |
self.metrics_df = metrics_df
|
| 82 |
+
|
| 83 |
if word_counts_df is not None:
|
| 84 |
self.word_counts_df = word_counts_df
|
| 85 |
+
|
| 86 |
if completed_metric is not None and completed_metric not in self.completed_metrics:
|
| 87 |
self.completed_metrics.append(completed_metric)
|
| 88 |
+
|
| 89 |
if warning:
|
| 90 |
self.warning = warning
|
| 91 |
+
|
| 92 |
self.is_complete = is_complete
|
| 93 |
+
|
| 94 |
# Only trigger update if enough time has passed or if this is the final update
|
| 95 |
if (current_time - self.last_update_time >= self.update_interval) or is_complete:
|
| 96 |
self._trigger_update()
|
| 97 |
self.last_update_time = current_time
|
| 98 |
+
|
| 99 |
def _trigger_update(self) -> None:
|
| 100 |
"""Trigger the update callback with the current state."""
|
| 101 |
if self.update_callback:
|
pipeline/progressive_ui.py
CHANGED
|
@@ -17,25 +17,25 @@ logger = logging.getLogger(__name__)
|
|
| 17 |
class ProgressiveUI:
|
| 18 |
"""
|
| 19 |
Manages progressive UI updates for the Tibetan Text Metrics app.
|
| 20 |
-
|
| 21 |
This class handles the incremental updates of UI components as metrics
|
| 22 |
are computed, allowing for a more responsive user experience.
|
| 23 |
"""
|
| 24 |
-
|
| 25 |
-
def __init__(self,
|
| 26 |
metrics_preview: gr.Dataframe,
|
| 27 |
word_count_plot: gr.Plot,
|
| 28 |
jaccard_heatmap: gr.Plot,
|
| 29 |
lcs_heatmap: gr.Plot,
|
| 30 |
fuzzy_heatmap: gr.Plot,
|
| 31 |
-
semantic_heatmap: gr.Plot,
|
| 32 |
-
warning_box: gr.Markdown,
|
| 33 |
-
progress_container: gr.Row,
|
| 34 |
-
heatmap_titles: Dict[str, str],
|
| 35 |
structural_btn=None):
|
| 36 |
"""
|
| 37 |
Initialize the ProgressiveUI.
|
| 38 |
-
|
| 39 |
Args:
|
| 40 |
metrics_preview: Gradio Dataframe component for metrics preview
|
| 41 |
word_count_plot: Gradio Plot component for word count visualization
|
|
@@ -55,9 +55,9 @@ class ProgressiveUI:
|
|
| 55 |
self.semantic_heatmap = semantic_heatmap
|
| 56 |
self.warning_box = warning_box
|
| 57 |
self.progress_container = progress_container
|
| 58 |
-
self.heatmap_titles = heatmap_titles
|
| 59 |
self.structural_btn = structural_btn
|
| 60 |
-
|
| 61 |
# Create progress indicators for each metric
|
| 62 |
with self.progress_container:
|
| 63 |
self.jaccard_progress = gr.Markdown("🔄 **Jaccard Similarity:** Waiting...", elem_id="jaccard_progress")
|
|
@@ -65,90 +65,90 @@ class ProgressiveUI:
|
|
| 65 |
self.fuzzy_progress = gr.Markdown("🔄 **Fuzzy Similarity:** Waiting...", elem_id="fuzzy_progress")
|
| 66 |
self.semantic_progress = gr.Markdown("🔄 **Semantic Similarity:** Waiting...", elem_id="semantic_progress")
|
| 67 |
self.word_count_progress = gr.Markdown("🔄 **Word Counts:** Waiting...", elem_id="word_count_progress")
|
| 68 |
-
|
| 69 |
# Track which components have been updated
|
| 70 |
self.updated_components = set()
|
| 71 |
-
|
| 72 |
def update(self, result: ProgressiveResult) -> Dict[gr.components.Component, Any]:
|
| 73 |
"""
|
| 74 |
Update UI components based on progressive results.
|
| 75 |
-
|
| 76 |
Args:
|
| 77 |
result: ProgressiveResult object containing the current state of computation
|
| 78 |
-
|
| 79 |
Returns:
|
| 80 |
Dictionary mapping Gradio components to their updated values
|
| 81 |
"""
|
| 82 |
updates = {}
|
| 83 |
-
|
| 84 |
# Always update metrics preview if we have data
|
| 85 |
if not result.metrics_df.empty:
|
| 86 |
updates[self.metrics_preview] = result.metrics_df.head(10)
|
| 87 |
-
|
| 88 |
# Update warning if present
|
| 89 |
if result.warning:
|
| 90 |
warning_md = f"**⚠️ Warning:** {result.warning}" if result.warning else ""
|
| 91 |
updates[self.warning_box] = gr.update(value=warning_md, visible=True)
|
| 92 |
-
|
| 93 |
# Generate visualizations for completed metrics
|
| 94 |
if not result.metrics_df.empty:
|
| 95 |
# Generate heatmaps for available metrics
|
| 96 |
heatmaps_data = generate_visualizations(
|
| 97 |
result.metrics_df, descriptive_titles=self.heatmap_titles
|
| 98 |
)
|
| 99 |
-
|
| 100 |
# Update heatmaps and progress indicators for completed metrics
|
| 101 |
for metric_type in result.completed_metrics:
|
| 102 |
if metric_type == MetricType.JACCARD:
|
| 103 |
# Update progress indicator
|
| 104 |
updates[self.jaccard_progress] = "✅ **Jaccard Similarity:** Complete"
|
| 105 |
-
|
| 106 |
# Update heatmap if not already updated
|
| 107 |
if self.jaccard_heatmap not in self.updated_components:
|
| 108 |
if "Jaccard Similarity (%)" in heatmaps_data:
|
| 109 |
updates[self.jaccard_heatmap] = heatmaps_data["Jaccard Similarity (%)"]
|
| 110 |
self.updated_components.add(self.jaccard_heatmap)
|
| 111 |
-
|
| 112 |
elif metric_type == MetricType.LCS:
|
| 113 |
# Update progress indicator
|
| 114 |
updates[self.lcs_progress] = "✅ **Normalized LCS:** Complete"
|
| 115 |
-
|
| 116 |
# Update heatmap if not already updated
|
| 117 |
if self.lcs_heatmap not in self.updated_components:
|
| 118 |
if "Normalized LCS" in heatmaps_data:
|
| 119 |
updates[self.lcs_heatmap] = heatmaps_data["Normalized LCS"]
|
| 120 |
self.updated_components.add(self.lcs_heatmap)
|
| 121 |
-
|
| 122 |
elif metric_type == MetricType.FUZZY:
|
| 123 |
# Update progress indicator
|
| 124 |
updates[self.fuzzy_progress] = "✅ **Fuzzy Similarity:** Complete"
|
| 125 |
-
|
| 126 |
# Update heatmap if not already updated
|
| 127 |
if self.fuzzy_heatmap not in self.updated_components:
|
| 128 |
if "Fuzzy Similarity" in heatmaps_data:
|
| 129 |
updates[self.fuzzy_heatmap] = heatmaps_data["Fuzzy Similarity"]
|
| 130 |
self.updated_components.add(self.fuzzy_heatmap)
|
| 131 |
-
|
| 132 |
elif metric_type == MetricType.SEMANTIC:
|
| 133 |
# Update progress indicator
|
| 134 |
updates[self.semantic_progress] = "✅ **Semantic Similarity:** Complete"
|
| 135 |
-
|
| 136 |
# Update heatmap if not already updated
|
| 137 |
if self.semantic_heatmap not in self.updated_components:
|
| 138 |
if "Semantic Similarity" in heatmaps_data:
|
| 139 |
updates[self.semantic_heatmap] = heatmaps_data["Semantic Similarity"]
|
| 140 |
self.updated_components.add(self.semantic_heatmap)
|
| 141 |
-
|
| 142 |
# Generate word count chart if we have data
|
| 143 |
if not result.word_counts_df.empty:
|
| 144 |
# Update progress indicator
|
| 145 |
updates[self.word_count_progress] = "✅ **Word Counts:** Complete"
|
| 146 |
-
|
| 147 |
# Update chart if not already updated
|
| 148 |
if self.word_count_plot not in self.updated_components:
|
| 149 |
updates[self.word_count_plot] = generate_word_count_chart(result.word_counts_df)
|
| 150 |
self.updated_components.add(self.word_count_plot)
|
| 151 |
-
|
| 152 |
# Update progress indicators for metrics in progress
|
| 153 |
if not result.is_complete:
|
| 154 |
# Update progress indicators for metrics that are still in progress
|
|
@@ -167,28 +167,28 @@ class ProgressiveUI:
|
|
| 167 |
if self.structural_btn is not None:
|
| 168 |
updates[self.structural_btn] = gr.update(interactive=True)
|
| 169 |
logger.info("Enabling structural analysis button via progressive UI")
|
| 170 |
-
|
| 171 |
return updates
|
| 172 |
|
| 173 |
|
| 174 |
def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
|
| 175 |
"""
|
| 176 |
Create a callback function for progressive updates.
|
| 177 |
-
|
| 178 |
Args:
|
| 179 |
progressive_ui: ProgressiveUI instance to handle updates
|
| 180 |
-
|
| 181 |
Returns:
|
| 182 |
Callback function that can be passed to process_texts
|
| 183 |
"""
|
| 184 |
-
def callback(metrics_df: pd.DataFrame,
|
| 185 |
word_counts_df: pd.DataFrame,
|
| 186 |
completed_metrics: List[MetricType],
|
| 187 |
warning: str,
|
| 188 |
is_complete: bool) -> None:
|
| 189 |
"""
|
| 190 |
Callback function for progressive updates.
|
| 191 |
-
|
| 192 |
Args:
|
| 193 |
metrics_df: DataFrame with current metrics
|
| 194 |
word_counts_df: DataFrame with word counts
|
|
@@ -203,10 +203,10 @@ def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
|
|
| 203 |
warning=warning,
|
| 204 |
is_complete=is_complete
|
| 205 |
)
|
| 206 |
-
|
| 207 |
# Get updates for UI components
|
| 208 |
updates = progressive_ui.update(result)
|
| 209 |
-
|
| 210 |
# Apply updates to UI components
|
| 211 |
for component, value in updates.items():
|
| 212 |
try:
|
|
@@ -228,5 +228,5 @@ def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
|
|
| 228 |
logger.warning(f"Cannot update component of type {type(component)}")
|
| 229 |
except Exception as e:
|
| 230 |
logger.warning(f"Error updating component: {e}")
|
| 231 |
-
|
| 232 |
return callback
|
|
|
|
| 17 |
class ProgressiveUI:
|
| 18 |
"""
|
| 19 |
Manages progressive UI updates for the Tibetan Text Metrics app.
|
| 20 |
+
|
| 21 |
This class handles the incremental updates of UI components as metrics
|
| 22 |
are computed, allowing for a more responsive user experience.
|
| 23 |
"""
|
| 24 |
+
|
| 25 |
+
def __init__(self,
|
| 26 |
metrics_preview: gr.Dataframe,
|
| 27 |
word_count_plot: gr.Plot,
|
| 28 |
jaccard_heatmap: gr.Plot,
|
| 29 |
lcs_heatmap: gr.Plot,
|
| 30 |
fuzzy_heatmap: gr.Plot,
|
| 31 |
+
semantic_heatmap: gr.Plot = None,
|
| 32 |
+
warning_box: gr.Markdown = None,
|
| 33 |
+
progress_container: gr.Row = None,
|
| 34 |
+
heatmap_titles: Dict[str, str] = None,
|
| 35 |
structural_btn=None):
|
| 36 |
"""
|
| 37 |
Initialize the ProgressiveUI.
|
| 38 |
+
|
| 39 |
Args:
|
| 40 |
metrics_preview: Gradio Dataframe component for metrics preview
|
| 41 |
word_count_plot: Gradio Plot component for word count visualization
|
|
|
|
| 55 |
self.semantic_heatmap = semantic_heatmap
|
| 56 |
self.warning_box = warning_box
|
| 57 |
self.progress_container = progress_container
|
| 58 |
+
self.heatmap_titles = heatmap_titles or {}
|
| 59 |
self.structural_btn = structural_btn
|
| 60 |
+
|
| 61 |
# Create progress indicators for each metric
|
| 62 |
with self.progress_container:
|
| 63 |
self.jaccard_progress = gr.Markdown("🔄 **Jaccard Similarity:** Waiting...", elem_id="jaccard_progress")
|
|
|
|
| 65 |
self.fuzzy_progress = gr.Markdown("🔄 **Fuzzy Similarity:** Waiting...", elem_id="fuzzy_progress")
|
| 66 |
self.semantic_progress = gr.Markdown("🔄 **Semantic Similarity:** Waiting...", elem_id="semantic_progress")
|
| 67 |
self.word_count_progress = gr.Markdown("🔄 **Word Counts:** Waiting...", elem_id="word_count_progress")
|
| 68 |
+
|
| 69 |
# Track which components have been updated
|
| 70 |
self.updated_components = set()
|
| 71 |
+
|
| 72 |
def update(self, result: ProgressiveResult) -> Dict[gr.components.Component, Any]:
|
| 73 |
"""
|
| 74 |
Update UI components based on progressive results.
|
| 75 |
+
|
| 76 |
Args:
|
| 77 |
result: ProgressiveResult object containing the current state of computation
|
| 78 |
+
|
| 79 |
Returns:
|
| 80 |
Dictionary mapping Gradio components to their updated values
|
| 81 |
"""
|
| 82 |
updates = {}
|
| 83 |
+
|
| 84 |
# Always update metrics preview if we have data
|
| 85 |
if not result.metrics_df.empty:
|
| 86 |
updates[self.metrics_preview] = result.metrics_df.head(10)
|
| 87 |
+
|
| 88 |
# Update warning if present
|
| 89 |
if result.warning:
|
| 90 |
warning_md = f"**⚠️ Warning:** {result.warning}" if result.warning else ""
|
| 91 |
updates[self.warning_box] = gr.update(value=warning_md, visible=True)
|
| 92 |
+
|
| 93 |
# Generate visualizations for completed metrics
|
| 94 |
if not result.metrics_df.empty:
|
| 95 |
# Generate heatmaps for available metrics
|
| 96 |
heatmaps_data = generate_visualizations(
|
| 97 |
result.metrics_df, descriptive_titles=self.heatmap_titles
|
| 98 |
)
|
| 99 |
+
|
| 100 |
# Update heatmaps and progress indicators for completed metrics
|
| 101 |
for metric_type in result.completed_metrics:
|
| 102 |
if metric_type == MetricType.JACCARD:
|
| 103 |
# Update progress indicator
|
| 104 |
updates[self.jaccard_progress] = "✅ **Jaccard Similarity:** Complete"
|
| 105 |
+
|
| 106 |
# Update heatmap if not already updated
|
| 107 |
if self.jaccard_heatmap not in self.updated_components:
|
| 108 |
if "Jaccard Similarity (%)" in heatmaps_data:
|
| 109 |
updates[self.jaccard_heatmap] = heatmaps_data["Jaccard Similarity (%)"]
|
| 110 |
self.updated_components.add(self.jaccard_heatmap)
|
| 111 |
+
|
| 112 |
elif metric_type == MetricType.LCS:
|
| 113 |
# Update progress indicator
|
| 114 |
updates[self.lcs_progress] = "✅ **Normalized LCS:** Complete"
|
| 115 |
+
|
| 116 |
# Update heatmap if not already updated
|
| 117 |
if self.lcs_heatmap not in self.updated_components:
|
| 118 |
if "Normalized LCS" in heatmaps_data:
|
| 119 |
updates[self.lcs_heatmap] = heatmaps_data["Normalized LCS"]
|
| 120 |
self.updated_components.add(self.lcs_heatmap)
|
| 121 |
+
|
| 122 |
elif metric_type == MetricType.FUZZY:
|
| 123 |
# Update progress indicator
|
| 124 |
updates[self.fuzzy_progress] = "✅ **Fuzzy Similarity:** Complete"
|
| 125 |
+
|
| 126 |
# Update heatmap if not already updated
|
| 127 |
if self.fuzzy_heatmap not in self.updated_components:
|
| 128 |
if "Fuzzy Similarity" in heatmaps_data:
|
| 129 |
updates[self.fuzzy_heatmap] = heatmaps_data["Fuzzy Similarity"]
|
| 130 |
self.updated_components.add(self.fuzzy_heatmap)
|
| 131 |
+
|
| 132 |
elif metric_type == MetricType.SEMANTIC:
|
| 133 |
# Update progress indicator
|
| 134 |
updates[self.semantic_progress] = "✅ **Semantic Similarity:** Complete"
|
| 135 |
+
|
| 136 |
# Update heatmap if not already updated
|
| 137 |
if self.semantic_heatmap not in self.updated_components:
|
| 138 |
if "Semantic Similarity" in heatmaps_data:
|
| 139 |
updates[self.semantic_heatmap] = heatmaps_data["Semantic Similarity"]
|
| 140 |
self.updated_components.add(self.semantic_heatmap)
|
| 141 |
+
|
| 142 |
# Generate word count chart if we have data
|
| 143 |
if not result.word_counts_df.empty:
|
| 144 |
# Update progress indicator
|
| 145 |
updates[self.word_count_progress] = "✅ **Word Counts:** Complete"
|
| 146 |
+
|
| 147 |
# Update chart if not already updated
|
| 148 |
if self.word_count_plot not in self.updated_components:
|
| 149 |
updates[self.word_count_plot] = generate_word_count_chart(result.word_counts_df)
|
| 150 |
self.updated_components.add(self.word_count_plot)
|
| 151 |
+
|
| 152 |
# Update progress indicators for metrics in progress
|
| 153 |
if not result.is_complete:
|
| 154 |
# Update progress indicators for metrics that are still in progress
|
|
|
|
| 167 |
if self.structural_btn is not None:
|
| 168 |
updates[self.structural_btn] = gr.update(interactive=True)
|
| 169 |
logger.info("Enabling structural analysis button via progressive UI")
|
| 170 |
+
|
| 171 |
return updates
|
| 172 |
|
| 173 |
|
| 174 |
def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
|
| 175 |
"""
|
| 176 |
Create a callback function for progressive updates.
|
| 177 |
+
|
| 178 |
Args:
|
| 179 |
progressive_ui: ProgressiveUI instance to handle updates
|
| 180 |
+
|
| 181 |
Returns:
|
| 182 |
Callback function that can be passed to process_texts
|
| 183 |
"""
|
| 184 |
+
def callback(metrics_df: pd.DataFrame,
|
| 185 |
word_counts_df: pd.DataFrame,
|
| 186 |
completed_metrics: List[MetricType],
|
| 187 |
warning: str,
|
| 188 |
is_complete: bool) -> None:
|
| 189 |
"""
|
| 190 |
Callback function for progressive updates.
|
| 191 |
+
|
| 192 |
Args:
|
| 193 |
metrics_df: DataFrame with current metrics
|
| 194 |
word_counts_df: DataFrame with word counts
|
|
|
|
| 203 |
warning=warning,
|
| 204 |
is_complete=is_complete
|
| 205 |
)
|
| 206 |
+
|
| 207 |
# Get updates for UI components
|
| 208 |
updates = progressive_ui.update(result)
|
| 209 |
+
|
| 210 |
# Apply updates to UI components
|
| 211 |
for component, value in updates.items():
|
| 212 |
try:
|
|
|
|
| 228 |
logger.warning(f"Cannot update component of type {type(component)}")
|
| 229 |
except Exception as e:
|
| 230 |
logger.warning(f"Error updating component: {e}")
|
| 231 |
+
|
| 232 |
return callback
|
pipeline/stopwords_bo.py
CHANGED
|
@@ -21,13 +21,13 @@ ORDINAL_NUMBERS = [
|
|
| 21 |
|
| 22 |
# Additional stopwords from the comprehensive list, categorized for readability
|
| 23 |
MORE_PARTICLES_SUFFIXES = [
|
| 24 |
-
"འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
|
| 25 |
-
"ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
|
| 26 |
-
"གྱིན་", "ན", "འམ་", "ཀྱིན་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི",
|
| 27 |
-
"བམ་", "ཤིག་", "ནམ", "མིན་", "ནམ་", "ངམ་", "རུ་", "ཤས་", "ཏུ", "ཡིས", "གིན་", "གམ་",
|
| 28 |
-
"གྱིས", "ཅང་", "སམ་", "ཞིག", "འང", "རུ", "དང", "ཡ", "འག", "སམ", "ཀ", "འམ", "མམ",
|
| 29 |
-
"དམ", "ཀྱི", "ལམ", "ནོ་", "སོ་", "རམ་", "བོ་", "ཨང་", "ཕྱི", "ཏོ་", "གེ", "གོ", "རོ་", "བོ",
|
| 30 |
-
"པས", "འི", "རམ", "བས", "མཾ་", "པོ", "ག་", "ག", "གམ", "བམ", "མོ་", "མམ་", "ཏམ་", "ངོ",
|
| 31 |
"ཏམ", "གིང་", "ཀྱང" # ཀྱང also in PARTICLES_INITIAL, set() will handle duplicates
|
| 32 |
]
|
| 33 |
|
|
@@ -36,13 +36,13 @@ PRONOUNS_DEMONSTRATIVES = ["འདི", "གཞན་", "དེ་", "རང་"
|
|
| 36 |
VERBS_AUXILIARIES = ["ཡིན་", "མི་", "ལགས་པ", "ཡིན་པ", "ལགས་", "མིན་", "ཡིན་པ་", "མིན", "ཡིན་བ", "ཡིན་ལུགས་"]
|
| 37 |
|
| 38 |
ADVERBS_QUALIFIERS_INTENSIFIERS = [
|
| 39 |
-
"སོགས་", "ཙམ་", "ཡང་", "ཉིད་", "ཞིང་", "རུང་", "ན་རེ", "འང་", "ཁོ་ན་", "འཕྲལ་", "བར་",
|
| 40 |
"ཅུང་ཟད་", "ཙམ་པ་", "ཤ་སྟག་"
|
| 41 |
]
|
| 42 |
|
| 43 |
QUANTIFIERS_DETERMINERS_COLLECTIVES = [
|
| 44 |
-
"རྣམས་", "ཀུན་", "སྙེད་", "བཅས་", "ཡོངས་", "མཐའ་དག་", "དག་", "ཚུ", "ཚང་མ", "ཐམས་ཅད་",
|
| 45 |
-
"ཅིག་", "སྣ་ཚོགས་", "སྙེད་པ", "རེ་རེ་", "འགའ་", "སྤྱི", "དུ་མ", "མ", "ཁོ་ན", "ཚོ", "ལ་ལ་",
|
| 46 |
"སྙེད་པ་", "འབའ་", "སྙེད", "གྲང་", "ཁ་", "ངེ་", "ཅོག་", "རིལ་", "ཉུང་ཤས་", "ཚ་"
|
| 47 |
]
|
| 48 |
|
|
@@ -64,8 +64,19 @@ _ALL_STOPWORDS_CATEGORIZED = (
|
|
| 64 |
INTERJECTIONS_EXCLAMATIONS
|
| 65 |
)
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
# Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
|
|
|
|
| 71 |
TIBETAN_STOPWORDS_SET = set(TIBETAN_STOPWORDS)
|
|
|
|
| 21 |
|
| 22 |
# Additional stopwords from the comprehensive list, categorized for readability
|
| 23 |
MORE_PARTICLES_SUFFIXES = [
|
| 24 |
+
"འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
|
| 25 |
+
"ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
|
| 26 |
+
"གྱིན་", "ན", "འམ་", "ཀྱིན་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི",
|
| 27 |
+
"བམ་", "ཤིག་", "ནམ", "མིན་", "ནམ་", "ངམ་", "རུ་", "ཤས་", "ཏུ", "ཡིས", "གིན་", "གམ་",
|
| 28 |
+
"གྱིས", "ཅང་", "སམ་", "ཞིག", "འང", "རུ", "དང", "ཡ", "འག", "སམ", "ཀ", "འམ", "མམ",
|
| 29 |
+
"དམ", "ཀྱི", "ལམ", "ནོ་", "སོ་", "རམ་", "བོ་", "ཨང་", "ཕྱི", "ཏོ་", "གེ", "གོ", "རོ་", "བོ",
|
| 30 |
+
"པས", "འི", "རམ", "བས", "མཾ་", "པོ", "ག་", "ག", "གམ", "བམ", "མོ་", "མམ་", "ཏམ་", "ངོ",
|
| 31 |
"ཏམ", "གིང་", "ཀྱང" # ཀྱང also in PARTICLES_INITIAL, set() will handle duplicates
|
| 32 |
]
|
| 33 |
|
|
|
|
| 36 |
VERBS_AUXILIARIES = ["ཡིན་", "མི་", "ལགས་པ", "ཡིན་པ", "ལགས་", "མིན་", "ཡིན་པ་", "མིན", "ཡིན་བ", "ཡིན་ལུགས་"]
|
| 37 |
|
| 38 |
ADVERBS_QUALIFIERS_INTENSIFIERS = [
|
| 39 |
+
"སོགས་", "ཙམ་", "ཡང་", "ཉིད་", "ཞིང་", "རུང་", "ན་རེ", "འང་", "ཁོ་ན་", "འཕྲལ་", "བར་",
|
| 40 |
"ཅུང་ཟད་", "ཙམ་པ་", "ཤ་སྟག་"
|
| 41 |
]
|
| 42 |
|
| 43 |
QUANTIFIERS_DETERMINERS_COLLECTIVES = [
|
| 44 |
+
"རྣམས་", "ཀུན་", "སྙེད་", "བཅས་", "ཡོངས་", "མཐའ་དག་", "དག་", "ཚུ", "ཚང་མ", "ཐམས་ཅད་",
|
| 45 |
+
"ཅིག་", "སྣ་ཚོགས་", "སྙེད་པ", "རེ་རེ་", "འགའ་", "སྤྱི", "དུ་མ", "མ", "ཁོ་ན", "ཚོ", "ལ་ལ་",
|
| 46 |
"སྙེད་པ་", "འབའ་", "སྙེད", "གྲང་", "ཁ་", "ངེ་", "ཅོག་", "རིལ་", "ཉུང་ཤས་", "ཚ་"
|
| 47 |
]
|
| 48 |
|
|
|
|
| 64 |
INTERJECTIONS_EXCLAMATIONS
|
| 65 |
)
|
| 66 |
|
| 67 |
+
def _normalize_tibetan_token(token: str) -> str:
|
| 68 |
+
"""
|
| 69 |
+
Normalize a Tibetan token by removing trailing tsek (་).
|
| 70 |
+
|
| 71 |
+
This ensures consistent matching regardless of whether the tokenizer
|
| 72 |
+
preserves or strips the tsek. Botok's behavior can vary, so we normalize
|
| 73 |
+
both the stopwords and the tokens being compared.
|
| 74 |
+
"""
|
| 75 |
+
return token.rstrip('་')
|
| 76 |
+
|
| 77 |
+
# Final flat list of unique stopwords (normalized to remove trailing tsek)
|
| 78 |
+
TIBETAN_STOPWORDS = list(set(_normalize_tibetan_token(sw) for sw in _ALL_STOPWORDS_CATEGORIZED))
|
| 79 |
|
| 80 |
# Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
|
| 81 |
+
# Normalized to match tokenizer output regardless of tsek handling
|
| 82 |
TIBETAN_STOPWORDS_SET = set(TIBETAN_STOPWORDS)
|
pipeline/stopwords_lite_bo.py
CHANGED
|
@@ -15,8 +15,8 @@ MARKERS_AND_PUNCTUATION = ["༈", "།", "༎", "༑"]
|
|
| 15 |
|
| 16 |
# Reduced list of particles and suffixes
|
| 17 |
MORE_PARTICLES_SUFFIXES_LITE = [
|
| 18 |
-
"འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
|
| 19 |
-
"ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
|
| 20 |
"ན", "འམ་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི"
|
| 21 |
]
|
| 22 |
|
|
@@ -27,8 +27,18 @@ _ALL_STOPWORDS_CATEGORIZED_LITE = (
|
|
| 27 |
MORE_PARTICLES_SUFFIXES_LITE
|
| 28 |
)
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
# Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
|
|
|
|
| 34 |
TIBETAN_STOPWORDS_LITE_SET = set(TIBETAN_STOPWORDS_LITE)
|
|
|
|
| 15 |
|
| 16 |
# Reduced list of particles and suffixes
|
| 17 |
MORE_PARTICLES_SUFFIXES_LITE = [
|
| 18 |
+
"འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
|
| 19 |
+
"ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
|
| 20 |
"ན", "འམ་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི"
|
| 21 |
]
|
| 22 |
|
|
|
|
| 27 |
MORE_PARTICLES_SUFFIXES_LITE
|
| 28 |
)
|
| 29 |
|
| 30 |
+
def _normalize_tibetan_token(token: str) -> str:
|
| 31 |
+
"""
|
| 32 |
+
Normalize a Tibetan token by removing trailing tsek (་).
|
| 33 |
+
|
| 34 |
+
This ensures consistent matching regardless of whether the tokenizer
|
| 35 |
+
preserves or strips the tsek.
|
| 36 |
+
"""
|
| 37 |
+
return token.rstrip('་')
|
| 38 |
+
|
| 39 |
+
# Final flat list of unique stopwords (normalized to remove trailing tsek)
|
| 40 |
+
TIBETAN_STOPWORDS_LITE = list(set(_normalize_tibetan_token(sw) for sw in _ALL_STOPWORDS_CATEGORIZED_LITE))
|
| 41 |
|
| 42 |
# Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
|
| 43 |
+
# Normalized to match tokenizer output regardless of tsek handling
|
| 44 |
TIBETAN_STOPWORDS_LITE_SET = set(TIBETAN_STOPWORDS_LITE)
|
pipeline/tokenize.py
CHANGED
|
@@ -29,10 +29,10 @@ except ImportError:
|
|
| 29 |
def _get_text_hash(text: str) -> str:
|
| 30 |
"""
|
| 31 |
Generate a hash for the input text to use as a cache key.
|
| 32 |
-
|
| 33 |
Args:
|
| 34 |
text: The input text to hash
|
| 35 |
-
|
| 36 |
Returns:
|
| 37 |
A string representation of the MD5 hash of the input text
|
| 38 |
"""
|
|
@@ -42,17 +42,17 @@ def _get_text_hash(text: str) -> str:
|
|
| 42 |
def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
|
| 43 |
"""
|
| 44 |
Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
|
| 45 |
-
|
| 46 |
This function maintains an in-memory cache of previously tokenized texts to avoid
|
| 47 |
redundant processing of the same content. The cache uses MD5 hashes of the input
|
| 48 |
texts as keys.
|
| 49 |
-
|
| 50 |
Args:
|
| 51 |
texts: List of raw text strings to tokenize.
|
| 52 |
-
|
| 53 |
Returns:
|
| 54 |
List of tokenized texts (each as a list of tokens).
|
| 55 |
-
|
| 56 |
Raises:
|
| 57 |
RuntimeError: If the botok tokenizer failed to initialize.
|
| 58 |
"""
|
|
@@ -68,18 +68,18 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
|
|
| 68 |
if mode not in ["word", "syllable"]:
|
| 69 |
logger.warning(f"Invalid tokenization mode: '{mode}'. Defaulting to 'syllable'.")
|
| 70 |
mode = "syllable"
|
| 71 |
-
|
| 72 |
# Process each text
|
| 73 |
for text_content in texts:
|
| 74 |
# Skip empty texts
|
| 75 |
if not text_content.strip():
|
| 76 |
tokenized_texts_list.append([])
|
| 77 |
continue
|
| 78 |
-
|
| 79 |
# Generate hash for cache lookup
|
| 80 |
cache_key_string = text_content + f"_{mode}" # Include mode in string for hashing
|
| 81 |
text_hash = _get_text_hash(cache_key_string)
|
| 82 |
-
|
| 83 |
# Check if we have this text in cache
|
| 84 |
if text_hash in _tokenization_cache:
|
| 85 |
# Cache hit - use cached tokens
|
|
@@ -91,7 +91,7 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
|
|
| 91 |
current_tokens = []
|
| 92 |
if BOTOK_TOKENIZER:
|
| 93 |
raw_botok_items = list(BOTOK_TOKENIZER.tokenize(text_content))
|
| 94 |
-
|
| 95 |
if mode == "word":
|
| 96 |
for item_idx, w in enumerate(raw_botok_items):
|
| 97 |
if hasattr(w, 'text') and isinstance(w.text, str):
|
|
@@ -125,7 +125,7 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
|
|
| 125 |
f"for hash {text_hash[:8]}. Skipping this syllable."
|
| 126 |
)
|
| 127 |
continue
|
| 128 |
-
|
| 129 |
if syllable_to_process is not None:
|
| 130 |
stripped_syl = syllable_to_process.strip()
|
| 131 |
if stripped_syl:
|
|
@@ -155,20 +155,20 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
|
|
| 155 |
else:
|
| 156 |
logger.error(f"BOTOK_TOKENIZER is None for text hash {text_hash[:8]}, cannot tokenize (mode: {mode}).")
|
| 157 |
tokens = []
|
| 158 |
-
|
| 159 |
# Store in cache if not empty
|
| 160 |
if tokens:
|
| 161 |
# If cache is full, remove a random entry (simple strategy)
|
| 162 |
if len(_tokenization_cache) >= MAX_CACHE_SIZE:
|
| 163 |
# Remove first key (oldest if ordered dict, random otherwise)
|
| 164 |
_tokenization_cache.pop(next(iter(_tokenization_cache)))
|
| 165 |
-
|
| 166 |
_tokenization_cache[text_hash] = tokens
|
| 167 |
logger.debug(f"Added tokens to cache with hash {text_hash[:8]}... (mode: {mode})")
|
| 168 |
except Exception as e:
|
| 169 |
logger.error(f"Error tokenizing text (mode: {mode}): {e}")
|
| 170 |
tokens = []
|
| 171 |
-
|
| 172 |
tokenized_texts_list.append(tokens)
|
| 173 |
-
|
| 174 |
return tokenized_texts_list
|
|
|
|
| 29 |
def _get_text_hash(text: str) -> str:
|
| 30 |
"""
|
| 31 |
Generate a hash for the input text to use as a cache key.
|
| 32 |
+
|
| 33 |
Args:
|
| 34 |
text: The input text to hash
|
| 35 |
+
|
| 36 |
Returns:
|
| 37 |
A string representation of the MD5 hash of the input text
|
| 38 |
"""
|
|
|
|
| 42 |
def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
|
| 43 |
"""
|
| 44 |
Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
|
| 45 |
+
|
| 46 |
This function maintains an in-memory cache of previously tokenized texts to avoid
|
| 47 |
redundant processing of the same content. The cache uses MD5 hashes of the input
|
| 48 |
texts as keys.
|
| 49 |
+
|
| 50 |
Args:
|
| 51 |
texts: List of raw text strings to tokenize.
|
| 52 |
+
|
| 53 |
Returns:
|
| 54 |
List of tokenized texts (each as a list of tokens).
|
| 55 |
+
|
| 56 |
Raises:
|
| 57 |
RuntimeError: If the botok tokenizer failed to initialize.
|
| 58 |
"""
|
|
|
|
| 68 |
if mode not in ["word", "syllable"]:
|
| 69 |
logger.warning(f"Invalid tokenization mode: '{mode}'. Defaulting to 'syllable'.")
|
| 70 |
mode = "syllable"
|
| 71 |
+
|
| 72 |
# Process each text
|
| 73 |
for text_content in texts:
|
| 74 |
# Skip empty texts
|
| 75 |
if not text_content.strip():
|
| 76 |
tokenized_texts_list.append([])
|
| 77 |
continue
|
| 78 |
+
|
| 79 |
# Generate hash for cache lookup
|
| 80 |
cache_key_string = text_content + f"_{mode}" # Include mode in string for hashing
|
| 81 |
text_hash = _get_text_hash(cache_key_string)
|
| 82 |
+
|
| 83 |
# Check if we have this text in cache
|
| 84 |
if text_hash in _tokenization_cache:
|
| 85 |
# Cache hit - use cached tokens
|
|
|
|
| 91 |
current_tokens = []
|
| 92 |
if BOTOK_TOKENIZER:
|
| 93 |
raw_botok_items = list(BOTOK_TOKENIZER.tokenize(text_content))
|
| 94 |
+
|
| 95 |
if mode == "word":
|
| 96 |
for item_idx, w in enumerate(raw_botok_items):
|
| 97 |
if hasattr(w, 'text') and isinstance(w.text, str):
|
|
|
|
| 125 |
f"for hash {text_hash[:8]}. Skipping this syllable."
|
| 126 |
)
|
| 127 |
continue
|
| 128 |
+
|
| 129 |
if syllable_to_process is not None:
|
| 130 |
stripped_syl = syllable_to_process.strip()
|
| 131 |
if stripped_syl:
|
|
|
|
| 155 |
else:
|
| 156 |
logger.error(f"BOTOK_TOKENIZER is None for text hash {text_hash[:8]}, cannot tokenize (mode: {mode}).")
|
| 157 |
tokens = []
|
| 158 |
+
|
| 159 |
# Store in cache if not empty
|
| 160 |
if tokens:
|
| 161 |
# If cache is full, remove a random entry (simple strategy)
|
| 162 |
if len(_tokenization_cache) >= MAX_CACHE_SIZE:
|
| 163 |
# Remove first key (oldest if ordered dict, random otherwise)
|
| 164 |
_tokenization_cache.pop(next(iter(_tokenization_cache)))
|
| 165 |
+
|
| 166 |
_tokenization_cache[text_hash] = tokens
|
| 167 |
logger.debug(f"Added tokens to cache with hash {text_hash[:8]}... (mode: {mode})")
|
| 168 |
except Exception as e:
|
| 169 |
logger.error(f"Error tokenizing text (mode: {mode}): {e}")
|
| 170 |
tokens = []
|
| 171 |
+
|
| 172 |
tokenized_texts_list.append(tokens)
|
| 173 |
+
|
| 174 |
return tokenized_texts_list
|
pipeline/visualize.py
CHANGED
|
@@ -40,29 +40,29 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
|
|
| 40 |
continue
|
| 41 |
|
| 42 |
cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
|
| 43 |
-
|
| 44 |
# For consistent interpretation: higher values (more similarity) = darker colors
|
| 45 |
# Using 'Reds' colormap for all metrics (dark red = high similarity)
|
| 46 |
-
cmap = "Reds"
|
| 47 |
-
|
| 48 |
# Format values for display
|
| 49 |
text = [
|
| 50 |
[f"{val:.2f}" if pd.notnull(val) else "" for val in row]
|
| 51 |
for row in pivot.values
|
| 52 |
]
|
| 53 |
-
|
| 54 |
# Create a copy of the pivot data for visualization
|
| 55 |
# For LCS and Semantic Similarity, we need to reverse the color scale
|
| 56 |
# so that higher values (more similarity) are darker
|
| 57 |
viz_values = pivot.values.copy()
|
| 58 |
-
|
| 59 |
# Determine if we need to reverse the values for consistent color interpretation
|
| 60 |
# (darker = more similar across all metrics)
|
| 61 |
reverse_colorscale = False
|
| 62 |
-
|
| 63 |
# All metrics should have darker colors for higher similarity
|
| 64 |
# No need to reverse values anymore - we'll use the same scale for all
|
| 65 |
-
|
| 66 |
fig = go.Figure(
|
| 67 |
data=go.Heatmap(
|
| 68 |
z=viz_values,
|
|
|
|
| 40 |
continue
|
| 41 |
|
| 42 |
cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
|
| 43 |
+
|
| 44 |
# For consistent interpretation: higher values (more similarity) = darker colors
|
| 45 |
# Using 'Reds' colormap for all metrics (dark red = high similarity)
|
| 46 |
+
cmap = "Reds"
|
| 47 |
+
|
| 48 |
# Format values for display
|
| 49 |
text = [
|
| 50 |
[f"{val:.2f}" if pd.notnull(val) else "" for val in row]
|
| 51 |
for row in pivot.values
|
| 52 |
]
|
| 53 |
+
|
| 54 |
# Create a copy of the pivot data for visualization
|
| 55 |
# For LCS and Semantic Similarity, we need to reverse the color scale
|
| 56 |
# so that higher values (more similarity) are darker
|
| 57 |
viz_values = pivot.values.copy()
|
| 58 |
+
|
| 59 |
# Determine if we need to reverse the values for consistent color interpretation
|
| 60 |
# (darker = more similar across all metrics)
|
| 61 |
reverse_colorscale = False
|
| 62 |
+
|
| 63 |
# All metrics should have darker colors for higher similarity
|
| 64 |
# No need to reverse values anymore - we'll use the same scale for all
|
| 65 |
+
|
| 66 |
fig = go.Figure(
|
| 67 |
data=go.Heatmap(
|
| 68 |
z=viz_values,
|
pyproject.toml
CHANGED
|
@@ -1,8 +1,33 @@
|
|
| 1 |
[build-system]
|
| 2 |
requires = [
|
| 3 |
-
"setuptools>=
|
| 4 |
-
"Cython>=0
|
| 5 |
-
"numpy>=1.
|
| 6 |
]
|
| 7 |
build-backend = "setuptools.build_meta"
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
[build-system]
|
| 2 |
requires = [
|
| 3 |
+
"setuptools>=65",
|
| 4 |
+
"Cython>=3.0",
|
| 5 |
+
"numpy>=1.24"
|
| 6 |
]
|
| 7 |
build-backend = "setuptools.build_meta"
|
| 8 |
+
|
| 9 |
+
[project]
|
| 10 |
+
name = "tibetan-text-metrics-webapp"
|
| 11 |
+
version = "0.4.0"
|
| 12 |
+
description = "Web application for computing text similarity metrics on Tibetan texts"
|
| 13 |
+
readme = "README.md"
|
| 14 |
+
license = {text = "CC-BY-4.0"}
|
| 15 |
+
requires-python = ">=3.10"
|
| 16 |
+
authors = [
|
| 17 |
+
{name = "Daniel Wojahn", email = "[email protected]"}
|
| 18 |
+
]
|
| 19 |
+
keywords = ["tibetan", "nlp", "text-similarity", "buddhist-texts"]
|
| 20 |
+
classifiers = [
|
| 21 |
+
"Development Status :: 4 - Beta",
|
| 22 |
+
"Intended Audience :: Science/Research",
|
| 23 |
+
"License :: OSI Approved",
|
| 24 |
+
"Programming Language :: Python :: 3",
|
| 25 |
+
"Programming Language :: Python :: 3.10",
|
| 26 |
+
"Programming Language :: Python :: 3.11",
|
| 27 |
+
"Programming Language :: Python :: 3.12",
|
| 28 |
+
"Topic :: Text Processing :: Linguistic",
|
| 29 |
+
]
|
| 30 |
+
|
| 31 |
+
[project.urls]
|
| 32 |
+
Homepage = "https://github.com/daniel-wojahn/tibetan-text-metrics"
|
| 33 |
+
Repository = "https://github.com/daniel-wojahn/tibetan-text-metrics"
|
requirements.txt
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
# Core application and UI
|
| 2 |
-
|
|
|
|
| 3 |
pandas==2.2.3
|
| 4 |
|
| 5 |
# Plotting and visualization
|
|
|
|
| 1 |
# Core application and UI
|
| 2 |
+
# Gradio 5.x (code is forward-compatible with Gradio 6)
|
| 3 |
+
gradio>=5.0.0
|
| 4 |
pandas==2.2.3
|
| 5 |
|
| 6 |
# Plotting and visualization
|
setup.py
CHANGED
|
@@ -1,45 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import numpy
|
| 2 |
from setuptools import Extension, setup
|
| 3 |
from Cython.Build import cythonize
|
| 4 |
|
| 5 |
-
# It's good practice to specify encoding for portability
|
| 6 |
-
with open("README.md", "r", encoding="utf-8") as fh:
|
| 7 |
-
long_description = fh.read()
|
| 8 |
-
|
| 9 |
setup(
|
| 10 |
-
name="tibetan
|
| 11 |
-
version="0.
|
| 12 |
-
author="Daniel Wojahn
|
| 13 |
author_email="[email protected]",
|
| 14 |
-
description="Cython
|
| 15 |
-
long_description=long_description,
|
| 16 |
-
long_description_content_type="text/markdown",
|
| 17 |
url="https://github.com/daniel-wojahn/tibetan-text-metrics",
|
| 18 |
ext_modules=cythonize(
|
| 19 |
[
|
| 20 |
Extension(
|
| 21 |
-
"pipeline.fast_lcs",
|
| 22 |
["pipeline/fast_lcs.pyx"],
|
| 23 |
include_dirs=[numpy.get_include()],
|
| 24 |
)
|
| 25 |
],
|
| 26 |
-
compiler_directives={
|
| 27 |
),
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
# Although this setup.py is in webapp, it's building modules for the 'pipeline' sub-package
|
| 32 |
-
# We don't list packages here as this setup.py is just for the extension.
|
| 33 |
-
# The main app will treat 'pipeline' as a regular package.
|
| 34 |
-
zip_safe=False, # Cython extensions are generally not zip-safe
|
| 35 |
-
classifiers=[
|
| 36 |
-
"Programming Language :: Python :: 3",
|
| 37 |
-
"License :: OSI Approved :: MIT License",
|
| 38 |
-
"Operating System :: OS Independent",
|
| 39 |
-
],
|
| 40 |
-
python_requires='>=3.8',
|
| 41 |
install_requires=[
|
| 42 |
-
"numpy>=1.
|
| 43 |
],
|
| 44 |
-
# setup_requires is deprecated, use pyproject.toml for build-system requirements
|
| 45 |
)
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Setup script for building Cython extensions.
|
| 3 |
+
|
| 4 |
+
This setup.py is used to compile the fast_lcs Cython extension for
|
| 5 |
+
improved LCS calculation performance. The main project metadata is
|
| 6 |
+
in pyproject.toml.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python setup.py build_ext --inplace
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
import numpy
|
| 13 |
from setuptools import Extension, setup
|
| 14 |
from Cython.Build import cythonize
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
setup(
|
| 17 |
+
name="tibetan-text-metrics-webapp",
|
| 18 |
+
version="0.4.0",
|
| 19 |
+
author="Daniel Wojahn",
|
| 20 |
author_email="[email protected]",
|
| 21 |
+
description="Cython LCS extension for Tibetan Text Metrics Webapp",
|
|
|
|
|
|
|
| 22 |
url="https://github.com/daniel-wojahn/tibetan-text-metrics",
|
| 23 |
ext_modules=cythonize(
|
| 24 |
[
|
| 25 |
Extension(
|
| 26 |
+
"pipeline.fast_lcs",
|
| 27 |
["pipeline/fast_lcs.pyx"],
|
| 28 |
include_dirs=[numpy.get_include()],
|
| 29 |
)
|
| 30 |
],
|
| 31 |
+
compiler_directives={"language_level": "3"}
|
| 32 |
),
|
| 33 |
+
include_package_data=True,
|
| 34 |
+
zip_safe=False,
|
| 35 |
+
python_requires=">=3.10",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
install_requires=[
|
| 37 |
+
"numpy>=1.24",
|
| 38 |
],
|
|
|
|
| 39 |
)
|
theme.py
CHANGED
|
@@ -1,273 +1,408 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import gradio as gr
|
| 2 |
from gradio.themes.utils import colors, sizes, fonts
|
| 3 |
|
| 4 |
|
| 5 |
class TibetanAppTheme(gr.themes.Soft):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
def __init__(self):
|
| 7 |
super().__init__(
|
| 8 |
-
primary_hue=colors.blue,
|
| 9 |
-
secondary_hue=colors.orange,
|
| 10 |
-
neutral_hue=colors.slate,
|
| 11 |
font=[
|
| 12 |
fonts.GoogleFont("Inter"),
|
| 13 |
"ui-sans-serif",
|
| 14 |
"system-ui",
|
| 15 |
"sans-serif",
|
| 16 |
],
|
| 17 |
-
radius_size=sizes.radius_md,
|
| 18 |
-
text_size=sizes.text_md,
|
| 19 |
)
|
|
|
|
|
|
|
| 20 |
self.theme_vars_for_set = {
|
| 21 |
# Global & Body Styles
|
| 22 |
"body_background_fill": "#f0f2f5",
|
| 23 |
"body_text_color": "#333333",
|
| 24 |
-
#
|
|
|
|
|
|
|
| 25 |
"block_background_fill": "#ffffff",
|
| 26 |
-
"block_radius": "16px",
|
| 27 |
"block_shadow": "0 4px 12px rgba(0, 0, 0, 0.08)",
|
| 28 |
"block_padding": "24px",
|
| 29 |
"block_border_width": "0px",
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
# Button Styles
|
| 33 |
"button_secondary_background_fill": "#ffffff",
|
| 34 |
"button_secondary_text_color": "#374151",
|
| 35 |
"button_secondary_border_color": "#d1d5db",
|
| 36 |
"button_secondary_border_color_hover": "#adb5bd",
|
| 37 |
"button_secondary_background_fill_hover": "#f9fafb",
|
| 38 |
-
|
|
|
|
| 39 |
"button_primary_background_fill": "#2563eb",
|
| 40 |
"button_primary_text_color": "#ffffff",
|
| 41 |
"button_primary_border_color": "transparent",
|
| 42 |
"button_primary_background_fill_hover": "#1d4ed8",
|
| 43 |
-
|
|
|
|
| 44 |
"border_color_accent_subdued": "#e5e7eb",
|
| 45 |
-
|
| 46 |
-
super().set(**self.theme_vars_for_set)
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
"
|
| 51 |
-
|
| 52 |
-
"font-size": "16px !important",
|
| 53 |
-
"line-height": "1.6 !important",
|
| 54 |
-
"color": "#333333 !important",
|
| 55 |
-
},
|
| 56 |
-
".gr-group": {"margin-bottom": "24px !important"}, # min-height removed
|
| 57 |
-
".gr-markdown": {
|
| 58 |
-
"background": "transparent !important",
|
| 59 |
-
"font-size": "1em !important",
|
| 60 |
-
"margin-bottom": "16px !important",
|
| 61 |
-
},
|
| 62 |
-
".gr-markdown h1": {
|
| 63 |
-
"font-size": "28px !important",
|
| 64 |
-
"font-weight": "600 !important",
|
| 65 |
-
"margin-bottom": "8px !important",
|
| 66 |
-
"color": "#111827 !important",
|
| 67 |
-
},
|
| 68 |
-
".gr-markdown h2": {
|
| 69 |
-
"font-size": "26px !important",
|
| 70 |
-
"font-weight": "600 !important",
|
| 71 |
-
"color": "var(--primary-600, #2563eb) !important",
|
| 72 |
-
"margin-top": "32px !important",
|
| 73 |
-
"margin-bottom": "16px !important",
|
| 74 |
-
},
|
| 75 |
-
".gr-markdown h3": {
|
| 76 |
-
"font-size": "22px !important",
|
| 77 |
-
"font-weight": "600 !important",
|
| 78 |
-
"color": "#1f2937 !important",
|
| 79 |
-
"margin-top": "24px !important",
|
| 80 |
-
"margin-bottom": "12px !important",
|
| 81 |
-
},
|
| 82 |
-
".gr-markdown p, .gr-markdown span": {
|
| 83 |
-
"font-size": "16px !important",
|
| 84 |
-
"color": "#4b5563 !important",
|
| 85 |
-
},
|
| 86 |
-
".gr-button button": {
|
| 87 |
-
"border-radius": "8px !important",
|
| 88 |
-
"padding": "10px 20px !important",
|
| 89 |
-
"font-weight": "500 !important",
|
| 90 |
-
"box-shadow": "0 1px 2px 0 rgba(0, 0, 0, 0.05) !important",
|
| 91 |
-
"border": "1px solid #d1d5db !important",
|
| 92 |
-
"background-color": "#ffffff !important",
|
| 93 |
-
"color": "#374151 !important",
|
| 94 |
-
},
|
| 95 |
-
"#run-btn": {
|
| 96 |
-
"background": "var(--button-primary-background-fill) !important",
|
| 97 |
-
"color": "var(--button-primary-text-color) !important",
|
| 98 |
-
"font-weight": "bold !important",
|
| 99 |
-
"font-size": "24px !important",
|
| 100 |
-
"border": "none !important",
|
| 101 |
-
"box-shadow": "var(--button-primary-shadow) !important",
|
| 102 |
-
},
|
| 103 |
-
"#run-btn:hover": { # Changed selector
|
| 104 |
-
"background": "var(--button-primary-background-fill-hover) !important",
|
| 105 |
-
"box-shadow": "0px 4px 12px rgba(0, 0, 0, 0.15) !important",
|
| 106 |
-
"transform": "translateY(-1px) !important",
|
| 107 |
-
},
|
| 108 |
-
".gr-button button:hover": {
|
| 109 |
-
"background-color": "#f9fafb !important",
|
| 110 |
-
"border-color": "#adb5bd !important",
|
| 111 |
-
},
|
| 112 |
-
"hr": {
|
| 113 |
-
"margin": "32px 0 !important",
|
| 114 |
-
"border": "none !important",
|
| 115 |
-
"border-top": "1px solid var(--border-color-accent-subdued) !important",
|
| 116 |
-
},
|
| 117 |
-
".gr-slider, .gr-radio, .gr-file": {"margin-bottom": "20px !important"},
|
| 118 |
-
".gr-radio .gr-form button": {
|
| 119 |
-
"background-color": "#f3f4f6 !important",
|
| 120 |
-
"color": "#374151 !important",
|
| 121 |
-
"border": "1px solid #d1d5db !important",
|
| 122 |
-
"border-radius": "6px !important",
|
| 123 |
-
"padding": "8px 16px !important",
|
| 124 |
-
"font-weight": "500 !important",
|
| 125 |
-
},
|
| 126 |
-
".gr-radio .gr-form button:hover": {
|
| 127 |
-
"background-color": "#e5e7eb !important",
|
| 128 |
-
"border-color": "#9ca3af !important",
|
| 129 |
-
},
|
| 130 |
-
".gr-radio .gr-form button.selected": {
|
| 131 |
-
"background-color": "var(--primary-500, #3b82f6) !important",
|
| 132 |
-
"color": "#ffffff !important",
|
| 133 |
-
"border-color": "var(--primary-500, #3b82f6) !important",
|
| 134 |
-
},
|
| 135 |
-
".gr-radio .gr-form button.selected:hover": {
|
| 136 |
-
"background-color": "var(--primary-600, #2563eb) !important",
|
| 137 |
-
"border-color": "var(--primary-600, #2563eb) !important",
|
| 138 |
-
},
|
| 139 |
-
"#semantic-radio-group span": { # General selector, refined size
|
| 140 |
-
"font-size": "17px !important",
|
| 141 |
-
"font-weight": "500 !important",
|
| 142 |
-
},
|
| 143 |
-
"#semantic-radio-group div": { # General selector, refined size
|
| 144 |
-
"font-size": "14px !important"
|
| 145 |
-
},
|
| 146 |
-
# Row and Column flex styles for equal height
|
| 147 |
-
"#steps-row": {
|
| 148 |
-
"display": "flex !important",
|
| 149 |
-
"align-items": "stretch !important",
|
| 150 |
-
},
|
| 151 |
-
".step-column": {
|
| 152 |
-
"display": "flex !important",
|
| 153 |
-
"flex-direction": "column !important",
|
| 154 |
-
},
|
| 155 |
-
".step-column > .gr-group": {
|
| 156 |
-
"flex-grow": "1 !important",
|
| 157 |
-
"display": "flex !important",
|
| 158 |
-
"flex-direction": "column !important",
|
| 159 |
-
},
|
| 160 |
-
".tabs > .tab-nav": {"border-bottom": "1px solid #d1d5db !important"},
|
| 161 |
-
".tabs > .tab-nav > button.selected": {
|
| 162 |
-
"border-bottom": "2px solid var(--primary-500) !important",
|
| 163 |
-
"color": "var(--primary-500) !important",
|
| 164 |
-
"background-color": "transparent !important",
|
| 165 |
-
},
|
| 166 |
-
".tabs > .tab-nav > button": {
|
| 167 |
-
"color": "#6b7280 !important",
|
| 168 |
-
"background-color": "transparent !important",
|
| 169 |
-
"padding": "10px 15px !important",
|
| 170 |
-
"border-bottom": "2px solid transparent !important",
|
| 171 |
-
},
|
| 172 |
-
|
| 173 |
-
# Custom styling for metric accordions
|
| 174 |
-
".metric-info-accordion": {
|
| 175 |
-
"border-left": "4px solid #3B82F6 !important",
|
| 176 |
-
"margin-bottom": "1rem !important",
|
| 177 |
-
"background-color": "#F8FAFC !important",
|
| 178 |
-
"border-radius": "6px !important",
|
| 179 |
-
"overflow": "hidden !important",
|
| 180 |
-
},
|
| 181 |
-
".jaccard-info": {
|
| 182 |
-
"border-left-color": "#3B82F6 !important", # Blue
|
| 183 |
-
},
|
| 184 |
-
".lcs-info": {
|
| 185 |
-
"border-left-color": "#10B981 !important", # Green
|
| 186 |
-
},
|
| 187 |
-
".semantic-info": {
|
| 188 |
-
"border-left-color": "#8B5CF6 !important", # Purple
|
| 189 |
-
},
|
| 190 |
-
".wordcount-info": {
|
| 191 |
-
"border-left-color": "#EC4899 !important", # Pink
|
| 192 |
-
},
|
| 193 |
-
|
| 194 |
-
# Accordion header styling
|
| 195 |
-
".metric-info-accordion > .label-wrap": {
|
| 196 |
-
"font-weight": "600 !important",
|
| 197 |
-
"padding": "12px 16px !important",
|
| 198 |
-
"background-color": "#F1F5F9 !important",
|
| 199 |
-
"border-bottom": "1px solid #E2E8F0 !important",
|
| 200 |
-
},
|
| 201 |
-
|
| 202 |
-
# Accordion content styling
|
| 203 |
-
".metric-info-accordion > .wrap": {
|
| 204 |
-
"padding": "16px !important",
|
| 205 |
-
},
|
| 206 |
-
|
| 207 |
-
# Word count plot styling - full width
|
| 208 |
-
".tabs > .tab-content > div[data-testid='tabitem'] > .plot": {
|
| 209 |
-
"width": "100% !important",
|
| 210 |
-
},
|
| 211 |
-
|
| 212 |
-
# Heatmap plot styling - responsive sizing
|
| 213 |
-
".tabs > .tab-content > div[data-testid='tabitem'] > .plotly": {
|
| 214 |
-
"width": "100% !important",
|
| 215 |
-
"height": "auto !important",
|
| 216 |
-
},
|
| 217 |
-
|
| 218 |
-
# Specific heatmap container styling
|
| 219 |
-
".metric-heatmap": {
|
| 220 |
-
"max-width": "100% !important",
|
| 221 |
-
"overflow-x": "auto !important",
|
| 222 |
-
},
|
| 223 |
-
|
| 224 |
-
# LLM Analysis styling
|
| 225 |
-
".llm-analysis": {
|
| 226 |
-
"background-color": "#f8f9fa !important",
|
| 227 |
-
"border-left": "4px solid #3B82F6 !important",
|
| 228 |
-
"border-radius": "8px !important",
|
| 229 |
-
"padding": "20px 24px !important",
|
| 230 |
-
"margin": "16px 0 !important",
|
| 231 |
-
"box-shadow": "0 2px 8px rgba(0, 0, 0, 0.05) !important",
|
| 232 |
-
},
|
| 233 |
-
".llm-analysis h2": {
|
| 234 |
-
"color": "#1e40af !important",
|
| 235 |
-
"font-size": "24px !important",
|
| 236 |
-
"margin-bottom": "16px !important",
|
| 237 |
-
"border-bottom": "1px solid #e5e7eb !important",
|
| 238 |
-
"padding-bottom": "8px !important",
|
| 239 |
-
},
|
| 240 |
-
".llm-analysis h3, .llm-analysis h4": {
|
| 241 |
-
"color": "#1e3a8a !important",
|
| 242 |
-
"margin-top": "20px !important",
|
| 243 |
-
"margin-bottom": "12px !important",
|
| 244 |
-
},
|
| 245 |
-
".llm-analysis p": {
|
| 246 |
-
"line-height": "1.7 !important",
|
| 247 |
-
"margin-bottom": "12px !important",
|
| 248 |
-
},
|
| 249 |
-
".llm-analysis ul, .llm-analysis ol": {
|
| 250 |
-
"margin-left": "24px !important",
|
| 251 |
-
"margin-bottom": "16px !important",
|
| 252 |
-
},
|
| 253 |
-
".llm-analysis li": {
|
| 254 |
-
"margin-bottom": "6px !important",
|
| 255 |
-
},
|
| 256 |
-
".llm-analysis strong, .llm-analysis b": {
|
| 257 |
-
"color": "#1f2937 !important",
|
| 258 |
-
"font-weight": "600 !important",
|
| 259 |
-
},
|
| 260 |
}
|
|
|
|
| 261 |
|
| 262 |
def get_css_string(self) -> str:
|
| 263 |
-
"""
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 271 |
|
| 272 |
|
| 273 |
# Instantiate the theme for easy import
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Tibetan Text Metrics Theme - Gradio 6 Compatible
|
| 3 |
+
|
| 4 |
+
This theme provides a clean, professional look for the TTM application.
|
| 5 |
+
Updated for Gradio 6.x compatibility where theme/css are passed to launch().
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
import gradio as gr
|
| 9 |
from gradio.themes.utils import colors, sizes, fonts
|
| 10 |
|
| 11 |
|
| 12 |
class TibetanAppTheme(gr.themes.Soft):
|
| 13 |
+
"""
|
| 14 |
+
Custom theme for Tibetan Text Metrics application.
|
| 15 |
+
|
| 16 |
+
Gradio 6 Migration Notes:
|
| 17 |
+
- Theme is now passed to demo.launch(theme=...) instead of gr.Blocks(theme=...)
|
| 18 |
+
- CSS is now passed to demo.launch(css=...) instead of gr.Blocks(css=...)
|
| 19 |
+
- Use elem_id and elem_classes for stable CSS targeting
|
| 20 |
+
"""
|
| 21 |
+
|
| 22 |
def __init__(self):
|
| 23 |
super().__init__(
|
| 24 |
+
primary_hue=colors.blue,
|
| 25 |
+
secondary_hue=colors.orange,
|
| 26 |
+
neutral_hue=colors.slate,
|
| 27 |
font=[
|
| 28 |
fonts.GoogleFont("Inter"),
|
| 29 |
"ui-sans-serif",
|
| 30 |
"system-ui",
|
| 31 |
"sans-serif",
|
| 32 |
],
|
| 33 |
+
radius_size=sizes.radius_md,
|
| 34 |
+
text_size=sizes.text_md,
|
| 35 |
)
|
| 36 |
+
|
| 37 |
+
# Theme variable overrides using Gradio's theming system
|
| 38 |
self.theme_vars_for_set = {
|
| 39 |
# Global & Body Styles
|
| 40 |
"body_background_fill": "#f0f2f5",
|
| 41 |
"body_text_color": "#333333",
|
| 42 |
+
"body_text_color_subdued": "#4b5563",
|
| 43 |
+
|
| 44 |
+
# Block/Card Styles
|
| 45 |
"block_background_fill": "#ffffff",
|
| 46 |
+
"block_radius": "16px",
|
| 47 |
"block_shadow": "0 4px 12px rgba(0, 0, 0, 0.08)",
|
| 48 |
"block_padding": "24px",
|
| 49 |
"block_border_width": "0px",
|
| 50 |
+
|
| 51 |
+
# Button Styles - Secondary
|
|
|
|
| 52 |
"button_secondary_background_fill": "#ffffff",
|
| 53 |
"button_secondary_text_color": "#374151",
|
| 54 |
"button_secondary_border_color": "#d1d5db",
|
| 55 |
"button_secondary_border_color_hover": "#adb5bd",
|
| 56 |
"button_secondary_background_fill_hover": "#f9fafb",
|
| 57 |
+
|
| 58 |
+
# Button Styles - Primary
|
| 59 |
"button_primary_background_fill": "#2563eb",
|
| 60 |
"button_primary_text_color": "#ffffff",
|
| 61 |
"button_primary_border_color": "transparent",
|
| 62 |
"button_primary_background_fill_hover": "#1d4ed8",
|
| 63 |
+
|
| 64 |
+
# Border colors
|
| 65 |
"border_color_accent_subdued": "#e5e7eb",
|
| 66 |
+
"border_color_primary": "#d1d5db",
|
|
|
|
| 67 |
|
| 68 |
+
# Input styles
|
| 69 |
+
"input_background_fill": "#ffffff",
|
| 70 |
+
"input_border_color": "#d1d5db",
|
| 71 |
+
"input_border_color_focus": "#2563eb",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
}
|
| 73 |
+
super().set(**self.theme_vars_for_set)
|
| 74 |
|
| 75 |
def get_css_string(self) -> str:
|
| 76 |
+
"""
|
| 77 |
+
Returns custom CSS string for additional styling.
|
| 78 |
+
|
| 79 |
+
Gradio 6 uses different class naming conventions. This CSS uses:
|
| 80 |
+
- elem_id selectors (#id) for specific components
|
| 81 |
+
- elem_classes selectors (.class) for groups of components
|
| 82 |
+
- Gradio 6 native classes where stable
|
| 83 |
+
"""
|
| 84 |
+
return """
|
| 85 |
+
/* ============================================
|
| 86 |
+
GLOBAL STYLES
|
| 87 |
+
============================================ */
|
| 88 |
+
|
| 89 |
+
.gradio-container {
|
| 90 |
+
font-family: 'Inter', ui-sans-serif, system-ui, sans-serif !important;
|
| 91 |
+
max-width: 1400px !important;
|
| 92 |
+
margin: 0 auto !important;
|
| 93 |
+
}
|
| 94 |
+
|
| 95 |
+
/* ============================================
|
| 96 |
+
TYPOGRAPHY
|
| 97 |
+
============================================ */
|
| 98 |
+
|
| 99 |
+
h1 {
|
| 100 |
+
font-size: 28px !important;
|
| 101 |
+
font-weight: 600 !important;
|
| 102 |
+
color: #111827 !important;
|
| 103 |
+
margin-bottom: 8px !important;
|
| 104 |
+
}
|
| 105 |
+
|
| 106 |
+
h2 {
|
| 107 |
+
font-size: 24px !important;
|
| 108 |
+
font-weight: 600 !important;
|
| 109 |
+
color: var(--primary-600, #2563eb) !important;
|
| 110 |
+
margin-top: 24px !important;
|
| 111 |
+
margin-bottom: 16px !important;
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
+
h3 {
|
| 115 |
+
font-size: 20px !important;
|
| 116 |
+
font-weight: 600 !important;
|
| 117 |
+
color: #1f2937 !important;
|
| 118 |
+
margin-top: 20px !important;
|
| 119 |
+
margin-bottom: 12px !important;
|
| 120 |
+
}
|
| 121 |
+
|
| 122 |
+
/* ============================================
|
| 123 |
+
LAYOUT - Steps Row
|
| 124 |
+
============================================ */
|
| 125 |
+
|
| 126 |
+
#steps-row {
|
| 127 |
+
display: flex !important;
|
| 128 |
+
align-items: stretch !important;
|
| 129 |
+
gap: 24px !important;
|
| 130 |
+
}
|
| 131 |
+
|
| 132 |
+
.step-column {
|
| 133 |
+
display: flex !important;
|
| 134 |
+
flex-direction: column !important;
|
| 135 |
+
flex: 1 !important;
|
| 136 |
+
}
|
| 137 |
+
|
| 138 |
+
.step-box {
|
| 139 |
+
padding: 1.5rem !important;
|
| 140 |
+
flex-grow: 1 !important;
|
| 141 |
+
display: flex !important;
|
| 142 |
+
flex-direction: column !important;
|
| 143 |
+
}
|
| 144 |
+
|
| 145 |
+
/* ============================================
|
| 146 |
+
BUTTONS
|
| 147 |
+
============================================ */
|
| 148 |
+
|
| 149 |
+
/* Primary action buttons */
|
| 150 |
+
#run-btn-quick, #run-btn-custom {
|
| 151 |
+
background: var(--button-primary-background-fill, #2563eb) !important;
|
| 152 |
+
color: var(--button-primary-text-color, #ffffff) !important;
|
| 153 |
+
font-weight: 600 !important;
|
| 154 |
+
font-size: 18px !important;
|
| 155 |
+
padding: 12px 24px !important;
|
| 156 |
+
border: none !important;
|
| 157 |
+
border-radius: 8px !important;
|
| 158 |
+
box-shadow: 0 2px 4px rgba(37, 99, 235, 0.2) !important;
|
| 159 |
+
transition: all 0.2s ease !important;
|
| 160 |
+
margin-top: 16px !important;
|
| 161 |
+
}
|
| 162 |
+
|
| 163 |
+
#run-btn-quick:hover, #run-btn-custom:hover {
|
| 164 |
+
background: var(--button-primary-background-fill-hover, #1d4ed8) !important;
|
| 165 |
+
box-shadow: 0 4px 12px rgba(37, 99, 235, 0.3) !important;
|
| 166 |
+
transform: translateY(-1px) !important;
|
| 167 |
+
}
|
| 168 |
+
|
| 169 |
+
/* Secondary buttons */
|
| 170 |
+
button.secondary {
|
| 171 |
+
background-color: #ffffff !important;
|
| 172 |
+
color: #374151 !important;
|
| 173 |
+
border: 1px solid #d1d5db !important;
|
| 174 |
+
border-radius: 8px !important;
|
| 175 |
+
padding: 10px 20px !important;
|
| 176 |
+
font-weight: 500 !important;
|
| 177 |
+
}
|
| 178 |
+
|
| 179 |
+
button.secondary:hover {
|
| 180 |
+
background-color: #f9fafb !important;
|
| 181 |
+
border-color: #adb5bd !important;
|
| 182 |
+
}
|
| 183 |
+
|
| 184 |
+
/* ============================================
|
| 185 |
+
TABS
|
| 186 |
+
============================================ */
|
| 187 |
+
|
| 188 |
+
.tabs {
|
| 189 |
+
margin-top: 8px !important;
|
| 190 |
+
}
|
| 191 |
+
|
| 192 |
+
.tab-nav {
|
| 193 |
+
border-bottom: 1px solid #e5e7eb !important;
|
| 194 |
+
margin-bottom: 16px !important;
|
| 195 |
+
}
|
| 196 |
+
|
| 197 |
+
.tab-nav button {
|
| 198 |
+
color: #6b7280 !important;
|
| 199 |
+
background-color: transparent !important;
|
| 200 |
+
padding: 12px 20px !important;
|
| 201 |
+
border: none !important;
|
| 202 |
+
border-bottom: 2px solid transparent !important;
|
| 203 |
+
font-weight: 500 !important;
|
| 204 |
+
transition: all 0.2s ease !important;
|
| 205 |
+
}
|
| 206 |
+
|
| 207 |
+
.tab-nav button:hover {
|
| 208 |
+
color: #374151 !important;
|
| 209 |
+
}
|
| 210 |
+
|
| 211 |
+
.tab-nav button.selected {
|
| 212 |
+
border-bottom: 2px solid var(--primary-500, #3b82f6) !important;
|
| 213 |
+
color: var(--primary-600, #2563eb) !important;
|
| 214 |
+
background-color: transparent !important;
|
| 215 |
+
}
|
| 216 |
+
|
| 217 |
+
/* ============================================
|
| 218 |
+
ACCORDIONS
|
| 219 |
+
============================================ */
|
| 220 |
+
|
| 221 |
+
.accordion {
|
| 222 |
+
border: 1px solid #e5e7eb !important;
|
| 223 |
+
border-radius: 8px !important;
|
| 224 |
+
margin-bottom: 12px !important;
|
| 225 |
+
overflow: hidden !important;
|
| 226 |
+
}
|
| 227 |
+
|
| 228 |
+
/* Metric info accordions with colored borders */
|
| 229 |
+
.metric-info-accordion {
|
| 230 |
+
border-left: 4px solid #3B82F6 !important;
|
| 231 |
+
margin-bottom: 1rem !important;
|
| 232 |
+
background-color: #F8FAFC !important;
|
| 233 |
+
border-radius: 6px !important;
|
| 234 |
+
}
|
| 235 |
+
|
| 236 |
+
.jaccard-info { border-left-color: #3B82F6 !important; }
|
| 237 |
+
.lcs-info { border-left-color: #10B981 !important; }
|
| 238 |
+
.fuzzy-info { border-left-color: #F59E0B !important; }
|
| 239 |
+
.semantic-info { border-left-color: #8B5CF6 !important; }
|
| 240 |
+
.wordcount-info { border-left-color: #EC4899 !important; }
|
| 241 |
+
|
| 242 |
+
/* ============================================
|
| 243 |
+
FORM ELEMENTS
|
| 244 |
+
============================================ */
|
| 245 |
+
|
| 246 |
+
/* Radio buttons */
|
| 247 |
+
.radio-group label {
|
| 248 |
+
display: flex !important;
|
| 249 |
+
align-items: center !important;
|
| 250 |
+
padding: 10px 16px !important;
|
| 251 |
+
border: 1px solid #e5e7eb !important;
|
| 252 |
+
border-radius: 8px !important;
|
| 253 |
+
margin-bottom: 8px !important;
|
| 254 |
+
cursor: pointer !important;
|
| 255 |
+
transition: all 0.2s ease !important;
|
| 256 |
+
}
|
| 257 |
+
|
| 258 |
+
.radio-group label:hover {
|
| 259 |
+
background-color: #f9fafb !important;
|
| 260 |
+
border-color: #d1d5db !important;
|
| 261 |
+
}
|
| 262 |
+
|
| 263 |
+
.radio-group input:checked + label,
|
| 264 |
+
.radio-group label.selected {
|
| 265 |
+
background-color: var(--primary-50, #eff6ff) !important;
|
| 266 |
+
border-color: var(--primary-500, #3b82f6) !important;
|
| 267 |
+
}
|
| 268 |
+
|
| 269 |
+
/* Dropdowns */
|
| 270 |
+
select, .dropdown {
|
| 271 |
+
border: 1px solid #d1d5db !important;
|
| 272 |
+
border-radius: 8px !important;
|
| 273 |
+
padding: 10px 12px !important;
|
| 274 |
+
background-color: #ffffff !important;
|
| 275 |
+
}
|
| 276 |
+
|
| 277 |
+
/* Checkboxes */
|
| 278 |
+
input[type="checkbox"] {
|
| 279 |
+
width: 18px !important;
|
| 280 |
+
height: 18px !important;
|
| 281 |
+
accent-color: var(--primary-500, #3b82f6) !important;
|
| 282 |
+
}
|
| 283 |
+
|
| 284 |
+
/* ============================================
|
| 285 |
+
PRESET TABLE
|
| 286 |
+
============================================ */
|
| 287 |
+
|
| 288 |
+
.preset-table table {
|
| 289 |
+
font-size: 14px !important;
|
| 290 |
+
margin-top: 12px !important;
|
| 291 |
+
width: 100% !important;
|
| 292 |
+
border-collapse: collapse !important;
|
| 293 |
+
}
|
| 294 |
+
|
| 295 |
+
.preset-table th, .preset-table td {
|
| 296 |
+
padding: 10px 14px !important;
|
| 297 |
+
text-align: center !important;
|
| 298 |
+
border-bottom: 1px solid #e5e7eb !important;
|
| 299 |
+
}
|
| 300 |
+
|
| 301 |
+
.preset-table th {
|
| 302 |
+
background-color: #f9fafb !important;
|
| 303 |
+
font-weight: 600 !important;
|
| 304 |
+
color: #374151 !important;
|
| 305 |
+
}
|
| 306 |
+
|
| 307 |
+
.preset-table tr:hover {
|
| 308 |
+
background-color: #f9fafb !important;
|
| 309 |
+
}
|
| 310 |
+
|
| 311 |
+
/* ============================================
|
| 312 |
+
RESULTS SECTION
|
| 313 |
+
============================================ */
|
| 314 |
+
|
| 315 |
+
/* Heatmaps and plots */
|
| 316 |
+
.plot-container {
|
| 317 |
+
width: 100% !important;
|
| 318 |
+
overflow-x: auto !important;
|
| 319 |
+
}
|
| 320 |
+
|
| 321 |
+
.metric-heatmap {
|
| 322 |
+
max-width: 100% !important;
|
| 323 |
+
}
|
| 324 |
+
|
| 325 |
+
/* ============================================
|
| 326 |
+
LLM ANALYSIS OUTPUT
|
| 327 |
+
============================================ */
|
| 328 |
+
|
| 329 |
+
#llm-analysis {
|
| 330 |
+
background-color: #f8f9fa !important;
|
| 331 |
+
border-left: 4px solid #3B82F6 !important;
|
| 332 |
+
border-radius: 8px !important;
|
| 333 |
+
padding: 20px 24px !important;
|
| 334 |
+
margin: 16px 0 !important;
|
| 335 |
+
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.05) !important;
|
| 336 |
+
}
|
| 337 |
+
|
| 338 |
+
#llm-analysis h2 {
|
| 339 |
+
color: #1e40af !important;
|
| 340 |
+
font-size: 22px !important;
|
| 341 |
+
margin-bottom: 16px !important;
|
| 342 |
+
border-bottom: 1px solid #e5e7eb !important;
|
| 343 |
+
padding-bottom: 8px !important;
|
| 344 |
+
}
|
| 345 |
+
|
| 346 |
+
#llm-analysis h3, #llm-analysis h4 {
|
| 347 |
+
color: #1e3a8a !important;
|
| 348 |
+
margin-top: 18px !important;
|
| 349 |
+
margin-bottom: 10px !important;
|
| 350 |
+
}
|
| 351 |
+
|
| 352 |
+
#llm-analysis p {
|
| 353 |
+
line-height: 1.7 !important;
|
| 354 |
+
margin-bottom: 12px !important;
|
| 355 |
+
color: #374151 !important;
|
| 356 |
+
}
|
| 357 |
+
|
| 358 |
+
#llm-analysis ul, #llm-analysis ol {
|
| 359 |
+
margin-left: 24px !important;
|
| 360 |
+
margin-bottom: 16px !important;
|
| 361 |
+
}
|
| 362 |
+
|
| 363 |
+
#llm-analysis li {
|
| 364 |
+
margin-bottom: 6px !important;
|
| 365 |
+
}
|
| 366 |
+
|
| 367 |
+
#llm-analysis strong, #llm-analysis b {
|
| 368 |
+
color: #1f2937 !important;
|
| 369 |
+
font-weight: 600 !important;
|
| 370 |
+
}
|
| 371 |
+
|
| 372 |
+
/* ============================================
|
| 373 |
+
RESPONSIVE ADJUSTMENTS
|
| 374 |
+
============================================ */
|
| 375 |
+
|
| 376 |
+
@media (max-width: 768px) {
|
| 377 |
+
#steps-row {
|
| 378 |
+
flex-direction: column !important;
|
| 379 |
+
}
|
| 380 |
+
|
| 381 |
+
.step-column {
|
| 382 |
+
width: 100% !important;
|
| 383 |
+
}
|
| 384 |
+
|
| 385 |
+
#run-btn-quick, #run-btn-custom {
|
| 386 |
+
font-size: 16px !important;
|
| 387 |
+
padding: 10px 20px !important;
|
| 388 |
+
}
|
| 389 |
+
}
|
| 390 |
+
|
| 391 |
+
/* ============================================
|
| 392 |
+
UTILITY CLASSES
|
| 393 |
+
============================================ */
|
| 394 |
+
|
| 395 |
+
.custom-header {
|
| 396 |
+
margin-bottom: 12px !important;
|
| 397 |
+
color: #374151 !important;
|
| 398 |
+
}
|
| 399 |
+
|
| 400 |
+
.info-text {
|
| 401 |
+
font-size: 14px !important;
|
| 402 |
+
color: #6b7280 !important;
|
| 403 |
+
margin-top: 4px !important;
|
| 404 |
+
}
|
| 405 |
+
"""
|
| 406 |
|
| 407 |
|
| 408 |
# Instantiate the theme for easy import
|