Daniel Wojahn commited on
Commit
75e8f38
·
1 Parent(s): 54934d5

feat(ui): Add preset-based analysis UI and Gradio 6 compatibility

Browse files
README.md CHANGED
@@ -15,27 +15,53 @@ app_file: app.py
15
  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
16
  [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
17
 
18
- A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts. This tool provides a graphical interface to the core text comparison functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project, making it accessible to researchers without Python or command-line experience. Built with Python, Cython, and Gradio.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Background
21
 
22
- The Tibetan Text Metrics project aims to provide quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application extends these capabilities by offering an intuitive interface, removing the need for manual script execution and environment setup for end-users.
23
 
24
  ## Key Features of the Web App
25
 
26
  - **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
27
  - **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
28
  - **Core Metrics Computed**:
29
- - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
30
- - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
31
- - **Fuzzy Similarity**: Uses fuzzy string matching to detect approximate matches between words, accommodating spelling variations and minor differences in Tibetan text.
32
- - **Semantic Similarity**: Uses sentence-transformer embeddings (e.g., LaBSE) to compare the contextual meaning of segments. *Note: This metric works best when combined with other metrics for a more comprehensive analysis.*
33
  - **Handles Long Texts**: Implements automated handling for long segments when computing semantic embeddings.
34
- - **Model Selection**: Semantic similarity analysis uses Hugging Face sentence-transformer models (e.g., LaBSE).
 
 
 
35
  - **Stopword Filtering**: Three levels of filtering for Tibetan words:
36
  - **None**: No filtering, includes all words
37
  - **Standard**: Filters only common particles and punctuation
38
  - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
 
39
  - **Interactive Visualizations**:
40
  - Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
41
  - Bar chart displaying word counts per segment.
@@ -91,7 +117,10 @@ To obtain meaningful results, it is highly recommended to divide your Tibetan te
91
  ## Implemented Metrics
92
 
93
  **Stopword Filtering:**
94
- To enhance the accuracy and relevance of similarity scores, both the Jaccard Similarity and TF-IDF Cosine Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. This ensures that the resulting scores are more reflective of meaningful lexical and thematic similarities between texts, rather than being skewed by the presence of ubiquitous common words.
 
 
 
95
 
96
  The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
97
  - The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
@@ -99,7 +128,7 @@ The comprehensive list of Tibetan stopwords used is adapted and compiled from th
99
 
100
  We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
101
 
102
- Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords.py` file.
103
 
104
  ### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
105
 
@@ -116,29 +145,33 @@ A higher percentage indicates a greater overlap in the significant vocabularies
116
 
117
  This helps focus on meaningful content words rather than grammatical elements.
118
 
119
- 2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
120
- * *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
121
- 3. **Fuzzy Similarity**: This metric uses fuzzy string matching algorithms to detect approximate matches between words, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical. The implementation offers multiple matching methods:
122
- - **Token Set Ratio** (default): Compares the sets of words regardless of order, finding the best alignment between them
123
- - **Token Sort Ratio**: Sorts the words alphabetically before comparing, useful for texts with similar vocabulary in different orders
124
- - **Partial Ratio**: Finds the best matching substring, helpful for detecting when one text contains parts of another
125
- - **Simple Ratio**: Performs character-by-character comparison, best for detecting minor spelling variations
126
 
127
- Scores range from 0 to 1, where 1 indicates perfect or near-perfect matches. This metric is particularly useful for identifying textual relationships that might be missed by exact matching methods, especially in manuscripts with orthographic variations.
 
 
 
128
 
129
- **Stopword Filtering**: The same three levels of filtering used for Jaccard Similarity are applied to fuzzy matching:
130
- - **None**: No filtering, includes all words in the comparison
131
- - **Standard**: Filters only common particles and punctuation
132
- - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
133
 
134
- 4. **Semantic Similarity**: Computes the cosine similarity between sentence-transformer embeddings (e.g., LaBSE) of text segments. Segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate a higher degree of semantic overlap.
 
 
 
135
 
136
- **Stopword Filtering**: Three levels of filtering are available:
 
 
137
  - **None**: No filtering, includes all words in the comparison
138
  - **Standard**: Filters only common particles and punctuation
139
  - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
140
 
141
- This helps focus on meaningful content words rather than grammatical elements.
 
 
142
 
143
  ## Getting Started (if run Locally)
144
 
@@ -173,40 +206,63 @@ This helps focus on meaningful content words rather than grammatical elements.
173
 
174
  ## Usage
175
 
176
- 1. **Upload Files**: Use the file upload interface to select one or more `.txt` files containing Tibetan Unicode text.
177
- 2. **Configure Options**:
178
- - Choose whether to compute semantic similarity
179
- - Choose whether to compute fuzzy string similarity
180
- - Select a fuzzy matching method (Token Set, Token Sort, Partial, or Simple Ratio)
181
- - Select an embedding model for semantic analysis
182
- - Choose a stopword filtering level (None, Standard, or Aggressive)
183
- 3. **Run Analysis**: Click the "Run Analysis" button.
184
- 3. **View Results**:
185
- - A preview of the similarity metrics will be displayed.
186
- - Download the full results as a CSV file.
187
- - Interactive heatmaps for Jaccard Similarity, Normalized LCS, Fuzzy Similarity, and Semantic Similarity will be generated. All heatmaps use a consistent color scheme where darker colors represent higher similarity.
188
- - A bar chart showing word counts per segment will also be available.
189
- - Any warnings (e.g., regarding missing chapter markers) will be displayed.
190
-
191
- 4. **Get Interpretation** (Optional):
192
- - After running the analysis, click the "Help Interpret Results" button.
193
- - No API key or internet connection required! The system uses a built-in rule-based analysis engine.
194
- - The system will analyze your metrics and provide insights about patterns, relationships, and notable findings in your data.
195
- - This feature helps researchers understand the significance of the metrics and identify interesting textual relationships between chapters.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
  ## Embedding Model
198
 
199
- Semantic similarity uses Hugging Face sentence-transformer models (default: `sentence-transformers/LaBSE`). These models provide context-aware, segment-level embeddings suitable for comparing Tibetan text passages.
 
 
 
 
 
 
 
200
 
201
  ## Structure
202
 
203
  - `app.py` — Gradio web app entry point and UI definition.
204
  - `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
205
  - `process.py`: Core logic for segmenting texts and orchestrating metric computation.
206
- - `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity.
207
  - `hf_embedding.py`: Handles loading and using sentence-transformer models.
208
  - `tokenize.py`: Tibetan text tokenization using `botok`.
209
- - `upload.py`: File upload handling (currently minimal).
 
210
  - `visualize.py`: Generates heatmaps and word count plots.
211
  - `requirements.txt` — Python dependencies for the web application.
212
 
@@ -228,7 +284,7 @@ If you use this web application or the underlying TTM tool in your research, ple
228
  author = {Daniel Wojahn},
229
  year = {2025},
230
  url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
231
- version = {0.3.0}
232
  }
233
  ```
234
 
 
15
  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
16
  [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
17
 
18
+ Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts no programming required.
19
+
20
+ ## Quick Start (3 Steps)
21
+
22
+ 1. **Upload** two or more Tibetan text files (.txt format)
23
+ 2. **Click** "Compare My Texts"
24
+ 3. **View** the results — higher scores mean more similarity
25
+
26
+ That's it! The default settings work well for most cases. See the results section for colorful heatmaps showing which chapters are most similar.
27
+
28
+ > **Tip:** If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.
29
+
30
+ ## What's New (v0.4.0)
31
+
32
+ - **New preset-based UI**: Choose "Quick Start" for simple analysis or "Custom" for full control
33
+ - **Three analysis presets**: Standard, Deep (with AI), and Quick (fastest)
34
+ - **Word-level tokenization** is now the default (recommended for Jaccard similarity)
35
+ - **Particle normalization**: Treat grammatical particle variants as equivalent (གི/ཀྱི/གྱི → གི)
36
+ - **LCS normalization options**: Choose how to handle texts of different lengths
37
+ - **Improved stopword matching**: Fixed tsek (་) handling for consistent filtering
38
+ - **Tibetan-optimized fuzzy matching**: Syllable-level methods only (removed character-level methods)
39
+ - **Dharmamitra models**: Buddhist-specific semantic similarity models as default
40
+ - **Modernized theme**: Cleaner UI with better responsive design
41
 
42
  ## Background
43
 
44
+ The Tibetan Text Metrics project provides quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application makes these capabilities accessible through an intuitive interface no command-line or Python experience needed.
45
 
46
  ## Key Features of the Web App
47
 
48
  - **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
49
  - **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
50
  - **Core Metrics Computed**:
51
+ - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. Word-level tokenization recommended. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
52
+ - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels. Supports multiple normalization modes (average, min, max).
53
+ - **Fuzzy Similarity**: Uses syllable-level fuzzy matching to detect approximate matches, accommodating spelling variations and scribal differences in Tibetan text.
54
+ - **Semantic Similarity**: Uses Buddhist-specific sentence-transformer embeddings (Dharmamitra) to compare the contextual meaning of segments.
55
  - **Handles Long Texts**: Implements automated handling for long segments when computing semantic embeddings.
56
+ - **Model Selection**: Semantic similarity uses Hugging Face sentence-transformer models. Default is Dharmamitra's `buddhist-nlp/buddhist-sentence-similarity`, trained specifically for Buddhist texts.
57
+ - **Tokenization Modes**:
58
+ - **Word** (default, recommended): Keeps multi-syllable words together for more meaningful comparison
59
+ - **Syllable**: Splits into individual syllables for finer-grained analysis
60
  - **Stopword Filtering**: Three levels of filtering for Tibetan words:
61
  - **None**: No filtering, includes all words
62
  - **Standard**: Filters only common particles and punctuation
63
  - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
64
+ - **Particle Normalization**: Optional normalization of grammatical particles to canonical forms (e.g., གི/ཀྱི/གྱི → གི, ལ/ར/སུ/ཏུ/དུ → ལ). Reduces false negatives from sandhi variation.
65
  - **Interactive Visualizations**:
66
  - Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
67
  - Bar chart displaying word counts per segment.
 
117
  ## Implemented Metrics
118
 
119
  **Stopword Filtering:**
120
+ To enhance the accuracy and relevance of similarity scores, the Jaccard Similarity and Fuzzy Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. Stopwords are normalized to handle tsek (་) variations consistently.
121
+
122
+ **Particle Normalization:**
123
+ Tibetan grammatical particles change form based on the preceding syllable (sandhi). For example, the genitive particle appears as གི, ཀྱི, གྱི, ཡི, or འི depending on context. When particle normalization is enabled, all variants are treated as equivalent, reducing false negatives when comparing texts with different scribal conventions.
124
 
125
  The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
126
  - The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
 
128
 
129
  We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
130
 
131
+ Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords_bo.py` file.
132
 
133
  ### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
134
 
 
145
 
146
  This helps focus on meaningful content words rather than grammatical elements.
147
 
148
+ 2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.
 
 
 
 
 
 
149
 
150
+ **Normalization options:**
151
+ - **Average** (default): Divides LCS length by the average of both text lengths. Balanced comparison.
152
+ - **Min**: Divides by the shorter text length. Useful for detecting if one text contains the other (e.g., quotes within commentary). Can return 1.0 if shorter text is fully contained.
153
+ - **Max**: Divides by the longer text length. Stricter metric that penalizes length differences.
154
 
155
+ A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.
156
+
157
+ *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary.
158
+ 3. **Fuzzy Similarity**: This metric uses syllable-level fuzzy matching algorithms to detect approximate matches, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical.
159
 
160
+ **Available methods (all work at syllable level):**
161
+ - **Syllable N-gram Overlap** (default, recommended): Compares syllable bigrams between texts. Best for detecting shared phrases and local patterns.
162
+ - **Syllable-level Edit Distance**: Computes Levenshtein distance at the syllable/token level. Detects minor variations while respecting syllable boundaries.
163
+ - **Weighted Jaccard**: Like standard Jaccard but considers token frequency, giving more weight to frequently shared terms.
164
 
165
+ Scores range from 0 to 1, where 1 indicates perfect or near-perfect matches. All methods work at the syllable level, which is linguistically appropriate for Tibetan.
166
+
167
+ **Stopword Filtering**: The same three levels of filtering used for Jaccard Similarity are applied to fuzzy matching:
168
  - **None**: No filtering, includes all words in the comparison
169
  - **Standard**: Filters only common particles and punctuation
170
  - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
171
 
172
+ 4. **Semantic Similarity**: Computes the cosine similarity between sentence-transformer embeddings of text segments. Uses Dharmamitra's Buddhist-specific models by default. Segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate a higher degree of semantic overlap.
173
+
174
+ *Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
175
 
176
  ## Getting Started (if run Locally)
177
 
 
206
 
207
  ## Usage
208
 
209
+ ### Quick Start (Recommended for Most Users)
210
+
211
+ 1. **Upload Files**: Select one or more `.txt` files containing Tibetan Unicode text.
212
+ 2. **Choose a Preset**: In the "Quick Start" tab, select an analysis type:
213
+
214
+ | Preset | What it does | Best for |
215
+ |--------|--------------|----------|
216
+ | 📊 **Standard** | Vocabulary + Sequences + Fuzzy matching | Most comparisons |
217
+ | 🧠 **Deep** | All metrics including AI meaning analysis | Finding semantic parallels |
218
+ | **Quick** | Vocabulary overlap only | Fast initial scan |
219
+
220
+ 3. **Click "Compare My Texts"**: Results appear below with heatmaps and downloadable CSV.
221
+
222
+ ### Custom Analysis (Advanced Users)
223
+
224
+ For fine-grained control, use the "Custom" tab:
225
+
226
+ - **Lexical Metrics**: Configure tokenization (word/syllable), stopword filtering, and particle normalization
227
+ - **Sequence Matching (LCS)**: Enable/disable and choose normalization mode (avg/min/max)
228
+ - **Fuzzy Matching**: Choose method (N-gram, Syllable Edit, or Weighted Jaccard)
229
+ - **Semantic Analysis**: Enable AI-based meaning comparison with model selection
230
+
231
+ ### Viewing Results
232
+
233
+ - **Metrics Preview**: Summary table of similarity scores
234
+ - **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
235
+ - **Word Counts**: Bar chart showing segment lengths
236
+ - **CSV Download**: Full results for further analysis
237
+
238
+ ### AI Interpretation (Optional)
239
+
240
+ After running analysis, click "Help Interpret Results" for scholarly insights:
241
+ - Pattern identification across chapters
242
+ - Notable textual relationships
243
+ - Suggestions for further investigation
244
 
245
  ## Embedding Model
246
 
247
+ Semantic similarity uses Hugging Face sentence-transformer models. The following models are available:
248
+
249
+ - **`buddhist-nlp/buddhist-sentence-similarity`** (default, recommended): Developed by [Dharmamitra](https://huggingface.co/buddhist-nlp), this model is specifically trained for sentence similarity on Buddhist texts in Tibetan, Buddhist Chinese, Sanskrit (IAST), and Pāli. Best choice for Tibetan Buddhist manuscripts.
250
+ - **`buddhist-nlp/bod-eng-similarity`**: Also from Dharmamitra, optimized for Tibetan-English bitext alignment tasks.
251
+ - **`sentence-transformers/LaBSE`**: General multilingual model, good baseline for non-Buddhist texts.
252
+ - **`BAAI/bge-m3`**: Strong multilingual alternative with broad language coverage.
253
+
254
+ These models provide context-aware, segment-level embeddings suitable for comparing Tibetan text passages.
255
 
256
  ## Structure
257
 
258
  - `app.py` — Gradio web app entry point and UI definition.
259
  - `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
260
  - `process.py`: Core logic for segmenting texts and orchestrating metric computation.
261
+ - `metrics.py`: Implementation of Jaccard, LCS, Fuzzy, and Semantic Similarity.
262
  - `hf_embedding.py`: Handles loading and using sentence-transformer models.
263
  - `tokenize.py`: Tibetan text tokenization using `botok`.
264
+ - `normalize_bo.py`: Tibetan particle normalization for grammatical variants.
265
+ - `stopwords_bo.py`: Comprehensive Tibetan stopword list with tsek normalization.
266
  - `visualize.py`: Generates heatmaps and word count plots.
267
  - `requirements.txt` — Python dependencies for the web application.
268
 
 
284
  author = {Daniel Wojahn},
285
  year = {2025},
286
  url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
287
+ version = {0.4.0}
288
  }
289
  ```
290
 
app.py CHANGED
@@ -17,14 +17,16 @@ load_dotenv()
17
 
18
  logger = logging.getLogger(__name__)
19
  def main_interface():
 
 
20
  with gr.Blocks(
21
  theme=tibetan_theme,
22
- title="Tibetan Text Metrics Web App",
23
- css=tibetan_theme.get_css_string() + ".metric-description, .step-box { padding: 1.5rem !important; }"
24
  ) as demo:
25
  gr.Markdown(
26
- """# Tibetan Text Metrics Web App
27
- <span style='font-size:18px;'>A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project. Powered by advanced language models via OpenRouter for in-depth text analysis.</span>
28
  """,
29
 
30
  elem_classes="gr-markdown",
@@ -35,93 +37,174 @@ def main_interface():
35
  with gr.Group(elem_classes="step-box"):
36
  gr.Markdown(
37
  """
38
- ## Step 1: Upload Your Tibetan Text Files
39
- <span style='font-size:16px;'>Upload two or more `.txt` files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible using the marker '༈' (<i>sbrul shad</i>).</span>
40
  """,
41
  elem_classes="gr-markdown",
42
  )
43
  file_input = gr.File(
44
- label="Upload Tibetan .txt files",
45
  file_types=[".txt"],
46
  file_count="multiple",
47
  )
48
  gr.Markdown(
49
- "<small>Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.</small>",
50
  elem_classes="gr-markdown"
51
  )
52
  with gr.Column(scale=1, elem_classes="step-column"):
53
  with gr.Group(elem_classes="step-box"):
54
  gr.Markdown(
55
- """## Step 2: Configure and run the analysis
56
- <span style='font-size:16px;'>Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using the marker '༈' (<i>sbrul shad</i>). The tool will split files based on this marker.</span>
57
  """,
58
  elem_classes="gr-markdown",
59
  )
60
- semantic_toggle_radio = gr.Radio(
61
- label="Compute semantic similarity? (Experimental)",
62
- choices=["Yes", "No"],
63
- value="No",
64
- info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
65
- elem_id="semantic-radio-group",
66
- )
67
-
68
- model_dropdown = gr.Dropdown(
69
- choices=[
70
- "sentence-transformers/LaBSE"
71
- ],
72
- label="Select Embedding Model",
73
- value="sentence-transformers/LaBSE",
74
- info="Select the embedding model to use for semantic similarity analysis. Only Hugging Face sentence-transformers are supported."
75
- )
76
-
77
- with gr.Accordion("Advanced Options", open=False):
78
- batch_size_slider = gr.Slider(
79
- minimum=1,
80
- maximum=64,
81
- value=8,
82
- step=1,
83
- label="Batch Size (for Hugging Face models)",
84
- info="Adjust based on your hardware (VRAM). Lower this if you encounter memory issues."
85
- )
86
- progress_bar_checkbox = gr.Checkbox(
87
- label="Show Embedding Progress Bar",
88
- value=False,
89
- info="Display a progress bar during embedding generation. Useful for large datasets."
90
- )
91
-
92
- stopwords_dropdown = gr.Dropdown(
93
- label="Stopword Filtering",
94
- choices=[
95
- "None (No filtering)",
96
- "Standard (Common particles only)",
97
- "Aggressive (All function words)"
98
- ],
99
- value="Standard (Common particles only)", # Default
100
- info="Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words."
101
- )
102
-
103
- fuzzy_toggle_radio = gr.Radio(
104
- label="Enable Fuzzy String Matching",
105
- choices=["Yes", "No"],
106
- value="Yes",
107
- info="Fuzzy matching helps detect similar but not identical text segments. Useful for identifying variations and modifications."
108
- )
109
-
110
- fuzzy_method_dropdown = gr.Dropdown(
111
- label="Fuzzy Matching Method",
112
- choices=[
113
- "token_set - Order-independent matching",
114
- "token_sort - Order-normalized matching",
115
- "partial - Best partial matching",
116
- "ratio - Simple ratio matching"
117
- ],
118
- value="token_set - Order-independent matching",
119
- info="Select the fuzzy matching algorithm to use:\n\n• token_set: Best for texts with different word orders and partial overlaps. Compares unique words regardless of their order (recommended for Tibetan texts).\n\n• token_sort: Good for texts with different word orders but similar content. Sorts words alphabetically before comparing.\n\n• partial: Best for finding shorter strings within longer ones. Useful when one text is a fragment of another.\n\n• ratio: Simple Levenshtein distance ratio. Best for detecting small edits and typos in otherwise identical texts."
120
- )
121
 
122
- process_btn = gr.Button(
123
- "Run Analysis", elem_id="run-btn", variant="primary"
124
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
  gr.Markdown(
127
  """## Results
@@ -131,165 +214,208 @@ def main_interface():
131
  # The heatmap_titles and metric_tooltips dictionaries are defined here
132
  # heatmap_titles = { ... }
133
  # metric_tooltips = { ... }
134
- csv_output = gr.File(label="Download CSV Results")
135
  metrics_preview = gr.Dataframe(
136
- label="Similarity Metrics Preview", interactive=False, visible=True
137
  )
138
  # States for data persistence
139
  state_text_data = gr.State()
140
  state_df_results = gr.State()
141
-
142
  # LLM Interpretation components
143
  with gr.Row():
144
  with gr.Column():
145
  gr.Markdown(
146
- "## AI Analysis\n*The AI will analyze your text similarities and provide insights into patterns and relationships.*",
147
  elem_classes="gr-markdown"
148
  )
149
-
150
  # Add the interpret button
151
  with gr.Row():
152
  interpret_btn = gr.Button(
153
- "Help Interpret Results",
154
  variant="primary",
155
  elem_id="interpret-btn"
156
  )
157
  # Create a placeholder message with proper formatting and structure
158
  initial_message = """
159
- ## Analysis of Tibetan Text Similarity Metrics
160
 
161
- <small>*Click the 'Help Interpret Results' button above to generate an AI-powered analysis of your similarity metrics.*</small>
162
  """
163
  interpretation_output = gr.Markdown(
164
  value=initial_message,
165
  elem_id="llm-analysis"
166
  )
167
-
168
  # Heatmap tabs for each metric
169
  heatmap_titles = {
170
- "Jaccard Similarity (%)": "Higher scores mean more shared unique words.",
171
- "Normalized LCS": "Higher scores mean longer shared sequences of words.",
172
- "Fuzzy Similarity": "Higher scores mean more similar text with fuzzy matching tolerance for variations.",
173
- "Semantic Similarity": "Higher scores mean more similar meanings.",
174
- "Word Counts": "Word Counts: Bar chart showing the number of words in each segment after tokenization.",
175
  }
176
 
177
  metric_tooltips = {
178
  "Jaccard Similarity (%)": """
179
- ### Jaccard Similarity (%)
180
- This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, optionally filtering out common Tibetan stopwords.
 
181
 
182
- It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`.
183
 
184
- Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
 
 
 
185
 
186
- **Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before comparison. This helps focus on meaningful content words rather than grammatical elements.
 
 
 
 
187
  """,
188
  "Fuzzy Similarity": """
189
- ### Fuzzy Similarity
190
- This metric measures the approximate string similarity between text segments using fuzzy matching algorithms from TheFuzz library. Unlike exact matching metrics, fuzzy similarity can detect similarities even when texts contain variations, misspellings, or different word orders.
 
 
 
191
 
192
- Fuzzy similarity is particularly useful for Tibetan texts that may have orthographic variations, scribal differences, or regional spelling conventions. It provides a score between 0 and 1, where higher values indicate greater similarity.
 
 
 
193
 
194
- **Available Methods**:
195
- - **Token Set Ratio**: Compares the unique words in each text regardless of order (best for texts with different word arrangements)
196
- - **Token Sort Ratio**: Normalizes word order before comparison (good for texts with similar content but different ordering)
197
- - **Partial Ratio**: Finds the best matching substring (useful for texts where one is contained within the other)
198
- - **Simple Ratio**: Direct character-by-character comparison (best for detecting minor variations)
199
 
200
- **Stopword Filtering**: When enabled, common Tibetan particles and function words are filtered out before comparison, focusing on meaningful content words.
 
 
 
201
  """,
202
  "Normalized LCS": """
203
- ### Normalized LCS (Longest Common Subsequence)
204
- This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order.
205
- Importantly, these words do not need to be directly adjacent (contiguous) in either text.
206
- For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'.
207
- The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage.
208
- A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
209
 
210
- **No Stopword Filtering.** Unlike metrics such as Jaccard Similarity or TF-IDF Cosine Similarity (which typically filter out common stopwords to focus on content-bearing words), the LCS calculation in this tool intentionally uses the raw, unfiltered sequence of tokens from your texts. This design choice allows LCS to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction and narrative sequence. By not removing stopwords, LCS can reveal similarities in phrasing and textual structure that might otherwise be obscured, making it a valuable complement to metrics that focus purely on lexical overlap of keywords.
211
 
212
- **Note on Interpretation**: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
 
 
 
 
 
 
 
 
 
213
  """,
214
  "Semantic Similarity": """
215
- ### Semantic Similarity
216
- This metric measures similarity in meaning between text segments using sentence-transformer models from Hugging Face (e.g., LaBSE). Text segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate higher semantic overlap.
217
-
218
- Key points:
219
- - Context-aware embeddings capture nuanced meanings and relationships.
220
- - Designed for sentence/segment-level representations, not just words.
221
- - Works well alongside Jaccard and LCS for a holistic view.
222
- - Stopword filtering: When enabled, common Tibetan particles and function words are filtered before embedding to focus on content-bearing terms.
 
 
 
 
 
 
 
 
 
 
 
223
  """,
224
  "Word Counts": """
225
- ### Word Counts per Segment
226
- This chart displays the number of words in each segment of your texts after tokenization.
 
 
 
227
 
228
- The word count is calculated after applying the selected tokenization and stopword filtering options. This visualization helps you understand the relative sizes of different text segments and can reveal patterns in text structure across your documents.
 
 
 
229
 
230
- **Key points**:
231
- - Longer bars indicate segments with more words
232
- - Segments are grouped by source document
233
- - Useful for identifying structural patterns and content distribution
234
- - Can help explain similarity metric variations (longer texts may show different patterns)
235
  """,
236
  "Structural Analysis": """
237
- ### Structural Analysis
238
- This advanced analysis examines the structural relationships between text segments across your documents. It identifies patterns of similarity and difference that may indicate textual dependencies, common sources, or editorial modifications.
239
 
240
- The structural analysis combines multiple similarity metrics to create a comprehensive view of how text segments relate to each other, highlighting potential stemmatic relationships and textual transmission patterns.
241
 
242
- **Key points**:
243
- - Identifies potential source-target relationships between texts
244
- - Visualizes text reuse patterns across segments
245
- - Helps reconstruct possible stemmatic relationships
246
- - Provides insights into textual transmission and editorial history
247
 
248
- **Note**: This analysis is computationally intensive and only available after the initial metrics calculation is complete.
 
 
 
 
 
249
  """
250
 
251
  }
252
  heatmap_tabs = {}
253
- gr.Markdown("## Detailed Metric Analysis", elem_classes="gr-markdown")
254
-
255
  with gr.Tabs(elem_id="heatmap-tab-group"):
256
  # Process all metrics
257
  metrics_to_display = heatmap_titles
258
-
259
  for metric_key, descriptive_title in metrics_to_display.items():
260
  with gr.Tab(metric_key):
261
  # Set CSS class based on metric type
262
  if metric_key == "Jaccard Similarity (%)":
263
  css_class = "metric-info-accordion jaccard-info"
264
- accordion_title = "Understanding Vocabulary Overlap"
265
  elif metric_key == "Normalized LCS":
266
  css_class = "metric-info-accordion lcs-info"
267
- accordion_title = "Understanding Sequence Patterns"
268
  elif metric_key == "Fuzzy Similarity":
269
  css_class = "metric-info-accordion fuzzy-info"
270
- accordion_title = "Understanding Fuzzy Matching"
271
  elif metric_key == "Semantic Similarity":
272
  css_class = "metric-info-accordion semantic-info"
273
- accordion_title = "Understanding Meaning Similarity"
274
  elif metric_key == "Word Counts":
275
  css_class = "metric-info-accordion wordcount-info"
276
- accordion_title = "Understanding Text Length"
277
  else:
278
  css_class = "metric-info-accordion"
279
- accordion_title = f"About {metric_key}"
280
-
281
  # Create the accordion with appropriate content
282
  with gr.Accordion(accordion_title, open=False, elem_classes=css_class):
283
  if metric_key == "Word Counts":
284
  gr.Markdown("""
285
- ### Word Counts per Segment
286
- This chart displays the number of words in each segment of your texts after tokenization.
 
 
 
287
  """)
288
  elif metric_key in metric_tooltips:
289
  gr.Markdown(value=metric_tooltips[metric_key], elem_classes="metric-description")
290
  else:
291
  gr.Markdown(value=f"### {metric_key}\nDescription not found.")
292
-
293
  # Add the appropriate plot
294
  if metric_key == "Word Counts":
295
  word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
@@ -302,26 +428,28 @@ The structural analysis combines multiple similarity metrics to create a compreh
302
  # The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
303
 
304
  warning_box = gr.Markdown(visible=False)
305
-
306
  # Create a container for metric progress indicators
307
  with gr.Row(visible=False) as progress_container:
308
  # Progress indicators will be created dynamically by ProgressiveUI
309
  gr.Markdown("Metric progress will appear here during analysis")
310
 
311
- def run_pipeline(files, enable_semantic, enable_fuzzy, fuzzy_method, model_name, stopwords_option, batch_size, show_progress, progress=gr.Progress()):
312
  """Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
313
-
314
  Args:
315
  files: A list of file objects uploaded by the user.
316
  enable_semantic: Whether to compute semantic similarity.
317
  enable_fuzzy: Whether to compute fuzzy string similarity.
318
  fuzzy_method: The fuzzy matching method to use.
319
  model_name: Name of the embedding model to use.
 
320
  stopwords_option: Stopword filtering level (None, Standard, or Aggressive).
 
321
  batch_size: Batch size for embedding generation.
322
  show_progress: Whether to show progress bars during embedding.
323
  progress: Gradio progress indicator.
324
-
325
  Returns:
326
  tuple: Results for UI components including metrics, visualizations, and state.
327
  """
@@ -336,7 +464,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
336
  warning_update_res = gr.update(visible=False)
337
  state_text_data_res = None
338
  state_df_results_res = None
339
-
340
  # Create a ProgressiveUI instance for handling progressive updates
341
  progressive_ui = ProgressiveUI(
342
  metrics_preview=metrics_preview,
@@ -349,10 +477,10 @@ The structural analysis combines multiple similarity metrics to create a compreh
349
  progress_container=progress_container,
350
  heatmap_titles=heatmap_titles
351
  )
352
-
353
  # Make progress container visible during analysis
354
  progress_container.update(visible=True)
355
-
356
  # Create a progressive callback function
357
  progressive_callback = create_progressive_callback(progressive_ui)
358
  # Check if files are provided
@@ -369,7 +497,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
369
  None, # state_text_data
370
  None # state_df_results
371
  )
372
-
373
  # Check file size limits (10MB per file)
374
  for file in files:
375
  file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
@@ -393,13 +521,13 @@ The structural analysis combines multiple similarity metrics to create a compreh
393
  progress(0.1, desc="Preparing files...")
394
  except Exception as e:
395
  logger.warning(f"Progress update error (non-critical): {e}")
396
-
397
  # Get filenames and read file contents
398
  filenames = [
399
  Path(file.name).name for file in files
400
  ] # Use Path().name to get just the filename
401
  text_data = {}
402
-
403
  # Read files with progress updates
404
  for i, file in enumerate(files):
405
  file_path = Path(file.name)
@@ -409,7 +537,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
409
  progress(0.1 + (0.1 * (i / len(files))), desc=f"Reading file: {filename}")
410
  except Exception as e:
411
  logger.warning(f"Progress update error (non-critical): {e}")
412
-
413
  try:
414
  text_data[filename] = file_path.read_text(encoding="utf-8-sig")
415
  except UnicodeDecodeError:
@@ -433,21 +561,27 @@ The structural analysis combines multiple similarity metrics to create a compreh
433
  # Configure semantic similarity and fuzzy matching
434
  enable_semantic_bool = enable_semantic == "Yes"
435
  enable_fuzzy_bool = enable_fuzzy == "Yes"
436
-
437
  # Extract the fuzzy method from the dropdown value
438
- fuzzy_method_value = fuzzy_method.split(' - ')[0] if fuzzy_method else 'token_set'
439
-
 
 
 
 
 
 
440
  if progress is not None:
441
  try:
442
  progress(0.2, desc="Loading model..." if enable_semantic_bool else "Processing text...")
443
  except Exception as e:
444
  logger.warning(f"Progress update error (non-critical): {e}")
445
-
446
  # Process texts with selected model
447
  # Convert stopword option to appropriate parameters
448
  use_stopwords = stopwords_option != "None (No filtering)"
449
  use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
450
-
451
  # For Hugging Face models, the UI value is the correct model ID
452
  internal_model_id = model_name
453
 
@@ -457,9 +591,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
457
  enable_semantic=enable_semantic_bool,
458
  enable_fuzzy=enable_fuzzy_bool,
459
  fuzzy_method=fuzzy_method_value,
 
460
  model_name=internal_model_id,
461
  use_stopwords=use_stopwords,
462
  use_lite_stopwords=use_lite_stopwords,
 
 
463
  progress_callback=progress,
464
  progressive_callback=progressive_callback,
465
  batch_size=batch_size,
@@ -479,12 +616,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
479
  progress(0.8, desc="Generating visualizations...")
480
  except Exception as e:
481
  logger.warning(f"Progress update error (non-critical): {e}")
482
-
483
  # heatmap_titles is already defined in the outer scope of main_interface
484
  heatmaps_data = generate_visualizations(
485
  df_results, descriptive_titles=heatmap_titles
486
  )
487
-
488
  # Generate word count chart
489
  if progress is not None:
490
  try:
@@ -492,12 +629,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
492
  except Exception as e:
493
  logger.warning(f"Progress update error (non-critical): {e}")
494
  word_count_fig_res = generate_word_count_chart(word_counts_df_data)
495
-
496
  # Store state data for potential future use
497
  state_text_data_res = text_data
498
  state_df_results_res = df_results
499
  logger.info("Analysis complete, storing state data")
500
-
501
  # Save results to CSV
502
  if progress is not None:
503
  try:
@@ -506,7 +643,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
506
  logger.warning(f"Progress update error (non-critical): {e}")
507
  csv_path_res = "results.csv"
508
  df_results.to_csv(csv_path_res, index=False)
509
-
510
  # Prepare final output
511
  warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
512
  metrics_preview_df_res = df_results.head(10)
@@ -514,10 +651,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
514
  jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
515
  lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
516
  fuzzy_heatmap_res = heatmaps_data.get("Fuzzy Similarity")
517
- semantic_heatmap_res = heatmaps_data.get(
518
- "Semantic Similarity"
519
- )
520
- # TF-IDF has been completely removed
521
  warning_update_res = gr.update(
522
  visible=bool(warning_raw), value=warning_md
523
  )
@@ -546,27 +680,27 @@ The structural analysis combines multiple similarity metrics to create a compreh
546
  try:
547
  if not csv_path or not Path(csv_path).exists():
548
  return "Please run the analysis first to generate results."
549
-
550
  # Read the CSV file
551
  df_results = pd.read_csv(csv_path)
552
-
553
  # Show detailed progress messages with percentages
554
  progress(0, desc="Preparing data for analysis...")
555
  progress(0.1, desc="Analyzing similarity patterns...")
556
  progress(0.2, desc="Connecting to Mistral 7B via OpenRouter...")
557
-
558
  # Get interpretation from LLM (using OpenRouter API)
559
  progress(0.3, desc="Generating scholarly interpretation (this may take 20-40 seconds)...")
560
  llm_service = LLMService()
561
  interpretation = llm_service.analyze_similarity(df_results)
562
-
563
  # Simulate completion steps
564
  progress(0.9, desc="Formatting results...")
565
  progress(0.95, desc="Applying scholarly formatting...")
566
-
567
  # Completed
568
  progress(1.0, desc="Analysis complete!")
569
-
570
  # Add a timestamp to the interpretation
571
  timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
572
  interpretation = f"{interpretation}\n\n<small>Analysis generated on {timestamp}</small>"
@@ -574,36 +708,92 @@ The structural analysis combines multiple similarity metrics to create a compreh
574
  except Exception as e:
575
  logger.error(f"Error in interpret_results: {e}", exc_info=True)
576
  return f"Error interpreting results: {str(e)}"
577
-
578
- process_btn.click(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
579
  fn=run_pipeline,
580
- inputs=[file_input, semantic_toggle_radio, fuzzy_toggle_radio, fuzzy_method_dropdown, model_dropdown, stopwords_dropdown, batch_size_slider, progress_bar_checkbox],
581
- outputs=[
582
- csv_output,
583
- metrics_preview,
584
- word_count_plot,
585
- heatmap_tabs["Jaccard Similarity (%)"],
586
- heatmap_tabs["Normalized LCS"],
587
- heatmap_tabs["Fuzzy Similarity"],
588
- heatmap_tabs["Semantic Similarity"],
589
- warning_box,
590
- state_text_data,
591
- state_df_results,
592
- ]
 
593
  )
594
 
595
  # Structural analysis functionality removed - see dedicated collation app
596
-
597
  # Connect the interpret button
598
  interpret_btn.click(
599
  fn=interpret_results,
600
  inputs=[csv_output],
601
  outputs=interpretation_output
602
  )
603
-
604
  return demo
605
 
606
 
607
  if __name__ == "__main__":
608
  demo = main_interface()
609
- demo.launch()
 
17
 
18
  logger = logging.getLogger(__name__)
19
  def main_interface():
20
+ # Theme and CSS applied here for Gradio 5.x compatibility
21
+ # For Gradio 6.x, these will move to launch() - see migration guide
22
  with gr.Blocks(
23
  theme=tibetan_theme,
24
+ css=tibetan_theme.get_css_string(),
25
+ title="Tibetan Text Metrics Web App"
26
  ) as demo:
27
  gr.Markdown(
28
+ """# Tibetan Text Metrics
29
+ <span style='font-size:18px;'>Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts. Part of the <a href="https://github.com/daniel-wojahn/tibetan-text-metrics" target="_blank">TTM project</a>.</span>
30
  """,
31
 
32
  elem_classes="gr-markdown",
 
37
  with gr.Group(elem_classes="step-box"):
38
  gr.Markdown(
39
  """
40
+ ## Step 1: Upload Your Texts
41
+ <span style='font-size:16px;'>Upload two or more Tibetan text files (.txt format). If your texts have chapters, separate them with the marker so the tool can compare chapter-by-chapter.</span>
42
  """,
43
  elem_classes="gr-markdown",
44
  )
45
  file_input = gr.File(
46
+ label="Choose your Tibetan text files",
47
  file_types=[".txt"],
48
  file_count="multiple",
49
  )
50
  gr.Markdown(
51
+ "<small>Tip: Files should be under 1MB for best performance. Use UTF-8 encoded .txt files.</small>",
52
  elem_classes="gr-markdown"
53
  )
54
  with gr.Column(scale=1, elem_classes="step-column"):
55
  with gr.Group(elem_classes="step-box"):
56
  gr.Markdown(
57
+ """## Step 2: Choose Analysis Type
58
+ <span style='font-size:16px;'>Pick a preset for quick results, or use Custom for full control.</span>
59
  """,
60
  elem_classes="gr-markdown",
61
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
+ with gr.Tabs():
64
+ # ===== QUICK START TAB =====
65
+ with gr.Tab("Quick Start", id="quick_tab"):
66
+ analysis_preset = gr.Radio(
67
+ label="What kind of analysis do you need?",
68
+ choices=[
69
+ "📊 Standard — Vocabulary + Sequences + Fuzzy matching",
70
+ "🧠 Deep — All metrics including AI meaning analysis",
71
+ "⚡ Quick — Vocabulary overlap only (fastest)"
72
+ ],
73
+ value="📊 Standard — Vocabulary + Sequences + Fuzzy matching",
74
+ info="Standard is recommended for most users. Deep analysis takes longer but finds texts with similar meaning even when words differ."
75
+ )
76
+
77
+ gr.Markdown("""
78
+ **What each preset includes:**
79
+
80
+ | Preset | Jaccard | LCS | Fuzzy | Semantic AI |
81
+ |--------|---------|-----|-------|-------------|
82
+ | 📊 Standard | ✓ | ✓ | ✓ | — |
83
+ | 🧠 Deep | ✓ | ✓ | ✓ | ✓ |
84
+ | ⚡ Quick | ✓ | — | — | — |
85
+ """, elem_classes="preset-table")
86
+
87
+ process_btn_quick = gr.Button(
88
+ "🔍 Compare My Texts", elem_id="run-btn-quick", variant="primary"
89
+ )
90
+
91
+ # ===== CUSTOM TAB =====
92
+ with gr.Tab("Custom", id="custom_tab"):
93
+ gr.Markdown("**Fine-tune each metric and option:**", elem_classes="custom-header")
94
+
95
+ with gr.Accordion("📊 Lexical Metrics", open=True):
96
+ gr.Markdown("*Compare the actual words used in texts*")
97
+
98
+ tokenization_mode_dropdown = gr.Dropdown(
99
+ label="How to split text?",
100
+ choices=[
101
+ "word - Whole words (recommended)",
102
+ "syllable - Individual syllables (finer detail)"
103
+ ],
104
+ value="word - Whole words (recommended)",
105
+ info="'Word' keeps multi-syllable words together — recommended for Jaccard."
106
+ )
107
+
108
+ stopwords_dropdown = gr.Dropdown(
109
+ label="Filter common words?",
110
+ choices=[
111
+ "None (No filtering)",
112
+ "Standard (Common particles only)",
113
+ "Aggressive (All function words)"
114
+ ],
115
+ value="Standard (Common particles only)",
116
+ info="Remove common particles (གི, ལ, ནི) before comparing."
117
+ )
118
+
119
+ particle_normalization_checkbox = gr.Checkbox(
120
+ label="Normalize grammatical particles?",
121
+ value=False,
122
+ info="Treat variants as equivalent (གི/ཀྱི/གྱི → གི). Useful for different scribal conventions."
123
+ )
124
+
125
+ with gr.Accordion("📏 Sequence Matching (LCS)", open=True):
126
+ gr.Markdown("*Find shared passages in the same order*")
127
+
128
+ gr.Checkbox(
129
+ label="Enable sequence matching",
130
+ value=True,
131
+ info="Finds the longest sequence of words appearing in both texts."
132
+ ) # LCS is always computed as a core metric
133
+
134
+ lcs_normalization_dropdown = gr.Dropdown(
135
+ label="How to handle different text lengths?",
136
+ choices=[
137
+ "avg - Balanced comparison (default)",
138
+ "min - Detect if one text contains the other",
139
+ "max - Stricter, penalizes length differences"
140
+ ],
141
+ value="avg - Balanced comparison (default)",
142
+ info="'min' is useful for finding quotes or excerpts."
143
+ )
144
+
145
+ with gr.Accordion("🔍 Fuzzy Matching", open=True):
146
+ gr.Markdown("*Detect similar but not identical text*")
147
+
148
+ fuzzy_toggle_radio = gr.Radio(
149
+ label="Find approximate matches?",
150
+ choices=["Yes", "No"],
151
+ value="Yes",
152
+ info="Useful for spelling variations and scribal differences."
153
+ )
154
+
155
+ fuzzy_method_dropdown = gr.Dropdown(
156
+ label="Matching method",
157
+ choices=[
158
+ "ngram - Syllable pairs (recommended)",
159
+ "syllable_edit - Count syllable changes",
160
+ "weighted_jaccard - Word frequency comparison"
161
+ ],
162
+ value="ngram - Syllable pairs (recommended)",
163
+ info="All options work at the Tibetan syllable level."
164
+ )
165
+
166
+ with gr.Accordion("🧠 Semantic Analysis", open=False):
167
+ gr.Markdown("*Compare meaning using AI (slower)*")
168
+
169
+ semantic_toggle_radio = gr.Radio(
170
+ label="Analyze meaning similarity?",
171
+ choices=["Yes", "No"],
172
+ value="No",
173
+ info="Finds texts that say similar things in different words."
174
+ )
175
+
176
+ model_dropdown = gr.Dropdown(
177
+ choices=[
178
+ "buddhist-nlp/buddhist-sentence-similarity",
179
+ "buddhist-nlp/bod-eng-similarity",
180
+ "sentence-transformers/LaBSE",
181
+ "BAAI/bge-m3"
182
+ ],
183
+ label="AI Model",
184
+ value="buddhist-nlp/buddhist-sentence-similarity",
185
+ info="'buddhist-sentence-similarity' works best for Buddhist texts."
186
+ )
187
+
188
+ batch_size_slider = gr.Slider(
189
+ minimum=1,
190
+ maximum=64,
191
+ value=8,
192
+ step=1,
193
+ label="Processing batch size",
194
+ info="Higher = faster but uses more memory."
195
+ )
196
+
197
+ progress_bar_checkbox = gr.Checkbox(
198
+ label="Show detailed progress",
199
+ value=False,
200
+ info="See step-by-step progress during analysis."
201
+ )
202
+
203
+ process_btn_custom = gr.Button(
204
+ "🔍 Compare My Texts (Custom)", elem_id="run-btn-custom", variant="primary"
205
+ )
206
+
207
+ # Note: Both process_btn_quick and process_btn_custom are wired below
208
 
209
  gr.Markdown(
210
  """## Results
 
214
  # The heatmap_titles and metric_tooltips dictionaries are defined here
215
  # heatmap_titles = { ... }
216
  # metric_tooltips = { ... }
217
+ csv_output = gr.File(label="📥 Download Full Results (CSV spreadsheet)")
218
  metrics_preview = gr.Dataframe(
219
+ label="Results Summary — Compare chapters across your texts", interactive=False, visible=True
220
  )
221
  # States for data persistence
222
  state_text_data = gr.State()
223
  state_df_results = gr.State()
224
+
225
  # LLM Interpretation components
226
  with gr.Row():
227
  with gr.Column():
228
  gr.Markdown(
229
+ "## Get Expert Insights\n*Let AI help you understand what the numbers mean and what patterns they reveal about your texts.*",
230
  elem_classes="gr-markdown"
231
  )
232
+
233
  # Add the interpret button
234
  with gr.Row():
235
  interpret_btn = gr.Button(
236
+ "📊 Explain My Results",
237
  variant="primary",
238
  elem_id="interpret-btn"
239
  )
240
  # Create a placeholder message with proper formatting and structure
241
  initial_message = """
242
+ ## Understanding Your Results
243
 
244
+ <small>*After running the analysis, click "Explain My Results" to get a plain-language interpretation of what the similarity scores mean for your texts.*</small>
245
  """
246
  interpretation_output = gr.Markdown(
247
  value=initial_message,
248
  elem_id="llm-analysis"
249
  )
250
+
251
  # Heatmap tabs for each metric
252
  heatmap_titles = {
253
+ "Jaccard Similarity (%)": "Shows how much vocabulary the texts share. Higher = more words in common.",
254
+ "Normalized LCS": "Shows shared phrases in the same order. Higher = more passages appear in both texts.",
255
+ "Fuzzy Similarity": "Finds similar text even with spelling differences. Higher = more alike.",
256
+ "Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
257
+ "Word Counts": "How long is each section? Helps you understand text structure.",
258
  }
259
 
260
  metric_tooltips = {
261
  "Jaccard Similarity (%)": """
262
+ ### Vocabulary Overlap (Jaccard Similarity)
263
+
264
+ **What it measures:** How many unique words appear in both texts.
265
 
266
+ **How to read it:** A score of 70% means 70% of all unique words found in either text appear in both. Higher scores = more shared vocabulary.
267
 
268
+ **What it tells you:**
269
+ - High scores (>70%): Texts use very similar vocabulary — possibly the same source or direct copying
270
+ - Medium scores (40-70%): Texts share significant vocabulary — likely related topics or traditions
271
+ - Low scores (<40%): Texts use different words — different sources or heavily edited versions
272
 
273
+ **Good to know:** This metric ignores word order and how often words repeat. It only asks "does this word appear in both texts?"
274
+
275
+ **Tips:**
276
+ - Use the "Filter common words" option to focus on meaningful content words rather than grammatical particles.
277
+ - **Word mode is recommended** for Jaccard. Syllable mode may inflate scores because common syllables (like ས, ར, ན) appear in many different words.
278
  """,
279
  "Fuzzy Similarity": """
280
+ ### Approximate Matching (Fuzzy Similarity)
281
+
282
+ **What it measures:** How similar texts are, even when they're not exactly the same.
283
+
284
+ **How to read it:** Scores from 0 to 1. Higher = more similar. A score of 0.85 means the texts are 85% alike.
285
 
286
+ **What it tells you:**
287
+ - High scores (>0.8): Very similar texts with minor differences (spelling, small edits)
288
+ - Medium scores (0.5-0.8): Noticeably different but clearly related
289
+ - Low scores (<0.5): Substantially different texts
290
 
291
+ **Why it matters for Tibetan texts:**
292
+ - Catches spelling variations between manuscripts
293
+ - Finds scribal differences and regional conventions
294
+ - Identifies passages that were slightly modified
 
295
 
296
+ **Recommended methods:**
297
+ - **Syllable pairs (ngram)**: Best for Tibetan — compares pairs of syllables
298
+ - **Count syllable changes**: Good for finding minor edits
299
+ - **Word frequency**: Useful when certain words repeat often
300
  """,
301
  "Normalized LCS": """
302
+ ### Shared Sequences (Longest Common Subsequence)
303
+
304
+ **What it measures:** The longest chain of words that appears in both texts *in the same order*.
305
+
306
+ **How to read it:** Higher scores mean longer shared passages. A score of 0.6 means 60% of the text follows the same word sequence.
 
307
 
308
+ **Example:** If Text A says "the quick brown fox" and Text B says "the lazy brown dog", the shared sequence is "the brown" words that appear in both, in the same order.
309
 
310
+ **What it tells you:**
311
+ - High scores (>0.6): Texts share substantial passages — likely direct copying or common source
312
+ - Medium scores (0.3-0.6): Some shared phrasing — possibly related traditions
313
+ - Low scores (<0.3): Different word ordering — independent compositions or heavy editing
314
+
315
+ **Why this is different from vocabulary overlap:**
316
+ - Vocabulary overlap asks: "Do they use the same words?"
317
+ - Sequence matching asks: "Do they say things in the same order?"
318
+
319
+ Two texts might share many words (high Jaccard) but arrange them differently (low LCS), suggesting they discuss similar topics but were composed independently.
320
  """,
321
  "Semantic Similarity": """
322
+ ### Meaning Similarity (Semantic Analysis)
323
+
324
+ **What it measures:** Whether texts convey similar *meaning*, even if they use different words.
325
+
326
+ **How to read it:** Scores from 0 to 1. Higher = more similar meaning. A score of 0.8 means the texts express very similar ideas.
327
+
328
+ **What it tells you:**
329
+ - High scores (>0.75): Texts say similar things, even if worded differently
330
+ - Medium scores (0.5-0.75): Related topics or themes
331
+ - Low scores (<0.5): Different subject matter
332
+
333
+ **How it works:** An AI model (trained on Buddhist texts) reads both passages and judges how similar their meaning is. This catches similarities that word-matching would miss.
334
+
335
+ **When to use it:**
336
+ - Finding paraphrased passages
337
+ - Identifying texts that discuss the same concepts differently
338
+ - Comparing translations or commentaries
339
+
340
+ **Note:** This takes longer to compute but provides insights the other metrics can't.
341
  """,
342
  "Word Counts": """
343
+ ### Text Length by Section
344
+
345
+ **What it shows:** How many words are in each chapter or section of your texts.
346
+
347
+ **How to read it:** Taller bars = longer sections. Compare bars to see which parts of your texts are longer or shorter.
348
 
349
+ **What it tells you:**
350
+ - Similar bar heights across texts suggest similar structure
351
+ - Very different lengths might explain why similarity scores vary
352
+ - Helps identify which sections to examine more closely
353
 
354
+ **Tip:** If one text has much longer chapters, it might contain additional material not in the other version.
 
 
 
 
355
  """,
356
  "Structural Analysis": """
357
+ ### How Texts Relate to Each Other
 
358
 
359
+ **What it shows:** An overview of how your text sections connect and relate across documents.
360
 
361
+ **What it tells you:**
362
+ - Which sections are most similar to each other
363
+ - Possible patterns of copying or shared sources
364
+ - How texts might have evolved or been edited over time
 
365
 
366
+ **Useful for:**
367
+ - Understanding textual transmission history
368
+ - Identifying which version might be older or more original
369
+ - Finding sections that were added, removed, or modified
370
+
371
+ **Note:** This analysis combines all the other metrics to give you the big picture.
372
  """
373
 
374
  }
375
  heatmap_tabs = {}
376
+ gr.Markdown("## Visual Comparison", elem_classes="gr-markdown")
377
+
378
  with gr.Tabs(elem_id="heatmap-tab-group"):
379
  # Process all metrics
380
  metrics_to_display = heatmap_titles
381
+
382
  for metric_key, descriptive_title in metrics_to_display.items():
383
  with gr.Tab(metric_key):
384
  # Set CSS class based on metric type
385
  if metric_key == "Jaccard Similarity (%)":
386
  css_class = "metric-info-accordion jaccard-info"
387
+ accordion_title = "ℹ️ What does this mean?"
388
  elif metric_key == "Normalized LCS":
389
  css_class = "metric-info-accordion lcs-info"
390
+ accordion_title = "ℹ️ What does this mean?"
391
  elif metric_key == "Fuzzy Similarity":
392
  css_class = "metric-info-accordion fuzzy-info"
393
+ accordion_title = "ℹ️ What does this mean?"
394
  elif metric_key == "Semantic Similarity":
395
  css_class = "metric-info-accordion semantic-info"
396
+ accordion_title = "ℹ️ What does this mean?"
397
  elif metric_key == "Word Counts":
398
  css_class = "metric-info-accordion wordcount-info"
399
+ accordion_title = "ℹ️ What does this mean?"
400
  else:
401
  css_class = "metric-info-accordion"
402
+ accordion_title = f"ℹ️ About {metric_key}"
403
+
404
  # Create the accordion with appropriate content
405
  with gr.Accordion(accordion_title, open=False, elem_classes=css_class):
406
  if metric_key == "Word Counts":
407
  gr.Markdown("""
408
+ ### Text Length by Section
409
+
410
+ This chart shows how many words are in each chapter or section. Taller bars = longer sections.
411
+
412
+ **Why it matters:** If sections have very different lengths, it might explain differences in similarity scores.
413
  """)
414
  elif metric_key in metric_tooltips:
415
  gr.Markdown(value=metric_tooltips[metric_key], elem_classes="metric-description")
416
  else:
417
  gr.Markdown(value=f"### {metric_key}\nDescription not found.")
418
+
419
  # Add the appropriate plot
420
  if metric_key == "Word Counts":
421
  word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
 
428
  # The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
429
 
430
  warning_box = gr.Markdown(visible=False)
431
+
432
  # Create a container for metric progress indicators
433
  with gr.Row(visible=False) as progress_container:
434
  # Progress indicators will be created dynamically by ProgressiveUI
435
  gr.Markdown("Metric progress will appear here during analysis")
436
 
437
+ def run_pipeline(files, enable_semantic, enable_fuzzy, fuzzy_method, lcs_normalization, model_name, tokenization_mode, stopwords_option, normalize_particles, batch_size, show_progress, progress=gr.Progress()):
438
  """Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
439
+
440
  Args:
441
  files: A list of file objects uploaded by the user.
442
  enable_semantic: Whether to compute semantic similarity.
443
  enable_fuzzy: Whether to compute fuzzy string similarity.
444
  fuzzy_method: The fuzzy matching method to use.
445
  model_name: Name of the embedding model to use.
446
+ tokenization_mode: How to tokenize text (syllable or word).
447
  stopwords_option: Stopword filtering level (None, Standard, or Aggressive).
448
+ normalize_particles: Whether to normalize grammatical particles.
449
  batch_size: Batch size for embedding generation.
450
  show_progress: Whether to show progress bars during embedding.
451
  progress: Gradio progress indicator.
452
+
453
  Returns:
454
  tuple: Results for UI components including metrics, visualizations, and state.
455
  """
 
464
  warning_update_res = gr.update(visible=False)
465
  state_text_data_res = None
466
  state_df_results_res = None
467
+
468
  # Create a ProgressiveUI instance for handling progressive updates
469
  progressive_ui = ProgressiveUI(
470
  metrics_preview=metrics_preview,
 
477
  progress_container=progress_container,
478
  heatmap_titles=heatmap_titles
479
  )
480
+
481
  # Make progress container visible during analysis
482
  progress_container.update(visible=True)
483
+
484
  # Create a progressive callback function
485
  progressive_callback = create_progressive_callback(progressive_ui)
486
  # Check if files are provided
 
497
  None, # state_text_data
498
  None # state_df_results
499
  )
500
+
501
  # Check file size limits (10MB per file)
502
  for file in files:
503
  file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
 
521
  progress(0.1, desc="Preparing files...")
522
  except Exception as e:
523
  logger.warning(f"Progress update error (non-critical): {e}")
524
+
525
  # Get filenames and read file contents
526
  filenames = [
527
  Path(file.name).name for file in files
528
  ] # Use Path().name to get just the filename
529
  text_data = {}
530
+
531
  # Read files with progress updates
532
  for i, file in enumerate(files):
533
  file_path = Path(file.name)
 
537
  progress(0.1 + (0.1 * (i / len(files))), desc=f"Reading file: {filename}")
538
  except Exception as e:
539
  logger.warning(f"Progress update error (non-critical): {e}")
540
+
541
  try:
542
  text_data[filename] = file_path.read_text(encoding="utf-8-sig")
543
  except UnicodeDecodeError:
 
561
  # Configure semantic similarity and fuzzy matching
562
  enable_semantic_bool = enable_semantic == "Yes"
563
  enable_fuzzy_bool = enable_fuzzy == "Yes"
564
+
565
  # Extract the fuzzy method from the dropdown value
566
+ fuzzy_method_value = fuzzy_method.split(' - ')[0] if fuzzy_method else 'ngram'
567
+
568
+ # Extract the LCS normalization from the dropdown value
569
+ lcs_normalization_value = lcs_normalization.split(' - ')[0] if lcs_normalization else 'avg'
570
+
571
+ # Extract the tokenization mode from the dropdown value
572
+ tokenization_mode_value = tokenization_mode.split(' - ')[0] if tokenization_mode else 'syllable'
573
+
574
  if progress is not None:
575
  try:
576
  progress(0.2, desc="Loading model..." if enable_semantic_bool else "Processing text...")
577
  except Exception as e:
578
  logger.warning(f"Progress update error (non-critical): {e}")
579
+
580
  # Process texts with selected model
581
  # Convert stopword option to appropriate parameters
582
  use_stopwords = stopwords_option != "None (No filtering)"
583
  use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
584
+
585
  # For Hugging Face models, the UI value is the correct model ID
586
  internal_model_id = model_name
587
 
 
591
  enable_semantic=enable_semantic_bool,
592
  enable_fuzzy=enable_fuzzy_bool,
593
  fuzzy_method=fuzzy_method_value,
594
+ lcs_normalization=lcs_normalization_value,
595
  model_name=internal_model_id,
596
  use_stopwords=use_stopwords,
597
  use_lite_stopwords=use_lite_stopwords,
598
+ normalize_particles=normalize_particles,
599
+ tokenization_mode=tokenization_mode_value,
600
  progress_callback=progress,
601
  progressive_callback=progressive_callback,
602
  batch_size=batch_size,
 
616
  progress(0.8, desc="Generating visualizations...")
617
  except Exception as e:
618
  logger.warning(f"Progress update error (non-critical): {e}")
619
+
620
  # heatmap_titles is already defined in the outer scope of main_interface
621
  heatmaps_data = generate_visualizations(
622
  df_results, descriptive_titles=heatmap_titles
623
  )
624
+
625
  # Generate word count chart
626
  if progress is not None:
627
  try:
 
629
  except Exception as e:
630
  logger.warning(f"Progress update error (non-critical): {e}")
631
  word_count_fig_res = generate_word_count_chart(word_counts_df_data)
632
+
633
  # Store state data for potential future use
634
  state_text_data_res = text_data
635
  state_df_results_res = df_results
636
  logger.info("Analysis complete, storing state data")
637
+
638
  # Save results to CSV
639
  if progress is not None:
640
  try:
 
643
  logger.warning(f"Progress update error (non-critical): {e}")
644
  csv_path_res = "results.csv"
645
  df_results.to_csv(csv_path_res, index=False)
646
+
647
  # Prepare final output
648
  warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
649
  metrics_preview_df_res = df_results.head(10)
 
651
  jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
652
  lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
653
  fuzzy_heatmap_res = heatmaps_data.get("Fuzzy Similarity")
654
+ semantic_heatmap_res = heatmaps_data.get("Semantic Similarity")
 
 
 
655
  warning_update_res = gr.update(
656
  visible=bool(warning_raw), value=warning_md
657
  )
 
680
  try:
681
  if not csv_path or not Path(csv_path).exists():
682
  return "Please run the analysis first to generate results."
683
+
684
  # Read the CSV file
685
  df_results = pd.read_csv(csv_path)
686
+
687
  # Show detailed progress messages with percentages
688
  progress(0, desc="Preparing data for analysis...")
689
  progress(0.1, desc="Analyzing similarity patterns...")
690
  progress(0.2, desc="Connecting to Mistral 7B via OpenRouter...")
691
+
692
  # Get interpretation from LLM (using OpenRouter API)
693
  progress(0.3, desc="Generating scholarly interpretation (this may take 20-40 seconds)...")
694
  llm_service = LLMService()
695
  interpretation = llm_service.analyze_similarity(df_results)
696
+
697
  # Simulate completion steps
698
  progress(0.9, desc="Formatting results...")
699
  progress(0.95, desc="Applying scholarly formatting...")
700
+
701
  # Completed
702
  progress(1.0, desc="Analysis complete!")
703
+
704
  # Add a timestamp to the interpretation
705
  timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
706
  interpretation = f"{interpretation}\n\n<small>Analysis generated on {timestamp}</small>"
 
708
  except Exception as e:
709
  logger.error(f"Error in interpret_results: {e}", exc_info=True)
710
  return f"Error interpreting results: {str(e)}"
711
+
712
+ def run_pipeline_preset(files, preset, progress=gr.Progress()):
713
+ """Wrapper that converts preset selection to pipeline parameters."""
714
+ # Determine settings based on preset
715
+ if "Quick" in preset:
716
+ # Quick: Jaccard only
717
+ enable_semantic = "No"
718
+ enable_fuzzy = "No"
719
+ elif "Deep" in preset:
720
+ # Deep: All metrics including semantic
721
+ enable_semantic = "Yes"
722
+ enable_fuzzy = "Yes"
723
+ else:
724
+ # Standard: Jaccard + LCS + Fuzzy (no semantic)
725
+ enable_semantic = "No"
726
+ enable_fuzzy = "Yes"
727
+
728
+ # Use sensible defaults for preset mode
729
+ fuzzy_method = "ngram - Syllable pairs (recommended)"
730
+ lcs_normalization = "avg - Balanced comparison (default)"
731
+ model_name = "buddhist-nlp/buddhist-sentence-similarity"
732
+ tokenization_mode = "word - Whole words (recommended)"
733
+ stopwords_option = "Standard (Common particles only)"
734
+ normalize_particles = False
735
+ batch_size = 8
736
+ show_progress = False
737
+
738
+ return run_pipeline(
739
+ files, enable_semantic, enable_fuzzy, fuzzy_method,
740
+ lcs_normalization, model_name, tokenization_mode,
741
+ stopwords_option, normalize_particles, batch_size,
742
+ show_progress, progress
743
+ )
744
+
745
+ # Output components for both buttons
746
+ pipeline_outputs = [
747
+ csv_output,
748
+ metrics_preview,
749
+ word_count_plot,
750
+ heatmap_tabs["Jaccard Similarity (%)"],
751
+ heatmap_tabs["Normalized LCS"],
752
+ heatmap_tabs["Fuzzy Similarity"],
753
+ heatmap_tabs["Semantic Similarity"],
754
+ warning_box,
755
+ state_text_data,
756
+ state_df_results,
757
+ ]
758
+
759
+ # Quick Start button uses presets
760
+ process_btn_quick.click(
761
+ fn=run_pipeline_preset,
762
+ inputs=[file_input, analysis_preset],
763
+ outputs=pipeline_outputs
764
+ )
765
+
766
+ # Custom button uses all the detailed settings
767
+ process_btn_custom.click(
768
  fn=run_pipeline,
769
+ inputs=[
770
+ file_input,
771
+ semantic_toggle_radio,
772
+ fuzzy_toggle_radio,
773
+ fuzzy_method_dropdown,
774
+ lcs_normalization_dropdown,
775
+ model_dropdown,
776
+ tokenization_mode_dropdown,
777
+ stopwords_dropdown,
778
+ particle_normalization_checkbox,
779
+ batch_size_slider,
780
+ progress_bar_checkbox
781
+ ],
782
+ outputs=pipeline_outputs
783
  )
784
 
785
  # Structural analysis functionality removed - see dedicated collation app
786
+
787
  # Connect the interpret button
788
  interpret_btn.click(
789
  fn=interpret_results,
790
  inputs=[csv_output],
791
  outputs=interpretation_output
792
  )
793
+
794
  return demo
795
 
796
 
797
  if __name__ == "__main__":
798
  demo = main_interface()
799
+ demo.launch()
pipeline/hf_embedding.py CHANGED
@@ -10,8 +10,10 @@ _model_cache = {}
10
 
11
  # Model version mapping
12
  MODEL_VERSIONS = {
13
- "sentence-transformers/LaBSE": "v1.0",
14
- "intfloat/e5-base-v2": "v1.0",
 
 
15
  }
16
 
17
  def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[str]]:
@@ -28,7 +30,7 @@ def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[st
28
  # Include version information in cache key
29
  model_version = MODEL_VERSIONS.get(model_id, "unknown")
30
  cache_key = f"{model_id}@{model_version}"
31
-
32
  if cache_key in _model_cache:
33
  logger.info(f"Returning cached model: {model_id} (version: {model_version})")
34
  return _model_cache[cache_key], "sentence-transformer"
@@ -44,9 +46,9 @@ def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[st
44
  return None, None
45
 
46
  def generate_embeddings(
47
- texts: List[str],
48
- model: SentenceTransformer,
49
- batch_size: int = 32,
50
  show_progress_bar: bool = False
51
  ) -> np.ndarray:
52
  """
@@ -70,9 +72,9 @@ def generate_embeddings(
70
  logger.info(f"Generating embeddings for {len(texts)} texts with {type(model).__name__}...")
71
  try:
72
  embeddings = model.encode(
73
- texts,
74
  batch_size=batch_size,
75
- convert_to_numpy=True,
76
  show_progress_bar=show_progress_bar
77
  )
78
  logger.info(f"Embeddings generated with shape: {embeddings.shape}")
 
10
 
11
  # Model version mapping
12
  MODEL_VERSIONS = {
13
+ "buddhist-nlp/buddhist-sentence-similarity": "v1.0", # Dharmamitra - best for Tibetan Buddhist texts
14
+ "buddhist-nlp/bod-eng-similarity": "v1.0", # Dharmamitra - Tibetan-English bitext alignment
15
+ "sentence-transformers/LaBSE": "v1.0", # Multilingual baseline
16
+ "BAAI/bge-m3": "v1.0", # Strong multilingual alternative
17
  }
18
 
19
  def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[str]]:
 
30
  # Include version information in cache key
31
  model_version = MODEL_VERSIONS.get(model_id, "unknown")
32
  cache_key = f"{model_id}@{model_version}"
33
+
34
  if cache_key in _model_cache:
35
  logger.info(f"Returning cached model: {model_id} (version: {model_version})")
36
  return _model_cache[cache_key], "sentence-transformer"
 
46
  return None, None
47
 
48
  def generate_embeddings(
49
+ texts: List[str],
50
+ model: SentenceTransformer,
51
+ batch_size: int = 32,
52
  show_progress_bar: bool = False
53
  ) -> np.ndarray:
54
  """
 
72
  logger.info(f"Generating embeddings for {len(texts)} texts with {type(model).__name__}...")
73
  try:
74
  embeddings = model.encode(
75
+ texts,
76
  batch_size=batch_size,
77
+ convert_to_numpy=True,
78
  show_progress_bar=show_progress_bar
79
  )
80
  logger.info(f"Embeddings generated with shape: {embeddings.shape}")
pipeline/llm_service.py CHANGED
@@ -39,11 +39,11 @@ class LLMService:
39
  """
40
  Service for analyzing text similarity metrics using LLMs and rule-based methods.
41
  """
42
-
43
  def __init__(self, api_key: str = None):
44
  """
45
  Initialize the LLM service.
46
-
47
  Args:
48
  api_key: Optional API key for OpenRouter. If not provided, will try to load from environment.
49
  """
@@ -51,19 +51,19 @@ class LLMService:
51
  self.models = PREFERRED_MODELS
52
  self.temperature = DEFAULT_TEMPERATURE
53
  self.top_p = DEFAULT_TOP_P
54
-
55
  def analyze_similarity(
56
- self,
57
- results_df: pd.DataFrame,
58
  use_llm: bool = True,
59
  ) -> str:
60
  """
61
  Analyze similarity metrics using either LLM or rule-based approach.
62
-
63
  Args:
64
  results_df: DataFrame containing similarity metrics
65
  use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
66
-
67
  Returns:
68
  str: Analysis of the metrics in markdown format with appropriate fallback messages
69
  """
@@ -71,19 +71,19 @@ class LLMService:
71
  if not use_llm:
72
  logger.info("LLM analysis disabled. Using rule-based analysis.")
73
  return self._analyze_with_rules(results_df)
74
-
75
  # Try LLM analysis if enabled
76
  try:
77
  if not self.api_key:
78
  raise ValueError("No OpenRouter API key provided. Please set the OPENROUTER_API_KEY environment variable.")
79
-
80
  logger.info("Attempting LLM-based analysis...")
81
  return self._analyze_with_llm(results_df, max_tokens=DEFAULT_MAX_TOKENS)
82
-
83
  except Exception as e:
84
  error_msg = str(e)
85
  logger.error(f"Error in LLM analysis: {error_msg}")
86
-
87
  # Create a user-friendly error message
88
  if "no openrouter api key" in error_msg.lower():
89
  error_note = "OpenRouter API key not found. Please set the `OPENROUTER_API_KEY` environment variable to use this feature."
@@ -95,42 +95,42 @@ class LLMService:
95
  error_note = "API rate limit exceeded. Falling back to rule-based analysis."
96
  else:
97
  error_note = f"LLM analysis failed: {error_msg[:200]}. Falling back to rule-based analysis."
98
-
99
  # Get rule-based analysis
100
  rule_based_analysis = self._analyze_with_rules(results_df)
101
-
102
  # Combine the error message with the rule-based analysis
103
  return f"## Analysis of Tibetan Text Similarity Metrics\n\n*Note: {error_note}*\n\n{rule_based_analysis}"
104
-
105
  def _prepare_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
106
  """
107
  Prepare the DataFrame for analysis.
108
-
109
  Args:
110
  df: Input DataFrame with similarity metrics
111
-
112
  Returns:
113
  pd.DataFrame: Cleaned and prepared DataFrame
114
  """
115
  # Make a copy to avoid modifying the original
116
  df = df.copy()
117
-
118
  # Clean text columns
119
  text_cols = ['Text A', 'Text B']
120
  for col in text_cols:
121
  if col in df.columns:
122
  df[col] = df[col].fillna('Unknown').astype(str)
123
  df[col] = df[col].str.replace('.txt$', '', regex=True)
124
-
125
  # Filter out perfect matches (likely empty cells)
126
  metrics_cols = ['Jaccard Similarity (%)', 'Normalized LCS']
127
  if all(col in df.columns for col in metrics_cols):
128
- mask = ~((df['Jaccard Similarity (%)'] == 100.0) &
129
  (df['Normalized LCS'] == 1.0))
130
  df = df[mask].copy()
131
-
132
  return df
133
-
134
  def _analyze_with_llm(self, df: pd.DataFrame, max_tokens: int) -> str:
135
  """
136
  Analyze metrics using an LLM via OpenRouter API, with fallback models.
@@ -181,65 +181,65 @@ class LLMService:
181
  raise last_error
182
  else:
183
  raise Exception("LLM analysis failed for all available models.")
184
-
185
  def _analyze_with_rules(self, df: pd.DataFrame) -> str:
186
  """
187
  Analyze metrics using rule-based approach.
188
-
189
  Args:
190
  df: Prepared DataFrame with metrics
191
-
192
  Returns:
193
  str: Rule-based analysis in markdown format
194
  """
195
  analysis = ["## Tibetan Text Similarity Analysis (Rule-Based)"]
196
-
197
  # Basic stats
198
  text_a_col = 'Text A' if 'Text A' in df.columns else None
199
  text_b_col = 'Text B' if 'Text B' in df.columns else None
200
-
201
  if text_a_col and text_b_col:
202
  unique_texts = set(df[text_a_col].unique()) | set(df[text_b_col].unique())
203
  analysis.append(f"- **Texts analyzed:** {', '.join(sorted(unique_texts))}")
204
-
205
  # Analyze each metric
206
  metric_analyses = []
207
-
208
  if 'Jaccard Similarity (%)' in df.columns:
209
  jaccard_analysis = self._analyze_jaccard(df)
210
  metric_analyses.append(jaccard_analysis)
211
-
212
  if 'Normalized LCS' in df.columns:
213
  lcs_analysis = self._analyze_lcs(df)
214
  metric_analyses.append(lcs_analysis)
215
-
216
  # TF-IDF analysis removed
217
-
218
  # Add all metric analyses
219
  if metric_analyses:
220
  analysis.extend(metric_analyses)
221
-
222
  # Add overall interpretation
223
  analysis.append("\n## Overall Interpretation")
224
  analysis.append(self._generate_overall_interpretation(df))
225
-
226
  return "\n\n".join(analysis)
227
-
228
  def _analyze_jaccard(self, df: pd.DataFrame) -> str:
229
  """Analyze Jaccard similarity scores."""
230
  jaccard = df['Jaccard Similarity (%)'].dropna()
231
  if jaccard.empty:
232
  return ""
233
-
234
  mean_jaccard = jaccard.mean()
235
  max_jaccard = jaccard.max()
236
  min_jaccard = jaccard.min()
237
-
238
  analysis = [
239
  "### Jaccard Similarity Analysis",
240
  f"- **Range:** {min_jaccard:.1f}% to {max_jaccard:.1f}% (mean: {mean_jaccard:.1f}%)"
241
  ]
242
-
243
  # Interpret the scores
244
  if mean_jaccard > 60:
245
  analysis.append("- **High vocabulary overlap** suggests texts share significant content or are from the same tradition.")
@@ -247,7 +247,7 @@ class LLMService:
247
  analysis.append("- **Moderate vocabulary overlap** indicates some shared content or themes.")
248
  else:
249
  analysis.append("- **Low vocabulary overlap** suggests texts are on different topics or from different traditions.")
250
-
251
  # Add top pairs
252
  top_pairs = df.nlargest(3, 'Jaccard Similarity (%)')
253
  if not top_pairs.empty:
@@ -257,24 +257,24 @@ class LLMService:
257
  text_b = row.get('Text B', 'Text 2')
258
  score = row['Jaccard Similarity (%)']
259
  analysis.append(f"- {text_a} ↔ {text_b}: {score:.1f}%")
260
-
261
  return "\n".join(analysis)
262
-
263
  def _analyze_lcs(self, df: pd.DataFrame) -> str:
264
  """Analyze Longest Common Subsequence scores."""
265
  lcs = df['Normalized LCS'].dropna()
266
  if lcs.empty:
267
  return ""
268
-
269
  mean_lcs = lcs.mean()
270
  max_lcs = lcs.max()
271
  min_lcs = lcs.min()
272
-
273
  analysis = [
274
  "### Structural Similarity (LCS) Analysis",
275
  f"- **Range:** {min_lcs:.2f} to {max_lcs:.2f} (mean: {mean_lcs:.2f})"
276
  ]
277
-
278
  # Interpret the scores
279
  if mean_lcs > 0.7:
280
  analysis.append("- **High structural similarity** suggests texts follow similar organizational patterns.")
@@ -282,7 +282,7 @@ class LLMService:
282
  analysis.append("- **Moderate structural similarity** indicates some shared organizational elements.")
283
  else:
284
  analysis.append("- **Low structural similarity** suggests different organizational approaches.")
285
-
286
  # Add top pairs
287
  top_pairs = df.nlargest(3, 'Normalized LCS')
288
  if not top_pairs.empty:
@@ -292,19 +292,19 @@ class LLMService:
292
  text_b = row.get('Text B', 'Text 2')
293
  score = row['Normalized LCS']
294
  analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
295
-
296
  return "\n".join(analysis)
297
-
298
  # TF-IDF analysis method removed
299
-
300
  def _generate_overall_interpretation(self, df: pd.DataFrame) -> str:
301
  """Generate an overall interpretation of the metrics."""
302
  interpretations = []
303
-
304
  # Get metrics if they exist
305
  has_jaccard = 'Jaccard Similarity (%)' in df.columns
306
  has_lcs = 'Normalized LCS' in df.columns
307
-
308
  # Calculate means for available metrics
309
  metrics = {}
310
  if has_jaccard:
@@ -312,51 +312,51 @@ class LLMService:
312
  if has_lcs:
313
  metrics['lcs'] = df['Normalized LCS'].mean()
314
  # TF-IDF metrics removed
315
-
316
  # Generate interpretation based on metrics
317
  if metrics:
318
  interpretations.append("Based on the analysis of similarity metrics:")
319
-
320
  if has_jaccard and metrics['jaccard'] > 60:
321
  interpretations.append("- The high Jaccard similarity indicates significant vocabulary overlap between texts, "
322
  "suggesting they may share common sources or be part of the same textual tradition.")
323
-
324
  if has_lcs and metrics['lcs'] > 0.7:
325
  interpretations.append("- The high LCS score indicates strong structural similarity, "
326
  "suggesting the texts may follow similar organizational patterns or share common structural elements.")
327
-
328
  # TF-IDF interpretation removed
329
-
330
  # Add cross-metric interpretations
331
  if has_jaccard and has_lcs and metrics['jaccard'] > 60 and metrics['lcs'] > 0.7:
332
  interpretations.append("\nThe combination of high Jaccard and LCS similarities strongly suggests "
333
  "that these texts are closely related, possibly being different versions or "
334
  "transmissions of the same work or sharing a common source.")
335
-
336
  # TF-IDF cross-metric interpretation removed
337
-
338
  # Add general guidance if no specific patterns found
339
  if not interpretations:
340
  interpretations.append("The analysis did not reveal strong patterns in the similarity metrics. "
341
  "This could indicate that the texts are either very similar or very different "
342
  "across all measured dimensions.")
343
-
344
  return "\n\n".join(interpretations)
345
-
346
  def _create_llm_prompt(self, df: pd.DataFrame, model_name: str) -> str:
347
  """
348
  Create a prompt for the LLM based on the DataFrame.
349
-
350
  Args:
351
  df: Prepared DataFrame with metrics
352
  model_name: Name of the model being used
353
-
354
  Returns:
355
  str: Formatted prompt for the LLM
356
  """
357
  # Convert DataFrame to markdown for the prompt
358
  md_table = df.to_markdown(index=False)
359
-
360
  # Create the prompt
361
  prompt = f"""
362
  # Tibetan Text Similarity Analysis
@@ -372,19 +372,19 @@ You will be provided with a table of text similarity scores in Markdown format.
372
 
373
  Your analysis will be performed using the `{model_name}` model. Provide a concise, scholarly analysis in well-structured markdown.
374
  """
375
-
376
 
377
-
 
378
  return prompt
379
-
380
  def _get_system_prompt(self) -> str:
381
  """Get the system prompt for the LLM."""
382
  return """You are a senior scholar of Tibetan Buddhist texts, specializing in textual criticism. Your task is to analyze the provided similarity metrics and provide expert insights into the relationships between these texts. Ground your analysis in the data, be precise, and focus on what the metrics reveal about the texts' transmission and history."""
383
-
384
  def _call_openrouter_api(self, model: str, prompt: str, system_message: str = None, max_tokens: int = None, temperature: float = None, top_p: float = None) -> str:
385
  """
386
  Call the OpenRouter API.
387
-
388
  Args:
389
  model: Model to use for the API call
390
  prompt: The user prompt
@@ -392,10 +392,10 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
392
  max_tokens: Maximum tokens for the response
393
  temperature: Sampling temperature
394
  top_p: Nucleus sampling parameter
395
-
396
  Returns:
397
  str: The API response
398
-
399
  Raises:
400
  ValueError: If API key is missing or invalid
401
  requests.exceptions.RequestException: For network-related errors
@@ -405,21 +405,21 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
405
  error_msg = "OpenRouter API key not provided. Please set the OPENROUTER_API_KEY environment variable."
406
  logger.error(error_msg)
407
  raise ValueError(error_msg)
408
-
409
  url = "https://openrouter.ai/api/v1/chat/completions"
410
-
411
  headers = {
412
  "Authorization": f"Bearer {self.api_key}",
413
  "Content-Type": "application/json",
414
  "HTTP-Referer": "https://github.com/daniel-wojahn/tibetan-text-metrics",
415
  "X-Title": "Tibetan Text Metrics"
416
  }
417
-
418
  messages = []
419
  if system_message:
420
  messages.append({"role": "system", "content": system_message})
421
  messages.append({"role": "user", "content": prompt})
422
-
423
  data = {
424
  "model": model, # Use the model parameter here
425
  "messages": messages,
@@ -427,11 +427,11 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
427
  "temperature": temperature or self.temperature,
428
  "top_p": top_p or self.top_p,
429
  }
430
-
431
  try:
432
  logger.info(f"Calling OpenRouter API with model: {model}")
433
  response = requests.post(url, headers=headers, json=data, timeout=60)
434
-
435
  # Handle different HTTP status codes
436
  if response.status_code == 200:
437
  result = response.json()
@@ -441,53 +441,53 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
441
  error_msg = "Unexpected response format from OpenRouter API"
442
  logger.error(f"{error_msg}: {result}")
443
  raise ValueError(error_msg)
444
-
445
  elif response.status_code == 401:
446
  error_msg = "Invalid OpenRouter API key. Please check your API key and try again."
447
  logger.error(error_msg)
448
  raise ValueError(error_msg)
449
-
450
  elif response.status_code == 402:
451
  error_msg = "OpenRouter API payment required. Please check your OpenRouter account balance or billing status."
452
  logger.error(error_msg)
453
  raise ValueError(error_msg)
454
-
455
  elif response.status_code == 429:
456
  error_msg = "API rate limit exceeded. Please try again later or check your OpenRouter rate limits."
457
  logger.error(error_msg)
458
  raise ValueError(error_msg)
459
-
460
  else:
461
  error_msg = f"OpenRouter API error: {response.status_code} - {response.text}"
462
  logger.error(error_msg)
463
  raise Exception(error_msg)
464
-
465
  except requests.exceptions.RequestException as e:
466
  error_msg = f"Failed to connect to OpenRouter API: {str(e)}"
467
  logger.error(error_msg)
468
  raise Exception(error_msg) from e
469
-
470
  except json.JSONDecodeError as e:
471
  error_msg = f"Failed to parse OpenRouter API response: {str(e)}"
472
  logger.error(error_msg)
473
  raise Exception(error_msg) from e
474
-
475
  def _format_llm_response(self, response: str, df: pd.DataFrame, model_name: str) -> str:
476
  """
477
  Format the LLM response for display.
478
-
479
  Args:
480
  response: Raw LLM response
481
  df: Original DataFrame for reference
482
  model_name: Name of the model used
483
-
484
  Returns:
485
  str: Formatted response with fallback if needed
486
  """
487
  # Basic validation
488
  if not response or len(response) < 100:
489
  raise ValueError("Response too short or empty")
490
-
491
  # Check for garbled output (random numbers, nonsensical patterns)
492
  # This is a simple heuristic - look for long sequences of numbers or strange patterns
493
  suspicious_patterns = [
@@ -495,24 +495,24 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
495
  r'[0-9,.]{20,}', # Long sequences of digits, commas and periods
496
  r'[\W]{20,}', # Long sequences of non-word characters
497
  ]
498
-
499
  for pattern in suspicious_patterns:
500
  if re.search(pattern, response):
501
  logger.warning(f"Detected potentially garbled output matching pattern: {pattern}")
502
  # Don't immediately raise - we'll do a more comprehensive check
503
-
504
  # Check for content quality - ensure it has expected sections
505
  expected_content = [
506
  "introduction", "analysis", "similarity", "patterns", "conclusion", "question"
507
  ]
508
-
509
  # Count how many expected content markers we find
510
  content_matches = sum(1 for term in expected_content if term.lower() in response.lower())
511
-
512
  # If we find fewer than 3 expected content markers, log a warning
513
  if content_matches < 3:
514
  logger.warning(f"LLM response missing expected content sections (found {content_matches}/6)")
515
-
516
  # Check for text names from the dataset
517
  # Extract text names from the Text Pair column
518
  text_names = set()
@@ -521,22 +521,22 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
521
  if isinstance(pair, str) and " vs " in pair:
522
  texts = pair.split(" vs ")
523
  text_names.update(texts)
524
-
525
  # Check if at least some text names appear in the response
526
  text_name_matches = sum(1 for name in text_names if name in response)
527
  if text_names and text_name_matches == 0:
528
  logger.warning("LLM response does not mention any of the text names from the dataset. The analysis may be generic.")
529
-
530
  # Ensure basic markdown structure
531
  if '##' not in response:
532
  response = f"## Analysis of Tibetan Text Similarity\n\n{response}"
533
-
534
  # Add styling to make the output more readable
535
  response = f"<div class='llm-analysis'>\n{response}\n</div>"
536
-
537
  # Format the response into a markdown block
538
  formatted_response = f"""## AI-Powered Analysis (Model: {model_name})\n\n{response}"""
539
-
540
  return formatted_response
541
-
542
 
 
39
  """
40
  Service for analyzing text similarity metrics using LLMs and rule-based methods.
41
  """
42
+
43
  def __init__(self, api_key: str = None):
44
  """
45
  Initialize the LLM service.
46
+
47
  Args:
48
  api_key: Optional API key for OpenRouter. If not provided, will try to load from environment.
49
  """
 
51
  self.models = PREFERRED_MODELS
52
  self.temperature = DEFAULT_TEMPERATURE
53
  self.top_p = DEFAULT_TOP_P
54
+
55
  def analyze_similarity(
56
+ self,
57
+ results_df: pd.DataFrame,
58
  use_llm: bool = True,
59
  ) -> str:
60
  """
61
  Analyze similarity metrics using either LLM or rule-based approach.
62
+
63
  Args:
64
  results_df: DataFrame containing similarity metrics
65
  use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
66
+
67
  Returns:
68
  str: Analysis of the metrics in markdown format with appropriate fallback messages
69
  """
 
71
  if not use_llm:
72
  logger.info("LLM analysis disabled. Using rule-based analysis.")
73
  return self._analyze_with_rules(results_df)
74
+
75
  # Try LLM analysis if enabled
76
  try:
77
  if not self.api_key:
78
  raise ValueError("No OpenRouter API key provided. Please set the OPENROUTER_API_KEY environment variable.")
79
+
80
  logger.info("Attempting LLM-based analysis...")
81
  return self._analyze_with_llm(results_df, max_tokens=DEFAULT_MAX_TOKENS)
82
+
83
  except Exception as e:
84
  error_msg = str(e)
85
  logger.error(f"Error in LLM analysis: {error_msg}")
86
+
87
  # Create a user-friendly error message
88
  if "no openrouter api key" in error_msg.lower():
89
  error_note = "OpenRouter API key not found. Please set the `OPENROUTER_API_KEY` environment variable to use this feature."
 
95
  error_note = "API rate limit exceeded. Falling back to rule-based analysis."
96
  else:
97
  error_note = f"LLM analysis failed: {error_msg[:200]}. Falling back to rule-based analysis."
98
+
99
  # Get rule-based analysis
100
  rule_based_analysis = self._analyze_with_rules(results_df)
101
+
102
  # Combine the error message with the rule-based analysis
103
  return f"## Analysis of Tibetan Text Similarity Metrics\n\n*Note: {error_note}*\n\n{rule_based_analysis}"
104
+
105
  def _prepare_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
106
  """
107
  Prepare the DataFrame for analysis.
108
+
109
  Args:
110
  df: Input DataFrame with similarity metrics
111
+
112
  Returns:
113
  pd.DataFrame: Cleaned and prepared DataFrame
114
  """
115
  # Make a copy to avoid modifying the original
116
  df = df.copy()
117
+
118
  # Clean text columns
119
  text_cols = ['Text A', 'Text B']
120
  for col in text_cols:
121
  if col in df.columns:
122
  df[col] = df[col].fillna('Unknown').astype(str)
123
  df[col] = df[col].str.replace('.txt$', '', regex=True)
124
+
125
  # Filter out perfect matches (likely empty cells)
126
  metrics_cols = ['Jaccard Similarity (%)', 'Normalized LCS']
127
  if all(col in df.columns for col in metrics_cols):
128
+ mask = ~((df['Jaccard Similarity (%)'] == 100.0) &
129
  (df['Normalized LCS'] == 1.0))
130
  df = df[mask].copy()
131
+
132
  return df
133
+
134
  def _analyze_with_llm(self, df: pd.DataFrame, max_tokens: int) -> str:
135
  """
136
  Analyze metrics using an LLM via OpenRouter API, with fallback models.
 
181
  raise last_error
182
  else:
183
  raise Exception("LLM analysis failed for all available models.")
184
+
185
  def _analyze_with_rules(self, df: pd.DataFrame) -> str:
186
  """
187
  Analyze metrics using rule-based approach.
188
+
189
  Args:
190
  df: Prepared DataFrame with metrics
191
+
192
  Returns:
193
  str: Rule-based analysis in markdown format
194
  """
195
  analysis = ["## Tibetan Text Similarity Analysis (Rule-Based)"]
196
+
197
  # Basic stats
198
  text_a_col = 'Text A' if 'Text A' in df.columns else None
199
  text_b_col = 'Text B' if 'Text B' in df.columns else None
200
+
201
  if text_a_col and text_b_col:
202
  unique_texts = set(df[text_a_col].unique()) | set(df[text_b_col].unique())
203
  analysis.append(f"- **Texts analyzed:** {', '.join(sorted(unique_texts))}")
204
+
205
  # Analyze each metric
206
  metric_analyses = []
207
+
208
  if 'Jaccard Similarity (%)' in df.columns:
209
  jaccard_analysis = self._analyze_jaccard(df)
210
  metric_analyses.append(jaccard_analysis)
211
+
212
  if 'Normalized LCS' in df.columns:
213
  lcs_analysis = self._analyze_lcs(df)
214
  metric_analyses.append(lcs_analysis)
215
+
216
  # TF-IDF analysis removed
217
+
218
  # Add all metric analyses
219
  if metric_analyses:
220
  analysis.extend(metric_analyses)
221
+
222
  # Add overall interpretation
223
  analysis.append("\n## Overall Interpretation")
224
  analysis.append(self._generate_overall_interpretation(df))
225
+
226
  return "\n\n".join(analysis)
227
+
228
  def _analyze_jaccard(self, df: pd.DataFrame) -> str:
229
  """Analyze Jaccard similarity scores."""
230
  jaccard = df['Jaccard Similarity (%)'].dropna()
231
  if jaccard.empty:
232
  return ""
233
+
234
  mean_jaccard = jaccard.mean()
235
  max_jaccard = jaccard.max()
236
  min_jaccard = jaccard.min()
237
+
238
  analysis = [
239
  "### Jaccard Similarity Analysis",
240
  f"- **Range:** {min_jaccard:.1f}% to {max_jaccard:.1f}% (mean: {mean_jaccard:.1f}%)"
241
  ]
242
+
243
  # Interpret the scores
244
  if mean_jaccard > 60:
245
  analysis.append("- **High vocabulary overlap** suggests texts share significant content or are from the same tradition.")
 
247
  analysis.append("- **Moderate vocabulary overlap** indicates some shared content or themes.")
248
  else:
249
  analysis.append("- **Low vocabulary overlap** suggests texts are on different topics or from different traditions.")
250
+
251
  # Add top pairs
252
  top_pairs = df.nlargest(3, 'Jaccard Similarity (%)')
253
  if not top_pairs.empty:
 
257
  text_b = row.get('Text B', 'Text 2')
258
  score = row['Jaccard Similarity (%)']
259
  analysis.append(f"- {text_a} ↔ {text_b}: {score:.1f}%")
260
+
261
  return "\n".join(analysis)
262
+
263
  def _analyze_lcs(self, df: pd.DataFrame) -> str:
264
  """Analyze Longest Common Subsequence scores."""
265
  lcs = df['Normalized LCS'].dropna()
266
  if lcs.empty:
267
  return ""
268
+
269
  mean_lcs = lcs.mean()
270
  max_lcs = lcs.max()
271
  min_lcs = lcs.min()
272
+
273
  analysis = [
274
  "### Structural Similarity (LCS) Analysis",
275
  f"- **Range:** {min_lcs:.2f} to {max_lcs:.2f} (mean: {mean_lcs:.2f})"
276
  ]
277
+
278
  # Interpret the scores
279
  if mean_lcs > 0.7:
280
  analysis.append("- **High structural similarity** suggests texts follow similar organizational patterns.")
 
282
  analysis.append("- **Moderate structural similarity** indicates some shared organizational elements.")
283
  else:
284
  analysis.append("- **Low structural similarity** suggests different organizational approaches.")
285
+
286
  # Add top pairs
287
  top_pairs = df.nlargest(3, 'Normalized LCS')
288
  if not top_pairs.empty:
 
292
  text_b = row.get('Text B', 'Text 2')
293
  score = row['Normalized LCS']
294
  analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
295
+
296
  return "\n".join(analysis)
297
+
298
  # TF-IDF analysis method removed
299
+
300
  def _generate_overall_interpretation(self, df: pd.DataFrame) -> str:
301
  """Generate an overall interpretation of the metrics."""
302
  interpretations = []
303
+
304
  # Get metrics if they exist
305
  has_jaccard = 'Jaccard Similarity (%)' in df.columns
306
  has_lcs = 'Normalized LCS' in df.columns
307
+
308
  # Calculate means for available metrics
309
  metrics = {}
310
  if has_jaccard:
 
312
  if has_lcs:
313
  metrics['lcs'] = df['Normalized LCS'].mean()
314
  # TF-IDF metrics removed
315
+
316
  # Generate interpretation based on metrics
317
  if metrics:
318
  interpretations.append("Based on the analysis of similarity metrics:")
319
+
320
  if has_jaccard and metrics['jaccard'] > 60:
321
  interpretations.append("- The high Jaccard similarity indicates significant vocabulary overlap between texts, "
322
  "suggesting they may share common sources or be part of the same textual tradition.")
323
+
324
  if has_lcs and metrics['lcs'] > 0.7:
325
  interpretations.append("- The high LCS score indicates strong structural similarity, "
326
  "suggesting the texts may follow similar organizational patterns or share common structural elements.")
327
+
328
  # TF-IDF interpretation removed
329
+
330
  # Add cross-metric interpretations
331
  if has_jaccard and has_lcs and metrics['jaccard'] > 60 and metrics['lcs'] > 0.7:
332
  interpretations.append("\nThe combination of high Jaccard and LCS similarities strongly suggests "
333
  "that these texts are closely related, possibly being different versions or "
334
  "transmissions of the same work or sharing a common source.")
335
+
336
  # TF-IDF cross-metric interpretation removed
337
+
338
  # Add general guidance if no specific patterns found
339
  if not interpretations:
340
  interpretations.append("The analysis did not reveal strong patterns in the similarity metrics. "
341
  "This could indicate that the texts are either very similar or very different "
342
  "across all measured dimensions.")
343
+
344
  return "\n\n".join(interpretations)
345
+
346
  def _create_llm_prompt(self, df: pd.DataFrame, model_name: str) -> str:
347
  """
348
  Create a prompt for the LLM based on the DataFrame.
349
+
350
  Args:
351
  df: Prepared DataFrame with metrics
352
  model_name: Name of the model being used
353
+
354
  Returns:
355
  str: Formatted prompt for the LLM
356
  """
357
  # Convert DataFrame to markdown for the prompt
358
  md_table = df.to_markdown(index=False)
359
+
360
  # Create the prompt
361
  prompt = f"""
362
  # Tibetan Text Similarity Analysis
 
372
 
373
  Your analysis will be performed using the `{model_name}` model. Provide a concise, scholarly analysis in well-structured markdown.
374
  """
 
375
 
376
+
377
+
378
  return prompt
379
+
380
  def _get_system_prompt(self) -> str:
381
  """Get the system prompt for the LLM."""
382
  return """You are a senior scholar of Tibetan Buddhist texts, specializing in textual criticism. Your task is to analyze the provided similarity metrics and provide expert insights into the relationships between these texts. Ground your analysis in the data, be precise, and focus on what the metrics reveal about the texts' transmission and history."""
383
+
384
  def _call_openrouter_api(self, model: str, prompt: str, system_message: str = None, max_tokens: int = None, temperature: float = None, top_p: float = None) -> str:
385
  """
386
  Call the OpenRouter API.
387
+
388
  Args:
389
  model: Model to use for the API call
390
  prompt: The user prompt
 
392
  max_tokens: Maximum tokens for the response
393
  temperature: Sampling temperature
394
  top_p: Nucleus sampling parameter
395
+
396
  Returns:
397
  str: The API response
398
+
399
  Raises:
400
  ValueError: If API key is missing or invalid
401
  requests.exceptions.RequestException: For network-related errors
 
405
  error_msg = "OpenRouter API key not provided. Please set the OPENROUTER_API_KEY environment variable."
406
  logger.error(error_msg)
407
  raise ValueError(error_msg)
408
+
409
  url = "https://openrouter.ai/api/v1/chat/completions"
410
+
411
  headers = {
412
  "Authorization": f"Bearer {self.api_key}",
413
  "Content-Type": "application/json",
414
  "HTTP-Referer": "https://github.com/daniel-wojahn/tibetan-text-metrics",
415
  "X-Title": "Tibetan Text Metrics"
416
  }
417
+
418
  messages = []
419
  if system_message:
420
  messages.append({"role": "system", "content": system_message})
421
  messages.append({"role": "user", "content": prompt})
422
+
423
  data = {
424
  "model": model, # Use the model parameter here
425
  "messages": messages,
 
427
  "temperature": temperature or self.temperature,
428
  "top_p": top_p or self.top_p,
429
  }
430
+
431
  try:
432
  logger.info(f"Calling OpenRouter API with model: {model}")
433
  response = requests.post(url, headers=headers, json=data, timeout=60)
434
+
435
  # Handle different HTTP status codes
436
  if response.status_code == 200:
437
  result = response.json()
 
441
  error_msg = "Unexpected response format from OpenRouter API"
442
  logger.error(f"{error_msg}: {result}")
443
  raise ValueError(error_msg)
444
+
445
  elif response.status_code == 401:
446
  error_msg = "Invalid OpenRouter API key. Please check your API key and try again."
447
  logger.error(error_msg)
448
  raise ValueError(error_msg)
449
+
450
  elif response.status_code == 402:
451
  error_msg = "OpenRouter API payment required. Please check your OpenRouter account balance or billing status."
452
  logger.error(error_msg)
453
  raise ValueError(error_msg)
454
+
455
  elif response.status_code == 429:
456
  error_msg = "API rate limit exceeded. Please try again later or check your OpenRouter rate limits."
457
  logger.error(error_msg)
458
  raise ValueError(error_msg)
459
+
460
  else:
461
  error_msg = f"OpenRouter API error: {response.status_code} - {response.text}"
462
  logger.error(error_msg)
463
  raise Exception(error_msg)
464
+
465
  except requests.exceptions.RequestException as e:
466
  error_msg = f"Failed to connect to OpenRouter API: {str(e)}"
467
  logger.error(error_msg)
468
  raise Exception(error_msg) from e
469
+
470
  except json.JSONDecodeError as e:
471
  error_msg = f"Failed to parse OpenRouter API response: {str(e)}"
472
  logger.error(error_msg)
473
  raise Exception(error_msg) from e
474
+
475
  def _format_llm_response(self, response: str, df: pd.DataFrame, model_name: str) -> str:
476
  """
477
  Format the LLM response for display.
478
+
479
  Args:
480
  response: Raw LLM response
481
  df: Original DataFrame for reference
482
  model_name: Name of the model used
483
+
484
  Returns:
485
  str: Formatted response with fallback if needed
486
  """
487
  # Basic validation
488
  if not response or len(response) < 100:
489
  raise ValueError("Response too short or empty")
490
+
491
  # Check for garbled output (random numbers, nonsensical patterns)
492
  # This is a simple heuristic - look for long sequences of numbers or strange patterns
493
  suspicious_patterns = [
 
495
  r'[0-9,.]{20,}', # Long sequences of digits, commas and periods
496
  r'[\W]{20,}', # Long sequences of non-word characters
497
  ]
498
+
499
  for pattern in suspicious_patterns:
500
  if re.search(pattern, response):
501
  logger.warning(f"Detected potentially garbled output matching pattern: {pattern}")
502
  # Don't immediately raise - we'll do a more comprehensive check
503
+
504
  # Check for content quality - ensure it has expected sections
505
  expected_content = [
506
  "introduction", "analysis", "similarity", "patterns", "conclusion", "question"
507
  ]
508
+
509
  # Count how many expected content markers we find
510
  content_matches = sum(1 for term in expected_content if term.lower() in response.lower())
511
+
512
  # If we find fewer than 3 expected content markers, log a warning
513
  if content_matches < 3:
514
  logger.warning(f"LLM response missing expected content sections (found {content_matches}/6)")
515
+
516
  # Check for text names from the dataset
517
  # Extract text names from the Text Pair column
518
  text_names = set()
 
521
  if isinstance(pair, str) and " vs " in pair:
522
  texts = pair.split(" vs ")
523
  text_names.update(texts)
524
+
525
  # Check if at least some text names appear in the response
526
  text_name_matches = sum(1 for name in text_names if name in response)
527
  if text_names and text_name_matches == 0:
528
  logger.warning("LLM response does not mention any of the text names from the dataset. The analysis may be generic.")
529
+
530
  # Ensure basic markdown structure
531
  if '##' not in response:
532
  response = f"## Analysis of Tibetan Text Similarity\n\n{response}"
533
+
534
  # Add styling to make the output more readable
535
  response = f"<div class='llm-analysis'>\n{response}\n</div>"
536
+
537
  # Format the response into a markdown block
538
  formatted_response = f"""## AI-Powered Analysis (Model: {model_name})\n\n{response}"""
539
+
540
  return formatted_response
541
+
542
 
pipeline/metrics.py CHANGED
@@ -4,14 +4,19 @@ from typing import List, Dict, Union
4
  from itertools import combinations
5
 
6
  from sklearn.metrics.pairwise import cosine_similarity
7
- from thefuzz import fuzz
8
  from .hf_embedding import generate_embeddings as generate_hf_embeddings
9
  from .stopwords_bo import TIBETAN_STOPWORDS_SET
10
  from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
 
11
 
12
  import logging
13
 
14
 
 
 
 
 
 
15
  # Attempt to import the Cython-compiled fast_lcs module
16
  try:
17
  from .fast_lcs import compute_lcs_fast
@@ -25,19 +30,37 @@ logger = logging.getLogger(__name__)
25
 
26
 
27
 
28
- def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
29
- # Calculate m and n (lengths) here, so they are available for normalization
30
- # regardless of which LCS implementation is used.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  m, n = len(words1), len(words2)
32
 
 
 
 
33
  if USE_CYTHON_LCS:
34
- # Use the Cython-compiled version if available
35
  lcs_length = compute_lcs_fast(words1, words2)
36
  else:
37
- # Fallback to pure Python implementation
38
- # m, n = len(words1), len(words2) # Moved to the beginning of the function
39
- # Using numpy array for dp table can be slightly faster than list of lists for large inputs
40
- # but the primary bottleneck is the Python loop itself compared to Cython.
41
  dp = np.zeros((m + 1, n + 1), dtype=np.int32)
42
 
43
  for i in range(1, m + 1):
@@ -47,63 +70,192 @@ def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
47
  else:
48
  dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
49
  lcs_length = int(dp[m, n])
50
- avg_length = (m + n) / 2
51
- return lcs_length / avg_length if avg_length > 0 else 0.0
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
- def compute_fuzzy_similarity(words1: List[str], words2: List[str], method: str = 'token_set') -> float:
 
 
 
 
 
55
  """
56
- Computes fuzzy string similarity between two lists of words using TheFuzz.
57
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  Args:
59
  words1: First list of tokens
60
  words2: Second list of tokens
61
  method: The fuzzy matching method to use:
62
- 'token_set' - Order-independent token matching (default)
63
- 'token_sort' - Order-normalized token matching
64
- 'partial' - Best partial token matching
65
- 'ratio' - Simple ratio matching
66
-
67
  Returns:
68
  float: Fuzzy similarity score between 0.0 and 1.0
69
  """
70
  if not words1 or not words2:
71
  return 0.0
72
-
73
- # Join tokens into strings for fuzzy matching
74
- text1 = " ".join(words1)
75
- text2 = " ".join(words2)
76
-
77
- # Apply the selected fuzzy matching method
78
- if method == 'token_set':
79
- # Best for texts with different word orders and partial overlaps
80
- score = fuzz.token_set_ratio(text1, text2)
81
- elif method == 'token_sort':
82
- # Good for texts with different word orders but similar content
83
- score = fuzz.token_sort_ratio(text1, text2)
84
- elif method == 'partial':
85
- # Best for finding shorter strings within longer ones
86
- score = fuzz.partial_ratio(text1, text2)
87
- else: # 'ratio'
88
- # Simple Levenshtein distance ratio
89
- score = fuzz.ratio(text1, text2)
90
-
91
- # Convert score from 0-100 scale to 0-1 scale
92
- return score / 100.0
93
 
94
 
95
 
96
  def compute_semantic_similarity(
97
  text1_segment: str,
98
  text2_segment: str,
99
- tokens1: List[str],
100
- tokens2: List[str],
101
  model,
102
  batch_size: int = 32,
103
  show_progress_bar: bool = False
104
  ) -> float:
105
- """Computes semantic similarity using a Sentence Transformer model only."""
 
106
 
 
 
 
 
 
 
 
 
 
 
107
  if model is None:
108
  logger.warning(
109
  "Embedding model not available for semantic similarity. Skipping calculation."
@@ -116,38 +268,27 @@ def compute_semantic_similarity(
116
  )
117
  return 0.0
118
 
119
- def _get_aggregated_embedding(
120
- raw_text_segment: str,
121
- _botok_tokens: List[str],
122
- model_obj,
123
- batch_size_param: int,
124
- show_progress_bar_param: bool
125
- ) -> Union[np.ndarray, None]:
126
- """Helper to get a single embedding for a text using Sentence Transformers."""
127
- if not raw_text_segment.strip():
128
- logger.info(
129
- f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
130
- )
131
  return None
132
-
133
  embedding = generate_hf_embeddings(
134
- texts=[raw_text_segment],
135
- model=model_obj,
136
- batch_size=batch_size_param,
137
- show_progress_bar=show_progress_bar_param
138
  )
139
-
140
- if embedding is None or embedding.size == 0:
141
- logger.error(
142
- f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
143
- )
144
  return None
145
  return embedding
146
 
147
  try:
148
- # Pass all relevant parameters to _get_aggregated_embedding
149
- emb1 = _get_aggregated_embedding(text1_segment, tokens1, model, batch_size, show_progress_bar)
150
- emb2 = _get_aggregated_embedding(text2_segment, tokens2, model, batch_size, show_progress_bar)
151
 
152
  if emb1 is None or emb2 is None or emb1.size == 0 or emb2.size == 0:
153
  logger.error(
@@ -168,7 +309,7 @@ def compute_semantic_similarity(
168
  if np.all(emb1 == 0) or np.all(emb2 == 0):
169
  logger.info("One of the embeddings is zero. Semantic similarity is 0.0.")
170
  return 0.0
171
-
172
  # Handle NaN or Inf in embeddings
173
  if np.isnan(emb1).any() or np.isinf(emb1).any() or \
174
  np.isnan(emb2).any() or np.isinf(emb2).any():
@@ -180,9 +321,9 @@ def compute_semantic_similarity(
180
  emb1 = emb1.reshape(1, -1)
181
  if emb2.ndim == 1:
182
  emb2 = emb2.reshape(1, -1)
183
-
184
  similarity_score = cosine_similarity(emb1, emb2)[0][0]
185
-
186
  return max(0.0, float(similarity_score))
187
 
188
  except Exception as e:
@@ -202,8 +343,10 @@ def compute_all_metrics(
202
  enable_semantic: bool = True,
203
  enable_fuzzy: bool = True,
204
  fuzzy_method: str = 'token_set',
 
205
  use_stopwords: bool = True,
206
  use_lite_stopwords: bool = False,
 
207
  batch_size: int = 32,
208
  show_progress_bar: bool = False
209
  ) -> pd.DataFrame:
@@ -218,10 +361,13 @@ def compute_all_metrics(
218
  Defaults to None.
219
  enable_semantic (bool): Whether to compute semantic similarity. Defaults to True.
220
  enable_fuzzy (bool): Whether to compute fuzzy string similarity. Defaults to True.
221
- fuzzy_method (str): The fuzzy matching method to use ('token_set', 'token_sort', 'partial', 'ratio').
222
  Defaults to 'token_set'.
 
223
  use_stopwords (bool): Whether to filter stopwords for Jaccard similarity. Defaults to True.
224
  use_lite_stopwords (bool): Whether to use the lite version of stopwords. Defaults to False.
 
 
225
  batch_size (int): Batch size for semantic similarity computation. Defaults to 32.
226
  show_progress_bar (bool): Whether to show progress bar for semantic similarity. Defaults to False.
227
 
@@ -232,14 +378,7 @@ def compute_all_metrics(
232
  """
233
  files = list(texts.keys())
234
  results = []
235
- corpus_for_sklearn_tfidf = [] # Kept for potential future use
236
-
237
- for fname, content in texts.items():
238
- # Use the pre-computed tokens from the token_lists dictionary
239
- current_tokens_for_file = token_lists.get(fname, [])
240
- corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
241
 
242
-
243
  for i, j in combinations(range(len(files)), 2):
244
  f1, f2 = files[i], files[j]
245
  words1_raw, words2_raw = token_lists[f1], token_lists[f2]
@@ -254,21 +393,33 @@ def compute_all_metrics(
254
  else:
255
  # If stopwords are disabled, use an empty set
256
  stopwords_set_to_use = set()
257
-
258
- # Filter stopwords for Jaccard calculation
259
- words1_jaccard = [word for word in words1_raw if word not in stopwords_set_to_use]
260
- words2_jaccard = [word for word in words2_raw if word not in stopwords_set_to_use]
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
  jaccard = (
263
  len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
264
  if set(words1_jaccard) | set(words2_jaccard) # Ensure denominator is not zero
265
  else 0.0
266
  )
267
- # LCS uses raw tokens (words1_raw, words2_raw) to provide a complementary metric.
268
- # Semantic similarity also uses raw text and its botok tokens for chunking decisions.
269
  jaccard_percent = jaccard * 100.0
270
- norm_lcs = compute_normalized_lcs(words1_raw, words2_raw)
271
-
 
 
272
  # Fuzzy Similarity Calculation
273
  if enable_fuzzy:
274
  fuzzy_sim = compute_fuzzy_similarity(words1_jaccard, words2_jaccard, method=fuzzy_method)
@@ -277,9 +428,8 @@ def compute_all_metrics(
277
 
278
  # Semantic Similarity Calculation
279
  if enable_semantic:
280
- # Pass raw texts and their pre-computed botok tokens
281
  semantic_sim = compute_semantic_similarity(
282
- texts[f1], texts[f2], words1_raw, words2_raw, model,
283
  batch_size=batch_size,
284
  show_progress_bar=show_progress_bar
285
  )
 
4
  from itertools import combinations
5
 
6
  from sklearn.metrics.pairwise import cosine_similarity
 
7
  from .hf_embedding import generate_embeddings as generate_hf_embeddings
8
  from .stopwords_bo import TIBETAN_STOPWORDS_SET
9
  from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
10
+ from .normalize_bo import normalize_particles
11
 
12
  import logging
13
 
14
 
15
+ def _normalize_token_for_stopwords(token: str) -> str:
16
+ """Normalize token by removing trailing tsek for stopword matching."""
17
+ return token.rstrip('་')
18
+
19
+
20
  # Attempt to import the Cython-compiled fast_lcs module
21
  try:
22
  from .fast_lcs import compute_lcs_fast
 
30
 
31
 
32
 
33
+ def compute_normalized_lcs(words1: List[str], words2: List[str], normalization: str = "avg") -> float:
34
+ """
35
+ Computes the Longest Common Subsequence (LCS) similarity between two token lists.
36
+
37
+ Args:
38
+ words1: First list of tokens
39
+ words2: Second list of tokens
40
+ normalization: How to normalize the LCS length. Options:
41
+ 'avg' - Divide by average length (default, balanced)
42
+ 'min' - Divide by shorter text (detects if one text contains the other)
43
+ 'max' - Divide by longer text (stricter, penalizes length differences)
44
+
45
+ Returns:
46
+ float: Normalized LCS score between 0.0 and 1.0
47
+
48
+ Note on normalization choice:
49
+ - 'avg': Good general-purpose choice, treats both texts equally
50
+ - 'min': Use when looking for containment (e.g., quotes within commentary)
51
+ Can return 1.0 if shorter text is fully contained in longer
52
+ - 'max': Use when you want to penalize length differences
53
+ Will be lower when texts have very different lengths
54
+ """
55
  m, n = len(words1), len(words2)
56
 
57
+ if m == 0 or n == 0:
58
+ return 0.0
59
+
60
  if USE_CYTHON_LCS:
 
61
  lcs_length = compute_lcs_fast(words1, words2)
62
  else:
63
+ # Pure Python implementation using dynamic programming
 
 
 
64
  dp = np.zeros((m + 1, n + 1), dtype=np.int32)
65
 
66
  for i in range(1, m + 1):
 
70
  else:
71
  dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
72
  lcs_length = int(dp[m, n])
 
 
73
 
74
+ # Apply selected normalization
75
+ if normalization == "min":
76
+ divisor = min(m, n)
77
+ elif normalization == "max":
78
+ divisor = max(m, n)
79
+ else: # "avg" (default)
80
+ divisor = (m + n) / 2
81
+
82
+ return lcs_length / divisor if divisor > 0 else 0.0
83
+
84
+
85
+ def compute_ngram_similarity(tokens1: List[str], tokens2: List[str], n: int = 2) -> float:
86
+ """
87
+ Computes syllable/token n-gram overlap similarity (Jaccard on n-grams).
88
+
89
+ This is more effective for Tibetan than character-level fuzzy matching because
90
+ it preserves syllable boundaries and captures local word patterns.
91
+
92
+ Args:
93
+ tokens1: First list of tokens (syllables or words)
94
+ tokens2: Second list of tokens (syllables or words)
95
+ n: Size of n-grams (default: 2 for bigrams)
96
+
97
+ Returns:
98
+ float: N-gram similarity score between 0.0 and 1.0
99
+ """
100
+ if not tokens1 or not tokens2:
101
+ return 0.0
102
+
103
+ # Handle edge case where text is shorter than n
104
+ if len(tokens1) < n or len(tokens2) < n:
105
+ # Fall back to unigram comparison
106
+ set1, set2 = set(tokens1), set(tokens2)
107
+ if not set1 or not set2:
108
+ return 0.0
109
+ intersection = len(set1 & set2)
110
+ union = len(set1 | set2)
111
+ return intersection / union if union > 0 else 0.0
112
+
113
+ def get_ngrams(tokens: List[str], size: int) -> set:
114
+ return set(tuple(tokens[i:i+size]) for i in range(len(tokens) - size + 1))
115
+
116
+ ngrams1 = get_ngrams(tokens1, n)
117
+ ngrams2 = get_ngrams(tokens2, n)
118
 
119
+ intersection = len(ngrams1 & ngrams2)
120
+ union = len(ngrams1 | ngrams2)
121
+ return intersection / union if union > 0 else 0.0
122
+
123
+
124
+ def compute_syllable_edit_similarity(syls1: List[str], syls2: List[str]) -> float:
125
  """
126
+ Computes edit distance at the syllable/token level rather than character level.
127
+
128
+ This is more appropriate for Tibetan because:
129
+ - Tibetan syllables are meaningful units (unlike individual characters)
130
+ - Character-level Levenshtein over-penalizes syllable differences
131
+ - Syllable-level comparison better captures textual variation patterns
132
+
133
+ Args:
134
+ syls1: First list of syllables/tokens
135
+ syls2: Second list of syllables/tokens
136
+
137
+ Returns:
138
+ float: Syllable-level similarity score between 0.0 and 1.0
139
+ """
140
+ if not syls1 and not syls2:
141
+ return 1.0
142
+ if not syls1 or not syls2:
143
+ return 0.0
144
+
145
+ m, n = len(syls1), len(syls2)
146
+
147
+ # Create DP table for syllable-level edit distance
148
+ dp = np.zeros((m + 1, n + 1), dtype=np.int32)
149
+
150
+ # Initialize base cases
151
+ for i in range(m + 1):
152
+ dp[i, 0] = i
153
+ for j in range(n + 1):
154
+ dp[0, j] = j
155
+
156
+ # Fill DP table
157
+ for i in range(1, m + 1):
158
+ for j in range(1, n + 1):
159
+ if syls1[i - 1] == syls2[j - 1]:
160
+ dp[i, j] = dp[i - 1, j - 1]
161
+ else:
162
+ dp[i, j] = 1 + min(
163
+ dp[i - 1, j], # deletion
164
+ dp[i, j - 1], # insertion
165
+ dp[i - 1, j - 1] # substitution
166
+ )
167
+
168
+ edit_distance = dp[m, n]
169
+ max_len = max(m, n)
170
+ return 1.0 - (edit_distance / max_len) if max_len > 0 else 1.0
171
+
172
+
173
+ def compute_weighted_jaccard(tokens1: List[str], tokens2: List[str]) -> float:
174
+ """
175
+ Computes weighted Jaccard similarity using token frequencies.
176
+
177
+ Unlike standard Jaccard which treats all tokens as binary (present/absent),
178
+ this considers how often each token appears, giving more weight to
179
+ frequently shared terms.
180
+
181
+ Args:
182
+ tokens1: First list of tokens
183
+ tokens2: Second list of tokens
184
+
185
+ Returns:
186
+ float: Weighted Jaccard similarity between 0.0 and 1.0
187
+ """
188
+ from collections import Counter
189
+
190
+ if not tokens1 or not tokens2:
191
+ return 0.0
192
+
193
+ c1, c2 = Counter(tokens1), Counter(tokens2)
194
+
195
+ # Intersection: min count for each shared token
196
+ intersection = sum((c1 & c2).values())
197
+ # Union: max count for each token
198
+ union = sum((c1 | c2).values())
199
+
200
+ return intersection / union if union > 0 else 0.0
201
+
202
+
203
+ def compute_fuzzy_similarity(words1: List[str], words2: List[str], method: str = 'ngram') -> float:
204
+ """
205
+ Computes fuzzy string similarity between two lists of words.
206
+
207
+ All methods work at the syllable/token level, which is linguistically
208
+ appropriate for Tibetan text.
209
+
210
  Args:
211
  words1: First list of tokens
212
  words2: Second list of tokens
213
  method: The fuzzy matching method to use:
214
+ 'ngram' - Syllable bigram overlap (default, recommended)
215
+ 'syllable_edit' - Syllable-level edit distance
216
+ 'weighted_jaccard' - Frequency-weighted Jaccard
217
+
 
218
  Returns:
219
  float: Fuzzy similarity score between 0.0 and 1.0
220
  """
221
  if not words1 or not words2:
222
  return 0.0
223
+
224
+ if method == 'ngram':
225
+ # Syllable bigram overlap - good for detecting shared phrases
226
+ return compute_ngram_similarity(words1, words2, n=2)
227
+ elif method == 'syllable_edit':
228
+ # Syllable-level edit distance - good for detecting minor variations
229
+ return compute_syllable_edit_similarity(words1, words2)
230
+ elif method == 'weighted_jaccard':
231
+ # Frequency-weighted Jaccard - good for repeated terms
232
+ return compute_weighted_jaccard(words1, words2)
233
+ else:
234
+ # Default to ngram for any unrecognized method
235
+ return compute_ngram_similarity(words1, words2, n=2)
 
 
 
 
 
 
 
 
236
 
237
 
238
 
239
  def compute_semantic_similarity(
240
  text1_segment: str,
241
  text2_segment: str,
 
 
242
  model,
243
  batch_size: int = 32,
244
  show_progress_bar: bool = False
245
  ) -> float:
246
+ """
247
+ Computes semantic similarity using a Sentence Transformer model.
248
 
249
+ Args:
250
+ text1_segment: First text segment
251
+ text2_segment: Second text segment
252
+ model: Pre-loaded SentenceTransformer model
253
+ batch_size: Batch size for encoding
254
+ show_progress_bar: Whether to show progress bar
255
+
256
+ Returns:
257
+ float: Cosine similarity between embeddings (0.0 to 1.0), or np.nan on error
258
+ """
259
  if model is None:
260
  logger.warning(
261
  "Embedding model not available for semantic similarity. Skipping calculation."
 
268
  )
269
  return 0.0
270
 
271
+ def _get_embedding(raw_text: str) -> Union[np.ndarray, None]:
272
+ """Helper to get embedding for a single text."""
273
+ if not raw_text.strip():
274
+ logger.info("Text is empty or whitespace. Returning None.")
 
 
 
 
 
 
 
 
275
  return None
276
+
277
  embedding = generate_hf_embeddings(
278
+ texts=[raw_text],
279
+ model=model,
280
+ batch_size=batch_size,
281
+ show_progress_bar=show_progress_bar
282
  )
283
+
284
+ if embedding is None or embedding.size == 0:
285
+ logger.error(f"Failed to generate embedding for text: {raw_text[:100]}...")
 
 
286
  return None
287
  return embedding
288
 
289
  try:
290
+ emb1 = _get_embedding(text1_segment)
291
+ emb2 = _get_embedding(text2_segment)
 
292
 
293
  if emb1 is None or emb2 is None or emb1.size == 0 or emb2.size == 0:
294
  logger.error(
 
309
  if np.all(emb1 == 0) or np.all(emb2 == 0):
310
  logger.info("One of the embeddings is zero. Semantic similarity is 0.0.")
311
  return 0.0
312
+
313
  # Handle NaN or Inf in embeddings
314
  if np.isnan(emb1).any() or np.isinf(emb1).any() or \
315
  np.isnan(emb2).any() or np.isinf(emb2).any():
 
321
  emb1 = emb1.reshape(1, -1)
322
  if emb2.ndim == 1:
323
  emb2 = emb2.reshape(1, -1)
324
+
325
  similarity_score = cosine_similarity(emb1, emb2)[0][0]
326
+
327
  return max(0.0, float(similarity_score))
328
 
329
  except Exception as e:
 
343
  enable_semantic: bool = True,
344
  enable_fuzzy: bool = True,
345
  fuzzy_method: str = 'token_set',
346
+ lcs_normalization: str = 'avg',
347
  use_stopwords: bool = True,
348
  use_lite_stopwords: bool = False,
349
+ normalize_particles_opt: bool = False,
350
  batch_size: int = 32,
351
  show_progress_bar: bool = False
352
  ) -> pd.DataFrame:
 
361
  Defaults to None.
362
  enable_semantic (bool): Whether to compute semantic similarity. Defaults to True.
363
  enable_fuzzy (bool): Whether to compute fuzzy string similarity. Defaults to True.
364
+ fuzzy_method (str): The fuzzy matching method to use ('ngram', 'syllable_edit', 'weighted_jaccard').
365
  Defaults to 'token_set'.
366
+ lcs_normalization (str): How to normalize LCS ('avg', 'min', 'max'). Defaults to 'avg'.
367
  use_stopwords (bool): Whether to filter stopwords for Jaccard similarity. Defaults to True.
368
  use_lite_stopwords (bool): Whether to use the lite version of stopwords. Defaults to False.
369
+ normalize_particles_opt (bool): Whether to normalize grammatical particles (གི/ཀྱི/གྱི → གི).
370
+ Reduces false negatives from sandhi variation. Defaults to False.
371
  batch_size (int): Batch size for semantic similarity computation. Defaults to 32.
372
  show_progress_bar (bool): Whether to show progress bar for semantic similarity. Defaults to False.
373
 
 
378
  """
379
  files = list(texts.keys())
380
  results = []
 
 
 
 
 
 
381
 
 
382
  for i, j in combinations(range(len(files)), 2):
383
  f1, f2 = files[i], files[j]
384
  words1_raw, words2_raw = token_lists[f1], token_lists[f2]
 
393
  else:
394
  # If stopwords are disabled, use an empty set
395
  stopwords_set_to_use = set()
396
+
397
+ # Filter stopwords for Jaccard calculation (normalize tokens for consistent matching)
398
+ words1_filtered = [word for word in words1_raw if _normalize_token_for_stopwords(word) not in stopwords_set_to_use]
399
+ words2_filtered = [word for word in words2_raw if _normalize_token_for_stopwords(word) not in stopwords_set_to_use]
400
+
401
+ # Apply particle normalization if enabled
402
+ if normalize_particles_opt:
403
+ words1_jaccard = normalize_particles(words1_filtered)
404
+ words2_jaccard = normalize_particles(words2_filtered)
405
+ words1_lcs = normalize_particles(words1_raw)
406
+ words2_lcs = normalize_particles(words2_raw)
407
+ else:
408
+ words1_jaccard = words1_filtered
409
+ words2_jaccard = words2_filtered
410
+ words1_lcs = words1_raw
411
+ words2_lcs = words2_raw
412
 
413
  jaccard = (
414
  len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
415
  if set(words1_jaccard) | set(words2_jaccard) # Ensure denominator is not zero
416
  else 0.0
417
  )
 
 
418
  jaccard_percent = jaccard * 100.0
419
+
420
+ # LCS uses tokens (with optional particle normalization)
421
+ norm_lcs = compute_normalized_lcs(words1_lcs, words2_lcs, normalization=lcs_normalization)
422
+
423
  # Fuzzy Similarity Calculation
424
  if enable_fuzzy:
425
  fuzzy_sim = compute_fuzzy_similarity(words1_jaccard, words2_jaccard, method=fuzzy_method)
 
428
 
429
  # Semantic Similarity Calculation
430
  if enable_semantic:
 
431
  semantic_sim = compute_semantic_similarity(
432
+ texts[f1], texts[f2], model,
433
  batch_size=batch_size,
434
  show_progress_bar=show_progress_bar
435
  )
pipeline/normalize_bo.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """
3
+ Tibetan text normalization for improved text comparison.
4
+
5
+ This module provides normalization functions for Tibetan grammatical particles,
6
+ which change form based on the preceding syllable (sandhi). Normalizing these
7
+ allows more accurate comparison between texts that may use different particle
8
+ forms for grammatical reasons rather than semantic differences.
9
+ """
10
+
11
+ from typing import List
12
+
13
+ # Particle equivalence classes
14
+ # All forms in each class are grammatically equivalent
15
+ # The first form in each list is the canonical/normalized form
16
+ PARTICLE_CLASSES = {
17
+ # Genitive particles (གི་སྒྲ) - "of"
18
+ # Form depends on final letter of preceding syllable
19
+ "genitive": ["གི", "ཀྱི", "གྱི", "ཡི", "འི"],
20
+
21
+ # Agentive/instrumental particles (བྱེད་སྒྲ) - "by"
22
+ "agentive": ["གིས", "ཀྱིས", "གྱིས", "ཡིས", "ས"],
23
+
24
+ # Dative/locative particles (ལ་དོན) - "to/at/in"
25
+ "dative": ["ལ", "ར", "སུ", "ཏུ", "དུ", "རུ"],
26
+
27
+ # Ablative particles (འབྱུང་ཁུངས) - "from"
28
+ "ablative": ["ནས", "ལས"],
29
+
30
+ # Conjunctive particles (སྦྱོར་སྒྲ) - verbal connective "and/while"
31
+ "conjunctive": ["ཅིང", "ཤིང", "ཞིང"],
32
+
33
+ # Terminative particles (མཐའ་སྒྲ) - clause ending
34
+ "terminative": ["སྟེ", "ཏེ", "དེ"],
35
+
36
+ # Concessive particles - "even/also"
37
+ "concessive": ["ཀྱང", "ཡང", "འང"],
38
+
39
+ # Imperative particles
40
+ "imperative": ["ཅིག", "ཤིག", "ཞིག"],
41
+ }
42
+
43
+
44
+ def _build_particle_map() -> dict:
45
+ """Build mapping from all particle variants to canonical form."""
46
+ mapping = {}
47
+ for class_name, forms in PARTICLE_CLASSES.items():
48
+ canonical = forms[0] # First form is canonical
49
+ for variant in forms:
50
+ # Strip tsek for matching (will be normalized anyway)
51
+ variant_clean = variant.rstrip('་')
52
+ mapping[variant_clean] = canonical
53
+ return mapping
54
+
55
+
56
+ # Pre-built mapping for efficiency
57
+ PARTICLE_NORMALIZATION_MAP = _build_particle_map()
58
+
59
+
60
+ def normalize_particles(tokens: List[str]) -> List[str]:
61
+ """
62
+ Normalize grammatical particles to canonical forms.
63
+
64
+ This treats all sandhi variants of a particle as equivalent:
65
+ - གི, ཀྱི, གྱི, ཡི, འི → གི (genitive)
66
+ - གིས, ཀྱིས, གྱིས, ཡིས, ས → གིས (agentive)
67
+ - ལ, ར, སུ, ཏུ, དུ, རུ → ལ (dative)
68
+ - etc.
69
+
70
+ This is useful when comparing texts that may use different particle forms
71
+ based on phonological context rather than semantic differences.
72
+
73
+ Args:
74
+ tokens: List of Tibetan tokens (syllables or words)
75
+
76
+ Returns:
77
+ List of tokens with particles normalized to canonical forms
78
+ """
79
+ normalized = []
80
+ for token in tokens:
81
+ # Strip tsek for lookup
82
+ token_clean = token.rstrip('་')
83
+ # Check if it's a particle that should be normalized
84
+ if token_clean in PARTICLE_NORMALIZATION_MAP:
85
+ normalized.append(PARTICLE_NORMALIZATION_MAP[token_clean])
86
+ else:
87
+ normalized.append(token_clean)
88
+ return normalized
89
+
90
+
91
+ def get_particle_class(token: str) -> str:
92
+ """
93
+ Get the grammatical class of a particle.
94
+
95
+ Args:
96
+ token: A Tibetan token
97
+
98
+ Returns:
99
+ The particle class name (e.g., 'genitive', 'agentive') or None
100
+ """
101
+ token_clean = token.rstrip('་')
102
+ for class_name, forms in PARTICLE_CLASSES.items():
103
+ clean_forms = [f.rstrip('་') for f in forms]
104
+ if token_clean in clean_forms:
105
+ return class_name
106
+ return None
pipeline/process.py CHANGED
@@ -1,4 +1,5 @@
1
  import pandas as pd
 
2
  from typing import Dict, List, Tuple
3
  from .metrics import compute_all_metrics
4
  from .hf_embedding import get_model as get_hf_model
@@ -13,7 +14,7 @@ import re
13
 
14
  def get_botok_tokens_for_single_text(text: str, mode: str = "syllable") -> list[str]:
15
  """
16
- A wrapper around tokenize_texts to make it suitable for tokenize_fn
17
  in generate_embeddings, which expects a function that tokenizes a single string.
18
  Accepts a 'mode' argument ('syllable' or 'word') to pass to tokenize_texts.
19
  """
@@ -46,14 +47,17 @@ logger = logging.getLogger(__name__)
46
 
47
 
48
  def process_texts(
49
- text_data: Dict[str, str],
50
- filenames: List[str],
51
  enable_semantic: bool = True,
52
  enable_fuzzy: bool = True,
53
- fuzzy_method: str = 'token_set',
54
- model_name: str = "sentence-transformers/LaBSE",
 
55
  use_stopwords: bool = True,
56
  use_lite_stopwords: bool = False,
 
 
57
  progress_callback = None,
58
  progressive_callback = None,
59
  batch_size: int = 32,
@@ -61,11 +65,11 @@ def process_texts(
61
  ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
62
  """
63
  Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
64
-
65
  Args:
66
  text_data (Dict[str, str]): A dictionary mapping filenames to their content.
67
  filenames (List[str]): A list of filenames that were uploaded.
68
- enable_semantic (bool, optional): Whether to compute semantic similarity metrics.
69
  Requires loading a sentence-transformer model, which can be time-consuming. Defaults to True.
70
  enable_fuzzy (bool, optional): Whether to compute fuzzy string similarity metrics.
71
  Uses TheFuzz library for approximate string matching. Defaults to True.
@@ -74,16 +78,28 @@ def process_texts(
74
  'token_sort' - Order-normalized token matching
75
  'partial' - Best partial token matching
76
  'ratio' - Simple ratio matching
 
 
 
 
 
 
 
77
  model_name (str, optional): The Hugging Face sentence-transformer model to use for semantic similarity.
78
- Must be a valid model identifier on Hugging Face. Defaults to "sentence-transformers/LaBSE".
79
  use_stopwords (bool, optional): Whether to use stopwords in the metrics calculation. Defaults to True.
80
  use_lite_stopwords (bool, optional): Whether to use the lite stopwords list (common particles only)
81
  instead of the comprehensive list. Only applies if use_stopwords is True. Defaults to False.
 
 
 
 
 
82
  progress_callback (callable, optional): A callback function for reporting progress updates.
83
  Should accept a float between 0 and 1 and a description string. Defaults to None.
84
  progressive_callback (callable, optional): A callback function for sending incremental results.
85
  Used for progressive loading of metrics as they become available. Defaults to None.
86
-
87
  Returns:
88
  Tuple[pd.DataFrame, pd.DataFrame, str]:
89
  - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
@@ -92,7 +108,7 @@ def process_texts(
92
  - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
93
  Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
94
  - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
95
-
96
  Raises:
97
  RuntimeError: If the botok tokenizer fails to initialize.
98
  ValueError: If the input files cannot be processed or if metrics computation fails.
@@ -132,7 +148,7 @@ def process_texts(
132
  progress_callback(0.3, desc="Unsupported model, continuing without semantic similarity.")
133
  except Exception as e:
134
  logger.warning(f"Progress callback error (non-critical): {e}")
135
-
136
  except Exception as e: # General catch-all for unexpected errors during model loading attempts
137
  model_warning = f"An unexpected error occurred while attempting to load model '{model_name}': {e}. Semantic similarity will be disabled."
138
  logger.error(model_warning, exc_info=True)
@@ -156,38 +172,38 @@ def process_texts(
156
  progress_callback(0.35, desc="Segmenting texts by chapters...")
157
  except Exception as e:
158
  logger.warning(f"Progress callback error (non-critical): {e}")
159
-
160
  chapter_marker = "༈"
161
  fallback = False
162
  segment_texts = {}
163
-
164
  # Process each file
165
  for i, fname in enumerate(filenames):
166
  if progress_callback is not None and len(filenames) > 1:
167
  try:
168
- progress_callback(0.35 + (0.05 * (i / len(filenames))),
169
  desc=f"Segmenting file {i+1}/{len(filenames)}: {fname}")
170
  except Exception as e:
171
  logger.warning(f"Progress callback error (non-critical): {e}")
172
-
173
  content = text_data[fname]
174
-
175
  # Check if content is empty
176
  if not content.strip():
177
  logger.warning(f"File '{fname}' is empty or contains only whitespace.")
178
  continue
179
-
180
  # Split by chapter marker if present
181
  if chapter_marker in content:
182
  segments = [
183
  seg.strip() for seg in content.split(chapter_marker) if seg.strip()
184
  ]
185
-
186
  # Check if we have valid segments after splitting
187
  if not segments:
188
  logger.warning(f"File '{fname}' contains chapter markers but no valid text segments.")
189
  continue
190
-
191
  for idx, seg in enumerate(segments):
192
  seg_id = f"{fname}|chapter {idx+1}"
193
  cleaned_seg = clean_tibetan_text(seg)
@@ -198,7 +214,7 @@ def process_texts(
198
  cleaned_content = clean_tibetan_text(content.strip())
199
  segment_texts[seg_id] = cleaned_content
200
  fallback = True
201
-
202
  # Generate warning if no chapter markers found
203
  warning = model_warning # Include any model warnings
204
  if fallback:
@@ -208,7 +224,7 @@ def process_texts(
208
  "For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
209
  )
210
  warning = warning + " " + chapter_warning if warning else chapter_warning
211
-
212
  # Check if we have any valid segments
213
  if not segment_texts:
214
  logger.error("No valid text segments found in any of the uploaded files.")
@@ -216,90 +232,90 @@ def process_texts(
216
  # Tokenize all segments at once for efficiency
217
  if progress_callback is not None:
218
  try:
219
- progress_callback(0.42, desc="Tokenizing all text segments...")
220
  except Exception as e:
221
  logger.warning(f"Progress callback error (non-critical): {e}")
222
 
223
  all_segment_ids = list(segment_texts.keys())
224
  all_segment_contents = list(segment_texts.values())
225
- tokenized_segments_list = tokenize_texts(all_segment_contents)
226
 
227
  segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
228
 
229
  # Group chapters by filename (preserving order)
230
  if progress_callback is not None:
231
  try:
232
- progress_callback(0.4, desc="Organizing text segments...")
233
  except Exception as e:
234
  logger.warning(f"Progress callback error (non-critical): {e}")
235
-
236
  file_to_chapters = {}
237
  for seg_id in segment_texts:
238
  fname = seg_id.split("|")[0]
239
  file_to_chapters.setdefault(fname, []).append(seg_id)
240
-
241
  # For each pair of files, compare corresponding chapters (by index)
242
  if progress_callback is not None:
243
  try:
244
  progress_callback(0.45, desc="Computing similarity metrics...")
245
  except Exception as e:
246
  logger.warning(f"Progress callback error (non-critical): {e}")
247
-
248
  results = []
249
  files = list(file_to_chapters.keys())
250
-
251
  # Check if we have at least two files to compare
252
  if len(files) < 2:
253
  logger.warning("Need at least two files to compute similarity metrics.")
254
  return pd.DataFrame(), pd.DataFrame(), "Need at least two files to compute similarity metrics."
255
-
256
  # Track total number of comparisons for progress reporting
257
  total_comparisons = 0
258
  for file1, file2 in combinations(files, 2):
259
  chaps1 = file_to_chapters[file1]
260
  chaps2 = file_to_chapters[file2]
261
  total_comparisons += min(len(chaps1), len(chaps2))
262
-
263
  # Initialize results DataFrame for progressive updates
264
  results_columns = ['Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS']
265
  if enable_fuzzy:
266
  results_columns.append('Fuzzy Similarity')
267
  if enable_semantic:
268
  results_columns.append('Semantic Similarity')
269
-
270
  # Create empty DataFrame with the correct columns
271
  progressive_df = pd.DataFrame(columns=results_columns)
272
-
273
  # Track which metrics have been completed for progressive updates
274
  completed_metrics = []
275
-
276
  # Process each file pair
277
  comparison_count = 0
278
  for file1, file2 in combinations(files, 2):
279
  chaps1 = file_to_chapters[file1]
280
  chaps2 = file_to_chapters[file2]
281
  min_chaps = min(len(chaps1), len(chaps2))
282
-
283
  if progress_callback is not None:
284
  try:
285
  progress_callback(0.45, desc=f"Comparing {file1} with {file2}...")
286
  except Exception as e:
287
  logger.warning(f"Progress callback error (non-critical): {e}")
288
-
289
  for idx in range(min_chaps):
290
  seg1 = chaps1[idx]
291
  seg2 = chaps2[idx]
292
-
293
  # Update progress
294
  comparison_count += 1
295
  if progress_callback is not None and total_comparisons > 0:
296
  try:
297
  progress_percentage = 0.45 + (0.25 * (comparison_count / total_comparisons))
298
- progress_callback(progress_percentage,
299
  desc=f"Computing metrics for chapter {idx+1} ({comparison_count}/{total_comparisons})")
300
  except Exception as e:
301
  logger.warning(f"Progress callback error (non-critical): {e}")
302
-
303
  try:
304
  # Compute metrics for this chapter pair
305
  metrics_df = compute_all_metrics(
@@ -309,10 +325,12 @@ def process_texts(
309
  enable_semantic=enable_semantic,
310
  enable_fuzzy=enable_fuzzy,
311
  fuzzy_method=fuzzy_method,
 
312
  use_stopwords=use_stopwords,
313
  use_lite_stopwords=use_lite_stopwords,
 
314
  )
315
-
316
  # Extract metrics from the DataFrame (should have only one row)
317
  if not metrics_df.empty:
318
  pair_metrics = metrics_df.iloc[0].to_dict()
@@ -325,57 +343,57 @@ def process_texts(
325
  "Fuzzy Similarity": 0.0 if enable_fuzzy else np.nan,
326
  "Semantic Similarity": 0.0 if enable_semantic else np.nan
327
  }
328
-
329
  # Format the results
330
  text_pair = f"{file1} vs {file2}"
331
  chapter_num = idx + 1
332
-
333
  result_row = {
334
  "Text Pair": text_pair,
335
  "Chapter": chapter_num,
336
  "Jaccard Similarity (%)": pair_metrics["Jaccard Similarity (%)"], # Already in percentage
337
  "Normalized LCS": pair_metrics["Normalized LCS"],
338
  }
339
-
340
  # Add fuzzy similarity if enabled
341
  if enable_fuzzy:
342
  result_row["Fuzzy Similarity"] = pair_metrics["Fuzzy Similarity"]
343
-
344
  # Add semantic similarity if enabled and available
345
  if enable_semantic and "Semantic Similarity" in pair_metrics:
346
  result_row["Semantic Similarity"] = pair_metrics["Semantic Similarity"]
347
-
348
  # Convert the dictionary to a DataFrame before appending
349
  result_df = pd.DataFrame([result_row])
350
  results.append(result_df)
351
-
352
  # Update progressive DataFrame and send update if callback is provided
353
  progressive_df = pd.concat(results, ignore_index=True)
354
-
355
  # Send progressive update if callback is provided
356
  if progressive_callback is not None:
357
  # Determine which metrics are complete in this update
358
  current_metrics = []
359
-
360
  # Always include these basic metrics
361
  if "Jaccard Similarity (%)" in progressive_df.columns and MetricType.JACCARD not in completed_metrics:
362
  current_metrics.append(MetricType.JACCARD)
363
  completed_metrics.append(MetricType.JACCARD)
364
-
365
  if "Normalized LCS" in progressive_df.columns and MetricType.LCS not in completed_metrics:
366
  current_metrics.append(MetricType.LCS)
367
  completed_metrics.append(MetricType.LCS)
368
-
369
  # Add fuzzy if enabled and available
370
  if enable_fuzzy and "Fuzzy Similarity" in progressive_df.columns and MetricType.FUZZY not in completed_metrics:
371
  current_metrics.append(MetricType.FUZZY)
372
  completed_metrics.append(MetricType.FUZZY)
373
-
374
  # Add semantic if enabled and available
375
  if enable_semantic and "Semantic Similarity" in progressive_df.columns and MetricType.SEMANTIC not in completed_metrics:
376
  current_metrics.append(MetricType.SEMANTIC)
377
  completed_metrics.append(MetricType.SEMANTIC)
378
-
379
  # Create word counts DataFrame for progressive update
380
  word_counts_data = []
381
  for seg_id, tokens in segment_tokens.items():
@@ -388,7 +406,7 @@ def process_texts(
388
  "WordCount": len(tokens)
389
  })
390
  word_counts_df_progressive = pd.DataFrame(word_counts_data)
391
-
392
  # Send the update
393
  try:
394
  progressive_callback(
@@ -400,12 +418,12 @@ def process_texts(
400
  )
401
  except Exception as e:
402
  logger.warning(f"Progressive callback error (non-critical): {e}")
403
-
404
  except Exception as e:
405
  logger.error(f"Error computing metrics for {seg1} vs {seg2}: {e}", exc_info=True)
406
  # Continue with other segmentsparisons instead of failing completely
407
  continue
408
-
409
  # Create the metrics DataFrame
410
  if results:
411
  # Results are already DataFrames, so we can concatenate them directly
@@ -420,9 +438,9 @@ def process_texts(
420
  progress_callback(0.75, desc="Calculating word counts...")
421
  except Exception as e:
422
  logger.warning(f"Progress callback error (non-critical): {e}")
423
-
424
  word_counts_data = []
425
-
426
  # Process each segment
427
  for i, (seg_id, text_content) in enumerate(segment_texts.items()):
428
  # Update progress
@@ -432,10 +450,10 @@ def process_texts(
432
  progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
433
  except Exception as e:
434
  logger.warning(f"Progress callback error (non-critical): {e}")
435
-
436
  fname, chapter_info = seg_id.split("|", 1)
437
  chapter_num = int(chapter_info.replace("chapter ", ""))
438
-
439
  try:
440
  # Use botok for accurate word count for raw Tibetan text
441
  tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
@@ -443,7 +461,7 @@ def process_texts(
443
  word_count = len(tokenized_segments[0])
444
  else:
445
  word_count = 0
446
-
447
  word_counts_data.append(
448
  {
449
  "Filename": fname.replace(".txt", ""),
@@ -463,20 +481,20 @@ def process_texts(
463
  "WordCount": 0,
464
  }
465
  )
466
-
467
  # Create and sort the word counts DataFrame
468
  word_counts_df = pd.DataFrame(word_counts_data)
469
  if not word_counts_df.empty:
470
  word_counts_df = word_counts_df.sort_values(
471
  by=["Filename", "ChapterNumber"]
472
  ).reset_index(drop=True)
473
-
474
  if progress_callback is not None:
475
  try:
476
  progress_callback(0.95, desc="Analysis complete!")
477
  except Exception as e:
478
  logger.warning(f"Progress callback error (non-critical): {e}")
479
-
480
  # Send final progressive update if callback is provided
481
  if progressive_callback is not None:
482
  try:
@@ -490,6 +508,6 @@ def process_texts(
490
  )
491
  except Exception as e:
492
  logger.warning(f"Final progressive callback error (non-critical): {e}")
493
-
494
  # Return the results
495
  return metrics_df, word_counts_df, warning
 
1
  import pandas as pd
2
+ import numpy as np
3
  from typing import Dict, List, Tuple
4
  from .metrics import compute_all_metrics
5
  from .hf_embedding import get_model as get_hf_model
 
14
 
15
  def get_botok_tokens_for_single_text(text: str, mode: str = "syllable") -> list[str]:
16
  """
17
+ A wrapper around tokenize_texts to make it suitable for tokenize_fn
18
  in generate_embeddings, which expects a function that tokenizes a single string.
19
  Accepts a 'mode' argument ('syllable' or 'word') to pass to tokenize_texts.
20
  """
 
47
 
48
 
49
  def process_texts(
50
+ text_data: Dict[str, str],
51
+ filenames: List[str],
52
  enable_semantic: bool = True,
53
  enable_fuzzy: bool = True,
54
+ fuzzy_method: str = 'ngram',
55
+ lcs_normalization: str = 'avg',
56
+ model_name: str = "buddhist-nlp/buddhist-sentence-similarity",
57
  use_stopwords: bool = True,
58
  use_lite_stopwords: bool = False,
59
+ normalize_particles: bool = False,
60
+ tokenization_mode: str = "word",
61
  progress_callback = None,
62
  progressive_callback = None,
63
  batch_size: int = 32,
 
65
  ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
66
  """
67
  Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
68
+
69
  Args:
70
  text_data (Dict[str, str]): A dictionary mapping filenames to their content.
71
  filenames (List[str]): A list of filenames that were uploaded.
72
+ enable_semantic (bool, optional): Whether to compute semantic similarity metrics.
73
  Requires loading a sentence-transformer model, which can be time-consuming. Defaults to True.
74
  enable_fuzzy (bool, optional): Whether to compute fuzzy string similarity metrics.
75
  Uses TheFuzz library for approximate string matching. Defaults to True.
 
78
  'token_sort' - Order-normalized token matching
79
  'partial' - Best partial token matching
80
  'ratio' - Simple ratio matching
81
+ 'ngram' - Syllable bigram overlap (recommended for Tibetan)
82
+ 'syllable_edit' - Syllable-level edit distance
83
+ 'weighted_jaccard' - Frequency-weighted Jaccard
84
+ lcs_normalization (str, optional): How to normalize LCS length. Options:
85
+ 'avg' - Divide by average length (default, balanced)
86
+ 'min' - Divide by shorter text (detects containment)
87
+ 'max' - Divide by longer text (stricter)
88
  model_name (str, optional): The Hugging Face sentence-transformer model to use for semantic similarity.
89
+ Must be a valid model identifier on Hugging Face. Defaults to "buddhist-nlp/buddhist-sentence-similarity".
90
  use_stopwords (bool, optional): Whether to use stopwords in the metrics calculation. Defaults to True.
91
  use_lite_stopwords (bool, optional): Whether to use the lite stopwords list (common particles only)
92
  instead of the comprehensive list. Only applies if use_stopwords is True. Defaults to False.
93
+ normalize_particles (bool, optional): Whether to normalize grammatical particles to canonical forms.
94
+ Treats གི/ཀྱི/གྱི as equivalent, ལ/ར/སུ/ཏུ/དུ as equivalent, etc. Defaults to False.
95
+ tokenization_mode (str, optional): How to tokenize the text. Options are:
96
+ 'word' - Keep multi-syllable words together (default, recommended for Jaccard)
97
+ 'syllable' - Split into individual syllables (finer granularity)
98
  progress_callback (callable, optional): A callback function for reporting progress updates.
99
  Should accept a float between 0 and 1 and a description string. Defaults to None.
100
  progressive_callback (callable, optional): A callback function for sending incremental results.
101
  Used for progressive loading of metrics as they become available. Defaults to None.
102
+
103
  Returns:
104
  Tuple[pd.DataFrame, pd.DataFrame, str]:
105
  - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
 
108
  - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
109
  Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
110
  - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
111
+
112
  Raises:
113
  RuntimeError: If the botok tokenizer fails to initialize.
114
  ValueError: If the input files cannot be processed or if metrics computation fails.
 
148
  progress_callback(0.3, desc="Unsupported model, continuing without semantic similarity.")
149
  except Exception as e:
150
  logger.warning(f"Progress callback error (non-critical): {e}")
151
+
152
  except Exception as e: # General catch-all for unexpected errors during model loading attempts
153
  model_warning = f"An unexpected error occurred while attempting to load model '{model_name}': {e}. Semantic similarity will be disabled."
154
  logger.error(model_warning, exc_info=True)
 
172
  progress_callback(0.35, desc="Segmenting texts by chapters...")
173
  except Exception as e:
174
  logger.warning(f"Progress callback error (non-critical): {e}")
175
+
176
  chapter_marker = "༈"
177
  fallback = False
178
  segment_texts = {}
179
+
180
  # Process each file
181
  for i, fname in enumerate(filenames):
182
  if progress_callback is not None and len(filenames) > 1:
183
  try:
184
+ progress_callback(0.35 + (0.05 * (i / len(filenames))),
185
  desc=f"Segmenting file {i+1}/{len(filenames)}: {fname}")
186
  except Exception as e:
187
  logger.warning(f"Progress callback error (non-critical): {e}")
188
+
189
  content = text_data[fname]
190
+
191
  # Check if content is empty
192
  if not content.strip():
193
  logger.warning(f"File '{fname}' is empty or contains only whitespace.")
194
  continue
195
+
196
  # Split by chapter marker if present
197
  if chapter_marker in content:
198
  segments = [
199
  seg.strip() for seg in content.split(chapter_marker) if seg.strip()
200
  ]
201
+
202
  # Check if we have valid segments after splitting
203
  if not segments:
204
  logger.warning(f"File '{fname}' contains chapter markers but no valid text segments.")
205
  continue
206
+
207
  for idx, seg in enumerate(segments):
208
  seg_id = f"{fname}|chapter {idx+1}"
209
  cleaned_seg = clean_tibetan_text(seg)
 
214
  cleaned_content = clean_tibetan_text(content.strip())
215
  segment_texts[seg_id] = cleaned_content
216
  fallback = True
217
+
218
  # Generate warning if no chapter markers found
219
  warning = model_warning # Include any model warnings
220
  if fallback:
 
224
  "For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
225
  )
226
  warning = warning + " " + chapter_warning if warning else chapter_warning
227
+
228
  # Check if we have any valid segments
229
  if not segment_texts:
230
  logger.error("No valid text segments found in any of the uploaded files.")
 
232
  # Tokenize all segments at once for efficiency
233
  if progress_callback is not None:
234
  try:
235
+ progress_callback(0.40, desc="Tokenizing all text segments...")
236
  except Exception as e:
237
  logger.warning(f"Progress callback error (non-critical): {e}")
238
 
239
  all_segment_ids = list(segment_texts.keys())
240
  all_segment_contents = list(segment_texts.values())
241
+ tokenized_segments_list = tokenize_texts(all_segment_contents, mode=tokenization_mode)
242
 
243
  segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
244
 
245
  # Group chapters by filename (preserving order)
246
  if progress_callback is not None:
247
  try:
248
+ progress_callback(0.42, desc="Organizing text segments...")
249
  except Exception as e:
250
  logger.warning(f"Progress callback error (non-critical): {e}")
251
+
252
  file_to_chapters = {}
253
  for seg_id in segment_texts:
254
  fname = seg_id.split("|")[0]
255
  file_to_chapters.setdefault(fname, []).append(seg_id)
256
+
257
  # For each pair of files, compare corresponding chapters (by index)
258
  if progress_callback is not None:
259
  try:
260
  progress_callback(0.45, desc="Computing similarity metrics...")
261
  except Exception as e:
262
  logger.warning(f"Progress callback error (non-critical): {e}")
263
+
264
  results = []
265
  files = list(file_to_chapters.keys())
266
+
267
  # Check if we have at least two files to compare
268
  if len(files) < 2:
269
  logger.warning("Need at least two files to compute similarity metrics.")
270
  return pd.DataFrame(), pd.DataFrame(), "Need at least two files to compute similarity metrics."
271
+
272
  # Track total number of comparisons for progress reporting
273
  total_comparisons = 0
274
  for file1, file2 in combinations(files, 2):
275
  chaps1 = file_to_chapters[file1]
276
  chaps2 = file_to_chapters[file2]
277
  total_comparisons += min(len(chaps1), len(chaps2))
278
+
279
  # Initialize results DataFrame for progressive updates
280
  results_columns = ['Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS']
281
  if enable_fuzzy:
282
  results_columns.append('Fuzzy Similarity')
283
  if enable_semantic:
284
  results_columns.append('Semantic Similarity')
285
+
286
  # Create empty DataFrame with the correct columns
287
  progressive_df = pd.DataFrame(columns=results_columns)
288
+
289
  # Track which metrics have been completed for progressive updates
290
  completed_metrics = []
291
+
292
  # Process each file pair
293
  comparison_count = 0
294
  for file1, file2 in combinations(files, 2):
295
  chaps1 = file_to_chapters[file1]
296
  chaps2 = file_to_chapters[file2]
297
  min_chaps = min(len(chaps1), len(chaps2))
298
+
299
  if progress_callback is not None:
300
  try:
301
  progress_callback(0.45, desc=f"Comparing {file1} with {file2}...")
302
  except Exception as e:
303
  logger.warning(f"Progress callback error (non-critical): {e}")
304
+
305
  for idx in range(min_chaps):
306
  seg1 = chaps1[idx]
307
  seg2 = chaps2[idx]
308
+
309
  # Update progress
310
  comparison_count += 1
311
  if progress_callback is not None and total_comparisons > 0:
312
  try:
313
  progress_percentage = 0.45 + (0.25 * (comparison_count / total_comparisons))
314
+ progress_callback(progress_percentage,
315
  desc=f"Computing metrics for chapter {idx+1} ({comparison_count}/{total_comparisons})")
316
  except Exception as e:
317
  logger.warning(f"Progress callback error (non-critical): {e}")
318
+
319
  try:
320
  # Compute metrics for this chapter pair
321
  metrics_df = compute_all_metrics(
 
325
  enable_semantic=enable_semantic,
326
  enable_fuzzy=enable_fuzzy,
327
  fuzzy_method=fuzzy_method,
328
+ lcs_normalization=lcs_normalization,
329
  use_stopwords=use_stopwords,
330
  use_lite_stopwords=use_lite_stopwords,
331
+ normalize_particles_opt=normalize_particles,
332
  )
333
+
334
  # Extract metrics from the DataFrame (should have only one row)
335
  if not metrics_df.empty:
336
  pair_metrics = metrics_df.iloc[0].to_dict()
 
343
  "Fuzzy Similarity": 0.0 if enable_fuzzy else np.nan,
344
  "Semantic Similarity": 0.0 if enable_semantic else np.nan
345
  }
346
+
347
  # Format the results
348
  text_pair = f"{file1} vs {file2}"
349
  chapter_num = idx + 1
350
+
351
  result_row = {
352
  "Text Pair": text_pair,
353
  "Chapter": chapter_num,
354
  "Jaccard Similarity (%)": pair_metrics["Jaccard Similarity (%)"], # Already in percentage
355
  "Normalized LCS": pair_metrics["Normalized LCS"],
356
  }
357
+
358
  # Add fuzzy similarity if enabled
359
  if enable_fuzzy:
360
  result_row["Fuzzy Similarity"] = pair_metrics["Fuzzy Similarity"]
361
+
362
  # Add semantic similarity if enabled and available
363
  if enable_semantic and "Semantic Similarity" in pair_metrics:
364
  result_row["Semantic Similarity"] = pair_metrics["Semantic Similarity"]
365
+
366
  # Convert the dictionary to a DataFrame before appending
367
  result_df = pd.DataFrame([result_row])
368
  results.append(result_df)
369
+
370
  # Update progressive DataFrame and send update if callback is provided
371
  progressive_df = pd.concat(results, ignore_index=True)
372
+
373
  # Send progressive update if callback is provided
374
  if progressive_callback is not None:
375
  # Determine which metrics are complete in this update
376
  current_metrics = []
377
+
378
  # Always include these basic metrics
379
  if "Jaccard Similarity (%)" in progressive_df.columns and MetricType.JACCARD not in completed_metrics:
380
  current_metrics.append(MetricType.JACCARD)
381
  completed_metrics.append(MetricType.JACCARD)
382
+
383
  if "Normalized LCS" in progressive_df.columns and MetricType.LCS not in completed_metrics:
384
  current_metrics.append(MetricType.LCS)
385
  completed_metrics.append(MetricType.LCS)
386
+
387
  # Add fuzzy if enabled and available
388
  if enable_fuzzy and "Fuzzy Similarity" in progressive_df.columns and MetricType.FUZZY not in completed_metrics:
389
  current_metrics.append(MetricType.FUZZY)
390
  completed_metrics.append(MetricType.FUZZY)
391
+
392
  # Add semantic if enabled and available
393
  if enable_semantic and "Semantic Similarity" in progressive_df.columns and MetricType.SEMANTIC not in completed_metrics:
394
  current_metrics.append(MetricType.SEMANTIC)
395
  completed_metrics.append(MetricType.SEMANTIC)
396
+
397
  # Create word counts DataFrame for progressive update
398
  word_counts_data = []
399
  for seg_id, tokens in segment_tokens.items():
 
406
  "WordCount": len(tokens)
407
  })
408
  word_counts_df_progressive = pd.DataFrame(word_counts_data)
409
+
410
  # Send the update
411
  try:
412
  progressive_callback(
 
418
  )
419
  except Exception as e:
420
  logger.warning(f"Progressive callback error (non-critical): {e}")
421
+
422
  except Exception as e:
423
  logger.error(f"Error computing metrics for {seg1} vs {seg2}: {e}", exc_info=True)
424
  # Continue with other segmentsparisons instead of failing completely
425
  continue
426
+
427
  # Create the metrics DataFrame
428
  if results:
429
  # Results are already DataFrames, so we can concatenate them directly
 
438
  progress_callback(0.75, desc="Calculating word counts...")
439
  except Exception as e:
440
  logger.warning(f"Progress callback error (non-critical): {e}")
441
+
442
  word_counts_data = []
443
+
444
  # Process each segment
445
  for i, (seg_id, text_content) in enumerate(segment_texts.items()):
446
  # Update progress
 
450
  progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
451
  except Exception as e:
452
  logger.warning(f"Progress callback error (non-critical): {e}")
453
+
454
  fname, chapter_info = seg_id.split("|", 1)
455
  chapter_num = int(chapter_info.replace("chapter ", ""))
456
+
457
  try:
458
  # Use botok for accurate word count for raw Tibetan text
459
  tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
 
461
  word_count = len(tokenized_segments[0])
462
  else:
463
  word_count = 0
464
+
465
  word_counts_data.append(
466
  {
467
  "Filename": fname.replace(".txt", ""),
 
481
  "WordCount": 0,
482
  }
483
  )
484
+
485
  # Create and sort the word counts DataFrame
486
  word_counts_df = pd.DataFrame(word_counts_data)
487
  if not word_counts_df.empty:
488
  word_counts_df = word_counts_df.sort_values(
489
  by=["Filename", "ChapterNumber"]
490
  ).reset_index(drop=True)
491
+
492
  if progress_callback is not None:
493
  try:
494
  progress_callback(0.95, desc="Analysis complete!")
495
  except Exception as e:
496
  logger.warning(f"Progress callback error (non-critical): {e}")
497
+
498
  # Send final progressive update if callback is provided
499
  if progressive_callback is not None:
500
  try:
 
508
  )
509
  except Exception as e:
510
  logger.warning(f"Final progressive callback error (non-critical): {e}")
511
+
512
  # Return the results
513
  return metrics_df, word_counts_df, warning
pipeline/progressive_loader.py CHANGED
@@ -36,15 +36,15 @@ class ProgressiveResult:
36
  class ProgressiveLoader:
37
  """
38
  Manages progressive loading of metrics computation results.
39
-
40
  This class handles the incremental updates of metrics as they are computed,
41
  allowing the UI to display partial results before the entire computation is complete.
42
  """
43
-
44
  def __init__(self, update_callback: Optional[Callable[[ProgressiveResult], None]] = None):
45
  """
46
  Initialize the ProgressiveLoader.
47
-
48
  Args:
49
  update_callback: Function to call when new results are available.
50
  Should accept a ProgressiveResult object.
@@ -57,16 +57,16 @@ class ProgressiveLoader:
57
  self.is_complete = False
58
  self.last_update_time = 0
59
  self.update_interval = 0.5 # Minimum seconds between updates to avoid UI thrashing
60
-
61
- def update(self,
62
  metrics_df: Optional[pd.DataFrame] = None,
63
- word_counts_df: Optional[pd.DataFrame] = None,
64
  completed_metric: Optional[MetricType] = None,
65
  warning: Optional[str] = None,
66
  is_complete: bool = False) -> None:
67
  """
68
  Update the progressive results and trigger the callback if enough time has passed.
69
-
70
  Args:
71
  metrics_df: Updated metrics DataFrame
72
  word_counts_df: Updated word counts DataFrame
@@ -75,27 +75,27 @@ class ProgressiveLoader:
75
  is_complete: Whether the computation is complete
76
  """
77
  current_time = time.time()
78
-
79
  # Update internal state
80
  if metrics_df is not None:
81
  self.metrics_df = metrics_df
82
-
83
  if word_counts_df is not None:
84
  self.word_counts_df = word_counts_df
85
-
86
  if completed_metric is not None and completed_metric not in self.completed_metrics:
87
  self.completed_metrics.append(completed_metric)
88
-
89
  if warning:
90
  self.warning = warning
91
-
92
  self.is_complete = is_complete
93
-
94
  # Only trigger update if enough time has passed or if this is the final update
95
  if (current_time - self.last_update_time >= self.update_interval) or is_complete:
96
  self._trigger_update()
97
  self.last_update_time = current_time
98
-
99
  def _trigger_update(self) -> None:
100
  """Trigger the update callback with the current state."""
101
  if self.update_callback:
 
36
  class ProgressiveLoader:
37
  """
38
  Manages progressive loading of metrics computation results.
39
+
40
  This class handles the incremental updates of metrics as they are computed,
41
  allowing the UI to display partial results before the entire computation is complete.
42
  """
43
+
44
  def __init__(self, update_callback: Optional[Callable[[ProgressiveResult], None]] = None):
45
  """
46
  Initialize the ProgressiveLoader.
47
+
48
  Args:
49
  update_callback: Function to call when new results are available.
50
  Should accept a ProgressiveResult object.
 
57
  self.is_complete = False
58
  self.last_update_time = 0
59
  self.update_interval = 0.5 # Minimum seconds between updates to avoid UI thrashing
60
+
61
+ def update(self,
62
  metrics_df: Optional[pd.DataFrame] = None,
63
+ word_counts_df: Optional[pd.DataFrame] = None,
64
  completed_metric: Optional[MetricType] = None,
65
  warning: Optional[str] = None,
66
  is_complete: bool = False) -> None:
67
  """
68
  Update the progressive results and trigger the callback if enough time has passed.
69
+
70
  Args:
71
  metrics_df: Updated metrics DataFrame
72
  word_counts_df: Updated word counts DataFrame
 
75
  is_complete: Whether the computation is complete
76
  """
77
  current_time = time.time()
78
+
79
  # Update internal state
80
  if metrics_df is not None:
81
  self.metrics_df = metrics_df
82
+
83
  if word_counts_df is not None:
84
  self.word_counts_df = word_counts_df
85
+
86
  if completed_metric is not None and completed_metric not in self.completed_metrics:
87
  self.completed_metrics.append(completed_metric)
88
+
89
  if warning:
90
  self.warning = warning
91
+
92
  self.is_complete = is_complete
93
+
94
  # Only trigger update if enough time has passed or if this is the final update
95
  if (current_time - self.last_update_time >= self.update_interval) or is_complete:
96
  self._trigger_update()
97
  self.last_update_time = current_time
98
+
99
  def _trigger_update(self) -> None:
100
  """Trigger the update callback with the current state."""
101
  if self.update_callback:
pipeline/progressive_ui.py CHANGED
@@ -17,25 +17,25 @@ logger = logging.getLogger(__name__)
17
  class ProgressiveUI:
18
  """
19
  Manages progressive UI updates for the Tibetan Text Metrics app.
20
-
21
  This class handles the incremental updates of UI components as metrics
22
  are computed, allowing for a more responsive user experience.
23
  """
24
-
25
- def __init__(self,
26
  metrics_preview: gr.Dataframe,
27
  word_count_plot: gr.Plot,
28
  jaccard_heatmap: gr.Plot,
29
  lcs_heatmap: gr.Plot,
30
  fuzzy_heatmap: gr.Plot,
31
- semantic_heatmap: gr.Plot,
32
- warning_box: gr.Markdown,
33
- progress_container: gr.Row,
34
- heatmap_titles: Dict[str, str],
35
  structural_btn=None):
36
  """
37
  Initialize the ProgressiveUI.
38
-
39
  Args:
40
  metrics_preview: Gradio Dataframe component for metrics preview
41
  word_count_plot: Gradio Plot component for word count visualization
@@ -55,9 +55,9 @@ class ProgressiveUI:
55
  self.semantic_heatmap = semantic_heatmap
56
  self.warning_box = warning_box
57
  self.progress_container = progress_container
58
- self.heatmap_titles = heatmap_titles
59
  self.structural_btn = structural_btn
60
-
61
  # Create progress indicators for each metric
62
  with self.progress_container:
63
  self.jaccard_progress = gr.Markdown("🔄 **Jaccard Similarity:** Waiting...", elem_id="jaccard_progress")
@@ -65,90 +65,90 @@ class ProgressiveUI:
65
  self.fuzzy_progress = gr.Markdown("🔄 **Fuzzy Similarity:** Waiting...", elem_id="fuzzy_progress")
66
  self.semantic_progress = gr.Markdown("🔄 **Semantic Similarity:** Waiting...", elem_id="semantic_progress")
67
  self.word_count_progress = gr.Markdown("🔄 **Word Counts:** Waiting...", elem_id="word_count_progress")
68
-
69
  # Track which components have been updated
70
  self.updated_components = set()
71
-
72
  def update(self, result: ProgressiveResult) -> Dict[gr.components.Component, Any]:
73
  """
74
  Update UI components based on progressive results.
75
-
76
  Args:
77
  result: ProgressiveResult object containing the current state of computation
78
-
79
  Returns:
80
  Dictionary mapping Gradio components to their updated values
81
  """
82
  updates = {}
83
-
84
  # Always update metrics preview if we have data
85
  if not result.metrics_df.empty:
86
  updates[self.metrics_preview] = result.metrics_df.head(10)
87
-
88
  # Update warning if present
89
  if result.warning:
90
  warning_md = f"**⚠️ Warning:** {result.warning}" if result.warning else ""
91
  updates[self.warning_box] = gr.update(value=warning_md, visible=True)
92
-
93
  # Generate visualizations for completed metrics
94
  if not result.metrics_df.empty:
95
  # Generate heatmaps for available metrics
96
  heatmaps_data = generate_visualizations(
97
  result.metrics_df, descriptive_titles=self.heatmap_titles
98
  )
99
-
100
  # Update heatmaps and progress indicators for completed metrics
101
  for metric_type in result.completed_metrics:
102
  if metric_type == MetricType.JACCARD:
103
  # Update progress indicator
104
  updates[self.jaccard_progress] = "✅ **Jaccard Similarity:** Complete"
105
-
106
  # Update heatmap if not already updated
107
  if self.jaccard_heatmap not in self.updated_components:
108
  if "Jaccard Similarity (%)" in heatmaps_data:
109
  updates[self.jaccard_heatmap] = heatmaps_data["Jaccard Similarity (%)"]
110
  self.updated_components.add(self.jaccard_heatmap)
111
-
112
  elif metric_type == MetricType.LCS:
113
  # Update progress indicator
114
  updates[self.lcs_progress] = "✅ **Normalized LCS:** Complete"
115
-
116
  # Update heatmap if not already updated
117
  if self.lcs_heatmap not in self.updated_components:
118
  if "Normalized LCS" in heatmaps_data:
119
  updates[self.lcs_heatmap] = heatmaps_data["Normalized LCS"]
120
  self.updated_components.add(self.lcs_heatmap)
121
-
122
  elif metric_type == MetricType.FUZZY:
123
  # Update progress indicator
124
  updates[self.fuzzy_progress] = "✅ **Fuzzy Similarity:** Complete"
125
-
126
  # Update heatmap if not already updated
127
  if self.fuzzy_heatmap not in self.updated_components:
128
  if "Fuzzy Similarity" in heatmaps_data:
129
  updates[self.fuzzy_heatmap] = heatmaps_data["Fuzzy Similarity"]
130
  self.updated_components.add(self.fuzzy_heatmap)
131
-
132
  elif metric_type == MetricType.SEMANTIC:
133
  # Update progress indicator
134
  updates[self.semantic_progress] = "✅ **Semantic Similarity:** Complete"
135
-
136
  # Update heatmap if not already updated
137
  if self.semantic_heatmap not in self.updated_components:
138
  if "Semantic Similarity" in heatmaps_data:
139
  updates[self.semantic_heatmap] = heatmaps_data["Semantic Similarity"]
140
  self.updated_components.add(self.semantic_heatmap)
141
-
142
  # Generate word count chart if we have data
143
  if not result.word_counts_df.empty:
144
  # Update progress indicator
145
  updates[self.word_count_progress] = "✅ **Word Counts:** Complete"
146
-
147
  # Update chart if not already updated
148
  if self.word_count_plot not in self.updated_components:
149
  updates[self.word_count_plot] = generate_word_count_chart(result.word_counts_df)
150
  self.updated_components.add(self.word_count_plot)
151
-
152
  # Update progress indicators for metrics in progress
153
  if not result.is_complete:
154
  # Update progress indicators for metrics that are still in progress
@@ -167,28 +167,28 @@ class ProgressiveUI:
167
  if self.structural_btn is not None:
168
  updates[self.structural_btn] = gr.update(interactive=True)
169
  logger.info("Enabling structural analysis button via progressive UI")
170
-
171
  return updates
172
 
173
 
174
  def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
175
  """
176
  Create a callback function for progressive updates.
177
-
178
  Args:
179
  progressive_ui: ProgressiveUI instance to handle updates
180
-
181
  Returns:
182
  Callback function that can be passed to process_texts
183
  """
184
- def callback(metrics_df: pd.DataFrame,
185
  word_counts_df: pd.DataFrame,
186
  completed_metrics: List[MetricType],
187
  warning: str,
188
  is_complete: bool) -> None:
189
  """
190
  Callback function for progressive updates.
191
-
192
  Args:
193
  metrics_df: DataFrame with current metrics
194
  word_counts_df: DataFrame with word counts
@@ -203,10 +203,10 @@ def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
203
  warning=warning,
204
  is_complete=is_complete
205
  )
206
-
207
  # Get updates for UI components
208
  updates = progressive_ui.update(result)
209
-
210
  # Apply updates to UI components
211
  for component, value in updates.items():
212
  try:
@@ -228,5 +228,5 @@ def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
228
  logger.warning(f"Cannot update component of type {type(component)}")
229
  except Exception as e:
230
  logger.warning(f"Error updating component: {e}")
231
-
232
  return callback
 
17
  class ProgressiveUI:
18
  """
19
  Manages progressive UI updates for the Tibetan Text Metrics app.
20
+
21
  This class handles the incremental updates of UI components as metrics
22
  are computed, allowing for a more responsive user experience.
23
  """
24
+
25
+ def __init__(self,
26
  metrics_preview: gr.Dataframe,
27
  word_count_plot: gr.Plot,
28
  jaccard_heatmap: gr.Plot,
29
  lcs_heatmap: gr.Plot,
30
  fuzzy_heatmap: gr.Plot,
31
+ semantic_heatmap: gr.Plot = None,
32
+ warning_box: gr.Markdown = None,
33
+ progress_container: gr.Row = None,
34
+ heatmap_titles: Dict[str, str] = None,
35
  structural_btn=None):
36
  """
37
  Initialize the ProgressiveUI.
38
+
39
  Args:
40
  metrics_preview: Gradio Dataframe component for metrics preview
41
  word_count_plot: Gradio Plot component for word count visualization
 
55
  self.semantic_heatmap = semantic_heatmap
56
  self.warning_box = warning_box
57
  self.progress_container = progress_container
58
+ self.heatmap_titles = heatmap_titles or {}
59
  self.structural_btn = structural_btn
60
+
61
  # Create progress indicators for each metric
62
  with self.progress_container:
63
  self.jaccard_progress = gr.Markdown("🔄 **Jaccard Similarity:** Waiting...", elem_id="jaccard_progress")
 
65
  self.fuzzy_progress = gr.Markdown("🔄 **Fuzzy Similarity:** Waiting...", elem_id="fuzzy_progress")
66
  self.semantic_progress = gr.Markdown("🔄 **Semantic Similarity:** Waiting...", elem_id="semantic_progress")
67
  self.word_count_progress = gr.Markdown("🔄 **Word Counts:** Waiting...", elem_id="word_count_progress")
68
+
69
  # Track which components have been updated
70
  self.updated_components = set()
71
+
72
  def update(self, result: ProgressiveResult) -> Dict[gr.components.Component, Any]:
73
  """
74
  Update UI components based on progressive results.
75
+
76
  Args:
77
  result: ProgressiveResult object containing the current state of computation
78
+
79
  Returns:
80
  Dictionary mapping Gradio components to their updated values
81
  """
82
  updates = {}
83
+
84
  # Always update metrics preview if we have data
85
  if not result.metrics_df.empty:
86
  updates[self.metrics_preview] = result.metrics_df.head(10)
87
+
88
  # Update warning if present
89
  if result.warning:
90
  warning_md = f"**⚠️ Warning:** {result.warning}" if result.warning else ""
91
  updates[self.warning_box] = gr.update(value=warning_md, visible=True)
92
+
93
  # Generate visualizations for completed metrics
94
  if not result.metrics_df.empty:
95
  # Generate heatmaps for available metrics
96
  heatmaps_data = generate_visualizations(
97
  result.metrics_df, descriptive_titles=self.heatmap_titles
98
  )
99
+
100
  # Update heatmaps and progress indicators for completed metrics
101
  for metric_type in result.completed_metrics:
102
  if metric_type == MetricType.JACCARD:
103
  # Update progress indicator
104
  updates[self.jaccard_progress] = "✅ **Jaccard Similarity:** Complete"
105
+
106
  # Update heatmap if not already updated
107
  if self.jaccard_heatmap not in self.updated_components:
108
  if "Jaccard Similarity (%)" in heatmaps_data:
109
  updates[self.jaccard_heatmap] = heatmaps_data["Jaccard Similarity (%)"]
110
  self.updated_components.add(self.jaccard_heatmap)
111
+
112
  elif metric_type == MetricType.LCS:
113
  # Update progress indicator
114
  updates[self.lcs_progress] = "✅ **Normalized LCS:** Complete"
115
+
116
  # Update heatmap if not already updated
117
  if self.lcs_heatmap not in self.updated_components:
118
  if "Normalized LCS" in heatmaps_data:
119
  updates[self.lcs_heatmap] = heatmaps_data["Normalized LCS"]
120
  self.updated_components.add(self.lcs_heatmap)
121
+
122
  elif metric_type == MetricType.FUZZY:
123
  # Update progress indicator
124
  updates[self.fuzzy_progress] = "✅ **Fuzzy Similarity:** Complete"
125
+
126
  # Update heatmap if not already updated
127
  if self.fuzzy_heatmap not in self.updated_components:
128
  if "Fuzzy Similarity" in heatmaps_data:
129
  updates[self.fuzzy_heatmap] = heatmaps_data["Fuzzy Similarity"]
130
  self.updated_components.add(self.fuzzy_heatmap)
131
+
132
  elif metric_type == MetricType.SEMANTIC:
133
  # Update progress indicator
134
  updates[self.semantic_progress] = "✅ **Semantic Similarity:** Complete"
135
+
136
  # Update heatmap if not already updated
137
  if self.semantic_heatmap not in self.updated_components:
138
  if "Semantic Similarity" in heatmaps_data:
139
  updates[self.semantic_heatmap] = heatmaps_data["Semantic Similarity"]
140
  self.updated_components.add(self.semantic_heatmap)
141
+
142
  # Generate word count chart if we have data
143
  if not result.word_counts_df.empty:
144
  # Update progress indicator
145
  updates[self.word_count_progress] = "✅ **Word Counts:** Complete"
146
+
147
  # Update chart if not already updated
148
  if self.word_count_plot not in self.updated_components:
149
  updates[self.word_count_plot] = generate_word_count_chart(result.word_counts_df)
150
  self.updated_components.add(self.word_count_plot)
151
+
152
  # Update progress indicators for metrics in progress
153
  if not result.is_complete:
154
  # Update progress indicators for metrics that are still in progress
 
167
  if self.structural_btn is not None:
168
  updates[self.structural_btn] = gr.update(interactive=True)
169
  logger.info("Enabling structural analysis button via progressive UI")
170
+
171
  return updates
172
 
173
 
174
  def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
175
  """
176
  Create a callback function for progressive updates.
177
+
178
  Args:
179
  progressive_ui: ProgressiveUI instance to handle updates
180
+
181
  Returns:
182
  Callback function that can be passed to process_texts
183
  """
184
+ def callback(metrics_df: pd.DataFrame,
185
  word_counts_df: pd.DataFrame,
186
  completed_metrics: List[MetricType],
187
  warning: str,
188
  is_complete: bool) -> None:
189
  """
190
  Callback function for progressive updates.
191
+
192
  Args:
193
  metrics_df: DataFrame with current metrics
194
  word_counts_df: DataFrame with word counts
 
203
  warning=warning,
204
  is_complete=is_complete
205
  )
206
+
207
  # Get updates for UI components
208
  updates = progressive_ui.update(result)
209
+
210
  # Apply updates to UI components
211
  for component, value in updates.items():
212
  try:
 
228
  logger.warning(f"Cannot update component of type {type(component)}")
229
  except Exception as e:
230
  logger.warning(f"Error updating component: {e}")
231
+
232
  return callback
pipeline/stopwords_bo.py CHANGED
@@ -21,13 +21,13 @@ ORDINAL_NUMBERS = [
21
 
22
  # Additional stopwords from the comprehensive list, categorized for readability
23
  MORE_PARTICLES_SUFFIXES = [
24
- "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
25
- "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
26
- "གྱིན་", "ན", "འམ་", "ཀྱིན་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི",
27
- "བམ་", "ཤིག་", "ནམ", "མིན་", "ནམ་", "ངམ་", "རུ་", "ཤས་", "ཏུ", "ཡིས", "གིན་", "གམ་",
28
- "གྱིས", "ཅང་", "སམ་", "ཞིག", "འང", "རུ", "དང", "ཡ", "འག", "སམ", "ཀ", "འམ", "མམ",
29
- "དམ", "ཀྱི", "ལམ", "ནོ་", "སོ་", "རམ་", "བོ་", "ཨང་", "ཕྱི", "ཏོ་", "གེ", "གོ", "རོ་", "བོ",
30
- "པས", "འི", "རམ", "བས", "མཾ་", "པོ", "ག་", "ག", "གམ", "བམ", "མོ་", "མམ་", "ཏམ་", "ངོ",
31
  "ཏམ", "གིང་", "ཀྱང" # ཀྱང also in PARTICLES_INITIAL, set() will handle duplicates
32
  ]
33
 
@@ -36,13 +36,13 @@ PRONOUNS_DEMONSTRATIVES = ["འདི", "གཞན་", "དེ་", "རང་"
36
  VERBS_AUXILIARIES = ["ཡིན་", "མི་", "ལགས་པ", "ཡིན་པ", "ལགས་", "མིན་", "ཡིན་པ་", "མིན", "ཡིན་བ", "ཡིན་ལུགས་"]
37
 
38
  ADVERBS_QUALIFIERS_INTENSIFIERS = [
39
- "སོགས་", "ཙམ་", "ཡང་", "ཉིད་", "ཞིང་", "རུང་", "ན་རེ", "འང་", "ཁོ་ན་", "འཕྲལ་", "བར་",
40
  "ཅུང་ཟད་", "ཙམ་པ་", "ཤ་སྟག་"
41
  ]
42
 
43
  QUANTIFIERS_DETERMINERS_COLLECTIVES = [
44
- "རྣམས་", "ཀུན་", "སྙེད་", "བཅས་", "ཡོངས་", "མཐའ་དག་", "དག་", "ཚུ", "ཚང་མ", "ཐམས་ཅད་",
45
- "ཅིག་", "སྣ་ཚོགས་", "སྙེད་པ", "རེ་རེ་", "འགའ་", "སྤྱི", "དུ་མ", "མ", "ཁོ་ན", "ཚོ", "ལ་ལ་",
46
  "སྙེད་པ་", "འབའ་", "སྙེད", "གྲང་", "ཁ་", "ངེ་", "ཅོག་", "རིལ་", "ཉུང་ཤས་", "ཚ་"
47
  ]
48
 
@@ -64,8 +64,19 @@ _ALL_STOPWORDS_CATEGORIZED = (
64
  INTERJECTIONS_EXCLAMATIONS
65
  )
66
 
67
- # Final flat list of unique stopwords
68
- TIBETAN_STOPWORDS = list(set(_ALL_STOPWORDS_CATEGORIZED))
 
 
 
 
 
 
 
 
 
 
69
 
70
  # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
 
71
  TIBETAN_STOPWORDS_SET = set(TIBETAN_STOPWORDS)
 
21
 
22
  # Additional stopwords from the comprehensive list, categorized for readability
23
  MORE_PARTICLES_SUFFIXES = [
24
+ "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
25
+ "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
26
+ "གྱིན་", "ན", "འམ་", "ཀྱིན་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི",
27
+ "བམ་", "ཤིག་", "ནམ", "མིན་", "ནམ་", "ངམ་", "རུ་", "ཤས་", "ཏུ", "ཡིས", "གིན་", "གམ་",
28
+ "གྱིས", "ཅང་", "སམ་", "ཞིག", "འང", "རུ", "དང", "ཡ", "འག", "སམ", "ཀ", "འམ", "མམ",
29
+ "དམ", "ཀྱི", "ལམ", "ནོ་", "སོ་", "རམ་", "བོ་", "ཨང་", "ཕྱི", "ཏོ་", "གེ", "གོ", "རོ་", "བོ",
30
+ "པས", "འི", "རམ", "བས", "མཾ་", "པོ", "ག་", "ག", "གམ", "བམ", "མོ་", "མམ་", "ཏམ་", "ངོ",
31
  "ཏམ", "གིང་", "ཀྱང" # ཀྱང also in PARTICLES_INITIAL, set() will handle duplicates
32
  ]
33
 
 
36
  VERBS_AUXILIARIES = ["ཡིན་", "མི་", "ལགས་པ", "ཡིན་པ", "ལགས་", "མིན་", "ཡིན་པ་", "མིན", "ཡིན་བ", "ཡིན་ལུགས་"]
37
 
38
  ADVERBS_QUALIFIERS_INTENSIFIERS = [
39
+ "སོགས་", "ཙམ་", "ཡང་", "ཉིད་", "ཞིང་", "རུང་", "ན་རེ", "འང་", "ཁོ་ན་", "འཕྲལ་", "བར་",
40
  "ཅུང་ཟད་", "ཙམ་པ་", "ཤ་སྟག་"
41
  ]
42
 
43
  QUANTIFIERS_DETERMINERS_COLLECTIVES = [
44
+ "རྣམས་", "ཀུན་", "སྙེད་", "བཅས་", "ཡོངས་", "མཐའ་དག་", "དག་", "ཚུ", "ཚང་མ", "ཐམས་ཅད་",
45
+ "ཅིག་", "སྣ་ཚོགས་", "སྙེད་པ", "རེ་རེ་", "འགའ་", "སྤྱི", "དུ་མ", "མ", "ཁོ་ན", "ཚོ", "ལ་ལ་",
46
  "སྙེད་པ་", "འབའ་", "སྙེད", "གྲང་", "ཁ་", "ངེ་", "ཅོག་", "རིལ་", "ཉུང་ཤས་", "ཚ་"
47
  ]
48
 
 
64
  INTERJECTIONS_EXCLAMATIONS
65
  )
66
 
67
+ def _normalize_tibetan_token(token: str) -> str:
68
+ """
69
+ Normalize a Tibetan token by removing trailing tsek (་).
70
+
71
+ This ensures consistent matching regardless of whether the tokenizer
72
+ preserves or strips the tsek. Botok's behavior can vary, so we normalize
73
+ both the stopwords and the tokens being compared.
74
+ """
75
+ return token.rstrip('་')
76
+
77
+ # Final flat list of unique stopwords (normalized to remove trailing tsek)
78
+ TIBETAN_STOPWORDS = list(set(_normalize_tibetan_token(sw) for sw in _ALL_STOPWORDS_CATEGORIZED))
79
 
80
  # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
81
+ # Normalized to match tokenizer output regardless of tsek handling
82
  TIBETAN_STOPWORDS_SET = set(TIBETAN_STOPWORDS)
pipeline/stopwords_lite_bo.py CHANGED
@@ -15,8 +15,8 @@ MARKERS_AND_PUNCTUATION = ["༈", "།", "༎", "༑"]
15
 
16
  # Reduced list of particles and suffixes
17
  MORE_PARTICLES_SUFFIXES_LITE = [
18
- "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
19
- "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
20
  "ན", "འམ་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི"
21
  ]
22
 
@@ -27,8 +27,18 @@ _ALL_STOPWORDS_CATEGORIZED_LITE = (
27
  MORE_PARTICLES_SUFFIXES_LITE
28
  )
29
 
30
- # Final flat list of unique stopwords
31
- TIBETAN_STOPWORDS_LITE = list(set(_ALL_STOPWORDS_CATEGORIZED_LITE))
 
 
 
 
 
 
 
 
 
32
 
33
  # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
 
34
  TIBETAN_STOPWORDS_LITE_SET = set(TIBETAN_STOPWORDS_LITE)
 
15
 
16
  # Reduced list of particles and suffixes
17
  MORE_PARTICLES_SUFFIXES_LITE = [
18
+ "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
19
+ "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
20
  "ན", "འམ་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི"
21
  ]
22
 
 
27
  MORE_PARTICLES_SUFFIXES_LITE
28
  )
29
 
30
+ def _normalize_tibetan_token(token: str) -> str:
31
+ """
32
+ Normalize a Tibetan token by removing trailing tsek (་).
33
+
34
+ This ensures consistent matching regardless of whether the tokenizer
35
+ preserves or strips the tsek.
36
+ """
37
+ return token.rstrip('་')
38
+
39
+ # Final flat list of unique stopwords (normalized to remove trailing tsek)
40
+ TIBETAN_STOPWORDS_LITE = list(set(_normalize_tibetan_token(sw) for sw in _ALL_STOPWORDS_CATEGORIZED_LITE))
41
 
42
  # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
43
+ # Normalized to match tokenizer output regardless of tsek handling
44
  TIBETAN_STOPWORDS_LITE_SET = set(TIBETAN_STOPWORDS_LITE)
pipeline/tokenize.py CHANGED
@@ -29,10 +29,10 @@ except ImportError:
29
  def _get_text_hash(text: str) -> str:
30
  """
31
  Generate a hash for the input text to use as a cache key.
32
-
33
  Args:
34
  text: The input text to hash
35
-
36
  Returns:
37
  A string representation of the MD5 hash of the input text
38
  """
@@ -42,17 +42,17 @@ def _get_text_hash(text: str) -> str:
42
  def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
43
  """
44
  Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
45
-
46
  This function maintains an in-memory cache of previously tokenized texts to avoid
47
  redundant processing of the same content. The cache uses MD5 hashes of the input
48
  texts as keys.
49
-
50
  Args:
51
  texts: List of raw text strings to tokenize.
52
-
53
  Returns:
54
  List of tokenized texts (each as a list of tokens).
55
-
56
  Raises:
57
  RuntimeError: If the botok tokenizer failed to initialize.
58
  """
@@ -68,18 +68,18 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
68
  if mode not in ["word", "syllable"]:
69
  logger.warning(f"Invalid tokenization mode: '{mode}'. Defaulting to 'syllable'.")
70
  mode = "syllable"
71
-
72
  # Process each text
73
  for text_content in texts:
74
  # Skip empty texts
75
  if not text_content.strip():
76
  tokenized_texts_list.append([])
77
  continue
78
-
79
  # Generate hash for cache lookup
80
  cache_key_string = text_content + f"_{mode}" # Include mode in string for hashing
81
  text_hash = _get_text_hash(cache_key_string)
82
-
83
  # Check if we have this text in cache
84
  if text_hash in _tokenization_cache:
85
  # Cache hit - use cached tokens
@@ -91,7 +91,7 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
91
  current_tokens = []
92
  if BOTOK_TOKENIZER:
93
  raw_botok_items = list(BOTOK_TOKENIZER.tokenize(text_content))
94
-
95
  if mode == "word":
96
  for item_idx, w in enumerate(raw_botok_items):
97
  if hasattr(w, 'text') and isinstance(w.text, str):
@@ -125,7 +125,7 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
125
  f"for hash {text_hash[:8]}. Skipping this syllable."
126
  )
127
  continue
128
-
129
  if syllable_to_process is not None:
130
  stripped_syl = syllable_to_process.strip()
131
  if stripped_syl:
@@ -155,20 +155,20 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
155
  else:
156
  logger.error(f"BOTOK_TOKENIZER is None for text hash {text_hash[:8]}, cannot tokenize (mode: {mode}).")
157
  tokens = []
158
-
159
  # Store in cache if not empty
160
  if tokens:
161
  # If cache is full, remove a random entry (simple strategy)
162
  if len(_tokenization_cache) >= MAX_CACHE_SIZE:
163
  # Remove first key (oldest if ordered dict, random otherwise)
164
  _tokenization_cache.pop(next(iter(_tokenization_cache)))
165
-
166
  _tokenization_cache[text_hash] = tokens
167
  logger.debug(f"Added tokens to cache with hash {text_hash[:8]}... (mode: {mode})")
168
  except Exception as e:
169
  logger.error(f"Error tokenizing text (mode: {mode}): {e}")
170
  tokens = []
171
-
172
  tokenized_texts_list.append(tokens)
173
-
174
  return tokenized_texts_list
 
29
  def _get_text_hash(text: str) -> str:
30
  """
31
  Generate a hash for the input text to use as a cache key.
32
+
33
  Args:
34
  text: The input text to hash
35
+
36
  Returns:
37
  A string representation of the MD5 hash of the input text
38
  """
 
42
  def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
43
  """
44
  Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
45
+
46
  This function maintains an in-memory cache of previously tokenized texts to avoid
47
  redundant processing of the same content. The cache uses MD5 hashes of the input
48
  texts as keys.
49
+
50
  Args:
51
  texts: List of raw text strings to tokenize.
52
+
53
  Returns:
54
  List of tokenized texts (each as a list of tokens).
55
+
56
  Raises:
57
  RuntimeError: If the botok tokenizer failed to initialize.
58
  """
 
68
  if mode not in ["word", "syllable"]:
69
  logger.warning(f"Invalid tokenization mode: '{mode}'. Defaulting to 'syllable'.")
70
  mode = "syllable"
71
+
72
  # Process each text
73
  for text_content in texts:
74
  # Skip empty texts
75
  if not text_content.strip():
76
  tokenized_texts_list.append([])
77
  continue
78
+
79
  # Generate hash for cache lookup
80
  cache_key_string = text_content + f"_{mode}" # Include mode in string for hashing
81
  text_hash = _get_text_hash(cache_key_string)
82
+
83
  # Check if we have this text in cache
84
  if text_hash in _tokenization_cache:
85
  # Cache hit - use cached tokens
 
91
  current_tokens = []
92
  if BOTOK_TOKENIZER:
93
  raw_botok_items = list(BOTOK_TOKENIZER.tokenize(text_content))
94
+
95
  if mode == "word":
96
  for item_idx, w in enumerate(raw_botok_items):
97
  if hasattr(w, 'text') and isinstance(w.text, str):
 
125
  f"for hash {text_hash[:8]}. Skipping this syllable."
126
  )
127
  continue
128
+
129
  if syllable_to_process is not None:
130
  stripped_syl = syllable_to_process.strip()
131
  if stripped_syl:
 
155
  else:
156
  logger.error(f"BOTOK_TOKENIZER is None for text hash {text_hash[:8]}, cannot tokenize (mode: {mode}).")
157
  tokens = []
158
+
159
  # Store in cache if not empty
160
  if tokens:
161
  # If cache is full, remove a random entry (simple strategy)
162
  if len(_tokenization_cache) >= MAX_CACHE_SIZE:
163
  # Remove first key (oldest if ordered dict, random otherwise)
164
  _tokenization_cache.pop(next(iter(_tokenization_cache)))
165
+
166
  _tokenization_cache[text_hash] = tokens
167
  logger.debug(f"Added tokens to cache with hash {text_hash[:8]}... (mode: {mode})")
168
  except Exception as e:
169
  logger.error(f"Error tokenizing text (mode: {mode}): {e}")
170
  tokens = []
171
+
172
  tokenized_texts_list.append(tokens)
173
+
174
  return tokenized_texts_list
pipeline/visualize.py CHANGED
@@ -40,29 +40,29 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
40
  continue
41
 
42
  cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
43
-
44
  # For consistent interpretation: higher values (more similarity) = darker colors
45
  # Using 'Reds' colormap for all metrics (dark red = high similarity)
46
- cmap = "Reds"
47
-
48
  # Format values for display
49
  text = [
50
  [f"{val:.2f}" if pd.notnull(val) else "" for val in row]
51
  for row in pivot.values
52
  ]
53
-
54
  # Create a copy of the pivot data for visualization
55
  # For LCS and Semantic Similarity, we need to reverse the color scale
56
  # so that higher values (more similarity) are darker
57
  viz_values = pivot.values.copy()
58
-
59
  # Determine if we need to reverse the values for consistent color interpretation
60
  # (darker = more similar across all metrics)
61
  reverse_colorscale = False
62
-
63
  # All metrics should have darker colors for higher similarity
64
  # No need to reverse values anymore - we'll use the same scale for all
65
-
66
  fig = go.Figure(
67
  data=go.Heatmap(
68
  z=viz_values,
 
40
  continue
41
 
42
  cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
43
+
44
  # For consistent interpretation: higher values (more similarity) = darker colors
45
  # Using 'Reds' colormap for all metrics (dark red = high similarity)
46
+ cmap = "Reds"
47
+
48
  # Format values for display
49
  text = [
50
  [f"{val:.2f}" if pd.notnull(val) else "" for val in row]
51
  for row in pivot.values
52
  ]
53
+
54
  # Create a copy of the pivot data for visualization
55
  # For LCS and Semantic Similarity, we need to reverse the color scale
56
  # so that higher values (more similarity) are darker
57
  viz_values = pivot.values.copy()
58
+
59
  # Determine if we need to reverse the values for consistent color interpretation
60
  # (darker = more similar across all metrics)
61
  reverse_colorscale = False
62
+
63
  # All metrics should have darker colors for higher similarity
64
  # No need to reverse values anymore - we'll use the same scale for all
65
+
66
  fig = go.Figure(
67
  data=go.Heatmap(
68
  z=viz_values,
pyproject.toml CHANGED
@@ -1,8 +1,33 @@
1
  [build-system]
2
  requires = [
3
- "setuptools>=42",
4
- "Cython>=0.29.21",
5
- "numpy>=1.20"
6
  ]
7
  build-backend = "setuptools.build_meta"
8
- backend-path = ["."] # Specifies that setuptools.build_meta is in the current directory's PYTHONPATH
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  [build-system]
2
  requires = [
3
+ "setuptools>=65",
4
+ "Cython>=3.0",
5
+ "numpy>=1.24"
6
  ]
7
  build-backend = "setuptools.build_meta"
8
+
9
+ [project]
10
+ name = "tibetan-text-metrics-webapp"
11
+ version = "0.4.0"
12
+ description = "Web application for computing text similarity metrics on Tibetan texts"
13
+ readme = "README.md"
14
+ license = {text = "CC-BY-4.0"}
15
+ requires-python = ">=3.10"
16
+ authors = [
17
+ {name = "Daniel Wojahn", email = "[email protected]"}
18
+ ]
19
+ keywords = ["tibetan", "nlp", "text-similarity", "buddhist-texts"]
20
+ classifiers = [
21
+ "Development Status :: 4 - Beta",
22
+ "Intended Audience :: Science/Research",
23
+ "License :: OSI Approved",
24
+ "Programming Language :: Python :: 3",
25
+ "Programming Language :: Python :: 3.10",
26
+ "Programming Language :: Python :: 3.11",
27
+ "Programming Language :: Python :: 3.12",
28
+ "Topic :: Text Processing :: Linguistic",
29
+ ]
30
+
31
+ [project.urls]
32
+ Homepage = "https://github.com/daniel-wojahn/tibetan-text-metrics"
33
+ Repository = "https://github.com/daniel-wojahn/tibetan-text-metrics"
requirements.txt CHANGED
@@ -1,5 +1,6 @@
1
  # Core application and UI
2
- gradio
 
3
  pandas==2.2.3
4
 
5
  # Plotting and visualization
 
1
  # Core application and UI
2
+ # Gradio 5.x (code is forward-compatible with Gradio 6)
3
+ gradio>=5.0.0
4
  pandas==2.2.3
5
 
6
  # Plotting and visualization
setup.py CHANGED
@@ -1,45 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
1
  import numpy
2
  from setuptools import Extension, setup
3
  from Cython.Build import cythonize
4
 
5
- # It's good practice to specify encoding for portability
6
- with open("README.md", "r", encoding="utf-8") as fh:
7
- long_description = fh.read()
8
-
9
  setup(
10
- name="tibetan text metrics webapp",
11
- version="0.1.0",
12
- author="Daniel Wojahn / Tibetan Text Metrics",
13
  author_email="[email protected]",
14
- description="Cython components for the Tibetan Text Metrics Webapp",
15
- long_description=long_description,
16
- long_description_content_type="text/markdown",
17
  url="https://github.com/daniel-wojahn/tibetan-text-metrics",
18
  ext_modules=cythonize(
19
  [
20
  Extension(
21
- "pipeline.fast_lcs", # Module name to import: from pipeline.fast_lcs import ...
22
  ["pipeline/fast_lcs.pyx"],
23
  include_dirs=[numpy.get_include()],
24
  )
25
  ],
26
- compiler_directives={'language_level' : "3"} # For Python 3 compatibility
27
  ),
28
- # Indicates that package data (like .pyx files) should be included if specified in MANIFEST.in
29
- # For simple cases like this, Cythonize usually handles it.
30
- include_package_data=True,
31
- # Although this setup.py is in webapp, it's building modules for the 'pipeline' sub-package
32
- # We don't list packages here as this setup.py is just for the extension.
33
- # The main app will treat 'pipeline' as a regular package.
34
- zip_safe=False, # Cython extensions are generally not zip-safe
35
- classifiers=[
36
- "Programming Language :: Python :: 3",
37
- "License :: OSI Approved :: MIT License",
38
- "Operating System :: OS Independent",
39
- ],
40
- python_requires='>=3.8',
41
  install_requires=[
42
- "numpy>=1.20", # Ensure numpy is available for runtime if not just build time
43
  ],
44
- # setup_requires is deprecated, use pyproject.toml for build-system requirements
45
  )
 
1
+ """
2
+ Setup script for building Cython extensions.
3
+
4
+ This setup.py is used to compile the fast_lcs Cython extension for
5
+ improved LCS calculation performance. The main project metadata is
6
+ in pyproject.toml.
7
+
8
+ Usage:
9
+ python setup.py build_ext --inplace
10
+ """
11
+
12
  import numpy
13
  from setuptools import Extension, setup
14
  from Cython.Build import cythonize
15
 
 
 
 
 
16
  setup(
17
+ name="tibetan-text-metrics-webapp",
18
+ version="0.4.0",
19
+ author="Daniel Wojahn",
20
  author_email="[email protected]",
21
+ description="Cython LCS extension for Tibetan Text Metrics Webapp",
 
 
22
  url="https://github.com/daniel-wojahn/tibetan-text-metrics",
23
  ext_modules=cythonize(
24
  [
25
  Extension(
26
+ "pipeline.fast_lcs",
27
  ["pipeline/fast_lcs.pyx"],
28
  include_dirs=[numpy.get_include()],
29
  )
30
  ],
31
+ compiler_directives={"language_level": "3"}
32
  ),
33
+ include_package_data=True,
34
+ zip_safe=False,
35
+ python_requires=">=3.10",
 
 
 
 
 
 
 
 
 
 
36
  install_requires=[
37
+ "numpy>=1.24",
38
  ],
 
39
  )
theme.py CHANGED
@@ -1,273 +1,408 @@
 
 
 
 
 
 
 
1
  import gradio as gr
2
  from gradio.themes.utils import colors, sizes, fonts
3
 
4
 
5
  class TibetanAppTheme(gr.themes.Soft):
 
 
 
 
 
 
 
 
 
6
  def __init__(self):
7
  super().__init__(
8
- primary_hue=colors.blue, # Primary interactive elements (e.g., #2563eb)
9
- secondary_hue=colors.orange, # For accents if needed, or default buttons
10
- neutral_hue=colors.slate, # For backgrounds, borders, and text
11
  font=[
12
  fonts.GoogleFont("Inter"),
13
  "ui-sans-serif",
14
  "system-ui",
15
  "sans-serif",
16
  ],
17
- radius_size=sizes.radius_md, # General radius, can be overridden (16px was for cards)
18
- text_size=sizes.text_md, # Base font size (16px)
19
  )
 
 
20
  self.theme_vars_for_set = {
21
  # Global & Body Styles
22
  "body_background_fill": "#f0f2f5",
23
  "body_text_color": "#333333",
24
- # Card Styles (.gr-group)
 
 
25
  "block_background_fill": "#ffffff",
26
- "block_radius": "16px", # May need to be removed if not a valid settable CSS var
27
  "block_shadow": "0 4px 12px rgba(0, 0, 0, 0.08)",
28
  "block_padding": "24px",
29
  "block_border_width": "0px",
30
- # Markdown Styles
31
- "body_text_color_subdued": "#4b5563",
32
- # Button Styles
33
  "button_secondary_background_fill": "#ffffff",
34
  "button_secondary_text_color": "#374151",
35
  "button_secondary_border_color": "#d1d5db",
36
  "button_secondary_border_color_hover": "#adb5bd",
37
  "button_secondary_background_fill_hover": "#f9fafb",
38
- # Primary Button
 
39
  "button_primary_background_fill": "#2563eb",
40
  "button_primary_text_color": "#ffffff",
41
  "button_primary_border_color": "transparent",
42
  "button_primary_background_fill_hover": "#1d4ed8",
43
- # HR style
 
44
  "border_color_accent_subdued": "#e5e7eb",
45
- }
46
- super().set(**self.theme_vars_for_set)
47
 
48
- # Store CSS overrides; these will be converted to a string and applied via gr.Blocks(css=...)
49
- self.css_overrides = {
50
- ".gradio-container, .gr-block, .gr-markdown, label, input, .gr-slider, .gr-radio, .gr-button": {
51
- "font-family": ", ".join(self.font),
52
- "font-size": "16px !important",
53
- "line-height": "1.6 !important",
54
- "color": "#333333 !important",
55
- },
56
- ".gr-group": {"margin-bottom": "24px !important"}, # min-height removed
57
- ".gr-markdown": {
58
- "background": "transparent !important",
59
- "font-size": "1em !important",
60
- "margin-bottom": "16px !important",
61
- },
62
- ".gr-markdown h1": {
63
- "font-size": "28px !important",
64
- "font-weight": "600 !important",
65
- "margin-bottom": "8px !important",
66
- "color": "#111827 !important",
67
- },
68
- ".gr-markdown h2": {
69
- "font-size": "26px !important",
70
- "font-weight": "600 !important",
71
- "color": "var(--primary-600, #2563eb) !important",
72
- "margin-top": "32px !important",
73
- "margin-bottom": "16px !important",
74
- },
75
- ".gr-markdown h3": {
76
- "font-size": "22px !important",
77
- "font-weight": "600 !important",
78
- "color": "#1f2937 !important",
79
- "margin-top": "24px !important",
80
- "margin-bottom": "12px !important",
81
- },
82
- ".gr-markdown p, .gr-markdown span": {
83
- "font-size": "16px !important",
84
- "color": "#4b5563 !important",
85
- },
86
- ".gr-button button": {
87
- "border-radius": "8px !important",
88
- "padding": "10px 20px !important",
89
- "font-weight": "500 !important",
90
- "box-shadow": "0 1px 2px 0 rgba(0, 0, 0, 0.05) !important",
91
- "border": "1px solid #d1d5db !important",
92
- "background-color": "#ffffff !important",
93
- "color": "#374151 !important",
94
- },
95
- "#run-btn": {
96
- "background": "var(--button-primary-background-fill) !important",
97
- "color": "var(--button-primary-text-color) !important",
98
- "font-weight": "bold !important",
99
- "font-size": "24px !important",
100
- "border": "none !important",
101
- "box-shadow": "var(--button-primary-shadow) !important",
102
- },
103
- "#run-btn:hover": { # Changed selector
104
- "background": "var(--button-primary-background-fill-hover) !important",
105
- "box-shadow": "0px 4px 12px rgba(0, 0, 0, 0.15) !important",
106
- "transform": "translateY(-1px) !important",
107
- },
108
- ".gr-button button:hover": {
109
- "background-color": "#f9fafb !important",
110
- "border-color": "#adb5bd !important",
111
- },
112
- "hr": {
113
- "margin": "32px 0 !important",
114
- "border": "none !important",
115
- "border-top": "1px solid var(--border-color-accent-subdued) !important",
116
- },
117
- ".gr-slider, .gr-radio, .gr-file": {"margin-bottom": "20px !important"},
118
- ".gr-radio .gr-form button": {
119
- "background-color": "#f3f4f6 !important",
120
- "color": "#374151 !important",
121
- "border": "1px solid #d1d5db !important",
122
- "border-radius": "6px !important",
123
- "padding": "8px 16px !important",
124
- "font-weight": "500 !important",
125
- },
126
- ".gr-radio .gr-form button:hover": {
127
- "background-color": "#e5e7eb !important",
128
- "border-color": "#9ca3af !important",
129
- },
130
- ".gr-radio .gr-form button.selected": {
131
- "background-color": "var(--primary-500, #3b82f6) !important",
132
- "color": "#ffffff !important",
133
- "border-color": "var(--primary-500, #3b82f6) !important",
134
- },
135
- ".gr-radio .gr-form button.selected:hover": {
136
- "background-color": "var(--primary-600, #2563eb) !important",
137
- "border-color": "var(--primary-600, #2563eb) !important",
138
- },
139
- "#semantic-radio-group span": { # General selector, refined size
140
- "font-size": "17px !important",
141
- "font-weight": "500 !important",
142
- },
143
- "#semantic-radio-group div": { # General selector, refined size
144
- "font-size": "14px !important"
145
- },
146
- # Row and Column flex styles for equal height
147
- "#steps-row": {
148
- "display": "flex !important",
149
- "align-items": "stretch !important",
150
- },
151
- ".step-column": {
152
- "display": "flex !important",
153
- "flex-direction": "column !important",
154
- },
155
- ".step-column > .gr-group": {
156
- "flex-grow": "1 !important",
157
- "display": "flex !important",
158
- "flex-direction": "column !important",
159
- },
160
- ".tabs > .tab-nav": {"border-bottom": "1px solid #d1d5db !important"},
161
- ".tabs > .tab-nav > button.selected": {
162
- "border-bottom": "2px solid var(--primary-500) !important",
163
- "color": "var(--primary-500) !important",
164
- "background-color": "transparent !important",
165
- },
166
- ".tabs > .tab-nav > button": {
167
- "color": "#6b7280 !important",
168
- "background-color": "transparent !important",
169
- "padding": "10px 15px !important",
170
- "border-bottom": "2px solid transparent !important",
171
- },
172
-
173
- # Custom styling for metric accordions
174
- ".metric-info-accordion": {
175
- "border-left": "4px solid #3B82F6 !important",
176
- "margin-bottom": "1rem !important",
177
- "background-color": "#F8FAFC !important",
178
- "border-radius": "6px !important",
179
- "overflow": "hidden !important",
180
- },
181
- ".jaccard-info": {
182
- "border-left-color": "#3B82F6 !important", # Blue
183
- },
184
- ".lcs-info": {
185
- "border-left-color": "#10B981 !important", # Green
186
- },
187
- ".semantic-info": {
188
- "border-left-color": "#8B5CF6 !important", # Purple
189
- },
190
- ".wordcount-info": {
191
- "border-left-color": "#EC4899 !important", # Pink
192
- },
193
-
194
- # Accordion header styling
195
- ".metric-info-accordion > .label-wrap": {
196
- "font-weight": "600 !important",
197
- "padding": "12px 16px !important",
198
- "background-color": "#F1F5F9 !important",
199
- "border-bottom": "1px solid #E2E8F0 !important",
200
- },
201
-
202
- # Accordion content styling
203
- ".metric-info-accordion > .wrap": {
204
- "padding": "16px !important",
205
- },
206
-
207
- # Word count plot styling - full width
208
- ".tabs > .tab-content > div[data-testid='tabitem'] > .plot": {
209
- "width": "100% !important",
210
- },
211
-
212
- # Heatmap plot styling - responsive sizing
213
- ".tabs > .tab-content > div[data-testid='tabitem'] > .plotly": {
214
- "width": "100% !important",
215
- "height": "auto !important",
216
- },
217
-
218
- # Specific heatmap container styling
219
- ".metric-heatmap": {
220
- "max-width": "100% !important",
221
- "overflow-x": "auto !important",
222
- },
223
-
224
- # LLM Analysis styling
225
- ".llm-analysis": {
226
- "background-color": "#f8f9fa !important",
227
- "border-left": "4px solid #3B82F6 !important",
228
- "border-radius": "8px !important",
229
- "padding": "20px 24px !important",
230
- "margin": "16px 0 !important",
231
- "box-shadow": "0 2px 8px rgba(0, 0, 0, 0.05) !important",
232
- },
233
- ".llm-analysis h2": {
234
- "color": "#1e40af !important",
235
- "font-size": "24px !important",
236
- "margin-bottom": "16px !important",
237
- "border-bottom": "1px solid #e5e7eb !important",
238
- "padding-bottom": "8px !important",
239
- },
240
- ".llm-analysis h3, .llm-analysis h4": {
241
- "color": "#1e3a8a !important",
242
- "margin-top": "20px !important",
243
- "margin-bottom": "12px !important",
244
- },
245
- ".llm-analysis p": {
246
- "line-height": "1.7 !important",
247
- "margin-bottom": "12px !important",
248
- },
249
- ".llm-analysis ul, .llm-analysis ol": {
250
- "margin-left": "24px !important",
251
- "margin-bottom": "16px !important",
252
- },
253
- ".llm-analysis li": {
254
- "margin-bottom": "6px !important",
255
- },
256
- ".llm-analysis strong, .llm-analysis b": {
257
- "color": "#1f2937 !important",
258
- "font-weight": "600 !important",
259
- },
260
  }
 
261
 
262
  def get_css_string(self) -> str:
263
- """Converts the self.css_overrides dictionary into a CSS string."""
264
- css_parts = []
265
- for selector, properties in self.css_overrides.items():
266
- props_str = "\n".join(
267
- [f" {prop}: {value};" for prop, value in properties.items()]
268
- )
269
- css_parts.append(f"{selector} {{\n{props_str}\n}}")
270
- return "\n\n".join(css_parts)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
271
 
272
 
273
  # Instantiate the theme for easy import
 
1
+ """
2
+ Tibetan Text Metrics Theme - Gradio 6 Compatible
3
+
4
+ This theme provides a clean, professional look for the TTM application.
5
+ Updated for Gradio 6.x compatibility where theme/css are passed to launch().
6
+ """
7
+
8
  import gradio as gr
9
  from gradio.themes.utils import colors, sizes, fonts
10
 
11
 
12
  class TibetanAppTheme(gr.themes.Soft):
13
+ """
14
+ Custom theme for Tibetan Text Metrics application.
15
+
16
+ Gradio 6 Migration Notes:
17
+ - Theme is now passed to demo.launch(theme=...) instead of gr.Blocks(theme=...)
18
+ - CSS is now passed to demo.launch(css=...) instead of gr.Blocks(css=...)
19
+ - Use elem_id and elem_classes for stable CSS targeting
20
+ """
21
+
22
  def __init__(self):
23
  super().__init__(
24
+ primary_hue=colors.blue,
25
+ secondary_hue=colors.orange,
26
+ neutral_hue=colors.slate,
27
  font=[
28
  fonts.GoogleFont("Inter"),
29
  "ui-sans-serif",
30
  "system-ui",
31
  "sans-serif",
32
  ],
33
+ radius_size=sizes.radius_md,
34
+ text_size=sizes.text_md,
35
  )
36
+
37
+ # Theme variable overrides using Gradio's theming system
38
  self.theme_vars_for_set = {
39
  # Global & Body Styles
40
  "body_background_fill": "#f0f2f5",
41
  "body_text_color": "#333333",
42
+ "body_text_color_subdued": "#4b5563",
43
+
44
+ # Block/Card Styles
45
  "block_background_fill": "#ffffff",
46
+ "block_radius": "16px",
47
  "block_shadow": "0 4px 12px rgba(0, 0, 0, 0.08)",
48
  "block_padding": "24px",
49
  "block_border_width": "0px",
50
+
51
+ # Button Styles - Secondary
 
52
  "button_secondary_background_fill": "#ffffff",
53
  "button_secondary_text_color": "#374151",
54
  "button_secondary_border_color": "#d1d5db",
55
  "button_secondary_border_color_hover": "#adb5bd",
56
  "button_secondary_background_fill_hover": "#f9fafb",
57
+
58
+ # Button Styles - Primary
59
  "button_primary_background_fill": "#2563eb",
60
  "button_primary_text_color": "#ffffff",
61
  "button_primary_border_color": "transparent",
62
  "button_primary_background_fill_hover": "#1d4ed8",
63
+
64
+ # Border colors
65
  "border_color_accent_subdued": "#e5e7eb",
66
+ "border_color_primary": "#d1d5db",
 
67
 
68
+ # Input styles
69
+ "input_background_fill": "#ffffff",
70
+ "input_border_color": "#d1d5db",
71
+ "input_border_color_focus": "#2563eb",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  }
73
+ super().set(**self.theme_vars_for_set)
74
 
75
  def get_css_string(self) -> str:
76
+ """
77
+ Returns custom CSS string for additional styling.
78
+
79
+ Gradio 6 uses different class naming conventions. This CSS uses:
80
+ - elem_id selectors (#id) for specific components
81
+ - elem_classes selectors (.class) for groups of components
82
+ - Gradio 6 native classes where stable
83
+ """
84
+ return """
85
+ /* ============================================
86
+ GLOBAL STYLES
87
+ ============================================ */
88
+
89
+ .gradio-container {
90
+ font-family: 'Inter', ui-sans-serif, system-ui, sans-serif !important;
91
+ max-width: 1400px !important;
92
+ margin: 0 auto !important;
93
+ }
94
+
95
+ /* ============================================
96
+ TYPOGRAPHY
97
+ ============================================ */
98
+
99
+ h1 {
100
+ font-size: 28px !important;
101
+ font-weight: 600 !important;
102
+ color: #111827 !important;
103
+ margin-bottom: 8px !important;
104
+ }
105
+
106
+ h2 {
107
+ font-size: 24px !important;
108
+ font-weight: 600 !important;
109
+ color: var(--primary-600, #2563eb) !important;
110
+ margin-top: 24px !important;
111
+ margin-bottom: 16px !important;
112
+ }
113
+
114
+ h3 {
115
+ font-size: 20px !important;
116
+ font-weight: 600 !important;
117
+ color: #1f2937 !important;
118
+ margin-top: 20px !important;
119
+ margin-bottom: 12px !important;
120
+ }
121
+
122
+ /* ============================================
123
+ LAYOUT - Steps Row
124
+ ============================================ */
125
+
126
+ #steps-row {
127
+ display: flex !important;
128
+ align-items: stretch !important;
129
+ gap: 24px !important;
130
+ }
131
+
132
+ .step-column {
133
+ display: flex !important;
134
+ flex-direction: column !important;
135
+ flex: 1 !important;
136
+ }
137
+
138
+ .step-box {
139
+ padding: 1.5rem !important;
140
+ flex-grow: 1 !important;
141
+ display: flex !important;
142
+ flex-direction: column !important;
143
+ }
144
+
145
+ /* ============================================
146
+ BUTTONS
147
+ ============================================ */
148
+
149
+ /* Primary action buttons */
150
+ #run-btn-quick, #run-btn-custom {
151
+ background: var(--button-primary-background-fill, #2563eb) !important;
152
+ color: var(--button-primary-text-color, #ffffff) !important;
153
+ font-weight: 600 !important;
154
+ font-size: 18px !important;
155
+ padding: 12px 24px !important;
156
+ border: none !important;
157
+ border-radius: 8px !important;
158
+ box-shadow: 0 2px 4px rgba(37, 99, 235, 0.2) !important;
159
+ transition: all 0.2s ease !important;
160
+ margin-top: 16px !important;
161
+ }
162
+
163
+ #run-btn-quick:hover, #run-btn-custom:hover {
164
+ background: var(--button-primary-background-fill-hover, #1d4ed8) !important;
165
+ box-shadow: 0 4px 12px rgba(37, 99, 235, 0.3) !important;
166
+ transform: translateY(-1px) !important;
167
+ }
168
+
169
+ /* Secondary buttons */
170
+ button.secondary {
171
+ background-color: #ffffff !important;
172
+ color: #374151 !important;
173
+ border: 1px solid #d1d5db !important;
174
+ border-radius: 8px !important;
175
+ padding: 10px 20px !important;
176
+ font-weight: 500 !important;
177
+ }
178
+
179
+ button.secondary:hover {
180
+ background-color: #f9fafb !important;
181
+ border-color: #adb5bd !important;
182
+ }
183
+
184
+ /* ============================================
185
+ TABS
186
+ ============================================ */
187
+
188
+ .tabs {
189
+ margin-top: 8px !important;
190
+ }
191
+
192
+ .tab-nav {
193
+ border-bottom: 1px solid #e5e7eb !important;
194
+ margin-bottom: 16px !important;
195
+ }
196
+
197
+ .tab-nav button {
198
+ color: #6b7280 !important;
199
+ background-color: transparent !important;
200
+ padding: 12px 20px !important;
201
+ border: none !important;
202
+ border-bottom: 2px solid transparent !important;
203
+ font-weight: 500 !important;
204
+ transition: all 0.2s ease !important;
205
+ }
206
+
207
+ .tab-nav button:hover {
208
+ color: #374151 !important;
209
+ }
210
+
211
+ .tab-nav button.selected {
212
+ border-bottom: 2px solid var(--primary-500, #3b82f6) !important;
213
+ color: var(--primary-600, #2563eb) !important;
214
+ background-color: transparent !important;
215
+ }
216
+
217
+ /* ============================================
218
+ ACCORDIONS
219
+ ============================================ */
220
+
221
+ .accordion {
222
+ border: 1px solid #e5e7eb !important;
223
+ border-radius: 8px !important;
224
+ margin-bottom: 12px !important;
225
+ overflow: hidden !important;
226
+ }
227
+
228
+ /* Metric info accordions with colored borders */
229
+ .metric-info-accordion {
230
+ border-left: 4px solid #3B82F6 !important;
231
+ margin-bottom: 1rem !important;
232
+ background-color: #F8FAFC !important;
233
+ border-radius: 6px !important;
234
+ }
235
+
236
+ .jaccard-info { border-left-color: #3B82F6 !important; }
237
+ .lcs-info { border-left-color: #10B981 !important; }
238
+ .fuzzy-info { border-left-color: #F59E0B !important; }
239
+ .semantic-info { border-left-color: #8B5CF6 !important; }
240
+ .wordcount-info { border-left-color: #EC4899 !important; }
241
+
242
+ /* ============================================
243
+ FORM ELEMENTS
244
+ ============================================ */
245
+
246
+ /* Radio buttons */
247
+ .radio-group label {
248
+ display: flex !important;
249
+ align-items: center !important;
250
+ padding: 10px 16px !important;
251
+ border: 1px solid #e5e7eb !important;
252
+ border-radius: 8px !important;
253
+ margin-bottom: 8px !important;
254
+ cursor: pointer !important;
255
+ transition: all 0.2s ease !important;
256
+ }
257
+
258
+ .radio-group label:hover {
259
+ background-color: #f9fafb !important;
260
+ border-color: #d1d5db !important;
261
+ }
262
+
263
+ .radio-group input:checked + label,
264
+ .radio-group label.selected {
265
+ background-color: var(--primary-50, #eff6ff) !important;
266
+ border-color: var(--primary-500, #3b82f6) !important;
267
+ }
268
+
269
+ /* Dropdowns */
270
+ select, .dropdown {
271
+ border: 1px solid #d1d5db !important;
272
+ border-radius: 8px !important;
273
+ padding: 10px 12px !important;
274
+ background-color: #ffffff !important;
275
+ }
276
+
277
+ /* Checkboxes */
278
+ input[type="checkbox"] {
279
+ width: 18px !important;
280
+ height: 18px !important;
281
+ accent-color: var(--primary-500, #3b82f6) !important;
282
+ }
283
+
284
+ /* ============================================
285
+ PRESET TABLE
286
+ ============================================ */
287
+
288
+ .preset-table table {
289
+ font-size: 14px !important;
290
+ margin-top: 12px !important;
291
+ width: 100% !important;
292
+ border-collapse: collapse !important;
293
+ }
294
+
295
+ .preset-table th, .preset-table td {
296
+ padding: 10px 14px !important;
297
+ text-align: center !important;
298
+ border-bottom: 1px solid #e5e7eb !important;
299
+ }
300
+
301
+ .preset-table th {
302
+ background-color: #f9fafb !important;
303
+ font-weight: 600 !important;
304
+ color: #374151 !important;
305
+ }
306
+
307
+ .preset-table tr:hover {
308
+ background-color: #f9fafb !important;
309
+ }
310
+
311
+ /* ============================================
312
+ RESULTS SECTION
313
+ ============================================ */
314
+
315
+ /* Heatmaps and plots */
316
+ .plot-container {
317
+ width: 100% !important;
318
+ overflow-x: auto !important;
319
+ }
320
+
321
+ .metric-heatmap {
322
+ max-width: 100% !important;
323
+ }
324
+
325
+ /* ============================================
326
+ LLM ANALYSIS OUTPUT
327
+ ============================================ */
328
+
329
+ #llm-analysis {
330
+ background-color: #f8f9fa !important;
331
+ border-left: 4px solid #3B82F6 !important;
332
+ border-radius: 8px !important;
333
+ padding: 20px 24px !important;
334
+ margin: 16px 0 !important;
335
+ box-shadow: 0 2px 8px rgba(0, 0, 0, 0.05) !important;
336
+ }
337
+
338
+ #llm-analysis h2 {
339
+ color: #1e40af !important;
340
+ font-size: 22px !important;
341
+ margin-bottom: 16px !important;
342
+ border-bottom: 1px solid #e5e7eb !important;
343
+ padding-bottom: 8px !important;
344
+ }
345
+
346
+ #llm-analysis h3, #llm-analysis h4 {
347
+ color: #1e3a8a !important;
348
+ margin-top: 18px !important;
349
+ margin-bottom: 10px !important;
350
+ }
351
+
352
+ #llm-analysis p {
353
+ line-height: 1.7 !important;
354
+ margin-bottom: 12px !important;
355
+ color: #374151 !important;
356
+ }
357
+
358
+ #llm-analysis ul, #llm-analysis ol {
359
+ margin-left: 24px !important;
360
+ margin-bottom: 16px !important;
361
+ }
362
+
363
+ #llm-analysis li {
364
+ margin-bottom: 6px !important;
365
+ }
366
+
367
+ #llm-analysis strong, #llm-analysis b {
368
+ color: #1f2937 !important;
369
+ font-weight: 600 !important;
370
+ }
371
+
372
+ /* ============================================
373
+ RESPONSIVE ADJUSTMENTS
374
+ ============================================ */
375
+
376
+ @media (max-width: 768px) {
377
+ #steps-row {
378
+ flex-direction: column !important;
379
+ }
380
+
381
+ .step-column {
382
+ width: 100% !important;
383
+ }
384
+
385
+ #run-btn-quick, #run-btn-custom {
386
+ font-size: 16px !important;
387
+ padding: 10px 20px !important;
388
+ }
389
+ }
390
+
391
+ /* ============================================
392
+ UTILITY CLASSES
393
+ ============================================ */
394
+
395
+ .custom-header {
396
+ margin-bottom: 12px !important;
397
+ color: #374151 !important;
398
+ }
399
+
400
+ .info-text {
401
+ font-size: 14px !important;
402
+ color: #6b7280 !important;
403
+ margin-top: 4px !important;
404
+ }
405
+ """
406
 
407
 
408
  # Instantiate the theme for easy import