Spaces:

daniel-wojahn
/

ttm-webapp-hf

Running

App Files Files Community

Daniel Wojahn commited on 8 days ago

Commit

75e8f38

1 Parent(s): 54934d5

feat(ui): Add preset-based analysis UI and Gradio 6 compatibility

Browse files

Files changed (17) hide show

README.md +104 -48
app.py +387 -197
pipeline/hf_embedding.py +10 -8
pipeline/llm_service.py +95 -95
pipeline/metrics.py +239 -89
pipeline/normalize_bo.py +106 -0
pipeline/process.py +81 -63
pipeline/progressive_loader.py +14 -14
pipeline/progressive_ui.py +36 -36
pipeline/stopwords_bo.py +23 -12
pipeline/stopwords_lite_bo.py +14 -4
pipeline/tokenize.py +15 -15
pipeline/visualize.py +7 -7
pyproject.toml +29 -4
requirements.txt +2 -1
setup.py +21 -27
theme.py +369 -234

README.md CHANGED Viewed

@@ -15,27 +15,53 @@ app_file: app.py
 [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
 [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
-A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts. This tool provides a graphical interface to the core text comparison functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project, making it accessible to researchers without Python or command-line experience. Built with Python, Cython, and Gradio.
 ## Background
-The Tibetan Text Metrics project aims to provide quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application extends these capabilities by offering an intuitive interface, removing the need for manual script execution and environment setup for end-users.
 ## Key Features of the Web App
 -   **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
 -   **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
 -   **Core Metrics Computed**:
-    -   **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
-    -   **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
-    -   **Fuzzy Similarity**: Uses fuzzy string matching to detect approximate matches between words, accommodating spelling variations and minor differences in Tibetan text.
-    -   **Semantic Similarity**: Uses sentence-transformer embeddings (e.g., LaBSE) to compare the contextual meaning of segments. *Note: This metric works best when combined with other metrics for a more comprehensive analysis.*
 -   **Handles Long Texts**: Implements automated handling for long segments when computing semantic embeddings.
--   **Model Selection**: Semantic similarity analysis uses Hugging Face sentence-transformer models (e.g., LaBSE).
 -   **Stopword Filtering**: Three levels of filtering for Tibetan words:
     -   **None**: No filtering, includes all words
     -   **Standard**: Filters only common particles and punctuation
     -   **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
 -   **Interactive Visualizations**:
     -   Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
     -   Bar chart displaying word counts per segment.
@@ -91,7 +117,10 @@ To obtain meaningful results, it is highly recommended to divide your Tibetan te
 ## Implemented Metrics
 **Stopword Filtering:**
-To enhance the accuracy and relevance of similarity scores, both the Jaccard Similarity and TF-IDF Cosine Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. This ensures that the resulting scores are more reflective of meaningful lexical and thematic similarities between texts, rather than being skewed by the presence of ubiquitous common words.
 The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
 - The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
@@ -99,7 +128,7 @@ The comprehensive list of Tibetan stopwords used is adapted and compiled from th
 We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
-Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords.py` file.
 ### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
@@ -116,29 +145,33 @@ A higher percentage indicates a greater overlap in the significant vocabularies
 This helps focus on meaningful content words rather than grammatical elements.
-2.  **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
-    *   *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
-3.  **Fuzzy Similarity**: This metric uses fuzzy string matching algorithms to detect approximate matches between words, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical. The implementation offers multiple matching methods:
-    - **Token Set Ratio** (default): Compares the sets of words regardless of order, finding the best alignment between them
-    - **Token Sort Ratio**: Sorts the words alphabetically before comparing, useful for texts with similar vocabulary in different orders
-    - **Partial Ratio**: Finds the best matching substring, helpful for detecting when one text contains parts of another
-    - **Simple Ratio**: Performs character-by-character comparison, best for detecting minor spelling variations
-    Scores range from 0 to 1, where 1 indicates perfect or near-perfect matches. This metric is particularly useful for identifying textual relationships that might be missed by exact matching methods, especially in manuscripts with orthographic variations.
-**Stopword Filtering**: The same three levels of filtering used for Jaccard Similarity are applied to fuzzy matching:
-- **None**: No filtering, includes all words in the comparison
-- **Standard**: Filters only common particles and punctuation
-- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
-4.  **Semantic Similarity**: Computes the cosine similarity between sentence-transformer embeddings (e.g., LaBSE) of text segments. Segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate a higher degree of semantic overlap.
-**Stopword Filtering**: Three levels of filtering are available:
 - **None**: No filtering, includes all words in the comparison
 - **Standard**: Filters only common particles and punctuation
 - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
-This helps focus on meaningful content words rather than grammatical elements.
 ## Getting Started (if run Locally)
@@ -173,40 +206,63 @@ This helps focus on meaningful content words rather than grammatical elements.
 ## Usage
-1.  **Upload Files**: Use the file upload interface to select one or more `.txt` files containing Tibetan Unicode text.
-2.  **Configure Options**:
-    - Choose whether to compute semantic similarity
-    - Choose whether to compute fuzzy string similarity
-    - Select a fuzzy matching method (Token Set, Token Sort, Partial, or Simple Ratio)
-    - Select an embedding model for semantic analysis
-    - Choose a stopword filtering level (None, Standard, or Aggressive)
-3.  **Run Analysis**: Click the "Run Analysis" button.
-3.  **View Results**:
-    -   A preview of the similarity metrics will be displayed.
-    -   Download the full results as a CSV file.
-    -   Interactive heatmaps for Jaccard Similarity, Normalized LCS, Fuzzy Similarity, and Semantic Similarity will be generated. All heatmaps use a consistent color scheme where darker colors represent higher similarity.
-    -   A bar chart showing word counts per segment will also be available.
-    -   Any warnings (e.g., regarding missing chapter markers) will be displayed.
-4.  **Get Interpretation** (Optional):
-    -   After running the analysis, click the "Help Interpret Results" button.
-    -   No API key or internet connection required! The system uses a built-in rule-based analysis engine.
-    -   The system will analyze your metrics and provide insights about patterns, relationships, and notable findings in your data.
-    -   This feature helps researchers understand the significance of the metrics and identify interesting textual relationships between chapters.
 ## Embedding Model
-Semantic similarity uses Hugging Face sentence-transformer models (default: `sentence-transformers/LaBSE`). These models provide context-aware, segment-level embeddings suitable for comparing Tibetan text passages.
 ## Structure
 -   `app.py` — Gradio web app entry point and UI definition.
 -   `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
     -   `process.py`: Core logic for segmenting texts and orchestrating metric computation.
-    -   `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity.
     -   `hf_embedding.py`: Handles loading and using sentence-transformer models.
     -   `tokenize.py`: Tibetan text tokenization using `botok`.
-    -   `upload.py`: File upload handling (currently minimal).
     -   `visualize.py`: Generates heatmaps and word count plots.
 -   `requirements.txt` — Python dependencies for the web application.
@@ -228,7 +284,7 @@ If you use this web application or the underlying TTM tool in your research, ple
   author = {Daniel Wojahn},
   year = {2025},
   url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
-  version = {0.3.0}
 }
 ```

 [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
 [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
+Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts — no programming required.
+## Quick Start (3 Steps)
+1. **Upload** two or more Tibetan text files (.txt format)
+2. **Click** "Compare My Texts"
+3. **View** the results — higher scores mean more similarity
+That's it! The default settings work well for most cases. See the results section for colorful heatmaps showing which chapters are most similar.
+> **Tip:** If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.
+## What's New (v0.4.0)
+- **New preset-based UI**: Choose "Quick Start" for simple analysis or "Custom" for full control
+- **Three analysis presets**: Standard, Deep (with AI), and Quick (fastest)
+- **Word-level tokenization** is now the default (recommended for Jaccard similarity)
+- **Particle normalization**: Treat grammatical particle variants as equivalent (གི/ཀྱི/གྱི → གི)
+- **LCS normalization options**: Choose how to handle texts of different lengths
+- **Improved stopword matching**: Fixed tsek (་) handling for consistent filtering
+- **Tibetan-optimized fuzzy matching**: Syllable-level methods only (removed character-level methods)
+- **Dharmamitra models**: Buddhist-specific semantic similarity models as default
+- **Modernized theme**: Cleaner UI with better responsive design
 ## Background
+The Tibetan Text Metrics project provides quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application makes these capabilities accessible through an intuitive interface — no command-line or Python experience needed.
 ## Key Features of the Web App
 -   **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
 -   **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
 -   **Core Metrics Computed**:
+    -   **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. Word-level tokenization recommended. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
+    -   **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels. Supports multiple normalization modes (average, min, max).
+    -   **Fuzzy Similarity**: Uses syllable-level fuzzy matching to detect approximate matches, accommodating spelling variations and scribal differences in Tibetan text.
+    -   **Semantic Similarity**: Uses Buddhist-specific sentence-transformer embeddings (Dharmamitra) to compare the contextual meaning of segments.
 -   **Handles Long Texts**: Implements automated handling for long segments when computing semantic embeddings.
+-   **Model Selection**: Semantic similarity uses Hugging Face sentence-transformer models. Default is Dharmamitra's `buddhist-nlp/buddhist-sentence-similarity`, trained specifically for Buddhist texts.
+-   **Tokenization Modes**:
+    -   **Word** (default, recommended): Keeps multi-syllable words together for more meaningful comparison
+    -   **Syllable**: Splits into individual syllables for finer-grained analysis
 -   **Stopword Filtering**: Three levels of filtering for Tibetan words:
     -   **None**: No filtering, includes all words
     -   **Standard**: Filters only common particles and punctuation
     -   **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
+-   **Particle Normalization**: Optional normalization of grammatical particles to canonical forms (e.g., གི/ཀྱི/གྱི → གི, ལ/ར/སུ/ཏུ/དུ → ལ). Reduces false negatives from sandhi variation.
 -   **Interactive Visualizations**:
     -   Heatmaps for Jaccard, LCS, Fuzzy, and Semantic similarity metrics, providing a quick overview of inter-segment relationships.
     -   Bar chart displaying word counts per segment.
 ## Implemented Metrics
 **Stopword Filtering:**
+To enhance the accuracy and relevance of similarity scores, the Jaccard Similarity and Fuzzy Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. Stopwords are normalized to handle tsek (་) variations consistently.
+**Particle Normalization:**
+Tibetan grammatical particles change form based on the preceding syllable (sandhi). For example, the genitive particle appears as གི, ཀྱི, གྱི, ཡི, or འི depending on context. When particle normalization is enabled, all variants are treated as equivalent, reducing false negatives when comparing texts with different scribal conventions.
 The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
 - The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
 We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
+Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords_bo.py` file.
 ### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
 This helps focus on meaningful content words rather than grammatical elements.
+2.  **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.
+    **Normalization options:**
+    - **Average** (default): Divides LCS length by the average of both text lengths. Balanced comparison.
+    - **Min**: Divides by the shorter text length. Useful for detecting if one text contains the other (e.g., quotes within commentary). Can return 1.0 if shorter text is fully contained.
+    - **Max**: Divides by the longer text length. Stricter metric that penalizes length differences.
+    A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.
+    *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary.
+3.  **Fuzzy Similarity**: This metric uses syllable-level fuzzy matching algorithms to detect approximate matches, making it particularly valuable for Tibetan texts where spelling variations, dialectal differences, or scribal errors might be present. Unlike exact matching methods (such as Jaccard), fuzzy similarity can recognize when words are similar but not identical.
+    **Available methods (all work at syllable level):**
+    - **Syllable N-gram Overlap** (default, recommended): Compares syllable bigrams between texts. Best for detecting shared phrases and local patterns.
+    - **Syllable-level Edit Distance**: Computes Levenshtein distance at the syllable/token level. Detects minor variations while respecting syllable boundaries.
+    - **Weighted Jaccard**: Like standard Jaccard but considers token frequency, giving more weight to frequently shared terms.
+    Scores range from 0 to 1, where 1 indicates perfect or near-perfect matches. All methods work at the syllable level, which is linguistically appropriate for Tibetan.
+**Stopword Filtering**: The same three levels of filtering used for Jaccard Similarity are applied to fuzzy matching:
 - **None**: No filtering, includes all words in the comparison
 - **Standard**: Filters only common particles and punctuation
 - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
+4.  **Semantic Similarity**: Computes the cosine similarity between sentence-transformer embeddings of text segments. Uses Dharmamitra's Buddhist-specific models by default. Segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate a higher degree of semantic overlap.
+    *Note*: Semantic similarity operates on the raw text and is not affected by stopword filtering settings.
 ## Getting Started (if run Locally)
 ## Usage
+### Quick Start (Recommended for Most Users)
+1.  **Upload Files**: Select one or more `.txt` files containing Tibetan Unicode text.
+2.  **Choose a Preset**: In the "Quick Start" tab, select an analysis type:
+| Preset | What it does | Best for |
+|--------|--------------|----------|
+| 📊 **Standard** | Vocabulary + Sequences + Fuzzy matching | Most comparisons |
+| 🧠 **Deep** | All metrics including AI meaning analysis | Finding semantic parallels |
+| ⚡ **Quick** | Vocabulary overlap only | Fast initial scan |
+3.  **Click "Compare My Texts"**: Results appear below with heatmaps and downloadable CSV.
+### Custom Analysis (Advanced Users)
+For fine-grained control, use the "Custom" tab:
+-   **Lexical Metrics**: Configure tokenization (word/syllable), stopword filtering, and particle normalization
+-   **Sequence Matching (LCS)**: Enable/disable and choose normalization mode (avg/min/max)
+-   **Fuzzy Matching**: Choose method (N-gram, Syllable Edit, or Weighted Jaccard)
+-   **Semantic Analysis**: Enable AI-based meaning comparison with model selection
+### Viewing Results
+-   **Metrics Preview**: Summary table of similarity scores
+-   **Heatmaps**: Visual comparison across all chapter pairs (darker = more similar)
+-   **Word Counts**: Bar chart showing segment lengths
+-   **CSV Download**: Full results for further analysis
+### AI Interpretation (Optional)
+After running analysis, click "Help Interpret Results" for scholarly insights:
+-   Pattern identification across chapters
+-   Notable textual relationships
+-   Suggestions for further investigation
 ## Embedding Model
+Semantic similarity uses Hugging Face sentence-transformer models. The following models are available:
+- **`buddhist-nlp/buddhist-sentence-similarity`** (default, recommended): Developed by [Dharmamitra](https://huggingface.co/buddhist-nlp), this model is specifically trained for sentence similarity on Buddhist texts in Tibetan, Buddhist Chinese, Sanskrit (IAST), and Pāli. Best choice for Tibetan Buddhist manuscripts.
+- **`buddhist-nlp/bod-eng-similarity`**: Also from Dharmamitra, optimized for Tibetan-English bitext alignment tasks.
+- **`sentence-transformers/LaBSE`**: General multilingual model, good baseline for non-Buddhist texts.
+- **`BAAI/bge-m3`**: Strong multilingual alternative with broad language coverage.
+These models provide context-aware, segment-level embeddings suitable for comparing Tibetan text passages.
 ## Structure
 -   `app.py` — Gradio web app entry point and UI definition.
 -   `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
     -   `process.py`: Core logic for segmenting texts and orchestrating metric computation.
+    -   `metrics.py`: Implementation of Jaccard, LCS, Fuzzy, and Semantic Similarity.
     -   `hf_embedding.py`: Handles loading and using sentence-transformer models.
     -   `tokenize.py`: Tibetan text tokenization using `botok`.
+    -   `normalize_bo.py`: Tibetan particle normalization for grammatical variants.
+    -   `stopwords_bo.py`: Comprehensive Tibetan stopword list with tsek normalization.
     -   `visualize.py`: Generates heatmaps and word count plots.
 -   `requirements.txt` — Python dependencies for the web application.
   author = {Daniel Wojahn},
   year = {2025},
   url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
+  version = {0.4.0}
 }
 ```

app.py CHANGED Viewed

@@ -17,14 +17,16 @@ load_dotenv()
 logger = logging.getLogger(__name__)
 def main_interface():
     with gr.Blocks(
         theme=tibetan_theme,
-        title="Tibetan Text Metrics Web App",
-        css=tibetan_theme.get_css_string() + ".metric-description, .step-box { padding: 1.5rem !important; }"
     ) as demo:
         gr.Markdown(
-            """# Tibetan Text Metrics Web App
-<span style='font-size:18px;'>A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project. Powered by advanced language models via OpenRouter for in-depth text analysis.</span>
         """,
             elem_classes="gr-markdown",
@@ -35,93 +37,174 @@ def main_interface():
                 with gr.Group(elem_classes="step-box"):
                     gr.Markdown(
                         """
-                    ## Step 1: Upload Your Tibetan Text Files
-                    <span style='font-size:16px;'>Upload two or more `.txt` files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible using the marker '༈' (<i>sbrul shad</i>).</span>
                     """,
                         elem_classes="gr-markdown",
                     )
                     file_input = gr.File(
-                        label="Upload Tibetan .txt files",
                         file_types=[".txt"],
                         file_count="multiple",
                     )
                     gr.Markdown(
-                        "<small>Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.</small>",
                         elem_classes="gr-markdown"
                     )
             with gr.Column(scale=1, elem_classes="step-column"):
                 with gr.Group(elem_classes="step-box"):
                     gr.Markdown(
-                        """## Step 2: Configure and run the analysis
-<span style='font-size:16px;'>Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using the marker '༈' (<i>sbrul shad</i>). The tool will split files based on this marker.</span>
                     """,
                         elem_classes="gr-markdown",
                     )
-                    semantic_toggle_radio = gr.Radio(
-                        label="Compute semantic similarity? (Experimental)",
-                        choices=["Yes", "No"],
-                        value="No",
-                        info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
-                        elem_id="semantic-radio-group",
-                    )
-                    model_dropdown = gr.Dropdown(
-                        choices=[
-                            "sentence-transformers/LaBSE"
-                        ],
-                        label="Select Embedding Model",
-                        value="sentence-transformers/LaBSE",
-                        info="Select the embedding model to use for semantic similarity analysis. Only Hugging Face sentence-transformers are supported."
-                    )
-                    with gr.Accordion("Advanced Options", open=False):
-                        batch_size_slider = gr.Slider(
-                            minimum=1,
-                            maximum=64,
-                            value=8,
-                            step=1,
-                            label="Batch Size (for Hugging Face models)",
-                            info="Adjust based on your hardware (VRAM). Lower this if you encounter memory issues."
-                        )
-                        progress_bar_checkbox = gr.Checkbox(
-                            label="Show Embedding Progress Bar",
-                            value=False,
-                            info="Display a progress bar during embedding generation. Useful for large datasets."
-                        )
-                    stopwords_dropdown = gr.Dropdown(
-                        label="Stopword Filtering",
-                        choices=[
-                            "None (No filtering)",
-                            "Standard (Common particles only)",
-                            "Aggressive (All function words)"
-                        ],
-                        value="Standard (Common particles only)",  # Default
-                        info="Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words."
-                    )
-                    fuzzy_toggle_radio = gr.Radio(
-                        label="Enable Fuzzy String Matching",
-                        choices=["Yes", "No"],
-                        value="Yes",
-                        info="Fuzzy matching helps detect similar but not identical text segments. Useful for identifying variations and modifications."
-                    )
-                    fuzzy_method_dropdown = gr.Dropdown(
-                        label="Fuzzy Matching Method",
-                        choices=[
-                            "token_set - Order-independent matching",
-                            "token_sort - Order-normalized matching",
-                            "partial - Best partial matching",
-                            "ratio - Simple ratio matching"
-                        ],
-                        value="token_set - Order-independent matching",
-                        info="Select the fuzzy matching algorithm to use:\n\n• token_set: Best for texts with different word orders and partial overlaps. Compares unique words regardless of their order (recommended for Tibetan texts).\n\n• token_sort: Good for texts with different word orders but similar content. Sorts words alphabetically before comparing.\n\n• partial: Best for finding shorter strings within longer ones. Useful when one text is a fragment of another.\n\n• ratio: Simple Levenshtein distance ratio. Best for detecting small edits and typos in otherwise identical texts."
-                    )
-                    process_btn = gr.Button(
-                        "Run Analysis", elem_id="run-btn", variant="primary"
-                    )
         gr.Markdown(
             """## Results
@@ -131,165 +214,208 @@ def main_interface():
         # The heatmap_titles and metric_tooltips dictionaries are defined here
         # heatmap_titles = { ... }
         # metric_tooltips = { ... }
-        csv_output = gr.File(label="Download CSV Results")
         metrics_preview = gr.Dataframe(
-            label="Similarity Metrics Preview", interactive=False, visible=True
         )
         # States for data persistence
         state_text_data = gr.State()
         state_df_results = gr.State()
         # LLM Interpretation components
         with gr.Row():
             with gr.Column():
                 gr.Markdown(
-                    "## AI Analysis\n*The AI will analyze your text similarities and provide insights into patterns and relationships.*",
                     elem_classes="gr-markdown"
                 )
                 # Add the interpret button
                 with gr.Row():
                     interpret_btn = gr.Button(
-                        "Help Interpret Results",
                         variant="primary",
                         elem_id="interpret-btn"
                     )
                 # Create a placeholder message with proper formatting and structure
                 initial_message = """
-## Analysis of Tibetan Text Similarity Metrics
-<small>*Click the 'Help Interpret Results' button above to generate an AI-powered analysis of your similarity metrics.*</small>
 """
                 interpretation_output = gr.Markdown(
                     value=initial_message,
                     elem_id="llm-analysis"
                 )
         # Heatmap tabs for each metric
         heatmap_titles = {
-            "Jaccard Similarity (%)": "Higher scores mean more shared unique words.",
-            "Normalized LCS": "Higher scores mean longer shared sequences of words.",
-            "Fuzzy Similarity": "Higher scores mean more similar text with fuzzy matching tolerance for variations.",
-            "Semantic Similarity": "Higher scores mean more similar meanings.",
-            "Word Counts": "Word Counts: Bar chart showing the number of words in each segment after tokenization.",
         }
         metric_tooltips = {
             "Jaccard Similarity (%)": """
-### Jaccard Similarity (%)
-This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, optionally filtering out common Tibetan stopwords.
-It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`.
-Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
-**Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before comparison. This helps focus on meaningful content words rather than grammatical elements.
 """,
             "Fuzzy Similarity": """
-### Fuzzy Similarity
-This metric measures the approximate string similarity between text segments using fuzzy matching algorithms from TheFuzz library. Unlike exact matching metrics, fuzzy similarity can detect similarities even when texts contain variations, misspellings, or different word orders.
-Fuzzy similarity is particularly useful for Tibetan texts that may have orthographic variations, scribal differences, or regional spelling conventions. It provides a score between 0 and 1, where higher values indicate greater similarity.
-**Available Methods**:
-- **Token Set Ratio**: Compares the unique words in each text regardless of order (best for texts with different word arrangements)
-- **Token Sort Ratio**: Normalizes word order before comparison (good for texts with similar content but different ordering)
-- **Partial Ratio**: Finds the best matching substring (useful for texts where one is contained within the other)
-- **Simple Ratio**: Direct character-by-character comparison (best for detecting minor variations)
-**Stopword Filtering**: When enabled, common Tibetan particles and function words are filtered out before comparison, focusing on meaningful content words.
 """,
             "Normalized LCS": """
-### Normalized LCS (Longest Common Subsequence)
-This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order.
-Importantly, these words do not need to be directly adjacent (contiguous) in either text.
-For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'.
-The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage.
-A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
-**No Stopword Filtering.** Unlike metrics such as Jaccard Similarity or TF-IDF Cosine Similarity (which typically filter out common stopwords to focus on content-bearing words), the LCS calculation in this tool intentionally uses the raw, unfiltered sequence of tokens from your texts. This design choice allows LCS to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction and narrative sequence. By not removing stopwords, LCS can reveal similarities in phrasing and textual structure that might otherwise be obscured, making it a valuable complement to metrics that focus purely on lexical overlap of keywords.
-**Note on Interpretation**: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
 """,
             "Semantic Similarity": """
-### Semantic Similarity
-This metric measures similarity in meaning between text segments using sentence-transformer models from Hugging Face (e.g., LaBSE). Text segments are embedded into high-dimensional vectors and compared via cosine similarity. Scores closer to 1 indicate higher semantic overlap.
-Key points:
-- Context-aware embeddings capture nuanced meanings and relationships.
-- Designed for sentence/segment-level representations, not just words.
-- Works well alongside Jaccard and LCS for a holistic view.
-- Stopword filtering: When enabled, common Tibetan particles and function words are filtered before embedding to focus on content-bearing terms.
 """,
             "Word Counts": """
-### Word Counts per Segment
-This chart displays the number of words in each segment of your texts after tokenization.
-The word count is calculated after applying the selected tokenization and stopword filtering options. This visualization helps you understand the relative sizes of different text segments and can reveal patterns in text structure across your documents.
-**Key points**:
-- Longer bars indicate segments with more words
-- Segments are grouped by source document
-- Useful for identifying structural patterns and content distribution
-- Can help explain similarity metric variations (longer texts may show different patterns)
 """,
             "Structural Analysis": """
-### Structural Analysis
-This advanced analysis examines the structural relationships between text segments across your documents. It identifies patterns of similarity and difference that may indicate textual dependencies, common sources, or editorial modifications.
-The structural analysis combines multiple similarity metrics to create a comprehensive view of how text segments relate to each other, highlighting potential stemmatic relationships and textual transmission patterns.
-**Key points**:
-- Identifies potential source-target relationships between texts
-- Visualizes text reuse patterns across segments
-- Helps reconstruct possible stemmatic relationships
-- Provides insights into textual transmission and editorial history
-**Note**: This analysis is computationally intensive and only available after the initial metrics calculation is complete.
 """
         }
         heatmap_tabs = {}
-        gr.Markdown("## Detailed Metric Analysis", elem_classes="gr-markdown")
         with gr.Tabs(elem_id="heatmap-tab-group"):
             # Process all metrics
             metrics_to_display = heatmap_titles
             for metric_key, descriptive_title in metrics_to_display.items():
                 with gr.Tab(metric_key):
                     # Set CSS class based on metric type
                     if metric_key == "Jaccard Similarity (%)":
                         css_class = "metric-info-accordion jaccard-info"
-                        accordion_title = "Understanding Vocabulary Overlap"
                     elif metric_key == "Normalized LCS":
                         css_class = "metric-info-accordion lcs-info"
-                        accordion_title = "Understanding Sequence Patterns"
                     elif metric_key == "Fuzzy Similarity":
                         css_class = "metric-info-accordion fuzzy-info"
-                        accordion_title = "Understanding Fuzzy Matching"
                     elif metric_key == "Semantic Similarity":
                         css_class = "metric-info-accordion semantic-info"
-                        accordion_title = "Understanding Meaning Similarity"
                     elif metric_key == "Word Counts":
                         css_class = "metric-info-accordion wordcount-info"
-                        accordion_title = "Understanding Text Length"
                     else:
                         css_class = "metric-info-accordion"
-                        accordion_title = f"About {metric_key}"
                     # Create the accordion with appropriate content
                     with gr.Accordion(accordion_title, open=False, elem_classes=css_class):
                         if metric_key == "Word Counts":
                             gr.Markdown("""
-                            ### Word Counts per Segment
-                            This chart displays the number of words in each segment of your texts after tokenization.
                             """)
                         elif metric_key in metric_tooltips:
                             gr.Markdown(value=metric_tooltips[metric_key], elem_classes="metric-description")
                         else:
                             gr.Markdown(value=f"### {metric_key}\nDescription not found.")
                     # Add the appropriate plot
                     if metric_key == "Word Counts":
                         word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
@@ -302,26 +428,28 @@ The structural analysis combines multiple similarity metrics to create a compreh
         # The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
         warning_box = gr.Markdown(visible=False)
         # Create a container for metric progress indicators
         with gr.Row(visible=False) as progress_container:
             # Progress indicators will be created dynamically by ProgressiveUI
             gr.Markdown("Metric progress will appear here during analysis")
-        def run_pipeline(files, enable_semantic, enable_fuzzy, fuzzy_method, model_name, stopwords_option, batch_size, show_progress, progress=gr.Progress()):
             """Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
             Args:
                 files: A list of file objects uploaded by the user.
                 enable_semantic: Whether to compute semantic similarity.
                 enable_fuzzy: Whether to compute fuzzy string similarity.
                 fuzzy_method: The fuzzy matching method to use.
                 model_name: Name of the embedding model to use.
                 stopwords_option: Stopword filtering level (None, Standard, or Aggressive).
                 batch_size: Batch size for embedding generation.
                 show_progress: Whether to show progress bars during embedding.
                 progress: Gradio progress indicator.
             Returns:
                 tuple: Results for UI components including metrics, visualizations, and state.
             """
@@ -336,7 +464,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
             warning_update_res = gr.update(visible=False)
             state_text_data_res = None
             state_df_results_res = None
             # Create a ProgressiveUI instance for handling progressive updates
             progressive_ui = ProgressiveUI(
                 metrics_preview=metrics_preview,
@@ -349,10 +477,10 @@ The structural analysis combines multiple similarity metrics to create a compreh
                 progress_container=progress_container,
                 heatmap_titles=heatmap_titles
             )
             # Make progress container visible during analysis
             progress_container.update(visible=True)
             # Create a progressive callback function
             progressive_callback = create_progressive_callback(progressive_ui)
             # Check if files are provided
@@ -369,7 +497,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
                     None,  # state_text_data
                     None  # state_df_results
                 )
             # Check file size limits (10MB per file)
             for file in files:
                 file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
@@ -393,13 +521,13 @@ The structural analysis combines multiple similarity metrics to create a compreh
                         progress(0.1, desc="Preparing files...")
                     except Exception as e:
                         logger.warning(f"Progress update error (non-critical): {e}")
                 # Get filenames and read file contents
                 filenames = [
                     Path(file.name).name for file in files
                 ]  # Use Path().name to get just the filename
                 text_data = {}
                 # Read files with progress updates
                 for i, file in enumerate(files):
                     file_path = Path(file.name)
@@ -409,7 +537,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
                             progress(0.1 + (0.1 * (i / len(files))), desc=f"Reading file: {filename}")
                         except Exception as e:
                             logger.warning(f"Progress update error (non-critical): {e}")
                     try:
                         text_data[filename] = file_path.read_text(encoding="utf-8-sig")
                     except UnicodeDecodeError:
@@ -433,21 +561,27 @@ The structural analysis combines multiple similarity metrics to create a compreh
                 # Configure semantic similarity and fuzzy matching
                 enable_semantic_bool = enable_semantic == "Yes"
                 enable_fuzzy_bool = enable_fuzzy == "Yes"
                 # Extract the fuzzy method from the dropdown value
-                fuzzy_method_value = fuzzy_method.split(' - ')[0] if fuzzy_method else 'token_set'
                 if progress is not None:
                     try:
                         progress(0.2, desc="Loading model..." if enable_semantic_bool else "Processing text...")
                     except Exception as e:
                         logger.warning(f"Progress update error (non-critical): {e}")
                 # Process texts with selected model
                 # Convert stopword option to appropriate parameters
                 use_stopwords = stopwords_option != "None (No filtering)"
                 use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
                 # For Hugging Face models, the UI value is the correct model ID
                 internal_model_id = model_name
@@ -457,9 +591,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
                     enable_semantic=enable_semantic_bool,
                     enable_fuzzy=enable_fuzzy_bool,
                     fuzzy_method=fuzzy_method_value,
                     model_name=internal_model_id,
                     use_stopwords=use_stopwords,
                     use_lite_stopwords=use_lite_stopwords,
                     progress_callback=progress,
                     progressive_callback=progressive_callback,
                     batch_size=batch_size,
@@ -479,12 +616,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
                             progress(0.8, desc="Generating visualizations...")
                         except Exception as e:
                             logger.warning(f"Progress update error (non-critical): {e}")
                     # heatmap_titles is already defined in the outer scope of main_interface
                     heatmaps_data = generate_visualizations(
                         df_results, descriptive_titles=heatmap_titles
                     )
                     # Generate word count chart
                     if progress is not None:
                         try:
@@ -492,12 +629,12 @@ The structural analysis combines multiple similarity metrics to create a compreh
                         except Exception as e:
                             logger.warning(f"Progress update error (non-critical): {e}")
                     word_count_fig_res = generate_word_count_chart(word_counts_df_data)
                     # Store state data for potential future use
                     state_text_data_res = text_data
                     state_df_results_res = df_results
                     logger.info("Analysis complete, storing state data")
                     # Save results to CSV
                     if progress is not None:
                         try:
@@ -506,7 +643,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
                             logger.warning(f"Progress update error (non-critical): {e}")
                     csv_path_res = "results.csv"
                     df_results.to_csv(csv_path_res, index=False)
                     # Prepare final output
                     warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
                     metrics_preview_df_res = df_results.head(10)
@@ -514,10 +651,7 @@ The structural analysis combines multiple similarity metrics to create a compreh
                     jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
                     lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
                     fuzzy_heatmap_res = heatmaps_data.get("Fuzzy Similarity")
-                    semantic_heatmap_res = heatmaps_data.get(
-                        "Semantic Similarity"
-                    )
-                    # TF-IDF has been completely removed
                     warning_update_res = gr.update(
                         visible=bool(warning_raw), value=warning_md
                     )
@@ -546,27 +680,27 @@ The structural analysis combines multiple similarity metrics to create a compreh
             try:
                 if not csv_path or not Path(csv_path).exists():
                     return "Please run the analysis first to generate results."
                 # Read the CSV file
                 df_results = pd.read_csv(csv_path)
                 # Show detailed progress messages with percentages
                 progress(0, desc="Preparing data for analysis...")
                 progress(0.1, desc="Analyzing similarity patterns...")
                 progress(0.2, desc="Connecting to Mistral 7B via OpenRouter...")
                 # Get interpretation from LLM (using OpenRouter API)
                 progress(0.3, desc="Generating scholarly interpretation (this may take 20-40 seconds)...")
                 llm_service = LLMService()
                 interpretation = llm_service.analyze_similarity(df_results)
                 # Simulate completion steps
                 progress(0.9, desc="Formatting results...")
                 progress(0.95, desc="Applying scholarly formatting...")
                 # Completed
                 progress(1.0, desc="Analysis complete!")
                 # Add a timestamp to the interpretation
                 timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
                 interpretation = f"{interpretation}\n\n<small>Analysis generated on {timestamp}</small>"
@@ -574,36 +708,92 @@ The structural analysis combines multiple similarity metrics to create a compreh
             except Exception as e:
                 logger.error(f"Error in interpret_results: {e}", exc_info=True)
                 return f"Error interpreting results: {str(e)}"
-        process_btn.click(
             fn=run_pipeline,
-            inputs=[file_input, semantic_toggle_radio, fuzzy_toggle_radio, fuzzy_method_dropdown, model_dropdown, stopwords_dropdown, batch_size_slider, progress_bar_checkbox],
-            outputs=[
-                csv_output,
-                metrics_preview,
-                word_count_plot,
-                heatmap_tabs["Jaccard Similarity (%)"],
-                heatmap_tabs["Normalized LCS"],
-                heatmap_tabs["Fuzzy Similarity"],
-                heatmap_tabs["Semantic Similarity"],
-                warning_box,
-                state_text_data,
-                state_df_results,
-            ]
         )
         # Structural analysis functionality removed - see dedicated collation app
         # Connect the interpret button
         interpret_btn.click(
             fn=interpret_results,
             inputs=[csv_output],
             outputs=interpretation_output
         )
     return demo
 if __name__ == "__main__":
     demo = main_interface()
-    demo.launch()

 logger = logging.getLogger(__name__)
 def main_interface():
+    # Theme and CSS applied here for Gradio 5.x compatibility
+    # For Gradio 6.x, these will move to launch() - see migration guide
     with gr.Blocks(
         theme=tibetan_theme,
+        css=tibetan_theme.get_css_string(),
+        title="Tibetan Text Metrics Web App"
     ) as demo:
         gr.Markdown(
+            """# Tibetan Text Metrics
+<span style='font-size:18px;'>Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts. Part of the <a href="https://github.com/daniel-wojahn/tibetan-text-metrics" target="_blank">TTM project</a>.</span>
         """,
             elem_classes="gr-markdown",
                 with gr.Group(elem_classes="step-box"):
                     gr.Markdown(
                         """
+                    ## Step 1: Upload Your Texts
+                    <span style='font-size:16px;'>Upload two or more Tibetan text files (.txt format). If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.</span>
                     """,
                         elem_classes="gr-markdown",
                     )
                     file_input = gr.File(
+                        label="Choose your Tibetan text files",
                         file_types=[".txt"],
                         file_count="multiple",
                     )
                     gr.Markdown(
+                        "<small>Tip: Files should be under 1MB for best performance. Use UTF-8 encoded .txt files.</small>",
                         elem_classes="gr-markdown"
                     )
             with gr.Column(scale=1, elem_classes="step-column"):
                 with gr.Group(elem_classes="step-box"):
                     gr.Markdown(
+                        """## Step 2: Choose Analysis Type
+<span style='font-size:16px;'>Pick a preset for quick results, or use Custom for full control.</span>
                     """,
                         elem_classes="gr-markdown",
                     )
+                    with gr.Tabs():
+                        # ===== QUICK START TAB =====
+                        with gr.Tab("Quick Start", id="quick_tab"):
+                            analysis_preset = gr.Radio(
+                                label="What kind of analysis do you need?",
+                                choices=[
+                                    "📊 Standard — Vocabulary + Sequences + Fuzzy matching",
+                                    "🧠 Deep — All metrics including AI meaning analysis",
+                                    "⚡ Quick — Vocabulary overlap only (fastest)"
+                                ],
+                                value="📊 Standard — Vocabulary + Sequences + Fuzzy matching",
+                                info="Standard is recommended for most users. Deep analysis takes longer but finds texts with similar meaning even when words differ."
+                            )
+                            gr.Markdown("""
+**What each preset includes:**
+| Preset | Jaccard | LCS | Fuzzy | Semantic AI |
+|--------|---------|-----|-------|-------------|
+| 📊 Standard | ✓ | ✓ | ✓ | — |
+| 🧠 Deep | ✓ | ✓ | ✓ | ✓ |
+| ⚡ Quick | ✓ | — | — | — |
+                            """, elem_classes="preset-table")
+                            process_btn_quick = gr.Button(
+                                "🔍 Compare My Texts", elem_id="run-btn-quick", variant="primary"
+                            )
+                        # ===== CUSTOM TAB =====
+                        with gr.Tab("Custom", id="custom_tab"):
+                            gr.Markdown("**Fine-tune each metric and option:**", elem_classes="custom-header")
+                            with gr.Accordion("📊 Lexical Metrics", open=True):
+                                gr.Markdown("*Compare the actual words used in texts*")
+                                tokenization_mode_dropdown = gr.Dropdown(
+                                    label="How to split text?",
+                                    choices=[
+                                        "word - Whole words (recommended)",
+                                        "syllable - Individual syllables (finer detail)"
+                                    ],
+                                    value="word - Whole words (recommended)",
+                                    info="'Word' keeps multi-syllable words together — recommended for Jaccard."
+                                )
+                                stopwords_dropdown = gr.Dropdown(
+                                    label="Filter common words?",
+                                    choices=[
+                                        "None (No filtering)",
+                                        "Standard (Common particles only)",
+                                        "Aggressive (All function words)"
+                                    ],
+                                    value="Standard (Common particles only)",
+                                    info="Remove common particles (གི, ལ, ནི) before comparing."
+                                )
+                                particle_normalization_checkbox = gr.Checkbox(
+                                    label="Normalize grammatical particles?",
+                                    value=False,
+                                    info="Treat variants as equivalent (གི/ཀྱི/གྱི → གི). Useful for different scribal conventions."
+                                )
+                            with gr.Accordion("📏 Sequence Matching (LCS)", open=True):
+                                gr.Markdown("*Find shared passages in the same order*")
+                                gr.Checkbox(
+                                    label="Enable sequence matching",
+                                    value=True,
+                                    info="Finds the longest sequence of words appearing in both texts."
+                                )  # LCS is always computed as a core metric
+                                lcs_normalization_dropdown = gr.Dropdown(
+                                    label="How to handle different text lengths?",
+                                    choices=[
+                                        "avg - Balanced comparison (default)",
+                                        "min - Detect if one text contains the other",
+                                        "max - Stricter, penalizes length differences"
+                                    ],
+                                    value="avg - Balanced comparison (default)",
+                                    info="'min' is useful for finding quotes or excerpts."
+                                )
+                            with gr.Accordion("🔍 Fuzzy Matching", open=True):
+                                gr.Markdown("*Detect similar but not identical text*")
+                                fuzzy_toggle_radio = gr.Radio(
+                                    label="Find approximate matches?",
+                                    choices=["Yes", "No"],
+                                    value="Yes",
+                                    info="Useful for spelling variations and scribal differences."
+                                )
+                                fuzzy_method_dropdown = gr.Dropdown(
+                                    label="Matching method",
+                                    choices=[
+                                        "ngram - Syllable pairs (recommended)",
+                                        "syllable_edit - Count syllable changes",
+                                        "weighted_jaccard - Word frequency comparison"
+                                    ],
+                                    value="ngram - Syllable pairs (recommended)",
+                                    info="All options work at the Tibetan syllable level."
+                                )
+                            with gr.Accordion("🧠 Semantic Analysis", open=False):
+                                gr.Markdown("*Compare meaning using AI (slower)*")
+                                semantic_toggle_radio = gr.Radio(
+                                    label="Analyze meaning similarity?",
+                                    choices=["Yes", "No"],
+                                    value="No",
+                                    info="Finds texts that say similar things in different words."
+                                )
+                                model_dropdown = gr.Dropdown(
+                                    choices=[
+                                        "buddhist-nlp/buddhist-sentence-similarity",
+                                        "buddhist-nlp/bod-eng-similarity",
+                                        "sentence-transformers/LaBSE",
+                                        "BAAI/bge-m3"
+                                    ],
+                                    label="AI Model",
+                                    value="buddhist-nlp/buddhist-sentence-similarity",
+                                    info="'buddhist-sentence-similarity' works best for Buddhist texts."
+                                )
+                                batch_size_slider = gr.Slider(
+                                    minimum=1,
+                                    maximum=64,
+                                    value=8,
+                                    step=1,
+                                    label="Processing batch size",
+                                    info="Higher = faster but uses more memory."
+                                )
+                                progress_bar_checkbox = gr.Checkbox(
+                                    label="Show detailed progress",
+                                    value=False,
+                                    info="See step-by-step progress during analysis."
+                                )
+                            process_btn_custom = gr.Button(
+                                "🔍 Compare My Texts (Custom)", elem_id="run-btn-custom", variant="primary"
+                            )
+                    # Note: Both process_btn_quick and process_btn_custom are wired below
         gr.Markdown(
             """## Results
         # The heatmap_titles and metric_tooltips dictionaries are defined here
         # heatmap_titles = { ... }
         # metric_tooltips = { ... }
+        csv_output = gr.File(label="📥 Download Full Results (CSV spreadsheet)")
         metrics_preview = gr.Dataframe(
+            label="Results Summary — Compare chapters across your texts", interactive=False, visible=True
         )
         # States for data persistence
         state_text_data = gr.State()
         state_df_results = gr.State()
         # LLM Interpretation components
         with gr.Row():
             with gr.Column():
                 gr.Markdown(
+                    "## Get Expert Insights\n*Let AI help you understand what the numbers mean and what patterns they reveal about your texts.*",
                     elem_classes="gr-markdown"
                 )
                 # Add the interpret button
                 with gr.Row():
                     interpret_btn = gr.Button(
+                        "📊 Explain My Results",
                         variant="primary",
                         elem_id="interpret-btn"
                     )
                 # Create a placeholder message with proper formatting and structure
                 initial_message = """
+## Understanding Your Results
+<small>*After running the analysis, click "Explain My Results" to get a plain-language interpretation of what the similarity scores mean for your texts.*</small>
 """
                 interpretation_output = gr.Markdown(
                     value=initial_message,
                     elem_id="llm-analysis"
                 )
         # Heatmap tabs for each metric
         heatmap_titles = {
+            "Jaccard Similarity (%)": "Shows how much vocabulary the texts share. Higher = more words in common.",
+            "Normalized LCS": "Shows shared phrases in the same order. Higher = more passages appear in both texts.",
+            "Fuzzy Similarity": "Finds similar text even with spelling differences. Higher = more alike.",
+            "Semantic Similarity": "Compares actual meaning using AI. Higher = texts say similar things.",
+            "Word Counts": "How long is each section? Helps you understand text structure.",
         }
         metric_tooltips = {
             "Jaccard Similarity (%)": """
+### Vocabulary Overlap (Jaccard Similarity)
+**What it measures:** How many unique words appear in both texts.
+**How to read it:** A score of 70% means 70% of all unique words found in either text appear in both. Higher scores = more shared vocabulary.
+**What it tells you:**
+- High scores (>70%): Texts use very similar vocabulary — possibly the same source or direct copying
+- Medium scores (40-70%): Texts share significant vocabulary — likely related topics or traditions
+- Low scores (<40%): Texts use different words — different sources or heavily edited versions
+**Good to know:** This metric ignores word order and how often words repeat. It only asks "does this word appear in both texts?"
+**Tips:**
+- Use the "Filter common words" option to focus on meaningful content words rather than grammatical particles.
+- **Word mode is recommended** for Jaccard. Syllable mode may inflate scores because common syllables (like ས, ར, ན) appear in many different words.
 """,
             "Fuzzy Similarity": """
+### Approximate Matching (Fuzzy Similarity)
+**What it measures:** How similar texts are, even when they're not exactly the same.
+**How to read it:** Scores from 0 to 1. Higher = more similar. A score of 0.85 means the texts are 85% alike.
+**What it tells you:**
+- High scores (>0.8): Very similar texts with minor differences (spelling, small edits)
+- Medium scores (0.5-0.8): Noticeably different but clearly related
+- Low scores (<0.5): Substantially different texts
+**Why it matters for Tibetan texts:**
+- Catches spelling variations between manuscripts
+- Finds scribal differences and regional conventions
+- Identifies passages that were slightly modified
+**Recommended methods:**
+- **Syllable pairs (ngram)**: Best for Tibetan — compares pairs of syllables
+- **Count syllable changes**: Good for finding minor edits
+- **Word frequency**: Useful when certain words repeat often
 """,
             "Normalized LCS": """
+### Shared Sequences (Longest Common Subsequence)
+**What it measures:** The longest chain of words that appears in both texts *in the same order*.
+**How to read it:** Higher scores mean longer shared passages. A score of 0.6 means 60% of the text follows the same word sequence.
+**Example:** If Text A says "the quick brown fox" and Text B says "the lazy brown dog", the shared sequence is "the brown" — words that appear in both, in the same order.
+**What it tells you:**
+- High scores (>0.6): Texts share substantial passages — likely direct copying or common source
+- Medium scores (0.3-0.6): Some shared phrasing — possibly related traditions
+- Low scores (<0.3): Different word ordering — independent compositions or heavy editing
+**Why this is different from vocabulary overlap:**
+- Vocabulary overlap asks: "Do they use the same words?"
+- Sequence matching asks: "Do they say things in the same order?"
+Two texts might share many words (high Jaccard) but arrange them differently (low LCS), suggesting they discuss similar topics but were composed independently.
 """,
             "Semantic Similarity": """
+### Meaning Similarity (Semantic Analysis)
+**What it measures:** Whether texts convey similar *meaning*, even if they use different words.
+**How to read it:** Scores from 0 to 1. Higher = more similar meaning. A score of 0.8 means the texts express very similar ideas.
+**What it tells you:**
+- High scores (>0.75): Texts say similar things, even if worded differently
+- Medium scores (0.5-0.75): Related topics or themes
+- Low scores (<0.5): Different subject matter
+**How it works:** An AI model (trained on Buddhist texts) reads both passages and judges how similar their meaning is. This catches similarities that word-matching would miss.
+**When to use it:**
+- Finding paraphrased passages
+- Identifying texts that discuss the same concepts differently
+- Comparing translations or commentaries
+**Note:** This takes longer to compute but provides insights the other metrics can't.
 """,
             "Word Counts": """
+### Text Length by Section
+**What it shows:** How many words are in each chapter or section of your texts.
+**How to read it:** Taller bars = longer sections. Compare bars to see which parts of your texts are longer or shorter.
+**What it tells you:**
+- Similar bar heights across texts suggest similar structure
+- Very different lengths might explain why similarity scores vary
+- Helps identify which sections to examine more closely
+**Tip:** If one text has much longer chapters, it might contain additional material not in the other version.
 """,
             "Structural Analysis": """
+### How Texts Relate to Each Other
+**What it shows:** An overview of how your text sections connect and relate across documents.
+**What it tells you:**
+- Which sections are most similar to each other
+- Possible patterns of copying or shared sources
+- How texts might have evolved or been edited over time
+**Useful for:**
+- Understanding textual transmission history
+- Identifying which version might be older or more original
+- Finding sections that were added, removed, or modified
+**Note:** This analysis combines all the other metrics to give you the big picture.
 """
         }
         heatmap_tabs = {}
+        gr.Markdown("## Visual Comparison", elem_classes="gr-markdown")
         with gr.Tabs(elem_id="heatmap-tab-group"):
             # Process all metrics
             metrics_to_display = heatmap_titles
             for metric_key, descriptive_title in metrics_to_display.items():
                 with gr.Tab(metric_key):
                     # Set CSS class based on metric type
                     if metric_key == "Jaccard Similarity (%)":
                         css_class = "metric-info-accordion jaccard-info"
+                        accordion_title = "ℹ️ What does this mean?"
                     elif metric_key == "Normalized LCS":
                         css_class = "metric-info-accordion lcs-info"
+                        accordion_title = "ℹ️ What does this mean?"
                     elif metric_key == "Fuzzy Similarity":
                         css_class = "metric-info-accordion fuzzy-info"
+                        accordion_title = "ℹ️ What does this mean?"
                     elif metric_key == "Semantic Similarity":
                         css_class = "metric-info-accordion semantic-info"
+                        accordion_title = "ℹ️ What does this mean?"
                     elif metric_key == "Word Counts":
                         css_class = "metric-info-accordion wordcount-info"
+                        accordion_title = "ℹ️ What does this mean?"
                     else:
                         css_class = "metric-info-accordion"
+                        accordion_title = f"ℹ️ About {metric_key}"
                     # Create the accordion with appropriate content
                     with gr.Accordion(accordion_title, open=False, elem_classes=css_class):
                         if metric_key == "Word Counts":
                             gr.Markdown("""
+### Text Length by Section
+This chart shows how many words are in each chapter or section. Taller bars = longer sections.
+**Why it matters:** If sections have very different lengths, it might explain differences in similarity scores.
                             """)
                         elif metric_key in metric_tooltips:
                             gr.Markdown(value=metric_tooltips[metric_key], elem_classes="metric-description")
                         else:
                             gr.Markdown(value=f"### {metric_key}\nDescription not found.")
                     # Add the appropriate plot
                     if metric_key == "Word Counts":
                         word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False, scale=1, elem_classes="metric-description")
         # The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
         warning_box = gr.Markdown(visible=False)
         # Create a container for metric progress indicators
         with gr.Row(visible=False) as progress_container:
             # Progress indicators will be created dynamically by ProgressiveUI
             gr.Markdown("Metric progress will appear here during analysis")
+        def run_pipeline(files, enable_semantic, enable_fuzzy, fuzzy_method, lcs_normalization, model_name, tokenization_mode, stopwords_option, normalize_particles, batch_size, show_progress, progress=gr.Progress()):
             """Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
             Args:
                 files: A list of file objects uploaded by the user.
                 enable_semantic: Whether to compute semantic similarity.
                 enable_fuzzy: Whether to compute fuzzy string similarity.
                 fuzzy_method: The fuzzy matching method to use.
                 model_name: Name of the embedding model to use.
+                tokenization_mode: How to tokenize text (syllable or word).
                 stopwords_option: Stopword filtering level (None, Standard, or Aggressive).
+                normalize_particles: Whether to normalize grammatical particles.
                 batch_size: Batch size for embedding generation.
                 show_progress: Whether to show progress bars during embedding.
                 progress: Gradio progress indicator.
             Returns:
                 tuple: Results for UI components including metrics, visualizations, and state.
             """
             warning_update_res = gr.update(visible=False)
             state_text_data_res = None
             state_df_results_res = None
             # Create a ProgressiveUI instance for handling progressive updates
             progressive_ui = ProgressiveUI(
                 metrics_preview=metrics_preview,
                 progress_container=progress_container,
                 heatmap_titles=heatmap_titles
             )
             # Make progress container visible during analysis
             progress_container.update(visible=True)
             # Create a progressive callback function
             progressive_callback = create_progressive_callback(progressive_ui)
             # Check if files are provided
                     None,  # state_text_data
                     None  # state_df_results
                 )
             # Check file size limits (10MB per file)
             for file in files:
                 file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
                         progress(0.1, desc="Preparing files...")
                     except Exception as e:
                         logger.warning(f"Progress update error (non-critical): {e}")
                 # Get filenames and read file contents
                 filenames = [
                     Path(file.name).name for file in files
                 ]  # Use Path().name to get just the filename
                 text_data = {}
                 # Read files with progress updates
                 for i, file in enumerate(files):
                     file_path = Path(file.name)
                             progress(0.1 + (0.1 * (i / len(files))), desc=f"Reading file: {filename}")
                         except Exception as e:
                             logger.warning(f"Progress update error (non-critical): {e}")
                     try:
                         text_data[filename] = file_path.read_text(encoding="utf-8-sig")
                     except UnicodeDecodeError:
                 # Configure semantic similarity and fuzzy matching
                 enable_semantic_bool = enable_semantic == "Yes"
                 enable_fuzzy_bool = enable_fuzzy == "Yes"
                 # Extract the fuzzy method from the dropdown value
+                fuzzy_method_value = fuzzy_method.split(' - ')[0] if fuzzy_method else 'ngram'
+                # Extract the LCS normalization from the dropdown value
+                lcs_normalization_value = lcs_normalization.split(' - ')[0] if lcs_normalization else 'avg'
+                # Extract the tokenization mode from the dropdown value
+                tokenization_mode_value = tokenization_mode.split(' - ')[0] if tokenization_mode else 'syllable'
                 if progress is not None:
                     try:
                         progress(0.2, desc="Loading model..." if enable_semantic_bool else "Processing text...")
                     except Exception as e:
                         logger.warning(f"Progress update error (non-critical): {e}")
                 # Process texts with selected model
                 # Convert stopword option to appropriate parameters
                 use_stopwords = stopwords_option != "None (No filtering)"
                 use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
                 # For Hugging Face models, the UI value is the correct model ID
                 internal_model_id = model_name
                     enable_semantic=enable_semantic_bool,
                     enable_fuzzy=enable_fuzzy_bool,
                     fuzzy_method=fuzzy_method_value,
+                    lcs_normalization=lcs_normalization_value,
                     model_name=internal_model_id,
                     use_stopwords=use_stopwords,
                     use_lite_stopwords=use_lite_stopwords,
+                    normalize_particles=normalize_particles,
+                    tokenization_mode=tokenization_mode_value,
                     progress_callback=progress,
                     progressive_callback=progressive_callback,
                     batch_size=batch_size,
                             progress(0.8, desc="Generating visualizations...")
                         except Exception as e:
                             logger.warning(f"Progress update error (non-critical): {e}")
                     # heatmap_titles is already defined in the outer scope of main_interface
                     heatmaps_data = generate_visualizations(
                         df_results, descriptive_titles=heatmap_titles
                     )
                     # Generate word count chart
                     if progress is not None:
                         try:
                         except Exception as e:
                             logger.warning(f"Progress update error (non-critical): {e}")
                     word_count_fig_res = generate_word_count_chart(word_counts_df_data)
                     # Store state data for potential future use
                     state_text_data_res = text_data
                     state_df_results_res = df_results
                     logger.info("Analysis complete, storing state data")
                     # Save results to CSV
                     if progress is not None:
                         try:
                             logger.warning(f"Progress update error (non-critical): {e}")
                     csv_path_res = "results.csv"
                     df_results.to_csv(csv_path_res, index=False)
                     # Prepare final output
                     warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
                     metrics_preview_df_res = df_results.head(10)
                     jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
                     lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
                     fuzzy_heatmap_res = heatmaps_data.get("Fuzzy Similarity")
+                    semantic_heatmap_res = heatmaps_data.get("Semantic Similarity")
                     warning_update_res = gr.update(
                         visible=bool(warning_raw), value=warning_md
                     )
             try:
                 if not csv_path or not Path(csv_path).exists():
                     return "Please run the analysis first to generate results."
                 # Read the CSV file
                 df_results = pd.read_csv(csv_path)
                 # Show detailed progress messages with percentages
                 progress(0, desc="Preparing data for analysis...")
                 progress(0.1, desc="Analyzing similarity patterns...")
                 progress(0.2, desc="Connecting to Mistral 7B via OpenRouter...")
                 # Get interpretation from LLM (using OpenRouter API)
                 progress(0.3, desc="Generating scholarly interpretation (this may take 20-40 seconds)...")
                 llm_service = LLMService()
                 interpretation = llm_service.analyze_similarity(df_results)
                 # Simulate completion steps
                 progress(0.9, desc="Formatting results...")
                 progress(0.95, desc="Applying scholarly formatting...")
                 # Completed
                 progress(1.0, desc="Analysis complete!")
                 # Add a timestamp to the interpretation
                 timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
                 interpretation = f"{interpretation}\n\n<small>Analysis generated on {timestamp}</small>"
             except Exception as e:
                 logger.error(f"Error in interpret_results: {e}", exc_info=True)
                 return f"Error interpreting results: {str(e)}"
+        def run_pipeline_preset(files, preset, progress=gr.Progress()):
+            """Wrapper that converts preset selection to pipeline parameters."""
+            # Determine settings based on preset
+            if "Quick" in preset:
+                # Quick: Jaccard only
+                enable_semantic = "No"
+                enable_fuzzy = "No"
+            elif "Deep" in preset:
+                # Deep: All metrics including semantic
+                enable_semantic = "Yes"
+                enable_fuzzy = "Yes"
+            else:
+                # Standard: Jaccard + LCS + Fuzzy (no semantic)
+                enable_semantic = "No"
+                enable_fuzzy = "Yes"
+            # Use sensible defaults for preset mode
+            fuzzy_method = "ngram - Syllable pairs (recommended)"
+            lcs_normalization = "avg - Balanced comparison (default)"
+            model_name = "buddhist-nlp/buddhist-sentence-similarity"
+            tokenization_mode = "word - Whole words (recommended)"
+            stopwords_option = "Standard (Common particles only)"
+            normalize_particles = False
+            batch_size = 8
+            show_progress = False
+            return run_pipeline(
+                files, enable_semantic, enable_fuzzy, fuzzy_method,
+                lcs_normalization, model_name, tokenization_mode,
+                stopwords_option, normalize_particles, batch_size,
+                show_progress, progress
+            )
+        # Output components for both buttons
+        pipeline_outputs = [
+            csv_output,
+            metrics_preview,
+            word_count_plot,
+            heatmap_tabs["Jaccard Similarity (%)"],
+            heatmap_tabs["Normalized LCS"],
+            heatmap_tabs["Fuzzy Similarity"],
+            heatmap_tabs["Semantic Similarity"],
+            warning_box,
+            state_text_data,
+            state_df_results,
+        ]
+        # Quick Start button uses presets
+        process_btn_quick.click(
+            fn=run_pipeline_preset,
+            inputs=[file_input, analysis_preset],
+            outputs=pipeline_outputs
+        )
+        # Custom button uses all the detailed settings
+        process_btn_custom.click(
             fn=run_pipeline,
+            inputs=[
+                file_input,
+                semantic_toggle_radio,
+                fuzzy_toggle_radio,
+                fuzzy_method_dropdown,
+                lcs_normalization_dropdown,
+                model_dropdown,
+                tokenization_mode_dropdown,
+                stopwords_dropdown,
+                particle_normalization_checkbox,
+                batch_size_slider,
+                progress_bar_checkbox
+            ],
+            outputs=pipeline_outputs
         )
         # Structural analysis functionality removed - see dedicated collation app
         # Connect the interpret button
         interpret_btn.click(
             fn=interpret_results,
             inputs=[csv_output],
             outputs=interpretation_output
         )
     return demo
 if __name__ == "__main__":
     demo = main_interface()
+    demo.launch()

pipeline/hf_embedding.py CHANGED Viewed

@@ -10,8 +10,10 @@ _model_cache = {}
 # Model version mapping
 MODEL_VERSIONS = {
-    "sentence-transformers/LaBSE": "v1.0",
-    "intfloat/e5-base-v2": "v1.0",
 }
 def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[str]]:
@@ -28,7 +30,7 @@ def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[st
     # Include version information in cache key
     model_version = MODEL_VERSIONS.get(model_id, "unknown")
     cache_key = f"{model_id}@{model_version}"
     if cache_key in _model_cache:
         logger.info(f"Returning cached model: {model_id} (version: {model_version})")
         return _model_cache[cache_key], "sentence-transformer"
@@ -44,9 +46,9 @@ def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[st
         return None, None
 def generate_embeddings(
-    texts: List[str],
-    model: SentenceTransformer,
-    batch_size: int = 32,
     show_progress_bar: bool = False
 ) -> np.ndarray:
     """
@@ -70,9 +72,9 @@ def generate_embeddings(
     logger.info(f"Generating embeddings for {len(texts)} texts with {type(model).__name__}...")
     try:
         embeddings = model.encode(
-            texts,
             batch_size=batch_size,
-            convert_to_numpy=True,
             show_progress_bar=show_progress_bar
         )
         logger.info(f"Embeddings generated with shape: {embeddings.shape}")

 # Model version mapping
 MODEL_VERSIONS = {
+    "buddhist-nlp/buddhist-sentence-similarity": "v1.0",  # Dharmamitra - best for Tibetan Buddhist texts
+    "buddhist-nlp/bod-eng-similarity": "v1.0",  # Dharmamitra - Tibetan-English bitext alignment
+    "sentence-transformers/LaBSE": "v1.0",  # Multilingual baseline
+    "BAAI/bge-m3": "v1.0",  # Strong multilingual alternative
 }
 def get_model(model_id: str) -> Tuple[Optional[SentenceTransformer], Optional[str]]:
     # Include version information in cache key
     model_version = MODEL_VERSIONS.get(model_id, "unknown")
     cache_key = f"{model_id}@{model_version}"
     if cache_key in _model_cache:
         logger.info(f"Returning cached model: {model_id} (version: {model_version})")
         return _model_cache[cache_key], "sentence-transformer"
         return None, None
 def generate_embeddings(
+    texts: List[str],
+    model: SentenceTransformer,
+    batch_size: int = 32,
     show_progress_bar: bool = False
 ) -> np.ndarray:
     """
     logger.info(f"Generating embeddings for {len(texts)} texts with {type(model).__name__}...")
     try:
         embeddings = model.encode(
+            texts,
             batch_size=batch_size,
+            convert_to_numpy=True,
             show_progress_bar=show_progress_bar
         )
         logger.info(f"Embeddings generated with shape: {embeddings.shape}")

pipeline/llm_service.py CHANGED Viewed

@@ -39,11 +39,11 @@ class LLMService:
     """
     Service for analyzing text similarity metrics using LLMs and rule-based methods.
     """
     def __init__(self, api_key: str = None):
         """
         Initialize the LLM service.
         Args:
             api_key: Optional API key for OpenRouter. If not provided, will try to load from environment.
         """
@@ -51,19 +51,19 @@ class LLMService:
         self.models = PREFERRED_MODELS
         self.temperature = DEFAULT_TEMPERATURE
         self.top_p = DEFAULT_TOP_P
     def analyze_similarity(
-        self,
-        results_df: pd.DataFrame,
         use_llm: bool = True,
     ) -> str:
         """
         Analyze similarity metrics using either LLM or rule-based approach.
         Args:
             results_df: DataFrame containing similarity metrics
             use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
         Returns:
             str: Analysis of the metrics in markdown format with appropriate fallback messages
         """
@@ -71,19 +71,19 @@ class LLMService:
         if not use_llm:
             logger.info("LLM analysis disabled. Using rule-based analysis.")
             return self._analyze_with_rules(results_df)
         # Try LLM analysis if enabled
         try:
             if not self.api_key:
                 raise ValueError("No OpenRouter API key provided. Please set the OPENROUTER_API_KEY environment variable.")
             logger.info("Attempting LLM-based analysis...")
             return self._analyze_with_llm(results_df, max_tokens=DEFAULT_MAX_TOKENS)
         except Exception as e:
             error_msg = str(e)
             logger.error(f"Error in LLM analysis: {error_msg}")
             # Create a user-friendly error message
             if "no openrouter api key" in error_msg.lower():
                 error_note = "OpenRouter API key not found. Please set the `OPENROUTER_API_KEY` environment variable to use this feature."
@@ -95,42 +95,42 @@ class LLMService:
                 error_note = "API rate limit exceeded. Falling back to rule-based analysis."
             else:
                 error_note = f"LLM analysis failed: {error_msg[:200]}. Falling back to rule-based analysis."
             # Get rule-based analysis
             rule_based_analysis = self._analyze_with_rules(results_df)
             # Combine the error message with the rule-based analysis
             return f"## Analysis of Tibetan Text Similarity Metrics\n\n*Note: {error_note}*\n\n{rule_based_analysis}"
     def _prepare_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
         """
         Prepare the DataFrame for analysis.
         Args:
             df: Input DataFrame with similarity metrics
         Returns:
             pd.DataFrame: Cleaned and prepared DataFrame
         """
         # Make a copy to avoid modifying the original
         df = df.copy()
         # Clean text columns
         text_cols = ['Text A', 'Text B']
         for col in text_cols:
             if col in df.columns:
                 df[col] = df[col].fillna('Unknown').astype(str)
                 df[col] = df[col].str.replace('.txt$', '', regex=True)
         # Filter out perfect matches (likely empty cells)
         metrics_cols = ['Jaccard Similarity (%)', 'Normalized LCS']
         if all(col in df.columns for col in metrics_cols):
-            mask = ~((df['Jaccard Similarity (%)'] == 100.0) &
                     (df['Normalized LCS'] == 1.0))
             df = df[mask].copy()
         return df
     def _analyze_with_llm(self, df: pd.DataFrame, max_tokens: int) -> str:
         """
         Analyze metrics using an LLM via OpenRouter API, with fallback models.
@@ -181,65 +181,65 @@ class LLMService:
             raise last_error
         else:
             raise Exception("LLM analysis failed for all available models.")
     def _analyze_with_rules(self, df: pd.DataFrame) -> str:
         """
         Analyze metrics using rule-based approach.
         Args:
             df: Prepared DataFrame with metrics
         Returns:
             str: Rule-based analysis in markdown format
         """
         analysis = ["## Tibetan Text Similarity Analysis (Rule-Based)"]
         # Basic stats
         text_a_col = 'Text A' if 'Text A' in df.columns else None
         text_b_col = 'Text B' if 'Text B' in df.columns else None
         if text_a_col and text_b_col:
             unique_texts = set(df[text_a_col].unique()) | set(df[text_b_col].unique())
             analysis.append(f"- **Texts analyzed:** {', '.join(sorted(unique_texts))}")
         # Analyze each metric
         metric_analyses = []
         if 'Jaccard Similarity (%)' in df.columns:
             jaccard_analysis = self._analyze_jaccard(df)
             metric_analyses.append(jaccard_analysis)
         if 'Normalized LCS' in df.columns:
             lcs_analysis = self._analyze_lcs(df)
             metric_analyses.append(lcs_analysis)
         # TF-IDF analysis removed
         # Add all metric analyses
         if metric_analyses:
             analysis.extend(metric_analyses)
         # Add overall interpretation
         analysis.append("\n## Overall Interpretation")
         analysis.append(self._generate_overall_interpretation(df))
         return "\n\n".join(analysis)
     def _analyze_jaccard(self, df: pd.DataFrame) -> str:
         """Analyze Jaccard similarity scores."""
         jaccard = df['Jaccard Similarity (%)'].dropna()
         if jaccard.empty:
             return ""
         mean_jaccard = jaccard.mean()
         max_jaccard = jaccard.max()
         min_jaccard = jaccard.min()
         analysis = [
             "### Jaccard Similarity Analysis",
             f"- **Range:** {min_jaccard:.1f}% to {max_jaccard:.1f}% (mean: {mean_jaccard:.1f}%)"
         ]
         # Interpret the scores
         if mean_jaccard > 60:
             analysis.append("- **High vocabulary overlap** suggests texts share significant content or are from the same tradition.")
@@ -247,7 +247,7 @@ class LLMService:
             analysis.append("- **Moderate vocabulary overlap** indicates some shared content or themes.")
         else:
             analysis.append("- **Low vocabulary overlap** suggests texts are on different topics or from different traditions.")
         # Add top pairs
         top_pairs = df.nlargest(3, 'Jaccard Similarity (%)')
         if not top_pairs.empty:
@@ -257,24 +257,24 @@ class LLMService:
                 text_b = row.get('Text B', 'Text 2')
                 score = row['Jaccard Similarity (%)']
                 analysis.append(f"- {text_a} ↔ {text_b}: {score:.1f}%")
         return "\n".join(analysis)
     def _analyze_lcs(self, df: pd.DataFrame) -> str:
         """Analyze Longest Common Subsequence scores."""
         lcs = df['Normalized LCS'].dropna()
         if lcs.empty:
             return ""
         mean_lcs = lcs.mean()
         max_lcs = lcs.max()
         min_lcs = lcs.min()
         analysis = [
             "### Structural Similarity (LCS) Analysis",
             f"- **Range:** {min_lcs:.2f} to {max_lcs:.2f} (mean: {mean_lcs:.2f})"
         ]
         # Interpret the scores
         if mean_lcs > 0.7:
             analysis.append("- **High structural similarity** suggests texts follow similar organizational patterns.")
@@ -282,7 +282,7 @@ class LLMService:
             analysis.append("- **Moderate structural similarity** indicates some shared organizational elements.")
         else:
             analysis.append("- **Low structural similarity** suggests different organizational approaches.")
         # Add top pairs
         top_pairs = df.nlargest(3, 'Normalized LCS')
         if not top_pairs.empty:
@@ -292,19 +292,19 @@ class LLMService:
                 text_b = row.get('Text B', 'Text 2')
                 score = row['Normalized LCS']
                 analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
         return "\n".join(analysis)
     # TF-IDF analysis method removed
     def _generate_overall_interpretation(self, df: pd.DataFrame) -> str:
         """Generate an overall interpretation of the metrics."""
         interpretations = []
         # Get metrics if they exist
         has_jaccard = 'Jaccard Similarity (%)' in df.columns
         has_lcs = 'Normalized LCS' in df.columns
         # Calculate means for available metrics
         metrics = {}
         if has_jaccard:
@@ -312,51 +312,51 @@ class LLMService:
         if has_lcs:
             metrics['lcs'] = df['Normalized LCS'].mean()
         # TF-IDF metrics removed
         # Generate interpretation based on metrics
         if metrics:
             interpretations.append("Based on the analysis of similarity metrics:")
             if has_jaccard and metrics['jaccard'] > 60:
                 interpretations.append("- The high Jaccard similarity indicates significant vocabulary overlap between texts, "
                                      "suggesting they may share common sources or be part of the same textual tradition.")
             if has_lcs and metrics['lcs'] > 0.7:
                 interpretations.append("- The high LCS score indicates strong structural similarity, "
                                      "suggesting the texts may follow similar organizational patterns or share common structural elements.")
             # TF-IDF interpretation removed
             # Add cross-metric interpretations
             if has_jaccard and has_lcs and metrics['jaccard'] > 60 and metrics['lcs'] > 0.7:
                 interpretations.append("\nThe combination of high Jaccard and LCS similarities strongly suggests "
                                      "that these texts are closely related, possibly being different versions or "
                                      "transmissions of the same work or sharing a common source.")
             # TF-IDF cross-metric interpretation removed
         # Add general guidance if no specific patterns found
         if not interpretations:
             interpretations.append("The analysis did not reveal strong patterns in the similarity metrics. "
                                  "This could indicate that the texts are either very similar or very different "
                                  "across all measured dimensions.")
         return "\n\n".join(interpretations)
     def _create_llm_prompt(self, df: pd.DataFrame, model_name: str) -> str:
         """
         Create a prompt for the LLM based on the DataFrame.
         Args:
             df: Prepared DataFrame with metrics
             model_name: Name of the model being used
         Returns:
             str: Formatted prompt for the LLM
         """
         # Convert DataFrame to markdown for the prompt
         md_table = df.to_markdown(index=False)
         # Create the prompt
         prompt = f"""
 # Tibetan Text Similarity Analysis
@@ -372,19 +372,19 @@ You will be provided with a table of text similarity scores in Markdown format.
 Your analysis will be performed using the `{model_name}` model. Provide a concise, scholarly analysis in well-structured markdown.
 """
         return prompt
     def _get_system_prompt(self) -> str:
         """Get the system prompt for the LLM."""
         return """You are a senior scholar of Tibetan Buddhist texts, specializing in textual criticism. Your task is to analyze the provided similarity metrics and provide expert insights into the relationships between these texts. Ground your analysis in the data, be precise, and focus on what the metrics reveal about the texts' transmission and history."""
     def _call_openrouter_api(self, model: str, prompt: str, system_message: str = None, max_tokens: int = None, temperature: float = None, top_p: float = None) -> str:
         """
         Call the OpenRouter API.
         Args:
             model: Model to use for the API call
             prompt: The user prompt
@@ -392,10 +392,10 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
             max_tokens: Maximum tokens for the response
             temperature: Sampling temperature
             top_p: Nucleus sampling parameter
         Returns:
             str: The API response
         Raises:
             ValueError: If API key is missing or invalid
             requests.exceptions.RequestException: For network-related errors
@@ -405,21 +405,21 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
             error_msg = "OpenRouter API key not provided. Please set the OPENROUTER_API_KEY environment variable."
             logger.error(error_msg)
             raise ValueError(error_msg)
         url = "https://openrouter.ai/api/v1/chat/completions"
         headers = {
             "Authorization": f"Bearer {self.api_key}",
             "Content-Type": "application/json",
             "HTTP-Referer": "https://github.com/daniel-wojahn/tibetan-text-metrics",
             "X-Title": "Tibetan Text Metrics"
         }
         messages = []
         if system_message:
             messages.append({"role": "system", "content": system_message})
         messages.append({"role": "user", "content": prompt})
         data = {
             "model": model,  # Use the model parameter here
             "messages": messages,
@@ -427,11 +427,11 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
             "temperature": temperature or self.temperature,
             "top_p": top_p or self.top_p,
         }
         try:
             logger.info(f"Calling OpenRouter API with model: {model}")
             response = requests.post(url, headers=headers, json=data, timeout=60)
             # Handle different HTTP status codes
             if response.status_code == 200:
                 result = response.json()
@@ -441,53 +441,53 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
                     error_msg = "Unexpected response format from OpenRouter API"
                     logger.error(f"{error_msg}: {result}")
                     raise ValueError(error_msg)
             elif response.status_code == 401:
                 error_msg = "Invalid OpenRouter API key. Please check your API key and try again."
                 logger.error(error_msg)
                 raise ValueError(error_msg)
             elif response.status_code == 402:
                 error_msg = "OpenRouter API payment required. Please check your OpenRouter account balance or billing status."
                 logger.error(error_msg)
                 raise ValueError(error_msg)
             elif response.status_code == 429:
                 error_msg = "API rate limit exceeded. Please try again later or check your OpenRouter rate limits."
                 logger.error(error_msg)
                 raise ValueError(error_msg)
             else:
                 error_msg = f"OpenRouter API error: {response.status_code} - {response.text}"
                 logger.error(error_msg)
                 raise Exception(error_msg)
         except requests.exceptions.RequestException as e:
             error_msg = f"Failed to connect to OpenRouter API: {str(e)}"
             logger.error(error_msg)
             raise Exception(error_msg) from e
         except json.JSONDecodeError as e:
             error_msg = f"Failed to parse OpenRouter API response: {str(e)}"
             logger.error(error_msg)
             raise Exception(error_msg) from e
     def _format_llm_response(self, response: str, df: pd.DataFrame, model_name: str) -> str:
         """
         Format the LLM response for display.
         Args:
             response: Raw LLM response
             df: Original DataFrame for reference
             model_name: Name of the model used
         Returns:
             str: Formatted response with fallback if needed
         """
         # Basic validation
         if not response or len(response) < 100:
             raise ValueError("Response too short or empty")
         # Check for garbled output (random numbers, nonsensical patterns)
         # This is a simple heuristic - look for long sequences of numbers or strange patterns
         suspicious_patterns = [
@@ -495,24 +495,24 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
             r'[0-9,.]{20,}',  # Long sequences of digits, commas and periods
             r'[\W]{20,}',  # Long sequences of non-word characters
         ]
         for pattern in suspicious_patterns:
             if re.search(pattern, response):
                 logger.warning(f"Detected potentially garbled output matching pattern: {pattern}")
                 # Don't immediately raise - we'll do a more comprehensive check
         # Check for content quality - ensure it has expected sections
         expected_content = [
             "introduction", "analysis", "similarity", "patterns", "conclusion", "question"
         ]
         # Count how many expected content markers we find
         content_matches = sum(1 for term in expected_content if term.lower() in response.lower())
         # If we find fewer than 3 expected content markers, log a warning
         if content_matches < 3:
             logger.warning(f"LLM response missing expected content sections (found {content_matches}/6)")
         # Check for text names from the dataset
         # Extract text names from the Text Pair column
         text_names = set()
@@ -521,22 +521,22 @@ Your analysis will be performed using the `{model_name}` model. Provide a concis
                 if isinstance(pair, str) and " vs " in pair:
                     texts = pair.split(" vs ")
                     text_names.update(texts)
         # Check if at least some text names appear in the response
         text_name_matches = sum(1 for name in text_names if name in response)
         if text_names and text_name_matches == 0:
             logger.warning("LLM response does not mention any of the text names from the dataset. The analysis may be generic.")
         # Ensure basic markdown structure
         if '##' not in response:
             response = f"## Analysis of Tibetan Text Similarity\n\n{response}"
         # Add styling to make the output more readable
         response = f"<div class='llm-analysis'>\n{response}\n</div>"
         # Format the response into a markdown block
         formatted_response = f"""## AI-Powered Analysis (Model: {model_name})\n\n{response}"""
         return formatted_response

     """
     Service for analyzing text similarity metrics using LLMs and rule-based methods.
     """
     def __init__(self, api_key: str = None):
         """
         Initialize the LLM service.
         Args:
             api_key: Optional API key for OpenRouter. If not provided, will try to load from environment.
         """
         self.models = PREFERRED_MODELS
         self.temperature = DEFAULT_TEMPERATURE
         self.top_p = DEFAULT_TOP_P
     def analyze_similarity(
+        self,
+        results_df: pd.DataFrame,
         use_llm: bool = True,
     ) -> str:
         """
         Analyze similarity metrics using either LLM or rule-based approach.
         Args:
             results_df: DataFrame containing similarity metrics
             use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
         Returns:
             str: Analysis of the metrics in markdown format with appropriate fallback messages
         """
         if not use_llm:
             logger.info("LLM analysis disabled. Using rule-based analysis.")
             return self._analyze_with_rules(results_df)
         # Try LLM analysis if enabled
         try:
             if not self.api_key:
                 raise ValueError("No OpenRouter API key provided. Please set the OPENROUTER_API_KEY environment variable.")
             logger.info("Attempting LLM-based analysis...")
             return self._analyze_with_llm(results_df, max_tokens=DEFAULT_MAX_TOKENS)
         except Exception as e:
             error_msg = str(e)
             logger.error(f"Error in LLM analysis: {error_msg}")
             # Create a user-friendly error message
             if "no openrouter api key" in error_msg.lower():
                 error_note = "OpenRouter API key not found. Please set the `OPENROUTER_API_KEY` environment variable to use this feature."
                 error_note = "API rate limit exceeded. Falling back to rule-based analysis."
             else:
                 error_note = f"LLM analysis failed: {error_msg[:200]}. Falling back to rule-based analysis."
             # Get rule-based analysis
             rule_based_analysis = self._analyze_with_rules(results_df)
             # Combine the error message with the rule-based analysis
             return f"## Analysis of Tibetan Text Similarity Metrics\n\n*Note: {error_note}*\n\n{rule_based_analysis}"
     def _prepare_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
         """
         Prepare the DataFrame for analysis.
         Args:
             df: Input DataFrame with similarity metrics
         Returns:
             pd.DataFrame: Cleaned and prepared DataFrame
         """
         # Make a copy to avoid modifying the original
         df = df.copy()
         # Clean text columns
         text_cols = ['Text A', 'Text B']
         for col in text_cols:
             if col in df.columns:
                 df[col] = df[col].fillna('Unknown').astype(str)
                 df[col] = df[col].str.replace('.txt$', '', regex=True)
         # Filter out perfect matches (likely empty cells)
         metrics_cols = ['Jaccard Similarity (%)', 'Normalized LCS']
         if all(col in df.columns for col in metrics_cols):
+            mask = ~((df['Jaccard Similarity (%)'] == 100.0) &
                     (df['Normalized LCS'] == 1.0))
             df = df[mask].copy()
         return df
     def _analyze_with_llm(self, df: pd.DataFrame, max_tokens: int) -> str:
         """
         Analyze metrics using an LLM via OpenRouter API, with fallback models.
             raise last_error
         else:
             raise Exception("LLM analysis failed for all available models.")
     def _analyze_with_rules(self, df: pd.DataFrame) -> str:
         """
         Analyze metrics using rule-based approach.
         Args:
             df: Prepared DataFrame with metrics
         Returns:
             str: Rule-based analysis in markdown format
         """
         analysis = ["## Tibetan Text Similarity Analysis (Rule-Based)"]
         # Basic stats
         text_a_col = 'Text A' if 'Text A' in df.columns else None
         text_b_col = 'Text B' if 'Text B' in df.columns else None
         if text_a_col and text_b_col:
             unique_texts = set(df[text_a_col].unique()) | set(df[text_b_col].unique())
             analysis.append(f"- **Texts analyzed:** {', '.join(sorted(unique_texts))}")
         # Analyze each metric
         metric_analyses = []
         if 'Jaccard Similarity (%)' in df.columns:
             jaccard_analysis = self._analyze_jaccard(df)
             metric_analyses.append(jaccard_analysis)
         if 'Normalized LCS' in df.columns:
             lcs_analysis = self._analyze_lcs(df)
             metric_analyses.append(lcs_analysis)
         # TF-IDF analysis removed
         # Add all metric analyses
         if metric_analyses:
             analysis.extend(metric_analyses)
         # Add overall interpretation
         analysis.append("\n## Overall Interpretation")
         analysis.append(self._generate_overall_interpretation(df))
         return "\n\n".join(analysis)
     def _analyze_jaccard(self, df: pd.DataFrame) -> str:
         """Analyze Jaccard similarity scores."""
         jaccard = df['Jaccard Similarity (%)'].dropna()
         if jaccard.empty:
             return ""
         mean_jaccard = jaccard.mean()
         max_jaccard = jaccard.max()
         min_jaccard = jaccard.min()
         analysis = [
             "### Jaccard Similarity Analysis",
             f"- **Range:** {min_jaccard:.1f}% to {max_jaccard:.1f}% (mean: {mean_jaccard:.1f}%)"
         ]
         # Interpret the scores
         if mean_jaccard > 60:
             analysis.append("- **High vocabulary overlap** suggests texts share significant content or are from the same tradition.")
             analysis.append("- **Moderate vocabulary overlap** indicates some shared content or themes.")
         else:
             analysis.append("- **Low vocabulary overlap** suggests texts are on different topics or from different traditions.")
         # Add top pairs
         top_pairs = df.nlargest(3, 'Jaccard Similarity (%)')
         if not top_pairs.empty:
                 text_b = row.get('Text B', 'Text 2')
                 score = row['Jaccard Similarity (%)']
                 analysis.append(f"- {text_a} ↔ {text_b}: {score:.1f}%")
         return "\n".join(analysis)
     def _analyze_lcs(self, df: pd.DataFrame) -> str:
         """Analyze Longest Common Subsequence scores."""
         lcs = df['Normalized LCS'].dropna()
         if lcs.empty:
             return ""
         mean_lcs = lcs.mean()
         max_lcs = lcs.max()
         min_lcs = lcs.min()
         analysis = [
             "### Structural Similarity (LCS) Analysis",
             f"- **Range:** {min_lcs:.2f} to {max_lcs:.2f} (mean: {mean_lcs:.2f})"
         ]
         # Interpret the scores
         if mean_lcs > 0.7:
             analysis.append("- **High structural similarity** suggests texts follow similar organizational patterns.")
             analysis.append("- **Moderate structural similarity** indicates some shared organizational elements.")
         else:
             analysis.append("- **Low structural similarity** suggests different organizational approaches.")
         # Add top pairs
         top_pairs = df.nlargest(3, 'Normalized LCS')
         if not top_pairs.empty:
                 text_b = row.get('Text B', 'Text 2')
                 score = row['Normalized LCS']
                 analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
         return "\n".join(analysis)
     # TF-IDF analysis method removed
     def _generate_overall_interpretation(self, df: pd.DataFrame) -> str:
         """Generate an overall interpretation of the metrics."""
         interpretations = []
         # Get metrics if they exist
         has_jaccard = 'Jaccard Similarity (%)' in df.columns
         has_lcs = 'Normalized LCS' in df.columns
         # Calculate means for available metrics
         metrics = {}
         if has_jaccard:
         if has_lcs:
             metrics['lcs'] = df['Normalized LCS'].mean()
         # TF-IDF metrics removed
         # Generate interpretation based on metrics
         if metrics:
             interpretations.append("Based on the analysis of similarity metrics:")
             if has_jaccard and metrics['jaccard'] > 60:
                 interpretations.append("- The high Jaccard similarity indicates significant vocabulary overlap between texts, "
                                      "suggesting they may share common sources or be part of the same textual tradition.")
             if has_lcs and metrics['lcs'] > 0.7:
                 interpretations.append("- The high LCS score indicates strong structural similarity, "
                                      "suggesting the texts may follow similar organizational patterns or share common structural elements.")
             # TF-IDF interpretation removed
             # Add cross-metric interpretations
             if has_jaccard and has_lcs and metrics['jaccard'] > 60 and metrics['lcs'] > 0.7:
                 interpretations.append("\nThe combination of high Jaccard and LCS similarities strongly suggests "
                                      "that these texts are closely related, possibly being different versions or "
                                      "transmissions of the same work or sharing a common source.")
             # TF-IDF cross-metric interpretation removed
         # Add general guidance if no specific patterns found
         if not interpretations:
             interpretations.append("The analysis did not reveal strong patterns in the similarity metrics. "
                                  "This could indicate that the texts are either very similar or very different "
                                  "across all measured dimensions.")
         return "\n\n".join(interpretations)
     def _create_llm_prompt(self, df: pd.DataFrame, model_name: str) -> str:
         """
         Create a prompt for the LLM based on the DataFrame.
         Args:
             df: Prepared DataFrame with metrics
             model_name: Name of the model being used
         Returns:
             str: Formatted prompt for the LLM
         """
         # Convert DataFrame to markdown for the prompt
         md_table = df.to_markdown(index=False)
         # Create the prompt
         prompt = f"""
 # Tibetan Text Similarity Analysis
 Your analysis will be performed using the `{model_name}` model. Provide a concise, scholarly analysis in well-structured markdown.
 """
         return prompt
     def _get_system_prompt(self) -> str:
         """Get the system prompt for the LLM."""
         return """You are a senior scholar of Tibetan Buddhist texts, specializing in textual criticism. Your task is to analyze the provided similarity metrics and provide expert insights into the relationships between these texts. Ground your analysis in the data, be precise, and focus on what the metrics reveal about the texts' transmission and history."""
     def _call_openrouter_api(self, model: str, prompt: str, system_message: str = None, max_tokens: int = None, temperature: float = None, top_p: float = None) -> str:
         """
         Call the OpenRouter API.
         Args:
             model: Model to use for the API call
             prompt: The user prompt
             max_tokens: Maximum tokens for the response
             temperature: Sampling temperature
             top_p: Nucleus sampling parameter
         Returns:
             str: The API response
         Raises:
             ValueError: If API key is missing or invalid
             requests.exceptions.RequestException: For network-related errors
             error_msg = "OpenRouter API key not provided. Please set the OPENROUTER_API_KEY environment variable."
             logger.error(error_msg)
             raise ValueError(error_msg)
         url = "https://openrouter.ai/api/v1/chat/completions"
         headers = {
             "Authorization": f"Bearer {self.api_key}",
             "Content-Type": "application/json",
             "HTTP-Referer": "https://github.com/daniel-wojahn/tibetan-text-metrics",
             "X-Title": "Tibetan Text Metrics"
         }
         messages = []
         if system_message:
             messages.append({"role": "system", "content": system_message})
         messages.append({"role": "user", "content": prompt})
         data = {
             "model": model,  # Use the model parameter here
             "messages": messages,
             "temperature": temperature or self.temperature,
             "top_p": top_p or self.top_p,
         }
         try:
             logger.info(f"Calling OpenRouter API with model: {model}")
             response = requests.post(url, headers=headers, json=data, timeout=60)
             # Handle different HTTP status codes
             if response.status_code == 200:
                 result = response.json()
                     error_msg = "Unexpected response format from OpenRouter API"
                     logger.error(f"{error_msg}: {result}")
                     raise ValueError(error_msg)
             elif response.status_code == 401:
                 error_msg = "Invalid OpenRouter API key. Please check your API key and try again."
                 logger.error(error_msg)
                 raise ValueError(error_msg)
             elif response.status_code == 402:
                 error_msg = "OpenRouter API payment required. Please check your OpenRouter account balance or billing status."
                 logger.error(error_msg)
                 raise ValueError(error_msg)
             elif response.status_code == 429:
                 error_msg = "API rate limit exceeded. Please try again later or check your OpenRouter rate limits."
                 logger.error(error_msg)
                 raise ValueError(error_msg)
             else:
                 error_msg = f"OpenRouter API error: {response.status_code} - {response.text}"
                 logger.error(error_msg)
                 raise Exception(error_msg)
         except requests.exceptions.RequestException as e:
             error_msg = f"Failed to connect to OpenRouter API: {str(e)}"
             logger.error(error_msg)
             raise Exception(error_msg) from e
         except json.JSONDecodeError as e:
             error_msg = f"Failed to parse OpenRouter API response: {str(e)}"
             logger.error(error_msg)
             raise Exception(error_msg) from e
     def _format_llm_response(self, response: str, df: pd.DataFrame, model_name: str) -> str:
         """
         Format the LLM response for display.
         Args:
             response: Raw LLM response
             df: Original DataFrame for reference
             model_name: Name of the model used
         Returns:
             str: Formatted response with fallback if needed
         """
         # Basic validation
         if not response or len(response) < 100:
             raise ValueError("Response too short or empty")
         # Check for garbled output (random numbers, nonsensical patterns)
         # This is a simple heuristic - look for long sequences of numbers or strange patterns
         suspicious_patterns = [
             r'[0-9,.]{20,}',  # Long sequences of digits, commas and periods
             r'[\W]{20,}',  # Long sequences of non-word characters
         ]
         for pattern in suspicious_patterns:
             if re.search(pattern, response):
                 logger.warning(f"Detected potentially garbled output matching pattern: {pattern}")
                 # Don't immediately raise - we'll do a more comprehensive check
         # Check for content quality - ensure it has expected sections
         expected_content = [
             "introduction", "analysis", "similarity", "patterns", "conclusion", "question"
         ]
         # Count how many expected content markers we find
         content_matches = sum(1 for term in expected_content if term.lower() in response.lower())
         # If we find fewer than 3 expected content markers, log a warning
         if content_matches < 3:
             logger.warning(f"LLM response missing expected content sections (found {content_matches}/6)")
         # Check for text names from the dataset
         # Extract text names from the Text Pair column
         text_names = set()
                 if isinstance(pair, str) and " vs " in pair:
                     texts = pair.split(" vs ")
                     text_names.update(texts)
         # Check if at least some text names appear in the response
         text_name_matches = sum(1 for name in text_names if name in response)
         if text_names and text_name_matches == 0:
             logger.warning("LLM response does not mention any of the text names from the dataset. The analysis may be generic.")
         # Ensure basic markdown structure
         if '##' not in response:
             response = f"## Analysis of Tibetan Text Similarity\n\n{response}"
         # Add styling to make the output more readable
         response = f"<div class='llm-analysis'>\n{response}\n</div>"
         # Format the response into a markdown block
         formatted_response = f"""## AI-Powered Analysis (Model: {model_name})\n\n{response}"""
         return formatted_response

pipeline/metrics.py CHANGED Viewed

@@ -4,14 +4,19 @@ from typing import List, Dict, Union
 from itertools import combinations
 from sklearn.metrics.pairwise import cosine_similarity
-from thefuzz import fuzz
 from .hf_embedding import generate_embeddings as generate_hf_embeddings
 from .stopwords_bo import TIBETAN_STOPWORDS_SET
 from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
 import logging
 # Attempt to import the Cython-compiled fast_lcs module
 try:
     from .fast_lcs import compute_lcs_fast
@@ -25,19 +30,37 @@ logger = logging.getLogger(__name__)
-def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
-    # Calculate m and n (lengths) here, so they are available for normalization
-    # regardless of which LCS implementation is used.
     m, n = len(words1), len(words2)
     if USE_CYTHON_LCS:
-        # Use the Cython-compiled version if available
         lcs_length = compute_lcs_fast(words1, words2)
     else:
-        # Fallback to pure Python implementation
-        # m, n = len(words1), len(words2) # Moved to the beginning of the function
-        # Using numpy array for dp table can be slightly faster than list of lists for large inputs
-        # but the primary bottleneck is the Python loop itself compared to Cython.
         dp = np.zeros((m + 1, n + 1), dtype=np.int32)
         for i in range(1, m + 1):
@@ -47,63 +70,192 @@ def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
                 else:
                     dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
         lcs_length = int(dp[m, n])
-    avg_length = (m + n) / 2
-    return lcs_length / avg_length if avg_length > 0 else 0.0
-def compute_fuzzy_similarity(words1: List[str], words2: List[str], method: str = 'token_set') -> float:
     """
-    Computes fuzzy string similarity between two lists of words using TheFuzz.
     Args:
         words1: First list of tokens
         words2: Second list of tokens
         method: The fuzzy matching method to use:
-                'token_set' - Order-independent token matching (default)
-                'token_sort' - Order-normalized token matching
-                'partial' - Best partial token matching
-                'ratio' - Simple ratio matching
     Returns:
         float: Fuzzy similarity score between 0.0 and 1.0
     """
     if not words1 or not words2:
         return 0.0
-    # Join tokens into strings for fuzzy matching
-    text1 = " ".join(words1)
-    text2 = " ".join(words2)
-    # Apply the selected fuzzy matching method
-    if method == 'token_set':
-        # Best for texts with different word orders and partial overlaps
-        score = fuzz.token_set_ratio(text1, text2)
-    elif method == 'token_sort':
-        # Good for texts with different word orders but similar content
-        score = fuzz.token_sort_ratio(text1, text2)
-    elif method == 'partial':
-        # Best for finding shorter strings within longer ones
-        score = fuzz.partial_ratio(text1, text2)
-    else:  # 'ratio'
-        # Simple Levenshtein distance ratio
-        score = fuzz.ratio(text1, text2)
-    # Convert score from 0-100 scale to 0-1 scale
-    return score / 100.0
 def compute_semantic_similarity(
     text1_segment: str,
     text2_segment: str,
-    tokens1: List[str],
-    tokens2: List[str],
     model,
     batch_size: int = 32,
     show_progress_bar: bool = False
 ) -> float:
-    """Computes semantic similarity using a Sentence Transformer model only."""
     if model is None:
         logger.warning(
             "Embedding model not available for semantic similarity. Skipping calculation."
@@ -116,38 +268,27 @@ def compute_semantic_similarity(
         )
         return 0.0
-    def _get_aggregated_embedding(
-        raw_text_segment: str,
-        _botok_tokens: List[str],
-        model_obj,
-        batch_size_param: int,
-        show_progress_bar_param: bool
-    ) -> Union[np.ndarray, None]:
-        """Helper to get a single embedding for a text using Sentence Transformers."""
-        if not raw_text_segment.strip():
-            logger.info(
-                f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
-            )
             return None
         embedding = generate_hf_embeddings(
-            texts=[raw_text_segment],
-            model=model_obj,
-            batch_size=batch_size_param,
-            show_progress_bar=show_progress_bar_param
         )
-        if embedding is None or embedding.size == 0:
-            logger.error(
-                f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
-            )
             return None
         return embedding
     try:
-        # Pass all relevant parameters to _get_aggregated_embedding
-        emb1 = _get_aggregated_embedding(text1_segment, tokens1, model, batch_size, show_progress_bar)
-        emb2 = _get_aggregated_embedding(text2_segment, tokens2, model, batch_size, show_progress_bar)
         if emb1 is None or emb2 is None or emb1.size == 0 or emb2.size == 0:
             logger.error(
@@ -168,7 +309,7 @@ def compute_semantic_similarity(
         if np.all(emb1 == 0) or np.all(emb2 == 0):
             logger.info("One of the embeddings is zero. Semantic similarity is 0.0.")
             return 0.0
         # Handle NaN or Inf in embeddings
         if np.isnan(emb1).any() or np.isinf(emb1).any() or \
            np.isnan(emb2).any() or np.isinf(emb2).any():
@@ -180,9 +321,9 @@ def compute_semantic_similarity(
             emb1 = emb1.reshape(1, -1)
         if emb2.ndim == 1:
             emb2 = emb2.reshape(1, -1)
         similarity_score = cosine_similarity(emb1, emb2)[0][0]
         return max(0.0, float(similarity_score))
     except Exception as e:
@@ -202,8 +343,10 @@ def compute_all_metrics(
     enable_semantic: bool = True,
     enable_fuzzy: bool = True,
     fuzzy_method: str = 'token_set',
     use_stopwords: bool = True,
     use_lite_stopwords: bool = False,
     batch_size: int = 32,
     show_progress_bar: bool = False
 ) -> pd.DataFrame:
@@ -218,10 +361,13 @@ def compute_all_metrics(
                                               Defaults to None.
         enable_semantic (bool): Whether to compute semantic similarity. Defaults to True.
         enable_fuzzy (bool): Whether to compute fuzzy string similarity. Defaults to True.
-        fuzzy_method (str): The fuzzy matching method to use ('token_set', 'token_sort', 'partial', 'ratio').
                            Defaults to 'token_set'.
         use_stopwords (bool): Whether to filter stopwords for Jaccard similarity. Defaults to True.
         use_lite_stopwords (bool): Whether to use the lite version of stopwords. Defaults to False.
         batch_size (int): Batch size for semantic similarity computation. Defaults to 32.
         show_progress_bar (bool): Whether to show progress bar for semantic similarity. Defaults to False.
@@ -232,14 +378,7 @@ def compute_all_metrics(
     """
     files = list(texts.keys())
     results = []
-    corpus_for_sklearn_tfidf = []  # Kept for potential future use
-    for fname, content in texts.items():
-        # Use the pre-computed tokens from the token_lists dictionary
-        current_tokens_for_file = token_lists.get(fname, [])
-        corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
     for i, j in combinations(range(len(files)), 2):
         f1, f2 = files[i], files[j]
         words1_raw, words2_raw = token_lists[f1], token_lists[f2]
@@ -254,21 +393,33 @@ def compute_all_metrics(
         else:
             # If stopwords are disabled, use an empty set
             stopwords_set_to_use = set()
-        # Filter stopwords for Jaccard calculation
-        words1_jaccard = [word for word in words1_raw if word not in stopwords_set_to_use]
-        words2_jaccard = [word for word in words2_raw if word not in stopwords_set_to_use]
         jaccard = (
             len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
             if set(words1_jaccard) | set(words2_jaccard)  # Ensure denominator is not zero
             else 0.0
         )
-        # LCS uses raw tokens (words1_raw, words2_raw) to provide a complementary metric.
-        # Semantic similarity also uses raw text and its botok tokens for chunking decisions.
         jaccard_percent = jaccard * 100.0
-        norm_lcs = compute_normalized_lcs(words1_raw, words2_raw)
         # Fuzzy Similarity Calculation
         if enable_fuzzy:
             fuzzy_sim = compute_fuzzy_similarity(words1_jaccard, words2_jaccard, method=fuzzy_method)
@@ -277,9 +428,8 @@ def compute_all_metrics(
         # Semantic Similarity Calculation
         if enable_semantic:
-            # Pass raw texts and their pre-computed botok tokens
             semantic_sim = compute_semantic_similarity(
-                texts[f1], texts[f2], words1_raw, words2_raw, model,
                 batch_size=batch_size,
                 show_progress_bar=show_progress_bar
             )

 from itertools import combinations
 from sklearn.metrics.pairwise import cosine_similarity
 from .hf_embedding import generate_embeddings as generate_hf_embeddings
 from .stopwords_bo import TIBETAN_STOPWORDS_SET
 from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
+from .normalize_bo import normalize_particles
 import logging
+def _normalize_token_for_stopwords(token: str) -> str:
+    """Normalize token by removing trailing tsek for stopword matching."""
+    return token.rstrip('་')
 # Attempt to import the Cython-compiled fast_lcs module
 try:
     from .fast_lcs import compute_lcs_fast
+def compute_normalized_lcs(words1: List[str], words2: List[str], normalization: str = "avg") -> float:
+    """
+    Computes the Longest Common Subsequence (LCS) similarity between two token lists.
+    Args:
+        words1: First list of tokens
+        words2: Second list of tokens
+        normalization: How to normalize the LCS length. Options:
+            'avg' - Divide by average length (default, balanced)
+            'min' - Divide by shorter text (detects if one text contains the other)
+            'max' - Divide by longer text (stricter, penalizes length differences)
+    Returns:
+        float: Normalized LCS score between 0.0 and 1.0
+    Note on normalization choice:
+        - 'avg': Good general-purpose choice, treats both texts equally
+        - 'min': Use when looking for containment (e.g., quotes within commentary)
+                 Can return 1.0 if shorter text is fully contained in longer
+        - 'max': Use when you want to penalize length differences
+                 Will be lower when texts have very different lengths
+    """
     m, n = len(words1), len(words2)
+    if m == 0 or n == 0:
+        return 0.0
     if USE_CYTHON_LCS:
         lcs_length = compute_lcs_fast(words1, words2)
     else:
+        # Pure Python implementation using dynamic programming
         dp = np.zeros((m + 1, n + 1), dtype=np.int32)
         for i in range(1, m + 1):
                 else:
                     dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
         lcs_length = int(dp[m, n])
+    # Apply selected normalization
+    if normalization == "min":
+        divisor = min(m, n)
+    elif normalization == "max":
+        divisor = max(m, n)
+    else:  # "avg" (default)
+        divisor = (m + n) / 2
+    return lcs_length / divisor if divisor > 0 else 0.0
+def compute_ngram_similarity(tokens1: List[str], tokens2: List[str], n: int = 2) -> float:
+    """
+    Computes syllable/token n-gram overlap similarity (Jaccard on n-grams).
+    This is more effective for Tibetan than character-level fuzzy matching because
+    it preserves syllable boundaries and captures local word patterns.
+    Args:
+        tokens1: First list of tokens (syllables or words)
+        tokens2: Second list of tokens (syllables or words)
+        n: Size of n-grams (default: 2 for bigrams)
+    Returns:
+        float: N-gram similarity score between 0.0 and 1.0
+    """
+    if not tokens1 or not tokens2:
+        return 0.0
+    # Handle edge case where text is shorter than n
+    if len(tokens1) < n or len(tokens2) < n:
+        # Fall back to unigram comparison
+        set1, set2 = set(tokens1), set(tokens2)
+        if not set1 or not set2:
+            return 0.0
+        intersection = len(set1 & set2)
+        union = len(set1 | set2)
+        return intersection / union if union > 0 else 0.0
+    def get_ngrams(tokens: List[str], size: int) -> set:
+        return set(tuple(tokens[i:i+size]) for i in range(len(tokens) - size + 1))
+    ngrams1 = get_ngrams(tokens1, n)
+    ngrams2 = get_ngrams(tokens2, n)
+    intersection = len(ngrams1 & ngrams2)
+    union = len(ngrams1 | ngrams2)
+    return intersection / union if union > 0 else 0.0
+def compute_syllable_edit_similarity(syls1: List[str], syls2: List[str]) -> float:
     """
+    Computes edit distance at the syllable/token level rather than character level.
+    This is more appropriate for Tibetan because:
+    - Tibetan syllables are meaningful units (unlike individual characters)
+    - Character-level Levenshtein over-penalizes syllable differences
+    - Syllable-level comparison better captures textual variation patterns
+    Args:
+        syls1: First list of syllables/tokens
+        syls2: Second list of syllables/tokens
+    Returns:
+        float: Syllable-level similarity score between 0.0 and 1.0
+    """
+    if not syls1 and not syls2:
+        return 1.0
+    if not syls1 or not syls2:
+        return 0.0
+    m, n = len(syls1), len(syls2)
+    # Create DP table for syllable-level edit distance
+    dp = np.zeros((m + 1, n + 1), dtype=np.int32)
+    # Initialize base cases
+    for i in range(m + 1):
+        dp[i, 0] = i
+    for j in range(n + 1):
+        dp[0, j] = j
+    # Fill DP table
+    for i in range(1, m + 1):
+        for j in range(1, n + 1):
+            if syls1[i - 1] == syls2[j - 1]:
+                dp[i, j] = dp[i - 1, j - 1]
+            else:
+                dp[i, j] = 1 + min(
+                    dp[i - 1, j],      # deletion
+                    dp[i, j - 1],      # insertion
+                    dp[i - 1, j - 1]   # substitution
+                )
+    edit_distance = dp[m, n]
+    max_len = max(m, n)
+    return 1.0 - (edit_distance / max_len) if max_len > 0 else 1.0
+def compute_weighted_jaccard(tokens1: List[str], tokens2: List[str]) -> float:
+    """
+    Computes weighted Jaccard similarity using token frequencies.
+    Unlike standard Jaccard which treats all tokens as binary (present/absent),
+    this considers how often each token appears, giving more weight to
+    frequently shared terms.
+    Args:
+        tokens1: First list of tokens
+        tokens2: Second list of tokens
+    Returns:
+        float: Weighted Jaccard similarity between 0.0 and 1.0
+    """
+    from collections import Counter
+    if not tokens1 or not tokens2:
+        return 0.0
+    c1, c2 = Counter(tokens1), Counter(tokens2)
+    # Intersection: min count for each shared token
+    intersection = sum((c1 & c2).values())
+    # Union: max count for each token
+    union = sum((c1 | c2).values())
+    return intersection / union if union > 0 else 0.0
+def compute_fuzzy_similarity(words1: List[str], words2: List[str], method: str = 'ngram') -> float:
+    """
+    Computes fuzzy string similarity between two lists of words.
+    All methods work at the syllable/token level, which is linguistically
+    appropriate for Tibetan text.
     Args:
         words1: First list of tokens
         words2: Second list of tokens
         method: The fuzzy matching method to use:
+                'ngram' - Syllable bigram overlap (default, recommended)
+                'syllable_edit' - Syllable-level edit distance
+                'weighted_jaccard' - Frequency-weighted Jaccard
     Returns:
         float: Fuzzy similarity score between 0.0 and 1.0
     """
     if not words1 or not words2:
         return 0.0
+    if method == 'ngram':
+        # Syllable bigram overlap - good for detecting shared phrases
+        return compute_ngram_similarity(words1, words2, n=2)
+    elif method == 'syllable_edit':
+        # Syllable-level edit distance - good for detecting minor variations
+        return compute_syllable_edit_similarity(words1, words2)
+    elif method == 'weighted_jaccard':
+        # Frequency-weighted Jaccard - good for repeated terms
+        return compute_weighted_jaccard(words1, words2)
+    else:
+        # Default to ngram for any unrecognized method
+        return compute_ngram_similarity(words1, words2, n=2)
 def compute_semantic_similarity(
     text1_segment: str,
     text2_segment: str,
     model,
     batch_size: int = 32,
     show_progress_bar: bool = False
 ) -> float:
+    """
+    Computes semantic similarity using a Sentence Transformer model.
+    Args:
+        text1_segment: First text segment
+        text2_segment: Second text segment
+        model: Pre-loaded SentenceTransformer model
+        batch_size: Batch size for encoding
+        show_progress_bar: Whether to show progress bar
+    Returns:
+        float: Cosine similarity between embeddings (0.0 to 1.0), or np.nan on error
+    """
     if model is None:
         logger.warning(
             "Embedding model not available for semantic similarity. Skipping calculation."
         )
         return 0.0
+    def _get_embedding(raw_text: str) -> Union[np.ndarray, None]:
+        """Helper to get embedding for a single text."""
+        if not raw_text.strip():
+            logger.info("Text is empty or whitespace. Returning None.")
             return None
         embedding = generate_hf_embeddings(
+            texts=[raw_text],
+            model=model,
+            batch_size=batch_size,
+            show_progress_bar=show_progress_bar
         )
+        if embedding is None or embedding.size == 0:
+            logger.error(f"Failed to generate embedding for text: {raw_text[:100]}...")
             return None
         return embedding
     try:
+        emb1 = _get_embedding(text1_segment)
+        emb2 = _get_embedding(text2_segment)
         if emb1 is None or emb2 is None or emb1.size == 0 or emb2.size == 0:
             logger.error(
         if np.all(emb1 == 0) or np.all(emb2 == 0):
             logger.info("One of the embeddings is zero. Semantic similarity is 0.0.")
             return 0.0
         # Handle NaN or Inf in embeddings
         if np.isnan(emb1).any() or np.isinf(emb1).any() or \
            np.isnan(emb2).any() or np.isinf(emb2).any():
             emb1 = emb1.reshape(1, -1)
         if emb2.ndim == 1:
             emb2 = emb2.reshape(1, -1)
         similarity_score = cosine_similarity(emb1, emb2)[0][0]
         return max(0.0, float(similarity_score))
     except Exception as e:
     enable_semantic: bool = True,
     enable_fuzzy: bool = True,
     fuzzy_method: str = 'token_set',
+    lcs_normalization: str = 'avg',
     use_stopwords: bool = True,
     use_lite_stopwords: bool = False,
+    normalize_particles_opt: bool = False,
     batch_size: int = 32,
     show_progress_bar: bool = False
 ) -> pd.DataFrame:
                                               Defaults to None.
         enable_semantic (bool): Whether to compute semantic similarity. Defaults to True.
         enable_fuzzy (bool): Whether to compute fuzzy string similarity. Defaults to True.
+        fuzzy_method (str): The fuzzy matching method to use ('ngram', 'syllable_edit', 'weighted_jaccard').
                            Defaults to 'token_set'.
+        lcs_normalization (str): How to normalize LCS ('avg', 'min', 'max'). Defaults to 'avg'.
         use_stopwords (bool): Whether to filter stopwords for Jaccard similarity. Defaults to True.
         use_lite_stopwords (bool): Whether to use the lite version of stopwords. Defaults to False.
+        normalize_particles_opt (bool): Whether to normalize grammatical particles (གི/ཀྱི/གྱི → གི).
+                                       Reduces false negatives from sandhi variation. Defaults to False.
         batch_size (int): Batch size for semantic similarity computation. Defaults to 32.
         show_progress_bar (bool): Whether to show progress bar for semantic similarity. Defaults to False.
     """
     files = list(texts.keys())
     results = []
     for i, j in combinations(range(len(files)), 2):
         f1, f2 = files[i], files[j]
         words1_raw, words2_raw = token_lists[f1], token_lists[f2]
         else:
             # If stopwords are disabled, use an empty set
             stopwords_set_to_use = set()
+        # Filter stopwords for Jaccard calculation (normalize tokens for consistent matching)
+        words1_filtered = [word for word in words1_raw if _normalize_token_for_stopwords(word) not in stopwords_set_to_use]
+        words2_filtered = [word for word in words2_raw if _normalize_token_for_stopwords(word) not in stopwords_set_to_use]
+        # Apply particle normalization if enabled
+        if normalize_particles_opt:
+            words1_jaccard = normalize_particles(words1_filtered)
+            words2_jaccard = normalize_particles(words2_filtered)
+            words1_lcs = normalize_particles(words1_raw)
+            words2_lcs = normalize_particles(words2_raw)
+        else:
+            words1_jaccard = words1_filtered
+            words2_jaccard = words2_filtered
+            words1_lcs = words1_raw
+            words2_lcs = words2_raw
         jaccard = (
             len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
             if set(words1_jaccard) | set(words2_jaccard)  # Ensure denominator is not zero
             else 0.0
         )
         jaccard_percent = jaccard * 100.0
+        # LCS uses tokens (with optional particle normalization)
+        norm_lcs = compute_normalized_lcs(words1_lcs, words2_lcs, normalization=lcs_normalization)
         # Fuzzy Similarity Calculation
         if enable_fuzzy:
             fuzzy_sim = compute_fuzzy_similarity(words1_jaccard, words2_jaccard, method=fuzzy_method)
         # Semantic Similarity Calculation
         if enable_semantic:
             semantic_sim = compute_semantic_similarity(
+                texts[f1], texts[f2], model,
                 batch_size=batch_size,
                 show_progress_bar=show_progress_bar
             )

pipeline/normalize_bo.py ADDED Viewed

	@@ -0,0 +1,106 @@

+# -*- coding: utf-8 -*-
+"""
+Tibetan text normalization for improved text comparison.
+This module provides normalization functions for Tibetan grammatical particles,
+which change form based on the preceding syllable (sandhi). Normalizing these
+allows more accurate comparison between texts that may use different particle
+forms for grammatical reasons rather than semantic differences.
+"""
+from typing import List
+# Particle equivalence classes
+# All forms in each class are grammatically equivalent
+# The first form in each list is the canonical/normalized form
+PARTICLE_CLASSES = {
+    # Genitive particles (གི་སྒྲ) - "of"
+    # Form depends on final letter of preceding syllable
+    "genitive": ["གི", "ཀྱི", "གྱི", "ཡི", "འི"],
+    # Agentive/instrumental particles (བྱེད་སྒྲ) - "by"
+    "agentive": ["གིས", "ཀྱིས", "གྱིས", "ཡིས", "ས"],
+    # Dative/locative particles (ལ་དོན) - "to/at/in"
+    "dative": ["ལ", "ར", "སུ", "ཏུ", "དུ", "རུ"],
+    # Ablative particles (འབྱུང་ཁུངས) - "from"
+    "ablative": ["ནས", "ལས"],
+    # Conjunctive particles (སྦྱོར་སྒྲ) - verbal connective "and/while"
+    "conjunctive": ["ཅིང", "ཤིང", "ཞིང"],
+    # Terminative particles (མཐའ་སྒྲ) - clause ending
+    "terminative": ["སྟེ", "ཏེ", "དེ"],
+    # Concessive particles - "even/also"
+    "concessive": ["ཀྱང", "ཡང", "འང"],
+    # Imperative particles
+    "imperative": ["ཅིག", "ཤིག", "ཞིག"],
+}
+def _build_particle_map() -> dict:
+    """Build mapping from all particle variants to canonical form."""
+    mapping = {}
+    for class_name, forms in PARTICLE_CLASSES.items():
+        canonical = forms[0]  # First form is canonical
+        for variant in forms:
+            # Strip tsek for matching (will be normalized anyway)
+            variant_clean = variant.rstrip('་')
+            mapping[variant_clean] = canonical
+    return mapping
+# Pre-built mapping for efficiency
+PARTICLE_NORMALIZATION_MAP = _build_particle_map()
+def normalize_particles(tokens: List[str]) -> List[str]:
+    """
+    Normalize grammatical particles to canonical forms.
+    This treats all sandhi variants of a particle as equivalent:
+    - གི, ཀྱི, གྱི, ཡི, འི → གི (genitive)
+    - གིས, ཀྱིས, གྱིས, ཡིས, ས → གིས (agentive)
+    - ལ, ར, སུ, ཏུ, དུ, རུ → ལ (dative)
+    - etc.
+    This is useful when comparing texts that may use different particle forms
+    based on phonological context rather than semantic differences.
+    Args:
+        tokens: List of Tibetan tokens (syllables or words)
+    Returns:
+        List of tokens with particles normalized to canonical forms
+    """
+    normalized = []
+    for token in tokens:
+        # Strip tsek for lookup
+        token_clean = token.rstrip('་')
+        # Check if it's a particle that should be normalized
+        if token_clean in PARTICLE_NORMALIZATION_MAP:
+            normalized.append(PARTICLE_NORMALIZATION_MAP[token_clean])
+        else:
+            normalized.append(token_clean)
+    return normalized
+def get_particle_class(token: str) -> str:
+    """
+    Get the grammatical class of a particle.
+    Args:
+        token: A Tibetan token
+    Returns:
+        The particle class name (e.g., 'genitive', 'agentive') or None
+    """
+    token_clean = token.rstrip('་')
+    for class_name, forms in PARTICLE_CLASSES.items():
+        clean_forms = [f.rstrip('་') for f in forms]
+        if token_clean in clean_forms:
+            return class_name
+    return None

pipeline/process.py CHANGED Viewed

@@ -1,4 +1,5 @@
 import pandas as pd
 from typing import Dict, List, Tuple
 from .metrics import compute_all_metrics
 from .hf_embedding import get_model as get_hf_model
@@ -13,7 +14,7 @@ import re
 def get_botok_tokens_for_single_text(text: str, mode: str = "syllable") -> list[str]:
     """
-    A wrapper around tokenize_texts to make it suitable for tokenize_fn
     in generate_embeddings, which expects a function that tokenizes a single string.
     Accepts a 'mode' argument ('syllable' or 'word') to pass to tokenize_texts.
     """
@@ -46,14 +47,17 @@ logger = logging.getLogger(__name__)
 def process_texts(
-    text_data: Dict[str, str],
-    filenames: List[str],
     enable_semantic: bool = True,
     enable_fuzzy: bool = True,
-    fuzzy_method: str = 'token_set',
-    model_name: str = "sentence-transformers/LaBSE",
     use_stopwords: bool = True,
     use_lite_stopwords: bool = False,
     progress_callback = None,
     progressive_callback = None,
     batch_size: int = 32,
@@ -61,11 +65,11 @@ def process_texts(
 ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
     """
     Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
     Args:
         text_data (Dict[str, str]): A dictionary mapping filenames to their content.
         filenames (List[str]): A list of filenames that were uploaded.
-        enable_semantic (bool, optional): Whether to compute semantic similarity metrics.
             Requires loading a sentence-transformer model, which can be time-consuming. Defaults to True.
         enable_fuzzy (bool, optional): Whether to compute fuzzy string similarity metrics.
             Uses TheFuzz library for approximate string matching. Defaults to True.
@@ -74,16 +78,28 @@ def process_texts(
             'token_sort' - Order-normalized token matching
             'partial' - Best partial token matching
             'ratio' - Simple ratio matching
         model_name (str, optional): The Hugging Face sentence-transformer model to use for semantic similarity.
-            Must be a valid model identifier on Hugging Face. Defaults to "sentence-transformers/LaBSE".
         use_stopwords (bool, optional): Whether to use stopwords in the metrics calculation. Defaults to True.
         use_lite_stopwords (bool, optional): Whether to use the lite stopwords list (common particles only)
             instead of the comprehensive list. Only applies if use_stopwords is True. Defaults to False.
         progress_callback (callable, optional): A callback function for reporting progress updates.
             Should accept a float between 0 and 1 and a description string. Defaults to None.
         progressive_callback (callable, optional): A callback function for sending incremental results.
             Used for progressive loading of metrics as they become available. Defaults to None.
     Returns:
         Tuple[pd.DataFrame, pd.DataFrame, str]:
             - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
@@ -92,7 +108,7 @@ def process_texts(
             - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
                 Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
             - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
     Raises:
         RuntimeError: If the botok tokenizer fails to initialize.
         ValueError: If the input files cannot be processed or if metrics computation fails.
@@ -132,7 +148,7 @@ def process_texts(
                         progress_callback(0.3, desc="Unsupported model, continuing without semantic similarity.")
                     except Exception as e:
                         logger.warning(f"Progress callback error (non-critical): {e}")
         except Exception as e:  # General catch-all for unexpected errors during model loading attempts
             model_warning = f"An unexpected error occurred while attempting to load model '{model_name}': {e}. Semantic similarity will be disabled."
             logger.error(model_warning, exc_info=True)
@@ -156,38 +172,38 @@ def process_texts(
             progress_callback(0.35, desc="Segmenting texts by chapters...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     chapter_marker = "༈"
     fallback = False
     segment_texts = {}
     # Process each file
     for i, fname in enumerate(filenames):
         if progress_callback is not None and len(filenames) > 1:
             try:
-                progress_callback(0.35 + (0.05 * (i / len(filenames))),
                                 desc=f"Segmenting file {i+1}/{len(filenames)}: {fname}")
             except Exception as e:
                 logger.warning(f"Progress callback error (non-critical): {e}")
         content = text_data[fname]
         # Check if content is empty
         if not content.strip():
             logger.warning(f"File '{fname}' is empty or contains only whitespace.")
             continue
         # Split by chapter marker if present
         if chapter_marker in content:
             segments = [
                 seg.strip() for seg in content.split(chapter_marker) if seg.strip()
             ]
             # Check if we have valid segments after splitting
             if not segments:
                 logger.warning(f"File '{fname}' contains chapter markers but no valid text segments.")
                 continue
             for idx, seg in enumerate(segments):
                 seg_id = f"{fname}|chapter {idx+1}"
                 cleaned_seg = clean_tibetan_text(seg)
@@ -198,7 +214,7 @@ def process_texts(
             cleaned_content = clean_tibetan_text(content.strip())
             segment_texts[seg_id] = cleaned_content
             fallback = True
     # Generate warning if no chapter markers found
     warning = model_warning  # Include any model warnings
     if fallback:
@@ -208,7 +224,7 @@ def process_texts(
             "For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
         )
         warning = warning + " " + chapter_warning if warning else chapter_warning
     # Check if we have any valid segments
     if not segment_texts:
         logger.error("No valid text segments found in any of the uploaded files.")
@@ -216,90 +232,90 @@ def process_texts(
     # Tokenize all segments at once for efficiency
     if progress_callback is not None:
         try:
-            progress_callback(0.42, desc="Tokenizing all text segments...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     all_segment_ids = list(segment_texts.keys())
     all_segment_contents = list(segment_texts.values())
-    tokenized_segments_list = tokenize_texts(all_segment_contents)
     segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
     # Group chapters by filename (preserving order)
     if progress_callback is not None:
         try:
-            progress_callback(0.4, desc="Organizing text segments...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     file_to_chapters = {}
     for seg_id in segment_texts:
         fname = seg_id.split("|")[0]
         file_to_chapters.setdefault(fname, []).append(seg_id)
     # For each pair of files, compare corresponding chapters (by index)
     if progress_callback is not None:
         try:
             progress_callback(0.45, desc="Computing similarity metrics...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     results = []
     files = list(file_to_chapters.keys())
     # Check if we have at least two files to compare
     if len(files) < 2:
         logger.warning("Need at least two files to compute similarity metrics.")
         return pd.DataFrame(), pd.DataFrame(), "Need at least two files to compute similarity metrics."
     # Track total number of comparisons for progress reporting
     total_comparisons = 0
     for file1, file2 in combinations(files, 2):
         chaps1 = file_to_chapters[file1]
         chaps2 = file_to_chapters[file2]
         total_comparisons += min(len(chaps1), len(chaps2))
     # Initialize results DataFrame for progressive updates
     results_columns = ['Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS']
     if enable_fuzzy:
         results_columns.append('Fuzzy Similarity')
     if enable_semantic:
         results_columns.append('Semantic Similarity')
     # Create empty DataFrame with the correct columns
     progressive_df = pd.DataFrame(columns=results_columns)
     # Track which metrics have been completed for progressive updates
     completed_metrics = []
     # Process each file pair
     comparison_count = 0
     for file1, file2 in combinations(files, 2):
         chaps1 = file_to_chapters[file1]
         chaps2 = file_to_chapters[file2]
         min_chaps = min(len(chaps1), len(chaps2))
         if progress_callback is not None:
             try:
                 progress_callback(0.45, desc=f"Comparing {file1} with {file2}...")
             except Exception as e:
                 logger.warning(f"Progress callback error (non-critical): {e}")
         for idx in range(min_chaps):
             seg1 = chaps1[idx]
             seg2 = chaps2[idx]
             # Update progress
             comparison_count += 1
             if progress_callback is not None and total_comparisons > 0:
                 try:
                     progress_percentage = 0.45 + (0.25 * (comparison_count / total_comparisons))
-                    progress_callback(progress_percentage,
                                     desc=f"Computing metrics for chapter {idx+1} ({comparison_count}/{total_comparisons})")
                 except Exception as e:
                     logger.warning(f"Progress callback error (non-critical): {e}")
             try:
                 # Compute metrics for this chapter pair
                 metrics_df = compute_all_metrics(
@@ -309,10 +325,12 @@ def process_texts(
                     enable_semantic=enable_semantic,
                     enable_fuzzy=enable_fuzzy,
                     fuzzy_method=fuzzy_method,
                     use_stopwords=use_stopwords,
                     use_lite_stopwords=use_lite_stopwords,
                 )
                 # Extract metrics from the DataFrame (should have only one row)
                 if not metrics_df.empty:
                     pair_metrics = metrics_df.iloc[0].to_dict()
@@ -325,57 +343,57 @@ def process_texts(
                         "Fuzzy Similarity": 0.0 if enable_fuzzy else np.nan,
                         "Semantic Similarity": 0.0 if enable_semantic else np.nan
                     }
                 # Format the results
                 text_pair = f"{file1} vs {file2}"
                 chapter_num = idx + 1
                 result_row = {
                     "Text Pair": text_pair,
                     "Chapter": chapter_num,
                     "Jaccard Similarity (%)": pair_metrics["Jaccard Similarity (%)"],  # Already in percentage
                     "Normalized LCS": pair_metrics["Normalized LCS"],
                 }
                 # Add fuzzy similarity if enabled
                 if enable_fuzzy:
                     result_row["Fuzzy Similarity"] = pair_metrics["Fuzzy Similarity"]
                 # Add semantic similarity if enabled and available
                 if enable_semantic and "Semantic Similarity" in pair_metrics:
                     result_row["Semantic Similarity"] = pair_metrics["Semantic Similarity"]
                 # Convert the dictionary to a DataFrame before appending
                 result_df = pd.DataFrame([result_row])
                 results.append(result_df)
                 # Update progressive DataFrame and send update if callback is provided
                 progressive_df = pd.concat(results, ignore_index=True)
                 # Send progressive update if callback is provided
                 if progressive_callback is not None:
                     # Determine which metrics are complete in this update
                     current_metrics = []
                     # Always include these basic metrics
                     if "Jaccard Similarity (%)" in progressive_df.columns and MetricType.JACCARD not in completed_metrics:
                         current_metrics.append(MetricType.JACCARD)
                         completed_metrics.append(MetricType.JACCARD)
                     if "Normalized LCS" in progressive_df.columns and MetricType.LCS not in completed_metrics:
                         current_metrics.append(MetricType.LCS)
                         completed_metrics.append(MetricType.LCS)
                     # Add fuzzy if enabled and available
                     if enable_fuzzy and "Fuzzy Similarity" in progressive_df.columns and MetricType.FUZZY not in completed_metrics:
                         current_metrics.append(MetricType.FUZZY)
                         completed_metrics.append(MetricType.FUZZY)
                     # Add semantic if enabled and available
                     if enable_semantic and "Semantic Similarity" in progressive_df.columns and MetricType.SEMANTIC not in completed_metrics:
                         current_metrics.append(MetricType.SEMANTIC)
                         completed_metrics.append(MetricType.SEMANTIC)
                     # Create word counts DataFrame for progressive update
                     word_counts_data = []
                     for seg_id, tokens in segment_tokens.items():
@@ -388,7 +406,7 @@ def process_texts(
                             "WordCount": len(tokens)
                         })
                     word_counts_df_progressive = pd.DataFrame(word_counts_data)
                     # Send the update
                     try:
                         progressive_callback(
@@ -400,12 +418,12 @@ def process_texts(
                         )
                     except Exception as e:
                         logger.warning(f"Progressive callback error (non-critical): {e}")
             except Exception as e:
                 logger.error(f"Error computing metrics for {seg1} vs {seg2}: {e}", exc_info=True)
                 # Continue with other segmentsparisons instead of failing completely
                 continue
     # Create the metrics DataFrame
     if results:
         # Results are already DataFrames, so we can concatenate them directly
@@ -420,9 +438,9 @@ def process_texts(
             progress_callback(0.75, desc="Calculating word counts...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     word_counts_data = []
     # Process each segment
     for i, (seg_id, text_content) in enumerate(segment_texts.items()):
         # Update progress
@@ -432,10 +450,10 @@ def process_texts(
                 progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
             except Exception as e:
                 logger.warning(f"Progress callback error (non-critical): {e}")
         fname, chapter_info = seg_id.split("|", 1)
         chapter_num = int(chapter_info.replace("chapter ", ""))
         try:
             # Use botok for accurate word count for raw Tibetan text
             tokenized_segments = tokenize_texts([text_content])  # Returns a list of lists
@@ -443,7 +461,7 @@ def process_texts(
                 word_count = len(tokenized_segments[0])
             else:
                 word_count = 0
             word_counts_data.append(
                 {
                     "Filename": fname.replace(".txt", ""),
@@ -463,20 +481,20 @@ def process_texts(
                     "WordCount": 0,
                 }
             )
     # Create and sort the word counts DataFrame
     word_counts_df = pd.DataFrame(word_counts_data)
     if not word_counts_df.empty:
         word_counts_df = word_counts_df.sort_values(
             by=["Filename", "ChapterNumber"]
         ).reset_index(drop=True)
     if progress_callback is not None:
         try:
             progress_callback(0.95, desc="Analysis complete!")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     # Send final progressive update if callback is provided
     if progressive_callback is not None:
         try:
@@ -490,6 +508,6 @@ def process_texts(
             )
         except Exception as e:
             logger.warning(f"Final progressive callback error (non-critical): {e}")
     # Return the results
     return metrics_df, word_counts_df, warning

 import pandas as pd
+import numpy as np
 from typing import Dict, List, Tuple
 from .metrics import compute_all_metrics
 from .hf_embedding import get_model as get_hf_model
 def get_botok_tokens_for_single_text(text: str, mode: str = "syllable") -> list[str]:
     """
+    A wrapper around tokenize_texts to make it suitable for tokenize_fn
     in generate_embeddings, which expects a function that tokenizes a single string.
     Accepts a 'mode' argument ('syllable' or 'word') to pass to tokenize_texts.
     """
 def process_texts(
+    text_data: Dict[str, str],
+    filenames: List[str],
     enable_semantic: bool = True,
     enable_fuzzy: bool = True,
+    fuzzy_method: str = 'ngram',
+    lcs_normalization: str = 'avg',
+    model_name: str = "buddhist-nlp/buddhist-sentence-similarity",
     use_stopwords: bool = True,
     use_lite_stopwords: bool = False,
+    normalize_particles: bool = False,
+    tokenization_mode: str = "word",
     progress_callback = None,
     progressive_callback = None,
     batch_size: int = 32,
 ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
     """
     Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
     Args:
         text_data (Dict[str, str]): A dictionary mapping filenames to their content.
         filenames (List[str]): A list of filenames that were uploaded.
+        enable_semantic (bool, optional): Whether to compute semantic similarity metrics.
             Requires loading a sentence-transformer model, which can be time-consuming. Defaults to True.
         enable_fuzzy (bool, optional): Whether to compute fuzzy string similarity metrics.
             Uses TheFuzz library for approximate string matching. Defaults to True.
             'token_sort' - Order-normalized token matching
             'partial' - Best partial token matching
             'ratio' - Simple ratio matching
+            'ngram' - Syllable bigram overlap (recommended for Tibetan)
+            'syllable_edit' - Syllable-level edit distance
+            'weighted_jaccard' - Frequency-weighted Jaccard
+        lcs_normalization (str, optional): How to normalize LCS length. Options:
+            'avg' - Divide by average length (default, balanced)
+            'min' - Divide by shorter text (detects containment)
+            'max' - Divide by longer text (stricter)
         model_name (str, optional): The Hugging Face sentence-transformer model to use for semantic similarity.
+            Must be a valid model identifier on Hugging Face. Defaults to "buddhist-nlp/buddhist-sentence-similarity".
         use_stopwords (bool, optional): Whether to use stopwords in the metrics calculation. Defaults to True.
         use_lite_stopwords (bool, optional): Whether to use the lite stopwords list (common particles only)
             instead of the comprehensive list. Only applies if use_stopwords is True. Defaults to False.
+        normalize_particles (bool, optional): Whether to normalize grammatical particles to canonical forms.
+            Treats གི/ཀྱི/གྱི as equivalent, ལ/ར/སུ/ཏུ/དུ as equivalent, etc. Defaults to False.
+        tokenization_mode (str, optional): How to tokenize the text. Options are:
+            'word' - Keep multi-syllable words together (default, recommended for Jaccard)
+            'syllable' - Split into individual syllables (finer granularity)
         progress_callback (callable, optional): A callback function for reporting progress updates.
             Should accept a float between 0 and 1 and a description string. Defaults to None.
         progressive_callback (callable, optional): A callback function for sending incremental results.
             Used for progressive loading of metrics as they become available. Defaults to None.
     Returns:
         Tuple[pd.DataFrame, pd.DataFrame, str]:
             - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
             - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
                 Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
             - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
     Raises:
         RuntimeError: If the botok tokenizer fails to initialize.
         ValueError: If the input files cannot be processed or if metrics computation fails.
                         progress_callback(0.3, desc="Unsupported model, continuing without semantic similarity.")
                     except Exception as e:
                         logger.warning(f"Progress callback error (non-critical): {e}")
         except Exception as e:  # General catch-all for unexpected errors during model loading attempts
             model_warning = f"An unexpected error occurred while attempting to load model '{model_name}': {e}. Semantic similarity will be disabled."
             logger.error(model_warning, exc_info=True)
             progress_callback(0.35, desc="Segmenting texts by chapters...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     chapter_marker = "༈"
     fallback = False
     segment_texts = {}
     # Process each file
     for i, fname in enumerate(filenames):
         if progress_callback is not None and len(filenames) > 1:
             try:
+                progress_callback(0.35 + (0.05 * (i / len(filenames))),
                                 desc=f"Segmenting file {i+1}/{len(filenames)}: {fname}")
             except Exception as e:
                 logger.warning(f"Progress callback error (non-critical): {e}")
         content = text_data[fname]
         # Check if content is empty
         if not content.strip():
             logger.warning(f"File '{fname}' is empty or contains only whitespace.")
             continue
         # Split by chapter marker if present
         if chapter_marker in content:
             segments = [
                 seg.strip() for seg in content.split(chapter_marker) if seg.strip()
             ]
             # Check if we have valid segments after splitting
             if not segments:
                 logger.warning(f"File '{fname}' contains chapter markers but no valid text segments.")
                 continue
             for idx, seg in enumerate(segments):
                 seg_id = f"{fname}|chapter {idx+1}"
                 cleaned_seg = clean_tibetan_text(seg)
             cleaned_content = clean_tibetan_text(content.strip())
             segment_texts[seg_id] = cleaned_content
             fallback = True
     # Generate warning if no chapter markers found
     warning = model_warning  # Include any model warnings
     if fallback:
             "For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
         )
         warning = warning + " " + chapter_warning if warning else chapter_warning
     # Check if we have any valid segments
     if not segment_texts:
         logger.error("No valid text segments found in any of the uploaded files.")
     # Tokenize all segments at once for efficiency
     if progress_callback is not None:
         try:
+            progress_callback(0.40, desc="Tokenizing all text segments...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     all_segment_ids = list(segment_texts.keys())
     all_segment_contents = list(segment_texts.values())
+    tokenized_segments_list = tokenize_texts(all_segment_contents, mode=tokenization_mode)
     segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
     # Group chapters by filename (preserving order)
     if progress_callback is not None:
         try:
+            progress_callback(0.42, desc="Organizing text segments...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     file_to_chapters = {}
     for seg_id in segment_texts:
         fname = seg_id.split("|")[0]
         file_to_chapters.setdefault(fname, []).append(seg_id)
     # For each pair of files, compare corresponding chapters (by index)
     if progress_callback is not None:
         try:
             progress_callback(0.45, desc="Computing similarity metrics...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     results = []
     files = list(file_to_chapters.keys())
     # Check if we have at least two files to compare
     if len(files) < 2:
         logger.warning("Need at least two files to compute similarity metrics.")
         return pd.DataFrame(), pd.DataFrame(), "Need at least two files to compute similarity metrics."
     # Track total number of comparisons for progress reporting
     total_comparisons = 0
     for file1, file2 in combinations(files, 2):
         chaps1 = file_to_chapters[file1]
         chaps2 = file_to_chapters[file2]
         total_comparisons += min(len(chaps1), len(chaps2))
     # Initialize results DataFrame for progressive updates
     results_columns = ['Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS']
     if enable_fuzzy:
         results_columns.append('Fuzzy Similarity')
     if enable_semantic:
         results_columns.append('Semantic Similarity')
     # Create empty DataFrame with the correct columns
     progressive_df = pd.DataFrame(columns=results_columns)
     # Track which metrics have been completed for progressive updates
     completed_metrics = []
     # Process each file pair
     comparison_count = 0
     for file1, file2 in combinations(files, 2):
         chaps1 = file_to_chapters[file1]
         chaps2 = file_to_chapters[file2]
         min_chaps = min(len(chaps1), len(chaps2))
         if progress_callback is not None:
             try:
                 progress_callback(0.45, desc=f"Comparing {file1} with {file2}...")
             except Exception as e:
                 logger.warning(f"Progress callback error (non-critical): {e}")
         for idx in range(min_chaps):
             seg1 = chaps1[idx]
             seg2 = chaps2[idx]
             # Update progress
             comparison_count += 1
             if progress_callback is not None and total_comparisons > 0:
                 try:
                     progress_percentage = 0.45 + (0.25 * (comparison_count / total_comparisons))
+                    progress_callback(progress_percentage,
                                     desc=f"Computing metrics for chapter {idx+1} ({comparison_count}/{total_comparisons})")
                 except Exception as e:
                     logger.warning(f"Progress callback error (non-critical): {e}")
             try:
                 # Compute metrics for this chapter pair
                 metrics_df = compute_all_metrics(
                     enable_semantic=enable_semantic,
                     enable_fuzzy=enable_fuzzy,
                     fuzzy_method=fuzzy_method,
+                    lcs_normalization=lcs_normalization,
                     use_stopwords=use_stopwords,
                     use_lite_stopwords=use_lite_stopwords,
+                    normalize_particles_opt=normalize_particles,
                 )
                 # Extract metrics from the DataFrame (should have only one row)
                 if not metrics_df.empty:
                     pair_metrics = metrics_df.iloc[0].to_dict()
                         "Fuzzy Similarity": 0.0 if enable_fuzzy else np.nan,
                         "Semantic Similarity": 0.0 if enable_semantic else np.nan
                     }
                 # Format the results
                 text_pair = f"{file1} vs {file2}"
                 chapter_num = idx + 1
                 result_row = {
                     "Text Pair": text_pair,
                     "Chapter": chapter_num,
                     "Jaccard Similarity (%)": pair_metrics["Jaccard Similarity (%)"],  # Already in percentage
                     "Normalized LCS": pair_metrics["Normalized LCS"],
                 }
                 # Add fuzzy similarity if enabled
                 if enable_fuzzy:
                     result_row["Fuzzy Similarity"] = pair_metrics["Fuzzy Similarity"]
                 # Add semantic similarity if enabled and available
                 if enable_semantic and "Semantic Similarity" in pair_metrics:
                     result_row["Semantic Similarity"] = pair_metrics["Semantic Similarity"]
                 # Convert the dictionary to a DataFrame before appending
                 result_df = pd.DataFrame([result_row])
                 results.append(result_df)
                 # Update progressive DataFrame and send update if callback is provided
                 progressive_df = pd.concat(results, ignore_index=True)
                 # Send progressive update if callback is provided
                 if progressive_callback is not None:
                     # Determine which metrics are complete in this update
                     current_metrics = []
                     # Always include these basic metrics
                     if "Jaccard Similarity (%)" in progressive_df.columns and MetricType.JACCARD not in completed_metrics:
                         current_metrics.append(MetricType.JACCARD)
                         completed_metrics.append(MetricType.JACCARD)
                     if "Normalized LCS" in progressive_df.columns and MetricType.LCS not in completed_metrics:
                         current_metrics.append(MetricType.LCS)
                         completed_metrics.append(MetricType.LCS)
                     # Add fuzzy if enabled and available
                     if enable_fuzzy and "Fuzzy Similarity" in progressive_df.columns and MetricType.FUZZY not in completed_metrics:
                         current_metrics.append(MetricType.FUZZY)
                         completed_metrics.append(MetricType.FUZZY)
                     # Add semantic if enabled and available
                     if enable_semantic and "Semantic Similarity" in progressive_df.columns and MetricType.SEMANTIC not in completed_metrics:
                         current_metrics.append(MetricType.SEMANTIC)
                         completed_metrics.append(MetricType.SEMANTIC)
                     # Create word counts DataFrame for progressive update
                     word_counts_data = []
                     for seg_id, tokens in segment_tokens.items():
                             "WordCount": len(tokens)
                         })
                     word_counts_df_progressive = pd.DataFrame(word_counts_data)
                     # Send the update
                     try:
                         progressive_callback(
                         )
                     except Exception as e:
                         logger.warning(f"Progressive callback error (non-critical): {e}")
             except Exception as e:
                 logger.error(f"Error computing metrics for {seg1} vs {seg2}: {e}", exc_info=True)
                 # Continue with other segmentsparisons instead of failing completely
                 continue
     # Create the metrics DataFrame
     if results:
         # Results are already DataFrames, so we can concatenate them directly
             progress_callback(0.75, desc="Calculating word counts...")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     word_counts_data = []
     # Process each segment
     for i, (seg_id, text_content) in enumerate(segment_texts.items()):
         # Update progress
                 progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
             except Exception as e:
                 logger.warning(f"Progress callback error (non-critical): {e}")
         fname, chapter_info = seg_id.split("|", 1)
         chapter_num = int(chapter_info.replace("chapter ", ""))
         try:
             # Use botok for accurate word count for raw Tibetan text
             tokenized_segments = tokenize_texts([text_content])  # Returns a list of lists
                 word_count = len(tokenized_segments[0])
             else:
                 word_count = 0
             word_counts_data.append(
                 {
                     "Filename": fname.replace(".txt", ""),
                     "WordCount": 0,
                 }
             )
     # Create and sort the word counts DataFrame
     word_counts_df = pd.DataFrame(word_counts_data)
     if not word_counts_df.empty:
         word_counts_df = word_counts_df.sort_values(
             by=["Filename", "ChapterNumber"]
         ).reset_index(drop=True)
     if progress_callback is not None:
         try:
             progress_callback(0.95, desc="Analysis complete!")
         except Exception as e:
             logger.warning(f"Progress callback error (non-critical): {e}")
     # Send final progressive update if callback is provided
     if progressive_callback is not None:
         try:
             )
         except Exception as e:
             logger.warning(f"Final progressive callback error (non-critical): {e}")
     # Return the results
     return metrics_df, word_counts_df, warning

pipeline/progressive_loader.py CHANGED Viewed

@@ -36,15 +36,15 @@ class ProgressiveResult:
 class ProgressiveLoader:
     """
     Manages progressive loading of metrics computation results.
     This class handles the incremental updates of metrics as they are computed,
     allowing the UI to display partial results before the entire computation is complete.
     """
     def __init__(self, update_callback: Optional[Callable[[ProgressiveResult], None]] = None):
         """
         Initialize the ProgressiveLoader.
         Args:
             update_callback: Function to call when new results are available.
                             Should accept a ProgressiveResult object.
@@ -57,16 +57,16 @@ class ProgressiveLoader:
         self.is_complete = False
         self.last_update_time = 0
         self.update_interval = 0.5  # Minimum seconds between updates to avoid UI thrashing
-    def update(self,
                metrics_df: Optional[pd.DataFrame] = None,
-               word_counts_df: Optional[pd.DataFrame] = None,
                completed_metric: Optional[MetricType] = None,
                warning: Optional[str] = None,
                is_complete: bool = False) -> None:
         """
         Update the progressive results and trigger the callback if enough time has passed.
         Args:
             metrics_df: Updated metrics DataFrame
             word_counts_df: Updated word counts DataFrame
@@ -75,27 +75,27 @@ class ProgressiveLoader:
             is_complete: Whether the computation is complete
         """
         current_time = time.time()
         # Update internal state
         if metrics_df is not None:
             self.metrics_df = metrics_df
         if word_counts_df is not None:
             self.word_counts_df = word_counts_df
         if completed_metric is not None and completed_metric not in self.completed_metrics:
             self.completed_metrics.append(completed_metric)
         if warning:
             self.warning = warning
         self.is_complete = is_complete
         # Only trigger update if enough time has passed or if this is the final update
         if (current_time - self.last_update_time >= self.update_interval) or is_complete:
             self._trigger_update()
             self.last_update_time = current_time
     def _trigger_update(self) -> None:
         """Trigger the update callback with the current state."""
         if self.update_callback:

 class ProgressiveLoader:
     """
     Manages progressive loading of metrics computation results.
     This class handles the incremental updates of metrics as they are computed,
     allowing the UI to display partial results before the entire computation is complete.
     """
     def __init__(self, update_callback: Optional[Callable[[ProgressiveResult], None]] = None):
         """
         Initialize the ProgressiveLoader.
         Args:
             update_callback: Function to call when new results are available.
                             Should accept a ProgressiveResult object.
         self.is_complete = False
         self.last_update_time = 0
         self.update_interval = 0.5  # Minimum seconds between updates to avoid UI thrashing
+    def update(self,
                metrics_df: Optional[pd.DataFrame] = None,
+               word_counts_df: Optional[pd.DataFrame] = None,
                completed_metric: Optional[MetricType] = None,
                warning: Optional[str] = None,
                is_complete: bool = False) -> None:
         """
         Update the progressive results and trigger the callback if enough time has passed.
         Args:
             metrics_df: Updated metrics DataFrame
             word_counts_df: Updated word counts DataFrame
             is_complete: Whether the computation is complete
         """
         current_time = time.time()
         # Update internal state
         if metrics_df is not None:
             self.metrics_df = metrics_df
         if word_counts_df is not None:
             self.word_counts_df = word_counts_df
         if completed_metric is not None and completed_metric not in self.completed_metrics:
             self.completed_metrics.append(completed_metric)
         if warning:
             self.warning = warning
         self.is_complete = is_complete
         # Only trigger update if enough time has passed or if this is the final update
         if (current_time - self.last_update_time >= self.update_interval) or is_complete:
             self._trigger_update()
             self.last_update_time = current_time
     def _trigger_update(self) -> None:
         """Trigger the update callback with the current state."""
         if self.update_callback:

pipeline/progressive_ui.py CHANGED Viewed

@@ -17,25 +17,25 @@ logger = logging.getLogger(__name__)
 class ProgressiveUI:
     """
     Manages progressive UI updates for the Tibetan Text Metrics app.
     This class handles the incremental updates of UI components as metrics
     are computed, allowing for a more responsive user experience.
     """
-    def __init__(self,
                  metrics_preview: gr.Dataframe,
                  word_count_plot: gr.Plot,
                  jaccard_heatmap: gr.Plot,
                  lcs_heatmap: gr.Plot,
                  fuzzy_heatmap: gr.Plot,
-                 semantic_heatmap: gr.Plot,
-                 warning_box: gr.Markdown,
-                 progress_container: gr.Row,
-                 heatmap_titles: Dict[str, str],
                  structural_btn=None):
         """
         Initialize the ProgressiveUI.
         Args:
             metrics_preview: Gradio Dataframe component for metrics preview
             word_count_plot: Gradio Plot component for word count visualization
@@ -55,9 +55,9 @@ class ProgressiveUI:
         self.semantic_heatmap = semantic_heatmap
         self.warning_box = warning_box
         self.progress_container = progress_container
-        self.heatmap_titles = heatmap_titles
         self.structural_btn = structural_btn
         # Create progress indicators for each metric
         with self.progress_container:
             self.jaccard_progress = gr.Markdown("🔄 **Jaccard Similarity:** Waiting...", elem_id="jaccard_progress")
@@ -65,90 +65,90 @@ class ProgressiveUI:
             self.fuzzy_progress = gr.Markdown("🔄 **Fuzzy Similarity:** Waiting...", elem_id="fuzzy_progress")
             self.semantic_progress = gr.Markdown("🔄 **Semantic Similarity:** Waiting...", elem_id="semantic_progress")
             self.word_count_progress = gr.Markdown("🔄 **Word Counts:** Waiting...", elem_id="word_count_progress")
         # Track which components have been updated
         self.updated_components = set()
     def update(self, result: ProgressiveResult) -> Dict[gr.components.Component, Any]:
         """
         Update UI components based on progressive results.
         Args:
             result: ProgressiveResult object containing the current state of computation
         Returns:
             Dictionary mapping Gradio components to their updated values
         """
         updates = {}
         # Always update metrics preview if we have data
         if not result.metrics_df.empty:
             updates[self.metrics_preview] = result.metrics_df.head(10)
         # Update warning if present
         if result.warning:
             warning_md = f"**⚠️ Warning:** {result.warning}" if result.warning else ""
             updates[self.warning_box] = gr.update(value=warning_md, visible=True)
         # Generate visualizations for completed metrics
         if not result.metrics_df.empty:
             # Generate heatmaps for available metrics
             heatmaps_data = generate_visualizations(
                 result.metrics_df, descriptive_titles=self.heatmap_titles
             )
             # Update heatmaps and progress indicators for completed metrics
             for metric_type in result.completed_metrics:
                 if metric_type == MetricType.JACCARD:
                     # Update progress indicator
                     updates[self.jaccard_progress] = "✅ **Jaccard Similarity:** Complete"
                     # Update heatmap if not already updated
                     if self.jaccard_heatmap not in self.updated_components:
                         if "Jaccard Similarity (%)" in heatmaps_data:
                             updates[self.jaccard_heatmap] = heatmaps_data["Jaccard Similarity (%)"]
                             self.updated_components.add(self.jaccard_heatmap)
                 elif metric_type == MetricType.LCS:
                     # Update progress indicator
                     updates[self.lcs_progress] = "✅ **Normalized LCS:** Complete"
                     # Update heatmap if not already updated
                     if self.lcs_heatmap not in self.updated_components:
                         if "Normalized LCS" in heatmaps_data:
                             updates[self.lcs_heatmap] = heatmaps_data["Normalized LCS"]
                             self.updated_components.add(self.lcs_heatmap)
                 elif metric_type == MetricType.FUZZY:
                     # Update progress indicator
                     updates[self.fuzzy_progress] = "✅ **Fuzzy Similarity:** Complete"
                     # Update heatmap if not already updated
                     if self.fuzzy_heatmap not in self.updated_components:
                         if "Fuzzy Similarity" in heatmaps_data:
                             updates[self.fuzzy_heatmap] = heatmaps_data["Fuzzy Similarity"]
                             self.updated_components.add(self.fuzzy_heatmap)
                 elif metric_type == MetricType.SEMANTIC:
                     # Update progress indicator
                     updates[self.semantic_progress] = "✅ **Semantic Similarity:** Complete"
                     # Update heatmap if not already updated
                     if self.semantic_heatmap not in self.updated_components:
                         if "Semantic Similarity" in heatmaps_data:
                             updates[self.semantic_heatmap] = heatmaps_data["Semantic Similarity"]
                             self.updated_components.add(self.semantic_heatmap)
         # Generate word count chart if we have data
         if not result.word_counts_df.empty:
             # Update progress indicator
             updates[self.word_count_progress] = "✅ **Word Counts:** Complete"
             # Update chart if not already updated
             if self.word_count_plot not in self.updated_components:
                 updates[self.word_count_plot] = generate_word_count_chart(result.word_counts_df)
                 self.updated_components.add(self.word_count_plot)
         # Update progress indicators for metrics in progress
         if not result.is_complete:
             # Update progress indicators for metrics that are still in progress
@@ -167,28 +167,28 @@ class ProgressiveUI:
             if self.structural_btn is not None:
                 updates[self.structural_btn] = gr.update(interactive=True)
                 logger.info("Enabling structural analysis button via progressive UI")
         return updates
 def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
     """
     Create a callback function for progressive updates.
     Args:
         progressive_ui: ProgressiveUI instance to handle updates
     Returns:
         Callback function that can be passed to process_texts
     """
-    def callback(metrics_df: pd.DataFrame,
                 word_counts_df: pd.DataFrame,
                 completed_metrics: List[MetricType],
                 warning: str,
                 is_complete: bool) -> None:
         """
         Callback function for progressive updates.
         Args:
             metrics_df: DataFrame with current metrics
             word_counts_df: DataFrame with word counts
@@ -203,10 +203,10 @@ def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
             warning=warning,
             is_complete=is_complete
         )
         # Get updates for UI components
         updates = progressive_ui.update(result)
         # Apply updates to UI components
         for component, value in updates.items():
             try:
@@ -228,5 +228,5 @@ def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
                     logger.warning(f"Cannot update component of type {type(component)}")
             except Exception as e:
                 logger.warning(f"Error updating component: {e}")
     return callback

 class ProgressiveUI:
     """
     Manages progressive UI updates for the Tibetan Text Metrics app.
     This class handles the incremental updates of UI components as metrics
     are computed, allowing for a more responsive user experience.
     """
+    def __init__(self,
                  metrics_preview: gr.Dataframe,
                  word_count_plot: gr.Plot,
                  jaccard_heatmap: gr.Plot,
                  lcs_heatmap: gr.Plot,
                  fuzzy_heatmap: gr.Plot,
+                 semantic_heatmap: gr.Plot = None,
+                 warning_box: gr.Markdown = None,
+                 progress_container: gr.Row = None,
+                 heatmap_titles: Dict[str, str] = None,
                  structural_btn=None):
         """
         Initialize the ProgressiveUI.
         Args:
             metrics_preview: Gradio Dataframe component for metrics preview
             word_count_plot: Gradio Plot component for word count visualization
         self.semantic_heatmap = semantic_heatmap
         self.warning_box = warning_box
         self.progress_container = progress_container
+        self.heatmap_titles = heatmap_titles or {}
         self.structural_btn = structural_btn
         # Create progress indicators for each metric
         with self.progress_container:
             self.jaccard_progress = gr.Markdown("🔄 **Jaccard Similarity:** Waiting...", elem_id="jaccard_progress")
             self.fuzzy_progress = gr.Markdown("🔄 **Fuzzy Similarity:** Waiting...", elem_id="fuzzy_progress")
             self.semantic_progress = gr.Markdown("🔄 **Semantic Similarity:** Waiting...", elem_id="semantic_progress")
             self.word_count_progress = gr.Markdown("🔄 **Word Counts:** Waiting...", elem_id="word_count_progress")
         # Track which components have been updated
         self.updated_components = set()
     def update(self, result: ProgressiveResult) -> Dict[gr.components.Component, Any]:
         """
         Update UI components based on progressive results.
         Args:
             result: ProgressiveResult object containing the current state of computation
         Returns:
             Dictionary mapping Gradio components to their updated values
         """
         updates = {}
         # Always update metrics preview if we have data
         if not result.metrics_df.empty:
             updates[self.metrics_preview] = result.metrics_df.head(10)
         # Update warning if present
         if result.warning:
             warning_md = f"**⚠️ Warning:** {result.warning}" if result.warning else ""
             updates[self.warning_box] = gr.update(value=warning_md, visible=True)
         # Generate visualizations for completed metrics
         if not result.metrics_df.empty:
             # Generate heatmaps for available metrics
             heatmaps_data = generate_visualizations(
                 result.metrics_df, descriptive_titles=self.heatmap_titles
             )
             # Update heatmaps and progress indicators for completed metrics
             for metric_type in result.completed_metrics:
                 if metric_type == MetricType.JACCARD:
                     # Update progress indicator
                     updates[self.jaccard_progress] = "✅ **Jaccard Similarity:** Complete"
                     # Update heatmap if not already updated
                     if self.jaccard_heatmap not in self.updated_components:
                         if "Jaccard Similarity (%)" in heatmaps_data:
                             updates[self.jaccard_heatmap] = heatmaps_data["Jaccard Similarity (%)"]
                             self.updated_components.add(self.jaccard_heatmap)
                 elif metric_type == MetricType.LCS:
                     # Update progress indicator
                     updates[self.lcs_progress] = "✅ **Normalized LCS:** Complete"
                     # Update heatmap if not already updated
                     if self.lcs_heatmap not in self.updated_components:
                         if "Normalized LCS" in heatmaps_data:
                             updates[self.lcs_heatmap] = heatmaps_data["Normalized LCS"]
                             self.updated_components.add(self.lcs_heatmap)
                 elif metric_type == MetricType.FUZZY:
                     # Update progress indicator
                     updates[self.fuzzy_progress] = "✅ **Fuzzy Similarity:** Complete"
                     # Update heatmap if not already updated
                     if self.fuzzy_heatmap not in self.updated_components:
                         if "Fuzzy Similarity" in heatmaps_data:
                             updates[self.fuzzy_heatmap] = heatmaps_data["Fuzzy Similarity"]
                             self.updated_components.add(self.fuzzy_heatmap)
                 elif metric_type == MetricType.SEMANTIC:
                     # Update progress indicator
                     updates[self.semantic_progress] = "✅ **Semantic Similarity:** Complete"
                     # Update heatmap if not already updated
                     if self.semantic_heatmap not in self.updated_components:
                         if "Semantic Similarity" in heatmaps_data:
                             updates[self.semantic_heatmap] = heatmaps_data["Semantic Similarity"]
                             self.updated_components.add(self.semantic_heatmap)
         # Generate word count chart if we have data
         if not result.word_counts_df.empty:
             # Update progress indicator
             updates[self.word_count_progress] = "✅ **Word Counts:** Complete"
             # Update chart if not already updated
             if self.word_count_plot not in self.updated_components:
                 updates[self.word_count_plot] = generate_word_count_chart(result.word_counts_df)
                 self.updated_components.add(self.word_count_plot)
         # Update progress indicators for metrics in progress
         if not result.is_complete:
             # Update progress indicators for metrics that are still in progress
             if self.structural_btn is not None:
                 updates[self.structural_btn] = gr.update(interactive=True)
                 logger.info("Enabling structural analysis button via progressive UI")
         return updates
 def create_progressive_callback(progressive_ui: ProgressiveUI) -> Callable:
     """
     Create a callback function for progressive updates.
     Args:
         progressive_ui: ProgressiveUI instance to handle updates
     Returns:
         Callback function that can be passed to process_texts
     """
+    def callback(metrics_df: pd.DataFrame,
                 word_counts_df: pd.DataFrame,
                 completed_metrics: List[MetricType],
                 warning: str,
                 is_complete: bool) -> None:
         """
         Callback function for progressive updates.
         Args:
             metrics_df: DataFrame with current metrics
             word_counts_df: DataFrame with word counts
             warning=warning,
             is_complete=is_complete
         )
         # Get updates for UI components
         updates = progressive_ui.update(result)
         # Apply updates to UI components
         for component, value in updates.items():
             try:
                     logger.warning(f"Cannot update component of type {type(component)}")
             except Exception as e:
                 logger.warning(f"Error updating component: {e}")
     return callback

pipeline/stopwords_bo.py CHANGED Viewed

@@ -21,13 +21,13 @@ ORDINAL_NUMBERS = [
 # Additional stopwords from the comprehensive list, categorized for readability
 MORE_PARTICLES_SUFFIXES = [
-    "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
-    "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
-    "གྱིན་", "ན", "འམ་", "ཀྱིན་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི",
-    "བམ་", "ཤིག་", "ནམ", "མིན་", "ནམ་", "ངམ་", "རུ་", "ཤས་", "ཏུ", "ཡིས", "གིན་", "གམ་",
-    "གྱིས", "ཅང་", "སམ་", "ཞིག", "འང", "རུ", "དང", "ཡ", "འག", "སམ", "ཀ", "འམ", "མམ",
-    "དམ", "ཀྱི", "ལམ", "ནོ་", "སོ་", "རམ་", "བོ་", "ཨང་", "ཕྱི", "ཏོ་", "གེ", "གོ", "རོ་", "བོ",
-    "པས", "འི", "རམ", "བས", "མཾ་", "པོ", "ག་", "ག", "གམ", "བམ", "མོ་", "མམ་", "ཏམ་", "ངོ",
     "ཏམ", "གིང་", "ཀྱང" # ཀྱང also in PARTICLES_INITIAL, set() will handle duplicates
 ]
@@ -36,13 +36,13 @@ PRONOUNS_DEMONSTRATIVES = ["འདི", "གཞན་", "དེ་", "རང་"
 VERBS_AUXILIARIES = ["ཡིན་", "མི་", "ལགས་པ", "ཡིན་པ", "ལགས་", "མིན་", "ཡིན་པ་", "མིན", "ཡིན་བ", "ཡིན་ལུགས་"]
 ADVERBS_QUALIFIERS_INTENSIFIERS = [
-    "སོགས་", "ཙམ་", "ཡང་", "ཉིད་", "ཞིང་", "རུང་", "ན་རེ", "འང་", "ཁོ་ན་", "འཕྲལ་", "བར་",
     "ཅུང་ཟད་", "ཙམ་པ་", "ཤ་སྟག་"
 ]
 QUANTIFIERS_DETERMINERS_COLLECTIVES = [
-    "རྣམས་", "ཀུན་", "སྙེད་", "བཅས་", "ཡོངས་", "མཐའ་དག་", "དག་", "ཚུ", "ཚང་མ", "ཐམས་ཅད་",
-    "ཅིག་", "སྣ་ཚོགས་", "སྙེད་པ", "རེ་རེ་", "འགའ་", "སྤྱི", "དུ་མ", "མ", "ཁོ་ན", "ཚོ", "ལ་ལ་",
     "སྙེད་པ་", "འབའ་", "སྙེད", "གྲང་", "ཁ་", "ངེ་", "ཅོག་", "རིལ་", "ཉུང་ཤས་", "ཚ་"
 ]
@@ -64,8 +64,19 @@ _ALL_STOPWORDS_CATEGORIZED = (
     INTERJECTIONS_EXCLAMATIONS
 )
-# Final flat list of unique stopwords
-TIBETAN_STOPWORDS = list(set(_ALL_STOPWORDS_CATEGORIZED))
 # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
 TIBETAN_STOPWORDS_SET = set(TIBETAN_STOPWORDS)

 # Additional stopwords from the comprehensive list, categorized for readability
 MORE_PARTICLES_SUFFIXES = [
+    "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
+    "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
+    "གྱིན་", "ན", "འམ་", "ཀྱིན་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི",
+    "བམ་", "ཤིག་", "ནམ", "མིན་", "ནམ་", "ངམ་", "རུ་", "ཤས་", "ཏུ", "ཡིས", "གིན་", "གམ་",
+    "གྱིས", "ཅང་", "སམ་", "ཞིག", "འང", "རུ", "དང", "ཡ", "འག", "སམ", "ཀ", "འམ", "མམ",
+    "དམ", "ཀྱི", "ལམ", "ནོ་", "སོ་", "རམ་", "བོ་", "ཨང་", "ཕྱི", "ཏོ་", "གེ", "གོ", "རོ་", "བོ",
+    "པས", "འི", "རམ", "བས", "མཾ་", "པོ", "ག་", "ག", "གམ", "བམ", "མོ་", "མམ་", "ཏམ་", "ངོ",
     "ཏམ", "གིང་", "ཀྱང" # ཀྱང also in PARTICLES_INITIAL, set() will handle duplicates
 ]
 VERBS_AUXILIARIES = ["ཡིན་", "མི་", "ལགས་པ", "ཡིན་པ", "ལགས་", "མིན་", "ཡིན་པ་", "མིན", "ཡིན་བ", "ཡིན་ལུགས་"]
 ADVERBS_QUALIFIERS_INTENSIFIERS = [
+    "སོགས་", "ཙམ་", "ཡང་", "ཉིད་", "ཞིང་", "རུང་", "ན་རེ", "འང་", "ཁོ་ན་", "འཕྲལ་", "བར་",
     "ཅུང་ཟད་", "ཙམ་པ་", "ཤ་སྟག་"
 ]
 QUANTIFIERS_DETERMINERS_COLLECTIVES = [
+    "རྣམས་", "ཀུན་", "སྙེད་", "བཅས་", "ཡོངས་", "མཐའ་དག་", "དག་", "ཚུ", "ཚང་མ", "ཐམས་ཅད་",
+    "ཅིག་", "སྣ་ཚོགས་", "སྙེད་པ", "རེ་རེ་", "འགའ་", "སྤྱི", "དུ་མ", "མ", "ཁོ་ན", "ཚོ", "ལ་ལ་",
     "སྙེད་པ་", "འབའ་", "སྙེད", "གྲང་", "ཁ་", "ངེ་", "ཅོག་", "རིལ་", "ཉུང་ཤས་", "ཚ་"
 ]
     INTERJECTIONS_EXCLAMATIONS
 )
+def _normalize_tibetan_token(token: str) -> str:
+    """
+    Normalize a Tibetan token by removing trailing tsek (་).
+    This ensures consistent matching regardless of whether the tokenizer
+    preserves or strips the tsek. Botok's behavior can vary, so we normalize
+    both the stopwords and the tokens being compared.
+    """
+    return token.rstrip('་')
+# Final flat list of unique stopwords (normalized to remove trailing tsek)
+TIBETAN_STOPWORDS = list(set(_normalize_tibetan_token(sw) for sw in _ALL_STOPWORDS_CATEGORIZED))
 # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
+# Normalized to match tokenizer output regardless of tsek handling
 TIBETAN_STOPWORDS_SET = set(TIBETAN_STOPWORDS)

pipeline/stopwords_lite_bo.py CHANGED Viewed

@@ -15,8 +15,8 @@ MARKERS_AND_PUNCTUATION = ["༈", "།", "༎", "༑"]
 # Reduced list of particles and suffixes
 MORE_PARTICLES_SUFFIXES_LITE = [
-    "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
-    "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
     "ན", "འམ་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི"
 ]
@@ -27,8 +27,18 @@ _ALL_STOPWORDS_CATEGORIZED_LITE = (
     MORE_PARTICLES_SUFFIXES_LITE
 )
-# Final flat list of unique stopwords
-TIBETAN_STOPWORDS_LITE = list(set(_ALL_STOPWORDS_CATEGORIZED_LITE))
 # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
 TIBETAN_STOPWORDS_LITE_SET = set(TIBETAN_STOPWORDS_LITE)

 # Reduced list of particles and suffixes
 MORE_PARTICLES_SUFFIXES_LITE = [
+    "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
+    "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
     "ན", "འམ་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི"
 ]
     MORE_PARTICLES_SUFFIXES_LITE
 )
+def _normalize_tibetan_token(token: str) -> str:
+    """
+    Normalize a Tibetan token by removing trailing tsek (་).
+    This ensures consistent matching regardless of whether the tokenizer
+    preserves or strips the tsek.
+    """
+    return token.rstrip('་')
+# Final flat list of unique stopwords (normalized to remove trailing tsek)
+TIBETAN_STOPWORDS_LITE = list(set(_normalize_tibetan_token(sw) for sw in _ALL_STOPWORDS_CATEGORIZED_LITE))
 # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
+# Normalized to match tokenizer output regardless of tsek handling
 TIBETAN_STOPWORDS_LITE_SET = set(TIBETAN_STOPWORDS_LITE)

pipeline/tokenize.py CHANGED Viewed

@@ -29,10 +29,10 @@ except ImportError:
 def _get_text_hash(text: str) -> str:
     """
     Generate a hash for the input text to use as a cache key.
     Args:
         text: The input text to hash
     Returns:
         A string representation of the MD5 hash of the input text
     """
@@ -42,17 +42,17 @@ def _get_text_hash(text: str) -> str:
 def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
     """
     Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
     This function maintains an in-memory cache of previously tokenized texts to avoid
     redundant processing of the same content. The cache uses MD5 hashes of the input
     texts as keys.
     Args:
         texts: List of raw text strings to tokenize.
     Returns:
         List of tokenized texts (each as a list of tokens).
     Raises:
         RuntimeError: If the botok tokenizer failed to initialize.
     """
@@ -68,18 +68,18 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
     if mode not in ["word", "syllable"]:
         logger.warning(f"Invalid tokenization mode: '{mode}'. Defaulting to 'syllable'.")
         mode = "syllable"
     # Process each text
     for text_content in texts:
         # Skip empty texts
         if not text_content.strip():
             tokenized_texts_list.append([])
             continue
         # Generate hash for cache lookup
         cache_key_string = text_content + f"_{mode}" # Include mode in string for hashing
         text_hash = _get_text_hash(cache_key_string)
         # Check if we have this text in cache
         if text_hash in _tokenization_cache:
             # Cache hit - use cached tokens
@@ -91,7 +91,7 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
                 current_tokens = []
                 if BOTOK_TOKENIZER:
                     raw_botok_items = list(BOTOK_TOKENIZER.tokenize(text_content))
                     if mode == "word":
                         for item_idx, w in enumerate(raw_botok_items):
                             if hasattr(w, 'text') and isinstance(w.text, str):
@@ -125,7 +125,7 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
                                                 f"for hash {text_hash[:8]}. Skipping this syllable."
                                             )
                                             continue
                                     if syllable_to_process is not None:
                                         stripped_syl = syllable_to_process.strip()
                                         if stripped_syl:
@@ -155,20 +155,20 @@ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
                 else:
                     logger.error(f"BOTOK_TOKENIZER is None for text hash {text_hash[:8]}, cannot tokenize (mode: {mode}).")
                     tokens = []
                 # Store in cache if not empty
                 if tokens:
                     # If cache is full, remove a random entry (simple strategy)
                     if len(_tokenization_cache) >= MAX_CACHE_SIZE:
                         # Remove first key (oldest if ordered dict, random otherwise)
                         _tokenization_cache.pop(next(iter(_tokenization_cache)))
                     _tokenization_cache[text_hash] = tokens
                     logger.debug(f"Added tokens to cache with hash {text_hash[:8]}... (mode: {mode})")
             except Exception as e:
                 logger.error(f"Error tokenizing text (mode: {mode}): {e}")
                 tokens = []
         tokenized_texts_list.append(tokens)
     return tokenized_texts_list

 def _get_text_hash(text: str) -> str:
     """
     Generate a hash for the input text to use as a cache key.
     Args:
         text: The input text to hash
     Returns:
         A string representation of the MD5 hash of the input text
     """
 def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
     """
     Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
     This function maintains an in-memory cache of previously tokenized texts to avoid
     redundant processing of the same content. The cache uses MD5 hashes of the input
     texts as keys.
     Args:
         texts: List of raw text strings to tokenize.
     Returns:
         List of tokenized texts (each as a list of tokens).
     Raises:
         RuntimeError: If the botok tokenizer failed to initialize.
     """
     if mode not in ["word", "syllable"]:
         logger.warning(f"Invalid tokenization mode: '{mode}'. Defaulting to 'syllable'.")
         mode = "syllable"
     # Process each text
     for text_content in texts:
         # Skip empty texts
         if not text_content.strip():
             tokenized_texts_list.append([])
             continue
         # Generate hash for cache lookup
         cache_key_string = text_content + f"_{mode}" # Include mode in string for hashing
         text_hash = _get_text_hash(cache_key_string)
         # Check if we have this text in cache
         if text_hash in _tokenization_cache:
             # Cache hit - use cached tokens
                 current_tokens = []
                 if BOTOK_TOKENIZER:
                     raw_botok_items = list(BOTOK_TOKENIZER.tokenize(text_content))
                     if mode == "word":
                         for item_idx, w in enumerate(raw_botok_items):
                             if hasattr(w, 'text') and isinstance(w.text, str):
                                                 f"for hash {text_hash[:8]}. Skipping this syllable."
                                             )
                                             continue
                                     if syllable_to_process is not None:
                                         stripped_syl = syllable_to_process.strip()
                                         if stripped_syl:
                 else:
                     logger.error(f"BOTOK_TOKENIZER is None for text hash {text_hash[:8]}, cannot tokenize (mode: {mode}).")
                     tokens = []
                 # Store in cache if not empty
                 if tokens:
                     # If cache is full, remove a random entry (simple strategy)
                     if len(_tokenization_cache) >= MAX_CACHE_SIZE:
                         # Remove first key (oldest if ordered dict, random otherwise)
                         _tokenization_cache.pop(next(iter(_tokenization_cache)))
                     _tokenization_cache[text_hash] = tokens
                     logger.debug(f"Added tokens to cache with hash {text_hash[:8]}... (mode: {mode})")
             except Exception as e:
                 logger.error(f"Error tokenizing text (mode: {mode}): {e}")
                 tokens = []
         tokenized_texts_list.append(tokens)
     return tokenized_texts_list

pipeline/visualize.py CHANGED Viewed

@@ -40,29 +40,29 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
             continue
         cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
         # For consistent interpretation: higher values (more similarity) = darker colors
         # Using 'Reds' colormap for all metrics (dark red = high similarity)
-        cmap = "Reds"
         # Format values for display
         text = [
             [f"{val:.2f}" if pd.notnull(val) else "" for val in row]
             for row in pivot.values
         ]
         # Create a copy of the pivot data for visualization
         # For LCS and Semantic Similarity, we need to reverse the color scale
         # so that higher values (more similarity) are darker
         viz_values = pivot.values.copy()
         # Determine if we need to reverse the values for consistent color interpretation
         # (darker = more similar across all metrics)
         reverse_colorscale = False
         # All metrics should have darker colors for higher similarity
         # No need to reverse values anymore - we'll use the same scale for all
         fig = go.Figure(
             data=go.Heatmap(
                 z=viz_values,

             continue
         cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
         # For consistent interpretation: higher values (more similarity) = darker colors
         # Using 'Reds' colormap for all metrics (dark red = high similarity)
+        cmap = "Reds"
         # Format values for display
         text = [
             [f"{val:.2f}" if pd.notnull(val) else "" for val in row]
             for row in pivot.values
         ]
         # Create a copy of the pivot data for visualization
         # For LCS and Semantic Similarity, we need to reverse the color scale
         # so that higher values (more similarity) are darker
         viz_values = pivot.values.copy()
         # Determine if we need to reverse the values for consistent color interpretation
         # (darker = more similar across all metrics)
         reverse_colorscale = False
         # All metrics should have darker colors for higher similarity
         # No need to reverse values anymore - we'll use the same scale for all
         fig = go.Figure(
             data=go.Heatmap(
                 z=viz_values,

pyproject.toml CHANGED Viewed

@@ -1,8 +1,33 @@
 [build-system]
 requires = [
-    "setuptools>=42",
-    "Cython>=0.29.21",
-    "numpy>=1.20"
 ]
 build-backend = "setuptools.build_meta"
-backend-path = ["."] # Specifies that setuptools.build_meta is in the current directory's PYTHONPATH

 [build-system]
 requires = [
+    "setuptools>=65",
+    "Cython>=3.0",
+    "numpy>=1.24"
 ]
 build-backend = "setuptools.build_meta"
+[project]
+name = "tibetan-text-metrics-webapp"
+version = "0.4.0"
+description = "Web application for computing text similarity metrics on Tibetan texts"
+readme = "README.md"
+license = {text = "CC-BY-4.0"}
+requires-python = ">=3.10"
+authors = [
+    {name = "Daniel Wojahn", email = "[email protected]"}
+]
+keywords = ["tibetan", "nlp", "text-similarity", "buddhist-texts"]
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Topic :: Text Processing :: Linguistic",
+]
+[project.urls]
+Homepage = "https://github.com/daniel-wojahn/tibetan-text-metrics"
+Repository = "https://github.com/daniel-wojahn/tibetan-text-metrics"

requirements.txt CHANGED Viewed

@@ -1,5 +1,6 @@
 # Core application and UI
-gradio
 pandas==2.2.3
 # Plotting and visualization

 # Core application and UI
+# Gradio 5.x (code is forward-compatible with Gradio 6)
+gradio>=5.0.0
 pandas==2.2.3
 # Plotting and visualization

setup.py CHANGED Viewed

@@ -1,45 +1,39 @@
 import numpy
 from setuptools import Extension, setup
 from Cython.Build import cythonize
-# It's good practice to specify encoding for portability
-with open("README.md", "r", encoding="utf-8") as fh:
-    long_description = fh.read()
 setup(
-    name="tibetan text metrics webapp",
-    version="0.1.0",
-    author="Daniel Wojahn / Tibetan Text Metrics",
     author_email="[email protected]",
-    description="Cython components for the Tibetan Text Metrics Webapp",
-    long_description=long_description,
-    long_description_content_type="text/markdown",
     url="https://github.com/daniel-wojahn/tibetan-text-metrics",
     ext_modules=cythonize(
         [
             Extension(
-                "pipeline.fast_lcs",  # Module name to import: from pipeline.fast_lcs import ...
                 ["pipeline/fast_lcs.pyx"],
                 include_dirs=[numpy.get_include()],
             )
         ],
-        compiler_directives={'language_level' : "3"} # For Python 3 compatibility
     ),
-    # Indicates that package data (like .pyx files) should be included if specified in MANIFEST.in
-    # For simple cases like this, Cythonize usually handles it.
-    include_package_data=True,
-    # Although this setup.py is in webapp, it's building modules for the 'pipeline' sub-package
-    # We don't list packages here as this setup.py is just for the extension.
-    # The main app will treat 'pipeline' as a regular package.
-    zip_safe=False, # Cython extensions are generally not zip-safe
-    classifiers=[
-        "Programming Language :: Python :: 3",
-        "License :: OSI Approved :: MIT License",
-        "Operating System :: OS Independent",
-    ],
-    python_requires='>=3.8',
     install_requires=[
-        "numpy>=1.20", # Ensure numpy is available for runtime if not just build time
     ],
-    # setup_requires is deprecated, use pyproject.toml for build-system requirements
 )

+"""
+Setup script for building Cython extensions.
+This setup.py is used to compile the fast_lcs Cython extension for
+improved LCS calculation performance. The main project metadata is
+in pyproject.toml.
+Usage:
+    python setup.py build_ext --inplace
+"""
 import numpy
 from setuptools import Extension, setup
 from Cython.Build import cythonize
 setup(
+    name="tibetan-text-metrics-webapp",
+    version="0.4.0",
+    author="Daniel Wojahn",
     author_email="[email protected]",
+    description="Cython LCS extension for Tibetan Text Metrics Webapp",
     url="https://github.com/daniel-wojahn/tibetan-text-metrics",
     ext_modules=cythonize(
         [
             Extension(
+                "pipeline.fast_lcs",
                 ["pipeline/fast_lcs.pyx"],
                 include_dirs=[numpy.get_include()],
             )
         ],
+        compiler_directives={"language_level": "3"}
     ),
+    include_package_data=True,
+    zip_safe=False,
+    python_requires=">=3.10",
     install_requires=[
+        "numpy>=1.24",
     ],
 )

theme.py CHANGED Viewed

@@ -1,273 +1,408 @@
 import gradio as gr
 from gradio.themes.utils import colors, sizes, fonts
 class TibetanAppTheme(gr.themes.Soft):
     def __init__(self):
         super().__init__(
-            primary_hue=colors.blue,  # Primary interactive elements (e.g., #2563eb)
-            secondary_hue=colors.orange,  # For accents if needed, or default buttons
-            neutral_hue=colors.slate,  # For backgrounds, borders, and text
             font=[
                 fonts.GoogleFont("Inter"),
                 "ui-sans-serif",
                 "system-ui",
                 "sans-serif",
             ],
-            radius_size=sizes.radius_md,  # General radius, can be overridden (16px was for cards)
-            text_size=sizes.text_md,  # Base font size (16px)
         )
         self.theme_vars_for_set = {
             # Global & Body Styles
             "body_background_fill": "#f0f2f5",
             "body_text_color": "#333333",
-            # Card Styles (.gr-group)
             "block_background_fill": "#ffffff",
-            "block_radius": "16px",  # May need to be removed if not a valid settable CSS var
             "block_shadow": "0 4px 12px rgba(0, 0, 0, 0.08)",
             "block_padding": "24px",
             "block_border_width": "0px",
-            # Markdown Styles
-            "body_text_color_subdued": "#4b5563",
-            # Button Styles
             "button_secondary_background_fill": "#ffffff",
             "button_secondary_text_color": "#374151",
             "button_secondary_border_color": "#d1d5db",
             "button_secondary_border_color_hover": "#adb5bd",
             "button_secondary_background_fill_hover": "#f9fafb",
-            # Primary Button
             "button_primary_background_fill": "#2563eb",
             "button_primary_text_color": "#ffffff",
             "button_primary_border_color": "transparent",
             "button_primary_background_fill_hover": "#1d4ed8",
-            # HR style
             "border_color_accent_subdued": "#e5e7eb",
-        }
-        super().set(**self.theme_vars_for_set)
-        # Store CSS overrides; these will be converted to a string and applied via gr.Blocks(css=...)
-        self.css_overrides = {
-            ".gradio-container, .gr-block, .gr-markdown, label, input, .gr-slider, .gr-radio, .gr-button": {
-                "font-family": ", ".join(self.font),
-                "font-size": "16px !important",
-                "line-height": "1.6 !important",
-                "color": "#333333 !important",
-            },
-            ".gr-group": {"margin-bottom": "24px !important"},  # min-height removed
-            ".gr-markdown": {
-                "background": "transparent !important",
-                "font-size": "1em !important",
-                "margin-bottom": "16px !important",
-            },
-            ".gr-markdown h1": {
-                "font-size": "28px !important",
-                "font-weight": "600 !important",
-                "margin-bottom": "8px !important",
-                "color": "#111827 !important",
-            },
-            ".gr-markdown h2": {
-                "font-size": "26px !important",
-                "font-weight": "600 !important",
-                "color": "var(--primary-600, #2563eb) !important",
-                "margin-top": "32px !important",
-                "margin-bottom": "16px !important",
-            },
-            ".gr-markdown h3": {
-                "font-size": "22px !important",
-                "font-weight": "600 !important",
-                "color": "#1f2937 !important",
-                "margin-top": "24px !important",
-                "margin-bottom": "12px !important",
-            },
-            ".gr-markdown p, .gr-markdown span": {
-                "font-size": "16px !important",
-                "color": "#4b5563 !important",
-            },
-            ".gr-button button": {
-                "border-radius": "8px !important",
-                "padding": "10px 20px !important",
-                "font-weight": "500 !important",
-                "box-shadow": "0 1px 2px 0 rgba(0, 0, 0, 0.05) !important",
-                "border": "1px solid #d1d5db !important",
-                "background-color": "#ffffff !important",
-                "color": "#374151 !important",
-            },
-            "#run-btn": {
-                "background": "var(--button-primary-background-fill) !important",
-                "color": "var(--button-primary-text-color) !important",
-                "font-weight": "bold !important",
-                "font-size": "24px !important",
-                "border": "none !important",
-                "box-shadow": "var(--button-primary-shadow) !important",
-            },
-            "#run-btn:hover": {  # Changed selector
-                "background": "var(--button-primary-background-fill-hover) !important",
-                "box-shadow": "0px 4px 12px rgba(0, 0, 0, 0.15) !important",
-                "transform": "translateY(-1px) !important",
-            },
-            ".gr-button button:hover": {
-                "background-color": "#f9fafb !important",
-                "border-color": "#adb5bd !important",
-            },
-            "hr": {
-                "margin": "32px 0 !important",
-                "border": "none !important",
-                "border-top": "1px solid var(--border-color-accent-subdued) !important",
-            },
-            ".gr-slider, .gr-radio, .gr-file": {"margin-bottom": "20px !important"},
-            ".gr-radio .gr-form button": {
-                "background-color": "#f3f4f6 !important",
-                "color": "#374151 !important",
-                "border": "1px solid #d1d5db !important",
-                "border-radius": "6px !important",
-                "padding": "8px 16px !important",
-                "font-weight": "500 !important",
-            },
-            ".gr-radio .gr-form button:hover": {
-                "background-color": "#e5e7eb !important",
-                "border-color": "#9ca3af !important",
-            },
-            ".gr-radio .gr-form button.selected": {
-                "background-color": "var(--primary-500, #3b82f6) !important",
-                "color": "#ffffff !important",
-                "border-color": "var(--primary-500, #3b82f6) !important",
-            },
-            ".gr-radio .gr-form button.selected:hover": {
-                "background-color": "var(--primary-600, #2563eb) !important",
-                "border-color": "var(--primary-600, #2563eb) !important",
-            },
-            "#semantic-radio-group span": {  # General selector, refined size
-                "font-size": "17px !important",
-                "font-weight": "500 !important",
-            },
-            "#semantic-radio-group div": {  # General selector, refined size
-                "font-size": "14px !important"
-            },
-            # Row and Column flex styles for equal height
-            "#steps-row": {
-                "display": "flex !important",
-                "align-items": "stretch !important",
-            },
-            ".step-column": {
-                "display": "flex !important",
-                "flex-direction": "column !important",
-            },
-            ".step-column > .gr-group": {
-                "flex-grow": "1 !important",
-                "display": "flex !important",
-                "flex-direction": "column !important",
-            },
-            ".tabs > .tab-nav": {"border-bottom": "1px solid #d1d5db !important"},
-            ".tabs > .tab-nav > button.selected": {
-                "border-bottom": "2px solid var(--primary-500) !important",
-                "color": "var(--primary-500) !important",
-                "background-color": "transparent !important",
-            },
-            ".tabs > .tab-nav > button": {
-                "color": "#6b7280 !important",
-                "background-color": "transparent !important",
-                "padding": "10px 15px !important",
-                "border-bottom": "2px solid transparent !important",
-            },
-            # Custom styling for metric accordions
-            ".metric-info-accordion": {
-                "border-left": "4px solid #3B82F6 !important",
-                "margin-bottom": "1rem !important",
-                "background-color": "#F8FAFC !important",
-                "border-radius": "6px !important",
-                "overflow": "hidden !important",
-            },
-            ".jaccard-info": {
-                "border-left-color": "#3B82F6 !important",  # Blue
-            },
-            ".lcs-info": {
-                "border-left-color": "#10B981 !important",  # Green
-            },
-            ".semantic-info": {
-                "border-left-color": "#8B5CF6 !important",  # Purple
-            },
-            ".wordcount-info": {
-                "border-left-color": "#EC4899 !important",  # Pink
-            },
-            # Accordion header styling
-            ".metric-info-accordion > .label-wrap": {
-                "font-weight": "600 !important",
-                "padding": "12px 16px !important",
-                "background-color": "#F1F5F9 !important",
-                "border-bottom": "1px solid #E2E8F0 !important",
-            },
-            # Accordion content styling
-            ".metric-info-accordion > .wrap": {
-                "padding": "16px !important",
-            },
-            # Word count plot styling - full width
-            ".tabs > .tab-content > div[data-testid='tabitem'] > .plot": {
-                "width": "100% !important",
-            },
-            # Heatmap plot styling - responsive sizing
-            ".tabs > .tab-content > div[data-testid='tabitem'] > .plotly": {
-                "width": "100% !important",
-                "height": "auto !important",
-            },
-            # Specific heatmap container styling
-            ".metric-heatmap": {
-                "max-width": "100% !important",
-                "overflow-x": "auto !important",
-            },
-            # LLM Analysis styling
-            ".llm-analysis": {
-                "background-color": "#f8f9fa !important",
-                "border-left": "4px solid #3B82F6 !important",
-                "border-radius": "8px !important",
-                "padding": "20px 24px !important",
-                "margin": "16px 0 !important",
-                "box-shadow": "0 2px 8px rgba(0, 0, 0, 0.05) !important",
-            },
-            ".llm-analysis h2": {
-                "color": "#1e40af !important",
-                "font-size": "24px !important",
-                "margin-bottom": "16px !important",
-                "border-bottom": "1px solid #e5e7eb !important",
-                "padding-bottom": "8px !important",
-            },
-            ".llm-analysis h3, .llm-analysis h4": {
-                "color": "#1e3a8a !important",
-                "margin-top": "20px !important",
-                "margin-bottom": "12px !important",
-            },
-            ".llm-analysis p": {
-                "line-height": "1.7 !important",
-                "margin-bottom": "12px !important",
-            },
-            ".llm-analysis ul, .llm-analysis ol": {
-                "margin-left": "24px !important",
-                "margin-bottom": "16px !important",
-            },
-            ".llm-analysis li": {
-                "margin-bottom": "6px !important",
-            },
-            ".llm-analysis strong, .llm-analysis b": {
-                "color": "#1f2937 !important",
-                "font-weight": "600 !important",
-            },
         }
     def get_css_string(self) -> str:
-        """Converts the self.css_overrides dictionary into a CSS string."""
-        css_parts = []
-        for selector, properties in self.css_overrides.items():
-            props_str = "\n".join(
-                [f"    {prop}: {value};" for prop, value in properties.items()]
-            )
-            css_parts.append(f"{selector} {{\n{props_str}\n}}")
-        return "\n\n".join(css_parts)
 # Instantiate the theme for easy import

+"""
+Tibetan Text Metrics Theme - Gradio 6 Compatible
+This theme provides a clean, professional look for the TTM application.
+Updated for Gradio 6.x compatibility where theme/css are passed to launch().
+"""
 import gradio as gr
 from gradio.themes.utils import colors, sizes, fonts
 class TibetanAppTheme(gr.themes.Soft):
+    """
+    Custom theme for Tibetan Text Metrics application.
+    Gradio 6 Migration Notes:
+    - Theme is now passed to demo.launch(theme=...) instead of gr.Blocks(theme=...)
+    - CSS is now passed to demo.launch(css=...) instead of gr.Blocks(css=...)
+    - Use elem_id and elem_classes for stable CSS targeting
+    """
     def __init__(self):
         super().__init__(
+            primary_hue=colors.blue,
+            secondary_hue=colors.orange,
+            neutral_hue=colors.slate,
             font=[
                 fonts.GoogleFont("Inter"),
                 "ui-sans-serif",
                 "system-ui",
                 "sans-serif",
             ],
+            radius_size=sizes.radius_md,
+            text_size=sizes.text_md,
         )
+        # Theme variable overrides using Gradio's theming system
         self.theme_vars_for_set = {
             # Global & Body Styles
             "body_background_fill": "#f0f2f5",
             "body_text_color": "#333333",
+            "body_text_color_subdued": "#4b5563",
+            # Block/Card Styles
             "block_background_fill": "#ffffff",
+            "block_radius": "16px",
             "block_shadow": "0 4px 12px rgba(0, 0, 0, 0.08)",
             "block_padding": "24px",
             "block_border_width": "0px",
+            # Button Styles - Secondary
             "button_secondary_background_fill": "#ffffff",
             "button_secondary_text_color": "#374151",
             "button_secondary_border_color": "#d1d5db",
             "button_secondary_border_color_hover": "#adb5bd",
             "button_secondary_background_fill_hover": "#f9fafb",
+            # Button Styles - Primary
             "button_primary_background_fill": "#2563eb",
             "button_primary_text_color": "#ffffff",
             "button_primary_border_color": "transparent",
             "button_primary_background_fill_hover": "#1d4ed8",
+            # Border colors
             "border_color_accent_subdued": "#e5e7eb",
+            "border_color_primary": "#d1d5db",
+            # Input styles
+            "input_background_fill": "#ffffff",
+            "input_border_color": "#d1d5db",
+            "input_border_color_focus": "#2563eb",
         }
+        super().set(**self.theme_vars_for_set)
     def get_css_string(self) -> str:
+        """
+        Returns custom CSS string for additional styling.
+        Gradio 6 uses different class naming conventions. This CSS uses:
+        - elem_id selectors (#id) for specific components
+        - elem_classes selectors (.class) for groups of components
+        - Gradio 6 native classes where stable
+        """
+        return """
+/* ============================================
+   GLOBAL STYLES
+   ============================================ */
+.gradio-container {
+    font-family: 'Inter', ui-sans-serif, system-ui, sans-serif !important;
+    max-width: 1400px !important;
+    margin: 0 auto !important;
+}
+/* ============================================
+   TYPOGRAPHY
+   ============================================ */
+h1 {
+    font-size: 28px !important;
+    font-weight: 600 !important;
+    color: #111827 !important;
+    margin-bottom: 8px !important;
+}
+h2 {
+    font-size: 24px !important;
+    font-weight: 600 !important;
+    color: var(--primary-600, #2563eb) !important;
+    margin-top: 24px !important;
+    margin-bottom: 16px !important;
+}
+h3 {
+    font-size: 20px !important;
+    font-weight: 600 !important;
+    color: #1f2937 !important;
+    margin-top: 20px !important;
+    margin-bottom: 12px !important;
+}
+/* ============================================
+   LAYOUT - Steps Row
+   ============================================ */
+#steps-row {
+    display: flex !important;
+    align-items: stretch !important;
+    gap: 24px !important;
+}
+.step-column {
+    display: flex !important;
+    flex-direction: column !important;
+    flex: 1 !important;
+}
+.step-box {
+    padding: 1.5rem !important;
+    flex-grow: 1 !important;
+    display: flex !important;
+    flex-direction: column !important;
+}
+/* ============================================
+   BUTTONS
+   ============================================ */
+/* Primary action buttons */
+#run-btn-quick, #run-btn-custom {
+    background: var(--button-primary-background-fill, #2563eb) !important;
+    color: var(--button-primary-text-color, #ffffff) !important;
+    font-weight: 600 !important;
+    font-size: 18px !important;
+    padding: 12px 24px !important;
+    border: none !important;
+    border-radius: 8px !important;
+    box-shadow: 0 2px 4px rgba(37, 99, 235, 0.2) !important;
+    transition: all 0.2s ease !important;
+    margin-top: 16px !important;
+}
+#run-btn-quick:hover, #run-btn-custom:hover {
+    background: var(--button-primary-background-fill-hover, #1d4ed8) !important;
+    box-shadow: 0 4px 12px rgba(37, 99, 235, 0.3) !important;
+    transform: translateY(-1px) !important;
+}
+/* Secondary buttons */
+button.secondary {
+    background-color: #ffffff !important;
+    color: #374151 !important;
+    border: 1px solid #d1d5db !important;
+    border-radius: 8px !important;
+    padding: 10px 20px !important;
+    font-weight: 500 !important;
+}
+button.secondary:hover {
+    background-color: #f9fafb !important;
+    border-color: #adb5bd !important;
+}
+/* ============================================
+   TABS
+   ============================================ */
+.tabs {
+    margin-top: 8px !important;
+}
+.tab-nav {
+    border-bottom: 1px solid #e5e7eb !important;
+    margin-bottom: 16px !important;
+}
+.tab-nav button {
+    color: #6b7280 !important;
+    background-color: transparent !important;
+    padding: 12px 20px !important;
+    border: none !important;
+    border-bottom: 2px solid transparent !important;
+    font-weight: 500 !important;
+    transition: all 0.2s ease !important;
+}
+.tab-nav button:hover {
+    color: #374151 !important;
+}
+.tab-nav button.selected {
+    border-bottom: 2px solid var(--primary-500, #3b82f6) !important;
+    color: var(--primary-600, #2563eb) !important;
+    background-color: transparent !important;
+}
+/* ============================================
+   ACCORDIONS
+   ============================================ */
+.accordion {
+    border: 1px solid #e5e7eb !important;
+    border-radius: 8px !important;
+    margin-bottom: 12px !important;
+    overflow: hidden !important;
+}
+/* Metric info accordions with colored borders */
+.metric-info-accordion {
+    border-left: 4px solid #3B82F6 !important;
+    margin-bottom: 1rem !important;
+    background-color: #F8FAFC !important;
+    border-radius: 6px !important;
+}
+.jaccard-info { border-left-color: #3B82F6 !important; }
+.lcs-info { border-left-color: #10B981 !important; }
+.fuzzy-info { border-left-color: #F59E0B !important; }
+.semantic-info { border-left-color: #8B5CF6 !important; }
+.wordcount-info { border-left-color: #EC4899 !important; }
+/* ============================================
+   FORM ELEMENTS
+   ============================================ */
+/* Radio buttons */
+.radio-group label {
+    display: flex !important;
+    align-items: center !important;
+    padding: 10px 16px !important;
+    border: 1px solid #e5e7eb !important;
+    border-radius: 8px !important;
+    margin-bottom: 8px !important;
+    cursor: pointer !important;
+    transition: all 0.2s ease !important;
+}
+.radio-group label:hover {
+    background-color: #f9fafb !important;
+    border-color: #d1d5db !important;
+}
+.radio-group input:checked + label,
+.radio-group label.selected {
+    background-color: var(--primary-50, #eff6ff) !important;
+    border-color: var(--primary-500, #3b82f6) !important;
+}
+/* Dropdowns */
+select, .dropdown {
+    border: 1px solid #d1d5db !important;
+    border-radius: 8px !important;
+    padding: 10px 12px !important;
+    background-color: #ffffff !important;
+}
+/* Checkboxes */
+input[type="checkbox"] {
+    width: 18px !important;
+    height: 18px !important;
+    accent-color: var(--primary-500, #3b82f6) !important;
+}
+/* ============================================
+   PRESET TABLE
+   ============================================ */
+.preset-table table {
+    font-size: 14px !important;
+    margin-top: 12px !important;
+    width: 100% !important;
+    border-collapse: collapse !important;
+}
+.preset-table th, .preset-table td {
+    padding: 10px 14px !important;
+    text-align: center !important;
+    border-bottom: 1px solid #e5e7eb !important;
+}
+.preset-table th {
+    background-color: #f9fafb !important;
+    font-weight: 600 !important;
+    color: #374151 !important;
+}
+.preset-table tr:hover {
+    background-color: #f9fafb !important;
+}
+/* ============================================
+   RESULTS SECTION
+   ============================================ */
+/* Heatmaps and plots */
+.plot-container {
+    width: 100% !important;
+    overflow-x: auto !important;
+}
+.metric-heatmap {
+    max-width: 100% !important;
+}
+/* ============================================
+   LLM ANALYSIS OUTPUT
+   ============================================ */
+#llm-analysis {
+    background-color: #f8f9fa !important;
+    border-left: 4px solid #3B82F6 !important;
+    border-radius: 8px !important;
+    padding: 20px 24px !important;
+    margin: 16px 0 !important;
+    box-shadow: 0 2px 8px rgba(0, 0, 0, 0.05) !important;
+}
+#llm-analysis h2 {
+    color: #1e40af !important;
+    font-size: 22px !important;
+    margin-bottom: 16px !important;
+    border-bottom: 1px solid #e5e7eb !important;
+    padding-bottom: 8px !important;
+}
+#llm-analysis h3, #llm-analysis h4 {
+    color: #1e3a8a !important;
+    margin-top: 18px !important;
+    margin-bottom: 10px !important;
+}
+#llm-analysis p {
+    line-height: 1.7 !important;
+    margin-bottom: 12px !important;
+    color: #374151 !important;
+}
+#llm-analysis ul, #llm-analysis ol {
+    margin-left: 24px !important;
+    margin-bottom: 16px !important;
+}
+#llm-analysis li {
+    margin-bottom: 6px !important;
+}
+#llm-analysis strong, #llm-analysis b {
+    color: #1f2937 !important;
+    font-weight: 600 !important;
+}
+/* ============================================
+   RESPONSIVE ADJUSTMENTS
+   ============================================ */
+@media (max-width: 768px) {
+    #steps-row {
+        flex-direction: column !important;
+    }
+    .step-column {
+        width: 100% !important;
+    }
+    #run-btn-quick, #run-btn-custom {
+        font-size: 16px !important;
+        padding: 10px 20px !important;
+    }
+}
+/* ============================================
+   UTILITY CLASSES
+   ============================================ */
+.custom-header {
+    margin-bottom: 12px !important;
+    color: #374151 !important;
+}
+.info-text {
+    font-size: 14px !important;
+    color: #6b7280 !important;
+    margin-top: 4px !important;
+}
+"""
 # Instantiate the theme for easy import