Spaces:

daniel-wojahn
/

ttm-webapp-hf

Sleeping

App Files Files Community

daniel-wojahn commited on Jul 24

Commit

2e45dc8

1 Parent(s): b44d470

bug fixes

Browse files

Files changed (4) hide show

academic_article.md +51 -0
app.py +0 -14
pipeline/metrics.py +5 -10
pipeline/process.py +13 -1

academic_article.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# A Computational Toolkit for Tibetan Textual Analysis: Methods and Applications of the Tibetan Text Metrics (TTM) Application
+**Abstract:** The study of Tibetan textual traditions, with its vast and complex corpus, presents unique challenges for quantitative analysis. Traditional philological methods, while essential, can be enhanced by computational tools that reveal large-scale patterns of similarity and variation. This paper introduces the Tibetan Text Metrics (TTM) web application, an accessible open-source toolkit designed to bridge this gap. TTM provides a suite of similarity metrics—including Jaccard, Normalized Longest Common Subsequence (LCS), TF-IDF Cosine Similarity, and semantic analysis using advanced embedding models (FastText and SentenceTransformers). A novel feature of the application is its AI-powered interpretation engine, which translates quantitative data into scholarly insights, making complex metrics accessible to a broader audience. By offering a user-friendly interface for sophisticated textual analysis, TTM empowers researchers to explore manuscript relationships, track textual transmission, and uncover new avenues for inquiry within Tibetan studies and the broader digital humanities landscape.
+## 1. Introduction
+### 1.1. The Challenge of Tibetan Textual Scholarship
+The Tibetan literary corpus is one of the world's most extensive, encompassing centuries of philosophy, history, jurisprudence, and religious doctrine. The transmission of these texts has been a complex process, involving manual copying by scribes across different regions and monastic traditions. This has resulted in a rich but challenging textual landscape, characterized by numerous variations, scribal errors, and divergent manuscript lineages. For scholars, tracing the history of a text and understanding its evolution requires meticulous philological work. While traditional methods are indispensable, the sheer scale of the available material necessitates computational approaches that can identify patterns and relationships that are not immediately apparent to the human eye.
+### 1.2. Digital Humanities and Under-Resourced Languages
+The rise of digital humanities has brought a wealth of computational tools to literary analysis. However, many of these tools are designed for well-resourced languages like English, leaving languages with fewer digital resources, such as Tibetan, underserved. The unique characteristics of the Tibetan script and language, including its syllabic nature and complex orthography, require specialized tools for effective processing and analysis. The Tibetan Text Metrics (TTM) project addresses this need by providing a tailored solution that respects the linguistic nuances of Tibetan, thereby making a vital contribution to the growing field of global digital humanities.
+### 1.3. The Tibetan Text Metrics (TTM) Application
+This paper introduces the Tibetan Text Metrics (TTM) web application, a user-friendly, open-source tool designed to make sophisticated textual analysis accessible to scholars of Tibetan, regardless of their technical background. The application allows researchers to upload Tibetan texts, automatically segment them into meaningful sections, and compare them using a range of quantitative metrics. This article will detail the methodologies underpinning the TTM application, describe its key features, and demonstrate its practical utility through a case study. By doing so, we aim to show how TTM can serve as a valuable assistant in the scholar's toolkit, augmenting traditional research methods and opening up new possibilities for the study of Tibetan textual history.
+## 2. Methodology: A Multi-faceted Approach to Text Similarity
+To provide a holistic view of textual relationships, the TTM application employs a multi-faceted methodology that combines lexical, structural, and semantic analysis. This approach is built upon a foundation of Tibetan-specific text processing, ensuring that each metric is applied in a linguistically sound manner.
+### 2.1. Text Pre-processing and Segmentation
+Meaningful computational analysis begins with careful pre-processing. The TTM application automates several crucial steps to prepare texts for comparison.
+**Segmentation:** Comparing entire texts, especially long ones, can obscure significant internal variations. TTM therefore defaults to a chapter-level analysis. It automatically segments texts using the Tibetan *sbrul shad* (༈), a common marker for section breaks. This allows for a more granular comparison, revealing similarities and differences at a structural level that mirrors the text's own divisions. If no marker is found, the application treats the entire file as a single segment and issues a warning.
+**Tokenization:** To analyze a text computationally, it must be broken down into individual units, or tokens. Given the syllabic nature of the Tibetan script, where morphemes are delimited by a *tsheg* (་), simple whitespace tokenization is inadequate. TTM leverages the `botok` library, a state-of-the-art tokenizer for Tibetan, which accurately identifies word boundaries, ensuring that the subsequent analysis is based on meaningful linguistic units.
+**Stopword Filtering:** Many words in a language (e.g., particles, pronouns) are grammatically necessary but carry little unique semantic weight. These "stopwords" can skew similarity scores by creating an illusion of similarity based on common grammatical structures. TTM provides two levels of optional stopword filtering: a *Standard* list containing the most common particles, and an *Aggressive* list that also removes function words. This allows researchers to focus their analysis on the substantive vocabulary of a text.
+### 2.2. Lexical and Thematic Similarity Metrics
+These metrics focus on the vocabulary and key terms within the texts.
+**Jaccard Similarity:** This metric measures the direct overlap in vocabulary between two segments. It is calculated as the size of the intersection of the word sets divided by the size of their union. The result is a score from 0 to 1, representing the proportion of unique words that are common to both texts. Jaccard similarity is a straightforward and effective measure of shared vocabulary, independent of word order or frequency.
+**TF-IDF Cosine Similarity:** Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (a corpus). It gives higher weight to terms that are frequent in one document but rare across the corpus, thus identifying the words that are most characteristic of that document. TTM calculates a TF-IDF vector for each text segment and then uses cosine similarity to measure the angle between these vectors. A higher score indicates that two segments share more of the same characteristic terms, suggesting a thematic similarity.
+### 2.3. Structural Similarity Metric
+**Normalized Longest Common Subsequence (LCS):** This metric moves beyond vocabulary to assess structural parallels. The LCS algorithm finds the longest sequence of words that appears in both texts in the same relative order, though not necessarily contiguously. For example, the LCS of "the brown fox jumps" and "the lazy brown dog jumps" is "the brown jumps". TTM normalizes the length of this subsequence to produce a score that reflects shared phrasing and narrative structure. A high LCS score can indicate direct textual borrowing or a shared structural template. To ensure performance, the LCS calculation is optimized with a custom Cython implementation.
+### 2.4. Semantic Similarity
+To capture similarities in meaning that may not be apparent from lexical overlap, TTM employs semantic similarity using word and sentence embeddings.
+**FastText Embeddings:** The application utilizes the official Facebook FastText model for Tibetan, which represents words as dense vectors in a high-dimensional space. A key advantage of FastText is its use of character n-grams, allowing it to generate meaningful vectors even for out-of-vocabulary words—a crucial feature for handling the orthographic variations common in Tibetan manuscripts. To create a single vector for an entire text segment, TTM employs a sophisticated TF-IDF weighted averaging of the word vectors, giving more weight to the embeddings of characteristic terms.
+**Hugging Face Models:** In addition to FastText, TTM integrates the `sentence-transformers` library, providing access to a wide range of pre-trained models from the Hugging Face Hub. This allows researchers to leverage powerful, context-aware models like LaBSE or XLM-RoBERTa, which can capture nuanced semantic relationships between entire sentences and paragraphs.

app.py CHANGED Viewed

@@ -133,20 +133,6 @@ def main_interface():
                         variant="primary",
                         elem_id="interpret-btn"
                     )
-                # About AI Analysis section
-                with gr.Accordion("ℹ️ About AI Analysis", open=False):
-                    gr.Markdown("""
-                    ### AI-Powered Analysis
-                    The AI analysis is powered by **Mistral 7B Instruct** via the OpenRouter API. To use this feature:
-                    1. Get an API key from [OpenRouter](https://openrouter.ai/keys)
-                    2. Create a `.env` file in the webapp directory
-                    3. Add: `OPENROUTER_API_KEY=your_api_key_here`
-                    The AI will automatically analyze your text similarities and provide insights into patterns and relationships.
-                    """)
                 # Create a placeholder message with proper formatting and structure
                 initial_message = """
 ## Analysis of Tibetan Text Similarity Metrics

                         variant="primary",
                         elem_id="interpret-btn"
                     )
                 # Create a placeholder message with proper formatting and structure
                 initial_message = """
 ## Analysis of Tibetan Text Similarity Metrics

pipeline/metrics.py CHANGED Viewed

@@ -176,12 +176,13 @@ def compute_semantic_similarity(
 def compute_all_metrics(
     texts: Dict[str, str],
     model=None,
-    enable_semantic: bool = True,
     model_type: str = "fasttext",
     use_stopwords: bool = True,
     use_lite_stopwords: bool = False,
-    fasttext_tokenize_fn=None,
     batch_size: int = 32,
     show_progress_bar: bool = False
 ) -> pd.DataFrame:
@@ -203,8 +204,6 @@ def compute_all_metrics(
     """
     files = list(texts.keys())
     results = []
-    # Prepare token lists (always use tokenize_texts for raw Unicode)
-    token_lists = {}  # Stores botok tokens for each text_id, used for Jaccard, LCS, and semantic sim
     corpus_for_sklearn_tfidf = []  # For storing space-joined tokens for scikit-learn's TF-IDF
     # For FastText TF-IDF related statistics
@@ -222,12 +221,8 @@ def compute_all_metrics(
             stopwords_set_for_fasttext_stats_calc = TIBETAN_STOPWORDS_SET
     for fname, content in texts.items():
-        current_tokens_for_file = []
-        tokenized_content_list_of_lists = tokenize_texts([content])
-        if tokenized_content_list_of_lists and tokenized_content_list_of_lists[0]:
-            current_tokens_for_file = tokenized_content_list_of_lists[0]
-        token_lists[fname] = current_tokens_for_file
         corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
         if model_type == "fasttext":

 def compute_all_metrics(
     texts: Dict[str, str],
+    token_lists: Dict[str, List[str]],
     model=None,
+    enable_semantic: bool = True,
     model_type: str = "fasttext",
     use_stopwords: bool = True,
     use_lite_stopwords: bool = False,
+    fasttext_tokenize_fn=None,
     batch_size: int = 32,
     show_progress_bar: bool = False
 ) -> pd.DataFrame:
     """
     files = list(texts.keys())
     results = []
     corpus_for_sklearn_tfidf = []  # For storing space-joined tokens for scikit-learn's TF-IDF
     # For FastText TF-IDF related statistics
             stopwords_set_for_fasttext_stats_calc = TIBETAN_STOPWORDS_SET
     for fname, content in texts.items():
+        # Use the pre-computed tokens from the token_lists dictionary
+        current_tokens_for_file = token_lists.get(fname, [])
         corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
         if model_type == "fasttext":

pipeline/process.py CHANGED Viewed

@@ -205,6 +205,19 @@ def process_texts(
     if not segment_texts:
         logger.error("No valid text segments found in any of the uploaded files.")
         return pd.DataFrame(), pd.DataFrame(), "No valid text segments found in the uploaded files. Please check your files and try again."
     # Group chapters by filename (preserving order)
     if progress_callback is not None:
         try:
@@ -283,7 +296,6 @@ def process_texts(
                 pair_metrics = compute_all_metrics(
                     texts={seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
                     token_lists={seg1: segment_tokens[seg1], seg2: segment_tokens[seg2]},
-                    metrics_to_compute=["jaccard", "lcs", "tfidf"],
                     model=model,
                     enable_semantic=enable_semantic,
                     model_type=model_type,

     if not segment_texts:
         logger.error("No valid text segments found in any of the uploaded files.")
         return pd.DataFrame(), pd.DataFrame(), "No valid text segments found in the uploaded files. Please check your files and try again."
+    # Tokenize all segments at once for efficiency
+    if progress_callback is not None:
+        try:
+            progress_callback(0.42, desc="Tokenizing all text segments...")
+        except Exception as e:
+            logger.warning(f"Progress callback error (non-critical): {e}")
+    all_segment_ids = list(segment_texts.keys())
+    all_segment_contents = list(segment_texts.values())
+    tokenized_segments_list = tokenize_texts(all_segment_contents)
+    segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
     # Group chapters by filename (preserving order)
     if progress_callback is not None:
         try:
                 pair_metrics = compute_all_metrics(
                     texts={seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
                     token_lists={seg1: segment_tokens[seg1], seg2: segment_tokens[seg2]},
                     model=model,
                     enable_semantic=enable_semantic,
                     model_type=model_type,