daniel-wojahn commited on
Commit
2e45dc8
·
1 Parent(s): b44d470
Files changed (4) hide show
  1. academic_article.md +51 -0
  2. app.py +0 -14
  3. pipeline/metrics.py +5 -10
  4. pipeline/process.py +13 -1
academic_article.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # A Computational Toolkit for Tibetan Textual Analysis: Methods and Applications of the Tibetan Text Metrics (TTM) Application
2
+
3
+ **Abstract:** The study of Tibetan textual traditions, with its vast and complex corpus, presents unique challenges for quantitative analysis. Traditional philological methods, while essential, can be enhanced by computational tools that reveal large-scale patterns of similarity and variation. This paper introduces the Tibetan Text Metrics (TTM) web application, an accessible open-source toolkit designed to bridge this gap. TTM provides a suite of similarity metrics—including Jaccard, Normalized Longest Common Subsequence (LCS), TF-IDF Cosine Similarity, and semantic analysis using advanced embedding models (FastText and SentenceTransformers). A novel feature of the application is its AI-powered interpretation engine, which translates quantitative data into scholarly insights, making complex metrics accessible to a broader audience. By offering a user-friendly interface for sophisticated textual analysis, TTM empowers researchers to explore manuscript relationships, track textual transmission, and uncover new avenues for inquiry within Tibetan studies and the broader digital humanities landscape.
4
+
5
+ ## 1. Introduction
6
+
7
+ ### 1.1. The Challenge of Tibetan Textual Scholarship
8
+
9
+ The Tibetan literary corpus is one of the world's most extensive, encompassing centuries of philosophy, history, jurisprudence, and religious doctrine. The transmission of these texts has been a complex process, involving manual copying by scribes across different regions and monastic traditions. This has resulted in a rich but challenging textual landscape, characterized by numerous variations, scribal errors, and divergent manuscript lineages. For scholars, tracing the history of a text and understanding its evolution requires meticulous philological work. While traditional methods are indispensable, the sheer scale of the available material necessitates computational approaches that can identify patterns and relationships that are not immediately apparent to the human eye.
10
+
11
+ ### 1.2. Digital Humanities and Under-Resourced Languages
12
+
13
+ The rise of digital humanities has brought a wealth of computational tools to literary analysis. However, many of these tools are designed for well-resourced languages like English, leaving languages with fewer digital resources, such as Tibetan, underserved. The unique characteristics of the Tibetan script and language, including its syllabic nature and complex orthography, require specialized tools for effective processing and analysis. The Tibetan Text Metrics (TTM) project addresses this need by providing a tailored solution that respects the linguistic nuances of Tibetan, thereby making a vital contribution to the growing field of global digital humanities.
14
+
15
+ ### 1.3. The Tibetan Text Metrics (TTM) Application
16
+
17
+ This paper introduces the Tibetan Text Metrics (TTM) web application, a user-friendly, open-source tool designed to make sophisticated textual analysis accessible to scholars of Tibetan, regardless of their technical background. The application allows researchers to upload Tibetan texts, automatically segment them into meaningful sections, and compare them using a range of quantitative metrics. This article will detail the methodologies underpinning the TTM application, describe its key features, and demonstrate its practical utility through a case study. By doing so, we aim to show how TTM can serve as a valuable assistant in the scholar's toolkit, augmenting traditional research methods and opening up new possibilities for the study of Tibetan textual history.
18
+
19
+ ## 2. Methodology: A Multi-faceted Approach to Text Similarity
20
+
21
+ To provide a holistic view of textual relationships, the TTM application employs a multi-faceted methodology that combines lexical, structural, and semantic analysis. This approach is built upon a foundation of Tibetan-specific text processing, ensuring that each metric is applied in a linguistically sound manner.
22
+
23
+ ### 2.1. Text Pre-processing and Segmentation
24
+
25
+ Meaningful computational analysis begins with careful pre-processing. The TTM application automates several crucial steps to prepare texts for comparison.
26
+
27
+ **Segmentation:** Comparing entire texts, especially long ones, can obscure significant internal variations. TTM therefore defaults to a chapter-level analysis. It automatically segments texts using the Tibetan *sbrul shad* (༈), a common marker for section breaks. This allows for a more granular comparison, revealing similarities and differences at a structural level that mirrors the text's own divisions. If no marker is found, the application treats the entire file as a single segment and issues a warning.
28
+
29
+ **Tokenization:** To analyze a text computationally, it must be broken down into individual units, or tokens. Given the syllabic nature of the Tibetan script, where morphemes are delimited by a *tsheg* (་), simple whitespace tokenization is inadequate. TTM leverages the `botok` library, a state-of-the-art tokenizer for Tibetan, which accurately identifies word boundaries, ensuring that the subsequent analysis is based on meaningful linguistic units.
30
+
31
+ **Stopword Filtering:** Many words in a language (e.g., particles, pronouns) are grammatically necessary but carry little unique semantic weight. These "stopwords" can skew similarity scores by creating an illusion of similarity based on common grammatical structures. TTM provides two levels of optional stopword filtering: a *Standard* list containing the most common particles, and an *Aggressive* list that also removes function words. This allows researchers to focus their analysis on the substantive vocabulary of a text.
32
+
33
+ ### 2.2. Lexical and Thematic Similarity Metrics
34
+
35
+ These metrics focus on the vocabulary and key terms within the texts.
36
+
37
+ **Jaccard Similarity:** This metric measures the direct overlap in vocabulary between two segments. It is calculated as the size of the intersection of the word sets divided by the size of their union. The result is a score from 0 to 1, representing the proportion of unique words that are common to both texts. Jaccard similarity is a straightforward and effective measure of shared vocabulary, independent of word order or frequency.
38
+
39
+ **TF-IDF Cosine Similarity:** Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (a corpus). It gives higher weight to terms that are frequent in one document but rare across the corpus, thus identifying the words that are most characteristic of that document. TTM calculates a TF-IDF vector for each text segment and then uses cosine similarity to measure the angle between these vectors. A higher score indicates that two segments share more of the same characteristic terms, suggesting a thematic similarity.
40
+
41
+ ### 2.3. Structural Similarity Metric
42
+
43
+ **Normalized Longest Common Subsequence (LCS):** This metric moves beyond vocabulary to assess structural parallels. The LCS algorithm finds the longest sequence of words that appears in both texts in the same relative order, though not necessarily contiguously. For example, the LCS of "the brown fox jumps" and "the lazy brown dog jumps" is "the brown jumps". TTM normalizes the length of this subsequence to produce a score that reflects shared phrasing and narrative structure. A high LCS score can indicate direct textual borrowing or a shared structural template. To ensure performance, the LCS calculation is optimized with a custom Cython implementation.
44
+
45
+ ### 2.4. Semantic Similarity
46
+
47
+ To capture similarities in meaning that may not be apparent from lexical overlap, TTM employs semantic similarity using word and sentence embeddings.
48
+
49
+ **FastText Embeddings:** The application utilizes the official Facebook FastText model for Tibetan, which represents words as dense vectors in a high-dimensional space. A key advantage of FastText is its use of character n-grams, allowing it to generate meaningful vectors even for out-of-vocabulary words—a crucial feature for handling the orthographic variations common in Tibetan manuscripts. To create a single vector for an entire text segment, TTM employs a sophisticated TF-IDF weighted averaging of the word vectors, giving more weight to the embeddings of characteristic terms.
50
+
51
+ **Hugging Face Models:** In addition to FastText, TTM integrates the `sentence-transformers` library, providing access to a wide range of pre-trained models from the Hugging Face Hub. This allows researchers to leverage powerful, context-aware models like LaBSE or XLM-RoBERTa, which can capture nuanced semantic relationships between entire sentences and paragraphs.
app.py CHANGED
@@ -133,20 +133,6 @@ def main_interface():
133
  variant="primary",
134
  elem_id="interpret-btn"
135
  )
136
-
137
- # About AI Analysis section
138
- with gr.Accordion("ℹ️ About AI Analysis", open=False):
139
- gr.Markdown("""
140
- ### AI-Powered Analysis
141
-
142
- The AI analysis is powered by **Mistral 7B Instruct** via the OpenRouter API. To use this feature:
143
-
144
- 1. Get an API key from [OpenRouter](https://openrouter.ai/keys)
145
- 2. Create a `.env` file in the webapp directory
146
- 3. Add: `OPENROUTER_API_KEY=your_api_key_here`
147
-
148
- The AI will automatically analyze your text similarities and provide insights into patterns and relationships.
149
- """)
150
  # Create a placeholder message with proper formatting and structure
151
  initial_message = """
152
  ## Analysis of Tibetan Text Similarity Metrics
 
133
  variant="primary",
134
  elem_id="interpret-btn"
135
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  # Create a placeholder message with proper formatting and structure
137
  initial_message = """
138
  ## Analysis of Tibetan Text Similarity Metrics
pipeline/metrics.py CHANGED
@@ -176,12 +176,13 @@ def compute_semantic_similarity(
176
 
177
  def compute_all_metrics(
178
  texts: Dict[str, str],
 
179
  model=None,
180
- enable_semantic: bool = True,
181
  model_type: str = "fasttext",
182
  use_stopwords: bool = True,
183
  use_lite_stopwords: bool = False,
184
- fasttext_tokenize_fn=None,
185
  batch_size: int = 32,
186
  show_progress_bar: bool = False
187
  ) -> pd.DataFrame:
@@ -203,8 +204,6 @@ def compute_all_metrics(
203
  """
204
  files = list(texts.keys())
205
  results = []
206
- # Prepare token lists (always use tokenize_texts for raw Unicode)
207
- token_lists = {} # Stores botok tokens for each text_id, used for Jaccard, LCS, and semantic sim
208
  corpus_for_sklearn_tfidf = [] # For storing space-joined tokens for scikit-learn's TF-IDF
209
 
210
  # For FastText TF-IDF related statistics
@@ -222,12 +221,8 @@ def compute_all_metrics(
222
  stopwords_set_for_fasttext_stats_calc = TIBETAN_STOPWORDS_SET
223
 
224
  for fname, content in texts.items():
225
- current_tokens_for_file = []
226
- tokenized_content_list_of_lists = tokenize_texts([content])
227
- if tokenized_content_list_of_lists and tokenized_content_list_of_lists[0]:
228
- current_tokens_for_file = tokenized_content_list_of_lists[0]
229
- token_lists[fname] = current_tokens_for_file
230
-
231
  corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
232
 
233
  if model_type == "fasttext":
 
176
 
177
  def compute_all_metrics(
178
  texts: Dict[str, str],
179
+ token_lists: Dict[str, List[str]],
180
  model=None,
181
+ enable_semantic: bool = True,
182
  model_type: str = "fasttext",
183
  use_stopwords: bool = True,
184
  use_lite_stopwords: bool = False,
185
+ fasttext_tokenize_fn=None,
186
  batch_size: int = 32,
187
  show_progress_bar: bool = False
188
  ) -> pd.DataFrame:
 
204
  """
205
  files = list(texts.keys())
206
  results = []
 
 
207
  corpus_for_sklearn_tfidf = [] # For storing space-joined tokens for scikit-learn's TF-IDF
208
 
209
  # For FastText TF-IDF related statistics
 
221
  stopwords_set_for_fasttext_stats_calc = TIBETAN_STOPWORDS_SET
222
 
223
  for fname, content in texts.items():
224
+ # Use the pre-computed tokens from the token_lists dictionary
225
+ current_tokens_for_file = token_lists.get(fname, [])
 
 
 
 
226
  corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
227
 
228
  if model_type == "fasttext":
pipeline/process.py CHANGED
@@ -205,6 +205,19 @@ def process_texts(
205
  if not segment_texts:
206
  logger.error("No valid text segments found in any of the uploaded files.")
207
  return pd.DataFrame(), pd.DataFrame(), "No valid text segments found in the uploaded files. Please check your files and try again."
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  # Group chapters by filename (preserving order)
209
  if progress_callback is not None:
210
  try:
@@ -283,7 +296,6 @@ def process_texts(
283
  pair_metrics = compute_all_metrics(
284
  texts={seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
285
  token_lists={seg1: segment_tokens[seg1], seg2: segment_tokens[seg2]},
286
- metrics_to_compute=["jaccard", "lcs", "tfidf"],
287
  model=model,
288
  enable_semantic=enable_semantic,
289
  model_type=model_type,
 
205
  if not segment_texts:
206
  logger.error("No valid text segments found in any of the uploaded files.")
207
  return pd.DataFrame(), pd.DataFrame(), "No valid text segments found in the uploaded files. Please check your files and try again."
208
+ # Tokenize all segments at once for efficiency
209
+ if progress_callback is not None:
210
+ try:
211
+ progress_callback(0.42, desc="Tokenizing all text segments...")
212
+ except Exception as e:
213
+ logger.warning(f"Progress callback error (non-critical): {e}")
214
+
215
+ all_segment_ids = list(segment_texts.keys())
216
+ all_segment_contents = list(segment_texts.values())
217
+ tokenized_segments_list = tokenize_texts(all_segment_contents)
218
+
219
+ segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
220
+
221
  # Group chapters by filename (preserving order)
222
  if progress_callback is not None:
223
  try:
 
296
  pair_metrics = compute_all_metrics(
297
  texts={seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
298
  token_lists={seg1: segment_tokens[seg1], seg2: segment_tokens[seg2]},
 
299
  model=model,
300
  enable_semantic=enable_semantic,
301
  model_type=model_type,