Spaces:

daniel-wojahn
/

ttm-webapp-hf

Sleeping

App Files Files Community

daniel-wojahn commited on Aug 25

Commit

46a7286

1 Parent(s): 671c107

codebase cleanup

Browse files

Files changed (9) hide show

academic_article.md +0 -122
base-texts/.DS_Store +0 -0
base-texts/Bailey.txt +0 -0
base-texts/Dolanji_16.txt +0 -0
base-texts/Leiden_16.txt +0 -0
base-texts/Ngari 8.txt +0 -1
pipeline/advanced_alignment.py +0 -329
results.csv +0 -97
user_guide.md +0 -190

academic_article.md DELETED Viewed

@@ -1,122 +0,0 @@
-# A Computational Toolkit for Tibetan Textual Analysis: Methods and Applications of the Tibetan Text Metrics (TTM) Application
-**Abstract:** The study of Tibetan textual traditions, with its vast and complex corpus, presents unique challenges for quantitative analysis. Traditional philological methods, while essential, can be enhanced by computational tools that reveal large-scale patterns of similarity and variation. This paper introduces the Tibetan Text Metrics (TTM) web application, an accessible open-source toolkit designed to bridge this gap. TTM provides a suite of similarity metrics—including Jaccard, Normalized Longest Common Subsequence (LCS), TF-IDF Cosine Similarity, and semantic analysis using advanced embedding models (FastText and SentenceTransformers). A novel feature of the application is its AI-powered interpretation engine, which translates quantitative data into scholarly insights, making complex metrics accessible to a broader audience. By offering a user-friendly interface for sophisticated textual analysis, TTM empowers researchers to explore manuscript relationships, track textual transmission, and uncover new avenues for inquiry within Tibetan studies and the broader digital humanities landscape.
-## 1. Introduction
-### 1.1. The Challenge of Tibetan Textual Scholarship
-The Tibetan literary corpus is one of the world's most extensive, encompassing centuries of philosophy, history, and religious doctrine. The transmission of these texts has been a complex process, involving manual copying that resulted in a rich but challenging textual landscape of divergent manuscript lineages. This challenge is exemplified by the development of the TTM application itself, which originated from the analysis of multiple editions of the 17th-century legal text, *The Pronouncements in Sixteen Chapters* (*zhal lce bcu drug*). An initial attempt to create a critical edition using standard collation software like CollateX proved untenable; the variations between editions were so substantial that they produced a convoluted apparatus that obscured, rather than clarified, the texts' relationships. It became clear that a different approach was needed—one that could move beyond one-to-one textual comparison to provide a higher-level, quantitative overview of textual similarity. TTM was developed to meet this need, providing a toolkit to assess relationships at the chapter level and reveal the broader patterns of textual evolution that traditional methods might miss.
-### 1.2. Digital Humanities and Under-Resourced Languages
-The rise of digital humanities has brought a wealth of computational tools to literary analysis. However, many of these tools are designed for well-resourced languages like English, leaving languages with fewer digital resources, such as Tibetan, underserved. The unique characteristics of the Tibetan script and language, including its syllabic nature and complex orthography, require specialized tools for effective processing and analysis. The Tibetan Text Metrics (TTM) project addresses this need by providing a tailored solution that respects the linguistic nuances of Tibetan, thereby making a vital contribution to the growing field of global digital humanities.
-### 1.3. The Tibetan Text Metrics (TTM) Application
-This paper introduces the Tibetan Text Metrics (TTM) web application, a user-friendly, open-source tool designed to make sophisticated textual analysis accessible to scholars of Tibetan, regardless of their technical background. The application empowers researchers to move beyond manual comparison by providing a suite of computational metrics that reveal distinct aspects of textual relationships—from direct lexical overlap (Jaccard similarity) and shared narrative structure (Normalized LCS) to thematic parallels (TF-IDF) and deep semantic connections (FastText and Transformer-based embeddings). This article will detail the methodologies underpinning these metrics, describe the application's key features—including its novel AI-powered interpretation engine—and demonstrate its practical utility through a case study. By doing so, we aim to show how TTM can serve as a valuable assistant in the scholar's toolkit, augmenting traditional research methods and opening up new possibilities for the study of Tibetan textual history.
-## 2. Methodology: A Multi-faceted Approach to Text Similarity
-To provide a holistic view of textual relationships, the TTM application employs a multi-faceted methodology that combines lexical, structural, and semantic analysis. This approach is built upon a foundation of Tibetan-specific text processing, ensuring that each metric is applied in a linguistically sound manner.
-### 2.1. Text Pre-processing and Segmentation
-Meaningful computational analysis begins with careful pre-processing. The TTM application automates several crucial steps to prepare texts for comparison.
-**Segmentation:** Comparing entire texts, especially long ones, can obscure significant internal variations. TTM therefore defaults to a chapter-level analysis. It automatically segments texts using the Tibetan *sbrul shad* (༈), a common marker for section breaks. This allows for a more granular comparison, revealing similarities and differences at a structural level that mirrors the text's own divisions. If no marker is found, the application treats the entire file as a single segment and issues a warning.
-**Tokenization:** To analyze a text computationally, it must be broken down into individual units, or tokens. Given the syllabic nature of the Tibetan script, where morphemes are delimited by a *tsheg* (་), simple whitespace tokenization is inadequate. TTM leverages the `botok` library, a state-of-the-art tokenizer for Tibetan, which accurately identifies word boundaries, ensuring that the subsequent analysis is based on meaningful linguistic units.
-**Stopword Filtering:** Many words in a language are grammatically necessary but carry little unique semantic weight. These "stopwords" can skew similarity scores by creating an illusion of similarity based on common grammatical structures. TTM provides two levels of optional stopword filtering to address this:
-*   The **Standard** list targets only the most frequent, low-information grammatical particles and punctuation (e.g., the instrumental particle `གིས་` (gis), the genitive particle `གི་` (gi), and the sentence-ending *shad* `།`).
-*   The **Aggressive** list includes the standard particles but also removes a wider range of function words, such as pronouns (e.g., `འདི` (this), `དེ་` (that)), auxiliary verbs (e.g., `ཡིན་` (is)), and common quantifiers (e.g., `རྣམས་` (plural marker)).
-This tiered approach allows researchers to fine-tune their analysis, either preserving the grammatical structure or focusing purely on the substantive vocabulary of a text.
-### 2.2. Lexical and Thematic Similarity Metrics
-These metrics focus on the vocabulary and key terms within the texts.
-**Jaccard Similarity:** This metric measures the direct overlap in vocabulary between two segments. It is calculated as the size of the intersection of the word sets divided by the size of their union. The result is a score from 0 to 1, representing the proportion of unique words that are common to both texts. Jaccard similarity is a straightforward and effective measure of shared vocabulary, independent of word order or frequency.
-**TF-IDF Cosine Similarity:** Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (a corpus). It gives higher weight to terms that are frequent in one document but rare across the corpus, thus identifying the words that are most characteristic of that document. TTM calculates a TF-IDF vector for each text segment and then uses cosine similarity to measure the angle between these vectors. A higher score indicates that two segments share more of the same characteristic terms, suggesting a thematic similarity.
-### 2.3. Structural Similarity Metric
-**Normalized Longest Common Subsequence (LCS):** This metric moves beyond vocabulary to assess structural parallels. The LCS algorithm finds the longest sequence of words that appears in both texts in the same relative order, though not necessarily contiguously. For example, the LCS of "the brown fox jumps" and "the lazy brown dog jumps" is "the brown jumps". TTM normalizes the length of this subsequence to produce a score that reflects shared phrasing and narrative structure. A high LCS score can indicate direct textual borrowing or a shared structural template. To ensure performance, the LCS calculation is optimized with a custom Cython implementation.
-**LCS vs. Levenshtein Distance:** While Levenshtein distance is another string similarity metric that measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another, LCS is more appropriate for Tibetan text analysis for several reasons:
-1. Tibetan manuscripts often share common passages or structural elements: LCS is particularly effective at identifying these shared passages, which may be separated by varying amounts of different text.
-2. LCS focuses on meaningful shared content rather than character-level differences: Unlike Levenshtein distance, which is sensitive to every character change, LCS identifies the longest sequence of words in the same order, focusing on substantive content overlap.
-3. LCS is less sensitive to minor variations that might occur in handwritten or OCR-processed texts: Tibetan manuscripts often contain variations due to scribal errors, regional differences, or OCR artifacts. LCS can still identify shared structural elements despite these variations, whereas Levenshtein distance might be disproportionately affected by them.
-### 2.4. Semantic Similarity
-To capture similarities in meaning that may not be apparent from lexical overlap, TTM employs semantic similarity using word and sentence embeddings.
-**FastText Embeddings:** The application utilizes the official Facebook FastText model for Tibetan, which represents words as dense vectors in a high-dimensional space. A key advantage of FastText is its use of character n-grams, allowing it to generate meaningful vectors even for out-of-vocabulary words—a crucial feature for handling the orthographic variations common in Tibetan manuscripts. To create a single vector for an entire text segment, TTM employs a sophisticated TF-IDF weighted averaging of the word vectors, giving more weight to the embeddings of characteristic terms.
-**Hugging Face Models:** In addition to FastText, TTM integrates the `sentence-transformers` library, providing access to a wide range of pre-trained models from the Hugging Face Hub. This allows researchers to leverage powerful, context-aware models like LaBSE or XLM-RoBERTa, which can capture nuanced semantic relationships between entire sentences and paragraphs.
-## 3. The TTM Web Application: Features and Functionality
-The TTM application is designed to be a practical tool for researchers. Its features are built to facilitate an intuitive workflow, from data input to the interpretation of results.
-### 3.1. User Interface and Workflow
-Built with the Gradio framework, the application's interface is clean and straightforward. The workflow is designed to be linear and intuitive:
-1.  **File Upload:** Users begin by uploading one or more Tibetan `.txt` files.
-2.  **Configuration:** Users can then configure the analysis by selecting which metrics to compute, choosing an embedding model, and setting the desired level of stopword filtering.
-3.  **Execution:** A single "Run Analysis" button initiates the entire processing pipeline.
-This simple, step-by-step process removes the barriers of command-line tools and complex software setups, making the technology accessible to all scholars.
-### 3.2. Data Visualization
-Understanding numerical similarity scores can be challenging. TTM addresses this by providing rich, interactive visualizations:
-*   **Heatmaps:** For each similarity metric, the application generates a heatmap that provides an at-a-glance overview of the relationships between all text segments. Darker cells indicate higher similarity, allowing researchers to quickly identify areas of strong textual connection.
-*   **Bar Charts:** A word count chart for each text provides a simple but effective visualization of the relative lengths of the segments, which is important context for interpreting the similarity scores.
-These visualizations are not only useful for analysis but are also publication-ready, allowing researchers to easily incorporate them into their own work.
-### 3.3. AI-Powered Interpretation
-A standout feature of the TTM application is its AI-powered interpretation engine. While quantitative metrics are powerful, their scholarly significance is not always self-evident. The "Interpret Results" button addresses this challenge by sending the computed metrics to a large language model (Mistral 7B via the OpenRouter API).
-The AI then generates a qualitative analysis of the results, framed in the language of textual scholarship. This analysis typically includes:
-*   An overview of the general patterns of similarity.
-*   A discussion of notable chapters with particularly high or low similarity.
-*   An interpretation of what the different metrics collectively suggest about the texts' relationship (e.g., lexical borrowing vs. structural parallels).
-*   Suggestions for further scholarly investigation.
-This feature acts as a bridge between the quantitative data and its qualitative interpretation, helping researchers to understand the implications of their findings and to formulate new research questions.
-## 5. Discussion and Future Directions
-### 5.1. Interpreting the Metrics: A Holistic View
-The true analytical power of the TTM application lies not in any single metric, but in the synthesis of all of them. For example, a high Jaccard similarity combined with a low LCS score might suggest that two texts share a common vocabulary but arrange it differently, perhaps indicating a shared topic but different authorial styles. Conversely, a high LCS score with a moderate Jaccard similarity could point to a shared structural backbone or direct borrowing, even with significant lexical variation. The addition of semantic similarity further enriches this picture, revealing conceptual connections that might be missed by lexical and structural methods alone. The TTM application facilitates this holistic approach, encouraging a nuanced interpretation of textual relationships.
-### 5.2. Limitations
-While powerful, the TTM application has limitations. The quality of the analysis is highly dependent on the quality of the input texts; poorly scanned or OCR'd texts may yield unreliable results. The performance of the semantic models, while state-of-the-art, may also vary depending on the specific domain of the texts being analyzed. Furthermore, the AI-powered interpretation, while a useful guide, is not a substitute for scholarly expertise and should be treated as a starting point for further investigation, not a definitive conclusion.
-### 5.3. Future Work
-The TTM project is under active development, with several potential avenues for future enhancement. These include:
-*   **Integration of More Models:** Expanding the library of available embedding models to include more domain-specific options.
-*   **Enhanced Visualization:** Adding more advanced visualization tools, such as network graphs to show relationships between multiple texts.
-*   **User-Trainable Models:** Exposing the functionality to train custom FastText models directly within the web UI, allowing researchers to create highly specialized models for their specific corpora.
-## 6. Conclusion
-The Tibetan Text Metrics web application represents a significant step forward in making computational textual analysis accessible to the field of Tibetan studies. By combining a suite of powerful similarity metrics with an intuitive user interface and a novel AI-powered interpretation engine, TTM lowers the barrier to entry for digital humanities research. It provides scholars with a versatile tool to explore textual relationships, investigate manuscript histories, and generate new, data-driven insights. As such, TTM serves not as a replacement for traditional philology, but as a powerful complement, one that promises to enrich and expand the horizons of Tibetan textual scholarship.

base-texts/.DS_Store DELETED Viewed

Binary file (6.15 kB)

base-texts/Bailey.txt DELETED Viewed

The diff for this file is too large to render. See raw diff

base-texts/Dolanji_16.txt DELETED Viewed

The diff for this file is too large to render. See raw diff

base-texts/Leiden_16.txt DELETED Viewed

The diff for this file is too large to render. See raw diff

base-texts/Ngari 8.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- ༈ དང་པོ།

pipeline/advanced_alignment.py DELETED Viewed

@@ -1,329 +0,0 @@
-"""
-Advanced Tibetan Legal Manuscript Alignment Engine
-Juxta/CollateX-inspired alignment with Tibetan-specific enhancements
-"""
-import difflib
-import re
-from typing import Dict, List, Tuple
-from dataclasses import dataclass
-from collections import defaultdict
-import logging
-logger = logging.getLogger(__name__)
-@dataclass
-class AlignmentSegment:
-    """Represents an aligned segment between texts."""
-    text1_content: str
-    text2_content: str
-    alignment_type: str  # 'match', 'gap', 'mismatch', 'transposition'
-    confidence: float
-    position_text1: int
-    position_text2: int
-    context: str = ""
-@dataclass
-class TibetanAlignmentResult:
-    """Complete alignment result for Tibetan manuscripts."""
-    segments: List[AlignmentSegment]
-    transpositions: List[Tuple[int, int]]
-    insertions: List[Dict]
-    deletions: List[Dict]
-    modifications: List[Dict]
-    alignment_score: float
-    structural_similarity: float
-    scholarly_apparatus: Dict
-class TibetanLegalAligner:
-    """
-    Juxta/CollateX-inspired alignment engine for Tibetan legal manuscripts.
-    Features:
-    - Multi-level alignment (character → word → sentence → paragraph)
-    - Transposition detection (content moves)
-    - Tibetan-specific punctuation handling
-    - Scholarly apparatus generation
-    - Confidence scoring
-    """
-    def __init__(self, min_segment_length: int = 3, context_window: int = 15):
-        self.min_segment_length = min_segment_length
-        self.context_window = context_window
-        self.tibetan_punctuation = r'[།༎༏༐༑༔་]'
-    def tibetan_tokenize(self, text: str) -> List[str]:
-        """Tibetan-specific tokenization respecting syllable boundaries."""
-        # Split on Tibetan punctuation and spaces
-        tokens = re.split(rf'{self.tibetan_punctuation}|\s+', text)
-        return [token.strip() for token in tokens if token.strip()]
-    def segment_by_syllables(self, text: str) -> List[str]:
-        """Segment text into Tibetan syllables."""
-        # Tibetan syllables typically end with ་ or punctuation
-        syllables = re.findall(r'[^་]+་?', text)
-        return [s.strip() for s in syllables if s.strip()]
-    def multi_level_alignment(self, text1: str, text2: str) -> TibetanAlignmentResult:
-        """
-        Multi-level alignment inspired by Juxta/CollateX.
-        Levels:
-        1. Character level (for precise changes)
-        2. Syllable level (Tibetan linguistic units)
-        3. Sentence level (punctuation-based)
-        4. Paragraph level (structural blocks)
-        """
-        # Level 1: Character-level alignment
-        char_alignment = self.character_level_alignment(text1, text2)
-        # Level 2: Syllable-level alignment
-        syllable_alignment = self.syllable_level_alignment(text1, text2)
-        # Level 3: Sentence-level alignment
-        sentence_alignment = self.sentence_level_alignment(text1, text2)
-        # Level 4: Structural alignment
-        structural_alignment = self.structural_level_alignment(text1, text2)
-        # Combine results with confidence scoring
-        return self.combine_alignments(
-            char_alignment, syllable_alignment,
-            sentence_alignment, structural_alignment
-        )
-    def character_level_alignment(self, text1: str, text2: str) -> Dict:
-        """Character-level precise alignment."""
-        matcher = difflib.SequenceMatcher(None, text1, text2)
-        segments = []
-        for tag, i1, i2, j1, j2 in matcher.get_opcodes():
-            segment = AlignmentSegment(
-                text1_content=text1[i1:i2],
-                text2_content=text2[j1:j2],
-                alignment_type=self.map_opcode_to_type(tag),
-                confidence=self.calculate_confidence(text1[i1:i2], text2[j1:j2]),
-                position_text1=i1,
-                position_text2=j1
-            )
-            segments.append(segment)
-        return {'segments': segments, 'level': 'character'}
-    def syllable_level_alignment(self, text1: str, text2: str) -> Dict:
-        """Tibetan syllable-level alignment."""
-        syllables1 = self.segment_by_syllables(text1)
-        syllables2 = self.segment_by_syllables(text2)
-        matcher = difflib.SequenceMatcher(None, syllables1, syllables2)
-        segments = []
-        for tag, i1, i2, j1, j2 in matcher.get_opcodes():
-            content1 = ' '.join(syllables1[i1:i2])
-            content2 = ' '.join(syllables2[j1:j2])
-            segment = AlignmentSegment(
-                text1_content=content1,
-                text2_content=content2,
-                alignment_type=self.map_opcode_to_type(tag),
-                confidence=self.calculate_confidence(content1, content2),
-                position_text1=i1,
-                position_text2=j1
-            )
-            segments.append(segment)
-        return {'segments': segments, 'level': 'syllable'}
-    def sentence_level_alignment(self, text1: str, text2: str) -> Dict:
-        """Sentence-level alignment using Tibetan punctuation."""
-        sentences1 = self.tibetan_tokenize(text1)
-        sentences2 = self.tibetan_tokenize(text2)
-        matcher = difflib.SequenceMatcher(None, sentences1, sentences2)
-        segments = []
-        for tag, i1, i2, j1, j2 in matcher.get_opcodes():
-            content1 = ' '.join(sentences1[i1:i2])
-            content2 = ' '.join(sentences2[j1:j2])
-            segment = AlignmentSegment(
-                text1_content=content1,
-                text2_content=content2,
-                alignment_type=self.map_opcode_to_type(tag),
-                confidence=self.calculate_confidence(content1, content2),
-                position_text1=i1,
-                position_text2=j1
-            )
-            segments.append(segment)
-        return {'segments': segments, 'level': 'sentence'}
-    def structural_level_alignment(self, text1: str, text2: str) -> Dict:
-        """Structural-level alignment for larger text blocks."""
-        # Paragraph-level segmentation
-        paragraphs1 = text1.split('\n\n')
-        paragraphs2 = text2.split('\n\n')
-        matcher = difflib.SequenceMatcher(None, paragraphs1, paragraphs2)
-        segments = []
-        for tag, i1, i2, j1, j2 in matcher.get_opcodes():
-            content1 = '\n\n'.join(paragraphs1[i1:i2])
-            content2 = '\n\n'.join(paragraphs2[j1:j2])
-            segment = AlignmentSegment(
-                text1_content=content1,
-                text2_content=content2,
-                alignment_type=self.map_opcode_to_type(tag),
-                confidence=self.calculate_confidence(content1, content2),
-                position_text1=i1,
-                position_text2=j1
-            )
-            segments.append(segment)
-        return {'segments': segments, 'level': 'structural'}
-    def detect_transpositions(self, segments: List[AlignmentSegment]) -> List[Tuple[int, int]]:
-        """Detect content transpositions (moves) between texts."""
-        transpositions = []
-        # Look for identical content appearing in different positions
-        content_map = defaultdict(list)
-        for i, segment in enumerate(segments):
-            if segment.alignment_type == 'match':
-                content_map[segment.text1_content].append(i)
-        # Detect moves where same content appears at different positions
-        for content, positions in content_map.items():
-            if len(positions) > 1:
-                # Potential transposition detected
-                transpositions.extend([(positions[i], positions[j])
-                                   for i in range(len(positions))
-                                   for j in range(i+1, len(positions))])
-        return transpositions
-    def map_opcode_to_type(self, opcode: str) -> str:
-        """Map difflib opcode to alignment type."""
-        mapping = {
-            'equal': 'match',
-            'delete': 'deletion',
-            'insert': 'insertion',
-            'replace': 'mismatch'
-        }
-        return mapping.get(opcode, 'unknown')
-    def calculate_confidence(self, content1: str, content2: str) -> float:
-        """Calculate alignment confidence score."""
-        if not content1 and not content2:
-            return 1.0
-        if not content1 or not content2:
-            return 0.0
-        # Use Levenshtein distance for confidence
-        distance = self.levenshtein_distance(content1, content2)
-        max_len = max(len(content1), len(content2))
-        return max(0.0, 1.0 - (distance / max_len)) if max_len > 0 else 1.0
-    def levenshtein_distance(self, s1: str, s2: str) -> int:
-        """Calculate Levenshtein distance between two strings."""
-        if len(s1) < len(s2):
-            return self.levenshtein_distance(s2, s1)
-        if len(s2) == 0:
-            return len(s1)
-        previous_row = list(range(len(s2) + 1))
-        for i, c1 in enumerate(s1):
-            current_row = [i + 1]
-            for j, c2 in enumerate(s2):
-                insertions = previous_row[j + 1] + 1
-                deletions = current_row[j] + 1
-                substitutions = previous_row[j] + (c1 != c2)
-                current_row.append(min(insertions, deletions, substitutions))
-            previous_row = current_row
-        return previous_row[-1]
-    def generate_scholarly_apparatus(self, alignment: TibetanAlignmentResult) -> Dict:
-        """Generate scholarly apparatus for critical edition."""
-        return {
-            'sigla': {
-                'witness_a': 'Base text',
-                'witness_b': 'Variant text'
-            },
-            'critical_notes': self.generate_critical_notes(alignment),
-            'alignment_summary': {
-                'total_segments': len(alignment.segments),
-                'exact_matches': len([s for s in alignment.segments if s.alignment_type == 'match']),
-                'variants': len([s for s in alignment.segments if s.alignment_type in ['mismatch', 'modification']]),
-                'transpositions': len(alignment.transpositions),
-                'confidence_score': sum(s.confidence for s in alignment.segments) / len(alignment.segments) if alignment.segments else 0
-            }
-        }
-    def generate_critical_notes(self, alignment: TibetanAlignmentResult) -> List[str]:
-        """Generate critical notes in scholarly format."""
-        notes = []
-        for segment in alignment.segments:
-            if segment.alignment_type in ['mismatch', 'modification']:
-                note = f"Variant: '{segment.text1_content}' → '{segment.text2_content}'"
-                notes.append(note)
-        return notes
-    def combine_alignments(self, *alignments) -> TibetanAlignmentResult:
-        """Combine multi-level alignments into final result."""
-        # This would implement sophisticated combination logic
-        # For now, return the highest confidence level
-        # Use sentence-level as primary
-        sentence_alignment = next(a for a in alignments if a['level'] == 'sentence')
-        return TibetanAlignmentResult(
-            segments=sentence_alignment['segments'],
-            transpositions=[],
-            insertions=[],
-            deletions=[],
-            modifications=[],
-            alignment_score=0.85,  # Placeholder
-            structural_similarity=0.75,  # Placeholder
-            scholarly_apparatus={
-                'method': 'Juxta/CollateX-inspired multi-level alignment',
-                'levels': ['character', 'syllable', 'sentence', 'structural']
-            }
-        )
-# Integration function for existing codebase
-def enhanced_structural_analysis(text1: str, text2: str,
-                               file1_name: str = "Text 1",
-                               file2_name: str = "Text 2") -> dict:
-    """
-    Enhanced structural analysis using Juxta/CollateX-inspired algorithms.
-    Args:
-        text1: First text to analyze
-        text2: Second text to analyze
-        file1_name: Name for first text
-        file2_name: Name for second text
-    Returns:
-        Comprehensive alignment analysis
-    """
-    aligner = TibetanLegalAligner()
-    result = aligner.multi_level_alignment(text1, text2)
-    return {
-        'alignment_segments': [{
-            'type': segment.alignment_type,
-            'content1': segment.text1_content,
-            'content2': segment.text2_content,
-            'confidence': segment.confidence
-        } for segment in result.segments],
-        'transpositions': result.transpositions,
-        'scholarly_apparatus': result.scholarly_apparatus,
-        'alignment_score': result.alignment_score,
-        'structural_similarity': result.structural_similarity
-    }

results.csv DELETED Viewed

@@ -1,97 +0,0 @@
-Text Pair,Jaccard Similarity (%),Normalized LCS,Semantic Similarity,TF-IDF Cosine Sim,Chapter
-Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,1
-Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,2
-Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,3
-Ngari 9.txt vs Nepal12.txt,46.42857142857143,0.6112115732368897,,0.8407944127395544,4
-Ngari 9.txt vs Nepal12.txt,40.42553191489361,0.5191256830601093,,0.5026984774848224,5
-Ngari 9.txt vs Nepal12.txt,47.28260869565217,0.6107784431137725,,0.8380742568060093,6
-Ngari 9.txt vs Nepal12.txt,49.29178470254957,0.5285565939771547,,0.8409605475909782,7
-Ngari 9.txt vs Nepal12.txt,46.07218683651805,0.6053169734151329,,0.9306016557862976,8
-Ngari 9.txt vs Nepal12.txt,51.7557251908397,0.7000429737859906,,0.9600630844581352,9
-Ngari 9.txt vs Nepal12.txt,52.760736196319016,0.710204081632653,,0.9135878707769712,10
-Ngari 9.txt vs Nepal12.txt,14.92842535787321,0.08302507192766133,,0.698638890914812,11
-Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,12
-Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,13
-Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,14
-Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,15
-Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,16
-Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,1
-Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,2
-Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,3
-Ngari 9.txt vs LTWA.txt,47.752808988764045,0.603648424543947,,0.8414077093281586,4
-Ngari 9.txt vs LTWA.txt,48.40764331210191,0.6094808126410836,,0.6526135410649626,5
-Ngari 9.txt vs LTWA.txt,49.13294797687861,0.6297872340425532,,0.8252183235183391,6
-Ngari 9.txt vs LTWA.txt,35.53459119496855,0.4071058475203553,,0.8403529862077375,7
-Ngari 9.txt vs LTWA.txt,45.0,0.601965601965602,,0.9452297806160965,8
-Ngari 9.txt vs LTWA.txt,37.89126853377265,0.29986320109439124,,0.8760838478443608,9
-Ngari 9.txt vs LTWA.txt,51.632047477744806,0.6395222584147665,,0.9317016829510952,10
-Ngari 9.txt vs LTWA.txt,14.979757085020243,0.10742761225346202,,0.7111189597708231,11
-Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,12
-Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,13
-Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,14
-Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,15
-Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,16
-Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,1
-Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,2
-Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,3
-Ngari 9.txt vs Leiden.txt,41.340782122905026,0.5282331511839709,,0.8525095366316284,4
-Ngari 9.txt vs Leiden.txt,36.80555555555556,0.4734042553191489,,0.5634721694429372,5
-Ngari 9.txt vs Leiden.txt,44.047619047619044,0.4728132387706856,,0.7698959290709281,6
-Ngari 9.txt vs Leiden.txt,35.67251461988304,0.3208020050125313,,0.784262930792386,7
-Ngari 9.txt vs Leiden.txt,41.01123595505618,0.4241099312929419,,0.9275267086147868,8
-Ngari 9.txt vs Leiden.txt,40.31209362808843,0.20184790334044064,,0.9076572014074583,9
-Ngari 9.txt vs Leiden.txt,50.445103857566764,0.6045733407696597,,0.9284684903895061,10
-Ngari 9.txt vs Leiden.txt,16.363636363636363,0.08736942070275404,,0.6999802304139516,11
-Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,12
-Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,13
-Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,14
-Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,15
-Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,16
-Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,1
-Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,2
-Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,3
-Nepal12.txt vs LTWA.txt,56.493506493506494,0.6959706959706959,,0.8321482637014176,4
-Nepal12.txt vs LTWA.txt,39.71631205673759,0.5386666666666666,,0.7104447145077406,5
-Nepal12.txt vs LTWA.txt,48.795180722891565,0.5898004434589801,,0.8168699067131293,6
-Nepal12.txt vs LTWA.txt,34.954407294832826,0.3365548607163161,,0.861391898750807,7
-Nepal12.txt vs LTWA.txt,51.41509433962265,0.5873239436619718,,0.9310750730815768,8
-Nepal12.txt vs LTWA.txt,41.18705035971223,0.3156208277703605,,0.9075961630628558,9
-Nepal12.txt vs LTWA.txt,60.066006600660074,0.7040533037201555,,0.921390350997517,10
-Nepal12.txt vs LTWA.txt,63.6986301369863,0.7454220634211701,,0.9803189694519824,11
-Nepal12.txt vs LTWA.txt,48.275862068965516,0.5102639296187683,,0.7725258306356406,12
-Nepal12.txt vs LTWA.txt,58.203125,0.7364921030756443,,0.9543942889292814,13
-Nepal12.txt vs LTWA.txt,41.732283464566926,0.4332449160035367,,0.8497746214132795,14
-Nepal12.txt vs LTWA.txt,17.983651226158038,0.1474820143884892,,0.5779105517118261,15
-Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,16
-Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,1
-Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,2
-Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,3
-Nepal12.txt vs Leiden.txt,57.14285714285714,0.6788617886178862,,0.8403894964769358,4
-Nepal12.txt vs Leiden.txt,38.793103448275865,0.4935064935064935,,0.4684416871587978,5
-Nepal12.txt vs Leiden.txt,60.416666666666664,0.6386138613861386,,0.8441982917223785,6
-Nepal12.txt vs Leiden.txt,43.24324324324324,0.33632734530938124,,0.8839876637274263,7
-Nepal12.txt vs Leiden.txt,50.8235294117647,0.4953338119167265,,0.9373191281412603,8
-Nepal12.txt vs Leiden.txt,44.03927068723703,0.2242042672263029,,0.9196700291527228,9
-Nepal12.txt vs Leiden.txt,67.59581881533101,0.7226027397260274,,0.9462708958951278,10
-Nepal12.txt vs Leiden.txt,60.42780748663101,0.7003094501309212,,0.9722895878422901,11
-Nepal12.txt vs Leiden.txt,23.502304147465438,0.27245508982035926,,0.6893488630692246,12
-Nepal12.txt vs Leiden.txt,67.08333333333333,0.7506382978723404,,0.9466019120384076,13
-Nepal12.txt vs Leiden.txt,42.67782426778243,0.418426103646833,,0.8023010077421123,14
-Nepal12.txt vs Leiden.txt,31.17206982543641,0.2664756446991404,,0.757778410785804,15
-Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,16
-LTWA.txt vs Leiden.txt,53.5064935064935,0.6359163591635917,,0.9623337315161734,1
-LTWA.txt vs Leiden.txt,60.909090909090914,0.7578659370725034,,0.8852155398192683,2
-LTWA.txt vs Leiden.txt,64.1025641025641,0.7001044932079414,,0.8986878289296542,3
-LTWA.txt vs Leiden.txt,51.21951219512195,0.6568265682656826,,0.8596233249504219,4
-LTWA.txt vs Leiden.txt,40.0,0.5298701298701298,,0.6287776677298036,5
-LTWA.txt vs Leiden.txt,46.308724832214764,0.5415549597855228,,0.7796920776498958,6
-LTWA.txt vs Leiden.txt,43.233082706766915,0.49545136459062283,,0.8819142330949857,7
-LTWA.txt vs Leiden.txt,54.03050108932462,0.5373230373230373,,0.9477445373252964,8
-LTWA.txt vs Leiden.txt,38.75598086124402,0.1898707353252808,,0.887072472781142,9
-LTWA.txt vs Leiden.txt,66.32996632996633,0.7823310271420969,,0.9693004524579277,10
-LTWA.txt vs Leiden.txt,63.537906137184116,0.754516983859311,,0.9830176756030125,11
-LTWA.txt vs Leiden.txt,24.299065420560748,0.18152350081037277,,0.6278532648577805,12
-LTWA.txt vs Leiden.txt,60.1593625498008,0.7367521367521368,,0.9381662329597793,13
-LTWA.txt vs Leiden.txt,59.44444444444444,0.6746987951807228,,0.8771500136505623,14
-LTWA.txt vs Leiden.txt,35.37735849056604,0.39255014326647564,,0.6834100468628878,15
-LTWA.txt vs Leiden.txt,60.45081967213115,0.6875444839857652,,0.9482911929631709,16

user_guide.md DELETED Viewed

@@ -1,190 +0,0 @@
-# Tibetan Text Metrics Web Application User Guide
-## Introduction
-Welcome to the Tibetan Text Metrics Web Application! This user-friendly tool allows you to analyze textual similarities and variations in Tibetan manuscripts using multiple computational approaches. The application provides a graphical interface to the core functionalities of the Tibetan Text Metrics (TTM) project.
-## Getting Started
-### System Requirements
-- Modern web browser (Chrome, Firefox, Safari, or Edge)
-- For local installation: Python 3.10 or newer
-- Sufficient RAM for processing large texts (4GB minimum, 8GB recommended)
-### Installation and Setup
-#### Online Demo
-The easiest way to try the application is through our Hugging Face Spaces demo:
-[daniel-wojahn/ttm-webapp-hf](https://huggingface.co/spaces/daniel-wojahn/ttm-webapp-hf)
-Note: The free tier of Hugging Face Spaces may have performance limitations compared to running locally.
-#### Local Installation
-1. Clone the repository:
-   ```bash
-   git clone https://github.com/daniel-wojahn/tibetan-text-metrics.git
-   cd tibetan-text-metrics/webapp
-   ```
-2. Create and activate a virtual environment:
-   ```bash
-   python -m venv venv
-   source venv/bin/activate  # On Windows: venv\Scripts\activate
-   ```
-3. Install dependencies:
-   ```bash
-   pip install -r requirements.txt
-   ```
-4. Run the application:
-   ```bash
-   python app.py
-   ```
-5. Open your browser and navigate to:
-   ```
-   http://localhost:7860
-   ```
-## Using the Application
-### Step 1: Upload Your Tibetan Text Files
-1. Click the "Upload Tibetan .txt files" button to select one or more `.txt` files containing Tibetan text.
-2. Files should be in UTF-8 or UTF-16 encoding.
-3. Maximum file size: 10MB per file (for optimal performance, use files under 1MB).
-4. For best results, your texts should be segmented into chapters/sections using the Tibetan marker '༈' (*sbrul shad*).
-### Step 2: Configure Analysis Options
-1. **Semantic Similarity**: Choose whether to compute semantic similarity metrics.
-   - "Yes" (default): Includes semantic similarity in the analysis (slower but more comprehensive).
-   - "No": Skips semantic similarity calculation for faster processing.
-2. **Embedding Model**: Select the model to use for semantic similarity analysis.
-   - **sentence-transformers/all-MiniLM-L6-v2** (default): General purpose sentence embedding model (fastest option).
-   - **sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**: Multilingual model with good performance for many languages.
-   - **buddhist-nlp/buddhist-sentence-similarity**: Optimized for Buddhist text similarity.
-   - **xlm-roberta-base**: Multilingual model that includes Tibetan.
-3. Click the "Run Analysis" button to start processing.
-### Step 3: View and Interpret Results
-After processing, the application displays several visualizations and metrics:
-#### Word Count Chart
-Shows the number of words in each chapter/segment of each file, allowing you to compare the relative lengths of different texts.
-#### Similarity Metrics
-The application computes four different similarity metrics between corresponding chapters of different files:
-1. **Jaccard Similarity (%)**: Measures vocabulary overlap between segments after filtering out common Tibetan stopwords. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
-2. **Normalized LCS (Longest Common Subsequence)**: Measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. A higher score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.
-3. **Semantic Similarity**: Uses a transformer-based model to compute the cosine similarity between the semantic embeddings of text segments. This captures similarities in meaning even when different vocabulary is used.
-4. **TF-IDF Cosine Similarity**: Compares texts based on their important, characteristic terms by giving higher weight to words that are frequent within a particular segment but relatively rare across the entire collection.
-#### Heatmap Visualizations
-Each metric has a corresponding heatmap visualization where:
-- Rows represent chapters/segments
-- Columns represent text pairs being compared
-- Color intensity indicates similarity (brighter = more similar)
-### Tips for Effective Analysis
-1. **Text Segmentation**: For meaningful chapter-level comparisons, ensure your texts are segmented using the Tibetan marker '༈' (*sbrul shad*).
-2. **File Naming**: Use descriptive filenames to make the comparison results easier to interpret.
-3. **Model Selection**:
-   - For faster processing, use the default model or disable semantic similarity.
-   - For Buddhist texts, the buddhist-nlp/buddhist-sentence-similarity model may provide better results.
-4. **File Size**:
-   - Keep individual files under 1MB for optimal performance.
-   - Very large files (>10MB) are not supported and will trigger an error.
-5. **Comparing Multiple Texts**: The application requires at least two text files to compute similarity metrics.
-## Understanding the Metrics
-### Jaccard Similarity (%)
-This metric quantifies the lexical overlap between two text segments by comparing their sets of unique words, after filtering out common Tibetan stopwords. It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
-It is calculated as:
-```
-(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100
-```
-Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
-### Normalized LCS (Longest Common Subsequence)
-This metric measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.
-For example, if Text A is 'the quick brown fox jumps' and Text B is 'the lazy cat and brown dog jumps high', the LCS is 'the brown jumps'.
-The length of this common subsequence is then normalized to provide a score. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
-Unlike other metrics, LCS does not filter out stopwords, allowing it to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction.
-### Semantic Similarity
-This metric utilizes transformer-based models to compute the cosine similarity between the semantic embeddings of text segments. The model converts each text segment into a high-dimensional vector that captures its semantic meaning.
-For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged to produce a single representative vector for the entire segment before comparison.
-A higher score indicates that the texts express similar concepts or ideas, even if they use different vocabulary or phrasing.
-### TF-IDF Cosine Similarity
-This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, after filtering out common Tibetan stopwords. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
-Each segment is then represented as a vector of these TF-IDF scores, and the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
-## Troubleshooting
-### Common Issues and Solutions
-1. **"Empty vocabulary" error**:
-   - This can occur if a text contains only stopwords or if tokenization fails.
-   - Solution: Check your input text to ensure it contains valid Tibetan content.
-2. **Model loading errors**:
-   - If a model fails to load, the application will continue without semantic similarity.
-   - Solution: Try a different model or disable semantic similarity.
-3. **Performance issues with large files**:
-   - Solution: Split large files into smaller ones or use fewer files at once.
-4. **No results displayed**:
-   - Solution: Ensure you have uploaded at least two valid text files and that they contain comparable content.
-5. **Encoding issues**:
-   - If your text appears garbled, it may have encoding problems.
-   - Solution: Ensure your files are saved in UTF-8 or UTF-16 encoding.
-### Getting Help
-If you encounter issues not covered in this guide, please:
-1. Check the [GitHub repository](https://github.com/daniel-wojahn/tibetan-text-metrics) for updates or known issues.
-2. Submit an issue on GitHub with details about your problem.
-## Acknowledgments
-The Tibetan Text Metrics project was developed as part of the [Law in Historic Tibet](https://www.law.ox.ac.uk/law-historic-tibet) project at the Centre for Socio-Legal Studies at the University of Oxford.
-## License
-This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).