File size: 9,618 Bytes
3011301
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# Tibetan Text Metrics Web Application User Guide

## Introduction

Welcome to the Tibetan Text Metrics Web Application! This user-friendly tool allows you to analyze textual similarities and variations in Tibetan manuscripts using multiple computational approaches. The application provides a graphical interface to the core functionalities of the Tibetan Text Metrics (TTM) project.

## Getting Started

### System Requirements

- Modern web browser (Chrome, Firefox, Safari, or Edge)
- For local installation: Python 3.10 or newer
- Sufficient RAM for processing large texts (4GB minimum, 8GB recommended)

### Installation and Setup

#### Online Demo

The easiest way to try the application is through our Hugging Face Spaces demo:
[daniel-wojahn/ttm-webapp-hf](https://huggingface.co/spaces/daniel-wojahn/ttm-webapp-hf)

Note: The free tier of Hugging Face Spaces may have performance limitations compared to running locally.

#### Local Installation

1. Clone the repository:
   ```bash
   git clone https://github.com/daniel-wojahn/tibetan-text-metrics.git
   cd tibetan-text-metrics/webapp
   ```

2. Create and activate a virtual environment:
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```

3. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

4. Run the application:
   ```bash
   python app.py
   ```

5. Open your browser and navigate to:
   ```
   http://localhost:7860
   ```

## Using the Application

### Step 1: Upload Your Tibetan Text Files

1. Click the "Upload Tibetan .txt files" button to select one or more `.txt` files containing Tibetan text.
2. Files should be in UTF-8 or UTF-16 encoding.
3. Maximum file size: 10MB per file (for optimal performance, use files under 1MB).
4. For best results, your texts should be segmented into chapters/sections using the Tibetan marker '༈' (*sbrul shad*).

### Step 2: Configure Analysis Options

1. **Semantic Similarity**: Choose whether to compute semantic similarity metrics.
   - "Yes" (default): Includes semantic similarity in the analysis (slower but more comprehensive).
   - "No": Skips semantic similarity calculation for faster processing.

2. **Embedding Model**: Select the model to use for semantic similarity analysis.
   - **sentence-transformers/all-MiniLM-L6-v2** (default): General purpose sentence embedding model (fastest option).
   - **sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**: Multilingual model with good performance for many languages.
   - **buddhist-nlp/buddhist-sentence-similarity**: Optimized for Buddhist text similarity.
   - **xlm-roberta-base**: Multilingual model that includes Tibetan.

3. Click the "Run Analysis" button to start processing.

### Step 3: View and Interpret Results

After processing, the application displays several visualizations and metrics:

#### Word Count Chart

Shows the number of words in each chapter/segment of each file, allowing you to compare the relative lengths of different texts.

#### Similarity Metrics

The application computes four different similarity metrics between corresponding chapters of different files:

1. **Jaccard Similarity (%)**: Measures vocabulary overlap between segments after filtering out common Tibetan stopwords. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.

2. **Normalized LCS (Longest Common Subsequence)**: Measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. A higher score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.

3. **Semantic Similarity**: Uses a transformer-based model to compute the cosine similarity between the semantic embeddings of text segments. This captures similarities in meaning even when different vocabulary is used.

4. **TF-IDF Cosine Similarity**: Compares texts based on their important, characteristic terms by giving higher weight to words that are frequent within a particular segment but relatively rare across the entire collection.

#### Heatmap Visualizations

Each metric has a corresponding heatmap visualization where:
- Rows represent chapters/segments
- Columns represent text pairs being compared
- Color intensity indicates similarity (brighter = more similar)

### Tips for Effective Analysis

1. **Text Segmentation**: For meaningful chapter-level comparisons, ensure your texts are segmented using the Tibetan marker '༈' (*sbrul shad*).

2. **File Naming**: Use descriptive filenames to make the comparison results easier to interpret.

3. **Model Selection**: 
   - For faster processing, use the default model or disable semantic similarity.
   - For Buddhist texts, the buddhist-nlp/buddhist-sentence-similarity model may provide better results.

4. **File Size**: 
   - Keep individual files under 1MB for optimal performance.
   - Very large files (>10MB) are not supported and will trigger an error.

5. **Comparing Multiple Texts**: The application requires at least two text files to compute similarity metrics.

## Understanding the Metrics

### Jaccard Similarity (%)

This metric quantifies the lexical overlap between two text segments by comparing their sets of unique words, after filtering out common Tibetan stopwords. It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'

It is calculated as:
```
(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100
```

Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.

### Normalized LCS (Longest Common Subsequence)

This metric measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.

For example, if Text A is 'the quick brown fox jumps' and Text B is 'the lazy cat and brown dog jumps high', the LCS is 'the brown jumps'.

The length of this common subsequence is then normalized to provide a score. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.

Unlike other metrics, LCS does not filter out stopwords, allowing it to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction.

### Semantic Similarity

This metric utilizes transformer-based models to compute the cosine similarity between the semantic embeddings of text segments. The model converts each text segment into a high-dimensional vector that captures its semantic meaning.

For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged to produce a single representative vector for the entire segment before comparison.

A higher score indicates that the texts express similar concepts or ideas, even if they use different vocabulary or phrasing.

### TF-IDF Cosine Similarity

This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, after filtering out common Tibetan stopwords. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.

Each segment is then represented as a vector of these TF-IDF scores, and the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.

## Troubleshooting

### Common Issues and Solutions

1. **"Empty vocabulary" error**:
   - This can occur if a text contains only stopwords or if tokenization fails.
   - Solution: Check your input text to ensure it contains valid Tibetan content.

2. **Model loading errors**:
   - If a model fails to load, the application will continue without semantic similarity.
   - Solution: Try a different model or disable semantic similarity.

3. **Performance issues with large files**:
   - Solution: Split large files into smaller ones or use fewer files at once.

4. **No results displayed**:
   - Solution: Ensure you have uploaded at least two valid text files and that they contain comparable content.

5. **Encoding issues**:
   - If your text appears garbled, it may have encoding problems.
   - Solution: Ensure your files are saved in UTF-8 or UTF-16 encoding.

### Getting Help

If you encounter issues not covered in this guide, please:
1. Check the [GitHub repository](https://github.com/daniel-wojahn/tibetan-text-metrics) for updates or known issues.
2. Submit an issue on GitHub with details about your problem.

## Acknowledgments

The Tibetan Text Metrics project was developed as part of the [Law in Historic Tibet](https://www.law.ox.ac.uk/law-historic-tibet) project at the Centre for Socio-Legal Studies at the University of Oxford.

## License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).