uBaby4life commited on
Commit
3785cde
·
1 Parent(s): 22f37d2

Add Flask application with Docker setup for transliterator

Browse files
Dockerfile ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Start from a Python base image
2
+ FROM python:3.9-slim
3
+
4
+ # Set environment variables
5
+ ENV PYTHONUNBUFFERED=1 \
6
+ # Ensures that Python output is sent straight to terminal without being first buffered
7
+ # and that can be helpful for logging.
8
+ PIP_NO_CACHE_DIR=off \
9
+ # Disables pip caching, which can reduce image size.
10
+ PIP_DISABLE_PIP_VERSION_CHECK=on \
11
+ # Disables the check for a new version of pip, speeding up builds.
12
+ PIP_DEFAULT_TIMEOUT=100 \
13
+ # Increases the default timeout for pip.
14
+ HF_HUB_DISABLE_SYMLINKS_WARNING=1
15
+ # To suppress the symlink warning from huggingface_hub
16
+
17
+ # Create a non-root user and switch to it
18
+ RUN useradd -m -u 1000 user
19
+ USER user
20
+ ENV PATH="/home/user/.local/bin:$PATH" # Add user's local bin to PATH
21
+
22
+ # Set the working directory in the container
23
+ WORKDIR /app
24
+
25
+ # Copy requirements.txt first to leverage Docker cache
26
+ COPY --chown=user ./requirements.txt requirements.txt
27
+
28
+ # Install dependencies
29
+ # Using --no-cache-dir to reduce image size further
30
+ RUN pip install --no-cache-dir --upgrade -r requirements.txt
31
+
32
+ # Copy the rest of the application code into the container
33
+ # This includes app.py, model_files/, static/, templates/, LICENSE, README.md
34
+ COPY --chown=user . .
35
+
36
+ # Expose the port the app will run on. HF Spaces expects 7860 for Docker.
37
+ EXPOSE 7860
38
+
39
+ # Command to run the application using Gunicorn
40
+ # It will listen on all interfaces (0.0.0.0) on port 7860.
41
+ # app:app means "in the file app.py, use the Flask instance named app".
42
+ CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "1", "--threads", "2", "--timeout", "0", "app:app"]
README.md CHANGED
@@ -1,12 +1,39 @@
1
  ---
2
- title: BanglaFeel
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: purple
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
- short_description: A customized Back Transliteration Model
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: BanglaFeel Translator
3
+ emoji: 🌍💬
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
+ app_port: 7860 # IMPORTANT: Tell Hugging Face which port your app EXPOSES
10
  ---
11
 
12
+ # BanglaFeel Translator
13
+
14
+ A Flask web application for English to Bengali transliteration using a custom-trained DualEncoderDecoder model.
15
+
16
+ ## How to Use
17
+
18
+ Visit the deployed Space URL and type or paste English text into the input box. Click "Translate" to see the Bengali transliteration.
19
+
20
+ ## Model Details
21
+
22
+ This model is a custom architecture (DualEncoderDecoder) combining T5 (csebuetnlp/banglat5) with a hybrid character CNN and word LSTM encoder.
23
+ * **Base T5 Model:** `csebuetnlp/banglat5`
24
+ * **Base Encoder Tokenizer:** `csebuetnlp/banglabert`
25
+ * **Custom Components:** CharCNN, WordLSTM, HybridEncoder
26
+ * Trained for English to Bengali transliteration.
27
+
28
+ ## Intended Uses & Limitations
29
+
30
+ * **Intended Use:** Transliteration of English text (phonetically representing Bengali words) into Bengali script.
31
+ * **Limitations:**
32
+ * May not handle all English phonetic variations perfectly.
33
+ * Performance depends on the training data.
34
+ * Currently handles inputs up to 500 characters.
35
+ * The free hosting tier might experience cold starts.
36
+
37
+ ## License
38
+
39
+ The code and model are licensed under the Apache License 2.0. See the `LICENSE` file for details.
app.py ADDED
@@ -0,0 +1,627 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import random
4
+ import torch
5
+ import numpy as np
6
+ from flask import Flask, request, jsonify, render_template
7
+
8
+ os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'
9
+ import torch.nn as nn
10
+ import torch.nn.functional as F
11
+ from transformers import T5Tokenizer, AutoTokenizer, T5ForConditionalGeneration
12
+ from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions
13
+
14
+ # Get the directory of the current script (app.py)
15
+ APP_ROOT = os.path.dirname(os.path.abspath(__file__))
16
+ MODEL_FILES_DIR = os.path.join(APP_ROOT, 'model_files') # Path to your model_files directory
17
+
18
+ # Ensure CFG.device is set to CPU for Hugging Face Spaces free tier
19
+ # device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Original
20
+ device = torch.device('cpu') # MODIFIED FOR HF SPACES
21
+
22
+ class CFG:
23
+ model_name = 'csebuetnlp/banglat5' # This is used for initial T5 model loading
24
+ encoder_name = 'csebuetnlp/banglabert' # This is used for initial encoder tokenizer loading
25
+ batch_size = 1
26
+ max_len = 512
27
+ seed = 42
28
+ device = device # Use the modified device
29
+
30
+ # ... (rest of your imports and set_seed function, CharCNNEncoder, WordLSTMEncoder, HybridEncoder, DualEncoderDecoder classes remain the same)
31
+ # Ensure these classes are present in your actual app.py
32
+
33
+ # The initial tokenizer loading below will try to download from the hub.
34
+ # This is okay, as load_checkpoint will later load your specific saved tokenizers
35
+ # from local files using local_files_only=True.
36
+ # If you wanted to avoid ANY hub download, you'd need to ensure your model_files/tokenizers
37
+ # are sufficient for T5Tokenizer.from_pretrained to work with local_files_only=True
38
+ # from the very start, which might require more config files in those dirs.
39
+ # For now, this setup is fine.
40
+
41
+ CFG.t5_tokenizer = T5Tokenizer.from_pretrained(
42
+ CFG.model_name,
43
+ legacy=False,
44
+ model_max_length=CFG.max_len
45
+ )
46
+ if CFG.t5_tokenizer.pad_token is None:
47
+ CFG.t5_tokenizer.pad_token = CFG.t5_tokenizer.eos_token
48
+ if CFG.t5_tokenizer.bos_token is None:
49
+ CFG.t5_tokenizer.bos_token = CFG.t5_tokenizer.eos_token
50
+
51
+ CFG.encoder_tokenizer = AutoTokenizer.from_pretrained(
52
+ CFG.encoder_name,
53
+ model_max_length=CFG.max_len
54
+ )
55
+ if CFG.encoder_tokenizer.pad_token is None:
56
+ CFG.encoder_tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # This line might change vocab size if not already done
57
+ CFG.encoder_tokenizer.pad_token = '[PAD]'
58
+
59
+
60
+ def compute_max_char_len(texts): # Ensure this function is present
61
+ # ... (your implementation)
62
+ # If not used, you can remove it, but it was in your original code
63
+ # A placeholder if it was just for training:
64
+ if not texts or not any(isinstance(text, str) for text in texts):
65
+ return 50 # Default or error
66
+ return max(
67
+ len(word)
68
+ for text in texts
69
+ if isinstance(text, str)
70
+ for word in text.split()
71
+ ) if any(text for text in texts if isinstance(text, str)) else 50
72
+
73
+
74
+ # --- START: PASTE YOUR CharCNNEncoder, WordLSTMEncoder, HybridEncoder, DualEncoderDecoder classes here ---
75
+ # (As provided in your original app.py)
76
+ # Make sure they are correctly defined before load_checkpoint
77
+ class CharCNNEncoder(nn.Module):
78
+ def __init__(self, char_vocab_size, char_embedding_dim, char_cnn_output_dim, kernel_sizes, num_filters, dropout=0.1):
79
+ super(CharCNNEncoder, self).__init__()
80
+ self.char_embedding = nn.Embedding(char_vocab_size, char_embedding_dim, padding_idx=0)
81
+ self.conv_layers = nn.ModuleList()
82
+ for ks, nf in zip(kernel_sizes, num_filters):
83
+ self.conv_layers.append(
84
+ nn.Sequential(
85
+ nn.Conv1d(char_embedding_dim, nf, kernel_size=ks, padding=ks // 2),
86
+ nn.ReLU(),
87
+ nn.AdaptiveMaxPool1d(1)
88
+ )
89
+ )
90
+ self.dropout = nn.Dropout(dropout)
91
+ self.output_projection = nn.Linear(sum(num_filters), char_cnn_output_dim)
92
+
93
+ def forward(self, char_input):
94
+ batch_size, seq_len, char_len = char_input.size()
95
+ char_input = char_input.view(-1, char_len)
96
+ char_emb = self.char_embedding(char_input)
97
+ char_emb = char_emb.permute(0, 2, 1)
98
+ conv_outputs = [conv(char_emb) for conv in self.conv_layers]
99
+ concat_output = torch.cat(conv_outputs, dim=1)
100
+ concat_output = concat_output.squeeze(-1)
101
+ concat_output = self.dropout(concat_output)
102
+ char_cnn_output = self.output_projection(concat_output)
103
+ char_cnn_output = char_cnn_output.view(batch_size, seq_len, -1)
104
+ return char_cnn_output
105
+
106
+ class WordLSTMEncoder(nn.Module):
107
+ def __init__(self, word_vocab_size, word_embedding_dim, word_lstm_hidden_dim, num_lstm_layers, dropout):
108
+ super(WordLSTMEncoder, self).__init__()
109
+ # Ensure CFG.encoder_tokenizer is loaded before this class is instantiated if padding_idx relies on it.
110
+ # The current code structure loads CFG.encoder_tokenizer globally first.
111
+ padding_idx_val = CFG.encoder_tokenizer.pad_token_id if hasattr(CFG, 'encoder_tokenizer') and CFG.encoder_tokenizer.pad_token_id is not None else 0
112
+ self.word_embedding = nn.Embedding(
113
+ word_vocab_size,
114
+ word_embedding_dim,
115
+ padding_idx=padding_idx_val
116
+ )
117
+ self.lstm = nn.LSTM(
118
+ word_embedding_dim,
119
+ word_lstm_hidden_dim,
120
+ num_layers=num_lstm_layers,
121
+ batch_first=True,
122
+ dropout=dropout,
123
+ bidirectional=True
124
+ )
125
+ self.output_projection = nn.Linear(2 * word_lstm_hidden_dim, word_lstm_hidden_dim)
126
+
127
+ def forward(self, word_input, sequence_lengths):
128
+ batch_size = word_input.size(0)
129
+ word_emb = self.word_embedding(word_input)
130
+
131
+ # Ensure sequence_lengths is on CPU for sorting and pack_padded_sequence
132
+ sequence_lengths_cpu = sequence_lengths.cpu()
133
+
134
+ # Handle cases where all sequence lengths might be zero, which can cause issues with sorting
135
+ if torch.all(sequence_lengths_cpu == 0):
136
+ # If all lengths are 0, LSTM output will be zeros.
137
+ # We need to create zero tensors of the expected shape.
138
+ # This is a simplified handling; a more robust solution might be needed
139
+ # depending on how zero-length sequences are meant to be processed.
140
+ lstm_out = torch.zeros(batch_size, word_input.size(1), self.lstm.hidden_size * 2, device=word_input.device)
141
+ hidden = torch.zeros(batch_size, self.lstm.hidden_size * 2, device=word_input.device)
142
+ return self.output_projection(hidden), lstm_out
143
+
144
+ sorted_lengths, sort_idx = sequence_lengths_cpu.sort(0, descending=True)
145
+ sorted_word_emb = word_emb[sort_idx]
146
+
147
+ # Filter out zero-length sequences before packing if pack_padded_sequence requires it
148
+ # For PyTorch versions where pack_padded_sequence handles zero lengths in sorted_lengths:
149
+ packed_word_emb = nn.utils.rnn.pack_padded_sequence(
150
+ sorted_word_emb,
151
+ sorted_lengths.clamp(min=1), # Ensure lengths are at least 1 for packing if issues arise
152
+ batch_first=True,
153
+ enforce_sorted=True # This is important
154
+ )
155
+ packed_lstm_out, (hidden_state, cell_state) = self.lstm(packed_word_emb)
156
+ lstm_out, _ = nn.utils.rnn.pad_packed_sequence(
157
+ packed_lstm_out,
158
+ batch_first=True,
159
+ total_length=word_input.size(1)
160
+ )
161
+ _, unsort_idx = sort_idx.sort(0)
162
+ lstm_out = lstm_out[unsort_idx]
163
+
164
+ # Process hidden state correctly for bidirectional LSTM
165
+ # hidden_state is (num_layers * num_directions, batch, hidden_size)
166
+ # We want the last layer's hidden states (forward and backward)
167
+ hidden_state = hidden_state.view(self.lstm.num_layers, 2, batch_size, self.lstm.hidden_size) # 2 for bidirectional
168
+ hidden_state_last_layer = hidden_state[-1] # Get the last layer
169
+ # Concatenate forward and backward hidden states: (batch, 2 * hidden_size)
170
+ final_hidden = torch.cat((hidden_state_last_layer[0], hidden_state_last_layer[1]), dim=1)
171
+ final_hidden = final_hidden[unsort_idx] # Unsort to original batch order
172
+
173
+ return self.output_projection(final_hidden), lstm_out
174
+
175
+
176
+ class HybridEncoder(nn.Module):
177
+ def __init__(self, char_cnn_encoder, word_lstm_encoder, hybrid_encoder_output_dim):
178
+ super(HybridEncoder, self).__init__()
179
+ self.char_cnn_encoder = char_cnn_encoder
180
+ self.word_lstm_encoder = word_lstm_encoder
181
+ self.char_hidden_size = char_cnn_encoder.output_projection.out_features
182
+ # For bidirectional LSTM, output is 2 * hidden_dim from WordLSTMEncoder's projection layer
183
+ self.lstm_projected_hidden_size = word_lstm_encoder.output_projection.out_features # This should be word_lstm_hidden_dim
184
+ # The actual output from LSTM itself before projection is 2 * lstm.hidden_size for sequence outputs
185
+ self.lstm_sequence_output_size = word_lstm_encoder.lstm.hidden_size * 2
186
+
187
+ # The output_projection should combine char_cnn_output and the sequence output of LSTM
188
+ self.output_projection = nn.Linear(self.char_hidden_size + self.lstm_sequence_output_size, hybrid_encoder_output_dim)
189
+
190
+
191
+ def forward(self, char_input, word_input, sequence_lengths):
192
+ batch_size = char_input.size(0)
193
+ max_seq_len = word_input.size(1) # Assuming word_input determines max_seq_len
194
+
195
+ char_cnn_output = self.char_cnn_encoder(char_input) # (batch_size, char_seq_len, char_cnn_output_dim)
196
+
197
+ # Ensure sequence_lengths is on the same device as the model/input
198
+ sequence_lengths = sequence_lengths.to(word_input.device)
199
+ _, lstm_sequence_output = self.word_lstm_encoder(word_input, sequence_lengths) # (batch_size, word_seq_len, 2 * lstm_hidden_dim)
200
+
201
+ # Pad/truncate char_cnn_output and lstm_sequence_output to a common max_seq_len if they differ
202
+ # This assumes char_input and word_input might correspond to different tokenization granularities
203
+ # For simplicity, let's assume they are aligned or word_input's seq_len is the target.
204
+
205
+ # Pad CharCNN outputs if its sequence length is less than max_seq_len from word_input
206
+ if char_cnn_output.size(1) < max_seq_len:
207
+ padding_size = max_seq_len - char_cnn_output.size(1)
208
+ char_cnn_output = F.pad(char_cnn_output, (0, 0, 0, padding_size), "constant", 0)
209
+ elif char_cnn_output.size(1) > max_seq_len:
210
+ char_cnn_output = char_cnn_output[:, :max_seq_len, :]
211
+
212
+ # Pad LSTM outputs if its sequence length is less than max_seq_len (should not happen if total_length in pad_packed_sequence is max_seq_len)
213
+ # This check is more of a safeguard.
214
+ if lstm_sequence_output.size(1) < max_seq_len:
215
+ padding_size = max_seq_len - lstm_sequence_output.size(1)
216
+ lstm_sequence_output = F.pad(lstm_sequence_output, (0, 0, 0, padding_size), "constant", 0)
217
+ elif lstm_sequence_output.size(1) > max_seq_len:
218
+ lstm_sequence_output = lstm_sequence_output[:, :max_seq_len, :]
219
+
220
+ hybrid_output_concat = torch.cat((char_cnn_output, lstm_sequence_output), dim=2)
221
+ hybrid_encoder_output = self.output_projection(hybrid_output_concat)
222
+ return hybrid_encoder_output
223
+
224
+ class DualEncoderDecoder(nn.Module):
225
+ def __init__(self, t5_model_name, hybrid_encoder, t5_tokenizer, freeze_t5=False):
226
+ super(DualEncoderDecoder, self).__init__()
227
+ self.t5 = T5ForConditionalGeneration.from_pretrained(t5_model_name)
228
+ self.t5_tokenizer = t5_tokenizer # Store tokenizer if needed for vocab size etc.
229
+ self.hybrid_encoder = hybrid_encoder
230
+
231
+ encoder_hidden_size = self.t5.config.d_model
232
+ hybrid_hidden_size = hybrid_encoder.output_projection.out_features # This is hybrid_encoder_output_dim
233
+
234
+ self.encoder_projection = nn.Linear(encoder_hidden_size + hybrid_hidden_size, encoder_hidden_size)
235
+
236
+ if freeze_t5:
237
+ for param in self.t5.parameters():
238
+ param.requires_grad = False
239
+
240
+ # Resize T5 token embeddings if tokenizer vocab size changed (e.g., by adding special tokens)
241
+ # This should ideally be done *after* loading CFG.t5_tokenizer in load_checkpoint
242
+ # if CFG.t5_tokenizer is the one tied to the model.
243
+ # self.t5.resize_token_embeddings(len(self.t5_tokenizer)) # Moved to load_checkpoint
244
+
245
+ def forward(self, input_ids, attention_mask, char_input, word_input, sequence_lengths, labels=None):
246
+ # T5 Encoder
247
+ t5_encoder_outputs_dict = self.t5.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
248
+ t5_encoder_last_hidden_state = t5_encoder_outputs_dict.last_hidden_state # (batch, seq_len_t5, d_model)
249
+
250
+ # Hybrid Encoder
251
+ # Ensure sequence_lengths is on the correct device for hybrid_encoder
252
+ sequence_lengths = sequence_lengths.to(char_input.device)
253
+ hybrid_encoder_output = self.hybrid_encoder(char_input, word_input, sequence_lengths) # (batch, seq_len_hybrid, hybrid_output_dim)
254
+
255
+ # Determine common sequence length for concatenation
256
+ # Typically, input_ids for T5 and word_input for hybrid encoder should have compatible sequence lengths.
257
+ # If they are from different tokenizations, alignment or choosing one as primary is needed.
258
+ # Assuming t5_encoder_last_hidden_state's seq_len is the target.
259
+ common_seq_len = t5_encoder_last_hidden_state.size(1)
260
+
261
+ # Pad or truncate hybrid_encoder_output to match common_seq_len
262
+ if hybrid_encoder_output.size(1) < common_seq_len:
263
+ padding_size = common_seq_len - hybrid_encoder_output.size(1)
264
+ hybrid_encoder_output = F.pad(hybrid_encoder_output, (0, 0, 0, padding_size), "constant", 0)
265
+ elif hybrid_encoder_output.size(1) > common_seq_len:
266
+ hybrid_encoder_output = hybrid_encoder_output[:, :common_seq_len, :]
267
+
268
+ # Concatenate along the feature dimension
269
+ concat_encoder_output = torch.cat((t5_encoder_last_hidden_state, hybrid_encoder_output), dim=2)
270
+ projected_encoder_output = self.encoder_projection(concat_encoder_output) # (batch, common_seq_len, d_model)
271
+
272
+ # Create BaseModelOutputWithPastAndCrossAttentions for T5 decoder
273
+ # The attention_mask here should correspond to the projected_encoder_output.
274
+ # If t5_encoder_last_hidden_state's seq_len was used, its attention_mask is appropriate.
275
+ encoder_outputs_for_decoder = BaseModelOutputWithPastAndCrossAttentions(
276
+ last_hidden_state=projected_encoder_output,
277
+ # past_key_values=None, # T5 internal
278
+ # hidden_states=None, # T5 internal
279
+ # attentions=None # T5 internal
280
+ )
281
+
282
+ # T5 Decoder
283
+ # The `attention_mask` passed to the T5 model here is for the *decoder's* cross-attention
284
+ # to the `encoder_outputs_for_decoder`. So it should match the sequence length of `projected_encoder_output`.
285
+ decoder_outputs = self.t5(
286
+ encoder_outputs=encoder_outputs_for_decoder, # Pass the combined & projected outputs
287
+ attention_mask=attention_mask, # This is the original T5 input attention mask, matching its seq_len
288
+ labels=labels,
289
+ return_dict=True,
290
+ use_cache=False # Important for training, can be True for faster inference if handled
291
+ )
292
+ return decoder_outputs
293
+
294
+ def generate(self, input_ids, attention_mask, char_input, word_input, sequence_lengths, max_length, num_beams):
295
+ # Similar to forward pass for encoder part
296
+ t5_encoder_outputs_dict = self.t5.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
297
+ t5_encoder_last_hidden_state = t5_encoder_outputs_dict.last_hidden_state
298
+
299
+ sequence_lengths = sequence_lengths.to(char_input.device)
300
+ hybrid_encoder_output = self.hybrid_encoder(char_input, word_input, sequence_lengths)
301
+
302
+ common_seq_len = t5_encoder_last_hidden_state.size(1)
303
+
304
+ if hybrid_encoder_output.size(1) < common_seq_len:
305
+ padding_size = common_seq_len - hybrid_encoder_output.size(1)
306
+ hybrid_encoder_output = F.pad(hybrid_encoder_output, (0, 0, 0, padding_size), "constant", 0)
307
+ elif hybrid_encoder_output.size(1) > common_seq_len:
308
+ hybrid_encoder_output = hybrid_encoder_output[:, :common_seq_len, :]
309
+
310
+ concat_encoder_output = torch.cat((t5_encoder_last_hidden_state, hybrid_encoder_output), dim=2)
311
+ projected_encoder_output = self.encoder_projection(concat_encoder_output)
312
+
313
+ encoder_outputs_for_generate = BaseModelOutputWithPastAndCrossAttentions(
314
+ last_hidden_state=projected_encoder_output
315
+ )
316
+
317
+ # Use T5's generate method
318
+ generated_ids_dict = self.t5.generate(
319
+ encoder_outputs=encoder_outputs_for_generate,
320
+ attention_mask=attention_mask, # Original T5 input attention mask
321
+ max_length=max_length,
322
+ num_beams=num_beams,
323
+ early_stopping=True,
324
+ use_cache=True, # Can be True for generation
325
+ return_dict_in_generate=True, # Ensures output is a dict-like object
326
+ # eos_token_id=self.t5_tokenizer.eos_token_id, # Good practice
327
+ # pad_token_id=self.t5_tokenizer.pad_token_id # Good practice
328
+ )
329
+ return generated_ids_dict.sequences # .sequences attribute contains the generated token ids
330
+
331
+ def load_checkpoint(path_to_checkpoint_file):
332
+ # path_to_checkpoint_file is now the full path to best_model.pth
333
+ if not os.path.exists(path_to_checkpoint_file):
334
+ print("No checkpoint file found at:", path_to_checkpoint_file)
335
+ # sys.exit(1) # Avoid exiting in a web app, raise an error or handle
336
+ raise FileNotFoundError(f"No checkpoint file found at: {path_to_checkpoint_file}")
337
+
338
+ print(f"Loading checkpoint from: {path_to_checkpoint_file}")
339
+ checkpoint = torch.load(path_to_checkpoint_file, map_location=CFG.device)
340
+
341
+ # checkpoint_dir is the directory containing best_model.pth, which is MODEL_FILES_DIR
342
+ checkpoint_dir = os.path.dirname(path_to_checkpoint_file)
343
+
344
+ # Path to the 'tokenizers' subdirectory within checkpoint_dir
345
+ tokenizer_base_save_path = os.path.join(checkpoint_dir, 'tokenizers')
346
+
347
+ t5_tokenizer_dir_path = os.path.join(tokenizer_base_save_path, 't5_tokenizer')
348
+ encoder_tokenizer_dir_path = os.path.join(tokenizer_base_save_path, 'encoders_tokenizer')
349
+
350
+ if not os.path.isdir(t5_tokenizer_dir_path):
351
+ raise FileNotFoundError(
352
+ f"T5 tokenizer directory not found: {t5_tokenizer_dir_path}. "
353
+ "Ensure tokenizers were saved correctly (e.g., using tokenizer.save_pretrained())."
354
+ )
355
+ if not os.path.isdir(encoder_tokenizer_dir_path):
356
+ raise FileNotFoundError(
357
+ f"Encoder tokenizer directory not found: {encoder_tokenizer_dir_path}. "
358
+ "Ensure tokenizers were saved correctly."
359
+ )
360
+ print(f"Loading T5 tokenizer from: {t5_tokenizer_dir_path}")
361
+ CFG.t5_tokenizer = T5Tokenizer.from_pretrained(
362
+ t5_tokenizer_dir_path,
363
+ legacy=False,
364
+ model_max_length=CFG.max_len,
365
+ local_files_only=True
366
+ )
367
+ if CFG.t5_tokenizer.pad_token is None: CFG.t5_tokenizer.pad_token = CFG.t5_tokenizer.eos_token
368
+ if CFG.t5_tokenizer.bos_token is None: CFG.t5_tokenizer.bos_token = CFG.t5_tokenizer.eos_token
369
+
370
+ print(f"Loading encoder tokenizer from: {encoder_tokenizer_dir_path}")
371
+ CFG.encoder_tokenizer = AutoTokenizer.from_pretrained(
372
+ encoder_tokenizer_dir_path,
373
+ model_max_length=CFG.max_len,
374
+ local_files_only=True
375
+ )
376
+ if CFG.encoder_tokenizer.pad_token is None:
377
+ print(f"Warning: Loaded encoder tokenizer from {encoder_tokenizer_dir_path} has no pad_token defined in its config.")
378
+ # If it was added during training and saved, it should be there.
379
+ # If it's missing, and your WordLSTMEncoder relies on a specific pad_token_id,
380
+ # you might need to manually set it here if the saved config doesn't have it.
381
+ # e.g., CFG.encoder_tokenizer.pad_token = '[PAD]'
382
+ # CFG.encoder_tokenizer.pad_token_id = CFG.encoder_tokenizer.convert_tokens_to_ids('[PAD]')
383
+ # However, if add_special_tokens({'pad_token': '[PAD]'}) was called before saving,
384
+ # it should be part of the saved tokenizer's configuration.
385
+
386
+ loaded_config_from_checkpoint = checkpoint['config'] # Renamed to avoid conflict
387
+ loaded_char_to_id = checkpoint['char_to_id']
388
+ loaded_id_to_char = checkpoint['id_to_char']
389
+ model_architecture = checkpoint['model_architecture']
390
+
391
+ # Update CFG with specifics from the loaded config if they exist
392
+ for key, value in loaded_config_from_checkpoint.items():
393
+ setattr(CFG, key, value)
394
+
395
+ # CRITICAL: Re-assign CFG.device after loading config, in case it was saved in checkpoint
396
+ # Or better, ensure device is not part of saved 'config' if you want to control it externally.
397
+ # For HF Spaces, we want CPU.
398
+ CFG.device = device # Ensure our desired device (CPU) is set
399
+
400
+ loaded_max_char_len = model_architecture.get('max_char_len', 50) # Default if not in checkpoint
401
+
402
+ # Re-initialize model components with parameters from the checkpoint
403
+ char_cnn_encoder = CharCNNEncoder(
404
+ char_vocab_size=model_architecture['char_vocab_size'],
405
+ char_embedding_dim=model_architecture['char_embedding_dim'],
406
+ char_cnn_output_dim=model_architecture['char_cnn_output_dim'],
407
+ kernel_sizes=model_architecture['kernel_sizes'],
408
+ num_filters=model_architecture['num_filters'],
409
+ dropout=model_architecture.get('dropout', 0.1) # Use .get for robustness
410
+ )
411
+
412
+ # Ensure word_vocab_size matches the re-loaded encoder_tokenizer's vocab size
413
+ # The one in model_architecture was from training time.
414
+ current_encoder_vocab_size = len(CFG.encoder_tokenizer)
415
+ if model_architecture['word_vocab_size'] != current_encoder_vocab_size:
416
+ print(f"Warning: Word vocab size mismatch. Checkpoint: {model_architecture['word_vocab_size']}, "
417
+ f"Loaded CFG.encoder_tokenizer: {current_encoder_vocab_size}. Using loaded tokenizer's size.")
418
+
419
+ word_lstm_encoder = WordLSTMEncoder(
420
+ word_vocab_size=current_encoder_vocab_size, # Use current vocab size
421
+ word_embedding_dim=model_architecture['word_embedding_dim'],
422
+ word_lstm_hidden_dim=model_architecture['word_lstm_hidden_dim'],
423
+ num_lstm_layers=model_architecture['num_lstm_layers'],
424
+ dropout=model_architecture.get('dropout', 0.1)
425
+ )
426
+ hybrid_encoder = HybridEncoder(
427
+ char_cnn_encoder,
428
+ word_lstm_encoder,
429
+ hybrid_encoder_output_dim=model_architecture['hybrid_encoder_output_dim']
430
+ )
431
+
432
+ # Use the model_name from the *checkpoint's config* for T5 base
433
+ # This ensures consistency with the trained model's base.
434
+ model_base_name_for_t5 = loaded_config_from_checkpoint.get('model_name', CFG.model_name)
435
+ print(f"Initializing DualEncoderDecoder with T5 base: {model_base_name_for_t5}")
436
+
437
+ model = DualEncoderDecoder(
438
+ t5_model_name=model_base_name_for_t5,
439
+ hybrid_encoder=hybrid_encoder,
440
+ t5_tokenizer=CFG.t5_tokenizer # Pass the loaded T5 tokenizer
441
+ )
442
+
443
+ # Resize T5 token embeddings based on the *loaded* CFG.t5_tokenizer
444
+ # This is important if CFG.t5_tokenizer (loaded from local files) has a different vocab size
445
+ # than the one from T5ForConditionalGeneration.from_pretrained(model_base_name_for_t5)
446
+ model.t5.resize_token_embeddings(len(CFG.t5_tokenizer))
447
+
448
+ print("Loading model state_dict...")
449
+ # Use strict=False if you have intentional mismatches, e.g., if encoder_tokenizer vocab changed
450
+ # and WordLSTM embedding size changed. Otherwise, strict=True is safer.
451
+ # Given the warnings and adjustments for vocab sizes, strict=False might be necessary
452
+ # if the embedding layer for WordLSTMEncoder was reinitialized with a different size.
453
+ model.load_state_dict(checkpoint['model_state_dict'], strict=False)
454
+ model.to(CFG.device)
455
+ model.eval()
456
+ print("Model loaded successfully.")
457
+ return model, loaded_char_to_id, loaded_id_to_char, loaded_max_char_len
458
+
459
+
460
+ # -------------
461
+ # Helper methods (tokenize_characters, pad_sequence, process_input)
462
+ # Ensure these are present and correct as per your original file
463
+ # -------------
464
+ def tokenize_characters(word, char_to_id): # Ensure char_to_id has <UNK> and <PAD>
465
+ # Default to <UNK> if char_to_id is not available or char not found
466
+ unk_token_id = char_to_id.get("<UNK>", 0) # Assuming 0 could be a fallback for <UNK> if not explicitly defined
467
+ return [char_to_id.get(char, unk_token_id) for char in word]
468
+
469
+ def pad_sequence(sequence, max_length, pad_value): # Ensure pad_value is valid
470
+ if len(sequence) > max_length:
471
+ sequence = sequence[:max_length]
472
+ return sequence + [pad_value] * (max_length - len(sequence))
473
+
474
+ def process_input(text, t5_tokenizer, encoder_tokenizer, char_to_id, current_max_char_len, max_token_len):
475
+ # Use current_max_char_len passed from loaded checkpoint
476
+ # Use max_token_len for t5_tokenizer and encoder_tokenizer max_length
477
+
478
+ t5_inputs = t5_tokenizer(
479
+ text,
480
+ return_tensors='pt',
481
+ padding='max_length',
482
+ truncation=True,
483
+ max_length=max_token_len, # Use max_token_len
484
+ add_special_tokens=True
485
+ )
486
+ encoder_inputs = encoder_tokenizer(
487
+ text,
488
+ return_tensors='pt',
489
+ padding='max_length',
490
+ truncation=True,
491
+ max_length=max_token_len # Use max_token_len
492
+ )
493
+
494
+ # Squeeze to remove batch dim if batch_size is 1, then unsqueeze later if model expects batch dim
495
+ # Assuming single instance processing here.
496
+ t5_input_ids_squeezed = t5_inputs['input_ids'].squeeze(0) # (seq_len)
497
+ t5_attention_mask_squeezed = t5_inputs['attention_mask'].squeeze(0) # (seq_len)
498
+ encoder_input_ids_squeezed = encoder_inputs['input_ids'].squeeze(0) # (seq_len)
499
+ # encoder_attention_mask_squeezed = encoder_inputs['attention_mask'].squeeze(0) # (seq_len)
500
+
501
+ # Max sequence length after tokenization for this specific input
502
+ actual_max_seq_len = encoder_input_ids_squeezed.shape[0]
503
+
504
+ char_input_tensor = torch.zeros((actual_max_seq_len, current_max_char_len), dtype=torch.long)
505
+
506
+ # Get pad_token_id for characters, ensure <PAD> is in char_to_id
507
+ char_pad_id = char_to_id.get("<PAD>", 0) # Default to 0 if <PAD> not in char_to_id
508
+
509
+ for j in range(actual_max_seq_len):
510
+ token_id = encoder_input_ids_squeezed[j].item()
511
+ # Avoid decoding special tokens like [PAD], [CLS], [SEP] into words for char tokenization
512
+ # if token_id in encoder_tokenizer.all_special_ids:
513
+ if token_id == encoder_tokenizer.pad_token_id or \
514
+ token_id == encoder_tokenizer.cls_token_id or \
515
+ token_id == encoder_tokenizer.sep_token_id or \
516
+ token_id == encoder_tokenizer.eos_token_id or \
517
+ token_id == encoder_tokenizer.bos_token_id:
518
+ word = "" # Treat special tokens as empty for char processing or handle as needed
519
+ else:
520
+ word = encoder_tokenizer.decode([token_id], skip_special_tokens=True).strip()
521
+
522
+ if not word: # Empty word or special token
523
+ char_ids = [char_pad_id] * current_max_char_len # Pad with <PAD> char id
524
+ else:
525
+ char_ids = tokenize_characters(word, char_to_id)
526
+ char_ids = pad_sequence(char_ids, current_max_char_len, char_pad_id)
527
+
528
+ char_input_tensor[j, :] = torch.tensor(char_ids, dtype=torch.long)
529
+
530
+ # sequence_lengths should be the sum of attention_mask for the encoder_input_ids
531
+ # This is for the word-level sequence length used by WordLSTMEncoder
532
+ # Squeeze if it's shape (1, 1) from sum, to get a scalar tensor if batch size is 1
533
+ sequence_lengths_tensor = encoder_inputs['attention_mask'].sum(dim=1).long().squeeze()
534
+ if sequence_lengths_tensor.ndim == 0: # If it became a 0-dim tensor (scalar)
535
+ sequence_lengths_tensor = sequence_lengths_tensor.unsqueeze(0) # Make it (1,) for consistency if batching
536
+
537
+ return {
538
+ # Unsqueeze(0) to add batch dimension back for the model
539
+ 't5_input_ids': t5_input_ids_squeezed.unsqueeze(0).to(CFG.device),
540
+ 't5_attention_mask': t5_attention_mask_squeezed.unsqueeze(0).to(CFG.device),
541
+ 'encoder_input_ids': encoder_input_ids_squeezed.unsqueeze(0).to(CFG.device),
542
+ # 'encoder_attention_mask' is not directly used by your model.generate, t5_attention_mask is used
543
+ 'char_input': char_input_tensor.unsqueeze(0).to(CFG.device),
544
+ 'sequence_lengths': sequence_lengths_tensor.to(CFG.device) # Should be (batch_size,)
545
+ }
546
+
547
+
548
+ # ----------------------------
549
+ # FLASK SETUP
550
+ # ----------------------------
551
+ app = Flask(__name__)
552
+
553
+ # MODIFIED: Define checkpoint path relative to MODEL_FILES_DIR
554
+ checkpoint_file_path = os.path.join(MODEL_FILES_DIR, "best_model.pth")
555
+
556
+ # Load your trained model and dictionaries ONCE at startup
557
+ print("Initializing model...")
558
+ try:
559
+ # model, char_to_id, id_to_char, max_char_len_loaded = load_checkpoint(checkpoint_file_path)
560
+ # Re-assign to global/module-level variables if you need them outside this scope,
561
+ # or pass them around. For Flask app, making them global for handlers is common.
562
+ loaded_model, loaded_char_to_id, loaded_id_to_char, loaded_max_char_len = load_checkpoint(checkpoint_file_path)
563
+ except Exception as e:
564
+ print(f"FATAL: Could not load model on startup: {e}")
565
+ # In a real app, you might want to prevent Flask from starting or return errors
566
+ # sys.exit(1) # Not ideal for a web server trying to start
567
+ loaded_model = None # Indicate model loading failed
568
+
569
+ @app.route('/')
570
+ def index():
571
+ return render_template('index.html')
572
+
573
+ @app.route('/translate', methods=['POST'])
574
+ def translate_text():
575
+ if loaded_model is None: # Check if model failed to load
576
+ return jsonify({"error": "Model is not available. Please check server logs."}), 500
577
+
578
+ data = request.get_json()
579
+ input_text = data.get('text', '')
580
+ if not input_text:
581
+ return jsonify({"error": "No text provided"}), 400
582
+
583
+ try:
584
+ # Process the input through your pipeline
585
+ # Use the globally loaded CFG.t5_tokenizer, CFG.encoder_tokenizer,
586
+ # loaded_char_to_id, and loaded_max_char_len
587
+ inputs = process_input(
588
+ input_text,
589
+ CFG.t5_tokenizer,
590
+ CFG.encoder_tokenizer,
591
+ loaded_char_to_id,
592
+ loaded_max_char_len, # Use the max_char_len loaded from checkpoint
593
+ CFG.max_len # Use CFG.max_len for token sequence length
594
+ )
595
+
596
+ # Generate translation
597
+ with torch.no_grad():
598
+ generated_ids = loaded_model.generate( # Use the loaded_model
599
+ inputs['t5_input_ids'],
600
+ inputs['t5_attention_mask'],
601
+ inputs['char_input'],
602
+ inputs['encoder_input_ids'], # This is word_input for HybridEncoder
603
+ inputs['sequence_lengths'],
604
+ max_length=CFG.max_len, # Max generation length
605
+ num_beams=4
606
+ )
607
+
608
+ translation = CFG.t5_tokenizer.decode(
609
+ generated_ids[0],
610
+ skip_special_tokens=True,
611
+ clean_up_tokenization_spaces=True
612
+ ).strip()
613
+
614
+ return jsonify({"translation": translation})
615
+ except Exception as e:
616
+ print(f"Error during translation: {e}") # Log the error
617
+ # import traceback
618
+ # traceback.print_exc() # For more detailed logs during debugging
619
+ return jsonify({"error": "An error occurred during translation."}), 500
620
+
621
+ if __name__ == '__main__':
622
+ # Port for Hugging Face Spaces is usually set via PORT environment variable
623
+ port = int(os.environ.get("PORT", 7860))
624
+ # For local testing, debug=True is fine. For HF Spaces, it will be run by their infrastructure.
625
+ # Setting debug=False for production-like environments.
626
+ # The host='0.0.0.0' makes it accessible externally (needed for Docker/HF Spaces).
627
+ app.run(host='0.0.0.0', port=port, debug=False)
model_files/tokenizers/encoders_tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
model_files/tokenizers/encoders_tokenizer/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
model_files/tokenizers/encoders_tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "full_tokenizer_file": null,
50
+ "mask_token": "[MASK]",
51
+ "max_length": 512,
52
+ "model_max_length": 512,
53
+ "never_split": null,
54
+ "pad_token": "[PAD]",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": false,
59
+ "tokenizer_class": "ElectraTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
model_files/tokenizers/encoders_tokenizer/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
model_files/tokenizers/t5_tokenizer/added_tokens.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<extra_id_0>": 32099,
3
+ "<extra_id_10>": 32089,
4
+ "<extra_id_11>": 32088,
5
+ "<extra_id_12>": 32087,
6
+ "<extra_id_13>": 32086,
7
+ "<extra_id_14>": 32085,
8
+ "<extra_id_15>": 32084,
9
+ "<extra_id_16>": 32083,
10
+ "<extra_id_17>": 32082,
11
+ "<extra_id_18>": 32081,
12
+ "<extra_id_19>": 32080,
13
+ "<extra_id_1>": 32098,
14
+ "<extra_id_20>": 32079,
15
+ "<extra_id_21>": 32078,
16
+ "<extra_id_22>": 32077,
17
+ "<extra_id_23>": 32076,
18
+ "<extra_id_24>": 32075,
19
+ "<extra_id_25>": 32074,
20
+ "<extra_id_26>": 32073,
21
+ "<extra_id_27>": 32072,
22
+ "<extra_id_28>": 32071,
23
+ "<extra_id_29>": 32070,
24
+ "<extra_id_2>": 32097,
25
+ "<extra_id_30>": 32069,
26
+ "<extra_id_31>": 32068,
27
+ "<extra_id_32>": 32067,
28
+ "<extra_id_33>": 32066,
29
+ "<extra_id_34>": 32065,
30
+ "<extra_id_35>": 32064,
31
+ "<extra_id_36>": 32063,
32
+ "<extra_id_37>": 32062,
33
+ "<extra_id_38>": 32061,
34
+ "<extra_id_39>": 32060,
35
+ "<extra_id_3>": 32096,
36
+ "<extra_id_40>": 32059,
37
+ "<extra_id_41>": 32058,
38
+ "<extra_id_42>": 32057,
39
+ "<extra_id_43>": 32056,
40
+ "<extra_id_44>": 32055,
41
+ "<extra_id_45>": 32054,
42
+ "<extra_id_46>": 32053,
43
+ "<extra_id_47>": 32052,
44
+ "<extra_id_48>": 32051,
45
+ "<extra_id_49>": 32050,
46
+ "<extra_id_4>": 32095,
47
+ "<extra_id_50>": 32049,
48
+ "<extra_id_51>": 32048,
49
+ "<extra_id_52>": 32047,
50
+ "<extra_id_53>": 32046,
51
+ "<extra_id_54>": 32045,
52
+ "<extra_id_55>": 32044,
53
+ "<extra_id_56>": 32043,
54
+ "<extra_id_57>": 32042,
55
+ "<extra_id_58>": 32041,
56
+ "<extra_id_59>": 32040,
57
+ "<extra_id_5>": 32094,
58
+ "<extra_id_60>": 32039,
59
+ "<extra_id_61>": 32038,
60
+ "<extra_id_62>": 32037,
61
+ "<extra_id_63>": 32036,
62
+ "<extra_id_64>": 32035,
63
+ "<extra_id_65>": 32034,
64
+ "<extra_id_66>": 32033,
65
+ "<extra_id_67>": 32032,
66
+ "<extra_id_68>": 32031,
67
+ "<extra_id_69>": 32030,
68
+ "<extra_id_6>": 32093,
69
+ "<extra_id_70>": 32029,
70
+ "<extra_id_71>": 32028,
71
+ "<extra_id_72>": 32027,
72
+ "<extra_id_73>": 32026,
73
+ "<extra_id_74>": 32025,
74
+ "<extra_id_75>": 32024,
75
+ "<extra_id_76>": 32023,
76
+ "<extra_id_77>": 32022,
77
+ "<extra_id_78>": 32021,
78
+ "<extra_id_79>": 32020,
79
+ "<extra_id_7>": 32092,
80
+ "<extra_id_80>": 32019,
81
+ "<extra_id_81>": 32018,
82
+ "<extra_id_82>": 32017,
83
+ "<extra_id_83>": 32016,
84
+ "<extra_id_84>": 32015,
85
+ "<extra_id_85>": 32014,
86
+ "<extra_id_86>": 32013,
87
+ "<extra_id_87>": 32012,
88
+ "<extra_id_88>": 32011,
89
+ "<extra_id_89>": 32010,
90
+ "<extra_id_8>": 32091,
91
+ "<extra_id_90>": 32009,
92
+ "<extra_id_91>": 32008,
93
+ "<extra_id_92>": 32007,
94
+ "<extra_id_93>": 32006,
95
+ "<extra_id_94>": 32005,
96
+ "<extra_id_95>": 32004,
97
+ "<extra_id_96>": 32003,
98
+ "<extra_id_97>": 32002,
99
+ "<extra_id_98>": 32001,
100
+ "<extra_id_99>": 32000,
101
+ "<extra_id_9>": 32090,
102
+ "<extra_token_0>": 32101,
103
+ "<extra_token_10>": 32111,
104
+ "<extra_token_11>": 32112,
105
+ "<extra_token_12>": 32113,
106
+ "<extra_token_13>": 32114,
107
+ "<extra_token_14>": 32115,
108
+ "<extra_token_15>": 32116,
109
+ "<extra_token_16>": 32117,
110
+ "<extra_token_17>": 32118,
111
+ "<extra_token_18>": 32119,
112
+ "<extra_token_19>": 32120,
113
+ "<extra_token_1>": 32102,
114
+ "<extra_token_20>": 32121,
115
+ "<extra_token_21>": 32122,
116
+ "<extra_token_22>": 32123,
117
+ "<extra_token_23>": 32124,
118
+ "<extra_token_24>": 32125,
119
+ "<extra_token_25>": 32126,
120
+ "<extra_token_26>": 32127,
121
+ "<extra_token_2>": 32103,
122
+ "<extra_token_3>": 32104,
123
+ "<extra_token_4>": 32105,
124
+ "<extra_token_5>": 32106,
125
+ "<extra_token_6>": 32107,
126
+ "<extra_token_7>": 32108,
127
+ "<extra_token_8>": 32109,
128
+ "<extra_token_9>": 32110,
129
+ "<s>": 32100
130
+ }
model_files/tokenizers/t5_tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<s>",
4
+ "</s>",
5
+ "<pad>",
6
+ "<extra_id_0>",
7
+ "<extra_id_1>",
8
+ "<extra_id_2>",
9
+ "<extra_id_3>",
10
+ "<extra_id_4>",
11
+ "<extra_id_5>",
12
+ "<extra_id_6>",
13
+ "<extra_id_7>",
14
+ "<extra_id_8>",
15
+ "<extra_id_9>",
16
+ "<extra_id_10>",
17
+ "<extra_id_11>",
18
+ "<extra_id_12>",
19
+ "<extra_id_13>",
20
+ "<extra_id_14>",
21
+ "<extra_id_15>",
22
+ "<extra_id_16>",
23
+ "<extra_id_17>",
24
+ "<extra_id_18>",
25
+ "<extra_id_19>",
26
+ "<extra_id_20>",
27
+ "<extra_id_21>",
28
+ "<extra_id_22>",
29
+ "<extra_id_23>",
30
+ "<extra_id_24>",
31
+ "<extra_id_25>",
32
+ "<extra_id_26>",
33
+ "<extra_id_27>",
34
+ "<extra_id_28>",
35
+ "<extra_id_29>",
36
+ "<extra_id_30>",
37
+ "<extra_id_31>",
38
+ "<extra_id_32>",
39
+ "<extra_id_33>",
40
+ "<extra_id_34>",
41
+ "<extra_id_35>",
42
+ "<extra_id_36>",
43
+ "<extra_id_37>",
44
+ "<extra_id_38>",
45
+ "<extra_id_39>",
46
+ "<extra_id_40>",
47
+ "<extra_id_41>",
48
+ "<extra_id_42>",
49
+ "<extra_id_43>",
50
+ "<extra_id_44>",
51
+ "<extra_id_45>",
52
+ "<extra_id_46>",
53
+ "<extra_id_47>",
54
+ "<extra_id_48>",
55
+ "<extra_id_49>",
56
+ "<extra_id_50>",
57
+ "<extra_id_51>",
58
+ "<extra_id_52>",
59
+ "<extra_id_53>",
60
+ "<extra_id_54>",
61
+ "<extra_id_55>",
62
+ "<extra_id_56>",
63
+ "<extra_id_57>",
64
+ "<extra_id_58>",
65
+ "<extra_id_59>",
66
+ "<extra_id_60>",
67
+ "<extra_id_61>",
68
+ "<extra_id_62>",
69
+ "<extra_id_63>",
70
+ "<extra_id_64>",
71
+ "<extra_id_65>",
72
+ "<extra_id_66>",
73
+ "<extra_id_67>",
74
+ "<extra_id_68>",
75
+ "<extra_id_69>",
76
+ "<extra_id_70>",
77
+ "<extra_id_71>",
78
+ "<extra_id_72>",
79
+ "<extra_id_73>",
80
+ "<extra_id_74>",
81
+ "<extra_id_75>",
82
+ "<extra_id_76>",
83
+ "<extra_id_77>",
84
+ "<extra_id_78>",
85
+ "<extra_id_79>",
86
+ "<extra_id_80>",
87
+ "<extra_id_81>",
88
+ "<extra_id_82>",
89
+ "<extra_id_83>",
90
+ "<extra_id_84>",
91
+ "<extra_id_85>",
92
+ "<extra_id_86>",
93
+ "<extra_id_87>",
94
+ "<extra_id_88>",
95
+ "<extra_id_89>",
96
+ "<extra_id_90>",
97
+ "<extra_id_91>",
98
+ "<extra_id_92>",
99
+ "<extra_id_93>",
100
+ "<extra_id_94>",
101
+ "<extra_id_95>",
102
+ "<extra_id_96>",
103
+ "<extra_id_97>",
104
+ "<extra_id_98>",
105
+ "<extra_id_99>"
106
+ ],
107
+ "bos_token": {
108
+ "content": "</s>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false
113
+ },
114
+ "eos_token": {
115
+ "content": "</s>",
116
+ "lstrip": false,
117
+ "normalized": false,
118
+ "rstrip": false,
119
+ "single_word": false
120
+ },
121
+ "pad_token": {
122
+ "content": "<pad>",
123
+ "lstrip": false,
124
+ "normalized": false,
125
+ "rstrip": false,
126
+ "single_word": false
127
+ },
128
+ "unk_token": {
129
+ "content": "<unk>",
130
+ "lstrip": false,
131
+ "normalized": false,
132
+ "rstrip": false,
133
+ "single_word": false
134
+ }
135
+ }
model_files/tokenizers/t5_tokenizer/spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dcab96935a2a51b1461c84e44c952ea8a3640c8bc3e2c6ae7a21d855454ae27
3
+ size 1111492
model_files/tokenizers/t5_tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,1169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<pad>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "</s>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<unk>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "32000": {
29
+ "content": "<extra_id_99>",
30
+ "lstrip": true,
31
+ "normalized": false,
32
+ "rstrip": true,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "32001": {
37
+ "content": "<extra_id_98>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": true,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "32002": {
45
+ "content": "<extra_id_97>",
46
+ "lstrip": true,
47
+ "normalized": false,
48
+ "rstrip": true,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "32003": {
53
+ "content": "<extra_id_96>",
54
+ "lstrip": true,
55
+ "normalized": false,
56
+ "rstrip": true,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "32004": {
61
+ "content": "<extra_id_95>",
62
+ "lstrip": true,
63
+ "normalized": false,
64
+ "rstrip": true,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "32005": {
69
+ "content": "<extra_id_94>",
70
+ "lstrip": true,
71
+ "normalized": false,
72
+ "rstrip": true,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "32006": {
77
+ "content": "<extra_id_93>",
78
+ "lstrip": true,
79
+ "normalized": false,
80
+ "rstrip": true,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "32007": {
85
+ "content": "<extra_id_92>",
86
+ "lstrip": true,
87
+ "normalized": false,
88
+ "rstrip": true,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "32008": {
93
+ "content": "<extra_id_91>",
94
+ "lstrip": true,
95
+ "normalized": false,
96
+ "rstrip": true,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "32009": {
101
+ "content": "<extra_id_90>",
102
+ "lstrip": true,
103
+ "normalized": false,
104
+ "rstrip": true,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "32010": {
109
+ "content": "<extra_id_89>",
110
+ "lstrip": true,
111
+ "normalized": false,
112
+ "rstrip": true,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "32011": {
117
+ "content": "<extra_id_88>",
118
+ "lstrip": true,
119
+ "normalized": false,
120
+ "rstrip": true,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "32012": {
125
+ "content": "<extra_id_87>",
126
+ "lstrip": true,
127
+ "normalized": false,
128
+ "rstrip": true,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "32013": {
133
+ "content": "<extra_id_86>",
134
+ "lstrip": true,
135
+ "normalized": false,
136
+ "rstrip": true,
137
+ "single_word": false,
138
+ "special": true
139
+ },
140
+ "32014": {
141
+ "content": "<extra_id_85>",
142
+ "lstrip": true,
143
+ "normalized": false,
144
+ "rstrip": true,
145
+ "single_word": false,
146
+ "special": true
147
+ },
148
+ "32015": {
149
+ "content": "<extra_id_84>",
150
+ "lstrip": true,
151
+ "normalized": false,
152
+ "rstrip": true,
153
+ "single_word": false,
154
+ "special": true
155
+ },
156
+ "32016": {
157
+ "content": "<extra_id_83>",
158
+ "lstrip": true,
159
+ "normalized": false,
160
+ "rstrip": true,
161
+ "single_word": false,
162
+ "special": true
163
+ },
164
+ "32017": {
165
+ "content": "<extra_id_82>",
166
+ "lstrip": true,
167
+ "normalized": false,
168
+ "rstrip": true,
169
+ "single_word": false,
170
+ "special": true
171
+ },
172
+ "32018": {
173
+ "content": "<extra_id_81>",
174
+ "lstrip": true,
175
+ "normalized": false,
176
+ "rstrip": true,
177
+ "single_word": false,
178
+ "special": true
179
+ },
180
+ "32019": {
181
+ "content": "<extra_id_80>",
182
+ "lstrip": true,
183
+ "normalized": false,
184
+ "rstrip": true,
185
+ "single_word": false,
186
+ "special": true
187
+ },
188
+ "32020": {
189
+ "content": "<extra_id_79>",
190
+ "lstrip": true,
191
+ "normalized": false,
192
+ "rstrip": true,
193
+ "single_word": false,
194
+ "special": true
195
+ },
196
+ "32021": {
197
+ "content": "<extra_id_78>",
198
+ "lstrip": true,
199
+ "normalized": false,
200
+ "rstrip": true,
201
+ "single_word": false,
202
+ "special": true
203
+ },
204
+ "32022": {
205
+ "content": "<extra_id_77>",
206
+ "lstrip": true,
207
+ "normalized": false,
208
+ "rstrip": true,
209
+ "single_word": false,
210
+ "special": true
211
+ },
212
+ "32023": {
213
+ "content": "<extra_id_76>",
214
+ "lstrip": true,
215
+ "normalized": false,
216
+ "rstrip": true,
217
+ "single_word": false,
218
+ "special": true
219
+ },
220
+ "32024": {
221
+ "content": "<extra_id_75>",
222
+ "lstrip": true,
223
+ "normalized": false,
224
+ "rstrip": true,
225
+ "single_word": false,
226
+ "special": true
227
+ },
228
+ "32025": {
229
+ "content": "<extra_id_74>",
230
+ "lstrip": true,
231
+ "normalized": false,
232
+ "rstrip": true,
233
+ "single_word": false,
234
+ "special": true
235
+ },
236
+ "32026": {
237
+ "content": "<extra_id_73>",
238
+ "lstrip": true,
239
+ "normalized": false,
240
+ "rstrip": true,
241
+ "single_word": false,
242
+ "special": true
243
+ },
244
+ "32027": {
245
+ "content": "<extra_id_72>",
246
+ "lstrip": true,
247
+ "normalized": false,
248
+ "rstrip": true,
249
+ "single_word": false,
250
+ "special": true
251
+ },
252
+ "32028": {
253
+ "content": "<extra_id_71>",
254
+ "lstrip": true,
255
+ "normalized": false,
256
+ "rstrip": true,
257
+ "single_word": false,
258
+ "special": true
259
+ },
260
+ "32029": {
261
+ "content": "<extra_id_70>",
262
+ "lstrip": true,
263
+ "normalized": false,
264
+ "rstrip": true,
265
+ "single_word": false,
266
+ "special": true
267
+ },
268
+ "32030": {
269
+ "content": "<extra_id_69>",
270
+ "lstrip": true,
271
+ "normalized": false,
272
+ "rstrip": true,
273
+ "single_word": false,
274
+ "special": true
275
+ },
276
+ "32031": {
277
+ "content": "<extra_id_68>",
278
+ "lstrip": true,
279
+ "normalized": false,
280
+ "rstrip": true,
281
+ "single_word": false,
282
+ "special": true
283
+ },
284
+ "32032": {
285
+ "content": "<extra_id_67>",
286
+ "lstrip": true,
287
+ "normalized": false,
288
+ "rstrip": true,
289
+ "single_word": false,
290
+ "special": true
291
+ },
292
+ "32033": {
293
+ "content": "<extra_id_66>",
294
+ "lstrip": true,
295
+ "normalized": false,
296
+ "rstrip": true,
297
+ "single_word": false,
298
+ "special": true
299
+ },
300
+ "32034": {
301
+ "content": "<extra_id_65>",
302
+ "lstrip": true,
303
+ "normalized": false,
304
+ "rstrip": true,
305
+ "single_word": false,
306
+ "special": true
307
+ },
308
+ "32035": {
309
+ "content": "<extra_id_64>",
310
+ "lstrip": true,
311
+ "normalized": false,
312
+ "rstrip": true,
313
+ "single_word": false,
314
+ "special": true
315
+ },
316
+ "32036": {
317
+ "content": "<extra_id_63>",
318
+ "lstrip": true,
319
+ "normalized": false,
320
+ "rstrip": true,
321
+ "single_word": false,
322
+ "special": true
323
+ },
324
+ "32037": {
325
+ "content": "<extra_id_62>",
326
+ "lstrip": true,
327
+ "normalized": false,
328
+ "rstrip": true,
329
+ "single_word": false,
330
+ "special": true
331
+ },
332
+ "32038": {
333
+ "content": "<extra_id_61>",
334
+ "lstrip": true,
335
+ "normalized": false,
336
+ "rstrip": true,
337
+ "single_word": false,
338
+ "special": true
339
+ },
340
+ "32039": {
341
+ "content": "<extra_id_60>",
342
+ "lstrip": true,
343
+ "normalized": false,
344
+ "rstrip": true,
345
+ "single_word": false,
346
+ "special": true
347
+ },
348
+ "32040": {
349
+ "content": "<extra_id_59>",
350
+ "lstrip": true,
351
+ "normalized": false,
352
+ "rstrip": true,
353
+ "single_word": false,
354
+ "special": true
355
+ },
356
+ "32041": {
357
+ "content": "<extra_id_58>",
358
+ "lstrip": true,
359
+ "normalized": false,
360
+ "rstrip": true,
361
+ "single_word": false,
362
+ "special": true
363
+ },
364
+ "32042": {
365
+ "content": "<extra_id_57>",
366
+ "lstrip": true,
367
+ "normalized": false,
368
+ "rstrip": true,
369
+ "single_word": false,
370
+ "special": true
371
+ },
372
+ "32043": {
373
+ "content": "<extra_id_56>",
374
+ "lstrip": true,
375
+ "normalized": false,
376
+ "rstrip": true,
377
+ "single_word": false,
378
+ "special": true
379
+ },
380
+ "32044": {
381
+ "content": "<extra_id_55>",
382
+ "lstrip": true,
383
+ "normalized": false,
384
+ "rstrip": true,
385
+ "single_word": false,
386
+ "special": true
387
+ },
388
+ "32045": {
389
+ "content": "<extra_id_54>",
390
+ "lstrip": true,
391
+ "normalized": false,
392
+ "rstrip": true,
393
+ "single_word": false,
394
+ "special": true
395
+ },
396
+ "32046": {
397
+ "content": "<extra_id_53>",
398
+ "lstrip": true,
399
+ "normalized": false,
400
+ "rstrip": true,
401
+ "single_word": false,
402
+ "special": true
403
+ },
404
+ "32047": {
405
+ "content": "<extra_id_52>",
406
+ "lstrip": true,
407
+ "normalized": false,
408
+ "rstrip": true,
409
+ "single_word": false,
410
+ "special": true
411
+ },
412
+ "32048": {
413
+ "content": "<extra_id_51>",
414
+ "lstrip": true,
415
+ "normalized": false,
416
+ "rstrip": true,
417
+ "single_word": false,
418
+ "special": true
419
+ },
420
+ "32049": {
421
+ "content": "<extra_id_50>",
422
+ "lstrip": true,
423
+ "normalized": false,
424
+ "rstrip": true,
425
+ "single_word": false,
426
+ "special": true
427
+ },
428
+ "32050": {
429
+ "content": "<extra_id_49>",
430
+ "lstrip": true,
431
+ "normalized": false,
432
+ "rstrip": true,
433
+ "single_word": false,
434
+ "special": true
435
+ },
436
+ "32051": {
437
+ "content": "<extra_id_48>",
438
+ "lstrip": true,
439
+ "normalized": false,
440
+ "rstrip": true,
441
+ "single_word": false,
442
+ "special": true
443
+ },
444
+ "32052": {
445
+ "content": "<extra_id_47>",
446
+ "lstrip": true,
447
+ "normalized": false,
448
+ "rstrip": true,
449
+ "single_word": false,
450
+ "special": true
451
+ },
452
+ "32053": {
453
+ "content": "<extra_id_46>",
454
+ "lstrip": true,
455
+ "normalized": false,
456
+ "rstrip": true,
457
+ "single_word": false,
458
+ "special": true
459
+ },
460
+ "32054": {
461
+ "content": "<extra_id_45>",
462
+ "lstrip": true,
463
+ "normalized": false,
464
+ "rstrip": true,
465
+ "single_word": false,
466
+ "special": true
467
+ },
468
+ "32055": {
469
+ "content": "<extra_id_44>",
470
+ "lstrip": true,
471
+ "normalized": false,
472
+ "rstrip": true,
473
+ "single_word": false,
474
+ "special": true
475
+ },
476
+ "32056": {
477
+ "content": "<extra_id_43>",
478
+ "lstrip": true,
479
+ "normalized": false,
480
+ "rstrip": true,
481
+ "single_word": false,
482
+ "special": true
483
+ },
484
+ "32057": {
485
+ "content": "<extra_id_42>",
486
+ "lstrip": true,
487
+ "normalized": false,
488
+ "rstrip": true,
489
+ "single_word": false,
490
+ "special": true
491
+ },
492
+ "32058": {
493
+ "content": "<extra_id_41>",
494
+ "lstrip": true,
495
+ "normalized": false,
496
+ "rstrip": true,
497
+ "single_word": false,
498
+ "special": true
499
+ },
500
+ "32059": {
501
+ "content": "<extra_id_40>",
502
+ "lstrip": true,
503
+ "normalized": false,
504
+ "rstrip": true,
505
+ "single_word": false,
506
+ "special": true
507
+ },
508
+ "32060": {
509
+ "content": "<extra_id_39>",
510
+ "lstrip": true,
511
+ "normalized": false,
512
+ "rstrip": true,
513
+ "single_word": false,
514
+ "special": true
515
+ },
516
+ "32061": {
517
+ "content": "<extra_id_38>",
518
+ "lstrip": true,
519
+ "normalized": false,
520
+ "rstrip": true,
521
+ "single_word": false,
522
+ "special": true
523
+ },
524
+ "32062": {
525
+ "content": "<extra_id_37>",
526
+ "lstrip": true,
527
+ "normalized": false,
528
+ "rstrip": true,
529
+ "single_word": false,
530
+ "special": true
531
+ },
532
+ "32063": {
533
+ "content": "<extra_id_36>",
534
+ "lstrip": true,
535
+ "normalized": false,
536
+ "rstrip": true,
537
+ "single_word": false,
538
+ "special": true
539
+ },
540
+ "32064": {
541
+ "content": "<extra_id_35>",
542
+ "lstrip": true,
543
+ "normalized": false,
544
+ "rstrip": true,
545
+ "single_word": false,
546
+ "special": true
547
+ },
548
+ "32065": {
549
+ "content": "<extra_id_34>",
550
+ "lstrip": true,
551
+ "normalized": false,
552
+ "rstrip": true,
553
+ "single_word": false,
554
+ "special": true
555
+ },
556
+ "32066": {
557
+ "content": "<extra_id_33>",
558
+ "lstrip": true,
559
+ "normalized": false,
560
+ "rstrip": true,
561
+ "single_word": false,
562
+ "special": true
563
+ },
564
+ "32067": {
565
+ "content": "<extra_id_32>",
566
+ "lstrip": true,
567
+ "normalized": false,
568
+ "rstrip": true,
569
+ "single_word": false,
570
+ "special": true
571
+ },
572
+ "32068": {
573
+ "content": "<extra_id_31>",
574
+ "lstrip": true,
575
+ "normalized": false,
576
+ "rstrip": true,
577
+ "single_word": false,
578
+ "special": true
579
+ },
580
+ "32069": {
581
+ "content": "<extra_id_30>",
582
+ "lstrip": true,
583
+ "normalized": false,
584
+ "rstrip": true,
585
+ "single_word": false,
586
+ "special": true
587
+ },
588
+ "32070": {
589
+ "content": "<extra_id_29>",
590
+ "lstrip": true,
591
+ "normalized": false,
592
+ "rstrip": true,
593
+ "single_word": false,
594
+ "special": true
595
+ },
596
+ "32071": {
597
+ "content": "<extra_id_28>",
598
+ "lstrip": true,
599
+ "normalized": false,
600
+ "rstrip": true,
601
+ "single_word": false,
602
+ "special": true
603
+ },
604
+ "32072": {
605
+ "content": "<extra_id_27>",
606
+ "lstrip": true,
607
+ "normalized": false,
608
+ "rstrip": true,
609
+ "single_word": false,
610
+ "special": true
611
+ },
612
+ "32073": {
613
+ "content": "<extra_id_26>",
614
+ "lstrip": true,
615
+ "normalized": false,
616
+ "rstrip": true,
617
+ "single_word": false,
618
+ "special": true
619
+ },
620
+ "32074": {
621
+ "content": "<extra_id_25>",
622
+ "lstrip": true,
623
+ "normalized": false,
624
+ "rstrip": true,
625
+ "single_word": false,
626
+ "special": true
627
+ },
628
+ "32075": {
629
+ "content": "<extra_id_24>",
630
+ "lstrip": true,
631
+ "normalized": false,
632
+ "rstrip": true,
633
+ "single_word": false,
634
+ "special": true
635
+ },
636
+ "32076": {
637
+ "content": "<extra_id_23>",
638
+ "lstrip": true,
639
+ "normalized": false,
640
+ "rstrip": true,
641
+ "single_word": false,
642
+ "special": true
643
+ },
644
+ "32077": {
645
+ "content": "<extra_id_22>",
646
+ "lstrip": true,
647
+ "normalized": false,
648
+ "rstrip": true,
649
+ "single_word": false,
650
+ "special": true
651
+ },
652
+ "32078": {
653
+ "content": "<extra_id_21>",
654
+ "lstrip": true,
655
+ "normalized": false,
656
+ "rstrip": true,
657
+ "single_word": false,
658
+ "special": true
659
+ },
660
+ "32079": {
661
+ "content": "<extra_id_20>",
662
+ "lstrip": true,
663
+ "normalized": false,
664
+ "rstrip": true,
665
+ "single_word": false,
666
+ "special": true
667
+ },
668
+ "32080": {
669
+ "content": "<extra_id_19>",
670
+ "lstrip": true,
671
+ "normalized": false,
672
+ "rstrip": true,
673
+ "single_word": false,
674
+ "special": true
675
+ },
676
+ "32081": {
677
+ "content": "<extra_id_18>",
678
+ "lstrip": true,
679
+ "normalized": false,
680
+ "rstrip": true,
681
+ "single_word": false,
682
+ "special": true
683
+ },
684
+ "32082": {
685
+ "content": "<extra_id_17>",
686
+ "lstrip": true,
687
+ "normalized": false,
688
+ "rstrip": true,
689
+ "single_word": false,
690
+ "special": true
691
+ },
692
+ "32083": {
693
+ "content": "<extra_id_16>",
694
+ "lstrip": true,
695
+ "normalized": false,
696
+ "rstrip": true,
697
+ "single_word": false,
698
+ "special": true
699
+ },
700
+ "32084": {
701
+ "content": "<extra_id_15>",
702
+ "lstrip": true,
703
+ "normalized": false,
704
+ "rstrip": true,
705
+ "single_word": false,
706
+ "special": true
707
+ },
708
+ "32085": {
709
+ "content": "<extra_id_14>",
710
+ "lstrip": true,
711
+ "normalized": false,
712
+ "rstrip": true,
713
+ "single_word": false,
714
+ "special": true
715
+ },
716
+ "32086": {
717
+ "content": "<extra_id_13>",
718
+ "lstrip": true,
719
+ "normalized": false,
720
+ "rstrip": true,
721
+ "single_word": false,
722
+ "special": true
723
+ },
724
+ "32087": {
725
+ "content": "<extra_id_12>",
726
+ "lstrip": true,
727
+ "normalized": false,
728
+ "rstrip": true,
729
+ "single_word": false,
730
+ "special": true
731
+ },
732
+ "32088": {
733
+ "content": "<extra_id_11>",
734
+ "lstrip": true,
735
+ "normalized": false,
736
+ "rstrip": true,
737
+ "single_word": false,
738
+ "special": true
739
+ },
740
+ "32089": {
741
+ "content": "<extra_id_10>",
742
+ "lstrip": true,
743
+ "normalized": false,
744
+ "rstrip": true,
745
+ "single_word": false,
746
+ "special": true
747
+ },
748
+ "32090": {
749
+ "content": "<extra_id_9>",
750
+ "lstrip": true,
751
+ "normalized": false,
752
+ "rstrip": true,
753
+ "single_word": false,
754
+ "special": true
755
+ },
756
+ "32091": {
757
+ "content": "<extra_id_8>",
758
+ "lstrip": true,
759
+ "normalized": false,
760
+ "rstrip": true,
761
+ "single_word": false,
762
+ "special": true
763
+ },
764
+ "32092": {
765
+ "content": "<extra_id_7>",
766
+ "lstrip": true,
767
+ "normalized": false,
768
+ "rstrip": true,
769
+ "single_word": false,
770
+ "special": true
771
+ },
772
+ "32093": {
773
+ "content": "<extra_id_6>",
774
+ "lstrip": true,
775
+ "normalized": false,
776
+ "rstrip": true,
777
+ "single_word": false,
778
+ "special": true
779
+ },
780
+ "32094": {
781
+ "content": "<extra_id_5>",
782
+ "lstrip": true,
783
+ "normalized": false,
784
+ "rstrip": true,
785
+ "single_word": false,
786
+ "special": true
787
+ },
788
+ "32095": {
789
+ "content": "<extra_id_4>",
790
+ "lstrip": true,
791
+ "normalized": false,
792
+ "rstrip": true,
793
+ "single_word": false,
794
+ "special": true
795
+ },
796
+ "32096": {
797
+ "content": "<extra_id_3>",
798
+ "lstrip": true,
799
+ "normalized": false,
800
+ "rstrip": true,
801
+ "single_word": false,
802
+ "special": true
803
+ },
804
+ "32097": {
805
+ "content": "<extra_id_2>",
806
+ "lstrip": true,
807
+ "normalized": false,
808
+ "rstrip": true,
809
+ "single_word": false,
810
+ "special": true
811
+ },
812
+ "32098": {
813
+ "content": "<extra_id_1>",
814
+ "lstrip": true,
815
+ "normalized": false,
816
+ "rstrip": true,
817
+ "single_word": false,
818
+ "special": true
819
+ },
820
+ "32099": {
821
+ "content": "<extra_id_0>",
822
+ "lstrip": true,
823
+ "normalized": false,
824
+ "rstrip": true,
825
+ "single_word": false,
826
+ "special": true
827
+ },
828
+ "32100": {
829
+ "content": "<s>",
830
+ "lstrip": false,
831
+ "normalized": false,
832
+ "rstrip": false,
833
+ "single_word": false,
834
+ "special": true
835
+ },
836
+ "32101": {
837
+ "content": "<extra_token_0>",
838
+ "lstrip": false,
839
+ "normalized": true,
840
+ "rstrip": false,
841
+ "single_word": false,
842
+ "special": false
843
+ },
844
+ "32102": {
845
+ "content": "<extra_token_1>",
846
+ "lstrip": false,
847
+ "normalized": true,
848
+ "rstrip": false,
849
+ "single_word": false,
850
+ "special": false
851
+ },
852
+ "32103": {
853
+ "content": "<extra_token_2>",
854
+ "lstrip": false,
855
+ "normalized": true,
856
+ "rstrip": false,
857
+ "single_word": false,
858
+ "special": false
859
+ },
860
+ "32104": {
861
+ "content": "<extra_token_3>",
862
+ "lstrip": false,
863
+ "normalized": true,
864
+ "rstrip": false,
865
+ "single_word": false,
866
+ "special": false
867
+ },
868
+ "32105": {
869
+ "content": "<extra_token_4>",
870
+ "lstrip": false,
871
+ "normalized": true,
872
+ "rstrip": false,
873
+ "single_word": false,
874
+ "special": false
875
+ },
876
+ "32106": {
877
+ "content": "<extra_token_5>",
878
+ "lstrip": false,
879
+ "normalized": true,
880
+ "rstrip": false,
881
+ "single_word": false,
882
+ "special": false
883
+ },
884
+ "32107": {
885
+ "content": "<extra_token_6>",
886
+ "lstrip": false,
887
+ "normalized": true,
888
+ "rstrip": false,
889
+ "single_word": false,
890
+ "special": false
891
+ },
892
+ "32108": {
893
+ "content": "<extra_token_7>",
894
+ "lstrip": false,
895
+ "normalized": true,
896
+ "rstrip": false,
897
+ "single_word": false,
898
+ "special": false
899
+ },
900
+ "32109": {
901
+ "content": "<extra_token_8>",
902
+ "lstrip": false,
903
+ "normalized": true,
904
+ "rstrip": false,
905
+ "single_word": false,
906
+ "special": false
907
+ },
908
+ "32110": {
909
+ "content": "<extra_token_9>",
910
+ "lstrip": false,
911
+ "normalized": true,
912
+ "rstrip": false,
913
+ "single_word": false,
914
+ "special": false
915
+ },
916
+ "32111": {
917
+ "content": "<extra_token_10>",
918
+ "lstrip": false,
919
+ "normalized": true,
920
+ "rstrip": false,
921
+ "single_word": false,
922
+ "special": false
923
+ },
924
+ "32112": {
925
+ "content": "<extra_token_11>",
926
+ "lstrip": false,
927
+ "normalized": true,
928
+ "rstrip": false,
929
+ "single_word": false,
930
+ "special": false
931
+ },
932
+ "32113": {
933
+ "content": "<extra_token_12>",
934
+ "lstrip": false,
935
+ "normalized": true,
936
+ "rstrip": false,
937
+ "single_word": false,
938
+ "special": false
939
+ },
940
+ "32114": {
941
+ "content": "<extra_token_13>",
942
+ "lstrip": false,
943
+ "normalized": true,
944
+ "rstrip": false,
945
+ "single_word": false,
946
+ "special": false
947
+ },
948
+ "32115": {
949
+ "content": "<extra_token_14>",
950
+ "lstrip": false,
951
+ "normalized": true,
952
+ "rstrip": false,
953
+ "single_word": false,
954
+ "special": false
955
+ },
956
+ "32116": {
957
+ "content": "<extra_token_15>",
958
+ "lstrip": false,
959
+ "normalized": true,
960
+ "rstrip": false,
961
+ "single_word": false,
962
+ "special": false
963
+ },
964
+ "32117": {
965
+ "content": "<extra_token_16>",
966
+ "lstrip": false,
967
+ "normalized": true,
968
+ "rstrip": false,
969
+ "single_word": false,
970
+ "special": false
971
+ },
972
+ "32118": {
973
+ "content": "<extra_token_17>",
974
+ "lstrip": false,
975
+ "normalized": true,
976
+ "rstrip": false,
977
+ "single_word": false,
978
+ "special": false
979
+ },
980
+ "32119": {
981
+ "content": "<extra_token_18>",
982
+ "lstrip": false,
983
+ "normalized": true,
984
+ "rstrip": false,
985
+ "single_word": false,
986
+ "special": false
987
+ },
988
+ "32120": {
989
+ "content": "<extra_token_19>",
990
+ "lstrip": false,
991
+ "normalized": true,
992
+ "rstrip": false,
993
+ "single_word": false,
994
+ "special": false
995
+ },
996
+ "32121": {
997
+ "content": "<extra_token_20>",
998
+ "lstrip": false,
999
+ "normalized": true,
1000
+ "rstrip": false,
1001
+ "single_word": false,
1002
+ "special": false
1003
+ },
1004
+ "32122": {
1005
+ "content": "<extra_token_21>",
1006
+ "lstrip": false,
1007
+ "normalized": true,
1008
+ "rstrip": false,
1009
+ "single_word": false,
1010
+ "special": false
1011
+ },
1012
+ "32123": {
1013
+ "content": "<extra_token_22>",
1014
+ "lstrip": false,
1015
+ "normalized": true,
1016
+ "rstrip": false,
1017
+ "single_word": false,
1018
+ "special": false
1019
+ },
1020
+ "32124": {
1021
+ "content": "<extra_token_23>",
1022
+ "lstrip": false,
1023
+ "normalized": true,
1024
+ "rstrip": false,
1025
+ "single_word": false,
1026
+ "special": false
1027
+ },
1028
+ "32125": {
1029
+ "content": "<extra_token_24>",
1030
+ "lstrip": false,
1031
+ "normalized": true,
1032
+ "rstrip": false,
1033
+ "single_word": false,
1034
+ "special": false
1035
+ },
1036
+ "32126": {
1037
+ "content": "<extra_token_25>",
1038
+ "lstrip": false,
1039
+ "normalized": true,
1040
+ "rstrip": false,
1041
+ "single_word": false,
1042
+ "special": false
1043
+ },
1044
+ "32127": {
1045
+ "content": "<extra_token_26>",
1046
+ "lstrip": false,
1047
+ "normalized": true,
1048
+ "rstrip": false,
1049
+ "single_word": false,
1050
+ "special": false
1051
+ }
1052
+ },
1053
+ "additional_special_tokens": [
1054
+ "<s>",
1055
+ "</s>",
1056
+ "<pad>",
1057
+ "<extra_id_0>",
1058
+ "<extra_id_1>",
1059
+ "<extra_id_2>",
1060
+ "<extra_id_3>",
1061
+ "<extra_id_4>",
1062
+ "<extra_id_5>",
1063
+ "<extra_id_6>",
1064
+ "<extra_id_7>",
1065
+ "<extra_id_8>",
1066
+ "<extra_id_9>",
1067
+ "<extra_id_10>",
1068
+ "<extra_id_11>",
1069
+ "<extra_id_12>",
1070
+ "<extra_id_13>",
1071
+ "<extra_id_14>",
1072
+ "<extra_id_15>",
1073
+ "<extra_id_16>",
1074
+ "<extra_id_17>",
1075
+ "<extra_id_18>",
1076
+ "<extra_id_19>",
1077
+ "<extra_id_20>",
1078
+ "<extra_id_21>",
1079
+ "<extra_id_22>",
1080
+ "<extra_id_23>",
1081
+ "<extra_id_24>",
1082
+ "<extra_id_25>",
1083
+ "<extra_id_26>",
1084
+ "<extra_id_27>",
1085
+ "<extra_id_28>",
1086
+ "<extra_id_29>",
1087
+ "<extra_id_30>",
1088
+ "<extra_id_31>",
1089
+ "<extra_id_32>",
1090
+ "<extra_id_33>",
1091
+ "<extra_id_34>",
1092
+ "<extra_id_35>",
1093
+ "<extra_id_36>",
1094
+ "<extra_id_37>",
1095
+ "<extra_id_38>",
1096
+ "<extra_id_39>",
1097
+ "<extra_id_40>",
1098
+ "<extra_id_41>",
1099
+ "<extra_id_42>",
1100
+ "<extra_id_43>",
1101
+ "<extra_id_44>",
1102
+ "<extra_id_45>",
1103
+ "<extra_id_46>",
1104
+ "<extra_id_47>",
1105
+ "<extra_id_48>",
1106
+ "<extra_id_49>",
1107
+ "<extra_id_50>",
1108
+ "<extra_id_51>",
1109
+ "<extra_id_52>",
1110
+ "<extra_id_53>",
1111
+ "<extra_id_54>",
1112
+ "<extra_id_55>",
1113
+ "<extra_id_56>",
1114
+ "<extra_id_57>",
1115
+ "<extra_id_58>",
1116
+ "<extra_id_59>",
1117
+ "<extra_id_60>",
1118
+ "<extra_id_61>",
1119
+ "<extra_id_62>",
1120
+ "<extra_id_63>",
1121
+ "<extra_id_64>",
1122
+ "<extra_id_65>",
1123
+ "<extra_id_66>",
1124
+ "<extra_id_67>",
1125
+ "<extra_id_68>",
1126
+ "<extra_id_69>",
1127
+ "<extra_id_70>",
1128
+ "<extra_id_71>",
1129
+ "<extra_id_72>",
1130
+ "<extra_id_73>",
1131
+ "<extra_id_74>",
1132
+ "<extra_id_75>",
1133
+ "<extra_id_76>",
1134
+ "<extra_id_77>",
1135
+ "<extra_id_78>",
1136
+ "<extra_id_79>",
1137
+ "<extra_id_80>",
1138
+ "<extra_id_81>",
1139
+ "<extra_id_82>",
1140
+ "<extra_id_83>",
1141
+ "<extra_id_84>",
1142
+ "<extra_id_85>",
1143
+ "<extra_id_86>",
1144
+ "<extra_id_87>",
1145
+ "<extra_id_88>",
1146
+ "<extra_id_89>",
1147
+ "<extra_id_90>",
1148
+ "<extra_id_91>",
1149
+ "<extra_id_92>",
1150
+ "<extra_id_93>",
1151
+ "<extra_id_94>",
1152
+ "<extra_id_95>",
1153
+ "<extra_id_96>",
1154
+ "<extra_id_97>",
1155
+ "<extra_id_98>",
1156
+ "<extra_id_99>"
1157
+ ],
1158
+ "bos_token": "</s>",
1159
+ "clean_up_tokenization_spaces": false,
1160
+ "eos_token": "</s>",
1161
+ "extra_ids": 100,
1162
+ "extra_special_tokens": {},
1163
+ "legacy": false,
1164
+ "model_max_length": 512,
1165
+ "pad_token": "<pad>",
1166
+ "sp_model_kwargs": {},
1167
+ "tokenizer_class": "T5Tokenizer",
1168
+ "unk_token": "<unk>"
1169
+ }
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ flask
2
+ torch --index-url https://download.pytorch.org/whl/cpu # For CPU-only build
3
+ transformers
4
+ numpy
5
+ protobuf
6
+ sentencepiece
7
+ gunicorn
static/style.css ADDED
@@ -0,0 +1,274 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ :root {
2
+ --primary-color: #2962ff;
3
+ --secondary-color: #f8fafe;
4
+ --text-color: #1a1f36;
5
+ --border-color: #e1e6ef;
6
+ --hover-color: #edf2ff;
7
+ --shadow-color: rgba(0, 0, 0, 0.06);
8
+ }
9
+
10
+ * {
11
+ margin: 0;
12
+ padding: 0;
13
+ box-sizing: border-box;
14
+ }
15
+
16
+ body {
17
+ font-family: 'Google Sans', 'Roboto', sans-serif;
18
+ background: #fff;
19
+ color: var(--text-color);
20
+ min-height: 100vh;
21
+ }
22
+
23
+ .page-container {
24
+ min-height: 100vh;
25
+ max-width: 1200px;
26
+ margin: 0 auto;
27
+ padding: 40px 20px;
28
+ display: flex;
29
+ flex-direction: column;
30
+ align-items: center;
31
+ }
32
+
33
+ .container {
34
+ max-width: 800px;
35
+ width: 100%;
36
+ background: white;
37
+ border-radius: 20px;
38
+ padding: 30px;
39
+ box-shadow: 0 10px 30px var(--shadow-color);
40
+ }
41
+
42
+ header {
43
+ text-align: center;
44
+ margin-bottom: 20px;
45
+ }
46
+
47
+ h1 {
48
+ font-size: 2.5em;
49
+ font-weight: 600;
50
+ background: linear-gradient(135deg, var(--primary-color), #1e88e5);
51
+ -webkit-background-clip: text;
52
+ background-clip: text;
53
+ -webkit-text-fill-color: transparent;
54
+ margin-bottom: 10px;
55
+ }
56
+
57
+ .subtitle {
58
+ color: #666;
59
+ font-size: 1.1em;
60
+ }
61
+
62
+ .translation-box {
63
+ background: white;
64
+ border-radius: 16px;
65
+ box-shadow: 0 8px 30px var(--shadow-color);
66
+ width: 100%;
67
+ max-width: 900px;
68
+ margin-top: 30px;
69
+ overflow: hidden;
70
+ border: 1px solid var(--border-color);
71
+ }
72
+
73
+ .input-section, .output-section {
74
+ padding: 30px;
75
+ }
76
+
77
+ .input-section {
78
+ background: var(--secondary-color);
79
+ border-bottom: 1px solid var(--border-color);
80
+ }
81
+
82
+ .input-header, .output-header {
83
+ margin-bottom: 15px;
84
+ }
85
+
86
+ label {
87
+ font-weight: 500;
88
+ color: var(--text-color);
89
+ display: flex;
90
+ align-items: center;
91
+ gap: 8px;
92
+ }
93
+
94
+ .char-count {
95
+ color: #5f6368;
96
+ font-size: 0.8em;
97
+ }
98
+
99
+ textarea {
100
+ width: 100%;
101
+ min-height: 160px;
102
+ padding: 15px;
103
+ border: 1px solid var(--border-color);
104
+ border-radius: 12px;
105
+ background: white;
106
+ font-size: 1.1em;
107
+ line-height: 1.5;
108
+ transition: border-color 0.3s ease;
109
+ }
110
+
111
+ textarea:focus {
112
+ outline: none;
113
+ border-color: var(--primary-color);
114
+ box-shadow: 0 0 0 3px rgba(41, 98, 255, 0.1);
115
+ }
116
+
117
+ .controls {
118
+ padding: 15px 30px;
119
+ background: white;
120
+ border-top: 1px solid var(--border-color);
121
+ display: flex;
122
+ justify-content: space-between;
123
+ align-items: center;
124
+ }
125
+
126
+ .primary-btn, .secondary-btn, .icon-btn {
127
+ padding: 12px 25px;
128
+ border: none;
129
+ border-radius: 8px;
130
+ cursor: pointer;
131
+ font-size: 1em;
132
+ font-weight: 500;
133
+ display: flex;
134
+ align-items: center;
135
+ gap: 8px;
136
+ transition: all 0.3s ease;
137
+ }
138
+
139
+ .primary-btn {
140
+ background: var(--primary-color);
141
+ color: white;
142
+ padding: 12px 32px;
143
+ border-radius: 8px;
144
+ font-size: 1rem;
145
+ font-weight: 500;
146
+ letter-spacing: 0.3px;
147
+ transition: transform 0.2s ease, background 0.2s ease;
148
+ }
149
+
150
+ .primary-btn:hover {
151
+ background: #1e4bd8;
152
+ transform: translateY(-1px);
153
+ }
154
+
155
+ .secondary-btn {
156
+ background: #e0e0e0;
157
+ color: #666;
158
+ }
159
+
160
+ .secondary-btn:hover {
161
+ background: #d0d0d0;
162
+ }
163
+
164
+ .icon-btn {
165
+ width: 40px;
166
+ height: 40px;
167
+ display: flex;
168
+ align-items: center;
169
+ justify-content: center;
170
+ border-radius: 10px;
171
+ transition: all 0.2s ease;
172
+ }
173
+
174
+ .icon-btn:hover {
175
+ background: var(--hover-color);
176
+ transform: scale(1.05);
177
+ }
178
+
179
+ .translation-result {
180
+ min-height: 120px;
181
+ background: white;
182
+ border-radius: 12px;
183
+ padding: 20px;
184
+ font-size: 1.1em;
185
+ line-height: 1.6;
186
+ }
187
+
188
+ /* Loading Spinner */
189
+ .loading-spinner {
190
+ display: flex;
191
+ justify-content: center;
192
+ padding: 30px;
193
+ }
194
+
195
+ .spinner {
196
+ width: 30px;
197
+ height: 30px;
198
+ border: 3px solid var(--secondary-color);
199
+ border-top: 3px solid var(--primary-color);
200
+ border-radius: 50%;
201
+ animation: spin 1s linear infinite;
202
+ }
203
+
204
+ /* Toast Notification */
205
+ .toast {
206
+ position: fixed;
207
+ bottom: 24px;
208
+ left: 50%;
209
+ transform: translateX(-50%);
210
+ background: #323232;
211
+ color: white;
212
+ padding: 12px 30px;
213
+ border-radius: 8px;
214
+ font-size: 0.95rem;
215
+ box-shadow: 0 2px 5px rgba(0,0,0,0.2);
216
+ animation: slideUp 0.3s ease;
217
+ }
218
+
219
+ /* Animations */
220
+ @keyframes spin {
221
+ 0% { transform: rotate(0deg); }
222
+ 100% { transform: rotate(360deg); }
223
+ }
224
+
225
+ @keyframes fadeIn {
226
+ from { opacity: 0; transform: translate(-50%, 20px); }
227
+ to { opacity: 1; transform: translate(-50%, 0); }
228
+ }
229
+
230
+ @keyframes fadeOut {
231
+ from { opacity: 1; transform: translate(-50%, 0); }
232
+ to { opacity: 0; transform: translate(-50%, 20px); }
233
+ }
234
+
235
+ @keyframes slideUp {
236
+ from { transform: translate(-50%, 100%); opacity: 0; }
237
+ to { transform: translate(-50%, 0); opacity: 1; }
238
+ }
239
+
240
+ /* Responsive Design */
241
+ @media (max-width: 768px) {
242
+ .page-container {
243
+ padding: 20px 15px;
244
+ }
245
+
246
+ .translation-box {
247
+ border-radius: 12px;
248
+ border-left: none;
249
+ border-right: none;
250
+ }
251
+
252
+ .container {
253
+ padding: 20px;
254
+ margin: 10px;
255
+ }
256
+
257
+ h1 {
258
+ font-size: 2em;
259
+ }
260
+
261
+ .input-section, .output-section {
262
+ padding: 20px;
263
+ }
264
+
265
+ .controls {
266
+ flex-direction: column;
267
+ padding: 15px 20px;
268
+ }
269
+
270
+ .primary-btn, .secondary-btn {
271
+ width: 100%;
272
+ justify-content: center;
273
+ }
274
+ }
templates/index.html ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8" />
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>BanglaFeel Translator</title>
7
+ <link rel="stylesheet" href="/static/style.css" />
8
+ <link href="https://fonts.googleapis.com/css2?family=Google+Sans:wght@400;500&family=Roboto:wght@400;500&display=swap" rel="stylesheet">
9
+ <link rel="stylesheet" href="https://fonts.googleapis.com/icon?family=Material+Icons">
10
+ </head>
11
+ <body>
12
+ <div class="page-container">
13
+ <header>
14
+ <h1>BanglaFeel Translator</h1>
15
+ <p class="subtitle">Translate English to Bengali with ease</p>
16
+ </header>
17
+
18
+ <div class="translation-box">
19
+ <div class="input-section">
20
+ <div class="input-header">
21
+ <label>
22
+ <span class="material-icons">language</span>
23
+ English
24
+ </label>
25
+ </div>
26
+ <textarea id="userInput" placeholder="Type or paste your text here..." maxlength="500"></textarea>
27
+ </div>
28
+
29
+ <div class="controls">
30
+ <div class="left-controls">
31
+ <button id="clearBtn" class="icon-btn" title="Clear text">
32
+ <span class="material-icons">clear</span>
33
+ </button>
34
+ </div>
35
+ <div class="right-controls">
36
+ <span class="char-count">0/500</span>
37
+ <button id="translateBtn" class="primary-btn">
38
+ <span class="material-icons">translate</span>
39
+ Translate
40
+ </button>
41
+ </div>
42
+ </div>
43
+
44
+ <div class="output-section">
45
+ <div class="output-header">
46
+ <label>
47
+ <span class="material-icons">translate</span>
48
+ Bengali
49
+ </label>
50
+ <button id="copyBtn" class="icon-btn" title="Copy translation">
51
+ <span class="material-icons">content_copy</span>
52
+ </button>
53
+ </div>
54
+ <div class="translation-result">
55
+ <p id="translationResult"></p>
56
+ <div class="loading-spinner" style="display: none;">
57
+ <div class="spinner"></div>
58
+ </div>
59
+ </div>
60
+ </div>
61
+ </div>
62
+ </div>
63
+
64
+ <script>
65
+ document.addEventListener('DOMContentLoaded', function() {
66
+ const userInput = document.getElementById('userInput');
67
+ const charCount = document.querySelector('.char-count');
68
+ const translateBtn = document.getElementById('translateBtn');
69
+ const clearBtn = document.getElementById('clearBtn');
70
+ const copyBtn = document.getElementById('copyBtn');
71
+ const translationResult = document.getElementById('translationResult');
72
+ const loadingSpinner = document.querySelector('.loading-spinner');
73
+
74
+ // Character counter
75
+ userInput.addEventListener('input', function() {
76
+ charCount.textContent = `${this.value.length}/500`;
77
+ });
78
+
79
+ // Clear button
80
+ clearBtn.addEventListener('click', function() {
81
+ userInput.value = '';
82
+ translationResult.textContent = '';
83
+ charCount.textContent = '0/500';
84
+ });
85
+
86
+ // Copy button
87
+ copyBtn.addEventListener('click', function() {
88
+ if (translationResult.textContent) {
89
+ navigator.clipboard.writeText(translationResult.textContent)
90
+ .then(() => {
91
+ copyBtn.innerHTML = '<span class="material-icons">check</span>';
92
+ setTimeout(() => {
93
+ copyBtn.innerHTML = '<span class="material-icons">content_copy</span>';
94
+ }, 2000);
95
+ });
96
+ }
97
+ });
98
+
99
+ // Translation
100
+ translateBtn.addEventListener('click', async function() {
101
+ const text = userInput.value.trim();
102
+ if (!text) {
103
+ showToast('Please enter some text first.');
104
+ return;
105
+ }
106
+
107
+ // Show loading state
108
+ loadingSpinner.style.display = 'flex';
109
+ translateBtn.disabled = true;
110
+ translationResult.style.opacity = '0.5';
111
+
112
+ try {
113
+ const response = await fetch('/translate', {
114
+ method: 'POST',
115
+ headers: { 'Content-Type': 'application/json' },
116
+ body: JSON.stringify({ text: text })
117
+ });
118
+
119
+ if (!response.ok) throw new Error(`Server error: ${response.status}`);
120
+
121
+ const data = await response.json();
122
+ translationResult.textContent = data.translation;
123
+ translationResult.style.opacity = '1';
124
+ } catch (error) {
125
+ console.error('Error:', error);
126
+ translationResult.textContent = 'An error occurred during translation.';
127
+ showToast('Translation failed. Please try again.');
128
+ } finally {
129
+ loadingSpinner.style.display = 'none';
130
+ translateBtn.disabled = false;
131
+ }
132
+ });
133
+
134
+ function showToast(message) {
135
+ const toast = document.createElement('div');
136
+ toast.className = 'toast';
137
+ toast.textContent = message;
138
+ document.body.appendChild(toast);
139
+ setTimeout(() => toast.remove(), 3000);
140
+ }
141
+ });
142
+ </script>
143
+ </body>
144
+ </html>