lucalp commited on
Commit
360537f
·
1 Parent(s): d90ad1e

More tweaks and info

Browse files
Files changed (1) hide show
  1. app.py +19 -1
app.py CHANGED
@@ -249,7 +249,25 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
249
  - **BPE (Byte Pair Encoding)**: Traditional transformer architecture
250
  - **BLT (Byte Latent Transformer)**: Novel architecture with Global and Local components with a dynamic patch size to segment bytes.
251
 
252
- Have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253
 
254
  A few things you'll notice:
255
  1. Patch size reduces global model FLOPs but not local model
 
249
  - **BPE (Byte Pair Encoding)**: Traditional transformer architecture
250
  - **BLT (Byte Latent Transformer)**: Novel architecture with Global and Local components with a dynamic patch size to segment bytes.
251
 
252
+ For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
253
+
254
+
255
+ <details><summary>[INFO] What does this tool show us?</summary>
256
+
257
+
258
+ The **purpose** of this tool is to show the relationship between patch size, global
259
+ model dimension and local model layers in terms of FLOPs and parameters. This tool
260
+ implies _nothing_ about the **effectiveness** of the FLOPs relative to loss (c.f
261
+ [FLOPs/BPB plots from the paper](https://arxiv.org/html/2412.09871v1#:~:text=Introduction-,Figure%201%3A,-Scaling%20trends%20for)) or downstream benchmarks. In order to
262
+ fully compare BPE-based transformers and BLT, you'll need to investigate those
263
+ claims in the paper itself.
264
+
265
+ A core
266
+ hypothesis of the paper is "that larger models taking fewer steps on larger patches
267
+ might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
268
+ </details>
269
+
270
+
271
 
272
  A few things you'll notice:
273
  1. Patch size reduces global model FLOPs but not local model