Spaces:

lucalp
/

byte-latent-transformer-flops

Running

App Files Files Community

lucalp commited on May 25

Commit

360537f

1 Parent(s): d90ad1e

More tweaks and info

Browse files

Files changed (1) hide show

app.py +19 -1

app.py CHANGED Viewed

@@ -249,7 +249,25 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
     - **BPE (Byte Pair Encoding)**: Traditional transformer architecture
     - **BLT (Byte Latent Transformer)**: Novel architecture with Global and Local components with a dynamic patch size to segment bytes.
-    Have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
     A few things you'll notice:
     1. Patch size reduces global model FLOPs but not local model

     - **BPE (Byte Pair Encoding)**: Traditional transformer architecture
     - **BLT (Byte Latent Transformer)**: Novel architecture with Global and Local components with a dynamic patch size to segment bytes.
+    For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
+    <details><summary>[INFO] What does this tool show us?</summary>
+    The **purpose** of this tool is to show the relationship between patch size, global
+    model dimension and local model layers in terms of FLOPs and parameters. This tool
+    implies _nothing_ about the **effectiveness** of the FLOPs relative to loss (c.f
+    [FLOPs/BPB plots from the paper](https://arxiv.org/html/2412.09871v1#:~:text=Introduction-,Figure%201%3A,-Scaling%20trends%20for)) or downstream benchmarks. In order to
+    fully compare BPE-based transformers and BLT, you'll need to investigate those
+    claims in the paper itself.
+    A core
+    hypothesis of the paper is "that larger models taking fewer steps on larger patches
+    might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
+    </details>
     A few things you'll notice:
     1. Patch size reduces global model FLOPs but not local model