Spaces:

lucalp
/

byte-latent-transformer-flops

Running

lucalp commited on Jun 22

Commit

e83db0f

1 Parent(s): 3140e72

Moving description

Files changed (1) hide show

app.py CHANGED Viewed

@@ -243,15 +243,6 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
     gr.Markdown("""
     # BLT vs BPE FLOPs Comparison
     Companion blog post [can be found here](https://lucalp.dev/bitter-lesson-tokenization-and-blt).
-    For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
-    A few things you'll notice:
-    1. Patch size reduces global model FLOPs but not local model
-    2. Increasing patch size and global model dimension doesn't change total FLOPs
-    3. In smaller BLTs, local models constitute a larger portion of the total FLOPs
-    Parameter counts are displayed below each bar.
     """)
     with gr.Row():
@@ -285,6 +276,15 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
             )
             gr.Markdown("""
     A core
     hypothesis of the paper is "that larger models taking fewer steps on larger patches
     might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)

     gr.Markdown("""
     # BLT vs BPE FLOPs Comparison
     Companion blog post [can be found here](https://lucalp.dev/bitter-lesson-tokenization-and-blt).
     """)
     with gr.Row():
             )
             gr.Markdown("""
+    For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
+    A few things you'll notice:
+    1. Patch size reduces global model FLOPs but not local model
+    2. Increasing patch size and global model dimension doesn't change total FLOPs
+    3. In smaller BLTs, local models constitute a larger portion of the total FLOPs
+    Parameter counts are displayed below each bar.
     A core
     hypothesis of the paper is "that larger models taking fewer steps on larger patches
     might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)