Spaces:

lucalp
/

byte-latent-transformer-flops

Running

App Files Files Community

lucalp commited on May 25

Commit

86aec55

1 Parent(s): 360537f

Moving around more info section

Browse files

Files changed (1) hide show

app.py +12 -18

app.py CHANGED Viewed

@@ -251,24 +251,6 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
     For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
-    <details><summary>[INFO] What does this tool show us?</summary>
-    The **purpose** of this tool is to show the relationship between patch size, global
-    model dimension and local model layers in terms of FLOPs and parameters. This tool
-    implies _nothing_ about the **effectiveness** of the FLOPs relative to loss (c.f
-    [FLOPs/BPB plots from the paper](https://arxiv.org/html/2412.09871v1#:~:text=Introduction-,Figure%201%3A,-Scaling%20trends%20for)) or downstream benchmarks. In order to
-    fully compare BPE-based transformers and BLT, you'll need to investigate those
-    claims in the paper itself.
-    A core
-    hypothesis of the paper is "that larger models taking fewer steps on larger patches
-    might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
-    </details>
     A few things you'll notice:
     1. Patch size reduces global model FLOPs but not local model
     2. Increasing patch size and global model dimension doesn't change total FLOPs
@@ -306,6 +288,18 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
                 info="Number of layers in the BLT's local model"
             )
             gr.Markdown("### Fixed Parameters")
             gr.Markdown(f"""
             - **BPE's bytes per token (bpe_ps)**: {bpe_ps}

     For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
     A few things you'll notice:
     1. Patch size reduces global model FLOPs but not local model
     2. Increasing patch size and global model dimension doesn't change total FLOPs
                 info="Number of layers in the BLT's local model"
             )
+            gr.Markdown("""
+    A core
+    hypothesis of the paper is "that larger models taking fewer steps on larger patches
+    might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
+    The **purpose** of this tool is to show the relationship between patch size, global
+    model dimension and local model layers in terms of FLOPs and parameters. This tool
+    implies _nothing_ about the **effectiveness** of the FLOPs relative to loss (c.f
+    [FLOPs/BPB plots from the paper](https://arxiv.org/html/2412.09871v1#:~:text=Introduction-,Figure%201%3A,-Scaling%20trends%20for)) or downstream benchmarks. In order to
+    fully compare BPE-based transformers and BLT, you'll need to investigate those
+    claims in the paper itself.
+            """)
             gr.Markdown("### Fixed Parameters")
             gr.Markdown(f"""
             - **BPE's bytes per token (bpe_ps)**: {bpe_ps}