Moving around more info section
Browse files
app.py
CHANGED
@@ -251,24 +251,6 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
|
|
251 |
|
252 |
For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
|
253 |
|
254 |
-
|
255 |
-
<details><summary>[INFO] What does this tool show us?</summary>
|
256 |
-
|
257 |
-
|
258 |
-
The **purpose** of this tool is to show the relationship between patch size, global
|
259 |
-
model dimension and local model layers in terms of FLOPs and parameters. This tool
|
260 |
-
implies _nothing_ about the **effectiveness** of the FLOPs relative to loss (c.f
|
261 |
-
[FLOPs/BPB plots from the paper](https://arxiv.org/html/2412.09871v1#:~:text=Introduction-,Figure%201%3A,-Scaling%20trends%20for)) or downstream benchmarks. In order to
|
262 |
-
fully compare BPE-based transformers and BLT, you'll need to investigate those
|
263 |
-
claims in the paper itself.
|
264 |
-
|
265 |
-
A core
|
266 |
-
hypothesis of the paper is "that larger models taking fewer steps on larger patches
|
267 |
-
might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
|
268 |
-
</details>
|
269 |
-
|
270 |
-
|
271 |
-
|
272 |
A few things you'll notice:
|
273 |
1. Patch size reduces global model FLOPs but not local model
|
274 |
2. Increasing patch size and global model dimension doesn't change total FLOPs
|
@@ -306,6 +288,18 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
|
|
306 |
info="Number of layers in the BLT's local model"
|
307 |
)
|
308 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
309 |
gr.Markdown("### Fixed Parameters")
|
310 |
gr.Markdown(f"""
|
311 |
- **BPE's bytes per token (bpe_ps)**: {bpe_ps}
|
|
|
251 |
|
252 |
For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
|
253 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
254 |
A few things you'll notice:
|
255 |
1. Patch size reduces global model FLOPs but not local model
|
256 |
2. Increasing patch size and global model dimension doesn't change total FLOPs
|
|
|
288 |
info="Number of layers in the BLT's local model"
|
289 |
)
|
290 |
|
291 |
+
gr.Markdown("""
|
292 |
+
A core
|
293 |
+
hypothesis of the paper is "that larger models taking fewer steps on larger patches
|
294 |
+
might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
|
295 |
+
|
296 |
+
The **purpose** of this tool is to show the relationship between patch size, global
|
297 |
+
model dimension and local model layers in terms of FLOPs and parameters. This tool
|
298 |
+
implies _nothing_ about the **effectiveness** of the FLOPs relative to loss (c.f
|
299 |
+
[FLOPs/BPB plots from the paper](https://arxiv.org/html/2412.09871v1#:~:text=Introduction-,Figure%201%3A,-Scaling%20trends%20for)) or downstream benchmarks. In order to
|
300 |
+
fully compare BPE-based transformers and BLT, you'll need to investigate those
|
301 |
+
claims in the paper itself.
|
302 |
+
""")
|
303 |
gr.Markdown("### Fixed Parameters")
|
304 |
gr.Markdown(f"""
|
305 |
- **BPE's bytes per token (bpe_ps)**: {bpe_ps}
|