lucalp commited on
Commit
86aec55
·
1 Parent(s): 360537f

Moving around more info section

Browse files
Files changed (1) hide show
  1. app.py +12 -18
app.py CHANGED
@@ -251,24 +251,6 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
251
 
252
  For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
253
 
254
-
255
- <details><summary>[INFO] What does this tool show us?</summary>
256
-
257
-
258
- The **purpose** of this tool is to show the relationship between patch size, global
259
- model dimension and local model layers in terms of FLOPs and parameters. This tool
260
- implies _nothing_ about the **effectiveness** of the FLOPs relative to loss (c.f
261
- [FLOPs/BPB plots from the paper](https://arxiv.org/html/2412.09871v1#:~:text=Introduction-,Figure%201%3A,-Scaling%20trends%20for)) or downstream benchmarks. In order to
262
- fully compare BPE-based transformers and BLT, you'll need to investigate those
263
- claims in the paper itself.
264
-
265
- A core
266
- hypothesis of the paper is "that larger models taking fewer steps on larger patches
267
- might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
268
- </details>
269
-
270
-
271
-
272
  A few things you'll notice:
273
  1. Patch size reduces global model FLOPs but not local model
274
  2. Increasing patch size and global model dimension doesn't change total FLOPs
@@ -306,6 +288,18 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
306
  info="Number of layers in the BLT's local model"
307
  )
308
 
 
 
 
 
 
 
 
 
 
 
 
 
309
  gr.Markdown("### Fixed Parameters")
310
  gr.Markdown(f"""
311
  - **BPE's bytes per token (bpe_ps)**: {bpe_ps}
 
251
 
252
  For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
  A few things you'll notice:
255
  1. Patch size reduces global model FLOPs but not local model
256
  2. Increasing patch size and global model dimension doesn't change total FLOPs
 
288
  info="Number of layers in the BLT's local model"
289
  )
290
 
291
+ gr.Markdown("""
292
+ A core
293
+ hypothesis of the paper is "that larger models taking fewer steps on larger patches
294
+ might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
295
+
296
+ The **purpose** of this tool is to show the relationship between patch size, global
297
+ model dimension and local model layers in terms of FLOPs and parameters. This tool
298
+ implies _nothing_ about the **effectiveness** of the FLOPs relative to loss (c.f
299
+ [FLOPs/BPB plots from the paper](https://arxiv.org/html/2412.09871v1#:~:text=Introduction-,Figure%201%3A,-Scaling%20trends%20for)) or downstream benchmarks. In order to
300
+ fully compare BPE-based transformers and BLT, you'll need to investigate those
301
+ claims in the paper itself.
302
+ """)
303
  gr.Markdown("### Fixed Parameters")
304
  gr.Markdown(f"""
305
  - **BPE's bytes per token (bpe_ps)**: {bpe_ps}