lucalp commited on
Commit
e83db0f
·
1 Parent(s): 3140e72

Moving description

Browse files
Files changed (1) hide show
  1. app.py +9 -9
app.py CHANGED
@@ -243,15 +243,6 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
243
  gr.Markdown("""
244
  # BLT vs BPE FLOPs Comparison
245
  Companion blog post [can be found here](https://lucalp.dev/bitter-lesson-tokenization-and-blt).
246
-
247
- For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
248
-
249
- A few things you'll notice:
250
- 1. Patch size reduces global model FLOPs but not local model
251
- 2. Increasing patch size and global model dimension doesn't change total FLOPs
252
- 3. In smaller BLTs, local models constitute a larger portion of the total FLOPs
253
-
254
- Parameter counts are displayed below each bar.
255
  """)
256
 
257
  with gr.Row():
@@ -285,6 +276,15 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
285
  )
286
 
287
  gr.Markdown("""
 
 
 
 
 
 
 
 
 
288
  A core
289
  hypothesis of the paper is "that larger models taking fewer steps on larger patches
290
  might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
 
243
  gr.Markdown("""
244
  # BLT vs BPE FLOPs Comparison
245
  Companion blog post [can be found here](https://lucalp.dev/bitter-lesson-tokenization-and-blt).
 
 
 
 
 
 
 
 
 
246
  """)
247
 
248
  with gr.Row():
 
276
  )
277
 
278
  gr.Markdown("""
279
+ For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
280
+
281
+ A few things you'll notice:
282
+ 1. Patch size reduces global model FLOPs but not local model
283
+ 2. Increasing patch size and global model dimension doesn't change total FLOPs
284
+ 3. In smaller BLTs, local models constitute a larger portion of the total FLOPs
285
+
286
+ Parameter counts are displayed below each bar.
287
+
288
  A core
289
  hypothesis of the paper is "that larger models taking fewer steps on larger patches
290
  might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)