Moving description
Browse files
app.py
CHANGED
@@ -243,15 +243,6 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
|
|
243 |
gr.Markdown("""
|
244 |
# BLT vs BPE FLOPs Comparison
|
245 |
Companion blog post [can be found here](https://lucalp.dev/bitter-lesson-tokenization-and-blt).
|
246 |
-
|
247 |
-
For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
|
248 |
-
|
249 |
-
A few things you'll notice:
|
250 |
-
1. Patch size reduces global model FLOPs but not local model
|
251 |
-
2. Increasing patch size and global model dimension doesn't change total FLOPs
|
252 |
-
3. In smaller BLTs, local models constitute a larger portion of the total FLOPs
|
253 |
-
|
254 |
-
Parameter counts are displayed below each bar.
|
255 |
""")
|
256 |
|
257 |
with gr.Row():
|
@@ -285,6 +276,15 @@ with gr.Blocks(title="BLT vs BPE FLOPs Comparison") as demo:
|
|
285 |
)
|
286 |
|
287 |
gr.Markdown("""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
288 |
A core
|
289 |
hypothesis of the paper is "that larger models taking fewer steps on larger patches
|
290 |
might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
|
|
|
243 |
gr.Markdown("""
|
244 |
# BLT vs BPE FLOPs Comparison
|
245 |
Companion blog post [can be found here](https://lucalp.dev/bitter-lesson-tokenization-and-blt).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
246 |
""")
|
247 |
|
248 |
with gr.Row():
|
|
|
276 |
)
|
277 |
|
278 |
gr.Markdown("""
|
279 |
+
For inspiration, have a look at the paper's [BLT architecture configurations](https://arxiv.org/html/2412.09871v1#:~:text=%5Cbeginappendix-,11,Table%C2%A010%20shows%20different%20hyper%20parameter%20settings%20for%20BLT%20models.,-Encoder) for some inspiration.
|
280 |
+
|
281 |
+
A few things you'll notice:
|
282 |
+
1. Patch size reduces global model FLOPs but not local model
|
283 |
+
2. Increasing patch size and global model dimension doesn't change total FLOPs
|
284 |
+
3. In smaller BLTs, local models constitute a larger portion of the total FLOPs
|
285 |
+
|
286 |
+
Parameter counts are displayed below each bar.
|
287 |
+
|
288 |
A core
|
289 |
hypothesis of the paper is "that larger models taking fewer steps on larger patches
|
290 |
might perform better than smaller models taking more steps." [source](https://arxiv.org/html/2412.09871v1#:~:text=the%20hypothesis%20that%20larger%20models%20taking%20fewer%20steps%20on%20larger%20patches%20might%20perform%20better%20than%20smaller%20models%20taking%20more%20steps)
|