We LZ4 everything automatically, but when we do encounter a model, we also perform a format-agnostic byte grouping inspired by ZipNN before LZ4ing. This does empirically save about 20%.
https://github.com/huggingface/xet-core/blob/main/cas_object/src/byte_grouping/bg4.rs
yuchenglow
yuchenglow
AI & ML interests
Graphs, Interpretability, Performance. Pragmatic Bayesian.
Recent Activity
commented on
their
article
5 months ago
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub
upvoted
an
article
6 months ago
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub
published
an
article
6 months ago
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub