feature request: download as Parquet?

#1
by julien-c - opened

CSV is ok but Parquet is hub native:)

Hugging Face Sheets org

Should be really easy to add! cc @frascuchon

Hugging Face Sheets org

It's done, @julien-c
Captura de pantalla 2025-06-12 a las 14.10.40.png

nice!!! will use it

julien-c changed discussion status to closed

@kszucs @lhoestq is there something special to check here in AISheets to make sure the exported Parquet will chunk/deduplicate nicely?

Not sure what the app uses for the Parquet export. It's likely in JS and therefore unlikely to use the Parquet writer for Xet from Arrow

Hugging Face Sheets org
โ€ข
edited 5 days ago

Exactly, all is TS code.

Current export is straightforward/simplistic.

  1. Data is stored on DuckDB.
  2. We export the file locally
COPY (SELECT ...) TO 'file.parquet' (FORMAT PARQUET)
  1. We push the file to the hub with the hub js library

Here's an example of the result: https://huggingface.co/datasets/dvilasuero/hands_playing_instruments

Hugging Face Sheets org

Thanks! @lhoestq . Does the datasets library make any special preparations before sending Parquet files? Or is enough with using huggingface_hub + hf_xet?

and therefore unlikely to use the Parquet writer for Xet from Arrow

is there any way to downstream this into a JS implementation or into DuckDB, or even to compile it to WASM on its own?

The JavaScript parquet implementations doesn't seem to be maintained, especially parquet writing.
The most straightforward way would be to support Parquet CDC writing in parquet-wasm by adding support to the parquet rust implementation. It would also expand the support across the rust/datafusion ecosystem.

In order to support Parquet CDC writing in DuckDB we would need to submit a PR to their own parquet implementation.

Sign up or log in to comment