Spaces:
Running
feature request: download as Parquet?
CSV is ok but Parquet is hub native:)
nice!!! will use it
Not sure what the app uses for the Parquet export. It's likely in JS and therefore unlikely to use the Parquet writer for Xet from Arrow
Exactly, all is TS code.
Current export is straightforward/simplistic.
- Data is stored on DuckDB.
- We export the file locally
COPY (SELECT ...) TO 'file.parquet' (FORMAT PARQUET)
- We push the file to the hub with the hub js library
Here's an example of the result: https://huggingface.co/datasets/dvilasuero/hands_playing_instruments
Thanks!
@lhoestq
. Does the datasets
library make any special preparations before sending Parquet files? Or is enough with using huggingface_hub
+ hf_xet
?
and therefore unlikely to use the Parquet writer for Xet from Arrow
is there any way to downstream this into a JS implementation or into DuckDB, or even to compile it to WASM on its own?
The JavaScript parquet implementations doesn't seem to be maintained, especially parquet writing.
The most straightforward way would be to support Parquet CDC writing in parquet-wasm by adding support to the parquet rust implementation. It would also expand the support across the rust/datafusion ecosystem.
In order to support Parquet CDC writing in DuckDB we would need to submit a PR to their own parquet implementation.