Support loading datasets saved via save_to_disk (#1432) e634118 unverified Keith Stevens commited on Mar 29, 2024
fix(dataset): normalize tokenizer config and change hash from tokenizer class to tokenizer path (#1298) ff939d8 unverified Nanobit commited on Mar 25, 2024
Add a config not to shuffle merged dataset (#1394) [skip ci] 43bdc5d unverified seungduk winglian commited on Mar 19, 2024
Support user-defined prompt processing strategies for dpo (#1248) 1e3d530 unverified nopperl winglian commited on Feb 26, 2024
relora: magnitude pruning of the optimizer (#1245) 8c2e05a unverified winglian commited on Feb 6, 2024
Fix and document test_datasets (#1228) 5787e1a unverified DreamGenX winglian commited on Jan 31, 2024
make sure to register the base chatml template even if no system message is provided (#1207) badda37 unverified winglian commited on Jan 25, 2024
more dpo fixes for dataset loading and docs (#1185) [skip ci] 5bce45f unverified winglian commited on Jan 24, 2024
support for explicit test_dataset definition for evals (#786) cda52dc unverified winglian commited on Jan 23, 2024
feat(dataset): add config to keep processed dataset in memory (#1152) 3db5f2f unverified Nanobit commited on Jan 20, 2024
fix(preprocess): Make sure dataset not loaded from cache when using preprocess cli (#1136) 1e56b88 unverified Nanobit commited on Jan 17, 2024
Efficiently get the length of the tokenized docs (#1063) 81d3845 unverified ricdomolm winglian commited on Jan 8, 2024
streaming multipack for pretraining dataset (#959) 553c80f unverified whooray [email protected] winglian commited on Jan 6, 2024
Update data.py for signature generation (#851) 48630f5 unverified MilesQLi winglian commited on Nov 15, 2023
update table for rwkv4 support, fix process count for dataset (#822) cdc71f7 unverified winglian commited on Nov 5, 2023
catch ConnectionError when checking dataset from HuggingFace (#743) 992d57f unverified Napuh commited on Oct 19, 2023
improve handling of the prepared ds path and other cfg defaults (#701) 1c412c7 unverified winglian commited on Oct 13, 2023
Fix: Future deprecation warning with use_auth_token (#680) 69fac9a unverified Nanobit commited on Oct 5, 2023
prepared dataset caching, other misc fixes (#665) e50a64e unverified winglian commited on Oct 3, 2023
Feat(data): Allow loading local csv and text (#594) 00dce35 unverified Nanobit commited on Sep 17, 2023
support custom field for completion from yml (#580) f7a2263 unverified winglian commited on Sep 15, 2023
remove columns after tokenizing for pretraining (#571) 1157950 unverified winglian commited on Sep 14, 2023
Fix pretraining with iterable/streaming Dataset (#556) 2f586d1 unverified Jan Philipp Harries Jan Philipp Harries commited on Sep 13, 2023
support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348) d2e7f27 unverified winglian commited on Aug 20, 2023
use context manager to run things on rank0 before others (#397) fc2d6be unverified winglian commited on Aug 15, 2023
Attention mask and position id fixes for packing (#285) 2bb0b78 unverified winglian commited on Aug 12, 2023
experimental llama 2 chat support (#296) 3392270 unverified Jan Philipp Harries Jan Philipp Harries commited on Aug 6, 2023