File size: 1,497 Bytes
efb056c 9e51385 e4b84eb f2e6a64 e4b84eb ada2a1e efb056c 932e168 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
---
From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~125M param model.
Test network using [Tensor Product Attention](https://arxiv.org/abs/2501.06425). Other than some alterations to the attention, such as 16 heads insted of 9 and using TPA, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
# Scripts:
- `inference.py` to run the model with some test prompts
- `test_train.py` runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with `"text":"example text", "text":"..."`
# Notes:
One of the primary reported benefits for TPA are for inference which are not really being leveraged at all, although you can probably fit a larger bsz than traditional MHA/GQA with this. This did save about 5% on params, that amount should scale much more as the network size increases. The run time is very similar to MHA/GQA at this scale.
# Training Metrics
## Dataset Information
- Training data per epoch: 1 GB
- Total tokens trained: 48,261,120
- No sythetic data
## Training Results
- Final Train Loss: 3.0421
- Final Train Perplexity: 20.95

# Code
The code for tensor product attn is available at: https://github.com/tensorgi/T6. |