Blackroot commited on
Commit
e4b84eb
·
verified ·
1 Parent(s): ada2a1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -1 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/637f3b03932a61b89aefbf5c/8iTSQFvwgbn5or6LdNT9G.png)
 
1
+ Test network using [Tensor Product Attention](https://arxiv.org/abs/2501.06425). Other than some alterations to the attention, such as 16 heads insted of 9 and using TPA, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
2
+
3
+ # Scripts:
4
+ - `inference.py` to run the model with some test prompts
5
+ - `test_train.py` runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with `"text":"example text", "text":"..."`
6
+
7
+ # Notes:
8
+ Compared to the control model of Smollm2, this is bordering on incoherent. Potentially this model size is too small to correctly leverage differential attention. It's clearly picked up on some ideas in language, but is generally worse than the control model using GQA in terms of human output.
9
+
10
+
11
+ # Training Metrics
12
+
13
+ ## Dataset Information
14
+ - Training data per epoch: 1 GB
15
+ - Total tokens trained: 48,261,120
16
+ - No sythetic data
17
+
18
+ ## Training Results
19
+ - Final Train Loss: 3.0421
20
+ - Final Train Perplexity: 20.95
21
+
22
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/637f3b03932a61b89aefbf5c/8iTSQFvwgbn5or6LdNT9G.png)