add docs for `input_output` format (#1367) [skip ci]
Browse files- README.md +9 -0
- docs/input_output.md +260 -0
    	
        README.md
    CHANGED
    
    | @@ -385,6 +385,15 @@ pretraining_dataset: # hf path only | |
| 385 |  | 
| 386 | 
             
            </details>
         | 
| 387 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 388 | 
             
            ##### Conversation
         | 
| 389 |  | 
| 390 | 
             
            - `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
         | 
|  | |
| 385 |  | 
| 386 | 
             
            </details>
         | 
| 387 |  | 
| 388 | 
            +
            ##### Template-Free
         | 
| 389 | 
            +
             | 
| 390 | 
            +
            - `input_output`: template-free prompt construction
         | 
| 391 | 
            +
              ```json
         | 
| 392 | 
            +
               {"segments": [{"label": true|false, "text": "..."}]}
         | 
| 393 | 
            +
              ```
         | 
| 394 | 
            +
             | 
| 395 | 
            +
            This is a special format that allows you to construct prompts without using templates. This is for advanced users who want more freedom with prompt construction.  See [these docs](docs/input_output.md) for more details.
         | 
| 396 | 
            +
             | 
| 397 | 
             
            ##### Conversation
         | 
| 398 |  | 
| 399 | 
             
            - `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
         | 
    	
        docs/input_output.md
    ADDED
    
    | @@ -0,0 +1,260 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            # Template-free prompt construction with the `input_output` format
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            <!-- TOC -->
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            - [Background](#background)
         | 
| 6 | 
            +
                - [Masking Inputs](#masking-inputs)
         | 
| 7 | 
            +
                - [You may not want prompt templates](#you-may-not-want-prompt-templates)
         | 
| 8 | 
            +
                - [The `input_output` format](#the-input_output-format)
         | 
| 9 | 
            +
            - [Usage](#usage)
         | 
| 10 | 
            +
                - [1. Prepare Data](#1-prepare-data)
         | 
| 11 | 
            +
                - [2. Use `type: input_output`](#2-use-type-input_output)
         | 
| 12 | 
            +
                - [3. Check the prompts](#3-check-the-prompts)
         | 
| 13 | 
            +
             | 
| 14 | 
            +
            <!-- /TOC -->
         | 
| 15 | 
            +
             | 
| 16 | 
            +
            <a id="markdown-background" name="background"></a>
         | 
| 17 | 
            +
             | 
| 18 | 
            +
            ## Background
         | 
| 19 | 
            +
             | 
| 20 | 
            +
            <a id="markdown-masking-inputs" name="masking-inputs"></a>
         | 
| 21 | 
            +
             | 
| 22 | 
            +
            ### Masking Inputs
         | 
| 23 | 
            +
             | 
| 24 | 
            +
            One of the most popular features of
         | 
| 25 | 
            +
            [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is
         | 
| 26 | 
            +
            setting the following configuration value:
         | 
| 27 | 
            +
             | 
| 28 | 
            +
             | 
| 29 | 
            +
            ```yaml
         | 
| 30 | 
            +
            train_on_inputs: false
         | 
| 31 | 
            +
            ```
         | 
| 32 | 
            +
             | 
| 33 | 
            +
            If you declare a [dataset formats](https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#dataset)
         | 
| 34 | 
            +
            such as `alpaca` or `chatml`, axolotl knows what is an input
         | 
| 35 | 
            +
            (i.e. human) vs. an output (i.e. the assistant) and masks the input
         | 
| 36 | 
            +
            labels so that your model can focus on predicting the outputs only.
         | 
| 37 | 
            +
             | 
| 38 | 
            +
            <a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a>
         | 
| 39 | 
            +
             | 
| 40 | 
            +
            ### You may not want prompt templates
         | 
| 41 | 
            +
             | 
| 42 | 
            +
            However, there are many situations where you don't want to use one of
         | 
| 43 | 
            +
            these formats or templates (I usually don't!). This is because they can:
         | 
| 44 | 
            +
             | 
| 45 | 
            +
            -   Add unnecessary boilerplate to your prompts.
         | 
| 46 | 
            +
            -   Create artifacts like special delimiters `<|im_start|>` that can
         | 
| 47 | 
            +
                quickly become footguns if you don't include them correctly at
         | 
| 48 | 
            +
                inference time.
         | 
| 49 | 
            +
            -   Enforce a *chat* interface when you do not want one. Sometimes you
         | 
| 50 | 
            +
                just want to fine-tune a model to a very specific task and do NOT
         | 
| 51 | 
            +
                want multi-turn conversations, roles, etc.
         | 
| 52 | 
            +
            -   Limit you to only certain roles that the template allows.
         | 
| 53 | 
            +
             | 
| 54 | 
            +
            <a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a>
         | 
| 55 | 
            +
             | 
| 56 | 
            +
            ### The `input_output` format
         | 
| 57 | 
            +
             | 
| 58 | 
            +
            You can construct your prompts without a template by using the
         | 
| 59 | 
            +
            `input_output` format, by setting `type: input_output` in your
         | 
| 60 | 
            +
            configuration file like this:
         | 
| 61 | 
            +
             | 
| 62 | 
            +
            **config.yml**
         | 
| 63 | 
            +
             | 
| 64 | 
            +
            ```yaml
         | 
| 65 | 
            +
            train_on_inputs: false # Mask segments of your data
         | 
| 66 | 
            +
            datasets:
         | 
| 67 | 
            +
              - path: output.jsonl
         | 
| 68 | 
            +
                type: input_output  # use template free prompt construction
         | 
| 69 | 
            +
            ```
         | 
| 70 | 
            +
             | 
| 71 | 
            +
            Unlike `type: completion`, which is also template-free,
         | 
| 72 | 
            +
            `type: input_output` allows you to mask segments of your text. More
         | 
| 73 | 
            +
            details on how this works are described below.
         | 
| 74 | 
            +
             | 
| 75 | 
            +
            <a id="markdown-usage" name="usage"></a>
         | 
| 76 | 
            +
             | 
| 77 | 
            +
            ## Usage
         | 
| 78 | 
            +
             | 
| 79 | 
            +
            This is how you can use the `input_output` format:
         | 
| 80 | 
            +
             | 
| 81 | 
            +
            <a id="markdown-1-prepare-data" name="1-prepare-data"></a>
         | 
| 82 | 
            +
             | 
| 83 | 
            +
            ### 1. Prepare Data
         | 
| 84 | 
            +
             | 
| 85 | 
            +
            To use the `input_output` format, collect your data in the following
         | 
| 86 | 
            +
            format into a jsonl file (below is the first row from the file
         | 
| 87 | 
            +
            `output`.jsonl` pretty printed):
         | 
| 88 | 
            +
             | 
| 89 | 
            +
            ```bash
         | 
| 90 | 
            +
            $ head -n1 output.jsonl | python -m json.tool
         | 
| 91 | 
            +
             | 
| 92 | 
            +
            {.cell-output .cell-output-stdout}
         | 
| 93 | 
            +
                {
         | 
| 94 | 
            +
                    "segments": [
         | 
| 95 | 
            +
                        {
         | 
| 96 | 
            +
                            "label": true,
         | 
| 97 | 
            +
                            "text": "<s>Hello\n"
         | 
| 98 | 
            +
                        },
         | 
| 99 | 
            +
                        {
         | 
| 100 | 
            +
                            "label": true,
         | 
| 101 | 
            +
                            "text": "hi there!. "
         | 
| 102 | 
            +
                        },
         | 
| 103 | 
            +
                        {
         | 
| 104 | 
            +
                            "label": false,
         | 
| 105 | 
            +
                            "text": "goodbye "
         | 
| 106 | 
            +
                        },
         | 
| 107 | 
            +
                        {
         | 
| 108 | 
            +
                            "label": true,
         | 
| 109 | 
            +
                            "text": "farewell</s>"
         | 
| 110 | 
            +
                        }
         | 
| 111 | 
            +
                    ]
         | 
| 112 | 
            +
                }
         | 
| 113 | 
            +
            ```
         | 
| 114 | 
            +
             | 
| 115 | 
            +
            Set `label:false` when you want to mask a segment of text so that the
         | 
| 116 | 
            +
            model isn't trained on it. Some things to keep in mind:
         | 
| 117 | 
            +
             | 
| 118 | 
            +
            > [!IMPORTANT]
         | 
| 119 | 
            +
            > 1.  **EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl
         | 
| 120 | 
            +
                concatenates all the segments as-is.** The tokenizer doesn't add
         | 
| 121 | 
            +
                anything additional. Notice how I added spaces, newlines, `<s>`
         | 
| 122 | 
            +
                (BOS), and `</s>` (EOS) myself.
         | 
| 123 | 
            +
            > 2.  Make sure you check the materialized output to validate that the
         | 
| 124 | 
            +
                prompt is getting assembled how you like.
         | 
| 125 | 
            +
             | 
| 126 | 
            +
            <a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a>
         | 
| 127 | 
            +
             | 
| 128 | 
            +
            ### 2. Use `type: input_output`
         | 
| 129 | 
            +
             | 
| 130 | 
            +
            Let's materialize data with our `output.jsonl` file by setting
         | 
| 131 | 
            +
            `type: input_output` in our axolotl config:
         | 
| 132 | 
            +
             | 
| 133 | 
            +
            ```yaml
         | 
| 134 | 
            +
            # training_config.yaml
         | 
| 135 | 
            +
            base_model: mistralai/Mistral-7B-v0.1
         | 
| 136 | 
            +
            data_seed: 49
         | 
| 137 | 
            +
            seed: 49
         | 
| 138 | 
            +
             | 
| 139 | 
            +
            datasets:
         | 
| 140 | 
            +
              - path: output.jsonl
         | 
| 141 | 
            +
                type: input_output
         | 
| 142 | 
            +
            val_set_size: 0.1
         | 
| 143 | 
            +
             | 
| 144 | 
            +
            sequence_len: 896
         | 
| 145 | 
            +
            sample_packing: false
         | 
| 146 | 
            +
             | 
| 147 | 
            +
            micro_batch_size: 2
         | 
| 148 | 
            +
            gradient_accumulation_steps: 3
         | 
| 149 | 
            +
            eval_batch_size: 2
         | 
| 150 | 
            +
            num_epochs: 1
         | 
| 151 | 
            +
            learning_rate: 0.0002
         | 
| 152 | 
            +
             | 
| 153 | 
            +
            train_on_inputs: false
         | 
| 154 | 
            +
            special_tokens:
         | 
| 155 | 
            +
              bos_token: "<s>"
         | 
| 156 | 
            +
              eos_token: "</s>"
         | 
| 157 | 
            +
              unk_token: "<unk>"
         | 
| 158 | 
            +
            ```
         | 
| 159 | 
            +
             | 
| 160 | 
            +
            You can use the following command to materialize your data. The
         | 
| 161 | 
            +
            `--debug` flag will print the tokens, along with the labels so you can
         | 
| 162 | 
            +
            verify that the correct items are being ignored:
         | 
| 163 | 
            +
             | 
| 164 | 
            +
            ```bash
         | 
| 165 | 
            +
            $ python -m axolotl.cli.preprocess training_config.yaml --debug
         | 
| 166 | 
            +
             | 
| 167 | 
            +
            ...
         | 
| 168 | 
            +
            [2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557)
         | 
| 169 | 
            +
            (13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2)
         | 
| 170 | 
            +
             | 
| 171 | 
            +
            ```
         | 
| 172 | 
            +
             | 
| 173 | 
            +
            The format is `decoded_token`(`label`, `token_id`), for example,
         | 
| 174 | 
            +
            `<s>(1, 1)` means that the token is `<s>`, the label is `1` and the
         | 
| 175 | 
            +
            token_id is `1`. When the label is `-100` then that token is ignored for
         | 
| 176 | 
            +
            training.
         | 
| 177 | 
            +
             | 
| 178 | 
            +
            <a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a>
         | 
| 179 | 
            +
             | 
| 180 | 
            +
            ### 3. Check the prompts
         | 
| 181 | 
            +
             | 
| 182 | 
            +
            Here is another way to check the materialized output:
         | 
| 183 | 
            +
             | 
| 184 | 
            +
            ```python
         | 
| 185 | 
            +
            from transformers import AutoTokenizer
         | 
| 186 | 
            +
            from datasets import load_from_disk
         | 
| 187 | 
            +
            import yaml
         | 
| 188 | 
            +
             | 
| 189 | 
            +
            directory = !ls last_run_prepared/
         | 
| 190 | 
            +
            with open('training_config.yaml', 'r') as f:
         | 
| 191 | 
            +
                cfg = yaml.safe_load(f)
         | 
| 192 | 
            +
            model_id = cfg['base_model']
         | 
| 193 | 
            +
            tok = AutoTokenizer.from_pretrained(model_id)
         | 
| 194 | 
            +
            ds = load_from_disk(f'last_run_prepared/{directory[0]}/')
         | 
| 195 | 
            +
            ```
         | 
| 196 | 
            +
             | 
| 197 | 
            +
            ```python
         | 
| 198 | 
            +
            >>> row = ds[0]
         | 
| 199 | 
            +
            >>> print(tok.decode(row['input_ids']))
         | 
| 200 | 
            +
            <s> Hello
         | 
| 201 | 
            +
                hi there!.  goodbye  farewell</s>
         | 
| 202 | 
            +
            ```
         | 
| 203 | 
            +
             | 
| 204 | 
            +
            We can check that the right tokens are ingored by comparing the labels
         | 
| 205 | 
            +
            to each token:
         | 
| 206 | 
            +
             | 
| 207 | 
            +
            ```python
         | 
| 208 | 
            +
            import pandas as pd
         | 
| 209 | 
            +
            pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in
         | 
| 210 | 
            +
                          zip(row['input_ids'], row['labels'])])
         | 
| 211 | 
            +
            ```
         | 
| 212 | 
            +
             | 
| 213 | 
            +
            | token | label | id    |
         | 
| 214 | 
            +
            |-------|-------|-------|
         | 
| 215 | 
            +
            | 0     | \<s\> | 1     |
         | 
| 216 | 
            +
            | 1     | Hello | 22557 |
         | 
| 217 | 
            +
            | 2     | \\n   | 13    |
         | 
| 218 | 
            +
            | 3     | hi    | 12014 |
         | 
| 219 | 
            +
            | 4     | there | 736   |
         | 
| 220 | 
            +
            | 5     | !     | 28808 |
         | 
| 221 | 
            +
            | 6     | .     | 28723 |
         | 
| 222 | 
            +
            | 7     |       | 28705 |
         | 
| 223 | 
            +
            | 8     | good  | -100  |
         | 
| 224 | 
            +
            | 9     | bye   | -100  |
         | 
| 225 | 
            +
            | 10    |       | -100  |
         | 
| 226 | 
            +
            | 11    | fare  | 19111 |
         | 
| 227 | 
            +
            | 12    | well  | 5458  |
         | 
| 228 | 
            +
            | 13    | \</s\>| 2     |
         | 
| 229 | 
            +
             | 
| 230 | 
            +
             | 
| 231 | 
            +
             | 
| 232 | 
            +
            If we look at the input data, the above table seems correct! (The jsonl
         | 
| 233 | 
            +
            version is repeated below for reference):
         | 
| 234 | 
            +
             | 
| 235 | 
            +
             | 
| 236 | 
            +
            ```bash
         | 
| 237 | 
            +
            $ head -n1 output.jsonl | python -m json.tool
         | 
| 238 | 
            +
             | 
| 239 | 
            +
            {.cell-output .cell-output-stdout}
         | 
| 240 | 
            +
                {
         | 
| 241 | 
            +
                    "segments": [
         | 
| 242 | 
            +
                        {
         | 
| 243 | 
            +
                            "label": true,
         | 
| 244 | 
            +
                            "text": "<s>Hello\n"
         | 
| 245 | 
            +
                        },
         | 
| 246 | 
            +
                        {
         | 
| 247 | 
            +
                            "label": true,
         | 
| 248 | 
            +
                            "text": "hi there!. "
         | 
| 249 | 
            +
                        },
         | 
| 250 | 
            +
                        {
         | 
| 251 | 
            +
                            "label": false,
         | 
| 252 | 
            +
                            "text": "goodbye "
         | 
| 253 | 
            +
                        },
         | 
| 254 | 
            +
                        {
         | 
| 255 | 
            +
                            "label": true,
         | 
| 256 | 
            +
                            "text": "farewell</s>"
         | 
| 257 | 
            +
                        }
         | 
| 258 | 
            +
                    ]
         | 
| 259 | 
            +
                }
         | 
| 260 | 
            +
            ```
         |