Spaces:

hardiktiwari
/

tensora-autotrain

Running

File size: 6,107 Bytes

33d4721

# Understanding Column Mapping

Column mapping is a critical setup process in AutoTrain that informs the system 
about the roles of different columns in your dataset. Whether it's a tabular 
dataset, text classification data, or another type, the need for precise 
column mapping ensures that AutoTrain processes each dataset element correctly.

## How Column Mapping Works

AutoTrain has no way of knowing what the columns in your dataset represent. 
AutoTrain requires a clear understanding of each column's function within 
your dataset to train models effectively. This is managed through a 
straightforward mapping system in the user interface, represented as a dictionary. 
Here's a typical example:

```
{"text": "text", "label": "target"}
```

In this example, the `text` column in your dataset corresponds to the text data 
AutoTrain uses for processing, and the `target` column is treated as the 
label for training.

But let's not get confused! AutoTrain has a way to understand what each column in your dataset represents.
If your data is already in AutoTrain format, you dont need to change column mappings.
If not, you can easily map the columns in your dataset to the correct AutoTrain format.

In the UI, you will see column mapping as a dictionary:

```
{"text": "text", "label": "target"}
```

Here, the column `text` in your dataset is mapped to the AutoTrain column `text`, 
and the column `target` in your dataset is mapped to the AutoTrain column `label`.

Let's say you are training a text classification model and your dataset has the following columns:

```
full_text, target_sentiment
"this movie is great", positive
"this movie is bad", negative
```

You can map these columns to the AutoTrain format as follows:

```
{"text": "full_text", "label": "target_sentiment"}
```

If your dataset has the columns: `text` and `label`, you don't need to change the column mapping.

Let's take a look at column mappings for each task:

## LLM

Note: For all LLM tasks, if the text column(s) is not formatted i.e. if contains samples in chat format (dict or json), then you 
should use `chat_template` parameter. Read more about it in LLM Parameters Section.


### SFT / Generic Trainer

```
{"text": "text"}
```

`text`: The column in your dataset that contains the text data.


### Reward Trainer

```
{"text": "text", "rejected_text": "rejected_text"}
```

`text`: The column in your dataset that contains the text data.

`rejected_text`: The column in your dataset that contains the rejected text data.

### DPO / ORPO Trainer

```
{"prompt": "prompt", "text": "text", "rejected_text": "rejected_text"}
```

`prompt`: The column in your dataset that contains the prompt data.

`text`: The column in your dataset that contains the text data.

`rejected_text`: The column in your dataset that contains the rejected text data.


## Text Classification & Regression, Seq2Seq

For text classification and regression, the column mapping should be as follows:

```
{"text": "dataset_text_column", "label": "dataset_target_column"}
```

`text`: The column in your dataset that contains the text data.

`label`: The column in your dataset that contains the target variable.


## Token Classification


```
{"text": "tokens", "label": "tags"}
```

`text`: The column in your dataset that contains the tokens. These tokens must be a list of strings.

`label`: The column in your dataset that contains the tags. These tags must be a list of strings.

For token classification, if you are using a CSV, make sure that the columns are stringified lists.

## Tabular Classification & Regression

```
{"id": "id", "label": ["target"]}
```

`id`: The column in your dataset that contains the unique identifier for each row.

`label`: The column in your dataset that contains the target variable. This should be a list of strings.

For a single target column, you can pass a list with a single element.

For multiple target columns, e.g. a multi label classification task, you can pass a list with multiple elements.


# Image Classification

For image classification, the column mapping should be as follows:

```
{"image": "image_column", "label": "label_column"}
```

Image classification requires column mapping only when you are using a dataset from Hugging Face Hub.
For uploaded datasets, leave column mapping as it is.

# Sentence Transformers

For all sentence transformers tasks, one needs to map columns to `sentence1_column`, `sentence2_column`, `sentence3_column` & `target_column` column.
Not all columns need to be mapped for all trainers of sentence transformers.

## `pair`:

```
{"sentence1_column": "anchor", "sentence2_column": "positive"}
```

## `pair_class`:

```
{"sentence1_column": "premise", "sentence2_column": "hypothesis", "target_column": "label"}
```

## `pair_score`:

```
{"sentence1_column": "sentence1", "sentence2_column": "sentence2", "target_column": "score"}
```

## `triplet`:

```
{"sentence1_column": "anchor", "sentence2_column": "positive", "sentence3_column": "negative"}
```

## `qa`:

```
{"sentence1_column": "query", "sentence2_column": "answer"}
```


# Extractive Question Answering

For extractive question answering, the column mapping should be as follows:

```
{"text": "context", "question": "question", "answer": "answers"}
```

where `answer` is a dictionary with keys `text` and `answer_start`.


## Ensuring Accurate Mapping

To ensure your model trains correctly:

- Verify Column Names: Double-check that the names used in the mapping dictionary accurately reflect those in your dataset.

- Format Appropriately: Especially in token classification, ensure your data format matches expectations (e.g., lists of strings).

- Update Mappings for New Datasets: Each new dataset might require its unique mappings based on its structure and the task at hand.

By following these guidelines and using the provided examples as templates, 
you can effectively instruct AutoTrain on how to interpret and handle your 
data for various machine learning tasks. This process is fundamental for 
achieving optimal results from your model training endeavors.