Spaces:
Running
Running
# Understanding Column Mapping | |
Column mapping is a critical setup process in AutoTrain that informs the system | |
about the roles of different columns in your dataset. Whether it's a tabular | |
dataset, text classification data, or another type, the need for precise | |
column mapping ensures that AutoTrain processes each dataset element correctly. | |
## How Column Mapping Works | |
AutoTrain has no way of knowing what the columns in your dataset represent. | |
AutoTrain requires a clear understanding of each column's function within | |
your dataset to train models effectively. This is managed through a | |
straightforward mapping system in the user interface, represented as a dictionary. | |
Here's a typical example: | |
``` | |
{"text": "text", "label": "target"} | |
``` | |
In this example, the `text` column in your dataset corresponds to the text data | |
AutoTrain uses for processing, and the `target` column is treated as the | |
label for training. | |
But let's not get confused! AutoTrain has a way to understand what each column in your dataset represents. | |
If your data is already in AutoTrain format, you dont need to change column mappings. | |
If not, you can easily map the columns in your dataset to the correct AutoTrain format. | |
In the UI, you will see column mapping as a dictionary: | |
``` | |
{"text": "text", "label": "target"} | |
``` | |
Here, the column `text` in your dataset is mapped to the AutoTrain column `text`, | |
and the column `target` in your dataset is mapped to the AutoTrain column `label`. | |
Let's say you are training a text classification model and your dataset has the following columns: | |
``` | |
full_text, target_sentiment | |
"this movie is great", positive | |
"this movie is bad", negative | |
``` | |
You can map these columns to the AutoTrain format as follows: | |
``` | |
{"text": "full_text", "label": "target_sentiment"} | |
``` | |
If your dataset has the columns: `text` and `label`, you don't need to change the column mapping. | |
Let's take a look at column mappings for each task: | |
## LLM | |
Note: For all LLM tasks, if the text column(s) is not formatted i.e. if contains samples in chat format (dict or json), then you | |
should use `chat_template` parameter. Read more about it in LLM Parameters Section. | |
### SFT / Generic Trainer | |
``` | |
{"text": "text"} | |
``` | |
`text`: The column in your dataset that contains the text data. | |
### Reward Trainer | |
``` | |
{"text": "text", "rejected_text": "rejected_text"} | |
``` | |
`text`: The column in your dataset that contains the text data. | |
`rejected_text`: The column in your dataset that contains the rejected text data. | |
### DPO / ORPO Trainer | |
``` | |
{"prompt": "prompt", "text": "text", "rejected_text": "rejected_text"} | |
``` | |
`prompt`: The column in your dataset that contains the prompt data. | |
`text`: The column in your dataset that contains the text data. | |
`rejected_text`: The column in your dataset that contains the rejected text data. | |
## Text Classification & Regression, Seq2Seq | |
For text classification and regression, the column mapping should be as follows: | |
``` | |
{"text": "dataset_text_column", "label": "dataset_target_column"} | |
``` | |
`text`: The column in your dataset that contains the text data. | |
`label`: The column in your dataset that contains the target variable. | |
## Token Classification | |
``` | |
{"text": "tokens", "label": "tags"} | |
``` | |
`text`: The column in your dataset that contains the tokens. These tokens must be a list of strings. | |
`label`: The column in your dataset that contains the tags. These tags must be a list of strings. | |
For token classification, if you are using a CSV, make sure that the columns are stringified lists. | |
## Tabular Classification & Regression | |
``` | |
{"id": "id", "label": ["target"]} | |
``` | |
`id`: The column in your dataset that contains the unique identifier for each row. | |
`label`: The column in your dataset that contains the target variable. This should be a list of strings. | |
For a single target column, you can pass a list with a single element. | |
For multiple target columns, e.g. a multi label classification task, you can pass a list with multiple elements. | |
# Image Classification | |
For image classification, the column mapping should be as follows: | |
``` | |
{"image": "image_column", "label": "label_column"} | |
``` | |
Image classification requires column mapping only when you are using a dataset from Hugging Face Hub. | |
For uploaded datasets, leave column mapping as it is. | |
# Sentence Transformers | |
For all sentence transformers tasks, one needs to map columns to `sentence1_column`, `sentence2_column`, `sentence3_column` & `target_column` column. | |
Not all columns need to be mapped for all trainers of sentence transformers. | |
## `pair`: | |
``` | |
{"sentence1_column": "anchor", "sentence2_column": "positive"} | |
``` | |
## `pair_class`: | |
``` | |
{"sentence1_column": "premise", "sentence2_column": "hypothesis", "target_column": "label"} | |
``` | |
## `pair_score`: | |
``` | |
{"sentence1_column": "sentence1", "sentence2_column": "sentence2", "target_column": "score"} | |
``` | |
## `triplet`: | |
``` | |
{"sentence1_column": "anchor", "sentence2_column": "positive", "sentence3_column": "negative"} | |
``` | |
## `qa`: | |
``` | |
{"sentence1_column": "query", "sentence2_column": "answer"} | |
``` | |
# Extractive Question Answering | |
For extractive question answering, the column mapping should be as follows: | |
``` | |
{"text": "context", "question": "question", "answer": "answers"} | |
``` | |
where `answer` is a dictionary with keys `text` and `answer_start`. | |
## Ensuring Accurate Mapping | |
To ensure your model trains correctly: | |
- Verify Column Names: Double-check that the names used in the mapping dictionary accurately reflect those in your dataset. | |
- Format Appropriately: Especially in token classification, ensure your data format matches expectations (e.g., lists of strings). | |
- Update Mappings for New Datasets: Each new dataset might require its unique mappings based on its structure and the task at hand. | |
By following these guidelines and using the provided examples as templates, | |
you can effectively instruct AutoTrain on how to interpret and handle your | |
data for various machine learning tasks. This process is fundamental for | |
achieving optimal results from your model training endeavors. | |