Spaces:

hardiktiwari
/

tensora-autotrain

Running

App Files Files Community

tensora-autotrain / docs /source /col_map.mdx

hardiktiwari

Upload 244 files

33d4721 verified about 2 months ago

raw

history blame contribute delete

6.11 kB

	# Understanding Column Mapping

	Column mapping is a critical setup process in AutoTrain that informs the system
	about the roles of different columns in your dataset. Whether it's a tabular
	dataset, text classification data, or another type, the need for precise
	column mapping ensures that AutoTrain processes each dataset element correctly.

	## How Column Mapping Works

	AutoTrain has no way of knowing what the columns in your dataset represent.
	AutoTrain requires a clear understanding of each column's function within
	your dataset to train models effectively. This is managed through a
	straightforward mapping system in the user interface, represented as a dictionary.
	Here's a typical example:

	```
	{"text": "text", "label": "target"}
	```

	In this example, the `text` column in your dataset corresponds to the text data
	AutoTrain uses for processing, and the `target` column is treated as the
	label for training.

	But let's not get confused! AutoTrain has a way to understand what each column in your dataset represents.
	If your data is already in AutoTrain format, you dont need to change column mappings.
	If not, you can easily map the columns in your dataset to the correct AutoTrain format.

	In the UI, you will see column mapping as a dictionary:

	```
	{"text": "text", "label": "target"}
	```

	Here, the column `text` in your dataset is mapped to the AutoTrain column `text`,
	and the column `target` in your dataset is mapped to the AutoTrain column `label`.

	Let's say you are training a text classification model and your dataset has the following columns:

	```
	full_text, target_sentiment
	"this movie is great", positive
	"this movie is bad", negative
	```

	You can map these columns to the AutoTrain format as follows:

	```
	{"text": "full_text", "label": "target_sentiment"}
	```

	If your dataset has the columns: `text` and `label`, you don't need to change the column mapping.

	Let's take a look at column mappings for each task:

	## LLM

	Note: For all LLM tasks, if the text column(s) is not formatted i.e. if contains samples in chat format (dict or json), then you
	should use `chat_template` parameter. Read more about it in LLM Parameters Section.


	### SFT / Generic Trainer

	```
	{"text": "text"}
	```

	`text`: The column in your dataset that contains the text data.


	### Reward Trainer

	```
	{"text": "text", "rejected_text": "rejected_text"}
	```

	`text`: The column in your dataset that contains the text data.

	`rejected_text`: The column in your dataset that contains the rejected text data.

	### DPO / ORPO Trainer

	```
	{"prompt": "prompt", "text": "text", "rejected_text": "rejected_text"}
	```

	`prompt`: The column in your dataset that contains the prompt data.

	`text`: The column in your dataset that contains the text data.

	`rejected_text`: The column in your dataset that contains the rejected text data.


	## Text Classification & Regression, Seq2Seq

	For text classification and regression, the column mapping should be as follows:

	```
	{"text": "dataset_text_column", "label": "dataset_target_column"}
	```

	`text`: The column in your dataset that contains the text data.

	`label`: The column in your dataset that contains the target variable.


	## Token Classification


	```
	{"text": "tokens", "label": "tags"}
	```

	`text`: The column in your dataset that contains the tokens. These tokens must be a list of strings.

	`label`: The column in your dataset that contains the tags. These tags must be a list of strings.

	For token classification, if you are using a CSV, make sure that the columns are stringified lists.

	## Tabular Classification & Regression

	```
	{"id": "id", "label": ["target"]}
	```

	`id`: The column in your dataset that contains the unique identifier for each row.

	`label`: The column in your dataset that contains the target variable. This should be a list of strings.

	For a single target column, you can pass a list with a single element.

	For multiple target columns, e.g. a multi label classification task, you can pass a list with multiple elements.


	# Image Classification

	For image classification, the column mapping should be as follows:

	```
	{"image": "image_column", "label": "label_column"}
	```

	Image classification requires column mapping only when you are using a dataset from Hugging Face Hub.
	For uploaded datasets, leave column mapping as it is.

	# Sentence Transformers

	For all sentence transformers tasks, one needs to map columns to `sentence1_column`, `sentence2_column`, `sentence3_column` & `target_column` column.
	Not all columns need to be mapped for all trainers of sentence transformers.

	## `pair`:

	```
	{"sentence1_column": "anchor", "sentence2_column": "positive"}
	```

	## `pair_class`:

	```
	{"sentence1_column": "premise", "sentence2_column": "hypothesis", "target_column": "label"}
	```

	## `pair_score`:

	```
	{"sentence1_column": "sentence1", "sentence2_column": "sentence2", "target_column": "score"}
	```

	## `triplet`:

	```
	{"sentence1_column": "anchor", "sentence2_column": "positive", "sentence3_column": "negative"}
	```

	## `qa`:

	```
	{"sentence1_column": "query", "sentence2_column": "answer"}
	```


	# Extractive Question Answering

	For extractive question answering, the column mapping should be as follows:

	```
	{"text": "context", "question": "question", "answer": "answers"}
	```

	where `answer` is a dictionary with keys `text` and `answer_start`.


	## Ensuring Accurate Mapping

	To ensure your model trains correctly:

	- Verify Column Names: Double-check that the names used in the mapping dictionary accurately reflect those in your dataset.

	- Format Appropriately: Especially in token classification, ensure your data format matches expectations (e.g., lists of strings).

	- Update Mappings for New Datasets: Each new dataset might require its unique mappings based on its structure and the task at hand.

	By following these guidelines and using the provided examples as templates,
	you can effectively instruct AutoTrain on how to interpret and handle your
	data for various machine learning tasks. This process is fundamental for
	achieving optimal results from your model training endeavors.