Spaces:

mshukor
/

UnIVAL

Running

App Files Files Community

UnIVAL / fairseq /examples /pointer_generator /README.md

mshukor

init

26fd00c over 2 years ago

preview code

raw

history blame

3.92 kB

	# Transformer with Pointer-Generator Network

	This page describes the `transformer_pointer_generator` model that incorporates
	a pointing mechanism in the Transformer model that facilitates copying of input
	words to the output. This architecture is described in [Enarvi et al. (2020)](https://www.aclweb.org/anthology/2020.nlpmc-1.4/).

	## Background

	The pointer-generator network was introduced in [See et al. (2017)](https://arxiv.org/abs/1704.04368)
	for RNN encoder-decoder attention models. A similar mechanism can be
	incorporated in a Transformer model by reusing one of the many attention
	distributions for pointing. The attention distribution over the input words is
	interpolated with the normal output distribution over the vocabulary words. This
	allows the model to generate words that appear in the input, even if they don't
	appear in the vocabulary, helping especially with small vocabularies.

	## Implementation

	The mechanism for copying out-of-vocabulary words from the input has been
	implemented differently to See et al. In their [implementation](https://github.com/abisee/pointer-generator)
	they convey the word identities through the model in order to be able to produce
	words that appear in the input sequence but not in the vocabulary. A different
	approach was taken in the Fairseq implementation to keep it self-contained in
	the model file, avoiding any changes to the rest of the code base. Copying
	out-of-vocabulary words is possible by pre-processing the input and
	post-processing the output. This is described in detail in the next section.

	## Usage

	The training and evaluation procedure is outlined below. You can also find a
	more detailed example for the XSum dataset on [this page](README.xsum.md).

	##### 1. Create a vocabulary and extend it with source position markers

	The pointing mechanism is especially helpful with small vocabularies, if we are
	able to recover the identities of any out-of-vocabulary words that are copied
	from the input. For this purpose, the model allows extending the vocabulary with
	special tokens that can be used in place of `<unk>` tokens to identify different
	input positions. For example, the user may add `<unk-0>`, `<unk-1>`, `<unk-2>`,
	etc. to the end of the vocabulary, after the normal words. Below is an example
	of how to create a vocabulary of 10000 most common words and add 1000 input
	position markers.

	```bash
	vocab_size=10000
	position_markers=1000
	export LC_ALL=C
	cat train.src train.tgt \|
	tr -s '[:space:]' '\n' \|
	sort \|
	uniq -c \|
	sort -k1,1bnr -k2 \|
	head -n "$((vocab_size - 4))" \|
	awk '{ print $2 " " $1 }' >dict.pg.txt
	python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt
	```

	##### 2. Preprocess the text data

	The idea is that any `<unk>` tokens in the text are replaced with `<unk-0>` if
	it appears in the first input position, `<unk-1>` if it appears in the second
	input position, and so on. This can be achieved using the `preprocess.py` script
	that is provided in this directory.

	##### 3. Train a model

	The number of these special tokens is given to the model with the
	`--source-position-markers` argument—the model simply maps all of these to the
	same word embedding as `<unk>`.

	The attention distribution that is used for pointing is selected using the
	`--alignment-heads` and `--alignment-layer` command-line arguments in the same
	way as with the `transformer_align` model.

	##### 4. Generate text and postprocess it

	When using the model to generate text, you want to preprocess the input text in
	the same way that training data was processed, replacing out-of-vocabulary words
	with `<unk-N>` tokens. If any of these tokens are copied to the output, the
	actual words can be retrieved from the unprocessed input text. Any `<unk-N>`
	token should be replaced with the word at position N in the original input
	sequence. This can be achieved using the `postprocess.py` script.