Overview

This is a fine tuned CodeT5+ (220m) bimodal model tuned on a dataset consisting of 59,000 Python code-docstring pairs. The docstrings are in Google style format. A google style docstring is formatted as follows:

<Description of the code>

Args:
<var1> (<data-type>) : <description of var1>
<var2> (<data_type>) : <description of var2>

Returns:
<var3> (<data-type>) : <description of var3>

Raises:
<var4> (<data-type>) : <description of var4>

For more information on my dataset, please see the included referenced dataset.

You can test the model using this:

from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "Mir-2002/codet5p-google-style-docstrings"
device = "cuda" # or CPU

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

input = """
def calculate_sum(a, b):
    return a + b
"""

inputs = tokenizer.encode(input, return_tensors="pt").to(device)
outputs = model.generate(
            inputs,
            max_length=128,
            num_beams=8,
            early_stopping=True,
            no_repeat_ngram_size=3,
            pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Calculate the sum of two numbers.

# Args:
# a (int): The first number.
# b (int): The second number.

Fine tuning

In fine tuning the model, i used the special token <tdec>. According to CodeT5+'s paper:

" Specifically, when the input is a text sample, we prepend a [CDec] token to the input sequence to the decoder. In this case, the decoder operates under code generation functionality. Alternatively, when the input is a code sample, we prepend a [TDec] token to the input sequence to the decoder. The decoder operates under text generation functionality in this case. This type of Causal LM has been shown to be an effective learning objective to close the pretrain-finetune gap for generative downstream tasks"

Generally speaking, the <tdec> token was prepended to the target (the docstring) to signal to the decoder that it is in a text generation functionality. A sample row looks like this:

<s><tdec> Creates a task that to retry a previously abandoned task.

Returns:
Task: a task that was abandoned but should be retried or None if there are
no abandoned tasks that should be retried.</s>

This helps the decoder know under what downstream task it is currently being fine tuned in, improving the process. However, the paper doesn't clearly define whether or not the token is already included in the tokenizer's vocabulary. For safe measure, i manually included the token in the tokenizer's vocabulary using this script:

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "Salesforce/codet5p-220m-bimodal"
model_path = "/path/to/your/model"

import os
os.makedirs(model_path, exist_ok=True)

# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Add special token(s)
tokenizer.add_special_tokens({"additional_special_tokens": ["<tdec>"]})

# Resize embeddings to match new vocab size
model.resize_token_embeddings(len(tokenizer))

# Save both to a custom directory or just as a runtime
tokenizer.save_pretrained(model_path)
model.save_pretrained(model_path)

I then verified the token was added using this script:

print("Token ID for <tdec>:", tokenizer.convert_tokens_to_ids("<tdec>"))
print("Tokenized form of '<tdec>':", tokenizer.tokenize("<tdec>"))

# Token ID for <tdec>: 32103
# Tokenized form of '<tdec>': ['<tdec>']

The scripts were run beforehand and the modified model and tokenizer was used during fine tuning.

Hyperparameters

MAX_SOURCE_LENGTH = 256
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 16
NUM_EPOCHS = 35
LEARNING_RATE = 3e-5
GRADIENT_ACCUMULATION_STEPS = 4
EARLY_STOPPING_PATIENCE = 2
WEIGHT_DECAY = 0.01
OPTIMIZER = ADAFACTOR
LR_SCHEDULER = LINEAR

The model was trained on via Colab Pro, on an L4 GPU. A gradient accumulation step of 4 was used to simulate an effective batch size of 64 (16 * 4).

Loss

On the 35th epoch, the model achieved the following loss:

Epoch	Training Loss	Validation Loss
35	0.894800	1.268536

BLEU and ROUGE Scores

SacreBLEU	ROUGE-1	ROUGE-2	ROUGE-L
35.40	58.55	39.46	52.43

While a SacreBLEU score of 35 is a moderate score, it is important to consider that docstrings in Google style format vary extremely. Some are outliers that have extra sections that is usually not included in the general population which leads the model to generate "hallucinations". An example of this is this particular sample:

Reference:  Validate timestamp specified by request.

See `validate.request` for additional info.

Args:
stamp: str. Time request was made as ISO 8601 timestamp.
tolerance: int. Number of seconds request remains valid from timestamp.

Returns
bool: True if valid, False otherwise.
-----------------------------------------------------------------------
Prediction:  Validate timestamp.

Args:
stamp (str): A date string in the format YYYY-MM-DDThh:mm:ss.######[+-]##:##

Returns:
bool: True if valid, False otherwise.

As you can see, the model generated gibberish in the prediction's Args section specifically the string format for the date.

Mir-2002
/

codet5p-google-style-docstrings

Overview

Fine tuning

Hyperparameters

Loss

BLEU and ROUGE Scores

Model tree for Mir-2002/codet5p-google-style-docstrings

Dataset used to train Mir-2002/codet5p-google-style-docstrings