FinSeek

Experimentation of fine-tuned DeepSeek-R1-Distill-Llama-8B variant for financial sentiment analysis.

Model Details

Introduction and Model Description

Accurate, automated sentiment analysis is one of the most desirable tools for NLP, particularly in finance. Previous approaches generally made use simpler methods, using lexicon-based or slightly more advanced rules-based hybrid methods with heuristics (e.g. VADER). Recently, causal LLMs have proven to be effective at capturing sentiment, particularly with training, with Llama-3 and GPT-4 working quite well on FOMC minutes, as outlined in the paper Is Small Really Beautiful for Central Bank Communication?(Spörer). However, one of the ongoing limitations with many models is the over-reliance on Neutral designations, possibly due to relatively nuanced and even language used within a lot of financial texts. DeepSeek-R1's distilled models promise superior performance to their base counterparts, and thus the goal was to fine-tune DeepSeek-R1-Distill-Llama-8B and compare it against other similarly-sized options, attempting to shift the final sentiment classification towards a more decisive Positive or Neutral.

DeepSeek-R1-Distill-Llama-8B was trained on a ~5000 datapoint set of primarily financial news headlines, with associated sentiment labels, sourced from both the Financial Phrasebank as well as additional tweet microblogs. Base and tuned model were evaluated against other similar options on both like and related financial sentiment tasks. Results show that model performance can be greatly improved relative to baseline, but not across all tasks, and not entirely competitively against similar offerings.

Model type: Causal LM
Language(s) (NLP): English
License: MIT
Finetuned from model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Uses

This model was trained to directly return Positive/Neutral/Negative sentiment labels on financially-related sentences. It is intended for use only as an educational tool, primarily as an experiment, not for legitimate analysis. The author does not endorse nor advise the use of this tool in any scenario which may have direct or indirect financial consequences.

Direct Use

This model expects prompts that explicitly outline the task and text to be classified. It can return either pure labels or floating-point values. Primary usage is with respect to news headlines or statements from the FOMC minutes, using either Positive/Neutral/Negative or Dovish/Neutral/Hawkish labels.

Prompt Format Example:

Examine the excerpt from a central bank's release below. Classify it as HAWKISH if it advocates for a tightening of monetary policy, DOVISH if it suggests an easing of monetary policy, or NEUTRAL if the stance is unbiased. Your response should return only HAWKISH, DOVISH, or NEUTRAL.
Text: {YOUR_TEXT}
Answer:

Code Example:

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dms3g/FinSeek-Llama-8B")
model = AutoModelForCausalLM.from_pretrained("dms3g/FinSeek-Llama-8B")


tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id


def process_query(query, new_tokens=1500, skip=True):

    input_ids = tokenizer.encode(query, return_tensors="pt").to(model.device)
    input_len = input_ids.shape[1]

    # output
    output_ids = model.generate(
        input_ids,
        max_new_tokens=new_tokens,
        pad_token_id=tokenizer.pad_token_id,
        do_sample=False
    )

    # skip prompt
    if skip:
        output_text = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True)
    else:
        output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    return output_text

query = "Given the following financial text, return a sentiment score for Ashtead as a floating-point number ranging from -1 (indicating a very negative or bearish sentiment) to 1 (indicating a very positive or bullish sentiment), with 0 designating neutral sentiment. Return only the numerical score first, follow it with a brief reasoning behind your score.\nText: Ashtead to buy back shares, full-year profit beats estimates\nAnswer:"

process_query(query)

Expected Output

Output can either be a single response (as a label or floating-point) with no additional response, or with reasoning.

Example output:

0.52
Reasoning: The company announced a share buyback program, which is generally positive for shareholders. Additionally, the company's profit exceeded analyst expectations, which is also positive. The score of 0.52 reflects a moderately positive sentiment.

Out-of-Scope Use

The model can not be guaranteed to provide accurate assessment. It should never be used for legitimate financial advice.

Bias, Risks, and Limitations

Although the model has improved performance over the base DeepSeek-R1-Distill-Llama-8B on the benchmark datasets, it still struggles to work through prompts that may deviate from what is expected. Depending on prompt format, it can fail to provide accurate responses, fall into repetition, or take a considerably longer time to process. This model was intended primarily as an experiment to assess the performance of the R1 distillation, and compare it against other contemporary models. The other models tested, such as Qwen, showed considerably better performance, both in response and processing speed, without any additional fine-tuning. This may be a limitation of the underlying Llama model of the distillation. Further experimentation may be necessary, such as with improved training data or an alternative fine-tuning approach.

How to Get Started with the Model

You can load the model as follows:

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dms3g/FinSeek-Llama-8B")
model = AutoModelForCausalLM.from_pretrained("dms3g/FinSeek-Llama-8B")

Training Details

Training Data

This model was trained on a combination of headlines from the Financial Phrasebank dataset (FPB) mixed with Twitter headlines from the Financial Sentiment Analysis Dataset (originally from Twitter Financial News). Several mixtures and variations were trialed, initially attempting to combine the entirety of both datasets, and then at differing proportions with different styles of Tweets. Ultimately, in order to reduce noise caused by the large variation in tweet consistency, as well as useful information contained therein, the final mixture was 1:4 tweets:fpb, where only tweets containing financial news headlines were considered. The full base dataset was created by using all of the 4846 examples within the Finanical Phrasebank, and sampling 1000 Twitter Financial News headlines at random (using seed 42). When building the Instruction set for fine-tuning, an additional primer sentence was added to the prompt for each statement, depending on whether it was sourced from the Financial Phrasebank or from Twitter Financial News.

Financial Phrasebank
"Returning only labels Positive, Neutral, or Negative, please return the sentiment of the following news headline.\nText:"

Twitter Financial News
"Returning only labels Positive, Neutral, or Negative, please return the sentiment of the following Tweet.\nText:"

For training, any existing labels (Bullish/Neutral/Bearish) were remapped to Positive/Neutral/Negative where necessary.

The dataset was then split into training, validation, and testing sets, first by using a train/test split with test proportion 0.1, random_state 42, creating a 585 example test set. The remaining training split was then shuffled (again, using random_state 42), with the first 4800 examples mapped to the final training set, and 461 examples set aside for validation.

Training Procedure

The initial DeepSeek-R1-Distill-Llama-8B model was fine-tuned using AdaLoRA PEFT. AdaLoRA was chosen over a more standard LoRA approach for several reasons: AdaLoRa performs better than LoRA with the same budget on a variety of benchmarks, including tasks that involve sentiment. LoRA, although reasonably effective, can potentially underfit in such NLP tasks as sentiment analysis, as the uniform distribution of the adapters can have potentially unwanted influence on low-level layers, when the goal in this task is to shift final decisions in classifications away from Neutral. As AdaLoRA dynamically prioritizes and deprioritizes layers, this may better capture nuance in sentiment. Additionally, AdaLoRA has been shown to perform notably better than LoRA in tasks such as in GLUE SST-2, as detailed in the introductory paper. The expectation is that this would result in better generalization, and reduce the risk of catastrophic forgetting. Implementing AdaLoRA only requires an additional subclass in addition to the standard PEFT module within Transformers, documentation on which can be found here. The model was then trained on the training set of 4800 examples, and validated on 461, as detailed above.

Training Hyperparameters

Training regime:

The final hyperparameters were as follows:

  init_r=16,
  target_r=4,
  beta1=0.9,
  beta2=0.9,
  tinit=100,
  tfinal=1600,
  lora_alpha=32, 
  lora_dropout=0.01,
  task_type = "CAUSAL_LM"

With the training parameters set to the following:

  learning_rate = 8e-4
  epochs = 4

  weight_decay=0.001
  warmup_ratio=0.01
  lr_scheduler_type = "linear"

Evaluation

In addition to the base DeepSeek model and the fine-tuned FinSeek, two additional models were chosen to provide reasonable comparison:

Both of these models have comparable size and performance overhead, with reasonable baseline performance capable of processing the prompts with acceptable responses. In the case of Llama 3.1 8B Instruct, this should be the closest relative of the original DeepSeek-R1-Distill-Llama-8B, which was distilled on Llama-3.1-8B, but still capable of following the instructions and format outlined in the benchmarks in the same way that the DeepSeek distillation is instruct-tuned.

Testing Data, Factors & Metrics

Testing Data

Three primary task-related benchmarks were chosen to evaluate sentiment labeling performance. This includes:

FOMC dataset as used in Is Small Really Beautiful for Central Bank Communication?(Spörer)
FinBen-FOMC from FinAI's FinBen
SemEval-2017 Task 5 Track 2 as rehosted by FinAI for FinBen

For the first FOMC benchmark, following the example prompt format provided in appx. B of Small, a prompt using a single example sentence was provided in addition to the text to be evaluated:

Sentence: Imports rose in December, with an increased volume of petroleum imports, but declined in January, driven by lower prices and volumes for petroleum.
Output: {"growth_sentiment": "neutral", "employment_sentiment": "neutral", "inflation_sentiment": "positive"}

Given the above example, process the following prompt, returning only the JSON formatted dictionary of sentiments:

The primary inspiration for this experimentation was the Small paper, which graciously provides its test dataset as linked above. The methodology involved training on the Financial Phrasebank as well as X (formerly Twitter) texts, which makes it a suitable evaluation standard for direct sentiment labeling. FinBen-FOMC was chosen as it more directly relates to the language used by the FOMC, rating statements overall as Hawkish/Neutral/Dovish, which is an adjacent task that should provide insight into both the financial knowledge of the model, as well as generalization capability. Finally, SemEval-2017 Task 5, Track 2 was chosen as although it is more directly in line with the training data, rating financial news headlines as Positive, Negative, or Neutral, it does so with the added layer of a floating-point score. This should provide yet another, more granular evaluation of model performance beyond simple labeling.

In addition to these 3 benchmarks, the models were evaluated on the Test split of the training data, as well as SuperGLUE in order to assess baseline performance across unrelated tasks. A common benchmark, it should provide a reasonable evaluation of other simple NLP tasks unrelated to finance, gauging the effects of fine-tuning on generalization without an overly long processing time.

Metrics

Models were evaluated primarily on accuracy, comparing the response label with the gold result for a given prompt. However, additional considerations were taken into account. In some cases, such as in the FOMC test dataset from Small (Spörer), additional metrics of F1 were captured to match the confusion matrices found in the paper. Furthermore, in SemEval-2017, in order to properly evaluate floating-point value responses, a weighted cosine score value was calculated, based upon the formulas found in the results paper. They are replicated here for convenience:

$\text{cosine}(G, P) = \frac{\sum_{i=1}^{n} G_i \times P_i} {\sqrt{\sum_{i=1}^{n} G_i^2} \times \sqrt{\sum_{i=1}^{n} P_i^2}}$

$\text{cos\_weight} = \frac{|P|}{|G|}$

$\text{final\_score} = \text{cos\_weight} \times \text{cosine}(G, P)$

Results

Test split, FinBen-FOMC, SemEval-2017

Models	Test Split (Acc)	FinBen (Acc)	SemEval-2017 Task 5 Track 2 (Score)
Base (DeepSeek-R1-Distill-Llama-8B)	0.590	0.442	0.5892
Tuned (FinSeek)	*0.697*	0.550	0.7438
Llama 3.1 8B Instruct	0.660	0.571	0.5852
Qwen 2 7B Instruct	0.655	0.645	0.8394

FOMC

Metric	Base (DeepSeek-R1-Distill-Llama-8B)	Tuned (FinSeek)	Llama 3.1 8B Instruct	Qwen 2 7B Instruct
F1 (Neg)	0.3820	0.0559	0.5168	0.5812
F1 (Neu)	0.7336	0.8365	0.8573	0.8840
F1 (Pos)	0.4329	0.0784	0.4838	0.5581
F1 (Avg)	0.5162	0.3236	0.6193	0.6745
Accuracy	0.5756	0.7055	0.7531	0.7875

SuperGLUE

Models	BoolQ (Acc)	CB (Acc)	COPA (Acc)	MultiRC (Acc)	ReCoRD (F1)	RTE (Acc)	WiC (Acc)	WSC (Acc)
Base (DeepSeek-R1-Distill-Llama-8B)	0.829	0.625	0.880	0.528	0.858	0.697	0.541	0.433
Tuned (FinSeek)	0.792	0.696	0.850	0.548	0.870	0.664	0.498	0.356
Llama 3.1 8B Instruct	0.840	0.714	0.930	0.271	0.919	0.686	0.625	0.635
Qwen 2 7B Instruct	0.855	0.750	0.870	0.169	0.900	0.783	0.608	0.740

Summary

The fine-tuned FinSeek model generally outperformed the base R1 distilled model on the first set of benchmarks, at a slight cost in generalization performance as seen in the SuperGLUE results. Time to process was significantly reduced on the FinBen and SemEval benchmarks, from several hours to well under an hour on the same hardware (generally, A100 or A40 GPUs). Response format was better fit in many cases, returning only the label as requested by the prompt, rather than a label and full explanation, which contributes to the reduced processing time.

However, this performance is only impressive on its own, as it severely underperforms compared to the other similarly capable, non-fine-tuned models. Furthermore, there is a severe degradation in performance on the FOMC dataset, as although the accuracy improved, it significantly dropped F1 in the Positive and Negative, which suggest an underfit model, with the increase in Neutral F1 a direct result of the increased proportion of Neutral responses from the model. This is possibly a result of a relatively unbalanced dataset, as Neutral is the most common gold assesment, unsurprising given the FOMC's relatively guarded language.

It is interesting to note that the base distillation of R1 onto Llama appears to reduce performance in this case, though only in the context of this task, dataset, and setup. Overall, given the nature of the responses provided by both the basic Llama 3.1, the base model, and the fine-tuned model, it suggests that there may be a limitation in this form and application of Llama in particular, and perhaps an upper limit to performance. Further testing is necessary, and an additional comparison to a fine-tuned Llama 3.1 8B may provide a more comprehensive comparison.

dms3g
/

FinSeek-Llama-8B