# Sharing BERTopic models on the Hugging Face Hub

This notebook shows the steps involved in sharing a BERTopic model on the Hugging Face Hub. As an example, we'll train a topic model on GitHub issue titles for the Transformers library. 

First we need to install `BERTopic` along with the `huggingface_hub` library. We can optionally also install [`safetensors`](https://huggingface.co/docs/safetensors/index). `safetensors` Safetensors is a new simple format for storing tensors safely (as opposed to pickle) that is still fast (zero-copy). If this library is installed, BERTopic can use the `safetensor` format for model serialization. 

In [None]:
%pip install git+https://github.com/MaartenGr/BERTopic huggingface_hub safetensors -qqq

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m96.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We can use a [dataset](https://github.com/nlp-with-transformers/notebooks) that has been created for the [Natural Language Processing with Transformers](https://github.com/nlp-with-transformers/notebooks) book. This dataset contains issue titles, along with some metadata for the Transformers library GitHub repository. 

GitHub issues are an example of a domain where me might assume some sort of topics exist in the corpus, but we probablydon't have an exact sense of what all of these topics would be. This is the type of problem where topic modelling can give us a better sense of the corpus and potentially be useful for classifying new issues into topics. 

We'll start by loading the data into a pandas DataFrame. 

In [None]:
import pandas as pd

dataset_url = "https://raw.githubusercontent.com/nlp-with-transformers/notebooks/main/data/github-issues-transformers.jsonl"
df_issues = pd.read_json(dataset_url, lines=True)


In [None]:
df_issues.head(4)

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,body,performed_via_github_app,pull_request
0,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://github.com/huggingface/transformers/is...,849568459,MDU6SXNzdWU4NDk1Njg0NTk=,11046,Potential incorrect application of layer norm ...,...,,0,2021-04-03 03:37:32,2021-04-03 03:37:32,NaT,NONE,,"In BlenderbotSmallDecoder, layer norm is appl...",,
1,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://github.com/huggingface/transformers/is...,849544374,MDU6SXNzdWU4NDk1NDQzNzQ=,11045,Multi-GPU seq2seq example evaluation significa...,...,,0,2021-04-03 00:52:24,2021-04-03 00:52:24,NaT,NONE,,\r\n### Who can help\r\n@patil-suraj @sgugger ...,,
2,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://github.com/huggingface/transformers/is...,849529761,MDU6SXNzdWU4NDk1Mjk3NjE=,11044,[DeepSpeed] ZeRO stage 3 integration: getting ...,...,,0,2021-04-02 23:40:42,2021-04-03 00:00:18,NaT,COLLABORATOR,,"**[This is not yet alive, preparing for the re...",,
3,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://api.github.com/repos/huggingface/trans...,https://github.com/huggingface/transformers/is...,849499734,MDU6SXNzdWU4NDk0OTk3MzQ=,11043,Can't load model to estimater,...,,0,2021-04-02 21:51:44,2021-04-02 21:51:44,NaT,NONE,,I was trying to follow the Sagemaker instructi...,,


We can train our topic model on a subset of the data and hold back some examples which we can treat as new data. This mirrors a situtation where we might use BERTopic model in a production setting. 

In [None]:
len(df_issues)

9930

In [None]:
df_issues_train = df_issues[:9000]

In [None]:
df_issues_test = df_issues[9000:]

BERTopic expects a list of strings as input so let's grab the title column and turn this into a list. 

In [None]:
issue_titles = df_issues_train['title'].to_list()

In [None]:
issue_titles[:3]

['Potential incorrect application of layer norm in BlenderbotSmallDecoder',
 'Multi-GPU seq2seq example evaluation significantly slower than legacy example evaluation',
 '[DeepSpeed] ZeRO stage 3 integration: getting started and issues']

## Training our model

We'll train a BERTopic model using fairly standard settings.

In [None]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

In [None]:
representation_model = KeyBERTInspired()

In [None]:
topic_model = BERTopic("english", verbose=True, nr_topics=30, representation_model=representation_model)

In [None]:
topics, probs = topic_model.fit_transform(issue_titles)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/282 [00:00<?, ?it/s]

2023-05-30 10:28:46,335 - BERTopic - Transformed documents to Embeddings
2023-05-30 10:29:26,811 - BERTopic - Reduced dimensionality
2023-05-30 10:29:27,188 - BERTopic - Clustered reduced embeddings
2023-05-30 10:29:32,644 - BERTopic - Reduced number of topics from 181 to 30


We can quickly explore the topics from our model

In [None]:
freq = topic_model.get_topic_info()

In [None]:
freq.head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2106,-1_bert_tensorflow_model_models,"[bert, tensorflow, model, models, tf, tokenize...","[t5 model card, TFDistilBERT ValueError when l..."
1,0,1774,0_bert_bertforsequenceclassification_berttoken...,"[bert, bertforsequenceclassification, berttoke...",[The output to be used for getting sentence em...
2,1,1122,1_gpt2_trainertrain_gpt_trainer,"[gpt2, trainertrain, gpt, trainer, training, c...","[Training GPT2 and Reformer from scratch. , A..."
3,2,516,2_typos_typo_fix_fixed,"[typos, typo, fix, fixed, correction, error, c...","[fix typo, Fix doc link in README, [doc] typo ..."
4,3,464,3_s2s_seq2seq_examplesseq2seq_seq2seqdataset,"[s2s, seq2seq, examplesseq2seq, seq2seqdataset...","[[s2s] --eval_max_generate_length, [s2s] s/alp..."
5,4,404,4_modelcard_modelcards_card_model,"[modelcard, modelcards, card, model, cards, mo...","[Add model card, Add model card, Model Card fo..."
6,5,368,5_attributeerror_valueerror_typeerror_error,"[attributeerror, valueerror, typeerror, error,...",[TypeError: on_init_end() got an unexpected ke...
7,6,347,6_summarization_summaries_questionansweringpip...,"[summarization, summaries, questionansweringpi...","[Bug in the question answering pipeline, Add t..."
8,7,329,7_longformer_tf_longformers_tftrainer,"[longformer, tf, longformers, tftrainer, longf...","[TF Longformer, Fix TF Longformer, Fix TF Long..."
9,8,227,8_testing_ci_tests_tf,"[testing, ci, tests, tf, test, slow, t5, bench...","[Fix the CI, Ci test tf super slow, TF Slow te..."


In [None]:
topic_model.visualize_topics()

We can also view topics over time

In [None]:
timestamps = df_issues_train['created_at']

In [None]:
topics_over_time = topic_model.topics_over_time(issue_titles, timestamps, nr_bins=20)

20it [00:14,  1.36it/s]


In [None]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

## Pushing our BERTopic model to the Hugging Face Hub 🤗

We can use the new BERTopic Hub intergration to push our models to the Hugging Face hub. Sharing models to the Hub makes it easier for others (or our future self) to use or adapt our topic models for further use. 

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
HF_USER_NAME = "" # add your hub username here

In [None]:
topic_model.push_to_hf_hub(f'{HF_USER_NAME}/transformers_issues_topics')

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

topic_embeddings.safetensors:   0%|          | 0.00/46.2k [00:00<?, ?B/s]

'https://huggingface.co/davanstrien/transformers_issues_topics/tree/main/'

## Loading models from the Hugging Face Hub 🤗

We can similarly load models from the Hub.

In [None]:
from bertopic import BERTopic
topic_model = BERTopic.load("davanstrien/transformers_issues_topics")

Downloading (…)lve/main/topics.json:   0%|          | 0.00/99.2k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Downloading (…)beddings.safetensors:   0%|          | 0.00/46.2k [00:00<?, ?B/s]

We can then use this model to predict the topics of new unseen documents. 

In [None]:
new_issue_titles = df_issues_test['title'].to_list()

In [None]:
examples = new_issue_titles[5:15]

In [None]:
examples

['Changing the number of hidden layers for BERT',
 'Tokenization in quickstart guide fails',
 'Add NER TF2 example.',
 'Remove dead code in tests.',
 'CLI for authenticated file sharing',
 'Missing xlm-mlm-100-1280',
 "UnboundLocalError: local variable 'extended_attention_mask' referenced before assignment",
 'How do I load a pretrained file offline?',
 'XLM-R Support',
 'Meaning of run_lm_finetuning.py output']

In [None]:
topics, prob = topic_model.transform(examples)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
for example, topic in zip(examples,topics):
    print(f"TEXT: {example}")
    print(f"TOPIC: {topic_model.get_topic_info(int(topic)).loc[0,'Representation']}")
    print('--*--'*9)


TEXT: Changing the number of hidden layers for BERT
TOPIC: ['tokenizer', 'tokenizers', 'tokenization', 'tokenize', 'berttokenizer', 'token', 'bertforsequenceclassification', 'tokens', 'bert', 'bart']
--*----*----*----*----*----*----*----*----*--
TEXT: Tokenization in quickstart guide fails
TOPIC: ['tokenizer', 'tokenizers', 'tokenization', 'tokenize', 'berttokenizer', 'token', 'bertforsequenceclassification', 'tokens', 'bert', 'bart']
--*----*----*----*----*----*----*----*----*--
TEXT: Add NER TF2 example.
TOPIC: ['t5', 't5model', 't5base', 't5large', 'tf', 't5forconditionalgeneration', 'mt5', 'tftrainer', 'tpu', 't511b']
--*----*----*----*----*----*----*----*----*--
TEXT: Remove dead code in tests.
TOPIC: ['tests', 'testing', 'speedup', 'test', 'testgeneratefp16', 'testst', 'slow', 'installationtest', 'testenrogenerate', 'testoutputstxt']
--*----*----*----*----*----*----*----*----*--
TEXT: CLI for authenticated file sharing
TOPIC: ['readmemd', 'readmetxt', 'readme', 'docstring', 'docs

## Next steps

You can try training your own topic model and pushing it to the Hub. BERTopic is a very flexible library so you can swap out many of the components. 

You can easily grab a dataset from Hugging Face and extract the text you want to use for training a topic model. For example we can train a topic model on the German subset of the [amazon_reviews_multi](https://huggingface.co/datasets/amazon_reviews_multi) dataset. 

In [None]:
%pip install datasets

In [None]:
from datasets import load_dataset

dataset = load_dataset("amazon_reviews_multi", "de")

Downloading builder script:   0%|          | 0.00/7.16k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/37.4k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

Downloading and preparing dataset amazon_reviews_multi/de to /root/.cache/huggingface/datasets/amazon_reviews_multi/de/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.25M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.26M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/200000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/de/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
docs = dataset['train']['review_body']

In [None]:
docs[0:5]

['Armband ist leider nach 1 Jahr kaputt gegangen',
 'In der Lieferung war nur Ein Akku!',
 'Ein Stern, weil gar keine geht nicht. Es handelt sich um gebraucht Waren, die Stein haben so ein Belag drauf, wo man sich dabei denken kann, dass jemand schon die benutzt und nicht Mal richtig gewaschen. Bei ein paar ist die Qualität Mangelhaft, siehe Bild. Ein habe ich ausprobiert, richtig gewaschen, dann verfärbt sich..... Wärme halt nicht lange. Deswegen wird es zurückgeschickt.',
 'Dachte, das wären einfach etwas festere Binden, vielleicht größere Always. Aber die Verpackung ist derartig riesig - wie als hätte man einen riesigen Karton Windeln gekauft... nicht das, was ich wollte ;-)',
 'Meine Kinder haben kaum damit gespielt und nach 6 Monaten riss es an der Naht obwohl ich sehr leichte Kinder habe.']

In [None]:
topic_model = BERTopic("german")

In [None]:
topics, probs = topic_model.fit_transform(docs)

Downloading (…)0fe39/.gitattributes:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)83e900fe39/README.md:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading (…)e900fe39/config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading unigram.json:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading (…)900fe39/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

2023-05-30 11:08:41,116 - BERTopic - Transformed documents to Embeddings
2023-05-30 11:13:31,147 - BERTopic - Reduced dimensionality
2023-05-30 11:13:57,557 - BERTopic - Clustered reduced embeddings
