File size: 3,930 Bytes
9e45a43
 
46db3f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e45a43
 
46db3f3
9e45a43
46db3f3
9e45a43
 
 
 
 
46db3f3
9e45a43
46db3f3
 
 
 
 
9e45a43
46db3f3
9e45a43
46db3f3
 
 
9e45a43
 
 
 
 
46db3f3
 
9e45a43
46db3f3
9e45a43
46db3f3
 
9e45a43
 
 
46db3f3
 
9e45a43
 
 
46db3f3
 
 
9e45a43
 
 
46db3f3
 
9e45a43
46db3f3
9e45a43
46db3f3
 
9e45a43
46db3f3
 
 
 
 
 
9e45a43
46db3f3
 
 
 
 
 
9e45a43
 
 
46db3f3
9e45a43
46db3f3
 
9e45a43
46db3f3
9e45a43
46db3f3
 
 
 
 
9e45a43
46db3f3
9e45a43
46db3f3
9e45a43
 
 
46db3f3
9e45a43
46db3f3
 
 
 
9e45a43
 
 
46db3f3
 
 
9e45a43
46db3f3
9e45a43
46db3f3
 
 
 
 
 
 
 
9e45a43
46db3f3
9e45a43
46db3f3
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
library_name: transformers
tags:
- topic
- multi-sentiment
license: mit
datasets:
- valurank/Topic_Classification
language:
- en
metrics:
- accuracy
- f1
- precision
- recall
base_model:
- distilbert/distilbert-base-uncased
---

# Model Card for Topic Classification Model

A fine-tuned DistilBERT model for multi-class topic classification. This model predicts the most relevant topic label from a predefined set based on input text. It was trained using 🤗 Transformers and PyTorch on a custom dataset derived from academic and news-style corpora.

## Model Details

### Model Description

This model was developed by Daniel (@AfroLogicInsect) to classify text into one of several predefined topics. It builds on the `distilbert-base-uncased` architecture and was fine-tuned for multi-class classification using a softmax output layer.

- **Developed by:** Daniel 🇳🇬 (@AfroLogicInsect)
- **Model type:** DistilBERT-based multi-class sequence classifier
- **Language(s):** English
- **License:** MIT
- **Finetuned from:** distilbert-base-uncased

### Model Sources

- **Repository:** [AfroLogicInsect/topic-model-analysis-model](https://huggingface.co/AfroLogicInsect/topic-model-analysis-model)
- **Paper:** arXiv:1910.09700 (DistilBERT)
- **Demo:** [Coming soon]

## Uses

### Direct Use

- Classify academic or news-style text into topics such as AI, finance, sports, climate, etc.
- Embed in dashboards or content moderation tools for automatic tagging

### Downstream Use

- Can be extended to hierarchical topic classification
- Useful for building recommendation engines or content filters

### Out-of-Scope Use

- Not suitable for sentiment or emotion classification
- May not generalize well to informal or slang-heavy text

## Bias, Risks, and Limitations

- Trained on curated corpora — may reflect biases in source material
- Topics are predefined and static — emerging topics may be misclassified
- Confidence scores are probabilistic, not definitive

### Recommendations

- Use `top_k=5` with `return_all_scores=True` to retrieve multiple topic predictions
- Consider fine-tuning on domain-specific data for improved accuracy

## How to Get Started

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="AfroLogicInsect/topic-model-analysis-model",
    tokenizer="AfroLogicInsect/topic-model-analysis-model",
    return_all_scores=True
)

text = "New AI breakthrough in natural language processing"
results = classifier(text)
top_5 = sorted(results[0], key=lambda x: x['score'], reverse=True)[:5]
for i, res in enumerate(top_5):
    print(f"Top {i+1}: {res['label']} ({res['score']:.3f})")
```

## Training Details

### Dataset

- Custom multi-class topic dataset based on arXiv abstracts and news articles
- Labels include domains like AI, finance, sports, climate, etc.

### Hyperparameters

- Epochs: 3
- Batch size: 16
- Learning rate: 2e-5
- Evaluation every 200 steps
- Metric: F1 score

### Trainer Setup

Used Hugging Face `Trainer` API with `TrainingArguments` configured for early stopping and best model selection.

## Evaluation

Model achieved strong performance across multiple topic categories. Evaluation metrics include:

- **Accuracy:** ~90.8%
- **F1 Score:** ~0.91
- **Precision:** ~0.89
- **Recall:** ~0.93

## Environmental Impact

- **Hardware:** Google Colab (NVIDIA T4 GPU)
- **Training Time:** ~2.5 hours
- **Carbon Emitted:** ~0.3 kg CO₂eq (estimated via [ML Impact Calculator](https://mlco2.github.io/impact#compute))

## Citation

```bibtex
@misc{afrologicinsect2025topicmodel,
  title = {AfroLogicInsect Topic Classification Model},
  author = {Akan Daniel},
  year = {2025},
  howpublished = {\url{https://huggingface.co/AfroLogicInsect/topic-model-analysis-model}},
}
```

## Contact

- Name: Daniel (@AfroLogicInsect)
- Location: Lagos, Nigeria
- Contact: GitHub / Hugging Face / email ([email protected])