File size: 5,470 Bytes
d0f107b
 
 
 
 
 
2e42b0c
 
d0f107b
 
97f4989
540c906
 
d0f107b
 
97f4989
540c906
d0f107b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8aa80eb
 
 
 
 
 
 
 
 
 
 
d0f107b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e42b0c
d0f107b
 
 
 
 
540c906
97f4989
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
540c906
d0f107b
 
 
 
 
 
2e42b0c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: mit
language:
- en
tags:
- topic-modeling
datasets:
- CCRss/arxiv_papers_cs
---




# Top2Vec Scientific Texts Model

![MindMap](markmap-main.png)

This repository hosts the `top2vec_scientific_texts` model, a specialized Top2Vec model trained on scientific texts for topic modeling and semantic search.

## Model Overview

The `top2vec_scientific_texts` model is built for analyzing scientific literature. It leverages the Universal Sentence Encoder for embedding texts and uses Top2Vec for topic modeling.

### Key Features:

- **Domain-Specific:** Tailored for scientific texts.
- **Base Model:** Utilizes the Universal Sentence Encoder for effective text embeddings.
- **Topic Modeling:** Employs Top2Vec for discovering topics in scientific documents.

## Installation

To use the model, you need to install the following dependencies:

```bash
pip install top2vec
pip install top2vec[sentence_encoders]
pip install tensorflow==2.8.0
pip install tensorflow-probability==0.16.0
```

## Model Training Process

The entire process of model training, dataset creation, and visualization is documented in the `main.ipynb` Jupyter notebook. To explore the code and replicate the results:

- Open the `main.ipynb` notebook in Jupyter Lab or Jupyter Notebook.
- Execute the cells in sequence to run different stages of the analysis.
- The results, including thematic group analysis, trend analysis, and visualizations of interest dynamics over the years, are presented in the form of tables and graphs within the notebook.

For more details, please refer to the `main.ipynb` notebook in this repository.


## Usage

Here's an example of how to use the model for topic modeling:

```bash
from top2vec import Top2Vec

# Load your documents
docs = ["Document 1 text", "Document 2 text", ...]

# Initialize the Top2Vec model
model = Top2Vec(
    documents=docs,
    speed='learn',
    workers=80,
    embedding_model='universal-sentence-encoder',
    umap_args={'n_neighbors': 15, 'n_components': 5, 'metric': 'cosine', 'min_dist': 0.0, 'random_state': 42},
    hdbscan_args={'min_cluster_size': 15, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}
)
```

# Save the model

```bash
model.save('top2vec_scientific_texts_model')
```

## Dataset

The model was trained on a dataset of scientific abstracts sourced from [arXiv](https://arxiv.org/). The dataset covers a range of topics within the field of computer science from 2010 to 2024.

You can access the dataset [arxiv_papers_cs](https://huggingface.co/datasets/CCRss/arxiv_papers_cs).

## Use Cases

The `top2vec_scientific_texts` model can be used for various purposes, including:

- **Topic Discovery:** Identify the main topics within a collection of scientific texts.
- **Semantic Search:** Find documents that are semantically similar to a query text.
- **Trend Analysis:** Analyze the evolution of topics over time.

## Examples

Here are some examples of the model's output for the thematic group "UAV in Disasters and Emergency":

### Trend Analysis for "UAV in Disasters and Emergency"

![Trend Analysis](disasters_and_emergency_plot.png)

This graph shows the trend of interest in the use of UAVs in disaster and emergency situations over time.

### Key Metrics Table

Analysis for Thematic Group: Disasters & Emergency
|   Year |   Number of Publications |   Growth Acceleration |   Change in Number of Publications | Relative Growth   |
|-------:|-------------------------:|----------------------:|-----------------------------------:|:------------------|
|   2010 |                       19 |                     0 |                                  0 | 0.0%              |
|   2011 |                       15 |                    -4 |                                 -4 | -21.05%           |
|   2012 |                       28 |                    17 |                                 13 | 86.67%            |
|   2013 |                       38 |                    -3 |                                 10 | 35.71%            |
|   2014 |                       28 |                   -20 |                                -10 | -26.32%           |
|   2015 |                       47 |                    29 |                                 19 | 67.86%            |
|   2016 |                       63 |                    -3 |                                 16 | 34.04%            |
|   2017 |                       94 |                    15 |                                 31 | 49.21%            |
|   2018 |                      173 |                    48 |                                 79 | 84.04%            |
|   2019 |                      266 |                    14 |                                 93 | 53.76%            |
|   2020 |                      337 |                   -22 |                                 71 | 26.69%            |
|   2021 |                      380 |                   -28 |                                 43 | 12.76%            |
|   2022 |                      453 |                    30 |                                 73 | 19.21%            |
|   2023 |                      509 |                   -17 |                                 56 | 12.36%            |

## Contributions

We welcome contributions to the top2vec_scientific_texts model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue or submit a pull request.

## License

This project is licensed under the MIT License