devrim commited on
Commit
2f3914e
·
verified ·
1 Parent(s): b81b7a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -1
README.md CHANGED
@@ -7,4 +7,37 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # NONWESTLIT
11
+
12
+ Project codebase for the paper [A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts](https://aclanthology.org/2024.findings-acl.393/).
13
+
14
+ The objectives:
15
+
16
+ Linear probing to the SOTA LLMs (e.g. Llama-2, Falcon).
17
+ Fine-tune adapters e.g. LoRA.
18
+
19
+ ## Citation
20
+
21
+ If you use the dataset or code in your research, please cite our paper:
22
+
23
+ ```bibtex
24
+ @inproceedings{gokceoglu-etal-2024-multi,
25
+ title = "A multi-level multi-label text classification dataset of 19th century Ottoman and {R}ussian literary and critical texts",
26
+ author = {Gokceoglu, Gokcen and
27
+ {\c{C}}avu{\c{s}}o{\u{g}}lu, Devrim and
28
+ Akbas, Emre and
29
+ Dolcerocca, {\"O}zen},
30
+ editor = "Ku, Lun-Wei and
31
+ Martins, Andre and
32
+ Srikumar, Vivek",
33
+ booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
34
+ month = aug,
35
+ year = "2024",
36
+ address = "Bangkok, Thailand",
37
+ publisher = "Association for Computational Linguistics",
38
+ url = "https://aclanthology.org/2024.findings-acl.393/",
39
+ doi = "10.18653/v1/2024.findings-acl.393",
40
+ pages = "6585--6596",
41
+ abstract = "This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available."
42
+ }
43
+ ```