JoeArmani
updates through 4th iteration
300fe5d
|
raw
history blame
2.99 kB

Retrieval-based learning chatbot

CSC525 - Module 8 Option 2 - Retrieval-based Learning Chatbot - Joseph Armani

TODO

A Python tool to generate high-quality dialog variations.

This package automatically downloads the following models during installation:

  • Universal Sentence Encoder v4 (TensorFlow Hub)
  • ChatGPT Paraphraser T5-base
  • Helsinki-NLP translation models (en-de, de-es, es-en)
  • spaCy en_core_web_sm, eng_core_web_md
  • nltk wordnet and averaged_perceptron_tagger_eng models

Install package

pip install -e .

On Linux with Cuda/GPU: pip install faiss-gpu>=1.7.0

Description

This Python script demonstrates a complete pipeline for dialogue augmentation, including validation, optimization, and data augmentation. It creates high-quality augmented versions of dialogues by applying various text augmentation techniques and quality control checks. Two approaches are used for text augmentation: paraphrasing and back-translation. The pipeline also includes quality metrics for evaluating the augmented text. Special handling is implemented for very short text such as greetings and farewells, which are predefined and filtered for quality. The pipeline is designed to process a dataset of dialogues and generate multiple high-quality augmented versions of each dialogue. The pipeline ensures duplicate dialogues are not generated and that the output meets quality thresholds for semantic similarity, grammar, fluency, diversity, and content preservation.

References

Accsany, P. (2024). Working with JSON data in Python. Real Python. https://realpython.com/python-json/ Explosion AI Team. (n.d.). Spacy · industrial-strength natural language processing in python. https://spacy.io/ GeeksforGeeks. (2024). Text augmentation techniques in NLP. GeeksforGeeks. https://www.geeksforgeeks.org/text-augmentation-techniques-in-nlp/ Helsinki-NLP. (2024). Opus-MT [Computer software]. GitHub. https://github.com/Helsinki-NLP/Opus-MT Hugging Face. (n.d.). Transformers. Hugging Face. https://huggingface.co/docs/transformers/en/index Humarin. (2023). ChatGPT paraphraser on T5-base [Computer software]. Hugging Face. https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base Keita, Z. (2022). Data augmentation in NLP using back-translation with MarianMT. Towards Data Science. https://towardsdatascience.com/data-augmentation-in-nlp-using-back-translation-with-marianmt-a8939dfea50a Memgraph. (2023). Cosine similarity in Python with scikit-learn. Memgraph. https://memgraph.com/blog/cosine-similarity-python-scikit-learn Morris, J. (n.d.). language-tool-python (Version 2.8.1) [Computer software]. PyPI. https://pypi.org/project/language-tool-python/ TensorFlow. (n.d.). Universal sentence encoder. TensorFlow Hub. https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder Waheed, A. (2023). How to calculate ROUGE score in Python. Python Code. https://thepythoncode.com/article/calculate-rouge-score-in-python