Spaces:

JoeArmani
/

csc525_retrieval_based_chatbot

Sleeping

App Files Files Community

csc525_retrieval_based_chatbot / readme.md

JoeArmani

Initial commit

3190e1e 6 months ago

preview code

raw

history blame

2.96 kB

	# Retrieval-based learning chatbot

	CSC525 - Module 8 Option 2 - Retrieval-based Learning Chatbot - Joseph Armani

	## TODO

	A Python tool to generate high-quality dialog variations.

	This package automatically downloads the following models during installation:

	- Universal Sentence Encoder v4 (TensorFlow Hub)
	- ChatGPT Paraphraser T5-base
	- Helsinki-NLP translation models (en-de, de-es, es-en)
	- GPT-2 (for perplexity scoring)
	- spaCy en_core_web_sm
	- nltk wordnet and averaged_perceptron_tagger_eng models

	## Install package

	pip install -e .

	## Description

	This Python script demonstrates a complete pipeline for dialogue augmentation, including validation, optimization, and data augmentation.
	It creates high-quality augmented versions of dialogues by applying various text augmentation techniques and quality control checks.
	Two approaches are used for text augmentation: paraphrasing and back-translation. The pipeline also includes quality metrics for evaluating the augmented text.
	Special handling is implemented for very short text such as greetings and farewells, which are predefined and filtered for quality.
	The pipeline is designed to process a dataset of dialogues and generate multiple high-quality augmented versions of each dialogue.
	The pipeline ensures duplicate dialogues are not generated and that the output meets quality thresholds for semantic similarity, grammar, fluency, diversity, and content preservation.

	## References

	Accsany, P. (2024). Working with JSON data in Python. Real Python. <https://realpython.com/python-json/>
	Explosion AI Team. (n.d.). Spacy · industrial-strength natural language processing in python. <https://spacy.io/>
	GeeksforGeeks. (2024). Text augmentation techniques in NLP. GeeksforGeeks. <https://www.geeksforgeeks.org/text-augmentation-techniques-in-nlp/>
	Helsinki-NLP. (2024). Opus-MT [Computer software]. GitHub. <https://github.com/Helsinki-NLP/Opus-MT>
	Hugging Face. (n.d.). Transformers. Hugging Face. <https://huggingface.co/docs/transformers/en/index>
	Humarin. (2023). ChatGPT paraphraser on T5-base [Computer software]. Hugging Face. <https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base>
	Keita, Z. (2022). Data augmentation in NLP using back-translation with MarianMT. Towards Data Science. <https://towardsdatascience.com/data-augmentation-in-nlp-using-back-translation-with-marianmt-a8939dfea50a>
	Memgraph. (2023). Cosine similarity in Python with scikit-learn. Memgraph. <https://memgraph.com/blog/cosine-similarity-python-scikit-learn>
	Morris, J. (n.d.). language-tool-python (Version 2.8.1) [Computer software]. PyPI. <https://pypi.org/project/language-tool-python/>
	TensorFlow. (n.d.). Universal sentence encoder. TensorFlow Hub. <https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder>
	Waheed, A. (2023). How to calculate ROUGE score in Python. Python Code. <https://thepythoncode.com/article/calculate-rouge-score-in-python>