Spaces:

b289zhan
/

OntoChat

Running

App Files Files Community

Bohui Zhang commited on Feb 22, 2024

Commit

9257999

1 Parent(s): 1a4d7a2

Update the fourth version

Browse files

Files changed (4) hide show

app.py +22 -37
data/music_meta_cqs.txt +28 -0
ontochat/analysis.py +19 -84
ontochat/functions.py +69 -26

app.py CHANGED Viewed

@@ -37,6 +37,18 @@ with gr.Blocks() as user_story_interface:
                 label="Chatbot input",
                 placeholder="Please type your message here and press Enter to interact with the chatbot :)"
             )
         user_story = gr.TextArea(
             label="User story",
             interactive=True
@@ -56,33 +68,9 @@ with gr.Blocks() as cq_interface:
     gr.Markdown(
         """
         # OntoChat
-        This is the second step of OntoChat. Please copy the generated user story from the previous
-        step and use it here. You can also modify the user story before using it for generating competency questions.
-        **Recommended prompt workflow:**
-        1. Obtain competency questions from the user story.
-        - Zero-shot learning:
-            - Prompt template: Given the user story: {user story}, generate {number} competency questions base on it.
-        - Few-shot learning (i.e., provide examples to give more instructions on how to generate competency questions):
-            - Prompt template: Here are some good examples of competency questions generated from example data.
-              Formatted in {"Example data": "Competency questions"}.
-              {"Yesterday was performed by Armando Rocca.": "Who performs the song?"},
-              {"The Church was built in 1619.": "When (what year) was the building built?"},
-              {"The Church is located in a periurban context.": "In which context is the building located?"},
-              {"The mounting system of the bells is the falling clapper.": "Which is the mounting system of the bell?"}
-        2. Clean and refine competency questions.
-        - Obtain multiple competency questions.
-            - Prompt template: Take the generated competency questions and check if any of them can be divided into
-              multiple questions. If they do, split the competency question into multiple competency questions. If it
-              does not, leave the competency question as it is. For example, the competency question "Who wrote The
-              Hobbit and in what year was the book written?" must be split into two competency questions: "Who wrote
-              the book?" and "In what year was the book written?". Another example is the competency question, "When
-              was the person born?". This competency question cannot be divided into multiple questions.
-        - Remove specific named entities.
-            - Prompt template: Take the competency questions and check if they contain real-world entities, like
-              "Freddy Mercury" or "1837". If they do, change those real-world entities from these competency questions
-              to more general concepts. For example, the competency question "Which is the author of Harry Potter?"
-              should be changed to "Which is the author of the book?". Similarly, the competency question "Who wrote
-              the book in 2018?" should be changed to "Who wrote the book, and in what year was the book written?"
         """
     )
@@ -100,7 +88,8 @@ with gr.Blocks() as cq_interface:
         with gr.Column():
             cq_chatbot = gr.Chatbot([
                 [None, "I am OntoChat, your conversational ontology engineering assistant. Here is the second step of "
-                 "the system. Please give me your user story and tell me how many competency questions you want."]
             ])
             cq_input = gr.Textbox(
                 label="Chatbot input",
@@ -145,18 +134,14 @@ clustering_interface = gr.Interface(
         ),
         gr.Dropdown(
             value="LLM clustering",
-            choices=["LLM clustering", "Agglomerative clustering", "HDBSCAN"],
             label="Clustering method",
             info="Please select the clustering method."
         ),
-        gr.Slider(
-            minimum=2,
-            maximum=50,
-            step=1,
-            label="Number of clusters",
-            info="Please select the number of clusters you want to generate. Please note that for HDBSCAN, this value "
-                 "is used as the minimum size of a cluster. And please do not input a number that exceeds the total "
-                 "number of competency questions."
         )
     ],
     outputs=[

                 label="Chatbot input",
                 placeholder="Please type your message here and press Enter to interact with the chatbot :)"
             )
+            # gr.Markdown(
+            #     """
+            #     ### User story generation prompt
+            #     Click the button below to use a user story generation prompt that provides better instructions to the chatbot.
+            #     """
+            # )
+            # prompt_btn = gr.Button(value="User story generation prompt")
+            # prompt_btn.click(
+            #     fn=load_user_story_prompt,
+            #     inputs=[],
+            #     outputs=[user_story_input]
+            # )
         user_story = gr.TextArea(
             label="User story",
             interactive=True
     gr.Markdown(
         """
         # OntoChat
+        This is the second step of OntoChat. This functionality provides support for the extraction of competency
+        questions from a user story. Please, provide a user story to start extracting competency questions with the
+        chatbot, or simply load the example story below.
         """
     )
         with gr.Column():
             cq_chatbot = gr.Chatbot([
                 [None, "I am OntoChat, your conversational ontology engineering assistant. Here is the second step of "
+                       "the system. Please give me your user story and tell me how many competency questions you want "
+                       "me to generate from the user story."]
             ])
             cq_input = gr.Textbox(
                 label="Chatbot input",
         ),
         gr.Dropdown(
             value="LLM clustering",
+            choices=["LLM clustering", "Agglomerative clustering"],
             label="Clustering method",
             info="Please select the clustering method."
         ),
+        gr.Textbox(
+            label="Number of clusters (optional for LLM clustering)",
+            info="Please input the number of clusters you want to generate. And please do not input a number that "
+                 "exceeds the total number of competency questions."
         )
     ],
     outputs=[

data/music_meta_cqs.txt ADDED Viewed

	@@ -0,0 +1,28 @@

+Which is the composer of a musical piece?
+Is the composer of a musical piece known?
+Which are the members of a music ensemble?
+Which role a music artist played within a music ensemble?
+In which time interval has a music artist been a member of a music ensemble?
+Where was a music ensemble formed?
+Which award was a music artist nominated for?
+Which award was received by a music artist?
+Which music artists has a music artist been influenced by?
+Which music artist has a music artist collaborated with?
+Which is the start date of the activity of a music artist?
+Which is the end date of the activity of a music artist?
+Which is the name of a music artist?
+Which is the alias of a music artist?
+Which is the language of the name/alias of a music artist?
+Which music dataset has a music algorithm been trained on?
+Which is the process that led to the creation of a musical piece?
+In which time interval did the creation process took place?
+Where did the creation process took place?
+Which are the creative actions composing the creation process of a musical piece?
+Which task was executed by a creative action?
+Which are the parts of a musical piece?
+Which collection is a musical piece member of?
+Where was a musical piece performed?
+When was a musical piece performed?
+Which music artists took part to a musical performance?
+Which is the recording process that recorded a musical performance?
+Which is the recording produced by a recording process?

ontochat/analysis.py CHANGED Viewed

@@ -26,7 +26,7 @@ def preprocess_competency_questions(cqs):
     # # keep index
     # cqs = [re.split(r'\.\s', cq, 1) for cq in cqs]
     # cqs = [{cq[0]: cq[1]} for cq in cqs]
-    cqs = [re.split(r'\.\s', cq, 1)[1] for cq in cqs]
     # clean
     cleaned_cqs = []
@@ -139,81 +139,6 @@ def plot_dendrogram(model, **kwargs):
     return Image.open(buf)
-def hdbscan_clustering(cqs, embeddings, min_cluster_size=2):
-    """
-    :param cqs:
-    :param embeddings:
-    :param min_cluster_size:
-    :return:
-    """
-    clusterer = HDBSCAN(
-        min_cluster_size=min_cluster_size
-    )
-    clusterer.fit(embeddings)
-    cluster_assignment = clusterer.labels_
-    clustered_cqs = defaultdict(list)
-    for sentence_id, cluster_id in enumerate(cluster_assignment):
-        clustered_cqs[str(cluster_id)].append(cqs[sentence_id])
-    fig, axis = plt.subplots(1, 1)
-    image = plot_hdbscan_scatter(embeddings, cluster_assignment, parameters={"scale": 3, "eps": 0.9}, ax=axis)
-    return clustered_cqs, image
-def plot_hdbscan_scatter(data, labels, probabilities=None, parameters=None, ground_truth=False, ax=None):
-    """
-    source: https://scikit-learn.org/stable/auto_examples/cluster/plot_hdbscan.html
-    :param data:
-    :param labels:
-    :param probabilities:
-    :param parameters:
-    :param ground_truth:
-    :param ax:
-    :return:
-    """
-    if ax is None:
-        _, ax = plt.subplots(figsize=(10, 4))
-    labels = labels if labels is not None else np.ones(data.shape[0])
-    probabilities = probabilities if probabilities is not None else np.ones(data.shape[0])
-    # Black removed and is used for noise instead.
-    unique_labels = set(labels)
-    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
-    # The probability of a point belonging to its labeled cluster determines
-    # the size of its marker
-    proba_map = {idx: probabilities[idx] for idx in range(len(labels))}
-    for k, col in zip(unique_labels, colors):
-        if k == -1:
-            # Black used for noise.
-            col = [0, 0, 0, 1]
-        class_index = np.where(labels == k)[0]
-        for ci in class_index:
-            ax.plot(
-                data[ci, 0],
-                data[ci, 1],
-                "x" if k == -1 else "o",
-                markerfacecolor=tuple(col),
-                markeredgecolor="k",
-                markersize=4 if k == -1 else 1 + 5 * proba_map[ci],
-            )
-    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
-    preamble = "True" if ground_truth else "Estimated"
-    title = f"{preamble} number of clusters: {n_clusters_}"
-    if parameters is not None:
-        parameters_str = ", ".join(f"{k}={v}" for k, v in parameters.items())
-        title += f" | {parameters_str}"
-    ax.set_title(title)
-    plt.tight_layout()
-    fig = plt.gcf()
-    buf = io.BytesIO()
-    fig.savefig(buf)
-    buf.seek(0)
-    return Image.open(buf)
 def response_parser(response):
     try:
         response = ast.literal_eval(response)
@@ -222,7 +147,7 @@ def response_parser(response):
     return response
-def llm_cq_clustering(cqs: str, n_clusters: int, api_key, paraphrase_detection=False):
     """
     :param cqs:
@@ -241,21 +166,31 @@ def llm_cq_clustering(cqs: str, n_clusters: int, api_key, paraphrase_detection=F
                    "Return a Python list of duplicate competency questions.".format(cqs)
         conversation_history.append({"role": "user", "content": prompt_1})
-        response = chat_completion(conversation_history)
         print("{} CQs remaining after paraphrase detection.".format(len(cqs) - len(response_parser(response))))
         # 2. clustering
-        prompt_2 = f"Clustering the competency questions into {n_clusters} clusters based on their topics. " \
-                   "Keep the granularity of the topic in each cluster at a similar level. " \
-                   "Return in JSON format, such as: {'cluster 1 topic': " \
-                   "['competency question 1', 'competency question 2']}:"
         conversation_history.append({"role": "assistant", "content": response})  # previous response
         conversation_history.append({"role": "user", "content": prompt_2})
-        response = chat_completion(conversation_history)
         # print("Output is: \"{}\"".format(response))
     else:  # clustering only
-        prompt_2 = f"Given the competency questions: {cqs}, clustering them into {n_clusters} clusters based on the topics."
         prompt_2 += "Keep the granularity of the topic in each cluster at a similar level. " \
                     "Return in JSON format, such as: {'cluster 1 topic': " \
                     "['competency question 1', 'competency question 2']}:"

     # # keep index
     # cqs = [re.split(r'\.\s', cq, 1) for cq in cqs]
     # cqs = [{cq[0]: cq[1]} for cq in cqs]
+    # cqs = [re.split(r'\.\s', cq, 1)[1] for cq in cqs]
     # clean
     cleaned_cqs = []
     return Image.open(buf)
 def response_parser(response):
     try:
         response = ast.literal_eval(response)
     return response
+def llm_cq_clustering(cqs, n_clusters, api_key, paraphrase_detection=False):
     """
     :param cqs:
                    "Return a Python list of duplicate competency questions.".format(cqs)
         conversation_history.append({"role": "user", "content": prompt_1})
+        response = chat_completion(api_key, conversation_history)
         print("{} CQs remaining after paraphrase detection.".format(len(cqs) - len(response_parser(response))))
         # 2. clustering
+        if n_clusters:
+            prompt_2 = f"Clustering the competency questions into {n_clusters} clusters based on their topics. " \
+                        "Keep the granularity of the topic in each cluster at a similar level. " \
+                        "Return in JSON format, such as: {'cluster 1 topic': " \
+                        "['competency question 1', 'competency question 2']}:"
+        else:
+            prompt_2 = f"Clustering the competency questions into clusters based on their topics. " \
+                       "Keep the granularity of the topic in each cluster at a similar level. " \
+                       "Return in JSON format, such as: {'cluster 1 topic': " \
+                       "['competency question 1', 'competency question 2']}:"
         conversation_history.append({"role": "assistant", "content": response})  # previous response
         conversation_history.append({"role": "user", "content": prompt_2})
+        response = chat_completion(api_key, conversation_history)
         # print("Output is: \"{}\"".format(response))
     else:  # clustering only
+        if n_clusters:
+            prompt_2 = f"Given the competency questions: {cqs}, clustering them into {n_clusters} clusters based on " \
+                       f"the topics."
+        else:
+            prompt_2 = f"Given the competency questions: {cqs}, clustering them into clusters based on the topics."
         prompt_2 += "Keep the granularity of the topic in each cluster at a similar level. " \
                     "Return in JSON format, such as: {'cluster 1 topic': " \
                     "['competency question 1', 'competency question 2']}:"

ontochat/functions.py CHANGED Viewed

@@ -5,7 +5,7 @@ Interface functions
 import json
 from ontochat.chatbot import chat_completion, build_messages
-from ontochat.analysis import compute_embeddings, agglomerative_clustering, hdbscan_clustering, llm_cq_clustering
 from ontochat.verbaliser import verbalise_ontology
@@ -27,7 +27,9 @@ def user_story_generator(message, history):
                    "Persona: What are the name, occupation, skills and interests of the user? 2. The Goal: What is "
                    "the goal of the user? Are they facing specific issues? 3. Example Data: Do you have examples of "
                    "the specific data available? Make sure you have answers to all three questions before providing "
-                   "a user story. Only ask the next question once I have responded. And you should also ask questions "
                    "to elaborate on more information after the user provides the initial information, and ask for "
                    "feedback and suggestions after the user story is generated."
     }]
@@ -37,12 +39,42 @@ def user_story_generator(message, history):
         "content": message
     })
     bot_message = chat_completion(openai_api_key, instructions + messages)
-    # post-processing response
     history.append([message, bot_message])
-    print(history)
     return bot_message, history, ""
 def cq_generator(message, history):
     """
     generate competency questions based on the user story
@@ -51,25 +83,35 @@ def cq_generator(message, history):
     :param history:
     :return:
     """
-    if (len(history)) == 1:  # initial round
-        messages = [
-            {
-                "role": "system",
-                "content": "I am OntoChat, your conversational ontology engineering assistant. Here is the second step "
-                           "of the system. Please give me your user story and tell me how many competency questions "
-                           "you want."
-            }, {
-                "role": "user",
-                "content": message
-            }
-        ]
-    else:
-        messages = build_messages(history)
-        messages.append({
-            "role": "user",
-            "content": message
-        })
-    bot_message = chat_completion(openai_api_key, messages)
     history.append([message, bot_message])
     return bot_message, history, ""
@@ -89,15 +131,16 @@ def clustering_generator(cqs, cluster_method, n_clusters):
     :param cqs:
     :param cluster_method:
-    :param n_clusters:
     :return:
     """
     cqs, cq_embeddings = compute_embeddings(cqs)
     if cluster_method == "Agglomerative clustering":
         cq_clusters, cluster_image = agglomerative_clustering(cqs, cq_embeddings, n_clusters)
-    elif cluster_method == "HDBSCAN":
-        cq_clusters, cluster_image = hdbscan_clustering(cqs, cq_embeddings, n_clusters)
     else:  # cluster_method == "LLM clustering"
         cq_clusters, cluster_image = llm_cq_clustering(cqs, n_clusters, openai_api_key)

 import json
 from ontochat.chatbot import chat_completion, build_messages
+from ontochat.analysis import compute_embeddings, agglomerative_clustering, llm_cq_clustering
 from ontochat.verbaliser import verbalise_ontology
                    "Persona: What are the name, occupation, skills and interests of the user? 2. The Goal: What is "
                    "the goal of the user? Are they facing specific issues? 3. Example Data: Do you have examples of "
                    "the specific data available? Make sure you have answers to all three questions before providing "
+                   "a user story. The user story should be written in the following structure: title, persona, goal, "
+                   "scenario (where the user could use a structured knowledge base to help with their work), and "
+                   "example data. Only ask the next question once I have responded. And you should also ask questions "
                    "to elaborate on more information after the user provides the initial information, and ask for "
                    "feedback and suggestions after the user story is generated."
     }]
         "content": message
     })
     bot_message = chat_completion(openai_api_key, instructions + messages)
     history.append([message, bot_message])
     return bot_message, history, ""
+# def load_user_story_prompt():
+#     """
+#
+#     :return:
+#     """
+#     prompt = """
+#     Now create the full user story.The user story should be written in the following structure:
+#
+#     Title: Which topics are covered by the user story?
+#
+#     Persona: What is the occupation of the user and what are their goals?
+#
+#     Goal:
+#     Keywords: provide 5-10 keywords related to the user story
+#     Provide the issues a user is facing and how our application can help reach their goals.
+#
+#     Scenario:
+#     Write out a scenario, where the user could use a structured knowledge base to help with their work.
+#
+#     Example Data:
+#
+#     Think of a list of requirements and provide example data for each requirement. Structure the example data by requirements
+#     Example data should by simple sentences.
+#     These are possible formats:
+#     One sonata is a “Salmo alla Romana”.
+#     A concert played in San Pietro di Sturla for exhibition was recorded by ethnomusicologist Mauro Balma in 1994.
+#     The Church of San Pietro di Sturla is located in Carasco, Genova Province.
+#     The Sistema Ligure is described in the text “Campanari, campane e campanili di Liguria” By Mauro Balma, 1996.
+#     """
+#     return prompt
 def cq_generator(message, history):
     """
     generate competency questions based on the user story
     :param history:
     :return:
     """
+    instructions = [{
+        "role": "system",
+        "content": "You are a conversational ontology engineering assistant."
+    }, {
+        "role": "user",
+        "content": "Here are instructions for you on how to generate high-quality competency questions. First, here "
+                   "are some good examples of competency questions generated from example data. Who performs the song? "
+                   "from the data Yesterday was performed by Armando Rocca, When (what year) was the building built? "
+                   "from the data The Church was built in 1619, In which context is the building located? from the "
+                   "data The Church is located in a periurban context. Second, how to make them less complex. Take the "
+                   "generated competency questions and check if any of them can be divided into multiple questions. If "
+                   "they do, split the competency question into multiple competency questions. If it does not, leave "
+                   "the competency question as it is. For example, the competency question Who wrote The Hobbit and in "
+                   "what year was the book written? must be split into two competency questions: Who wrote the book? "
+                   "and In what year was the book written?. Another example is the competency question, When was the "
+                   "person born?. This competency question cannot be divided into multiple questions. Third, how to "
+                   "remove real entities to abstract them. Take the competency questions and check if they contain "
+                   "real-world entities, like Freddy Mercury or 1837. If they do, change those real-world entities "
+                   "from these competency questions to more general concepts. For example, the competency question "
+                   "Which is the author of Harry Potter? should be changed to Which is the author of the book?. "
+                   "Similarly, the competency question Who wrote the book in 2018? should be changed to Who wrote the "
+                   "book, and in what year was the book written?"
+    }]
+    messages = build_messages(history)
+    messages.append({
+        "role": "user",
+        "content": message
+    })
+    bot_message = chat_completion(openai_api_key, instructions + messages)
     history.append([message, bot_message])
     return bot_message, history, ""
     :param cqs:
     :param cluster_method:
+    :param n_clusters: default ''
     :return:
     """
+    if n_clusters:
+        n_clusters = int(n_clusters)
     cqs, cq_embeddings = compute_embeddings(cqs)
     if cluster_method == "Agglomerative clustering":
         cq_clusters, cluster_image = agglomerative_clustering(cqs, cq_embeddings, n_clusters)
     else:  # cluster_method == "LLM clustering"
         cq_clusters, cluster_image = llm_cq_clustering(cqs, n_clusters, openai_api_key)