Spaces:

panduwana
/

interview-ai-detector

Running

App Files Files Community

Yakobus Iryanto Prasethio commited on Jul 3, 2024

Commit

13cba26

unverified ·

2 Parent(s): b7e61ac ac62fa8

Merge pull request #14 from Sistem-Cerdas-Recruitment/main

Browse files

Files changed (6) hide show

README.md +72 -1
cloudbuild-model.yaml → cloudbuild.yaml +8 -0
core-model-prediction/Dockerfile +13 -0
core-model-prediction/gemma2b_dependencies.py +4 -13
public-prediction/kafka_consumer.py +3 -2
public-prediction/predict_custom_model.py +2 -5

README.md CHANGED Viewed

	@@ -1 +1,72 @@
1	- # ~~interview-ai-detector~~

+# Interview AI Detector
+## Overview
+Interview AI Detector is a machine learning model designed to distinguish between human and AI-generated responses during interviews. The system is composed of two models:
+1. **ALBERT Model**: Processes text features extracted from responses.
+2. **Logistic Regression Model (LogReg)**: Utilizes the output from the ALBERT model along with additional behavioral features to make the final prediction.
+The model is deployed on Google Vertex AI, with integration managed by a Kafka consumer deployed on Google Compute Engine. Both the model and Kafka consumer utilize FastAPI for API management.
+## Architecture
+### ALBERT Model
+- **Source**: HuggingFace
+- **Input**: 25 numerical features extracted from the text, including:
+  - Part-of-Speech (POS) tags
+  - Readability scores
+  - Sentiment analysis
+  - Perplexity numbers
+- **Output**: Features used as input for the Logistic Regression model
+### Logistic Regression Model
+- **Input**:
+  - Output from the ALBERT model
+  - 4 additional features, including typing behavior metrics such as backspace count and key presses per letter
+- **Output**: Final prediction indicating whether the response is human or AI-generated
+## Deployment
+- **Model Deployment**: Vertex AI
+- **Kafka Consumer Deployment**: Compute Engine
+- **API Framework**: FastAPI
+- **Training**:
+  - **Epochs**: 8
+  - **Dataset**: 2000 data points (1000 human responses, 1000 AI-generated responses)
+  - **Framework**: PyTorch
+## Usage
+### API Endpoints
+- **POST /predict**:
+  - **Description**: Receives a pair of question and answer, along with typing behavior metrics. Runs the prediction pipeline and returns the result.
+  - **Input**:
+    ```json
+    {
+      "question": "Your question text",
+      "answer": "The given answer",
+      "backspace_count": 5,
+      "letter_click_counts": {"a": 27, "b": 4, "c": 9, "d": 17, "e": 54, "f": 12, "g": 4, "h": 15, "i": 25, "j": 2, "k": 2, "l": 14, "m": 10, "n": 23, "o": 23, "p": 9, "q": 1, "r": 24, "s": 19, "t": 36, "u": 9, "v": 6, "w": 8, "x": 1, "y": 7, "z": 0}
+    }
+    ```
+  - **Output**:
+    ```json
+    {
+      "predicted_class": "HUMAN" or "AI",
+      "main_model_probability": "0.85",
+      "secondary_model_probability": "0.75",
+      "confidence": "High Confidence" or "Partially Confident" or "Low Confidence"
+    }
+    ```
+## Limitations
+- The model is not designed for retraining. The current implementation focuses solely on deployment and prediction.
+- The repository is meant for deployment purposes only and does not support local installation for development.
+## Author
+Yakobus Iryanto Prasethio

cloudbuild-model.yaml → cloudbuild.yaml RENAMED Viewed

@@ -4,10 +4,13 @@ steps:
     args:
       [
         "build",
         "-t",
         "us-central1-docker.pkg.dev/${PROJECT_ID}/interview-ai-detector/model-prediction:latest",
         ".",
       ]
   - name: "gcr.io/cloud-builders/docker"
     args:
@@ -18,3 +21,8 @@ steps:
 images:
   - "us-central1-docker.pkg.dev/${PROJECT_ID}/interview-ai-detector/model-prediction:latest"

     args:
       [
         "build",
+        "--build-arg",
+        "HF_TOKEN=${_HF_TOKEN}",
         "-t",
         "us-central1-docker.pkg.dev/${PROJECT_ID}/interview-ai-detector/model-prediction:latest",
         ".",
       ]
+    secretEnv: ["HF_TOKEN"]
   - name: "gcr.io/cloud-builders/docker"
     args:
 images:
   - "us-central1-docker.pkg.dev/${PROJECT_ID}/interview-ai-detector/model-prediction:latest"
+availableSecrets:
+  secretManager:
+    - versionName: "projects/${PROJECT_ID}/secrets/HF_TOKEN/versions/1"
+      env: "HF_TOKEN"

core-model-prediction/Dockerfile CHANGED Viewed

@@ -1,3 +1,6 @@
 # Use an official Python runtime as a base image
 FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
@@ -17,6 +20,16 @@ RUN python -m nltk.downloader punkt wordnet averaged_perceptron_tagger
 # Unzip wordnet
 RUN unzip /root/nltk_data/corpora/wordnet.zip -d /root/nltk_data/corpora/
 # Make port 8080 available to the world outside this container
 EXPOSE 8080

+# HF Token args
+ARG HF_TOKEN
 # Use an official Python runtime as a base image
 FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
 # Unzip wordnet
 RUN unzip /root/nltk_data/corpora/wordnet.zip -d /root/nltk_data/corpora/
+# Download HuggingFace model
+RUN python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
+    tokenizer = AutoTokenizer.from_pretrained('google/gemma-2b', token='$HF_TOKEN'); \
+    model = AutoModelForCausalLM.from_pretrained('google/gemma-2b', token='$HF_TOKEN'); \
+    tokenizer.save_pretrained('/app/gemma-2b'); \
+    model.save_pretrained('/app/gemma-2b')"
+# Model env
+ENV MODEL_DIR=/app/gemma-2b
 # Make port 8080 available to the world outside this container
 EXPOSE 8080

core-model-prediction/gemma2b_dependencies.py CHANGED Viewed

@@ -1,10 +1,10 @@
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 from torch.nn.functional import cosine_similarity
 from collections import Counter
 import numpy as np
 from device_manager import DeviceManager
-from google.cloud import secretmanager
 class Gemma2BDependencies:
@@ -13,21 +13,13 @@ class Gemma2BDependencies:
     def __new__(cls):
         if cls._instance is None:
             cls._instance = super(Gemma2BDependencies, cls).__new__(cls)
-            token = cls._instance.access_hf_token_secret()
-            cls._instance.tokenizer = AutoTokenizer.from_pretrained(
-                "google/gemma-2b", token=token)
-            cls._instance.model = AutoModelForCausalLM.from_pretrained(
-                "google/gemma-2b", token=token)
             cls._instance.device = DeviceManager()
             cls._instance.model.to(cls._instance.device)
         return cls._instance
-    def access_hf_token_secret(self):
-        client = secretmanager.SecretManagerServiceClient()
-        name = "projects/ta-2-sistem-cerdas/secrets/HF_TOKEN/versions/1"
-        response = client.access_secret_version(request={"name": name})
-        return response.payload.data.decode('UTF-8')
     def calculate_perplexity(self, text: str):
         inputs = self.tokenizer(text, return_tensors="pt",
                                 truncation=True, max_length=1024)
@@ -42,7 +34,6 @@ class Gemma2BDependencies:
         return perplexity.item()
     def calculate_burstiness(self, text: str):
-        # Tokenize the text using GPT-2 tokenizer
         tokens = self.tokenizer.encode(text, add_special_tokens=False)
         # Count token frequencies

+import os
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 from torch.nn.functional import cosine_similarity
 from collections import Counter
 import numpy as np
 from device_manager import DeviceManager
 class Gemma2BDependencies:
     def __new__(cls):
         if cls._instance is None:
             cls._instance = super(Gemma2BDependencies, cls).__new__(cls)
+            model_dir = os.getenv("MODEL_DIR", "/app/gemma-2b")
+            cls._instance.tokenizer = AutoTokenizer.from_pretrained(model_dir)
+            cls._instance.model = AutoModelForCausalLM.from_pretrained(model_dir)
             cls._instance.device = DeviceManager()
             cls._instance.model.to(cls._instance.device)
         return cls._instance
     def calculate_perplexity(self, text: str):
         inputs = self.tokenizer(text, return_tensors="pt",
                                 truncation=True, max_length=1024)
         return perplexity.item()
     def calculate_burstiness(self, text: str):
         tokens = self.tokenizer.encode(text, add_special_tokens=False)
         # Count token frequencies

public-prediction/kafka_consumer.py CHANGED Viewed

@@ -52,7 +52,6 @@ def send_results_back(full_results: dict[str, any], job_application_id: str):
     response = requests.patch(url, json=body, headers=headers)
     print(f"Data sent with status code {response.status_code}")
-    print(response.content)
 def consume_messages():
@@ -62,6 +61,7 @@ def consume_messages():
         auto_offset_reset='earliest',
         client_id="ai-detector-1",
         group_id="ai-detector",
     )
     print("Successfully connected to Kafka at", os.environ.get("KAFKA_IP"))
@@ -71,7 +71,7 @@ def consume_messages():
     for message in consumer:
         try:
-            incoming_message = json.loads(message.value.decode("utf-8"))
             full_batch = incoming_message["data"]
         except json.JSONDecodeError:
             print("Failed to decode JSON from message:", message.value)
@@ -83,6 +83,7 @@ def consume_messages():
         full_results = []
         for i in range(0, len(full_batch), BATCH_SIZE):
             batch = full_batch[i:i+BATCH_SIZE]
             batch_results = process_batch(batch, BATCH_SIZE, gpt_helper)
             full_results.extend(batch_results)

     response = requests.patch(url, json=body, headers=headers)
     print(f"Data sent with status code {response.status_code}")
 def consume_messages():
         auto_offset_reset='earliest',
         client_id="ai-detector-1",
         group_id="ai-detector",
+        api_version=(0, 10, 2)
     )
     print("Successfully connected to Kafka at", os.environ.get("KAFKA_IP"))
     for message in consumer:
         try:
+            incoming_message = json.loads(json.loads(message.value.decode("utf-8")))
             full_batch = incoming_message["data"]
         except json.JSONDecodeError:
             print("Failed to decode JSON from message:", message.value)
         full_results = []
         for i in range(0, len(full_batch), BATCH_SIZE):
+            print(f"Processing batch {i} to {i+BATCH_SIZE}")
             batch = full_batch[i:i+BATCH_SIZE]
             batch_results = process_batch(batch, BATCH_SIZE, gpt_helper)
             full_results.extend(batch_results)

public-prediction/predict_custom_model.py CHANGED Viewed

@@ -1,4 +1,5 @@
 from typing import Dict, List, Union
 from google.cloud import aiplatform
 from google.protobuf import json_format
 from google.protobuf.struct_pb2 import Value
@@ -19,13 +20,9 @@ def predict_custom_trained_model(
     # The AI Platform services require regional API endpoints.
     client_options = {"api_endpoint": api_endpoint}
-    credentials = service_account.Credentials.from_service_account_file(
-        "steady-climate-416810-ea1536e1868c.json")
     # Initialize client that will be used to create and send requests.
     # This client only needs to be created once, and can be reused for multiple requests.
-    client = aiplatform.gapic.PredictionServiceClient(
-        credentials=credentials,
-        client_options=client_options)
     # The format of each instance should conform to the deployed model's prediction input schema.
     instances = instances if isinstance(instances, list) else [instances]
     instances = [

 from typing import Dict, List, Union
+import os
 from google.cloud import aiplatform
 from google.protobuf import json_format
 from google.protobuf.struct_pb2 import Value
     # The AI Platform services require regional API endpoints.
     client_options = {"api_endpoint": api_endpoint}
     # Initialize client that will be used to create and send requests.
     # This client only needs to be created once, and can be reused for multiple requests.
+    client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
     # The format of each instance should conform to the deployed model's prediction input schema.
     instances = instances if isinstance(instances, list) else [instances]
     instances = [