Yakobus Iryanto Prasethio commited on
Commit
13cba26
·
unverified ·
2 Parent(s): b7e61ac ac62fa8

Merge pull request #14 from Sistem-Cerdas-Recruitment/main

Browse files
README.md CHANGED
@@ -1 +1,72 @@
1
- # interview-ai-detector
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Interview AI Detector
2
+
3
+ ## Overview
4
+
5
+ Interview AI Detector is a machine learning model designed to distinguish between human and AI-generated responses during interviews. The system is composed of two models:
6
+
7
+ 1. **ALBERT Model**: Processes text features extracted from responses.
8
+ 2. **Logistic Regression Model (LogReg)**: Utilizes the output from the ALBERT model along with additional behavioral features to make the final prediction.
9
+
10
+ The model is deployed on Google Vertex AI, with integration managed by a Kafka consumer deployed on Google Compute Engine. Both the model and Kafka consumer utilize FastAPI for API management.
11
+
12
+ ## Architecture
13
+
14
+ ### ALBERT Model
15
+
16
+ - **Source**: HuggingFace
17
+ - **Input**: 25 numerical features extracted from the text, including:
18
+ - Part-of-Speech (POS) tags
19
+ - Readability scores
20
+ - Sentiment analysis
21
+ - Perplexity numbers
22
+ - **Output**: Features used as input for the Logistic Regression model
23
+
24
+ ### Logistic Regression Model
25
+
26
+ - **Input**:
27
+ - Output from the ALBERT model
28
+ - 4 additional features, including typing behavior metrics such as backspace count and key presses per letter
29
+ - **Output**: Final prediction indicating whether the response is human or AI-generated
30
+
31
+ ## Deployment
32
+
33
+ - **Model Deployment**: Vertex AI
34
+ - **Kafka Consumer Deployment**: Compute Engine
35
+ - **API Framework**: FastAPI
36
+ - **Training**:
37
+ - **Epochs**: 8
38
+ - **Dataset**: 2000 data points (1000 human responses, 1000 AI-generated responses)
39
+ - **Framework**: PyTorch
40
+
41
+ ## Usage
42
+
43
+ ### API Endpoints
44
+
45
+ - **POST /predict**:
46
+ - **Description**: Receives a pair of question and answer, along with typing behavior metrics. Runs the prediction pipeline and returns the result.
47
+ - **Input**:
48
+ ```json
49
+ {
50
+ "question": "Your question text",
51
+ "answer": "The given answer",
52
+ "backspace_count": 5,
53
+ "letter_click_counts": {"a": 27, "b": 4, "c": 9, "d": 17, "e": 54, "f": 12, "g": 4, "h": 15, "i": 25, "j": 2, "k": 2, "l": 14, "m": 10, "n": 23, "o": 23, "p": 9, "q": 1, "r": 24, "s": 19, "t": 36, "u": 9, "v": 6, "w": 8, "x": 1, "y": 7, "z": 0}
54
+ }
55
+ ```
56
+ - **Output**:
57
+ ```json
58
+ {
59
+ "predicted_class": "HUMAN" or "AI",
60
+ "main_model_probability": "0.85",
61
+ "secondary_model_probability": "0.75",
62
+ "confidence": "High Confidence" or "Partially Confident" or "Low Confidence"
63
+ }
64
+ ```
65
+
66
+ ## Limitations
67
+
68
+ - The model is not designed for retraining. The current implementation focuses solely on deployment and prediction.
69
+ - The repository is meant for deployment purposes only and does not support local installation for development.
70
+
71
+ ## Author
72
+ Yakobus Iryanto Prasethio
cloudbuild-model.yaml → cloudbuild.yaml RENAMED
@@ -4,10 +4,13 @@ steps:
4
  args:
5
  [
6
  "build",
 
 
7
  "-t",
8
  "us-central1-docker.pkg.dev/${PROJECT_ID}/interview-ai-detector/model-prediction:latest",
9
  ".",
10
  ]
 
11
 
12
  - name: "gcr.io/cloud-builders/docker"
13
  args:
@@ -18,3 +21,8 @@ steps:
18
 
19
  images:
20
  - "us-central1-docker.pkg.dev/${PROJECT_ID}/interview-ai-detector/model-prediction:latest"
 
 
 
 
 
 
4
  args:
5
  [
6
  "build",
7
+ "--build-arg",
8
+ "HF_TOKEN=${_HF_TOKEN}",
9
  "-t",
10
  "us-central1-docker.pkg.dev/${PROJECT_ID}/interview-ai-detector/model-prediction:latest",
11
  ".",
12
  ]
13
+ secretEnv: ["HF_TOKEN"]
14
 
15
  - name: "gcr.io/cloud-builders/docker"
16
  args:
 
21
 
22
  images:
23
  - "us-central1-docker.pkg.dev/${PROJECT_ID}/interview-ai-detector/model-prediction:latest"
24
+
25
+ availableSecrets:
26
+ secretManager:
27
+ - versionName: "projects/${PROJECT_ID}/secrets/HF_TOKEN/versions/1"
28
+ env: "HF_TOKEN"
core-model-prediction/Dockerfile CHANGED
@@ -1,3 +1,6 @@
 
 
 
1
  # Use an official Python runtime as a base image
2
  FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
3
 
@@ -17,6 +20,16 @@ RUN python -m nltk.downloader punkt wordnet averaged_perceptron_tagger
17
  # Unzip wordnet
18
  RUN unzip /root/nltk_data/corpora/wordnet.zip -d /root/nltk_data/corpora/
19
 
 
 
 
 
 
 
 
 
 
 
20
  # Make port 8080 available to the world outside this container
21
  EXPOSE 8080
22
 
 
1
+ # HF Token args
2
+ ARG HF_TOKEN
3
+
4
  # Use an official Python runtime as a base image
5
  FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
6
 
 
20
  # Unzip wordnet
21
  RUN unzip /root/nltk_data/corpora/wordnet.zip -d /root/nltk_data/corpora/
22
 
23
+ # Download HuggingFace model
24
+ RUN python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
25
+ tokenizer = AutoTokenizer.from_pretrained('google/gemma-2b', token='$HF_TOKEN'); \
26
+ model = AutoModelForCausalLM.from_pretrained('google/gemma-2b', token='$HF_TOKEN'); \
27
+ tokenizer.save_pretrained('/app/gemma-2b'); \
28
+ model.save_pretrained('/app/gemma-2b')"
29
+
30
+ # Model env
31
+ ENV MODEL_DIR=/app/gemma-2b
32
+
33
  # Make port 8080 available to the world outside this container
34
  EXPOSE 8080
35
 
core-model-prediction/gemma2b_dependencies.py CHANGED
@@ -1,10 +1,10 @@
 
1
  from transformers import AutoTokenizer, AutoModelForCausalLM
2
  import torch
3
  from torch.nn.functional import cosine_similarity
4
  from collections import Counter
5
  import numpy as np
6
  from device_manager import DeviceManager
7
- from google.cloud import secretmanager
8
 
9
 
10
  class Gemma2BDependencies:
@@ -13,21 +13,13 @@ class Gemma2BDependencies:
13
  def __new__(cls):
14
  if cls._instance is None:
15
  cls._instance = super(Gemma2BDependencies, cls).__new__(cls)
16
- token = cls._instance.access_hf_token_secret()
17
- cls._instance.tokenizer = AutoTokenizer.from_pretrained(
18
- "google/gemma-2b", token=token)
19
- cls._instance.model = AutoModelForCausalLM.from_pretrained(
20
- "google/gemma-2b", token=token)
21
  cls._instance.device = DeviceManager()
22
  cls._instance.model.to(cls._instance.device)
23
  return cls._instance
24
 
25
- def access_hf_token_secret(self):
26
- client = secretmanager.SecretManagerServiceClient()
27
- name = "projects/ta-2-sistem-cerdas/secrets/HF_TOKEN/versions/1"
28
- response = client.access_secret_version(request={"name": name})
29
- return response.payload.data.decode('UTF-8')
30
-
31
  def calculate_perplexity(self, text: str):
32
  inputs = self.tokenizer(text, return_tensors="pt",
33
  truncation=True, max_length=1024)
@@ -42,7 +34,6 @@ class Gemma2BDependencies:
42
  return perplexity.item()
43
 
44
  def calculate_burstiness(self, text: str):
45
- # Tokenize the text using GPT-2 tokenizer
46
  tokens = self.tokenizer.encode(text, add_special_tokens=False)
47
 
48
  # Count token frequencies
 
1
+ import os
2
  from transformers import AutoTokenizer, AutoModelForCausalLM
3
  import torch
4
  from torch.nn.functional import cosine_similarity
5
  from collections import Counter
6
  import numpy as np
7
  from device_manager import DeviceManager
 
8
 
9
 
10
  class Gemma2BDependencies:
 
13
  def __new__(cls):
14
  if cls._instance is None:
15
  cls._instance = super(Gemma2BDependencies, cls).__new__(cls)
16
+ model_dir = os.getenv("MODEL_DIR", "/app/gemma-2b")
17
+ cls._instance.tokenizer = AutoTokenizer.from_pretrained(model_dir)
18
+ cls._instance.model = AutoModelForCausalLM.from_pretrained(model_dir)
 
 
19
  cls._instance.device = DeviceManager()
20
  cls._instance.model.to(cls._instance.device)
21
  return cls._instance
22
 
 
 
 
 
 
 
23
  def calculate_perplexity(self, text: str):
24
  inputs = self.tokenizer(text, return_tensors="pt",
25
  truncation=True, max_length=1024)
 
34
  return perplexity.item()
35
 
36
  def calculate_burstiness(self, text: str):
 
37
  tokens = self.tokenizer.encode(text, add_special_tokens=False)
38
 
39
  # Count token frequencies
public-prediction/kafka_consumer.py CHANGED
@@ -52,7 +52,6 @@ def send_results_back(full_results: dict[str, any], job_application_id: str):
52
 
53
  response = requests.patch(url, json=body, headers=headers)
54
  print(f"Data sent with status code {response.status_code}")
55
- print(response.content)
56
 
57
 
58
  def consume_messages():
@@ -62,6 +61,7 @@ def consume_messages():
62
  auto_offset_reset='earliest',
63
  client_id="ai-detector-1",
64
  group_id="ai-detector",
 
65
  )
66
 
67
  print("Successfully connected to Kafka at", os.environ.get("KAFKA_IP"))
@@ -71,7 +71,7 @@ def consume_messages():
71
 
72
  for message in consumer:
73
  try:
74
- incoming_message = json.loads(message.value.decode("utf-8"))
75
  full_batch = incoming_message["data"]
76
  except json.JSONDecodeError:
77
  print("Failed to decode JSON from message:", message.value)
@@ -83,6 +83,7 @@ def consume_messages():
83
 
84
  full_results = []
85
  for i in range(0, len(full_batch), BATCH_SIZE):
 
86
  batch = full_batch[i:i+BATCH_SIZE]
87
  batch_results = process_batch(batch, BATCH_SIZE, gpt_helper)
88
  full_results.extend(batch_results)
 
52
 
53
  response = requests.patch(url, json=body, headers=headers)
54
  print(f"Data sent with status code {response.status_code}")
 
55
 
56
 
57
  def consume_messages():
 
61
  auto_offset_reset='earliest',
62
  client_id="ai-detector-1",
63
  group_id="ai-detector",
64
+ api_version=(0, 10, 2)
65
  )
66
 
67
  print("Successfully connected to Kafka at", os.environ.get("KAFKA_IP"))
 
71
 
72
  for message in consumer:
73
  try:
74
+ incoming_message = json.loads(json.loads(message.value.decode("utf-8")))
75
  full_batch = incoming_message["data"]
76
  except json.JSONDecodeError:
77
  print("Failed to decode JSON from message:", message.value)
 
83
 
84
  full_results = []
85
  for i in range(0, len(full_batch), BATCH_SIZE):
86
+ print(f"Processing batch {i} to {i+BATCH_SIZE}")
87
  batch = full_batch[i:i+BATCH_SIZE]
88
  batch_results = process_batch(batch, BATCH_SIZE, gpt_helper)
89
  full_results.extend(batch_results)
public-prediction/predict_custom_model.py CHANGED
@@ -1,4 +1,5 @@
1
  from typing import Dict, List, Union
 
2
  from google.cloud import aiplatform
3
  from google.protobuf import json_format
4
  from google.protobuf.struct_pb2 import Value
@@ -19,13 +20,9 @@ def predict_custom_trained_model(
19
  # The AI Platform services require regional API endpoints.
20
  client_options = {"api_endpoint": api_endpoint}
21
 
22
- credentials = service_account.Credentials.from_service_account_file(
23
- "steady-climate-416810-ea1536e1868c.json")
24
  # Initialize client that will be used to create and send requests.
25
  # This client only needs to be created once, and can be reused for multiple requests.
26
- client = aiplatform.gapic.PredictionServiceClient(
27
- credentials=credentials,
28
- client_options=client_options)
29
  # The format of each instance should conform to the deployed model's prediction input schema.
30
  instances = instances if isinstance(instances, list) else [instances]
31
  instances = [
 
1
  from typing import Dict, List, Union
2
+ import os
3
  from google.cloud import aiplatform
4
  from google.protobuf import json_format
5
  from google.protobuf.struct_pb2 import Value
 
20
  # The AI Platform services require regional API endpoints.
21
  client_options = {"api_endpoint": api_endpoint}
22
 
 
 
23
  # Initialize client that will be used to create and send requests.
24
  # This client only needs to be created once, and can be reused for multiple requests.
25
+ client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
 
 
26
  # The format of each instance should conform to the deployed model's prediction input schema.
27
  instances = instances if isinstance(instances, list) else [instances]
28
  instances = [