Spaces:

husseinelsaadi
/

Codingo

Paused

File size: 7,372 Bytes

# Codingo - AI Powered Smart Recruitment System

This repository contains the implementation of Codingo, an AI-powered online recruitment platform designed to automate and enhance the hiring process through a virtual HR assistant named LUNA.

## Project Overview

Codingo addresses the challenges of traditional recruitment processes by offering:
- Automated CV screening and skill-based shortlisting
- AI-led interviews through the virtual assistant LUNA
- Real-time cheating detection during assessments
- Gamified practice tools for candidates
- Secure administration interface for hiring managers

## Getting Started

This guide outlines the development process, starting with local model training before moving to AWS deployment.

### Prerequisites

- Python 3.8+
- pip (Python package manager)
- Git

### Development Process

We'll implement the project in phases:

#### Phase 1: Local Training and Feature Extraction (Current Phase)

This initial phase focuses on building and training the model locally before AWS deployment.

### Project Structure

```
Codingo/
├── backend/                     # Flask API backend
│   ├── app.py                   # Flask server
│   ├── predict.py               # Predict using trained model
│   ├── train_model.py           # Model training script
│   ├── model/                   # Trained model artifacts
│   │   └── cv_classifier.pkl
│   ├── utils/
│   │   ├── text_extractor.py    # PDF/DOCX to text
│   │   └── preprocessor.py      # Cleaning, tokenizing
│
├── data/
│   ├── training.csv             # Your training dataset
│   └── raw_cvs/                 # CV files (PDF/DOCX/txt)
│
├── notebooks/
│   └── eda.ipynb                # Data exploration & feature work
│
├── requirements.txt             # Python dependencies
└── README.md                    # Project overview
```

## Step-by-Step Implementation Guide

### Step 1: Create Training Dataset

Start by manually collecting ~50-100 CV-like text samples with position labels.

**File:** `data/training.csv`

Example format:
```
text,position
"Experienced in Python, Flask, AWS",Backend Developer
"Built dashboards with React and TypeScript",Frontend Developer
"ML projects using pandas, scikit-learn",Data Scientist
```

### Step 2: Train Model

Implement a classifier using scikit-learn to predict job roles from CV text.

**File:** `backend/train_model.py`

```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import joblib

# Load training data
df = pd.read_csv('data/training.csv')

# Define model pipeline
model = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train model
model.fit(df['text'], df['position'])

# Save model
joblib.dump(model, 'backend/models/cv_classifier.pkl')

print("Model trained and saved successfully!")
```

### Step 3: Test Prediction Locally

Create a script to verify your model works correctly.

**File:** `backend/predict.py`

```python
import joblib
import sys


def predict_role(cv_text):
    # Load the trained model
    model = joblib.load('backend/models/cv_classifier.pkl')

    # Make prediction
    prediction = model.predict([cv_text])[0]
    confidence = max(model.predict_proba([cv_text])[0]) * 100

    return {
        'predicted_position': prediction,
        'confidence': f"{confidence:.2f}%"
    }


if __name__ == "__main__":
    if len(sys.argv) > 1:
        # Get CV text from command line argument
        cv_text = sys.argv[1]
    else:
        # Example CV text
        cv_text = "Experienced Python developer with 5 years of experience in Flask and AWS."

    result = predict_role(cv_text)
    print(f"Predicted Position: {result['predicted_position']}")
    print(f"Confidence: {result['confidence']}")
```

### Step 4: Add Text Extraction Utility

Create utilities to extract text from PDF and DOCX files.

**File:** `backend/utils/text_extractor.py`

```python
import fitz  # PyMuPDF
import docx
import os

def extract_text_from_pdf(path):
    """Extract text from PDF file."""
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text.strip()

def extract_text_from_docx(path):
    """Extract text from DOCX file."""
    doc = docx.Document(path)
    text = "\n".join([paragraph.text for paragraph in doc.paragraphs])
    return text.strip()

def extract_text(file_path):
    """Extract text from either PDF or DOCX."""
    extension = os.path.splitext(file_path)[1].lower()
    
    if extension == '.pdf':
        return extract_text_from_pdf(file_path)
    elif extension in ['.docx', '.doc']:
        return extract_text_from_docx(file_path)
    elif extension == '.txt':
        with open(file_path, 'r', encoding='utf-8') as f:
            return f.read().strip()
    else:
        raise ValueError(f"Unsupported file extension: {extension}")
```

### Step 5: Add Flask API (Simple)

Create a basic Flask API to accept CV uploads and return predictions.

**File:** `backend/app.py`

```python
from flask import Flask, request, jsonify
from utils.text_extractor import extract_text
import joblib
import os

app = Flask(__name__)
model = joblib.load("model/cv_classifier.pkl")

# Ensure directories exist
os.makedirs("data/raw_cvs", exist_ok=True)
os.makedirs("model", exist_ok=True)

@app.route("/predict", methods=["POST"])
def predict():
    if 'file' not in request.files:
        return jsonify({"error": "No file provided"}), 400
        
    file = request.files["file"]
    file_path = f"data/raw_cvs/{file.filename}"
    file.save(file_path)

    try:
        text = extract_text(file_path)
        prediction = model.predict([text])[0]
        confidence = max(model.predict_proba([text])[0]) * 100
        
        return jsonify({
            "predicted_position": prediction,
            "confidence": f"{confidence:.2f}%"
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":
    app.run(debug=True)
```

### Step 6: Install Dependencies

**File:** `requirements.txt`

```
flask
scikit-learn
pandas
joblib
PyMuPDF
python-docx
```

Run: `pip install -r requirements.txt`

## Next Steps

After completing Phase 1, we'll move to:

1. **Phase 2: Enhanced Model & NLP Features**
   - Implement BERT or DistilBERT for improved semantic understanding
   - Add skill extraction from CVs
   - Develop job-CV matching scoring

2. **Phase 3: Web Interface & Chatbot**
   - Develop user interface for admin and candidates
   - Implement LUNA virtual assistant using LangChain
   - Add interview scheduling functionality

3. **Phase 4: Video Interview & Proctoring**
   - Add video interview capabilities
   - Implement cheating detection using computer vision
   - Develop automated scoring system

4. **Phase 5: AWS Deployment**
   - Set up AWS infrastructure using Terraform
   - Deploy application to EC2/Lambda
   - Configure S3 for file storage

## Authors

- Hussein El Saadi
- Nour Ali Shaito

## Supervisor
- Dr. Ali Ezzedine

## License

This project is licensed under the MIT License - see the LICENSE file for details.