metadata

license: apache-2.0
datasets:
  - Novora/CodeClassifier_v1
pipeline_tag: text-classification

Introduction

Novora Code Classifier v1 Tiny, is a tiny Text Classification model, which classifies given code text input under 1 of 31 different classes (programming languages).

This model is designed to be able to run on CPU, but optimally runs on GPUs.

Info

1 of 31 classes output
512 token input dimension
64 hidden dimensions
2 linear layers
The snowflake-arctic-embed-xs model is used as the embeddings model.
Dataset split into 80% training set, 20% testing set.
The combined test and training data is around 1000 chunks per programming language, the data is 31,100 chunks (entries) as 512 tokens per chunk, being a snippet of the code.
Picked from the 18th epoch out of 20 done.

Architecture

The CodeClassifier-v1-Tiny model employs a neural network architecture optimized for text classification tasks, specifically for classifying programming languages from code snippets. This model includes:

Bidirectional LSTM Feature Extractor: This bidirectional LSTM layer processes input embeddings, effectively capturing contextual relationships in both forward and reverse directions within the code snippets.
Fully Connected Layers: The network includes two linear layers. The first projects the pooled features into a hidden feature space, and the second linear layer maps these to the output classes, which correspond to different programming languages. A dropout layer with a rate of 0.5 between these layers helps mitigate overfitting.

The model's bidirectional nature and architectural components make it adept at understanding the syntax and structure crucial for code classification.

Example Code

import torch.nn as nn
import torch.nn.functional as F

class CodeClassifier(nn.Module):
    def __init__(self, num_classes, embedding_dim, hidden_dim, num_layers, bidirectional=False):
        super(CodeClassifier, self).__init__()
        self.feature_extractor = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=bidirectional)
        self.dropout = nn.Dropout(0.5)  # Reintroduce dropout
        self.fc1 = nn.Linear(hidden_dim * (2 if bidirectional else 1), hidden_dim)  # Intermediate layer
        self.fc2 = nn.Linear(hidden_dim, num_classes)  # Output layer

    def forward(self, x):
        x = x.unsqueeze(1)  # Add sequence dimension
        x, _ = self.feature_extractor(x)
        x = x.squeeze(1)  # Remove sequence dimension
        x = self.fc1(x)
        x = self.dropout(x)  # Apply dropout
        x = self.fc2(x)
        return x

import torch
from transformers import AutoTokenizer, AutoModel
from pathlib import Path

def infer(text, model_path, embedding_model_name):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load tokenizer and embedding model
    tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
    embedding_model = AutoModel.from_pretrained(embedding_model_name).to(device)
    embedding_model.eval()

    # Prepare inputs
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Generate embeddings
    with torch.no_grad():
        embeddings = embedding_model(**inputs)[0][:, 0]

    # Load classifier model
    model = CodeClassifier(num_classes=31, embedding_dim=embeddings.size(-1), hidden_dim=64, num_layers=2, bidirectional=True)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model = model.to(device)
    model.eval()

    # Predict class
    with torch.no_grad():
        output = model(embeddings)
        _, predicted = torch.max(output, dim=1)

    # Language labels
    languages = [
        'Ada', 'Assembly', 'C', 'C#', 'C++', 'COBOL', 'Common Lisp', 'Dart', 'Erlang', 'F#',
        'Fortran', 'Go', 'Haskell', 'Java', 'JavaScript', 'Julia', 'Kotlin', 'Lua', 'MATLAB',
        'Objective-C', 'PHP', 'Perl', 'Prolog', 'Python', 'R', 'Ruby', 'Rust', 'SQL', 'Scala',
        'Swift', 'TypeScript'
    ]
    
    return languages[predicted.item()]

# Example usage
if __name__ == "__main__":
    example_text = "print('Hello, world!')"  # Replace with actual text for inference
    model_file_path = Path("./model.safetensors")
    predicted_language = infer(example_text, model_file_path, "Snowflake/snowflake-arctic-embed-xs")
    print(f"Predicted programming language: {predicted_language}")