license: apache-2.0
datasets:
- Novora/CodeClassifier_v1
pipeline_tag: text-classification
Introduction
Novora Code Classifier v1 Tiny, is a tiny Text Classification
model, which classifies given code text input under 1 of 31
different classes (programming languages).
This model is designed to be able to run on CPU, but optimally runs on GPUs.
Info
- 1 of 31 classes output
- 512 token input dimension
- 64 hidden dimensions
- 2 linear layers
- The
snowflake-arctic-embed-xs
model is used as the embeddings model. - Dataset split into 80% training set, 20% testing set.
- The combined test and training data is around 1000 chunks per programming language, the data is 31,100 chunks (entries) as 512 tokens per chunk, being a snippet of the code.
- Picked from the 18th epoch out of 20 done.
Architecture
The CodeClassifier-v1-Tiny
model employs a neural network architecture optimized for text classification tasks, specifically for classifying programming languages from code snippets. This model includes:
Bidirectional LSTM Feature Extractor: This bidirectional LSTM layer processes input embeddings, effectively capturing contextual relationships in both forward and reverse directions within the code snippets.
Fully Connected Layers: The network includes two linear layers. The first projects the pooled features into a hidden feature space, and the second linear layer maps these to the output classes, which correspond to different programming languages. A dropout layer with a rate of 0.5 between these layers helps mitigate overfitting.
The model's bidirectional nature and architectural components make it adept at understanding the syntax and structure crucial for code classification.
Example Code
import torch.nn as nn
import torch.nn.functional as F
class CodeClassifier(nn.Module):
def __init__(self, num_classes, embedding_dim, hidden_dim, num_layers, bidirectional=False):
super(CodeClassifier, self).__init__()
self.feature_extractor = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=bidirectional)
self.dropout = nn.Dropout(0.5) # Reintroduce dropout
self.fc1 = nn.Linear(hidden_dim * (2 if bidirectional else 1), hidden_dim) # Intermediate layer
self.fc2 = nn.Linear(hidden_dim, num_classes) # Output layer
def forward(self, x):
x = x.unsqueeze(1) # Add sequence dimension
x, _ = self.feature_extractor(x)
x = x.squeeze(1) # Remove sequence dimension
x = self.fc1(x)
x = self.dropout(x) # Apply dropout
x = self.fc2(x)
return x
import torch
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
def infer(text, model_path, embedding_model_name):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load tokenizer and embedding model
tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
embedding_model = AutoModel.from_pretrained(embedding_model_name).to(device)
embedding_model.eval()
# Prepare inputs
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate embeddings
with torch.no_grad():
embeddings = embedding_model(**inputs)[0][:, 0]
# Load classifier model
model = CodeClassifier(num_classes=31, embedding_dim=embeddings.size(-1), hidden_dim=64, num_layers=2, bidirectional=True)
model.load_state_dict(torch.load(model_path, map_location=device))
model = model.to(device)
model.eval()
# Predict class
with torch.no_grad():
output = model(embeddings)
_, predicted = torch.max(output, dim=1)
# Language labels
languages = [
'Ada', 'Assembly', 'C', 'C#', 'C++', 'COBOL', 'Common Lisp', 'Dart', 'Erlang', 'F#',
'Fortran', 'Go', 'Haskell', 'Java', 'JavaScript', 'Julia', 'Kotlin', 'Lua', 'MATLAB',
'Objective-C', 'PHP', 'Perl', 'Prolog', 'Python', 'R', 'Ruby', 'Rust', 'SQL', 'Scala',
'Swift', 'TypeScript'
]
return languages[predicted.item()]
# Example usage
if __name__ == "__main__":
example_text = "print('Hello, world!')" # Replace with actual text for inference
model_file_path = Path("./model.safetensors")
predicted_language = infer(example_text, model_file_path, "Snowflake/snowflake-arctic-embed-xs")
print(f"Predicted programming language: {predicted_language}")