Impulse2000 commited on
Commit
554a75a
·
unverified ·
1 Parent(s): 2b20a67

Added code example

Browse files
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -18,7 +18,7 @@ This model is designed to be able to run on CPU, but optimally runs on GPUs.
18
  - 2 linear layers
19
  - The `snowflake-arctic-embed-xs` model is used as the embeddings model.
20
  - Dataset split into 80% training set, 20% testing set.
21
- - The combined test and training data is 100 chunks per programming language, the data is 3,100 chunks (entries) as 512 tokens per chunk, being a snippet of the code.
22
 
23
  # Architecture
24
 
@@ -26,9 +26,60 @@ The `CodeClassifier-v1-Tiny` model employs a neural network architecture optimiz
26
 
27
  - **Bidirectional LSTM Feature Extractor**: This bidirectional LSTM layer processes input embeddings, effectively capturing contextual relationships in both forward and reverse directions within the code snippets.
28
 
29
- - **Adaptive Pooling**: Following the LSTM, adaptive average pooling reduces the feature dimension to a fixed size, accommodating variable-length inputs.
30
-
31
  - **Fully Connected Layers**: The network includes two linear layers. The first projects the pooled features into a hidden feature space, and the second linear layer maps these to the output classes, which correspond to different programming languages. A dropout layer with a rate of 0.5 between these layers helps mitigate overfitting.
32
 
33
  The model's bidirectional nature and architectural components make it adept at understanding the syntax and structure crucial for code classification.
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  - 2 linear layers
19
  - The `snowflake-arctic-embed-xs` model is used as the embeddings model.
20
  - Dataset split into 80% training set, 20% testing set.
21
+ - The combined test and training data is around 1000 chunks per programming language, the data is 31,100 chunks (entries) as 512 tokens per chunk, being a snippet of the code.
22
 
23
  # Architecture
24
 
 
26
 
27
  - **Bidirectional LSTM Feature Extractor**: This bidirectional LSTM layer processes input embeddings, effectively capturing contextual relationships in both forward and reverse directions within the code snippets.
28
 
 
 
29
  - **Fully Connected Layers**: The network includes two linear layers. The first projects the pooled features into a hidden feature space, and the second linear layer maps these to the output classes, which correspond to different programming languages. A dropout layer with a rate of 0.5 between these layers helps mitigate overfitting.
30
 
31
  The model's bidirectional nature and architectural components make it adept at understanding the syntax and structure crucial for code classification.
32
 
33
+ # Example Code
34
+
35
+ ```python
36
+ import torch
37
+ from transformers import AutoTokenizer, AutoModel
38
+ from pathlib import Path
39
+ from model import CodeClassifier
40
+
41
+ def infer(text, model_path, embedding_model_name):
42
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
43
+
44
+ # Load tokenizer and embedding model
45
+ tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
46
+ embedding_model = AutoModel.from_pretrained(embedding_model_name).to(device)
47
+ embedding_model.eval()
48
+
49
+ # Prepare inputs
50
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
51
+ inputs = {k: v.to(device) for k, v in inputs.items()}
52
+
53
+ # Generate embeddings
54
+ with torch.no_grad():
55
+ embeddings = embedding_model(**inputs)[0][:, 0]
56
+
57
+ # Load classifier model
58
+ model = CodeClassifier(num_classes=31, embedding_dim=embeddings.size(-1), hidden_dim=64, num_layers=2, bidirectional=True)
59
+ model.load_state_dict(torch.load(model_path, map_location=device))
60
+ model = model.to(device)
61
+ model.eval()
62
+
63
+ # Predict class
64
+ with torch.no_grad():
65
+ output = model(embeddings)
66
+ _, predicted = torch.max(output, dim=1)
67
+
68
+ # Language labels
69
+ languages = [
70
+ 'Ada', 'Assembly', 'C', 'C#', 'C++', 'COBOL', 'Common Lisp', 'Dart', 'Erlang', 'F#',
71
+ 'Fortran', 'Go', 'Haskell', 'Java', 'JavaScript', 'Julia', 'Kotlin', 'Lua', 'MATLAB',
72
+ 'Objective-C', 'PHP', 'Perl', 'Prolog', 'Python', 'R', 'Ruby', 'Rust', 'SQL', 'Scala',
73
+ 'Swift', 'TypeScript'
74
+ ]
75
+
76
+ return languages[predicted.item()]
77
+
78
+ # Example usage
79
+ if __name__ == "__main__":
80
+ example_text = "print('Hello, world!')" # Replace with actual text for inference
81
+ model_file_path = Path("./model.safetensors")
82
+ predicted_language = infer(example_text, model_file_path, "Snowflake/snowflake-arctic-embed-xs")
83
+ print(f"Predicted programming language: {predicted_language}")
84
+
85
+ ```