asadsandhu commited on
Commit
def1c2c
·
1 Parent(s): 6a450b3

Finalized.

Browse files
Files changed (4) hide show
  1. README.md +187 -1
  2. app.py +215 -0
  3. model.pth +3 -0
  4. requirements.txt +2 -0
README.md CHANGED
@@ -11,4 +11,190 @@ license: mit
11
  short_description: Convert C++ to Pseudocode using a Transformer Model.
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: Convert C++ to Pseudocode using a Transformer Model.
12
  ---
13
 
14
+ # 🔄 Code2Pseudo Transformer-based C++ to Pseudocode Converter
15
+
16
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
17
+ [![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/)
18
+ [![Hugging Face](https://img.shields.io/badge/HuggingFace-Spaces-orange)](https://huggingface.co/spaces/asadsandhu/Code2Pseudo)
19
+ [![GitHub Repo](https://img.shields.io/badge/GitHub-asadsandhu/Code2Pseudo-black?logo=github)](https://github.com/asadsandhu/Code2Pseudo)
20
+
21
+ > A fully custom Transformer-based Sequence-to-Sequence model built from scratch in PyTorch to convert executable C++ code into high-level pseudocode. Trained on the [SPoC dataset](https://arxiv.org/abs/2005.04326) from Stanford.
22
+
23
+ ---
24
+
25
+ ## 🖼️ Demo
26
+
27
+ Try it live on **Hugging Face Spaces**:
28
+ 👉 https://huggingface.co/spaces/asadsandhu/Code2Pseudo
29
+
30
+ ![App Demo](assets/demo.png)
31
+
32
+ ---
33
+
34
+ ## 🧠 Model Architecture
35
+
36
+ - Built from scratch using the **Transformer** encoder-decoder architecture (PyTorch)
37
+ - No pre-trained libraries – 100% custom code
38
+ - Token-level sequence generation with greedy decoding
39
+ - Custom tokenization and vocabulary building for both C++ and pseudocode
40
+
41
+ ```
42
+
43
+ Input: C++ lines (line-by-line)
44
+ Model: Transformer (Encoder-Decoder)
45
+ Output: Corresponding pseudocode line
46
+
47
+ ```
48
+
49
+ ---
50
+
51
+ ## 📊 Dataset
52
+
53
+ We trained on the **SPoC dataset**:
54
+
55
+ - ✅ Cleanly aligned C++ ↔ pseudocode line pairs
56
+ - ✅ High-quality syntactic coverage
57
+ - ✅ Multiple test splits available
58
+ - ✅ Custom preprocessing and token handling
59
+
60
+ > 📎 Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
61
+
62
+ ---
63
+
64
+ ## 📁 Directory Structure
65
+
66
+ ```
67
+
68
+ .
69
+ ├── app.py # Gradio web app (C++ → Pseudocode)
70
+ ├── train.py # Training script for code-to-pseudocode model
71
+ ├── model.pth # Trained model and vocab checkpoint
72
+ ├── spoc/
73
+ │ └── train/
74
+ │ ├── spoc-train.tsv
75
+ │ └── split/spoc-train-eval.tsv
76
+ ├── assets/
77
+ │ └── demo.png # Screenshot for README
78
+ └── README.md # This file
79
+
80
+ ````
81
+
82
+ ---
83
+
84
+ ## 🛠️ How to Run Locally
85
+
86
+ ### ⚙️ 1. Clone the Repo
87
+
88
+ ```bash
89
+ git clone https://github.com/asadsandhu/Code2Pseudo.git
90
+ cd Code2Pseudo
91
+ pip install torch gradio tqdm
92
+ ````
93
+
94
+ ### 🚀 2. Launch the Web App
95
+
96
+ Make sure `model.pth` exists (or train it first):
97
+
98
+ ```bash
99
+ python app.py
100
+ ```
101
+
102
+ The interface will open in your browser.
103
+
104
+ ---
105
+
106
+ ## 🧪 Training the Model
107
+
108
+ To retrain the transformer model:
109
+
110
+ ```bash
111
+ python train.py
112
+ ```
113
+
114
+ By default:
115
+
116
+ * Downloads SPoC dataset from GitHub
117
+ * Trains for 10 epochs
118
+ * Produces `model.pth` with weights and vocabulary
119
+
120
+ ---
121
+
122
+ ## 🔧 Key Hyperparameters
123
+
124
+ | Parameter | Value |
125
+ | -------------- | ----------- |
126
+ | Model Type | Transformer |
127
+ | Max Length | 128 |
128
+ | Embedding Dim | 256 |
129
+ | FFN Dim | 512 |
130
+ | Heads | 4 |
131
+ | Encoder Layers | 2 |
132
+ | Decoder Layers | 2 |
133
+ | Batch Size | 64 |
134
+ | Epochs | 10 |
135
+ | Optimizer | Adam |
136
+ | Learning Rate | 1e-4 |
137
+
138
+ ---
139
+
140
+ ## 🧩 Example Input
141
+
142
+ ```cpp
143
+ int main() {
144
+ int n , nn , ans = 0 ;
145
+ cin > > n ;
146
+ for ( int i = 2 ; i < = n - 1 ; i + + ) {
147
+ nn = n ;
148
+ while ( nn = = 0 ) ans + = nn % i , nn / = i ;
149
+ }
150
+ o = gcd ( ans , n - 2 ) ;
151
+ cout < < ans / 2 / o ( n - 2 ) / o < < endl ;
152
+ return 0;
153
+ }
154
+ ```
155
+
156
+ ### ⏩ Output Pseudocode
157
+
158
+ ```text
159
+ create integers n , nn , ans with ans = 0
160
+ read n
161
+ for i = 2 to n - 1 inclusive
162
+ set nn to n
163
+ while nn is 0 , set ans to nn % 12 , set ans to nn % nn , set nn to nn / i
164
+ set value of gcd to ans and n - 2
165
+ print ans / 2 / ( n - 2 ) / o
166
+ ```
167
+
168
+ ---
169
+
170
+ ## 📦 Deployment
171
+
172
+ Live demo hosted on:
173
+
174
+ * **Hugging Face Spaces**: [Code2Pseudo](https://huggingface.co/spaces/asadsandhu/Code2Pseudo)
175
+ * **GitHub**: [github.com/asadsandhu/Code2Pseudo](https://github.com/asadsandhu/Code2Pseudo)
176
+
177
+ ---
178
+
179
+ ## 🙌 Acknowledgements
180
+
181
+ * 📘 **SPoC Dataset** by Stanford University
182
+ Kulal, S., Pasupat, P., & Liang, P. (2020). [SPoC: Search-based Pseudocode to Code](https://arxiv.org/abs/2005.04326)
183
+
184
+ * 🧠 Transformer Paper: ["Attention is All You Need"](https://arxiv.org/abs/1706.03762)
185
+
186
+ ---
187
+
188
+ ## 🧑‍💻 Author
189
+
190
+ **Asad Ali**
191
+ [GitHub: asadsandhu](https://github.com/asadsandhu)
192
+ [Hugging Face: asadsandhu](https://huggingface.co/asadsandhu)
193
+ [LinkedIn: asadxali](https://www.linkedin.com/in/asadxali)
194
+
195
+ ---
196
+
197
+ ## 📄 License
198
+
199
+ This project is licensed under the MIT License.
200
+ Use, remix, and distribute freely with attribution.
app.py ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # app.py (for C++ to Pseudocode task)
2
+ import gradio as gr
3
+ import torch
4
+ import os
5
+ import math
6
+ import torch.nn as nn
7
+ import re
8
+ import sys
9
+ import asyncio
10
+
11
+ if sys.platform.startswith('win'):
12
+ asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
13
+
14
+ DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
15
+ MAX_LEN = 128
16
+ EMBED_DIM = 256
17
+ NHEAD = 4
18
+ NUM_ENCODER_LAYERS = 2
19
+ NUM_DECODER_LAYERS = 2
20
+ FF_DIM = 512
21
+
22
+ PAD_TOKEN = "<pad>"
23
+ SOS_TOKEN = "<sos>"
24
+ EOS_TOKEN = "<eos>"
25
+ UNK_TOKEN = "<unk>"
26
+
27
+ def tokenize_line(text):
28
+ return re.findall(r"[A-Za-z0-9]+|[^\sA-Za-z0-9]", text)
29
+
30
+ def numericalize(text, stoi):
31
+ tokens = tokenize_line(text)
32
+ return [stoi.get(tok, stoi[UNK_TOKEN]) for tok in tokens]
33
+
34
+ def pad_sequence(seq, max_len, pad_id):
35
+ seq = seq[:max_len-1]
36
+ seq = seq + [tgt_stoi[EOS_TOKEN]]
37
+ if len(seq) < max_len:
38
+ seq += [pad_id] * (max_len - len(seq))
39
+ return seq
40
+
41
+ class PositionalEncoding(nn.Module):
42
+ def __init__(self, d_model, max_len=5000):
43
+ super().__init__()
44
+ pe = torch.zeros(max_len, d_model)
45
+ position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
46
+ div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
47
+ pe[:, 0::2] = torch.sin(position * div_term)
48
+ pe[:, 1::2] = torch.cos(position * div_term)
49
+ pe = pe.unsqueeze(0)
50
+ self.register_buffer("pe", pe)
51
+ def forward(self, x):
52
+ return x + self.pe[:, :x.size(1), :]
53
+
54
+ class MultiHeadAttention(nn.Module):
55
+ def __init__(self, d_model, n_heads):
56
+ super().__init__()
57
+ assert d_model % n_heads == 0
58
+ self.head_dim = d_model // n_heads
59
+ self.n_heads = n_heads
60
+ self.query_linear = nn.Linear(d_model, d_model)
61
+ self.key_linear = nn.Linear(d_model, d_model)
62
+ self.value_linear = nn.Linear(d_model, d_model)
63
+ self.out_linear = nn.Linear(d_model, d_model)
64
+ def forward(self, query, key, value, mask=None):
65
+ B, Q_len, _ = query.size()
66
+ Q = self.query_linear(query).view(B, Q_len, self.n_heads, self.head_dim).transpose(1, 2)
67
+ K = self.key_linear(key).view(B, key.size(1), self.n_heads, self.head_dim).transpose(1, 2)
68
+ V = self.value_linear(value).view(B, value.size(1), self.n_heads, self.head_dim).transpose(1, 2)
69
+ scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
70
+ if mask is not None:
71
+ scores = scores.masked_fill(mask == 0, float('-inf'))
72
+ attn = torch.softmax(scores, dim=-1)
73
+ context = torch.matmul(attn, V).transpose(1, 2).contiguous().view(B, Q_len, -1)
74
+ return self.out_linear(context)
75
+
76
+ class FeedForward(nn.Module):
77
+ def __init__(self, d_model, dim_feedforward):
78
+ super().__init__()
79
+ self.fc1 = nn.Linear(d_model, dim_feedforward)
80
+ self.fc2 = nn.Linear(dim_feedforward, d_model)
81
+ self.relu = nn.ReLU()
82
+ def forward(self, x):
83
+ return self.fc2(self.relu(self.fc1(x)))
84
+
85
+ class EncoderLayer(nn.Module):
86
+ def __init__(self, d_model, n_heads, dim_feedforward):
87
+ super().__init__()
88
+ self.self_attn = MultiHeadAttention(d_model, n_heads)
89
+ self.ff = FeedForward(d_model, dim_feedforward)
90
+ self.norm1 = nn.LayerNorm(d_model)
91
+ self.norm2 = nn.LayerNorm(d_model)
92
+ self.dropout = nn.Dropout(0.1)
93
+ def forward(self, src, src_mask=None):
94
+ src = self.norm1(src + self.dropout(self.self_attn(src, src, src, mask=src_mask)))
95
+ src = self.norm2(src + self.dropout(self.ff(src)))
96
+ return src
97
+
98
+ class DecoderLayer(nn.Module):
99
+ def __init__(self, d_model, n_heads, dim_feedforward):
100
+ super().__init__()
101
+ self.self_attn = MultiHeadAttention(d_model, n_heads)
102
+ self.cross_attn = MultiHeadAttention(d_model, n_heads)
103
+ self.ff = FeedForward(d_model, dim_feedforward)
104
+ self.norm1 = nn.LayerNorm(d_model)
105
+ self.norm2 = nn.LayerNorm(d_model)
106
+ self.norm3 = nn.LayerNorm(d_model)
107
+ self.dropout = nn.Dropout(0.1)
108
+ def forward(self, tgt, memory, tgt_mask=None, memory_mask=None):
109
+ tgt = self.norm1(tgt + self.dropout(self.self_attn(tgt, tgt, tgt, mask=tgt_mask)))
110
+ tgt = self.norm2(tgt + self.dropout(self.cross_attn(tgt, memory, memory, mask=memory_mask)))
111
+ tgt = self.norm3(tgt + self.dropout(self.ff(tgt)))
112
+ return tgt
113
+
114
+ class Encoder(nn.Module):
115
+ def __init__(self, vocab_size, d_model, n_heads, num_layers, dim_feedforward):
116
+ super().__init__()
117
+ self.embedding = nn.Embedding(vocab_size, d_model)
118
+ self.pos_encoding = PositionalEncoding(d_model)
119
+ self.layers = nn.ModuleList([EncoderLayer(d_model, n_heads, dim_feedforward) for _ in range(num_layers)])
120
+ def forward(self, src, src_mask=None):
121
+ x = self.embedding(src)
122
+ x = self.pos_encoding(x)
123
+ for layer in self.layers:
124
+ x = layer(x, src_mask)
125
+ return x
126
+
127
+ class Decoder(nn.Module):
128
+ def __init__(self, vocab_size, d_model, n_heads, num_layers, dim_feedforward):
129
+ super().__init__()
130
+ self.embedding = nn.Embedding(vocab_size, d_model)
131
+ self.pos_encoding = PositionalEncoding(d_model)
132
+ self.layers = nn.ModuleList([DecoderLayer(d_model, n_heads, dim_feedforward) for _ in range(num_layers)])
133
+ self.fc_out = nn.Linear(d_model, vocab_size)
134
+ def forward(self, tgt, memory, tgt_mask=None, memory_mask=None):
135
+ x = self.embedding(tgt)
136
+ x = self.pos_encoding(x)
137
+ for layer in self.layers:
138
+ x = layer(x, memory, tgt_mask, memory_mask)
139
+ return self.fc_out(x)
140
+
141
+ class TransformerSeq2Seq(nn.Module):
142
+ def __init__(self, src_vocab_size, tgt_vocab_size, d_model, n_heads,
143
+ num_encoder_layers, num_decoder_layers, dim_feedforward):
144
+ super().__init__()
145
+ self.encoder = Encoder(src_vocab_size, d_model, n_heads, num_encoder_layers, dim_feedforward)
146
+ self.decoder = Decoder(tgt_vocab_size, d_model, n_heads, num_decoder_layers, dim_feedforward)
147
+ def forward(self, src, tgt, src_mask=None, tgt_mask=None):
148
+ memory = self.encoder(src, src_mask)
149
+ return self.decoder(tgt, memory, tgt_mask)
150
+
151
+ def generate_subsequent_mask(size):
152
+ mask = torch.triu(torch.ones(size, size), diagonal=1).bool()
153
+ return ~mask
154
+
155
+ def greedy_decode(model, src, src_stoi, tgt_stoi, tgt_itos, max_len=MAX_LEN):
156
+ model.eval()
157
+ src = torch.tensor(src, dtype=torch.long, device=DEVICE).unsqueeze(0)
158
+ memory = model.encoder(src)
159
+ ys = torch.tensor([tgt_stoi[SOS_TOKEN]], dtype=torch.long, device=DEVICE).unsqueeze(0)
160
+ for _ in range(max_len-1):
161
+ tgt_mask = generate_subsequent_mask(ys.size(1)).to(DEVICE)
162
+ out = model.decoder(ys, memory, tgt_mask)
163
+ next_token = torch.argmax(out[:, -1, :], dim=-1).item()
164
+ ys = torch.cat([ys, torch.tensor([[next_token]], device=DEVICE)], dim=1)
165
+ if next_token == tgt_stoi[EOS_TOKEN]:
166
+ break
167
+ out_tokens = ys.squeeze(0).tolist()[1:]
168
+ if tgt_stoi[EOS_TOKEN] in out_tokens:
169
+ out_tokens = out_tokens[:out_tokens.index(tgt_stoi[EOS_TOKEN])]
170
+ return " ".join(tgt_itos[t] for t in out_tokens)
171
+
172
+ # Load model checkpoint
173
+ checkpoint = torch.load("model.pth", map_location=DEVICE)
174
+ src_stoi = checkpoint['src_stoi']
175
+ src_itos = checkpoint['src_itos']
176
+ tgt_stoi = checkpoint['tgt_stoi']
177
+ tgt_itos = checkpoint['tgt_itos']
178
+
179
+ model = TransformerSeq2Seq(
180
+ src_vocab_size=len(src_stoi),
181
+ tgt_vocab_size=len(tgt_stoi),
182
+ d_model=EMBED_DIM,
183
+ n_heads=NHEAD,
184
+ num_encoder_layers=NUM_ENCODER_LAYERS,
185
+ num_decoder_layers=NUM_DECODER_LAYERS,
186
+ dim_feedforward=FF_DIM
187
+ ).to(DEVICE)
188
+ model.load_state_dict(checkpoint['model_state_dict'])
189
+ model.eval()
190
+
191
+ def convert_cpp_to_pseudocode(code_text):
192
+ lines = code_text.strip().split('\n')
193
+ outputs = []
194
+ for i, line in enumerate(lines):
195
+ line = line.strip()
196
+ if not line or line in ["int main() {", "}", "return 0;"]:
197
+ continue
198
+ try:
199
+ src_ids = numericalize(line, src_stoi)
200
+ src_ids = pad_sequence(src_ids, MAX_LEN, src_stoi[PAD_TOKEN])
201
+ out_line = greedy_decode(model, src_ids, src_stoi, tgt_stoi, tgt_itos)
202
+ outputs.append(out_line)
203
+ except Exception as e:
204
+ outputs.append(f"// Error in line {i+1}: {e}")
205
+ return "\n".join(outputs)
206
+
207
+ iface = gr.Interface(
208
+ fn=convert_cpp_to_pseudocode,
209
+ inputs=gr.Textbox(label="Enter C++ Code", lines=10),
210
+ outputs=gr.Textbox(label="Generated Pseudocode"),
211
+ title="C++ to Pseudocode Converter (Transformer from Scratch)"
212
+ )
213
+
214
+ if __name__ == "__main__":
215
+ iface.launch()
model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51b116f0bd343a4ad7738033120b923b91c4c52c78affec025490be5adaf974b
3
+ size 42692428
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ streamlit==1.35.0
2
+ torch==2.2.2