Spaces:
Running
Running
File size: 4,772 Bytes
6a450b3 def1c2c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
---
title: Code2Pseudo
emoji: π’
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: false
license: mit
short_description: Convert C++ to Pseudocode using a Transformer Model.
---
# π Code2Pseudo β Transformer-based C++ to Pseudocode Converter
[](LICENSE)
[](https://www.python.org/)
[](https://huggingface.co/spaces/asadsandhu/Code2Pseudo)
[](https://github.com/asadsandhu/Code2Pseudo)
> A fully custom Transformer-based Sequence-to-Sequence model built from scratch in PyTorch to convert executable C++ code into high-level pseudocode. Trained on the [SPoC dataset](https://arxiv.org/abs/2005.04326) from Stanford.
---
## πΌοΈ Demo
Try it live on **Hugging Face Spaces**:
π https://huggingface.co/spaces/asadsandhu/Code2Pseudo

---
## π§ Model Architecture
- Built from scratch using the **Transformer** encoder-decoder architecture (PyTorch)
- No pre-trained libraries β 100% custom code
- Token-level sequence generation with greedy decoding
- Custom tokenization and vocabulary building for both C++ and pseudocode
```
Input: C++ lines (line-by-line)
Model: Transformer (Encoder-Decoder)
Output: Corresponding pseudocode line
```
---
## π Dataset
We trained on the **SPoC dataset**:
- β
Cleanly aligned C++ β pseudocode line pairs
- β
High-quality syntactic coverage
- β
Multiple test splits available
- β
Custom preprocessing and token handling
> π Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
---
## π Directory Structure
```
.
βββ app.py # Gradio web app (C++ β Pseudocode)
βββ train.py # Training script for code-to-pseudocode model
βββ model.pth # Trained model and vocab checkpoint
βββ spoc/
β βββ train/
β βββ spoc-train.tsv
β βββ split/spoc-train-eval.tsv
βββ assets/
β βββ demo.png # Screenshot for README
βββ README.md # This file
````
---
## π οΈ How to Run Locally
### βοΈ 1. Clone the Repo
```bash
git clone https://github.com/asadsandhu/Code2Pseudo.git
cd Code2Pseudo
pip install torch gradio tqdm
````
### π 2. Launch the Web App
Make sure `model.pth` exists (or train it first):
```bash
python app.py
```
The interface will open in your browser.
---
## π§ͺ Training the Model
To retrain the transformer model:
```bash
python train.py
```
By default:
* Downloads SPoC dataset from GitHub
* Trains for 10 epochs
* Produces `model.pth` with weights and vocabulary
---
## π§ Key Hyperparameters
| Parameter | Value |
| -------------- | ----------- |
| Model Type | Transformer |
| Max Length | 128 |
| Embedding Dim | 256 |
| FFN Dim | 512 |
| Heads | 4 |
| Encoder Layers | 2 |
| Decoder Layers | 2 |
| Batch Size | 64 |
| Epochs | 10 |
| Optimizer | Adam |
| Learning Rate | 1e-4 |
---
## π§© Example Input
```cpp
int main() {
int n , nn , ans = 0 ;
cin > > n ;
for ( int i = 2 ; i < = n - 1 ; i + + ) {
nn = n ;
while ( nn = = 0 ) ans + = nn % i , nn / = i ;
}
o = gcd ( ans , n - 2 ) ;
cout < < ans / 2 / o ( n - 2 ) / o < < endl ;
return 0;
}
```
### β© Output Pseudocode
```text
create integers n , nn , ans with ans = 0
read n
for i = 2 to n - 1 inclusive
set nn to n
while nn is 0 , set ans to nn % 12 , set ans to nn % nn , set nn to nn / i
set value of gcd to ans and n - 2
print ans / 2 / ( n - 2 ) / o
```
---
## π¦ Deployment
Live demo hosted on:
* **Hugging Face Spaces**: [Code2Pseudo](https://huggingface.co/spaces/asadsandhu/Code2Pseudo)
* **GitHub**: [github.com/asadsandhu/Code2Pseudo](https://github.com/asadsandhu/Code2Pseudo)
---
## π Acknowledgements
* π **SPoC Dataset** by Stanford University
Kulal, S., Pasupat, P., & Liang, P. (2020). [SPoC: Search-based Pseudocode to Code](https://arxiv.org/abs/2005.04326)
* π§ Transformer Paper: ["Attention is All You Need"](https://arxiv.org/abs/1706.03762)
---
## π§βπ» Author
**Asad Ali**
[GitHub: asadsandhu](https://github.com/asadsandhu)
[Hugging Face: asadsandhu](https://huggingface.co/asadsandhu)
[LinkedIn: asadxali](https://www.linkedin.com/in/asadxali)
---
## π License
This project is licensed under the MIT License.
Use, remix, and distribute freely with attribution. |