language:
- en
license: mit
tags:
- code-generation
- transformer
- ast
- cfg
- langchain
- ollama
model_name: MiniCoderX
datasets:
- the-stack
- codesearchnet
- humaneval
- mbpp
- bugs2fix
- java-python
pipeline_tag: text-generation
๐ MiniCoderX: A Lightweight Transformer for Code Generation
MiniCoderX is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like LangChain and Ollama, making it ideal for rapid local experimentation.
Link -> https://v0-mini-coder-x.vercel.app/
โจ Features
- ๐ง Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
- ๐ฒ AST/CFG-aware encoding for code structure understanding
- ๐พ Syntax-constrained decoding using grammar rules and trees
- ๐ Multi-task heads: generation, summarization, translation, bug fixing
- โ๏ธ LangChain + Ollama integration for fast local deployment
- ๐งช Evaluated on HumanEval, CodeXGLUE, MBPP
๐๏ธ Model Architecture
Component | Description |
---|---|
Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) |
Structure-aware | AST and Control Flow Graph embeddings + positional masks |
Heads | Multi-task heads for flexible downstream use |
Decoder | Syntax-aware beam search (grammar constraints) |
Tokenizer | BPE or SentencePiece trained on code + comments |
๐ง Architectural Additions (SOTA Techniques)
๐ฒ AST/CFG Embeddings
Enhances understanding of code structure by:
- Adding AST node/edge embeddings to token inputs
- Including path embeddings between syntactic elements
- Graph-aware position encoding
Inspired by: StructCoder, AST-T5, Code4Struct
๐พ Syntax-Constrained Decoding
Improves generation accuracy and reduces invalid code by:
- Restricting token outputs using grammar constraints (BNF/PEG)
- Custom decoding logic (e.g., Tree traversal)
- Dynamic decoding masks based on token state
Inspired by: TreeGen, Code4Struct
๐ Multi-Task Learning Heads
Supports multiple tasks:
- Code generation (NL โ Code)
- Summarization (Code โ NL)
- Translation (Java โ Python)
- Code repair and completion
Inspired by: CodeT5+, CoTexT
โก LangChain + Ollama Integration
๐ก Why?
To enable:
- ๐งช Local testing and chaining of models via LangChain
- ๐ฆฎ Fast prototyping with Ollama for custom transformer backends
- ๐ Easy switch between small local models and larger remote APIs
๐ Integration Plan
from langchain.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# Load MiniCoderX with Ollama
llm = Ollama(model="minicoderx") # Local model via Ollama
# Define code generation prompt
prompt = PromptTemplate(
input_variables=["instruction"],
template="Generate Python code for the task: {instruction}",
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("Sort a list of integers using quicksort")
print(result)
โ Ollama will be used to serve your fine-tuned SLM locally
โ LangChain will wrap it with prompts, chains, and memory features for interactivity
๐ฆ Datasets
Dataset | Use |
---|---|
The Stack (subset) | Pretraining corpus |
CodeSearchNet | Summarization, Search |
HumanEval | Code generation benchmark |
MBPP | Python programming prompts |
Bugs2Fix | Code repair |
Java-Python | Cross-language translation |
๐ฌ Training Objectives
- โ Span Masking (CodeT5-style)
- โ Contrastive pretraining
- โ Instruction tuning (natural prompt formatting)
- โ Auto-regressive generation
๐ Evaluation Benchmarks
Benchmark | Metric |
---|---|
HumanEval | Pass@1, BLEU |
MBPP | Accuracy |
CodeXGLUE | CodeBLEU, EM |
Unit Tests | Pass Rate |
๐งช Project Roadmap
โ Phase 1: MVP Model
- Train TinyCodeT5 model with span masking
- Evaluate on MBPP and HumanEval-lite
- Serve via Ollama + LangChain prompt chain
๐ Phase 2: Structural Learning
- Add AST/CFG encodings
- Introduce grammar-constrained decoding
- Multi-task training (gen, sum, repair)
๐ฆ Phase 3: Optimization & Packaging
- Distill from larger model (e.g., StarCoder)
- Add reinforcement fine-tuning via test cases
- Export to Hugging Face + Ollama integration
๐ ๏ธ Tools & Frameworks
- Hugging Face Transformers
- LangChain
- Ollama
- SentencePiece / BPE
- NetworkX for AST/CFG parsing
๐ค Contributing
Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!
๐ License
MIT License. Built for research and open experimentation.
๐ง Contact
Drop an issue or discussion on GitHub!