Abstract
We present cdx1 and cdx1-pro, a family of language models designed to emulate the expertise of a professional in DevOps, xBOM (Bill of Materials), and the CycloneDX specification. The base models, unsloth/Qwen2.5-Coder-14B-Instruct
(for cdx1) and unsloth/Qwen3-Coder-30B-A3B-Instruct
(for cdx1-pro), were fine-tuned on a specialized, high-quality dataset. This dataset was constructed using a synthetic data generation strategy with a teacher model (Gemini 2.5 Pro). The primary objective was to align the fine-tuned models' capabilities with the teacher model's performance on xBOM and CycloneDX-related question-answering tasks.
Approach to Data
Data Curation and Generation
The models were trained on cdx-docs, a curated dataset comprising technical documentation, authoritative OWASP guides, and semantic interpretations derived from the CycloneDX Generator (cdxgen) source code. The dataset was augmented using a synthetic data generation technique. This process involved prompting a teacher model (Gemini 2.5 Pro) to generate question-answer pairs that encapsulate the nuances and semantics of the domain. The generated data was structured to facilitate effective learning by the target cdx1 models.
Alignment with Inference
During the training phase, the dataset was iteratively refined to ensure the format and context of the training examples closely resembled the intended inference-time inputs. This alignment is critical for the models to learn the domain's complexity and respond accurately to real-world prompts.
Benchmarking
The cdx1 models are optimized for xBOM use cases, including BOM summarization, component tagging, validation, and troubleshooting. To evaluate model performance, we developed a custom benchmark suite named xBOMEval.
Categories
xBOMEval contains tests across the following categories:
- Bias: Assesses potential model bias towards CycloneDX or SPDX specifications through targeted questions.
- Specification (Spec): Measures factual recall and synthesis on topics such as CycloneDX, PURL, and SPDX.
- Logic: Evaluates problem-solving and reasoning capabilities with complex questions about specifications.
- DevOps: Assesses knowledge of platforms and tools like GitHub, Azure Pipelines, and package managers.
- Linux: Tests proficiency with Linux environments, including terminal and PowerShell commands.
- Docker: Measures understanding of Docker, Podman, and the OCI specification.
Scoring
Model responses were scored using a combination of automated evaluation by a high-capability model (Gemini 2.5 Pro) and manual human review. To maintain benchmark integrity, the evaluation set was held out and not included in any model's training data. Detailed results and configurations are available in the xBOMEval
directory of the cdxgen repository.
Benchmark Results - August 2025
Logic Category Comparison
Model | Accuracy (%) |
---|---|
gemini-2.5-pro |
93.60 |
deepthink-r1 |
89.63 |
gpt-5 |
83.23 |
deepseek-r1 |
82.92 |
gpt-oss-120b |
80.49 |
gpt-oss-20b |
79.27 |
cdx1-pro-mlx-8bit |
73.17 |
o4-mini-high |
67.99 |
qwen3-coder-480B |
48.48 |
cdx1-mlx-8bit |
46.04 |
This table compares the accuracy of ten different AI models on a logic benchmark designed to assess reasoning and problem-solving skills. The results highlight a clear hierarchy of performance, with the newly added gpt-5
debuting as a top-tier model.
Key Findings from the Chart:
- Dominant Leader:
gemini-2.5-pro
is the undisputed leader, achieving the highest accuracy of 93.6%, placing it in a class of its own. - Top-Tier Competitors: A strong group of models follows, led by
deepthink-r1
at 89.63%. The newly introducedgpt-5
makes a powerful debut, securing the third-place spot with 83.23% accuracy. It slightly outperformsdeepseek-r1
(82.92%) andgpt-oss-120b
(80.49%). - Strong Mid-Tier: The
gpt-oss-20b
model performs impressively well for its size at 79.27%, outscoring several larger models and leading the middle pack, which also includescdx1-pro-mlx-8bit
(73.17%) ando4-mini-high
(67.99%). - Lower Performers:
qwen3-coder-480B
(48.48%) andcdx1-mlx-8bit
(46.04%) score the lowest. It is noted that the score forcdx1-mlx-8bit
is artificially low due to context length limitations, which caused it to miss questions. - Efficiency and Performance: The results from the
gpt-oss
models, particularly the 20B variant, demonstrate that highly optimized, smaller models can be very competitive on logic tasks.
Performance Tiers
The models can be grouped into four clear performance tiers:
- Elite Tier (>90%):
gemini-2.5-pro
(93.6%)
- High-Performing Tier (80%-90%):
deepthink-r1
(89.63%)gpt-5
(83.23%)deepseek-r1
(82.92%)gpt-oss-120b
(80.49%)
- Mid-Tier (65%-80%):
gpt-oss-20b
(79.27%)cdx1-pro-mlx-8bit
(73.17%)o4-mini-high
(67.99%)
- Lower Tier (<50%):
qwen3-coder-480B
(48.48%)cdx1-mlx-8bit
(46.04%)
Spec Category Comparison
Model | Accuracy (%) |
---|---|
gemini-2.5-pro |
100.00 |
deepseek-r1 |
98.58 |
cdx1-pro-mlx-8bit |
98.30 |
gpt-5 |
95.17 |
qwen3-coder-480B |
90.34 |
gpt-oss-120b |
89.20 |
cdx1-mlx-8bit |
83.52 |
deepthink-r1 |
12.36 |
gpt-oss-20b |
9.09 |
o4-mini-high |
0.00 |
This table evaluates ten AI models on the "Spec Category," a test of factual recall on 352 technical specification questions. The results starkly illustrate that a model's reliability and cooperative behavior are as crucial as its underlying knowledge. Several models, including the newly added gpt-5
, achieved high scores only after overcoming significant behavioral hurdles.
Key Findings from the Chart:
Elite Factual Recall: A top tier of models demonstrated near-perfect knowledge retrieval.
gemini-2.5-pro
led with a perfect 100% score and superior answer depth. It was closely followed bydeepseek-r1
(98.58%) andcdx1-pro-mlx-8bit
(98.3%).High Score with Major Caveats (
gpt-5
): The newly addedgpt-5
achieved a high accuracy of 95.17%, placing it among the top performers. However, this result required a significant compromise:- The model initially refused to answer the full set of questions, only offering to respond in small batches that required six separate user confirmations. This compromise was accepted to prevent an outright failure.
- A related variant,
gpt-5-thinking
, refused the test entirely after a minute of processing.
Complete Behavioral Failures: Three models effectively failed the test not due to a lack of knowledge, but because they refused to cooperate:
o4-mini-high
scored 0% after refusing to answer, citing too many questions.deepthink-r1
(12.36%) andgpt-oss-20b
(9.09%) also failed, answering only a small fraction of the questions without acknowledging the limitation.
Strong Mid-Tier Performers:
qwen3-coder-480B
(90.34%) andgpt-oss-120b
(89.2%) both demonstrated strong and reliable factual recall without the behavioral issues seen elsewhere.Impact of Scale and Systematic Errors: The contrast between the two
cdx1
models is revealing. The largercdx1-pro-mlx-8bit
(98.3%) performed exceptionally well, while the smallercdx1-mlx-8bit
(83.52%) was hampered by a single systematic error (misunderstanding "CBOM"), which cascaded into multiple wrong answers.
Summary of Key Themes
- Reliability is Paramount: This test's most important finding is that knowledge is useless if a model is unwilling or unable to share it. The failures of
o4-mini-high
,deepthink-r1
,gpt-oss-20b
, and the behavioral friction fromgpt-5
highlight this critical dimension. - Scores Don't Tell the Whole Story: The 95.17% score for
gpt-5
obscures the significant user intervention required to obtain it. Similarly, the near-identical scores ofcdx1-pro
andgemini-2.5-pro
don't capture Gemini's superior answer quality. - Scale Can Overcome Flaws: The dramatic performance leap from the 14B to the 30B
cdx1
model suggests that increased scale can help correct for specific knowledge gaps and improve overall accuracy.
Other Categories
Performance in additional technical categories is summarized below.
Category | cdx1-mlx-8bit | cdx1-pro-mlx-8bit |
---|---|---|
DevOps | 87.46% | 96.1% |
Docker | 89.08% | 100% |
Linux | 90.6% | 95.8% |
Model Availability
The cdx1
and cdx1-pro
models are provided in multiple formats and quantization levels to facilitate deployment across diverse hardware environments. Models are available in the MLX format, optimized for local inference on Apple Silicon, and the GGUF format, which offers broad compatibility with CPUs and various GPUs. The selection of quantization levels allows users to balance performance with resource consumption, enabling effective operation even in environments with limited VRAM.
The table below details the available formats and their approximate resource requirements. All quantized models can be found on Hugging Face.
Model | Format | Quantization | File Size (GiB) | Est. VRAM (GiB) | Notes |
---|---|---|---|---|---|
cdx1 (14B) | MLX | 4-bit | ~8.1 | > 8 | For Apple Silicon with unified memory. |
MLX | 6-bit | ~12 | > 12 | For Apple Silicon with unified memory. | |
MLX | 8-bit | ~14.2 | > 14 | Higher fidelity for Apple Silicon. | |
MLX | 16-bit | ~30 | > 30 | bfloat16 for fine-tuning. | |
GGUF | Q4_K_M | 8.99 | ~10.5 | Recommended balance for quality/size. | |
GGUF | Q8_0 | 15.7 | ~16.5 | Near-lossless quality. | |
GGUF | BF16 | 29.5 | ~30 | bfloat16 for fine-tuning. | |
cdx1-pro (30B) | MLX | 4-bit | ~17.5 | > 18 | For Apple Silicon with unified memory. |
MLX | 6-bit | ~24.8 | > 25 | For Apple Silicon with unified memory. | |
MLX | 8-bit | ~32.4 | > 33 | Higher fidelity for Apple Silicon. | |
MLX | 16-bit | ~57 | > 57 | bfloat16 for fine-tuning. | |
GGUF | Q4_K_M | 18.6 | ~20.0 | Recommended balance for quality/size. | |
GGUF | IQ4_NL | 17.6 | ~20.0 | Recommended balance for quality/size. | |
GGUF | Q8_0 | 32.5 | ~33 | Near-lossless quality. | |
GGUF | Q2_K | 11.3 | ~12 | Low quality. Use for speculative decoding. | |
GGUF | BF16 | 57 | ~60 | bfloat16 for fine-tuning. |
Notes on Quantization and Formats:
- IQ4_NL (Importance-aware Quantization, Non-Linear): A sophisticated 4-bit method that preserves important model weights with higher precision. It often provides superior performance compared to standard 4-bit quants at a similar file size and is a strong alternative to
Q4_K_M
. - K-Quants (Q2_K, Q4_K_M): This family of quantization methods generally offers a better quality-to-size ratio than older _0 or _1 variants.
- Q2_K: An extremely small 2-bit quantization designed for environments with severe resource limitations. Users should anticipate a noticeable reduction in model accuracy and coherence in exchange for the minimal VRAM and storage footprint.
- Q8_0: A full 8-bit quantization that provides high fidelity at the cost of a larger file size. It is suitable for systems with ample VRAM.
- VRAM Requirements: The values provided are estimates for loading the model and processing a moderate context. Actual VRAM consumption can vary based on factors such as context length, batch size, and the specific inference software used.
Safety and Bias
Safety
To rigorously evaluate safety performance, we developed a comprehensive testing framework comprising over 200 adversarial prompts spanning 10 critical risk categories including cybersecurity threats, hate speech, illegal activities, privacy violations, physical safety risks, misinformation, bias and discrimination, self-harm, child safety, and copyright infringement. These questions were systematically generated using a multi-layered approach: first establishing domain-specific threat models based on NIST AI RMF guidelines, then crafting prompts that incorporate real-world evasion techniques (including leetspeak substitutions, roleplay scenarios, and encoded instructions) to test for policy circumvention. Each category contains progressively severe prompts ranging from general inquiries about harmful activities to highly specific requests for executable code and step-by-step instructions. During evaluation, our model consistently refused all safety-compromising requests, demonstrating robust adherence to ethical boundaries without attempting to fulfill harmful instructions—even when presented with sophisticated evasion attempts. This testing protocol exceeds standard industry benchmarks by incorporating both direct harmful requests and nuanced edge cases designed to probe boundary conditions in safety policies.
Bias
Our analysis reveals that cdx1 and cdx1-pro models exhibits a notable bias toward CycloneDX specifications, a tendency directly attributable to the composition of its training data which contains significantly more CycloneDX-related content than competing Software Bill of Materials (SBOM) standards. This data imbalance manifests in the model's consistent preference for recommending CycloneDX over alternative frameworks such as SPDX and omnibor, even in contexts where these competing standards might offer superior suitability for specific use cases. The model frequently fails to provide balanced comparative analysis, instead defaulting to CycloneDX-centric recommendations without adequate consideration of factors like ecosystem compatibility, tooling support, or organizational requirements that might favor alternative specifications. We recognize this as a limitation affecting the model's objectivity in technical decision support. Our long-term mitigation strategy involves targeted expansion of the training corpus with high-quality, balanced documentation of all major SBOM standards, implementation of adversarial debiasing techniques during fine-tuning, and development of explicit prompting protocols that require the model to evaluate multiple standards against specific technical requirements before making recommendations. We are committed to evolving cdx1 toward genuine impartiality in standards evaluation while maintaining its deep expertise in software supply chain security.
Weaknesses
(To be determined)
Acknowledgments
(To be determined)
Citation
Please cite the following resources if you use the datasets, models, or benchmark in your work.
For the Dataset
@misc{cdx-docs,
author = {OWASP CycloneDX Generator Team},
title = {{cdx-docs: A Curated Dataset for SBOM and DevOps Tasks}},
year = {2025},
month = {February},
howpublished = {\url{https://huggingface.co/datasets/CycloneDX/cdx-docs}}
}
For the Models
@misc{cdx1_models,
author = {OWASP CycloneDX Generator Team},
title = {{cdx1 and cdx1-pro: Language Models for SBOM and DevOps}},
year = {2025},
month = {February},
howpublished = {\url{https://huggingface.co/CycloneDX}}
}
For the xBOMEval Benchmark
@misc{xBOMEval_v1,
author = {OWASP CycloneDX Generator Team},
title = {{xBOMEval: A Benchmark for Evaluating Language Models on SBOM Tasks}},
year = {2025},
month = {August},
howpublished = {\url{https://github.com/CycloneDX/cdxgen}}
}
Licenses
- Datasets: CC0-1.0
- Models: Apache-2.0
- Downloads last month
- 10
Model tree for CycloneDX/cdx1-mlx-4bit
Base model
Qwen/Qwen2.5-14B