Regex-Helper

Regex-Helper generates regular expressions from natural language descriptions, along with pattern explanations, optimization suggestions, and test cases. Regex generation is deceptively difficult: patterns must be syntactically exact, contextually correct, and logically sound simultaneously with no margin for single-character errors. This model delivers reliable patterns across common use cases.

🔍 Problem

Regex generation differs fundamentally from general code generation due to:

Perfect Syntax Precision — One wrong character produces an invalid or silently incorrect pattern with no partial rendering

Contextual Understanding — The same specification means different things across regex engines (PCRE, ERE, Python re, JavaScript)

Logical Consistency — Patterns must be mathematically sound, not merely syntactically valid

Performance Awareness — Naive patterns cause catastrophic backtracking on certain inputs

The failure mode is often silent: the model produces a syntactically valid pattern that matches the wrong strings. A model that generates semantically incorrect but syntactically valid regex is more dangerous than one producing obvious errors.

🛠️ Architecture & Training

Base Model: Qwen2.5-Coder-3B-Instruct

Framework: Apple MLX on Apple Silicon

Technique: LoRA (Low-Rank Adaptation)

Training Data: Regex patterns, explanations, optimizations, and test cases

Output Format: JSON with pattern, explanation, alternatives, and test cases

Base model selection was the highest-leverage decision. Qwen2.5-Coder carries prior knowledge of regex syntax, tokenizes special characters more effectively than general-purpose models, and has seen regex patterns in pretraining. Switching from a general model to Qwen2.5-Coder produced the most significant accuracy improvement across all iterations.

Training Progression:

Iteration 1 — Invalid regex syntax in majority of cases. Base model without code-focused foundation could not reliably produce correct special character sequences.

Iteration 2 — Syntax improved substantially after base model switch. Pattern logic remained poor for lookahead, lookbehind, and non-capturing group constructs.

Iteration 3 — Acceptable accuracy on common patterns. Performance inconsistent on nested quantifiers and alternation edge cases.

Final Version — Stable output with reliable pattern generation for common use cases represented in training data.

📊 Capabilities

The model generates multiple regex components in structured JSON:

Primary Pattern with named capture groups where appropriate

Plain-language Explanation breaking down each pattern component

Alternative Patterns where performance trade-offs exist

Test Cases with both matching and non-matching examples

Example input: “Extract email addresses from text, allowing only .com and .org domains”

Example output: Primary regex pattern, detailed explanation, 2 alternative patterns with performance trade-offs, 8 test cases (4 valid, 4 invalid).

🛡️ Limitations

Trained on common patterns — highly specialized regex for uncommon domains may require human expertise
English language only (multilingual character class handling not covered)
Complex nested patterns require manual review and optimization
Performance varies with pattern complexity — expert regex knowledge needed for edge cases
Not a substitute for domain expertise in security-sensitive pattern matching

See the official Hugging Face documentation for complete technical details.

🚀 Quick Start

Hugging Face

Access the model and documentation

Ollama

ollama pull fahidnasir/Regex-Helper && ollama run fahidnasir/Regex-Helper

Python

from mlx_lm import load, generate
model, tokenizer = load("fahidnasir/Regex-Helper")
generate(model, tokenizer, "Extract email addresses from text, allowing only .com and .org domains")

💡 Key Takeaways

Base model selection is the highest-leverage decision — code-specialized base improved accuracy more than any single data or hyperparameter change.
Data quality matters more than dataset size — smaller sets of correctly labeled, structurally consistent examples produce better output than larger noisy datasets.
Four training iterations is normal, not failure — systematic output failures at each stage revealed specific gaps that targeted data additions addressed.
Domain-specific challenges require domain-specific solutions — general fine-tuning strategies require adaptation when output has zero tolerance for partial correctness.
Silent failures are the primary risk — syntactically valid but logically incorrect patterns are more dangerous in production than obvious errors.