Regex-Helper: Fine-Tuned Model for Regular Expression Generation
A 3B-parameter fine-tuned model that generates regular expressions, pattern explanations, and test cases from natural language descriptions, built on Qwen2.5-Coder-3B-Instruct using LoRA.
Regex-Helper
Regex-Helper generates regular expressions from natural language descriptions, along with pattern explanations, optimization suggestions, and test cases. Regex generation is deceptively difficult: patterns must be syntactically exact, contextually correct, and logically sound simultaneously with no margin for single-character errors. This model delivers reliable patterns across common use cases.
π Problem
Regex generation differs fundamentally from general code generation due to:
Perfect Syntax Precision β One wrong character produces an invalid or silently incorrect pattern with no partial rendering
Contextual Understanding β The same specification means different things across regex engines (PCRE, ERE, Python re, JavaScript)
Logical Consistency β Patterns must be mathematically sound, not merely syntactically valid
Performance Awareness β Naive patterns cause catastrophic backtracking on certain inputs
The failure mode is often silent: the model produces a syntactically valid pattern that matches the wrong strings. A model that generates semantically incorrect but syntactically valid regex is more dangerous than one producing obvious errors.
π οΈ Architecture & Training
Base Model: Qwen2.5-Coder-3B-Instruct
Framework: Apple MLX on Apple Silicon
Technique: LoRA (Low-Rank Adaptation)
Training Data: Regex patterns, explanations, optimizations, and test cases
Output Format: JSON with pattern, explanation, alternatives, and test cases
Base model selection was the highest-leverage decision. Qwen2.5-Coder carries prior knowledge of regex syntax, tokenizes special characters more effectively than general-purpose models, and has seen regex patterns in pretraining. Switching from a general model to Qwen2.5-Coder produced the most significant accuracy improvement across all iterations.
Training Progression:
Iteration 1 β Invalid regex syntax in majority of cases. Base model without code-focused foundation could not reliably produce correct special character sequences.
Iteration 2 β Syntax improved substantially after base model switch. Pattern logic remained poor for lookahead, lookbehind, and non-capturing group constructs.
Iteration 3 β Acceptable accuracy on common patterns. Performance inconsistent on nested quantifiers and alternation edge cases.
Final Version β Stable output with reliable pattern generation for common use cases represented in training data.
π Capabilities
The model generates multiple regex components in structured JSON:
Primary Pattern with named capture groups where appropriate
Plain-language Explanation breaking down each pattern component
Alternative Patterns where performance trade-offs exist
Test Cases with both matching and non-matching examples
Example input: βExtract email addresses from text, allowing only .com and .org domainsβ
Example output: Primary regex pattern, detailed explanation, 2 alternative patterns with performance trade-offs, 8 test cases (4 valid, 4 invalid).
π‘οΈ Limitations
- Trained on common patterns β highly specialized regex for uncommon domains may require human expertise
- English language only (multilingual character class handling not covered)
- Complex nested patterns require manual review and optimization
- Performance varies with pattern complexity β expert regex knowledge needed for edge cases
- Not a substitute for domain expertise in security-sensitive pattern matching
See the official Hugging Face documentation for complete technical details.
π Quick Start
Hugging Face
Access the model and documentation
Ollama
ollama pull fahidnasir/Regex-Helper && ollama run fahidnasir/Regex-Helper
Python
from mlx_lm import load, generate
model, tokenizer = load("fahidnasir/Regex-Helper")
generate(model, tokenizer, "Extract email addresses from text, allowing only .com and .org domains")
π‘ Key Takeaways
- Base model selection is the highest-leverage decision β code-specialized base improved accuracy more than any single data or hyperparameter change.
- Data quality matters more than dataset size β smaller sets of correctly labeled, structurally consistent examples produce better output than larger noisy datasets.
- Four training iterations is normal, not failure β systematic output failures at each stage revealed specific gaps that targeted data additions addressed.
- Domain-specific challenges require domain-specific solutions β general fine-tuning strategies require adaptation when output has zero tolerance for partial correctness.
- Silent failures are the primary risk β syntactically valid but logically incorrect patterns are more dangerous in production than obvious errors.