Regex-Helper: Fine-Tuned Model for Regular Expression Generation

Regex-Helper: Fine-Tuned Model for Regular Expression Generation

A 3B-parameter fine-tuned model that generates regular expressions, pattern explanations, and test cases from natural language descriptions, built on Qwen2.5-Coder-3B-Instruct using LoRA.

Regex-Helper

Regex-Helper generates regular expressions from natural language descriptions, along with pattern explanations, optimization suggestions, and test cases. Regex generation is deceptively difficult: patterns must be syntactically exact, contextually correct, and logically sound simultaneously with no margin for single-character errors. This model delivers reliable patterns across common use cases.

πŸ” Problem

Regex generation differs fundamentally from general code generation due to:

Perfect Syntax Precision β€” One wrong character produces an invalid or silently incorrect pattern with no partial rendering

Contextual Understanding β€” The same specification means different things across regex engines (PCRE, ERE, Python re, JavaScript)

Logical Consistency β€” Patterns must be mathematically sound, not merely syntactically valid

Performance Awareness β€” Naive patterns cause catastrophic backtracking on certain inputs

The failure mode is often silent: the model produces a syntactically valid pattern that matches the wrong strings. A model that generates semantically incorrect but syntactically valid regex is more dangerous than one producing obvious errors.

πŸ› οΈ Architecture & Training

Base Model: Qwen2.5-Coder-3B-Instruct

Framework: Apple MLX on Apple Silicon

Technique: LoRA (Low-Rank Adaptation)

Training Data: Regex patterns, explanations, optimizations, and test cases

Output Format: JSON with pattern, explanation, alternatives, and test cases

Base model selection was the highest-leverage decision. Qwen2.5-Coder carries prior knowledge of regex syntax, tokenizes special characters more effectively than general-purpose models, and has seen regex patterns in pretraining. Switching from a general model to Qwen2.5-Coder produced the most significant accuracy improvement across all iterations.

Training Progression:

Iteration 1 β€” Invalid regex syntax in majority of cases. Base model without code-focused foundation could not reliably produce correct special character sequences.

Iteration 2 β€” Syntax improved substantially after base model switch. Pattern logic remained poor for lookahead, lookbehind, and non-capturing group constructs.

Iteration 3 β€” Acceptable accuracy on common patterns. Performance inconsistent on nested quantifiers and alternation edge cases.

Final Version β€” Stable output with reliable pattern generation for common use cases represented in training data.

πŸ“Š Capabilities

The model generates multiple regex components in structured JSON:

Primary Pattern with named capture groups where appropriate

Plain-language Explanation breaking down each pattern component

Alternative Patterns where performance trade-offs exist

Test Cases with both matching and non-matching examples

Example input: β€œExtract email addresses from text, allowing only .com and .org domains”

Example output: Primary regex pattern, detailed explanation, 2 alternative patterns with performance trade-offs, 8 test cases (4 valid, 4 invalid).

πŸ›‘οΈ Limitations

  • Trained on common patterns β€” highly specialized regex for uncommon domains may require human expertise
  • English language only (multilingual character class handling not covered)
  • Complex nested patterns require manual review and optimization
  • Performance varies with pattern complexity β€” expert regex knowledge needed for edge cases
  • Not a substitute for domain expertise in security-sensitive pattern matching

See the official Hugging Face documentation for complete technical details.

πŸš€ Quick Start

Hugging Face

Access the model and documentation

Ollama

ollama pull fahidnasir/Regex-Helper && ollama run fahidnasir/Regex-Helper

Python

from mlx_lm import load, generate
model, tokenizer = load("fahidnasir/Regex-Helper")
generate(model, tokenizer, "Extract email addresses from text, allowing only .com and .org domains")

πŸ’‘ Key Takeaways

  1. Base model selection is the highest-leverage decision β€” code-specialized base improved accuracy more than any single data or hyperparameter change.
  2. Data quality matters more than dataset size β€” smaller sets of correctly labeled, structurally consistent examples produce better output than larger noisy datasets.
  3. Four training iterations is normal, not failure β€” systematic output failures at each stage revealed specific gaps that targeted data additions addressed.
  4. Domain-specific challenges require domain-specific solutions β€” general fine-tuning strategies require adaptation when output has zero tolerance for partial correctness.
  5. Silent failures are the primary risk β€” syntactically valid but logically incorrect patterns are more dangerous in production than obvious errors.