Regex Engines
- Summary
-
Discussion
- Is my regex executed directly by a regex engine?
- Could you explain how automata theory is applied to regexes?
- What are the different types of regex engines?
- How do the different regex engines compare?
- What are the essential rules that a regex engine follows?
- Could you explain the concepts "greedy" and "eager" with examples?
- Could you explain backtracking with an example?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
A regular expression describes a search pattern that can be applied on textual data to find matches. A regex is typically compiled to a form that can be executed efficiently on a computer. The actual search operation is performed by the regex engine, which makes use of the compiled regex.
To write good regexes, it's helpful for programmers to know how these engines work. There are a few types of engines, often implemented as a Finite Automaton. In fact, regexes are related to Automata Theory and Regular Languages.
The syntax and semantics of regexes have been standardized by IEEE as POSIX BRE and ERE. However, there are many non-standard variants. Often, the differences are subtle. Programmers who design regexes must be aware of the variant being used by the engine.
Discussion
-
Is my regex executed directly by a regex engine? A regex engine receives two inputs: the regex pattern plus the input string. Programmers specify regexes as strings. While an engine could be designed to work with strings directly, there's a better and more efficient way.
It's been shown that for every regex there's an equivalent Finite State Machine (FSM) or Finite Automaton (FA). In other words, the regex can be modelled as a finite set of states and transitions among these states based on inputs received at a state. Therefore, the job of a compiler is to take the original regex string and compile it into a finite automaton, which can be more easily executed by the engine.
In some implementations, a preprocessor may be invoked before the compiler. This substitutes macros or character classes, such as, replacing
\p{Alpha}
with[a-zA-Z]
. A preprocessor also does locale-specific substitutions.Let's note that there's no standard definition of what's a regex engine. Some may consider parsing, compiling and execution as part of the engine.
-
Could you explain how automata theory is applied to regexes? There are two types of automata:
- Deterministic Finite Automaton (DFA): Given a state and an input, there's a well-defined output state.
- Non-Deterministic Finite Automaton (NFA): Given a state and an input, there could be multiple possible output states. A variant of NFA is ε-NFA, where a state transition can happen even without an input.
It's been proven that every DFA can be converted to an NFA, and vice versa. In the accompanying figure, all three automata are equivalent and represent the regex
abb*a
(orab+a
). For NFA, when 'b' is received in state q1, the automaton can remain in q1 or move to q2. Thus, it's non-deterministic. For ε-NFA, the automaton can move from q2 to q1 even without an input. It's been said,Regular expressions can be thought of as a user-friendly alternative to finite automata for describing patterns in text.
-
What are the different types of regex engines? Regex engines could be implemented as DFA or NFA. However, in simpler language, a regex engine can be classified as follows:
- Text-Directed: Engine attempts all paths of the regex before moving to next character of input. Thus, this engine doesn't backtrack. Since all paths are attempted at once, it will return the longest match. For example,
(Set|SetValue)
on input "SetValue" will match "SetValue". - Regex-directed: If the engine fails at a position, it backtracks to attempt an alternative path. Paths are attempted in left-to-right order. Thus, it returns the leftmost match even if there's a longer match in another path later. For example,
(Set|SetValue)
on input "SetValue" will match "Set".
Most modern engines are regex-directed because this is the only way to implement useful features such as lazy quantifiers and backreferences; and atomic grouping and possessive quantifiers that give extra control to backtracking.
Today's regexes are feature rich and can't always to implemented efficiently as an automaton. Lookaround assertions and backreferences are hard to implement as NFA. Most regex engines use recursive backtracking instead.
- Text-Directed: Engine attempts all paths of the regex before moving to next character of input. Thus, this engine doesn't backtrack. Since all paths are attempted at once, it will return the longest match. For example,
-
How do the different regex engines compare? With recursive backtracking, pathological regexes result in lots of backtracking and searching through alternative paths. The time complexity grows exponentially. Thompson's NFA (or its equivalent DFA) is more efficient and maintains linear-time complexity. A compromise is to use Thompson algorithm and backtrack only when needed for backreferences. GNU's awk and grep tools use DFA normally and switch to NFA when backreferences are used.
Ruby uses non-recursive backtracking but this too grows exponentially for pathological regexes. Ruby's engine is called Oniguruma.
DFA is more efficient than NFA since the automaton is in only one state at any given time. A traditional NFA tries every path before failing. A POSIX NFA tries every path to find the longest match. A text-directed DFA spends more time and memory analyzing and compiling the regex but this could be optimized to compile on the fly.
In terms of code size, NFA regex in ed (1979) was about 350 lines of C code. Henry Spencer's 1986 implementation was 1900 lines and his 1992 POSIX DFA was 4500 lines.
-
What are the essential rules that a regex engine follows? A regex engine executes the regex one character at a time in left-to-right order. This input string itself is parsed one character at a time, in left-to-right order. Once a character is matched, it's said to be consumed from the input, and the engine moves to the next input character.
The engine is by default greedy. When quantifiers (such as
* + ? {m,n}
) are used, it will try to match as many characters from the input string as possible. The engine is also eager, reporting the first match it finds.If the regex doesn't match a character in input, it does two things. It will backtrack to an earlier greedy operation and see if a less greedy match will result in a match. Otherwise, the engine will move to the next character and attempt to match the regex all over again at this position. Either way, the engine always knows its current position within the regex. If the regex specifies alternatives, if one search path fails, the engine will backtrack to match the next alternative. Therefore, the engine also stores backtracking positions.
-
Could you explain the concepts "greedy" and "eager" with examples? Take regex
a.*o
on input "cat dog mouse". Even though "at do" is a valid match, since the engine is greedy, the match it gives is "at dog mo". To avoid this greedy behaviour, we can use a lazy or non-greedy match:a.*?o
will match "at do". Ungreedy match in some flavours such as PCRE can be specified using 'U' flag.A non-greedy
\w{2,3}?
on input "abc" will match "ab" rather than "abc". Suppose the regex is\w{2,3}?$
, then the match is "abc" and not "bc", even though the regex is non-greedy. This is because the engine is eager to report the first match it finds. Thus, it first matches "ab", then sees that$
doesn't match "c". At this point, the engine will not backtrack for position "b". It will remain at "a" and compare the third character due to{2,3}
. Thus, it finds "abc" as match. It eagerly reports this match.Another example is
(self)?(selfish)?
applied on input "selfish". Because of eagerness, engine will report "self" as the match. However, a text-directed engine will report "selfish". -
Could you explain backtracking with an example? Let's take the regex
/-\d+$/g
and input string "212-244-7688". The regex engine will match-\d+
to "-244" but when it sees$
it declares no match. At this point, it will backtrack to the start of the regex and current position in input string will advance from "-" to "2". In this example, only one backtracking happens. Suppose we apply/\d-\d+$/g
on the same input, we'll have five backtracks as shown in the figure.Engine can also backtrack part of the way. Let's apply
/A\d+\D?./g
to "A1234". The engine will matchA\d+\D?
to "A1234" but when it sees.
there's no match. It will backtrack to\d+
and give up one character so thatA\d+
now matches only "A123". As the engine continues, it will match.
with "4".Another example of backtracking is
pic(ket|nic)
. If the string is "Let's picnic.", the engine will matchpic
to "pic" but will fail the next character match (k vs. n). The engine knows there's an alternative. It will backtrack to end ofpic
and process the second alternative.
Milestones
Ken Thompson shows in an ACM paper how a regex can be converted to a NFA. He presents an engine that can track alternative paths in the NFA simultaneously. This is now called Thompson NFA. For a regex of length m and input of length n, Thompson NFA requires \(O(mn)\) time. In comparison, the backtracking regex implementation requires \(O(2^n)\) time when there are n alternative paths. An NFA can be built from simple operations (such as concatenation, alternation, looping) on partial NFAs.
The regexp(3) library of Unix (Eighth Edition) adapts Thompson's algorithm to extract submatches or capturing groups. This work is credited to Rob Pike and Dave Presotto. This goes unnoticed and is not widely used. However, a year later, Henry Spencer reimplements the library interface from scratch using recursive backtracking. This is later adopted by Perl, PCRE, Python, and others.
2010
Google open sources RE2 that's based on automata theory, has linear-time complexity and uses a fixed size stack. Because Google uses regexes for customer-facing tools such as Code Search, backtracking and exponential-time complexity of earlier implementations could lead to Denial-of-Service (DoS) attacks. In addition, recursive backtracking can lead to stack overflows. Work on RE2 can be traced to the work on Code Search in 2006.
References
- Cox, Russ. 2007. "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)." January. Accessed 2019-08-09.
- Cox, Russ. 2010a. "Regular Expression Matching in the Wild." March. Accessed 2019-08-09.
- Cox, Russ. 2010b. "RE2: a principled approach to regular expression matching." Google Open Source Blog, March 11. Accessed 2019-08-12.
- Debuggex. 2019. "PCRE Regex Cheatsheet." Accessed 2019-08-09.
- Friedl, Jeffrey E. F. 2006. "Mastering Regular Expressions." Third Edition, O'Reilly Media. Accessed 2019-05-10.
- Goyvaerts, Jan. 2016. "Alternation with The Vertical Bar." Regular-Expressions.info, December 05. Accessed 2019-05-04.
- Goyvaerts, Jan. 2017. "How a Regex Engine Works Internally." Regular-Expressions.info, June 09. Accessed 2019-05-04.
- Goyvaerts, Jan. 2018. "Regular Expressions Reference." Regular-Expressions.info, May 18. Accessed 2019-05-04.
- Houston, Gary. 2019. "regex - Henry Spencer's regular expression libraries." Via GitHub.io. Accessed 2019-08-09.
- Karakoidas, Vassilios and Diomidis Spinellis. 2008. "FIRE/J - Optimizing Regular Expression Searches with Generative Programming." Software: Practice and Experience, 38(6):557–573, May. Accessed 2019-08-09.
- Kyashif, Denis. 2019. "Implementing a Regular Expression Engine." February 17. Accessed 2019-08-09.
- Maccaferri, Leniel. 2009. "Regex engine in C# - the Regex Parser." February 24. Accessed 2019-05-08.
- Schulz, Marius. 2014. "Why Using the Greedy .* in Regular Expressions Is Almost Never What You Actually Want." June 03. Accessed 2019-08-09.
- Shaughnessy, Pat. 2012. "Exploring Ruby’s Regular Expression Algorithm." April 03. Accessed 2019-08-09.
Further Reading
- Wikipedia. 2019a. "Comparison of regular expression engines." Wikipedia, April 05. Accessed 2019-05-10.
- Wikipedia. 2019b. "Thompson's construction." Wikipedia, March 14. Accessed 2019-05-10.
- Qiu, Roger. 2016. "Regular Expression Engine Comparison Chart." GitHub Gist, May 06. Accessed 2019-05-10.
- Leipzig, Rust. 2017. "A comparison of regex engines." GitHub.io, March 28. Accessed 2019-05-10.
- Becker, Pete. 2006. "Regular Expressions." Dr.Dobb's, April 11. Accessed 2019-08-09.
- Moser, Jeff. 2009. "How .NET Regular Expressions Really Work." Moserware, March 16. Accessed 2019-08-12.
Article Stats
Cite As
See Also
- Regular Expression
- Regular Expressions in Python
- Regular Expressions in Perl
- Regex Optimization
- PCRE
- String Searching Algorithm