Regular Expression

Article Info

Contributed by
2 authors

Last updated on
2022-02-15 11:53:08

Regular Expressions in Python
Regular Expressions in Perl
Regex Engines
Regex Optimization
String Searching Algorithm
Pattern Matching

Article Versions

8 2022-02-15 11:53:08
3357,1567 8,3357

By devbot5S

Migrating blockquotes to markdown syntax
7 2019-08-29 11:57:39
1567,1351 7,1567

By arvindpdmn

Corrected an image: lookbehind tokens were wrong in original image
6 2019-05-12 10:15:17
1351,1348 6,1351

By arvindpdmn

Added note on sed.
5 2019-05-08 16:29:42
1348,1347 5,1348

By arvindpdmn

Added image on POSIX character classes.
4 2019-05-08 15:18:44
1347,1346 4,1347

By arvindpdmn

Uploaded debuggex image.

Chat Room

Submitting ...

You are editing an existing chat message.

Regular expressions are essentially search patterns defined by a sequence of characters. Let's say, we wish to search for the substring 'grey' in a text document. This is a simple string search. What if we wish to search for both 'grey' and 'gray'? With simple string searches, we would need to do two separate searches and collate the results. With regular expressions, we can define a single search pattern that will give us matches for either of the two substrings.

In the above example, any of these patterns will work: gr[ae]y, gr(a|e)y, (gray|grey), gray|grey

Regular expression is commonly known as regex. Although regex has a history going back to the 1950s, it was popularized in computer science in the 1990s by the Perl programming language. Apart from Perl's regex, many other variants exist.

Discussion

Could you give examples where regular expressions are used?
A regex to match email addresses. Source: Computer Hope 2017.
Regex is useful typically when the format or syntax of data is known. It's to this expected syntax that a regex is written. For example, email addresses are of the form username@domainname.tld. We can make a regex to extract all emails coming from a specific domain.
A common use of regex is for search-and-replace. For example, developers can use regex to quickly change the order of arguments across hundreds of source code files.
Regex can be used to process log files and filter log messages that match a particular signature. It can be used to filter reports on Google Analytics Dashboard.
Regex can also be used within database commands, say, to obtain usernames that contain non-alphanumeric characters in them.
Suppose we've reorganized the file structure in a web application. In web servers, regex can be used to match requested URL patterns and redirect them to new locations. Regex is also useful in web scraping tasks.
What are the basic building blocks of regular expressions?
Regex basics. Source: Upscene 2015.
A regex has these basic building blocks:
- Anchor: A regex is processed by a regex engine. Anchors make assertions about the current position of the engine. Common anchors are ^ (beginning of line) and $ (end of line). For a multiline input, we can use \A and \z to match beginning and end of string respectively.
- Character: . matches any character; \s matches any whitespace; [12a-c] matches the set of characters '1 2 a b c'; (a|e)s matches 'a' or 'e' followed by 's'.
- Group: A sequence of characters grouped within ().
- Quantifier: Specify how many matches of the character or group are allowed. For zero or more matches, use *; + for one or more matches; ? for zero or one match; {m,n} for m to n matches.
- Modifier/Flag: Modify the regex in specific ways. Use i for case insensitive match; m is for multiline match; s for single line match so that . matches newlines as well.
For example, /^model{1,2}(ing)? /i will match any line starting with 'model ', 'modell ', 'modeling ', 'modelling ', and their lowercase versions.
Which are the special characters in regex?
Regex special characters are often called metacharacters. The common ones include \ ^ $ . * + ? [ { ( ) | characters.
Note that characters } ] have special meaning too when used with their counterparts { [; but on their own, they are treated literally. For example, a(b{2,})?c will match 'ac', 'abbc', 'abbbbc' and more; ab}?c will match exactly either 'abc' or 'ab}c', and nothing else.
Since metacharacters have special meaning, to use them literally, we would have to escape them with the backslash \ character. For example, to match the dollar value in string 'This costs $22.50 after discount', the regex would be \$(\d+(?:\.\d+)?), where the dollar symbol is escaped. It's treated literally and not as end of the string.
Metacharacter pair [] defines a character class. When range is specified, such as [0-6], hyphen character is a special character. However, if hyphen occurs without either start or end, such as [-6], it's interpreted literally. Likewise, other metacharacters are interpreted literally within a character class. This is therefore an alternative to escaping with backslash.
What are regex groups and how are they useful?
Parts of a regex can be grouped within parentheses, (). A quantifier can be applied to an entire group. Thus, ab+ will match 'abb' but (ab)+ will match 'ab' and 'abab' but not 'abb'. Groups are also useful to restrict alternation. Thus, cat|dogs will match either 'cat' or 'dogs' but (cat|dog)s will match either 'cats' or 'dogs'.
Groups are generally of three types:
- Numbered Capturing Group: Each group within a regex is given a number starting from 1. The number 0 is reserved for the entire regex. This number can later be used within the regex for subsequent matches or in a replacement string. These are called backreferences.
- Named Capturing Group: First introduced in Python, naming the groups make it easier to backreference. For example, we match one or more digits into a named group called 'score' using (?P<score>\d+) and backreference it as (?P=score).
- Non-Capturing Group: Sometimes we wish to match a group but we're not going to backreference it later. Therefore, there's no need to capture the group. For example, here we match one or more word characters without capture, (?:\w+).
Could you explain the concepts of lookahead and lookbehind in regex?
Illustrating regex lookaround assertions. Source: Adapted from Jain 2017, slide 16.
Sometimes we wish to match a string but only if it occurs in a certain context. For example, we wish to match all dark versions of colours that start with 'g': darkgreen, darkgrey, darkgray, etc. While dark(g[a-z]+) will do the job, an alternative is to match the base colour and look behind to see if it's dark: (?<=dark)g[a-z+]. The full match will return only the colour without the 'dark' prefix.
Looking before or after a particular match is called lookaround assertion. It's called an assertion because it doesn't consume characters from the input. They only declare if there's a match or not. We illustrate examples of the four lookaround assertions:
- Positive lookahead: Match 'q' that's followed by 'u', q(?=u).
- Negative lookahead: Match 'q' that's not followed by 'u', q(?!u).
- Positive lookbehind: Match 'b' that follows 'a', (?<=a)b.
- Negative lookbehind: Match 'b' that doesn't follow 'a', (?<!a)b.
What are some other advanced regex features?
Some advanced regex features include the following:
- Laziness: A regex such as .*, is greedy. It will match as many characters as possible preceding a comma. A lazy or non-greedy match is achieved using '?', such as .*?,. This will match only till the first comma.
- Possessive Quantifier: Using '+', such as .*+, we can tell the regex engine to avoid unnecessary backtracking. This can be seen as a notational convenience for atomic grouping.
- Atomic Grouping: These are special non-capturing groups to prevent unnecessary backtracking. Use of atomic grouping improves performance. An example of this is (?>his|this). If 'his' is not present, obviously we don't need to backtrack and look for 'this'.
- Conditionals: We can use lookaround assertions for specifying conditions. Depending on condition's result (true or false), other parts of the regex can be processed.
- Recursion: Use (?R) to match nested constructs.
Do regular expressions differ across programming languages?
Regular expressions are processed by regex engines, of which there are many flavours. A programming language typically adopts one of these engines. Differences across these flavours are not easy to remember. Programmers should be aware that migrating regex from one engine to another must be accompanied with proper testing.
Some of these flavours include JGsoft, .NET, Perl, PHP, R, JavaScript, Python, Tcl ARE, POSIX BRE, POSIX ERE, GNU BRE and GNU ERE. The oldest of these is the Basic Regular Expression (BRE) that has limited power and expressiveness. BRE was later extended to the Extended Regular Expression (ERE). To study how these various engines differ, take a look at Roger Qiu's comparison chart or Wikipedia's entry.
As an example, sed tool uses BRE by default, where + matches literal plus sign. But with option -E it uses ERE, where \+ must be used for literal match.
Since Perl popularized regex, one of the popular engines that came about is called Perl Compatible Regular Expression (PCRE). PCRE was later updated to PCRE2. PCRE and its variants have been adopted by many programming languages.
What are some tips or best practices for using regex?
For readable and maintainable regex, use the flag x to ignore whitespace. Thus, a complex regex can be expanded with useful comments.
To find match in the right place, use anchors. The use of .* can make the engine backtrack often. Construct a more specific regex. Likewise, the use of lazy quantifier {.*?} can be inefficient. Instead, say what you don't want to match, {[^}]*}. Also, atomic groups can save on backtracking. However, lazy quantifiers can be better in some simple scenarios: *?'s is faster than *'s.
Regex must be designed to fail fast. For example, (?=.*fleas).* does a lot of backtracking on lines that don't contain 'fleas'. On the other hand, ^(?=.*fleas).* has a lookahead anchored at the start of the string and will fail faster.
Use contrast, that is, what characters to match and what not to match. For example, to match 'ABC123' the regex ^.+\d{3}$ won't work since . and \d are not mutually exclusive. Instead, use ^\D+\d{3}$.
When using alternations (use of |), put the more common patterns earlier.
Where should I not use regular expressions?
While regex is useful, it can be overused and abused. An extreme example is a 6343-character long regex to match an email address. Here's a quote from 1997, attributed to Jamie Zawinski,
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
While URL paths and email addresses can be parsed using regex, there are dedicated and mature libraries to do these. You should prefer to use them instead. Regex is also not the best choice for parsing HTML or code since there are better tools to generate tokenized outputs.
Humans write in a number of different ways. A regex will not adequately capture all the different variations. In general, when code is read and maintained by many developers, complex regex will be problematic. Regex is generally not descriptive enough to match balanced parenthesis, such as, (aa (bbb) (bbb) aa).
Could you point me to useful resources for working with regex?
Debuggex helps in visualizing your regex. Source: Debuggex 2019.
Among the places to learn regex are RegexOne, RexEgg, Regex Crossword and Jan Goyvaerts' Regular-Expressions.info. Beginners might like Net Ninja's series of sixteen videos on regex.
For absolute basics, consider reading Mike Malone's blog post from 2007.
A classic book on regex is Mastering Regular Expressions by Jeffrey Friedl. Another good book is Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan. Dave Child has published a handy regex cheatsheet.
There are many websites that help with debugging and visualizing regex patterns and their matches. A few recommended ones include Regex101, Debuggex and RegExr. These typically support multiple regex flavours. Regex Storm is particular to .NET regex flavour; Rubular is for Ruby regex. You can also read a comparison of some online regex testers.

Milestones

1956

Mathematician Stephen Cole Kleene coins the term regular expressions as a notation for expressing the algebra of regular sets. The regex metacharacter * is named Kleene star in his honour.

1967

Ken Thompson at Bell Labs writes a new version of QED text editor for the MIT CTSS system. He introduces regular expressions to QED, thus bringing regex from the world of mathematics to computer science for the first time. Regex in QED is also compiled on the fly, for which Thompson receives a US patent.

1970

In this decade, regex makes its way into some Unix programs and utilities such as sed, awk and grep. It's been said that awk is the first language to make regex a first class programming construct. In 1975, Al Aho creates egrep command with a much more expressive syntax that the basic one supported by grep.

1986

Henry Spencer expands the regex syntax and provides an engine for the same. His regex library could be freely included in other programs. He later goes on to create an even better regex engine for Tcl.

1992

POSIX character classes mapped to equivalent regex. Source: Goyvaerts 2018d.

IEEE defines POSIX BRE and POSIX ERE as part the standard IEEE Std 1003.1-1992.

1997

Philip Hazel releases the PCRE regex library. This is later adopted by PHP, Apache and many others. This follows the syntax and semantics of Perl5. In 2015, PCRE2 is released.

1998

In the 1990s, Larry Wall, the creator of Perl, adopts and expands on Spencer's library. Perl 5.005 is released in 1998 and it includes enhancements to the regex engine. Perl's innovation on regex include lazy quantifiers, non-capturing parentheses, inline mode modifiers, lookahead, and a readability mode. ColdFusion, Java, JavaScript, the .NET Framework, PHP, Python, and Ruby are some of the languages that have since adopted Perl's regex syntax and features.

References

Article Stats

2175

Words

Authors

Edits

Chats

Likes

15K

Hits

Cite As

Devopedia. 2022. "Regular Expression." Version 8, February 15. Accessed 2023-11-12. https://devopedia.org/regular-expression

Contributed by
2 authors

Last updated on
2022-02-15 11:53:08

languages tools algorithms pattern matching

Regular Expressions in Python
Regular Expressions in Perl
Regex Engines
Regex Optimization
String Searching Algorithm
Pattern Matching

Regular Expression

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Regular Expression

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login