REGULAR EXPRESSION SYNTAX SUMMARY
The full syntax and semantics of the regular expressions that are supported by PCRE2 are described in the pcre2pattern documentation. This document contains a quick-reference summary of the syntax.
QUOTING
\x where x is non-alphanumeric is a literal x
\Q...\E treat enclosed characters as literalESCAPED CHARACTERS
This table applies to ASCII and Unicode environments.
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII printing character
\e escape (hex 1B)
\f form feed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd..
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..Note that \0dd is always an octal code. The treatment of backslash followed by a non-zero digit is complicated; for details see the section "Non-printing characters" in the pcre2pattern documentation, where details of escape processing in EBCDIC environments are also given.
When \x is not followed by {, from zero to two hexadecimal digits are read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise it matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits, it matches a literal "u".
CHARACTER TYPES
\C is dangerous because it may leave the current matching point in the middle of a UTF-8 or UTF-16 character. The application can lock out the use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2 with the use of \C permanently disabled.
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode or in the 16-bit and 32-bit libraries. However, if locale-specific matching is happening, \s and \w may also match characters with code points in the range 128-255. If the PCRE2_UCP option is set, the behaviour of these escape sequences is changed to use Unicode properties and they match many more characters.
GENERAL CATEGORY PROPERTIES FOR \p and \P
PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
Perl and POSIX space are now the same. Perl added VT to its space character set at release 5.18.
SCRIPT NAMES FOR \p AND \P
Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
CHARACTER CLASSES
In PCRE2, POSIX character set names recognize only ASCII characters by default, but some of them use Unicode properties if PCRE2_UCP is set. You can use \Q...\E inside a character class.
QUANTIFIERS
ANCHORS AND SIMPLE ASSERTIONS
MATCH POINT RESET
\K is honoured in positive assertions, but ignored in negative ones.
ALTERNATION
CAPTURING
ATOMIC GROUPS
COMMENT
OPTION SETTING
The following are recognized only at the very start of a pattern or after one of the newline or \R options with similar syntax. More than one of them may appear. (*LIMIT_MATCH=d) set the match limit to d (decimal number)
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the limits set by the caller of pcre2_match(), not increase them. The application can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
NEWLINE CONVENTION
These are recognized only at the very start of the pattern or after option settings with a similar syntax.
WHAT \R MATCHES
These are recognized only at the very start of the pattern or after option setting with a similar syntax.
LOOKAHEAD AND LOOKBEHIND ASSERTIONS
Each top-level branch of a look behind must be of a fixed length.
BACKREFERENCES
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
CONDITIONAL PATTERNS
BACKTRACKING CONTROL
The following act immediately they are reached:
The following act only when a subsequent match failure causes a backtrack to reach them. They all force a match failure, but they differ in what happens afterwards. Those that advance the start-of-match point do so only if the pattern is not anchored. (*COMMIT) overall failure, no advance of starting point
CALLOUTS
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the start and the end), and the starting delimiter { matched with the ending delimiter }. To encode the ending delimiter within the string, double it.
Last updated
Was this helpful?
