What is a regular expression
A regular expression is a special set of letters and symbols that can be used to find a sentence from text that meets the format you want.
A regular expression is a style that matches a string from left to right in a body string. For example, a "regular expression" is a complete sentence, but we often use the abbreviated terms "regex" or "regexp". Regular expressions can be used to replace strings in text, to validate forms, to extract strings and so on.
Imagine you are writing an application, and you want to set up a user naming convention so that the username contains characters, numbers, underscores and hyphens, as well as limiting the number of characters so that the name doesn't look so ugly. We use the following regular expression to validate a username:
The above regular expression will accept john_doe
, jo-hn_doe
, john12_as
.
But it does not match Jo
, because it contains an upper case letter and is too short.
1. Basic matchingโ
A regular expression is actually the format used when performing a search, which consists of a combination of letters and numbers.
For example: a regular expression the
, which represents a rule: it starts with the letter t
, followed by h
, followed by e
.
"the" => The fat cat sat on the mat.
The regular expression 123
matches the string 123
. It compares the regular expression with the input character by character.
Regular expressions are case-sensitive, so The
will not match the
.
"The" => The fat cat sat on the mat.
2. Metacharactersโ
Regular expressions rely heavily on metacharacters.
Metacharacters do not represent their literal meaning, they have a special meaning. Some metacharacters have special meanings when written in square brackets. Here is an introduction to some of these metacharacters:
metacharacters | description |
---|---|
. | A full stop matches any single character except a line break. |
[ ] | character type. Matches any character within square brackets. |
[^ ] | negative character type. Matches any character except those in square brackets. |
* | Match >=0 duplicates of the character before the * sign. |
+ | matches >1 repeated characters before the + sign. |
? | The character before the ? The character before the marker is optional. |
{n,m} | Match the characters before the num brackets (n <= num <= m). |
(xyz) | The set of characters, matching the exact equivalent of xyz. |
Or operator, matches characters before or after the symbol. | The |
escape character, use to match some reserved characters [ ]( ) { } . * + ? ^ $ \ | | |
^ | Matches from the beginning of the line. |
$ | Match from the end. |
2.1 The dot operator .
โ
.
is the simplest example of a metacharacter.
.
matches any single character, but not a line break.
For example, the expression .ar' matches an arbitrary character followed by
a' and `r'.
".ar" => The car parked in the garage.
2.2 Character setsโ
Character sets are also called character classes.
Square brackets are used to specify a character set.
A hyphen is used in square brackets to specify the range of the character set.
The set of characters in square brackets does not care about order.
For example, the expression [Tt]he
matches the
and The
.
"[Tt]he" => The car parked in the garage.
A period in square brackets means a full stop.
The expression ar[.]
matches the string ar.
"ar[.]" => A garage is a good place to park a car.
2.2.1 Negating character setsโ
Generally ^
indicates the beginning of a string, but when it is used at the beginning of a square bracket, it means that the character set is negative.
For example, the expression [^c]ar
matches any character other than c
followed by ar
.
"car" => The car parked in the garage.
2.3 Number of repetitionsโ
followed by the metacharacters +
, *
or ?
, which specify the number of times a subpattern is matched.
These metacharacters have different meanings in different contexts.
2.3.1 The *
signโ
The *
sign matches the character preceding *
more than or equal to 0
times.
For example, the expression a*
matches any character beginning with 0 or more a's, and because there are 0 of them, it matches all of them. The expression [a-z]*
matches all strings in a line that start with a lowercase letter.
"[a-z]*" => The car parked in the garage #21.
The *
character and the .
characters can match all characters . *
.
*
is used in conjunction with the symbol \s
to indicate a space match, e.g. the expression \s*cat\s*
matches a cat string starting with 0 or more spaces and ending with 0 or more spaces.
"\scat\s" => The fatcatsat on the concatenation.
2.3.2 The +
signโ
The +
sign matches the character before the +
sign >= 1 times.
For example, the expression 'c.+t' matches a string starting with the first letter 'c' and ending with 't', followed by any number of characters.
"c.+t" => The fat cat sat on the mat.
2.3.3 ?
signโ
In regular expressions the meta character ?
marks the character preceding the symbol as optional, i.e. it appears 0 or 1 times.
For example, the expression [T]?he
matches the strings he
and The
.
"[T]he" => The car is parked in the garage.
"[T]?he" => The car is parked in the garage.
2.4 The {}
numberโ
In regular expressions {}
is a quantifier, often used to indicate the number of times a character or group of characters can be repeated.
For example, the expression [0-9]{2,3}
matches 2 to 3 digits of 0 to 9.
"[0-9]{2,3}" => The number was 9.9997 but we rounded it off to 10.0.
We can omit the second argument.
For example, [0-9]{2,}
matches at least two digits from 0 to 9.
If the comma is also omitted, it is repeated a fixed number of times.
For example, [0-9]{3}
matches 3 digits
"[0-9]{2,}" => The number was 9.9997 but we rounded it off to 10.0.
"[0-9]{3}" => The number was 9.9997 but we rounded it off to 10.0.
2.5 (...)
Feature groupโ
A feature group is a set of subpatterns written in (...)
is a set of subpatterns written in (...)'. For example,
{}is used to indicate a specified number of occurrences of a preceding character. But if you prefix
{}with a feature group, it means that the character is repeated N times throughout the group. For example, the expression
(ab)*matches 0 or more consecutive occurrences of
ab`.
We can also use the or character |
in ()
to represent an or. For example, (c|g|p)ar
matches car
or gar
or par
.
"(c|g|p)ar" => The car is parked in the garage.
2.6 The |
or operatorโ
The or operator represents an or, and is used as a judgment condition.
For example (T|t)he|car
matches (T|t)he
or car
.
"(T|t)he|car" => The car is parked in the garage.
2.7 Transcoding special charactersโ
The backslash \
is used to escape the character immediately following it in an expression. It is used to specify { } [ ] / \ + * . $ ^ | ?
These are special characters. If you want to match these special characters you have to precede them with a backslash \
.
For example .
is used to match all characters except newlines. If you want to match the .
would be written as \.
.
"(f|c|m)at...?" => The fat cat sat on the mat.
2.8 Anchor pointsโ
In regular expressions, anchors are used when you want to match a string with a specified beginning or end. ^
specifies the beginning, $
the end.
2.8.1 The ^
signโ
^
is used to check that the matching string is at the beginning of the matched string.
For example, using the expression ^a
in abc
will give you the result a
. But if you use ^b
you will not get any result. This is because the string abc
does not start with b
.
For example, ^(T|t)he
matches a string starting with The
or the
.
"(T|t)he" => The car is parked in the garage.
"^(T|t)he" => The car is parked in the garage.
2.8.2 The $
signโ
Similar to the ^
sign, the $
sign is used to match whether the character is the last one.
For example, (at\.) $
matches a string ending in at.
.
"(at.)" => The fat cat. sat. on the mat.
"(at.) $" => The fat cat. sat. on the mat.
3. Abbreviated character setsโ
Regular expressions provide abbreviations for some common character sets. These are as follows:
shorthand | description |
---|---|
. | all characters except newlines |
\w | matches all alphanumeric characters, equivalent to [a-zA-Z0-9_] |
\W | Match all non-alphanumeric, i.e. symbols, equivalent to: [^\w] |
\d | Matching numbers: [0-9] |
\D | matches non-numbers: [^\d] |
\s | Matches all whitespace characters, equivalent to: [\t\n\f\r\p{Z}] |
\S | matches all non-whitespace characters: [^\s] |
4. Pre- and post-association constraints (pre- and post-checking)โ
Both pre and post constraints are part of non-capture clusters (used to match formats that are not in the match list). A pre-constraint is used to determine if the format being matched is after another identified format.
For example, if we want to get all the numbers that follow the $
symbol, we can use the forward-backward constraint (? <=\$)[0-9\.] *
.
This expression matches the beginning of $
, followed by 0,1,2,3,4,5,6,7,8,9,.
These characters can occur more than or equal to 0 times.
The preceding and following association constraints are as follows:
symbol | description |
---|---|
? = | predecessor-constraint-presence |
?! | predicate constraint-exclusion |
? <= | Posterior Constraint-Presence |
? <! | Posterior Constraint-Excluded |
4.1 ? =...
Pre-constraints (exist)โ
? =...
preconstraint (exists), which means that the first part of the expression must be followed by ? =...
defined after the expression.
Only the first part of the expression will be hidden from the return result.
To define a preconstraint (presence) use ()
. Use a question mark and an equal sign inside the parentheses: (? =...)
.
The contents of the preceding constraint are written after the equal sign in the parentheses.
For example, the expression [T|t]he(? = \sfat)
matches The
and the
, and in the parentheses we define the antecedent constraint (which exists) (? =\sfat)
, i.e. The
and the
are immediately followed by (space)fat
.
"[T|t]he(? =\sfat)" => The fat cat sat on the mat.
4.2 ?!...
Pre-constraint-exclusionโ
Preconstraint-exclusion ?!
is used to filter all matches, without the defined format following the filter condition
The definition of prefix-exclude
is the same as prefix-constraint(exist)
, except that =
is replaced by !
which is (?!...)
.
The expression [T|t]he(?! \sfat)
matches The
and the
, and is not followed by (space)fat
.
"[T|t]he(?! \sfat)" => The fat cat sat on the mat.
4.3 ? <= ...
Posterior constraint-presenceโ
Posterior constraint-existence Notated as (? <= ...)
is used to filter all matches by the format defined before it.
For example, the expression (? <=[T|t]he\s)(fat|mat)
matches fat
and mat
, and is preceded by the
or the
.
"(? <=[T|t]he\s)(fat|mat)" => The fat cat sat on the mat.
4.4 ? <!...
Posterior constraint-exclusionโ
Posterior constraint-exclusion is written as (? <!...)
is used to filter all matches by a format that is not preceded by a definition.
For example, the expression (? <! (T|t)he\s)(cat)
matches cat
without being preceded by the
or the
.
"(? <! [T|t]he\s)(cat)" => The cat sat on cat.
5. signsโ
Flags are also called modifiers, because they can be used to modify the search result of an expression. These flags can be used in any combination, and are part of the overall regular expression.
flags | description |
---|---|
i | ignores case. |
g | Global search. |
m | Multi-line: Anchor metacharacters ^ $ work at the beginning of each line. |
5.1 Case Insensitiveโ
The modifier i
is used to ignore case.
For example, the expression /The/gi
indicates a global search for The
, which becomes a search for the
and The
when followed by i
, which modifies the condition to be case insensitive, and g
indicates a global search.
"The" => The fat cat sat on the mat.
"/The/gi" => The fat cat sat on the mat.
5.2 Global searchโ
The modifier g
is often used to perform a global search, that is, to return not just the first match, but all of them.
For example, the expression /. (at)/g
means search for any character (except newlines) + at
, and return all results.
"/. (at)/" => The fat cat sat on the mat.
"/. (at)/g" => The fat cat sat on the mat.
5.3 Multiline modifiers (Multiline)โ
The multiline modifier m
is often used to perform a multi-line match.
As previously described (^,$)
is used to check if the formatting is at the beginning or end of the string to be tested. But if we want it to work at the beginning and end of each line, we need the multi-line modifier m
.
For example, the expression /at(.) ? $/gm
means that the string to be tested is searched at the end of each line for at
followed by one or more `. ' at the end of each line, and returns the full result.
"/.at(.) ? $/" => The fat cat sat on the mat.
"/.at(.) ? $/gm" => The fat cat sat on the mat.
Contributionsโ
Thanks to Learn-regex for the project
Licenseโ
MIT ยฉ Zeeshan Ahmad