Return to tutorials

Introduction to regular expressions

Regular expressions exist to perform a search on a piece of text(usually a line). Unlike traditional searches, regular expressions - also known as regex - often include special pattern checks.

hello

This regex will match “hello”

The dot(.)

The dot is one of the most used special characters on a regex. A dot matches ANY letter, digit, white space, underscore or any other character you can think of(including the dot itself).

G-.ava

The above regex would match G-java, but also G-Java, G-Tava, G-%ava, G-.ava, G-?ava, G- ava, and many more.

The literal sign (\)

Imagine we wanted to match a dot.

.

wouldn’t do it, because it doesn’t match only the dot. The solution is using the literal sign(\)! The literal sign is used to escape special characters such as .

For example

\.

would match . The literal sign changes everything, including the literal sign.

\\

would match only \

Space&inc.

We might want to match the space sign, but also the tab and many more. To do that, we can use \s.

Hello\sworld!

would match Hello world! separated by a space, by a tab, or other separators.

The unlimited *

One of the most used signs of the regular expressions is the *. The * means repetition, and is assigned to the previous character. It matches the previous character 0 times, once, twice, 3 times, 4 times or anything like that.

.*

matches anything

The strict unlimited +

Suppose you wanted to rewrite the above regex, requiring at least one character:

..*

this can end up being VERY boring in some circumstances, so there is a shortcut. And that shortcut is the +.

.+

Both * and + can work with any character, even when it isn’t a dot(.)

For example,

aa*

is a synonym of

a+

,which would match any kind of text which only has “a”s on it, and which has at least one a.

The ? sign

The ? sign can have two meanings. One of those is when the previous character is * or +, and the other one is when it is not linked to a * or +.

I will start with the unlinked meaning.

cats?

would match both cat and cats. The unlinked meaning is optional

When it is after * or +, it has another meaning.

Think about this regular expression

<.+>

to match a HTML tag and this piece of text

ee<b><u>a</u></b>ee

Instead of matching only <b> or only <u>, it matches <b><u>a</u></b>!

That happens because the * and + try to match as much as they can. The ? offers a way out.

<.+?>

automatically fetches as little as possible. Which also means that

<input type="submit" value="Hi>There">

would match only <input type=”submit” value=”Hi>

These situations are a reason to be very careful when using regular expressions.

The end of this tutorial

There is much more about regular expressions, but this tutorial already told you enough - after all, this is an introductory tutorial.