logo

Regular Experessions (regex)

This module presents pattern searching using regular expressions.

Introduction

Why does this exist? Ever want to search for a phrase, words or patterns, and want it be vague? Regular expressions are your answer! That may not sound very useful at first but a typical use case in science is that variables and values in a data file are not written nicely for ease of use for plotting.

There are several variants Basic, Extended and Perl Compatible Regular Expressions (PCRE). To keep things simple I will go over Basic and Extended. I will leave it up to the reader to choose to learn PCRE as some languages such as Perl and Python are capable of this but it extends the framework a lot and there are more in-depth guides on using it.

Regex

BRE Metacharacters

Metacharacter Description
^ Match pattern starting at the beginning of a line
. Match any single character
[ ] Bracket expansion, matches a single character within the brakets
[^ ] Matches a single character not inside brackets
$ Matches the ending position of a line before the newline character
( ) A subexpression that can be recalled later by nth subexpression \n. BRE requires \( \)
\n Matches the nth subexpression, where n ranges from 1–9
* Matches the preceeding element zero or more times
{m,n} Matches the preceeding element at least m and not more than n times. BRE requires \{m,n\}

The backslash \ in \( \) and \{ \} are easily forgettable when using BRE

Examples

ERE Metacharacters

Typcially ERE is more convienent or flexible for regular expression searches and is preferred. ERE metacharacters change to include ( ) and { } without needing the backslash \. If you want to include a parethesis or brace in your search, then escape them using the backslash such as \( or \}. This is opposite of BRE

Metacharacter Description
? Match preceeding element zero or one time, same as {0,1}
+ Match preceeding element one more times, same as {1,}
| Match before or after expression e.g. cat|dog matches cat or dog

Escape the +, ? and | characters to search for them

Examples

Definitions

This is subset consisting of the more common definitions. For more information see the Wikipedia regex article.

ERE BRE Description
[:alnum:] [A-Za-z0-9] Alphanumeric characters
[:alpha:] [A-Za-z] Alphabetic characters
[:blank:] [ \t] Space and Tab characters
[:digit:] [0-9] Digits
[:lower:] [a-z] Lowercase letters
[:space:] [ \t\r\n\v\f] Whitespace characters
[:upper:] [A-Z] Uppercase letters
[:xdigit:] [A-Fa-f0-9] Hexadecimal digits

Commands typically used with regex

Command Description
grep file pattern searcher (grab regular expression print)
egrep grep with ERE support, same as grep -E
sed stream editor, sed -E for ERE support
awk pattern-directed scanning and processing language

Since ERE is more convienent, the remaining examples will use the ERE format, i.e., egrep or sed -E. The awk language is very detailed and outside of the scope of this module (for now).

egrep - extended grab regular epression print

You can grep lines within a file that match the specified patterns, then print the output to stdout. For example, if we have a file

$ cat regex.txt
gravity                = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2
pi                     = 3.141592
atmospheric pressure is 7.60E+2 Torr

If we want to get the lines related to gravity

$ egrep "^grav" regex.txt
gravity                = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2

But what if we only want the numerical values and not the entire line? Then pipe the output to another egrep command and use -o, which only displays the matched result

$ egrep "^grav" regex.txt | egrep -o "[[:digit:]]+\.[[:digit:]]+[eE]?-?[[:digit:]]*"
9.81
6.67408E-11

NB When trying to include - as a seach within [ ] you need to put it at the end incase you have a range like [0-9], e.g., [0-9-] will match any digit and -.

sed - stream editor

Sed is a stream editor. That means it can read text as a stream and perform operations on that stream of text. The most common use case is to do a string replace portions of a line using regex. It has the syntax

`s/<RE or pattern>/<replacement pattern>/`

this will replace the first matching instance on a line. If the syntax appends a g at the end, such that

`s/<RE or pattern>/<replacement pattern>/g`

it will replace all non-overlaping occurences on the line.

For example if you want to replace all equal signs surrounded by extra space or tabs with just one space a colon and another space; you can do the following

$ sed -E 's/[[:blank:]]+=[[:blank:]]+/ : /g' regex.txt
gravity : 9.81 m/s
gravitational constant : 6.67408E-11 m^3 kg^-1 s^-2
pi : 3.141592
atmospheric pressure is 7.60E+2 Torr

sed doesn't overwrite the file. It just prints the result to stdout. You could use > to redirect stdout to a new file

$ sed -E 's/[[:blank:]]+=[[:blank:]]+/ : /g' regex.txt > regex_mod.txt

sed can mimic grep behaviour to search for regex and print to stdout. Use -E for ERE and -n to supress input being echoed to stdout; this would show duplicate results, plus all lines in the file if -n is not used. egrep is just more convienent.

$ sed -E -n '/^grav/p' regex.txt
gravity                = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2

sed has many features. If you want to know more see the man page for sed.

© 2017–2022 David Kalliecharan