Regular Experessions (regex)

This module presents pattern searching using regular expressions.

Introduction

Why does this exist? Ever want to search for a phrase, words or patterns, and want it be vague? Regular expressions are your answer! That may not sound very useful at first but a typical use case in science is that variables and values in a data file are not written nicely for ease of use for plotting.

There are several variants Basic, Extended and Perl Compatible Regular Expressions (PCRE). To keep things simple I will go over Basic and Extended. I will leave it up to the reader to choose to learn PCRE as some languages such as Perl and Python are capable of this but it extends the framework a lot and there are more in-depth guides on using it.

Regex

BRE Metacharacters

Metacharacter	Description
`^`	Match pattern starting at the beginning of a line
`.`	Match any single character
`[ ]`	Bracket expansion, matches a single character within the brakets
`[^ ]`	Matches a single character not inside brackets
`$`	Matches the ending position of a line before the newline character
`( )`	A subexpression that can be recalled later by nth subexpression `\n`. BRE requires ``
`\n`	Matches the nth subexpression, where n ranges from 1–9
`*`	Matches the preceeding element zero or more times
`{m,n}`	Matches the preceeding element at least m and not more than n times. BRE requires `\{m,n\}`

The backslash \ in  and \{ \} are easily forgettable when using BRE

Examples

[bch]at will match bat, cat and hat
Re\{1,2\}d will match Red and Reed
^[Pp]arrot will only match if a line begins with Parrot or parrot
treats\{0,1\}$ will only match if a line ends with treat or treats
[^b]ar matches any three-character string ending in ar, excluding bar
[0-9]* will match any repeating pattern of digits, e.g., 1, 109, 65535 or nothing at all

ERE Metacharacters

Typcially ERE is more convienent or flexible for regular expression searches and is preferred. ERE metacharacters change to include ( ) and { } without needing the backslash \. If you want to include a parethesis or brace in your search, then escape them using the backslash such as \( or \}. This is opposite of BRE

Metacharacter	Description
`?`	Match preceeding element zero or one time, same as `{0,1}`
`+`	Match preceeding element one more times, same as `{1,}`
`\|`	Match before or after expression e.g. `cat\|dog` matches cat or dog

Escape the +, ? and | characters to search for them

Examples

[0-9]+\.[0-9]+ will match any float/double number, e.g., 1.0, 3.141592 or 101.325 (NB the period is escaped to be searchable as a character)
b?rave will match brave and rave
Re{1,2}d will match Red and Reed
Bre(e|t)t will match Brett and Breet

Definitions

This is subset consisting of the more common definitions. For more information see the Wikipedia regex article.

ERE	BRE	Description
`[:alnum:]`	`[A-Za-z0-9]`	Alphanumeric characters
`[:alpha:]`	`[A-Za-z]`	Alphabetic characters
`[:blank:]`	`[ \t]`	Space and Tab characters
`[:digit:]`	`[0-9]`	Digits
`[:lower:]`	`[a-z]`	Lowercase letters
`[:space:]`	`[ \t\r\n\v\f]`	Whitespace characters
`[:upper:]`	`[A-Z]`	Uppercase letters
`[:xdigit:]`	`[A-Fa-f0-9]`	Hexadecimal digits

Commands typically used with regex

Command	Description
`grep`	file pattern searcher (grab regular expression print)
`egrep`	grep with ERE support, same as `grep -E`
`sed`	stream editor, `sed -E` for ERE support
`awk`	pattern-directed scanning and processing language

Since ERE is more convienent, the remaining examples will use the ERE format, i.e., egrep or sed -E. The awk language is very detailed and outside of the scope of this module (for now).

egrep - extended grab regular epression print

You can grep lines within a file that match the specified patterns, then print the output to stdout. For example, if we have a file

$ cat regex.txt
gravity                = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2
pi                     = 3.141592
atmospheric pressure is 7.60E+2 Torr

If we want to get the lines related to gravity

$ egrep "^grav" regex.txt
gravity                = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2

But what if we only want the numerical values and not the entire line? Then pipe the output to another egrep command and use -o, which only displays the matched result

$ egrep "^grav" regex.txt | egrep -o "[[:digit:]]+\.[[:digit:]]+[eE]?-?[[:digit:]]*"
9.81
6.67408E-11

NB When trying to include - as a seach within [ ] you need to put it at the end incase you have a range like [0-9], e.g., [0-9-] will match any digit and -.

sed - stream editor

Sed is a stream editor. That means it can read text as a stream and perform operations on that stream of text. The most common use case is to do a string replace portions of a line using regex. It has the syntax

`s/<RE or pattern>/<replacement pattern>/`

this will replace the first matching instance on a line. If the syntax appends a g at the end, such that

`s/<RE or pattern>/<replacement pattern>/g`

it will replace all non-overlaping occurences on the line.

For example if you want to replace all equal signs surrounded by extra space or tabs with just one space a colon and another space; you can do the following

$ sed -E 's/[[:blank:]]+=[[:blank:]]+/ : /g' regex.txt
gravity : 9.81 m/s
gravitational constant : 6.67408E-11 m^3 kg^-1 s^-2
pi : 3.141592
atmospheric pressure is 7.60E+2 Torr

sed doesn’t overwrite the file. It just prints the result to stdout. You could use > to redirect stdout to a new file

$ sed -E 's/[[:blank:]]+=[[:blank:]]+/ : /g' regex.txt > regex_mod.txt

sed can mimic grep behaviour to search for regex and print to stdout. Use -E for ERE and -n to supress input being echoed to stdout; this would show duplicate results, plus all lines in the file if -n is not used. egrep is just more convienent.

$ sed -E -n '/^grav/p' regex.txt
gravity                = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2

sed has many features. If you want to know more see the man page for sed.