This module presents pattern searching using regular expressions.
Why does this exist? Ever want to search for a phrase, words or patterns, and want it be vague? Regular expressions are your answer! That may not sound very useful at first but a typical use case in science is that variables and values in a data file are not written nicely for ease of use for plotting.
There are several variants Basic, Extended and Perl Compatible Regular Expressions (PCRE). To keep things simple I will go over Basic and Extended. I will leave it up to the reader to choose to learn PCRE as some languages such as Perl and Python are capable of this but it extends the framework a lot and there are more in-depth guides on using it.
Metacharacter | Description |
---|---|
^ |
Match pattern starting at the beginning of a line |
. |
Match any single character |
[ ] |
Bracket expansion, matches a single character within the brakets |
[^ ] |
Matches a single character not inside brackets |
$ |
Matches the ending position of a line before the newline character |
( ) |
A subexpression that can be recalled later by nth subexpression \n . BRE requires \( \) |
\n |
Matches the nth subexpression, where n ranges from 1–9 |
* |
Matches the preceeding element zero or more times |
{m,n} |
Matches the preceeding element at least m and not more than n times. BRE requires \{m,n\} |
The backslash
\
in\( \)
and\{ \}
are easily forgettable when using BRE
[bch]at
will match bat, cat and hatRe\{1,2\}d
will match Red and Reed^[Pp]arrot
will only match if a line begins with Parrot or parrottreats\{0,1\}$
will only match if a line ends with treat or treats[^b]ar
matches any three-character string ending in ar, excluding bar[0-9]*
will match any repeating pattern of digits, e.g., 1, 109, 65535 or nothing at allTypcially ERE is more convienent or flexible for regular expression searches and is preferred. ERE metacharacters change to include ( )
and { }
without needing the backslash \
. If you want to include a parethesis or brace in your search, then escape them using the backslash such as \(
or \}
. This is opposite of BRE
Metacharacter | Description |
---|---|
? |
Match preceeding element zero or one time, same as {0,1} |
+ |
Match preceeding element one more times, same as {1,} |
| |
Match before or after expression e.g. cat|dog matches cat or dog |
Escape the +, ? and | characters to search for them
[0-9]+\.[0-9]+
will match any float/double number, e.g., 1.0, 3.141592 or 101.325 (NB the period is escaped to be searchable as a character)b?rave
will match brave and raveRe{1,2}d
will match Red and ReedBre(e|t)t
will match Brett and BreetThis is subset consisting of the more common definitions. For more information see the Wikipedia regex article.
ERE | BRE | Description |
---|---|---|
[:alnum:] |
[A-Za-z0-9] |
Alphanumeric characters |
[:alpha:] |
[A-Za-z] |
Alphabetic characters |
[:blank:] |
[ \t] |
Space and Tab characters |
[:digit:] |
[0-9] |
Digits |
[:lower:] |
[a-z] |
Lowercase letters |
[:space:] |
[ \t\r\n\v\f] |
Whitespace characters |
[:upper:] |
[A-Z] |
Uppercase letters |
[:xdigit:] |
[A-Fa-f0-9] |
Hexadecimal digits |
Command | Description |
---|---|
grep |
file pattern searcher (grab regular expression print) |
egrep |
grep with ERE support, same as grep -E |
sed |
stream editor, sed -E for ERE support |
awk |
pattern-directed scanning and processing language |
Since ERE is more convienent, the remaining examples will use the ERE format, i.e., egrep
or sed -E
. The awk
language is very detailed and outside of the scope of this module (for now).
You can grep lines within a file that match the specified patterns, then print the output to stdout. For example, if we have a file
$ cat regex.txt
gravity = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2
pi = 3.141592
atmospheric pressure is 7.60E+2 Torr
If we want to get the lines related to gravity
$ egrep "^grav" regex.txt
gravity = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2
But what if we only want the numerical values and not the entire line? Then pipe the output to another egrep command and use -o
, which only displays the matched result
$ egrep "^grav" regex.txt | egrep -o "[[:digit:]]+\.[[:digit:]]+[eE]?-?[[:digit:]]*"
9.81
6.67408E-11
NB When trying to include
-
as a seach within[ ]
you need to put it at the end incase you have a range like[0-9]
, e.g.,[0-9-]
will match any digit and -.
Sed is a stream editor. That means it can read text as a stream and perform operations on that stream of text. The most common use case is to do a string replace portions of a line using regex. It has the syntax
`s/<RE or pattern>/<replacement pattern>/`
this will replace the first matching instance on a line. If the syntax appends a g
at the end, such that
`s/<RE or pattern>/<replacement pattern>/g`
it will replace all non-overlaping occurences on the line.
For example if you want to replace all equal signs surrounded by extra space or tabs with just one space a colon and another space; you can do the following
$ sed -E 's/[[:blank:]]+=[[:blank:]]+/ : /g' regex.txt
gravity : 9.81 m/s
gravitational constant : 6.67408E-11 m^3 kg^-1 s^-2
pi : 3.141592
atmospheric pressure is 7.60E+2 Torr
sed
doesn’t overwrite the file. It just prints the result to stdout. You could use >
to redirect stdout to a new file
$ sed -E 's/[[:blank:]]+=[[:blank:]]+/ : /g' regex.txt > regex_mod.txt
sed
can mimic grep
behaviour to search for regex and print to stdout. Use -E
for ERE and -n
to supress input being echoed to stdout; this would show duplicate results, plus all lines in the file if -n
is not used. egrep
is just more convienent.
$ sed -E -n '/^grav/p' regex.txt
gravity = 9.81 m/s
gravitational constant = 6.67408E-11 m^3 kg^-1 s^-2
sed
has many features. If you want to know more see the man page for sed
.
© 2017–2024 David Kalliecharan