login

Home - Tutorials & Guides - Regular Expressions

Introduction to Regular Expressions

This is a brief introduction to the use of regular expressions, written by Edwin Brady.


Contents


Introduction

What's a regular expression, and why are they so useful? A regular expression (often abbreviated regex to save typing...), according to the grep manual page is "A pattern which describes a set of strings". This document attempts to explain such patterns - how to construct them and how to use them to make your life easier. First the basic building blocks will be described and some common uses with Unix software tools described, then more advanced usage in programming languages.

Basic Regular Expressions

The basic building blocks of regular expressions are those which match a single character. Nearly every character (apart from the few with special meanings, of which more later) is a regular expression which matches itself. Characters with special meaning (metacharacters) will match themselves if preceded by a backslash.

Before going any further, perhaps I should explain what I mean when I say 'matching.' In typical use, a regular expression is matched against an input string. If any part of the input string (not the whole string) matches the regular expression, then we have a successful match.

In a basic regular expression (as understood by the grep command) there are very few metacharacters. These are:

There are also some predefined classes of names which can be enclosed by square brackets. These are [:alpha:], [:alnum:], [:cntrl:] (control characters), [:digit:], [:graph:], [:lower:], [:print:] (printable characters), [:punct:], [:space:], [:upper:] and [:xdigit:] (hex digit).

Furthermore, you can specify a range of ASCII characters inside square brackets. For example [A-Z] will match any upper case letter, while [^A-Z] will match anything but an upper case letter.

Examples

Consider the input string "Harvey the wonder Hamster". The following regular expressions would match successfully:

The following, however, wouldn't:

Extended Regular Expressions

You can have much more fun with extended regular expressions, though, there being more metacharacters to play with. There are the following repetition operators (in addition to the * described above):

Also, the | character can be used to join regular expressions. This character means 'OR' - either of the expression either side of the operator can match.

There are precedence rules for regular expressions, just like arithmetic expressions. Repetition takes precedence over concatenation, which takes precedence over alternation (ORing). These can be overridden with parentheses, just like arithmetic expressions.

The grep command

The grep command is used to match for regular expressions in files, quite simply. The output will, by default, be any lines in the file which match the given expression. There are three versions of grep, for different scarinesses of regular expressions - fixed strings, basic regular expressions and extended regular expressions (as defined earlier). These are invoked by the commands fgrep, grep and egrep respectively. Some of the more interesting options you can give to grep are:

For more details on usage of grep consult the man pages - they are very comprehensive!

The sed command

sed is a stream editor. This means that it performs transformations on the input stream and sends it to the output stream according to commands which are given either on the command line or in a script. A typical use would be to perform search and replace operations on a file.

Basic usage of sed is achieved using a command of one of the following forms:

sed command file > newfile
sed -f script-file file > newfile
sed -e command file > newfile
cat file | sed -e command > newfile
...
and so on. These will basically perform the transformations described either on the command line (with the -e option, or just the command) or in a script file (with the -f option). sed actually has a large number of clever features including branching and block structure which make it almost a programming language. These features are beyond the scope of this document, but some of the basic actions which you can perform on the input stream are as follows:

That's all I have to say about that :). Again, for fuller details consult the sed man page.

Perl Regular Expressions

The real power of perl lies in its regular expression handling functions and operators. Perl regular expressions are in some ways slightly different from those described above, but the basic principles are the same. In particular, parentheses achieve slightly more than a standard regular expression, and the predefined classes of names don't achieve anything useful. Basic knowledge of perl is assumed here!

Pattern Matching

Pattern matching is achieved using the m// operator. The syntax for this operator is variable = m/regex/options. The options which can be given include, among other things, case insensitive matching and global searching. More details of these later. The following code gives an example use of this operator:

if ($thing=~m/[0-9]+/) {
  print "Number!";
}

So if anything in the variable $thing has a sequence of 1 or more digits, we'll tell the user that we've spotted a number.

Substitution

Substitution is achieved through the use of the s/// operator. This substitutes a regular expression with a new string, and the syntax is variable = s/regex/replacement/options. More on the options later. Here's an example usage:

$input=;
$input=~s/[0-9]+/some number/;
print $input;

This, of course, will replace the first occurrence of a number in the input string with the string "some number". Useful, eh? Notice I said, "the first occurrence." To replace every occurrence, we must use the g option, which means global replace. So we would say $input=~s/[0-9]+/some number/g;.

More on options

The most useful options you can give to the s/// and m// operators are as follows:

Translation

Translation is achieved by the tr/// operator. The syntax, which should look strangely familiar by now, is variable =~ tr/source/destination/options. You can also use the y/// operator to do the same thing, which should be a clue that this behaves the same way as the sed command of the same name. There are three options which can be given to the translation operator:

The clever bit

Wouldn't it be useful if we could remember bits of the patterns we've just matched? Or if a pattern consists of a number of distinct sections (such as a URL) and we want to remember what each section is? Indeed it would, and perl can do it for us...

We can do this by grouping sections of regular expressions in parentheses (). Then we can refer back to these sections by using the predefined variables $1, $2, $3 etc.

An example may help here... Suppose we need to parse a file where each line is in the format name:college. We can use the match operator to get the name and college out of each line, like so:

while(<FILE>) {
  if (m/([A-Za-z ]*):([A-Za-z ]*)/) {
    print "Name $1 \nCollege: $2\n";
  }
}

Note here that the m// operator is acting on the defaul input and pattern searching variable $_ here.

This also works with the substitution operator, and indeed any function which takes regular expressions as arguments. For example, to swap two words around:

while(<FILE>) {
  s/([A-Za-z]+) +([A-Za-z]+)/$2 $1/g;
  print;
}

The g option, of course, means that this swap will work for every pair of words in the input string (in this case read from standard input). If you're not convinced, try it! :)

The split function

This is an example of a function which takes a regular expression as an argument. There are more of these - see the perl manual pages for more details. split splits the given string into an array of strings, using the given regex as delimeters. The syntax is split /regex/,string. A typical use may be to get all the fields from an input file in CSV (comma separated variable) format. For example, split /, */,"egg, beans, spam,sausage"; would return the array ("egg","beans","spam","sausage").

If this example looks strangely familiar to the use of the m// operator above, then that's because it is. Remember that the philosophy behind perl is TMTOWTDI (there's more than one way to do it).

Python Regular Expressions

I may fill this section in at some point soon, you never know...

Tyrannosaurus Regex

Here's a fun regular expression, which may look a bit like line noise:

([a-z]+)\:\/\/([^\/]*)\/(.*)/?

Taking each bracketed bit in turn may help work out what we're matching here:

  1. ([a-z]+) - this, trivially, will match any word in lower case letters.
  2. ([^\/]*) - will match any sequence of characters not containing a forward slash.
  3. (.*) - will match the rest of the string.

So what we're matching, put more simply, is (word)://(up to next forward slash)/(the rest) plus possibly a trailing forward slash. It is now, possibly, clear that we're looking at a URL here, where $1 will give us the protocol, $2 the machine name, and $3 the path to the file. So it will match strings such as:

Er, naturally, this doesn't perform any error checking on the URL - doing so is left as an exercise for the reader!

Exercises

  1. Find out how many people at the university are at Grey College, using grep. You can find a file containing useful data for this at /usr/local/mail/names/current/extra.list.
  2. Use sed to change this file to exclude anyone from University College.
  3. Use sed to ROT13 an input file. (ie A becomes N, B becomes O, etc).
  4. Write a perl script to do all of the above.
  5. Write a perl script to strip HTML tags from a file.
  6. Write a perl script to print every HTML tag in a file.

ecb@lbft.demon.co.uk
7th November 1999

Last edit: Wed 11th May, 12:58 a.m.

Google

External Links

Compsoc Wiki
Compsoc Library
DSU

Upcoming Events

There are no upcoming events planned. Check the CompSoc Wiki in case of emergency

RSS | iCal

Sponsors

ARM
DNUK
O'Reilly
No Starch Press
Durham Students Union

Random Poll

Which one's yours?


View results
Submit a new poll
All polls

This section exists to trap prefetching clients. Please just ignore it if you have css disabled and thus can see this. Do not click this link unless you want us to think you are a bot