Regular Expressions

Pierre Rioux

ACE lab Developer Meeting
November 2017, January 2018

"Les regex sont vos amis", hein Nic?

Table of content

  • About strings
  • Brief history of regular expressions
  • How they are used
  • Basic matching
  • Special characters
  • Grouping and capturing
  • Examples

About strings

As programmers, we all use them.

  "pierre.rioux@mcgill.ca"             "It was the best of times, it was the worst
                                        of times, it was the time of the ADM."
  "Restaurant L'Hémorragie"

  "y\n"          "699.99"              "/data/assembly/1234/V2/new.mnc"        "0"

  "2017-08-23 12:11:40 CBRAIN ERROR: MySql server has gone away"

  "922734,V6,04/03/98,'subject was drunk',/data/store/922734/rest_1b.mnc.gz"

  "AACCCTAACCCTAACCCTAACCCTAAGTTAGGATC...GATTATTATAATCGGATAGCTTTTAGGATATAGGGTATAG"

  • Simple, short, with specific meaning
  • Arbitrary prose or text
  • Multi line or single line
  • Structured or not
  • Really large data

What we do with them

  • Validate them:
    "pierre.rioux@mcgill.ca" (legal email address)
    "pierre@rioux@mcgill.ca" (improper email address)

  • Check properties:
    "2017-08-30" (date starts with a 4 digit year)
    "novel.txt" (filename ends with a known extension)
    "/data/cbrain/error.log" (seems to be a pathname)

  • Extract information:
    "2017-08-30 13:45 Added new user prioux"
    Date = "2017-08-30"
    Time = "13:45"
    Action = "Added new user"
    Argument = "prioux"

The substring libraries

Most programming languages have libraries that
provide simple substring operations.

  • Scan for fixed strings
  • Take and return string positions
  • Must build your own code to handle complex cases
  • Code becomes brittle and breaks on unexpected variations
  • Good enough for simple properties
                   "2017-08-20 13:45 Added new user prioux"
                   "2017-08-21  9:23 Removed user 'prioux'"

scanf()

Old solution, provided by the C library
  int year, month, day;
  char *action = "xxxxxxxxxxx";

  sscanf("Log: 2017-04-24 Added new user prioux",
         "Log: %d-%d-%d %s", &year, &month, &day, action);
This solution still suffers from brittleness. You are also limited to just the set of %X characters sequences that scanf() supports.

There is little need to use scanf() in modern times.

Quick history

  • Regular expressions come from Formal Language theory
  • They basically encode a finite automaton
  • Kleene, 1950s
  • Later, they started being used as programming tools
  • Basic support in early UNIX command (ed, sed, grep...)

From C to Perl and beyond

Henry Spencer's regex library

  • Released in the late 80s
  • Became quite popular, used in other projects
  • As C programmers, we must #include and ld...

Larry Wall's Perl interpreter

  • Larry took Henry's library and 'butchered' it
  • Perl interpreter came with the regex lib inside it
  • The Perl language has built-in operators to use it

Now they're everywhere

  • Perl and TCL started it
  • Newer languages included regexes as a basic type
  • Ruby, Python, PHP, Javascript...
  • In C, you still need an external library, like PCRE

How they are used



Note: called matching...

My first regex

/abc/

Some observations:

  • We could see this in a program
  • The regex is three characters long: abc
  • The / are delimiters, not part of the regex
  • The delimiters can be different, and vary by language

Some examples

  • Perl:
    $mystring = "I am the eggman";
    $yesno    = $mystring =~ /abc/;
    
  • Ruby:
    mystring = "I am the eggman"
    yesno    = mystring =~ /abc/
    yesno    = mystring.match(/abc/)
    yesno    = /abc/.match(mystring)
    
  • PHP:
    $mystring = "I am the walrus";
    $yesno    = preg_match('/abc/', $mystring);
    
  • Javascript:
    var mystring = "Goo goo g'joob";
    var yesno    = /abc/.test(string);
    

What does /abc/ even mean?

This is a simple substring match. The regexp
will search a string and return true if:

  • The string contains the exact substring "abc"
  • The substring can be anywhere
  • The letters MUST be lowercase

Match examples in yellow:

  "abc are three letters"

  "I learned my abc when I was three"

❌  "Buy three CDs at ABC Music!"

Basics 1

Each character in a regex is important.

Some are just meant to match themselves:

  • Letters of the alphabet: a, b, c, ..., A, B, C, ... Z
  • The digits: 0, 1, 2, 3, ..., 9
  • The blank characters (space, tab etc)
  • A few special characters: =, @, #, !, -

These two regex match their exact string inside:

/I am clown@circus/

/H2Y 4A2/

(This is basically like a substring search)

Basics 2

Most special characters and punctuation have special meanings.

  • The . (dot, period) match ANY character except the newline
  • The [ and ] are used for character ranges
  • The * is a repeat indicator
  • The ? is a optional indicator
  • The \ (backslash) introduces special matching rules

The next few slides will give examples for each of these cases.

The . in a regex

/st..k/ will match any these 5 letter strings:

stack stalk stank stark steak stick
stink stock stork stuck stunk stEAk
st!2k st87k st__k st[]k st--k stp\k

And also:

"tallest skyscraper"

The [ ] square brackets

Between the [ ] are the characters allowed at that position.

[ac/dc]       Matches the character "a", "c", "/" or "d"
[0123459876]  Matches a single digit
[0-9]         Matches a single digit
[dr.]         Matches "d", "r" or "."
[a-z]         All lowercase letters
[a-zA-Z]      All letters
[a-z-]        All lowercase letters and "-" too
  • The . just means the period character
  • The - can be used to represent a range
  • Allowed characters can be listed in any order
  • Important: whatever's inside will always match just one character in the string

Postal Code:
/[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]/

The ^ in square brackets

If a [ ] range starts with the ^ character,
it means all characters not in the range.

[^XYZ]      Any character other than "X", "Y" or "Z"
[^a-z]      One that is not a lowercase letter
[^.]        Not a "."
[^0-9]      Not a digit

The ? means zero or once

The ? character applies to whatever precedes it.

It allows the regex to:

  • match it 0 times (make it optional)
  • match it 1 time (like if the ? wasn't there)

/Pierr?e Rioux?/

 "Piere Riou"
 "Pierre Rioux"
 "Pierre Riou"
 "Piee Rioux"

The ? means zero or once

/X-[A-Z]?[a-z0-9]?-X/

 "X--X"
 "X-G-X
 "X-9-X"
 "X-Pa-X"
 "X-P8-X"
 "X-PA-X"
 "X-4Z-X"

The * means zero or more

The * character applies to whatever precedes it.

It allows the regex to:

  • match it 0 times (make it optional)
  • match it 1 time (like if the * wasn't there)
  • match two or more times

/Pierr*e Rioux*/

 "Piere Riou"
 "Pierre Rioux"
 "Pierrrrrrrrre Riouxxxxxxxx"
 "Piee Riouxxxxxxxx"

The * means zero or more

/X-[A-Z]*[a-z0-9]*-X/

 "X--X"
 "X-G-X
 "X-9-X"
 "X-Pa-X"
 "X-P8-X"
 "X-COOKIE-X"
 "X-m0nst3r-X"
 "X-COOKIEm0nst3r-X"
 "X-CookieMonst3r-X"

The + means one or more

The + character applies to whatever precedes it.

/[0-9]+_V[0-9]+/

 "0_V0"
 "528222_V12"
 "_V7165"
 "622122_V"

Generally, /x+/ is the same as /xx*/ for all x.

The {} mean repeat counts

The {} characters can be used
to specify how many times something must match.

Regex Meaning
/[0-9]{3}/ Three digits. Same as /[0-9][0-9][0-9]/
/[0-9]{3,}/ At least 3 digits
/[0-9]{,7}/ Up to 7 digits (including 0)
/[0-9]{3,7}/ Between 3 to 7 digits

The \ backslash character

In front of special characters, it just makes them less special.

/abc\*def\\xyz\.txt/

This just matches the string 'abc*def\xyz.txt' (15 characters)

The \ backslash character

In front of some letters, it allows
expressing unprintable or blank characters.

Unprintable character
\n newline (ASCII 10)
\t tab (ASCII 9)
\r carriage return (ASCII 13)
\e escape (ASCII 27)
... there are many others...

The \ backslash character

In front of some other letters, it
matches useful character sets or ranges.

Set of characters
\s Any whitespace: space, tab, newline, etc
\d Any digit. Similar to [0-9]
\w Any character used in an identifier: [a-zA-Z0-9_]
\S The opposite of \s
\D The opposite of \d
\W The opposite of \w

Anchoring

The ^ and $ characters forces the matching to happen, respectively, at the beginning or end of the string.

/\w+@\w+\.ca/

 "pierre@circus.ca"
 "My email is pierre@circus.ca, thanks"

/^\w+@\w+\.ca$/

 "pierre@circus.ca"
 "My email is pierre@circus.ca, thanks"

Anchoring

Within a multi-line string:

  • ^ matches the beginning of any line; it is a zero-width match.
  • $ matches the end of any line, it is a zero-width match just before the newline character.

   mystring = "amaryjo\nmary\nryjo\n"

 mystring =~ /ary$/    # "amaryjo\nmary\nryjo\n"
 mystring =~ /^mar/    # "amaryjo\nmary\nryjo\n"
 mystring =~ /^ryjo$/  # "amaryjo\nmary\nryjo\n"

Anchoring a regex can significantly improve performance when scanning large strings, or with complex regular expressions.

The | vertical bar

Within a regex, the | symbol means to try to match what's on the left or what's on the right; each side is a full regex.

/taxi$|tra?in|h.ver.raft/

 "My hovercraft is full of eels"
 "Planes, trains, and automobiles"
 "I took a taxi and paid $20"

Grouping with ( and )

The parentheses create a group;
the group can be modified with ?, * etc.

/(Malkovitch)+/

 "My name is Malkovitch, indeed."
 "Being John MalkovitchMalkovitchMalkovitchMalkovitch."

Grouping is useful with |

/a (robot|(human)+|vegetable)? truly/

 "I am a robot truly"
 "I am a vegetable truly"
 "I am a humanhuman truly"
 "I am a  truly"
 "I am a robothumanvegetable truly"

Each regex within a group is independent

/(a[0-9]z)+/

 "a0za0za0za0za0z"
 "a0za1za9za3za7za4z"

The + applies to the regular expression, not the matched string.

Capturing with ()

Capturing with ()

These are the string values for six task status in CBRAIN:
status = "Restarting Setup"
status = "Restarting Cluster"
status = "Restarting PostProcessing"
status = "Recovering Setup"
status = "Recovering Cluster"
status = "Recovering PostProcessing"
matchinfo = status.match /(Recovering|Restarting) (\S+)/
operation = matchinfo[0]
stage     = matchinfo[1]

When matching using Ruby, we get the first word into variable operation and the second word into variable stage.

Capturing with ()

Perl uses global variables $1, $2, $3, etc

my $status    = "Restarting PostProcessing"
$status       =~ /(Recovering|Restarting) (\S+)/;
my $operation = $1;
my $stage     = $2;

Backreferences:
\1 match the first capture

The special \1, \2 etc refer to the values
captured by the first, second etc parenthesis.

/([a-z]+)-\d+-([a-z]+)/

 "Product code abc-123-def is sold out"
 "Product code abc-123-abc is sold out"

/([a-z]+)-\d+-\1/

 "Product code abc-123-def is sold out"
 "Product code abc-123-abc is sold out"

By default, regex are greedy

/[a-z]+x/

 "0123abcdxyzabcdxyzabcdxyz0123"
 "0123abcdxyzabcdxyzabcdxyz0123"

/[0-9]+$/

 "Beverly Hills 90210"
 "Beverly Hills 90210"

Some engines support non-greedy matching with ?

/[a-z]+?x/

 "0123abcdxyzabcdxyzabcdxyz0123"
 "0123abcdxyzabcdxyzabcdxyz0123"

Modifiers: for the regex engine and parser

Modifier Effect Example
i Regex is case-insensitive
"H2y" =~ /[a-z][0-9][a-z]/i
g Regex match globally, in turn from position of last match
# This is a Perl example
while ("one two three" =~ /(\w+)/gi) {
  print "$1\n"; # will execute 3 times
}
x Allow spaces and comments for lisibility
email =~ /
     ([a-z0-9\.\-]+)  # username
     @
     ([a-z0-9\.\-]+)  # hostname
     /x

Example of x in CBRAIN

  # Returns true if the given +userfile+ is named according
  # to the LORIS CCNA Phantom convention for integration with CBRAIN,
  # which means we can find a subject ID and a visit ID.
  #
  # Example: 'ibis_417879_Initial_MRI_t1w_001.mnc'
  #          'ibis_123456_lego_phantom_SD_ABCD_20001212_12CH_t1w_0183.mnc
  #
  # The visit ID can be a complex string with underscores in it.
  # See the regex in the code.
  #
  # Returns the subject ID and the visit ID if the convention
  # is respected; for the example, the "417879" and "Initial_MRI" parts.
  def named_according_to_LORIS_convention?(userfile)
    if userfile.name =~ /\A
                             [a-zA-Z0-9]+            # arbitrary prefix
                             _(\d\d\d\d\d\d)         # $1 Subject ID
                             _(Initial_MRI|          # $2 Visit, one of "Initial_MRI" or ...
                               (lego|human)
                               _phantom
                               _(L1|SD)
                               _([A-Z]{3,4})         # three or four uppercase letters
                               _\d\d\d\d \d\d \d\d   # year month day
                               (_(12CH|32CH))?       # optional suffix of Visit
                              )
                        /x
      [ Regexp.last_match[1], Regexp.last_match[2] ]
    end
  end

Regex engines differ

Software Differences Notes
Perl has everything!
Ruby mostly like Perl
PHP mostly like Perl
Javascript more limited no comments with /x
grep, sed, etc Basic: No +, |, \w, \s, \d, etc Use grep -P to get Perl's engine; modifiers are options.
egrep is grep, but adds |

Note: Wikipedia has some nice, dedailed comparison tables.

Performance and
catastrophic backtracking

/(x+x+)+Y/

"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxY"

"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Ref: https://www.regular-expressions.info/catastrophic.html

Security: perl embedded code!

/abc(?{ print "hello" })def/

Example 1: /st..k/

How did I find the words in a previous slide?

I have a file at home called 's'. Its content:

s
sa
sabbath
sabbatical
Sabina
Sabine
sable
sabotage
sabra
sac
saccharine
sachem
Sachs
sack
(about 2600 more lines omitted)

Example 1: /st..k/

Here's part of my bash bash history:
   62  Nov-15 23:00:19 grep 's....' s
   63  Nov-15 23:00:24 grep 's....$' s
   64  Nov-15 23:00:31 grep '^s....$' s
   65  Nov-15 23:00:46 grep '^s....$' s | cut -c2
   66  Nov-15 23:00:53 grep '^s....$' s | cut -c2 | uniq -c
   67  Nov-15 23:01:10 grep '^s....$' s | grep '^st'
   68  Nov-15 23:01:56 grep '^s....$' s | grep '^st..k'

  • Command 62: too many words
  • Command 63: some extra, like suspect
  • Command 64: ok, 364 results
  • Command 65: extract second letter
  • Command 66: count second letters; t=71
  • Command 67: have a quick look
  • Command 68: choose the 11 that end in k

Example 2: CBRAIN code scan

Changes in the Pathname Ruby library forced the
CBRAIN team to review all calls to these four methods:

.present?()
.presence()
.blank?()
.empty?()

The details why are not important here,
suffice to say it was for the R5 release.

Example 2: CBRAIN code scan

I ran this bash pipe in the CBRAIN source directory:
find . -type f -name "*.rb" -print | \
xargs grep -B3 -A3 -E '\.empty\?|\.present\?|\.presence|\.blank\?' | \
highlight -l -r '\.empty\?|\.present\?|\.presence|\.blank\?' yellow | \
less

This created a report of about 9000 lines:

--
./BrainPortal/app/models/cluster_task.rb-  # to a location different than the original task directory.
./BrainPortal/app/models/cluster_task.rb-  def make_available(userfile, file_path, userfile_sub_path = nil, start_dir = nil)
./BrainPortal/app/models/cluster_task.rb-    cb_error "File path argument must be relative" if
./BrainPortal/app/models/cluster_task.rb:      file_path.to_s.blank? || file_path.to_s =~ /\A\//
./BrainPortal/app/models/cluster_task.rb-
./BrainPortal/app/models/cluster_task.rb-    # Determine starting dir for relative symlink target calculation
./BrainPortal/app/models/cluster_task.rb-    base_dir = start_dir || self.full_cluster_workdir
--
--
./BrainPortal/app/models/cluster_task.rb-    workdir_path   = Pathname.new(self.cluster_shared_dir)
./BrainPortal/app/models/cluster_task.rb-    dp_cache_path  = Pathname.new(RemoteResource.current_resource.dp_cache_dir)
./BrainPortal/app/models/cluster_task.rb-    userfile_path  = Pathname.new(userfile.cache_full_path)
./BrainPortal/app/models/cluster_task.rb:    userfile_path += Pathname.new(userfile_sub_path) if userfile_sub_path.to_s.present?
./BrainPortal/app/models/cluster_task.rb-
./BrainPortal/app/models/cluster_task.rb-    # Try to figure out if the DataProvider of userfile is local, or remote.
./BrainPortal/app/models/cluster_task.rb-    # SmartDataProviders supply the real implementation under a real_provider method;
--

The end

/(thank\s+you|thanks)!/i