Regular Expressions

Pierre Rioux

ACE lab Developer Meeting
November 2017, January 2018

"Les regex sont vos amis", hein Nic?

Table of content

About strings
Brief history of regular expressions
How they are used
Basic matching
Special characters
Grouping and capturing
Examples

About strings

As programmers, we all use them.


  "pierre.rioux@mcgill.ca"             "It was the best of times, it was the worst
                                        of times, it was the time of the ADM."
  "Restaurant L'Hémorragie"

  "y\n"          "699.99"              "/data/assembly/1234/V2/new.mnc"        "0"

  "2017-08-23 12:11:40 CBRAIN ERROR: MySql server has gone away"

  "922734,V6,04/03/98,'subject was drunk',/data/store/922734/rest_1b.mnc.gz"

  "AACCCTAACCCTAACCCTAACCCTAAGTTAGGATC...GATTATTATAATCGGATAGCTTTTAGGATATAGGGTATAG"

Simple, short, with specific meaning
Arbitrary prose or text
Multi line or single line
Structured or not
Really large data

What we do with them

Validate them:
✔ "pierre.rioux@mcgill.ca" (legal email address)
❌ "pierre@rioux@mcgill.ca" (improper email address)
Check properties:
✔ "2017-08-30" (date starts with a 4 digit year)
✔ "novel.txt" (filename ends with a known extension)
✔ "/data/cbrain/error.log" (seems to be a pathname)
Extract information:
"2017-08-30 13:45 Added new user prioux"
✔ Date = "2017-08-30"
✔ Time = "13:45"
✔ Action = "Added new user"
✔ Argument = "prioux"

The substring libraries

Most programming languages have libraries that
provide simple substring operations.

Scan for fixed strings
Take and return string positions
Must build your own code to handle complex cases
Code becomes brittle and breaks on unexpected variations
Good enough for simple properties

                   "2017-08-20 13:45 Added new user prioux"
                   "2017-08-21  9:23 Removed user 'prioux'"

scanf()

Old solution, provided by the C library

  int year, month, day;
  char *action = "xxxxxxxxxxx";

  sscanf("Log: 2017-04-24 Added new user prioux",
         "Log: %d-%d-%d %s", &year, &month, &day, action);

This solution still suffers from brittleness. You are also limited to just the set of %X characters sequences that scanf() supports.

There is little need to use scanf() in modern times.

Quick history

Regular expressions come from Formal Language theory
They basically encode a finite automaton
Kleene, 1950s
Later, they started being used as programming tools
Basic support in early UNIX command (ed, sed, grep...)

From C to Perl and beyond

Henry Spencer's regex library

Released in the late 80s
Became quite popular, used in other projects
As C programmers, we must #include and ld...

Larry Wall's Perl interpreter

Larry took Henry's library and 'butchered' it
Perl interpreter came with the regex lib inside it
The Perl language has built-in operators to use it

Now they're everywhere

Perl and TCL started it
Newer languages included regexes as a basic type
Ruby, Python, PHP, Javascript...
In C, you still need an external library, like PCRE

How they are used

Note: called matching...

My first regex

/abc/

Some observations:

We could see this in a program
The regex is three characters long: abc
The / are delimiters, not part of the regex
The delimiters can be different, and vary by language

Some examples

Perl:

$mystring = "I am the eggman";
$yesno    = $mystring =~ /abc/;

Ruby:

mystring = "I am the eggman"
yesno    = mystring =~ /abc/
yesno    = mystring.match(/abc/)
yesno    = /abc/.match(mystring)

PHP:

$mystring = "I am the walrus";
$yesno    = preg_match('/abc/', $mystring);

Javascript:

var mystring = "Goo goo g'joob";
var yesno    = /abc/.test(string);

What does /abc/ even mean?

This is a simple substring match. The regexp
will search a string and return true if:

The string contains the exact substring "abc"
The substring can be anywhere
The letters MUST be lowercase

Match examples in yellow:

✔  "abc are three letters"

✔  "I learned my abc when I was three"

❌  "Buy three CDs at ABC Music!"

Basics 1

Each character in a regex is important.

Some are just meant to match themselves:

Letters of the alphabet: a, b, c, ..., A, B, C, ... Z
The digits: 0, 1, 2, 3, ..., 9
The blank characters (space, tab etc)
A few special characters: =, @, #, !, -

These two regex match their exact string inside:

/I am clown@circus/

/H2Y 4A2/

(This is basically like a substring search)

Basics 2

Most special characters and punctuation have special meanings.

The . (dot, period) match ANY character except the newline
The [ and ] are used for character ranges
The * is a repeat indicator
The ? is a optional indicator
The \ (backslash) introduces special matching rules

The next few slides will give examples for each of these cases.

The . in a regex

/st..k/ will match any these 5 letter strings:

stack stalk stank stark steak stick
stink stock stork stuck stunk stEAk
st!2k st87k st__k st[]k st--k stp\k

And also:

"tallest skyscraper"

The [ ] square brackets

Between the [ ] are the characters allowed at that position.

[ac/dc]       Matches the character "a", "c", "/" or "d"
[0123459876]  Matches a single digit
[0-9]         Matches a single digit
[dr.]         Matches "d", "r" or "."
[a-z]         All lowercase letters
[a-zA-Z]      All letters
[a-z-]        All lowercase letters and "-" too

The . just means the period character
The - can be used to represent a range
Allowed characters can be listed in any order
Important: whatever's inside will always match just one character in the string

Postal Code:
/[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]/

The ^ in square brackets

If a [ ] range starts with the ^ character,
it means all characters not in the range.

[^XYZ]      Any character other than "X", "Y" or "Z"
[^a-z]      One that is not a lowercase letter
[^.]        Not a "."
[^0-9]      Not a digit

The ? means zero or once

The ? character applies to whatever precedes it.

It allows the regex to:

match it 0 times (make it optional)
match it 1 time (like if the ? wasn't there)

/Pierr?e Rioux?/

✔ "Piere Riou"
✔ "Pierre Rioux"
✔ "Pierre Riou"
❌ "Piee Rioux"

The ? means zero or once

/X-[A-Z]?[a-z0-9]?-X/

✔ "X--X"
✔ "X-G-X
✔ "X-9-X"
✔ "X-Pa-X"
✔ "X-P8-X"
❌ "X-PA-X"
❌ "X-4Z-X"

The * means zero or more

The * character applies to whatever precedes it.

It allows the regex to:

match it 0 times (make it optional)
match it 1 time (like if the * wasn't there)
match two or more times

/Pierr*e Rioux*/

✔ "Piere Riou"
✔ "Pierre Rioux"
✔ "Pierrrrrrrrre Riouxxxxxxxx"
❌ "Piee Riouxxxxxxxx"

The * means zero or more

/X-[A-Z]*[a-z0-9]*-X/

✔ "X--X"
✔ "X-G-X
✔ "X-9-X"
✔ "X-Pa-X"
✔ "X-P8-X"
✔ "X-COOKIE-X"
✔ "X-m0nst3r-X"
✔ "X-COOKIEm0nst3r-X"
❌ "X-CookieMonst3r-X"

The + means one or more

The + character applies to whatever precedes it.

/[0-9]+_V[0-9]+/

✔ "0_V0"
✔ "528222_V12"
❌ "_V7165"
❌ "622122_V"

Generally, /x+/ is the same as /xx*/ for all x.

The {} mean repeat counts

The {} characters can be used
to specify how many times something must match.

Regex	Meaning
/[0-9]{3}/	Three digits. Same as /[0-9][0-9][0-9]/
/[0-9]{3,}/	At least 3 digits
/[0-9]{,7}/	Up to 7 digits (including 0)
/[0-9]{3,7}/	Between 3 to 7 digits

The \ backslash character

In front of special characters, it just makes them less special.

/abc\*def\\xyz\.txt/

This just matches the string 'abc*def\xyz.txt' (15 characters)

The \ backslash character

In front of some letters, it allows
expressing unprintable or blank characters.

	Unprintable character
\n	newline (ASCII 10)
\t	tab (ASCII 9)
\r	carriage return (ASCII 13)
\e	escape (ASCII 27)
...	there are many others...

The \ backslash character

In front of some other letters, it
matches useful character sets or ranges.

	Set of characters
\s	Any whitespace: space, tab, newline, etc
\d	Any digit. Similar to [0-9]
\w	Any character used in an identifier: [a-zA-Z0-9_]
\S	The opposite of \s
\D	The opposite of \d
\W	The opposite of \w

Anchoring

The ^ and $ characters forces the matching to happen, respectively, at the beginning or end of the string.

/\w+@\w+\.ca/

✔ "pierre@circus.ca"
✔ "My email is pierre@circus.ca, thanks"

/^\w+@\w+\.ca$/

✔ "pierre@circus.ca"
❌ "My email is pierre@circus.ca, thanks"

Anchoring

Within a multi-line string:

^ matches the beginning of any line; it is a zero-width match.
$ matches the end of any line, it is a zero-width match just before the newline character.

   mystring = "amaryjo\nmary\nryjo\n"

✔ mystring =~ /ary$/    # "amaryjo\nmary\nryjo\n"
✔ mystring =~ /^mar/    # "amaryjo\nmary\nryjo\n"
✔ mystring =~ /^ryjo$/  # "amaryjo\nmary\nryjo\n"

Anchoring a regex can significantly improve performance when scanning large strings, or with complex regular expressions.

The | vertical bar

Within a regex, the | symbol means to try to match what's on the left or what's on the right; each side is a full regex.

/taxi$|tra?in|h.ver.raft/

✔ "My hovercraft is full of eels"
✔ "Planes, trains, and automobiles"
❌ "I took a taxi and paid $20"

Grouping with ( and )

The parentheses create a group;
the group can be modified with ?, * etc.

/(Malkovitch)+/

✔ "My name is Malkovitch, indeed."
✔ "Being John MalkovitchMalkovitchMalkovitchMalkovitch."

Grouping is useful with |

/a (robot|(human)+|vegetable)? truly/

✔ "I am a robot truly"
✔ "I am a vegetable truly"
✔ "I am a humanhuman truly"
✔ "I am a  truly"
❌ "I am a robothumanvegetable truly"

Each regex within a group is independent

/(a[0-9]z)+/

✔ "a0za0za0za0za0z"
✔ "a0za1za9za3za7za4z"

The + applies to the regular expression, not the matched string.

Capturing with ()

These are the string values for six task status in CBRAIN:

status = "Restarting Setup"
status = "Restarting Cluster"
status = "Restarting PostProcessing"
status = "Recovering Setup"
status = "Recovering Cluster"
status = "Recovering PostProcessing"

matchinfo = status.match /(Recovering|Restarting) (\S+)/
operation = matchinfo[0]
stage     = matchinfo[1]

When matching using Ruby, we get the first word into variable operation and the second word into variable stage.

Capturing with ()

Perl uses global variables $1, $2, $3, etc

my $status    = "Restarting PostProcessing"
$status       =~ /(Recovering|Restarting) (\S+)/;
my $operation = $1;
my $stage     = $2;

Backreferences:
\1 match the first capture

The special \1, \2 etc refer to the values
captured by the first, second etc parenthesis.

/([a-z]+)-\d+-([a-z]+)/

✔ "Product code abc-123-def is sold out"
✔ "Product code abc-123-abc is sold out"

/([a-z]+)-\d+-\1/

❌ "Product code abc-123-def is sold out"
✔ "Product code abc-123-abc is sold out"

By default, regex are greedy

/[a-z]+x/

❌ "0123abcdxyzabcdxyzabcdxyz0123"
✔ "0123abcdxyzabcdxyzabcdxyz0123"

/[0-9]+$/

❌ "Beverly Hills 90210"
✔ "Beverly Hills 90210"

Some engines support non-greedy matching with ?

/[a-z]+?x/

❌ "0123abcdxyzabcdxyzabcdxyz0123"
✔ "0123abcdxyzabcdxyzabcdxyz0123"

Modifiers: for the regex engine and parser

Modifier	Effect	Example
i	Regex is case-insensitive	"H2y" =~ /[a-z][0-9][a-z]/i
g	Regex match globally, in turn from position of last match	# This is a Perl example while ("one two three" =~ /(\w+)/gi) { print "$1\n"; # will execute 3 times }
x	Allow spaces and comments for lisibility	email =~ / ([a-z0-9\.\-]+) # username @ ([a-z0-9\.\-]+) # hostname /x

Modifier

Effect

Example

Regex is case-insensitive

"H2y" =~ /[a-z][0-9][a-z]/i

Regex match globally, in turn from position of last match

# This is a Perl example
while ("one two three" =~ /(\w+)/gi) {
  print "$1\n"; # will execute 3 times
}

Allow spaces and comments for lisibility

email =~ /
     ([a-z0-9\.\-]+)  # username
     @
     ([a-z0-9\.\-]+)  # hostname
     /x

Example of x in CBRAIN

  # Returns true if the given +userfile+ is named according
  # to the LORIS CCNA Phantom convention for integration with CBRAIN,
  # which means we can find a subject ID and a visit ID.
  #
  # Example: 'ibis_417879_Initial_MRI_t1w_001.mnc'
  #          'ibis_123456_lego_phantom_SD_ABCD_20001212_12CH_t1w_0183.mnc
  #
  # The visit ID can be a complex string with underscores in it.
  # See the regex in the code.
  #
  # Returns the subject ID and the visit ID if the convention
  # is respected; for the example, the "417879" and "Initial_MRI" parts.
  def named_according_to_LORIS_convention?(userfile)
    if userfile.name =~ /\A
                             [a-zA-Z0-9]+            # arbitrary prefix
                             _(\d\d\d\d\d\d)         # $1 Subject ID
                             _(Initial_MRI|          # $2 Visit, one of "Initial_MRI" or ...
                               (lego|human)
                               _phantom
                               _(L1|SD)
                               _([A-Z]{3,4})         # three or four uppercase letters
                               _\d\d\d\d \d\d \d\d   # year month day
                               (_(12CH|32CH))?       # optional suffix of Visit
                              )
                        /x
      [ Regexp.last_match[1], Regexp.last_match[2] ]
    end
  end

Regex engines differ

Software	Differences	Notes
Perl	has everything!
Ruby	mostly like Perl
PHP	mostly like Perl
Javascript	more limited	no comments with /x
grep, sed, etc	Basic: No +, \|, \w, \s, \d, etc	Use grep -P to get Perl's engine; modifiers are options.
egrep	is grep, but adds \|

Note: Wikipedia has some nice, dedailed comparison tables.

Performance and
catastrophic backtracking

/(x+x+)+Y/

"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxY"

"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Ref: https://www.regular-expressions.info/catastrophic.html

Security: perl embedded code!

/abc(?{ print "hello" })def/

Example 1: /st..k/

How did I find the words in a previous slide?

I have a file at home called 's'. Its content:

s
sa
sabbath
sabbatical
Sabina
Sabine
sable
sabotage
sabra
sac
saccharine
sachem
Sachs
sack
(about 2600 more lines omitted)

Example 1: /st..k/

Here's part of my bash bash history:

   62  Nov-15 23:00:19 grep 's....' s
   63  Nov-15 23:00:24 grep 's....$' s
   64  Nov-15 23:00:31 grep '^s....$' s
   65  Nov-15 23:00:46 grep '^s....$' s | cut -c2
   66  Nov-15 23:00:53 grep '^s....$' s | cut -c2 | uniq -c
   67  Nov-15 23:01:10 grep '^s....$' s | grep '^st'
   68  Nov-15 23:01:56 grep '^s....$' s | grep '^st..k'

Command 62: too many words
Command 63: some extra, like suspect
Command 64: ok, 364 results
Command 65: extract second letter
Command 66: count second letters; t=71
Command 67: have a quick look
Command 68: choose the 11 that end in k

Example 2: CBRAIN code scan

Changes in the Pathname Ruby library forced the
CBRAIN team to review all calls to these four methods:

.present?()
.presence()
.blank?()
.empty?()

The details why are not important here,
suffice to say it was for the R5 release.

Example 2: CBRAIN code scan

I ran this bash pipe in the CBRAIN source directory:

find . -type f -name "*.rb" -print | \
xargs grep -B3 -A3 -E '\.empty\?|\.present\?|\.presence|\.blank\?' | \
highlight -l -r '\.empty\?|\.present\?|\.presence|\.blank\?' yellow | \
less

This created a report of about 9000 lines:

--
./BrainPortal/app/models/cluster_task.rb-  # to a location different than the original task directory.
./BrainPortal/app/models/cluster_task.rb-  def make_available(userfile, file_path, userfile_sub_path = nil, start_dir = nil)
./BrainPortal/app/models/cluster_task.rb-    cb_error "File path argument must be relative" if
./BrainPortal/app/models/cluster_task.rb:      file_path.to_s.blank? || file_path.to_s =~ /\A\//
./BrainPortal/app/models/cluster_task.rb-
./BrainPortal/app/models/cluster_task.rb-    # Determine starting dir for relative symlink target calculation
./BrainPortal/app/models/cluster_task.rb-    base_dir = start_dir || self.full_cluster_workdir
--
--
./BrainPortal/app/models/cluster_task.rb-    workdir_path   = Pathname.new(self.cluster_shared_dir)
./BrainPortal/app/models/cluster_task.rb-    dp_cache_path  = Pathname.new(RemoteResource.current_resource.dp_cache_dir)
./BrainPortal/app/models/cluster_task.rb-    userfile_path  = Pathname.new(userfile.cache_full_path)
./BrainPortal/app/models/cluster_task.rb:    userfile_path += Pathname.new(userfile_sub_path) if userfile_sub_path.to_s.present?
./BrainPortal/app/models/cluster_task.rb-
./BrainPortal/app/models/cluster_task.rb-    # Try to figure out if the DataProvider of userfile is local, or remote.
./BrainPortal/app/models/cluster_task.rb-    # SmartDataProviders supply the real implementation under a real_provider method;
--

The end

/(thank\s+you|thanks)!/i

Regular Expressions

Pierre Rioux

ACE lab Developer MeetingNovember 2017, January 2018

Table of content

About strings

What we do with them

The substring libraries

scanf()

Quick history

From C to Perl and beyond

Now they're everywhere

How they are used

My first regex

Some examples

What does /abc/ even mean?

Basics 1

Basics 2

The . in a regex

The [ ] square brackets

The ^ in square brackets

The ? means zero or once

The ? means zero or once

The * means zero or more

The * means zero or more

The + means one or more

The {} mean repeat counts

The \ backslash character

The \ backslash character

The \ backslash character

Anchoring

Anchoring

The | vertical bar

Grouping with ( and )

Grouping is useful with |

Each regex within a group is independent

Capturing with ()

Capturing with ()

Capturing with ()

Backreferences:\1 match the first capture

By default, regex are greedy

Some engines support non-greedy matching with ?

Modifiers: for the regex engine and parser

Example of x in CBRAIN

Regex engines differ

Performance andcatastrophic backtracking

Security: perl embedded code!

Example 1: /st..k/

Example 1: /st..k/

Example 2: CBRAIN code scan

Example 2: CBRAIN code scan

The end

ACE lab Developer Meeting
November 2017, January 2018

Backreferences:
\1 match the first capture

Performance and
catastrophic backtracking