Table of content
- About strings
- Brief history of regular expressions
- How they are used
- Basic matching
- Special characters
- Grouping and capturing
- Examples
"Les regex sont vos amis", hein Nic?
"" "It was the best of times, it was the worst of times, it was the time of the ADM." "Restaurant L'Hémorragie" "y\n" "699.99" "/data/assembly/1234/V2/new.mnc" "0" "2017-08-23 12:11:40 CBRAIN ERROR: MySql server has gone away" "922734,V6,04/03/98,'subject was drunk',/data/store/922734/rest_1b.mnc.gz" "AACCCTAACCCTAACCCTAACCCTAAGTTAGGATC...GATTATTATAATCGGATAGCTTTTAGGATATAGGGTATAG"
"2017-08-20 13:45 Added new user prioux" "2017-08-21 9:23 Removed user 'prioux'"
int year, month, day;
char *action = "xxxxxxxxxxx";
sscanf("Log: 2017-04-24 Added new user prioux",
"Log: %d-%d-%d %s", &year, &month, &day, action);
This solution still suffers from brittleness. You are also limited to
just the set of %X characters sequences that scanf() supports.
There is little need to use scanf() in modern times.
Larry Wall's Perl interpreter
Some observations:
$mystring = "I am the eggman";
$yesno = $mystring =~ /abc/;
mystring = "I am the eggman"
yesno = mystring =~ /abc/
yesno = mystring.match(/abc/)
yesno = /abc/.match(mystring)
$mystring = "I am the walrus";
$yesno = preg_match('/abc/', $mystring);
var mystring = "Goo goo g'joob";
var yesno = /abc/.test(string);
Match examples in yellow:
✔ "abc are three letters" ✔ "I learned my abc when I was three" ❌ "Buy three CDs at ABC Music!"
Some are just meant to match themselves:
These two regex match their exact string inside:
/I am clown@circus/
/H2Y 4A2/
(This is basically like a substring search)
The next few slides will give examples for each of these cases.
stack stalk stank stark steak stick stink stock stork stuck stunk stEAk st!2k st87k st__k st[]k st--k stp\k
And also:
"tallest skyscraper"
[ac/dc] Matches the character "a", "c", "/" or "d" [0123459876] Matches a single digit [0-9] Matches a single digit [dr.] Matches "d", "r" or "." [a-z] All lowercase letters [a-zA-Z] All letters [a-z-] All lowercase letters and "-" too
Postal Code:
/[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]/
[^XYZ] Any character other than "X", "Y" or "Z" [^a-z] One that is not a lowercase letter [^.] Not a "." [^0-9] Not a digit
It allows the regex to:
/Pierr?e Rioux?/
✔ "Piere Riou" ✔ "Pierre Rioux" ✔ "Pierre Riou" ❌ "Piee Rioux"
✔ "X--X" ✔ "X-G-X ✔ "X-9-X" ✔ "X-Pa-X" ✔ "X-P8-X" ❌ "X-PA-X" ❌ "X-4Z-X"
It allows the regex to:
/Pierr*e Rioux*/
✔ "Piere Riou" ✔ "Pierre Rioux" ✔ "Pierrrrrrrrre Riouxxxxxxxx" ❌ "Piee Riouxxxxxxxx"
✔ "X--X" ✔ "X-G-X ✔ "X-9-X" ✔ "X-Pa-X" ✔ "X-P8-X" ✔ "X-COOKIE-X" ✔ "X-m0nst3r-X" ✔ "X-COOKIEm0nst3r-X" ❌ "X-CookieMonst3r-X"
✔ "0_V0" ✔ "528222_V12" ❌ "_V7165" ❌ "622122_V"
Generally, /x+/ is the same as /xx*/ for all x.
Regex | Meaning |
/[0-9]{3}/ | Three digits. Same as /[0-9][0-9][0-9]/ |
/[0-9]{3,}/ | At least 3 digits |
/[0-9]{,7}/ | Up to 7 digits (including 0) |
/[0-9]{3,7}/ | Between 3 to 7 digits |
This just matches the string 'abc*def\xyz.txt' (15 characters)
Unprintable character | |
\n | newline (ASCII 10) |
\t | tab (ASCII 9) |
\r | carriage return (ASCII 13) |
\e | escape (ASCII 27) |
... | there are many others... |
Set of characters | |
\s | Any whitespace: space, tab, newline, etc |
\d | Any digit. Similar to [0-9] |
\w | Any character used in an identifier: [a-zA-Z0-9_] |
\S | The opposite of \s |
\D | The opposite of \d |
\W | The opposite of \w |
✔ "" ✔ "My email is, thanks"
✔ "" ❌ "My email is, thanks"
mystring = "amaryjo\nmary\nryjo\n" ✔ mystring =~ /ary$/ # "amaryjo\nmary\nryjo\n" ✔ mystring =~ /^mar/ # "amaryjo\nmary\nryjo\n" ✔ mystring =~ /^ryjo$/ # "amaryjo\nmary\nryjo\n"
Anchoring a regex can significantly improve performance when scanning large strings, or with complex regular expressions.
✔ "My hovercraft is full of eels" ✔ "Planes, trains, and automobiles" ❌ "I took a taxi and paid $20"
✔ "My name is Malkovitch, indeed." ✔ "Being John MalkovitchMalkovitchMalkovitchMalkovitch."
/a (robot|(human)+|vegetable)? truly/
✔ "I am a robot truly" ✔ "I am a vegetable truly" ✔ "I am a humanhuman truly" ✔ "I am a truly" ❌ "I am a robothumanvegetable truly"
✔ "a0za0za0za0za0z" ✔ "a0za1za9za3za7za4z"
The + applies to the regular expression, not the matched string.
status = "Restarting Setup" status = "Restarting Cluster" status = "Restarting PostProcessing" status = "Recovering Setup" status = "Recovering Cluster" status = "Recovering PostProcessing"
matchinfo = status.match /(Recovering|Restarting) (\S+)/ operation = matchinfo[0] stage = matchinfo[1]
When matching using Ruby, we get the first word into variable operation and the second word into variable stage.
my $status = "Restarting PostProcessing"
$status =~ /(Recovering|Restarting) (\S+)/;
my $operation = $1;
my $stage = $2;
✔ "Product code abc-123-def is sold out" ✔ "Product code abc-123-abc is sold out"
❌ "Product code abc-123-def is sold out" ✔ "Product code abc-123-abc is sold out"
❌ "0123abcdxyzabcdxyzabcdxyz0123" ✔ "0123abcdxyzabcdxyzabcdxyz0123"
❌ "Beverly Hills 90210" ✔ "Beverly Hills 90210"
❌ "0123abcdxyzabcdxyzabcdxyz0123" ✔ "0123abcdxyzabcdxyzabcdxyz0123"
Modifier | Effect | Example |
i | Regex is case-insensitive | "H2y" =~ /[a-z][0-9][a-z]/i |
g | Regex match globally, in turn from position of last match | # This is a Perl example while ("one two three" =~ /(\w+)/gi) { print "$1\n"; # will execute 3 times } |
x | Allow spaces and comments for lisibility | email =~ / ([a-z0-9\.\-]+) # username @ ([a-z0-9\.\-]+) # hostname /x |
# Returns true if the given +userfile+ is named according
# to the LORIS CCNA Phantom convention for integration with CBRAIN,
# which means we can find a subject ID and a visit ID.
# Example: 'ibis_417879_Initial_MRI_t1w_001.mnc'
# 'ibis_123456_lego_phantom_SD_ABCD_20001212_12CH_t1w_0183.mnc
# The visit ID can be a complex string with underscores in it.
# See the regex in the code.
# Returns the subject ID and the visit ID if the convention
# is respected; for the example, the "417879" and "Initial_MRI" parts.
def named_according_to_LORIS_convention?(userfile)
if =~ /\A
[a-zA-Z0-9]+ # arbitrary prefix
_(\d\d\d\d\d\d) # $1 Subject ID
_(Initial_MRI| # $2 Visit, one of "Initial_MRI" or ...
_([A-Z]{3,4}) # three or four uppercase letters
_\d\d\d\d \d\d \d\d # year month day
(_(12CH|32CH))? # optional suffix of Visit
[ Regexp.last_match[1], Regexp.last_match[2] ]
Software | Differences | Notes |
Perl | has everything! | |
Ruby | mostly like Perl | |
PHP | mostly like Perl | |
Javascript | more limited | no comments with /x |
grep, sed, etc | Basic: No +, |, \w, \s, \d, etc | Use grep -P to get Perl's engine; modifiers are options. |
egrep | is grep, but adds | |
Note: Wikipedia has some nice, dedailed comparison tables.
I have a file at home called 's'. Its content:
s sa sabbath sabbatical Sabina Sabine sable sabotage sabra sac saccharine sachem Sachs sack (about 2600 more lines omitted)
62 Nov-15 23:00:19 grep 's....' s 63 Nov-15 23:00:24 grep 's....$' s 64 Nov-15 23:00:31 grep '^s....$' s 65 Nov-15 23:00:46 grep '^s....$' s | cut -c2 66 Nov-15 23:00:53 grep '^s....$' s | cut -c2 | uniq -c 67 Nov-15 23:01:10 grep '^s....$' s | grep '^st' 68 Nov-15 23:01:56 grep '^s....$' s | grep '^st..k'
.present?() .presence() .blank?() .empty?()
The details why are not important here,
suffice to say it was for the R5 release.
find . -type f -name "*.rb" -print | \ xargs grep -B3 -A3 -E '\.empty\?|\.present\?|\.presence|\.blank\?' | \ highlight -l -r '\.empty\?|\.present\?|\.presence|\.blank\?' yellow | \ less
This created a report of about 9000 lines:
-- ./BrainPortal/app/models/cluster_task.rb- # to a location different than the original task directory. ./BrainPortal/app/models/cluster_task.rb- def make_available(userfile, file_path, userfile_sub_path = nil, start_dir = nil) ./BrainPortal/app/models/cluster_task.rb- cb_error "File path argument must be relative" if ./BrainPortal/app/models/cluster_task.rb: file_path.to_s.blank? || file_path.to_s =~ /\A\// ./BrainPortal/app/models/cluster_task.rb- ./BrainPortal/app/models/cluster_task.rb- # Determine starting dir for relative symlink target calculation ./BrainPortal/app/models/cluster_task.rb- base_dir = start_dir || self.full_cluster_workdir -- -- ./BrainPortal/app/models/cluster_task.rb- workdir_path = ./BrainPortal/app/models/cluster_task.rb- dp_cache_path = ./BrainPortal/app/models/cluster_task.rb- userfile_path = ./BrainPortal/app/models/cluster_task.rb: userfile_path += if userfile_sub_path.to_s.present? ./BrainPortal/app/models/cluster_task.rb- ./BrainPortal/app/models/cluster_task.rb- # Try to figure out if the DataProvider of userfile is local, or remote. ./BrainPortal/app/models/cluster_task.rb- # SmartDataProviders supply the real implementation under a real_provider method; --