Table of content
- About strings
- Brief history of regular expressions
- How they are used
- Basic matching
- Special characters
- Grouping and capturing
- Examples
"Les regex sont vos amis", hein Nic?
"pierre.rioux@mcgill.ca" "It was the best of times, it was the worst of times, it was the time of the ADM." "Restaurant L'Hémorragie" "y\n" "699.99" "/data/assembly/1234/V2/new.mnc" "0" "2017-08-23 12:11:40 CBRAIN ERROR: MySql server has gone away" "922734,V6,04/03/98,'subject was drunk',/data/store/922734/rest_1b.mnc.gz" "AACCCTAACCCTAACCCTAACCCTAAGTTAGGATC...GATTATTATAATCGGATAGCTTTTAGGATATAGGGTATAG"
"2017-08-20 13:45 Added new user prioux" "2017-08-21 9:23 Removed user 'prioux'"
int year, month, day;
char *action = "xxxxxxxxxxx";
sscanf("Log: 2017-04-24 Added new user prioux",
"Log: %d-%d-%d %s", &year, &month, &day, action);
This solution still suffers from brittleness. You are also limited to
just the set of %X characters sequences that scanf() supports.
There is little need to use scanf() in modern times.
Larry Wall's Perl interpreter
Some observations:
$mystring = "I am the eggman";
$yesno = $mystring =~ /abc/;
mystring = "I am the eggman"
yesno = mystring =~ /abc/
yesno = mystring.match(/abc/)
yesno = /abc/.match(mystring)
$mystring = "I am the walrus";
$yesno = preg_match('/abc/', $mystring);
var mystring = "Goo goo g'joob";
var yesno = /abc/.test(string);
Match examples in yellow:
✔ "abc are three letters" ✔ "I learned my abc when I was three" ❌ "Buy three CDs at ABC Music!"
Some are just meant to match themselves:
These two regex match their exact string inside:
/I am clown@circus/
/H2Y 4A2/
(This is basically like a substring search)
The next few slides will give examples for each of these cases.
stack stalk stank stark steak stick stink stock stork stuck stunk stEAk st!2k st87k st__k st[]k st--k stp\k
And also:
"tallest skyscraper"
[ac/dc] Matches the character "a", "c", "/" or "d" [0123459876] Matches a single digit [0-9] Matches a single digit [dr.] Matches "d", "r" or "." [a-z] All lowercase letters [a-zA-Z] All letters [a-z-] All lowercase letters and "-" too
Postal Code:
/[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]/
[^XYZ] Any character other than "X", "Y" or "Z" [^a-z] One that is not a lowercase letter [^.] Not a "." [^0-9] Not a digit
It allows the regex to:
/Pierr?e Rioux?/
✔ "Piere Riou" ✔ "Pierre Rioux" ✔ "Pierre Riou" ❌ "Piee Rioux"
✔ "X--X" ✔ "X-G-X ✔ "X-9-X" ✔ "X-Pa-X" ✔ "X-P8-X" ❌ "X-PA-X" ❌ "X-4Z-X"
It allows the regex to:
/Pierr*e Rioux*/
✔ "Piere Riou" ✔ "Pierre Rioux" ✔ "Pierrrrrrrrre Riouxxxxxxxx" ❌ "Piee Riouxxxxxxxx"
✔ "X--X" ✔ "X-G-X ✔ "X-9-X" ✔ "X-Pa-X" ✔ "X-P8-X" ✔ "X-COOKIE-X" ✔ "X-m0nst3r-X" ✔ "X-COOKIEm0nst3r-X" ❌ "X-CookieMonst3r-X"
/[0-9]+_V[0-9]+/
✔ "0_V0" ✔ "528222_V12" ❌ "_V7165" ❌ "622122_V"
Generally, /x+/ is the same as /xx*/ for all x.
Regex | Meaning |
---|---|
/[0-9]{3}/ | Three digits. Same as /[0-9][0-9][0-9]/ |
/[0-9]{3,}/ | At least 3 digits |
/[0-9]{,7}/ | Up to 7 digits (including 0) |
/[0-9]{3,7}/ | Between 3 to 7 digits |
/abc\*def\\xyz\.txt/
This just matches the string 'abc*def\xyz.txt' (15 characters)
Unprintable character | |
---|---|
\n | newline (ASCII 10) |
\t | tab (ASCII 9) |
\r | carriage return (ASCII 13) |
\e | escape (ASCII 27) |
... | there are many others... |
Set of characters | |
---|---|
\s | Any whitespace: space, tab, newline, etc |
\d | Any digit. Similar to [0-9] |
\w | Any character used in an identifier: [a-zA-Z0-9_] |
\S | The opposite of \s |
\D | The opposite of \d |
\W | The opposite of \w |
/\w+@\w+\.ca/
✔ "pierre@circus.ca" ✔ "My email is pierre@circus.ca, thanks"
/^\w+@\w+\.ca$/
✔ "pierre@circus.ca" ❌ "My email is pierre@circus.ca, thanks"
mystring = "amaryjo\nmary\nryjo\n" ✔ mystring =~ /ary$/ # "amaryjo\nmary\nryjo\n" ✔ mystring =~ /^mar/ # "amaryjo\nmary\nryjo\n" ✔ mystring =~ /^ryjo$/ # "amaryjo\nmary\nryjo\n"
Anchoring a regex can significantly improve performance when scanning large strings, or with complex regular expressions.
/taxi$|tra?in|h.ver.raft/
✔ "My hovercraft is full of eels" ✔ "Planes, trains, and automobiles" ❌ "I took a taxi and paid $20"
/(Malkovitch)+/
✔ "My name is Malkovitch, indeed." ✔ "Being John MalkovitchMalkovitchMalkovitchMalkovitch."
/a (robot|(human)+|vegetable)? truly/
✔ "I am a robot truly" ✔ "I am a vegetable truly" ✔ "I am a humanhuman truly" ✔ "I am a truly" ❌ "I am a robothumanvegetable truly"
/(a[0-9]z)+/
✔ "a0za0za0za0za0z" ✔ "a0za1za9za3za7za4z"
The + applies to the regular expression, not the matched string.
status = "Restarting Setup" status = "Restarting Cluster" status = "Restarting PostProcessing" status = "Recovering Setup" status = "Recovering Cluster" status = "Recovering PostProcessing"
matchinfo = status.match /(Recovering|Restarting) (\S+)/ operation = matchinfo[0] stage = matchinfo[1]
When matching using Ruby, we get the first word into variable operation and the second word into variable stage.
my $status = "Restarting PostProcessing"
$status =~ /(Recovering|Restarting) (\S+)/;
my $operation = $1;
my $stage = $2;
/([a-z]+)-\d+-([a-z]+)/
✔ "Product code abc-123-def is sold out" ✔ "Product code abc-123-abc is sold out"
/([a-z]+)-\d+-\1/
❌ "Product code abc-123-def is sold out" ✔ "Product code abc-123-abc is sold out"
❌ "0123abcdxyzabcdxyzabcdxyz0123" ✔ "0123abcdxyzabcdxyzabcdxyz0123"
/[0-9]+$/
❌ "Beverly Hills 90210" ✔ "Beverly Hills 90210"
❌ "0123abcdxyzabcdxyzabcdxyz0123" ✔ "0123abcdxyzabcdxyzabcdxyz0123"
Modifier | Effect | Example |
---|---|---|
i | Regex is case-insensitive | "H2y" =~ /[a-z][0-9][a-z]/i |
g | Regex match globally, in turn from position of last match | # This is a Perl example while ("one two three" =~ /(\w+)/gi) { print "$1\n"; # will execute 3 times } |
x | Allow spaces and comments for lisibility | email =~ / ([a-z0-9\.\-]+) # username @ ([a-z0-9\.\-]+) # hostname /x |
# Returns true if the given +userfile+ is named according
# to the LORIS CCNA Phantom convention for integration with CBRAIN,
# which means we can find a subject ID and a visit ID.
#
# Example: 'ibis_417879_Initial_MRI_t1w_001.mnc'
# 'ibis_123456_lego_phantom_SD_ABCD_20001212_12CH_t1w_0183.mnc
#
# The visit ID can be a complex string with underscores in it.
# See the regex in the code.
#
# Returns the subject ID and the visit ID if the convention
# is respected; for the example, the "417879" and "Initial_MRI" parts.
def named_according_to_LORIS_convention?(userfile)
if userfile.name =~ /\A
[a-zA-Z0-9]+ # arbitrary prefix
_(\d\d\d\d\d\d) # $1 Subject ID
_(Initial_MRI| # $2 Visit, one of "Initial_MRI" or ...
(lego|human)
_phantom
_(L1|SD)
_([A-Z]{3,4}) # three or four uppercase letters
_\d\d\d\d \d\d \d\d # year month day
(_(12CH|32CH))? # optional suffix of Visit
)
/x
[ Regexp.last_match[1], Regexp.last_match[2] ]
end
end
Software | Differences | Notes |
---|---|---|
Perl | has everything! | |
Ruby | mostly like Perl | |
PHP | mostly like Perl | |
Javascript | more limited | no comments with /x |
grep, sed, etc | Basic: No +, |, \w, \s, \d, etc | Use grep -P to get Perl's engine; modifiers are options. |
egrep | is grep, but adds | |
Note: Wikipedia has some nice, dedailed comparison tables.
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxY"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
I have a file at home called 's'. Its content:
s sa sabbath sabbatical Sabina Sabine sable sabotage sabra sac saccharine sachem Sachs sack (about 2600 more lines omitted)
62 Nov-15 23:00:19 grep 's....' s 63 Nov-15 23:00:24 grep 's....$' s 64 Nov-15 23:00:31 grep '^s....$' s 65 Nov-15 23:00:46 grep '^s....$' s | cut -c2 66 Nov-15 23:00:53 grep '^s....$' s | cut -c2 | uniq -c 67 Nov-15 23:01:10 grep '^s....$' s | grep '^st' 68 Nov-15 23:01:56 grep '^s....$' s | grep '^st..k'
.present?() .presence() .blank?() .empty?()
The details why are not important here,
suffice to say it was for the R5 release.
find . -type f -name "*.rb" -print | \ xargs grep -B3 -A3 -E '\.empty\?|\.present\?|\.presence|\.blank\?' | \ highlight -l -r '\.empty\?|\.present\?|\.presence|\.blank\?' yellow | \ less
This created a report of about 9000 lines:
-- ./BrainPortal/app/models/cluster_task.rb- # to a location different than the original task directory. ./BrainPortal/app/models/cluster_task.rb- def make_available(userfile, file_path, userfile_sub_path = nil, start_dir = nil) ./BrainPortal/app/models/cluster_task.rb- cb_error "File path argument must be relative" if ./BrainPortal/app/models/cluster_task.rb: file_path.to_s.blank? || file_path.to_s =~ /\A\// ./BrainPortal/app/models/cluster_task.rb- ./BrainPortal/app/models/cluster_task.rb- # Determine starting dir for relative symlink target calculation ./BrainPortal/app/models/cluster_task.rb- base_dir = start_dir || self.full_cluster_workdir -- -- ./BrainPortal/app/models/cluster_task.rb- workdir_path = Pathname.new(self.cluster_shared_dir) ./BrainPortal/app/models/cluster_task.rb- dp_cache_path = Pathname.new(RemoteResource.current_resource.dp_cache_dir) ./BrainPortal/app/models/cluster_task.rb- userfile_path = Pathname.new(userfile.cache_full_path) ./BrainPortal/app/models/cluster_task.rb: userfile_path += Pathname.new(userfile_sub_path) if userfile_sub_path.to_s.present? ./BrainPortal/app/models/cluster_task.rb- ./BrainPortal/app/models/cluster_task.rb- # Try to figure out if the DataProvider of userfile is local, or remote. ./BrainPortal/app/models/cluster_task.rb- # SmartDataProviders supply the real implementation under a real_provider method; --