xkcd: Regular Expressions (CC-BY-NC 2.5)

How to make a regex pattern that matches except if…

Short answer

We will take an example: match all /* ... */-type comments except when in string (in our example: JavaScript, so both single and double quotes will have to be excluded).

  1. Write the regex for what you want to match (later called m; our example: /\*.*?\*/1).
  2. Write the regular expressions for matching your exclusions (later called e1, e2, e3; our example: '[^']+' and "[^"]+").

Regex #1

  1. Write your final regex: e1|e2|(m) (our example: '[^']+'|"[^"]+"|(/\*.*?\*/)).
  2. Match group 1 instead of matching group 0.

Note: This requires code to parse the result. If you need something simple (sed? Which you would replace with a Perl command? Or just something you can use with match), see below.

Regex #2: alternative for Perl and PCRE (PHP)

  1. Write your final regex: (?:e1|e2)(*SKIP)(?!)|m (our example: (?:'[^']+'|"[^"]+")(*SKIP)(?!)|/\*.*?\*/).
  2. Match group 0 as usual.

Note: You can execute this expression as a Perl command: perl -pe '(?:\'[^\']+\'|"[^"]+")(*SKIP)(?!)|/\*.*?\*/'

You can test the regex and have it explained at this link.

Something more detailed

Regex #1

How does it work?

This is quite readable: the regex first matches the line against any of your exclusion, then tests what you are looking for last. If it matched your exclusion, it won’t be matched again. Quite simple, when you come to think of it.

Using group 1 makes sure only the pattern you are interested in is matched.

This has been qualified as The Greatest Regex Trick Ever and it is pretty neat. You just need to think of it.

Caveat

It matches only group 1: you need code to use it and cannot use a match.

Regex #2

How does it work?

To make things short (a more detailed explanation is available here), you write your exclusions and use them as a negative lookahead (hence the ?!). In English: if there is a match with this pattern, the negative lookahead will simply tell the engine to fail that match.

To make things more easy to read: (?!) can be replaced with (*F) or (*FAIL).

Caveat

It is just not available with any language.


Sources:


  1. .*? is the non-greedy version of .*, meaning the engine will search for the shortest group matching the expression. This allows us not to use a lookahead to ensure the regex does not match something like /* comment 1 */ some code /* comment 2 */. Word of caution: As i recently discovered, the non-greedy operator is not compatible with all engines (e.g. sed does not know it and you may need to use Perl instead.

Published by

Cyrille Chopelet

Programming addict, UX philosopher, casual gamer, sci-fi enthusiast, hi-tech dilettante, ... Some people even call me a geek.

One thought on “How to make a regex pattern that matches except if…”

Leave a Reply