Where's my negation?

Anybody interested in a programming puzzle? Des? Warning: This post is full of seemingly incomprehensible strings of symbols that programmers call regular expressions, which actually do something very useful. If that doesn’t turn you on, better to quit now. If you’re still here, consider this…

Have you ever wondered why programming languages that have good regular expression support (Perl, Javascript, Ruby, etc.) nevertheless omit syntax for negation or conjunction within regexes? I’m sure there must be a good reason for this. Has it been found that people just don’t intuitively get what negation does to a regex? Are language designers unwilling to complicate their syntax? Are they avoiding the processing it takes to complement or intersect finite state machines? They do compile to finite state machines, right?

Let me give you an example of where negation would be useful. Say I’m scanning through some text for some sub-sections which have been explicitly quoted. If the delimiting characters are angle brackets then I can write /<([^>]*)>/. The bit in the middle [^>] matches any character which is not a ‘>‘. This is a limited form of negation on a single character. Taken as a whole the expression says: Find me the strings of non-‘>‘ characters which are surrounded by a ‘<' on the left and a '>' on the right. So far so good.

But what if the sub-sections I'm interested in are bounded by multi-character strings? '<<<' and '>>>' for argument's sake. In analogy to the example above I want to be able to write /<<<(?^.*>>>.*)>>>/. I've invented some syntax for negation here: (?^R) means match anything that the regex R doesn't match. The R in my example (.*>>>.*) matches any string that contains the sequence '>>>' so the expression as a whole says: Find me the strings that do not contain '>>>' that are bordered by '<<<' on the left and '>>>' on the right. Now, in this example, the same can be accomplished with a "lazy star" (if your programming language supports that) which just makes the regex non-greedy within the bounding strings: /<<<(.*?)>>>/. But I would much rather state the patterns I'm looking for declaratively than change the ambiguity-resolution mechanism of the processor. Why can't I?

Any finite-state toolkit with regex support has syntax for negation and conjunction. Do you know of any programming languages that do? If not, why not?

Comments are closed.