There is a persistent meme out there that matching regular expressions with back-references is NP-Hard. There is a post about this and the claim is repeated by Russ Cox so this is now part of received wisdom.

None of these claims are false; they just don’t apply to regular expression matching in the sense that most people would imagine (any more than, say, someone would claim, “colloquially” that summing a list of N integers is O(N^2) since it’s quite possible that each integer might be N bits long). It depends on the generally unfamiliar notion that the regular expression being *matched *might be arbitrarily varied to add more back-references.

These constructions rely on being able to add more things to the regular expression as the size of the problem that’s being reduced to ‘regex matching with back-references’ gets bigger.

Suppose, instead, as per more common practice, we are considering the difficulty of matching a fixed regular expressions with one or more back-references against an input of size N.

Is **this** task is in P? That is, is there a polynomial-time algorithm in the size of the input that will tell us whether this back-reference containing regular expression matched?

Note that back-references in a regular expression don’t “lock” – so the pattern /((\wx)\2)z/ will match “axaxbxbxz” (EDIT: sorry, I originally fat-fingered this example). So, sadly, we can’t just enumerate all starts and ending positions of every back-reference (say there are k backreferences) for a bad but polynomial-time algorithm (this would be O(N^2k) runs of our algorithm without back-references, so if we had a O(N) algorithm we could solve it in O(N^(2k+1)). Unfortunately, this construction doesn’t work – the capturing parentheses to which the back-references occur update, and so there can be numerous instances of them.

Note that even a lousy algorithm for establishing that this is possible suffices. So if there’s a construction that shows that we can match regular expressions with k backreferences in O(N^(100k^2+10000)) we’d still be in P, even if the algorithm is rubbish. I’ve read that (I forget the source) that, informally, a lousy poly-time algorithm can often be improved, but an exponential-time algorithm is intractable. So knowing that this problem was in P would be helpful.

So I’m curious – are there any either (a) results showing that *fixed* regex matching with back-references is also NP-hard, or (b) results, possibly the construction of a dreadfully naive algorithm, showing that it can be polynomial?

I think matching regex with backreferences, with a fixed number of captured groups k, is in P.

Here’s an implementation which I think achieves that:

https://github.com/travisdowns/polyregex

The basic idea is the same as the proof sketch on Twitter:

The bound I found is O(n^(2k+2)) time and O(n^(2k+1)) space, which is very slightly different than the bound in the Twitter thread (because of the way actual backreference instances are expanded).

This isn’t meant to be a useful regex matcher, just a proof of concept! Even apart from being totally unoptimized, an O(n^20) algorithm (with 9 backrefs), might as well be exponential for most inputs.

Still, it may be the first matcher that doesn’t explode exponentially and yet supports backreferences.

LikeLike

I am not satisfied with the idea that there are n^(2k) start/stop pairs in the input for k backreferences. If a capturing subexpression and the corresponding backref appear inside a loop it will take on multiple different values – potentially O(n) different values.

LikeLike

I probably should have been more precise with my language: at any one time (while handing a given character in the input), for a single state (aka “path”), there is a single start/stop position (including the possibility of “not captured”) for each capturing group. As you move on to later characters, that can definitely change – so the start/stop pair for each backreference can change up to n times for an n-length string. That’s fine though, and in fact it doesn’t even end up changing the order.

They key is that capturing groups have no “memory” – when a group gets captured for the second time, what got captured the first time doesn’t matter any more, later behavior only depends on the last match. That prevents the exponential blowup and allows us to represent everything in O(n^(2k+1)) states (since the state only depends on the last match).

Yes, there are a lot of paths, but only polynomially many, if you do it right. I have put a more detailed explanation along with results from actually running polyregex on the issue you created:

https://github.com/travisdowns/polyregex/issues/2

LikeLike