Tried solving today's using only regex, and it turns out that it's not easy to match a string containing any character consecutively repeated exactly 2 times (but no more).

For example, I want to match "abccd", but not "abcccd".

Follow

It's simple for a single character, e.g. the "c" from the example above:

(?<!c)cc(?!c)

This uses negative lookbehind and lookahead to make sure the two c's are not preceeded or followed by any other c's.

But if you want to make it generic, e.g. match any letter, you might try using backreferences:

(?<!\1)(\w)\1(?!\1)

This produces the error: "lookbehind assertion is not fixed length" since the regex engine cannot know how big the backref is.

Another attempt was using the \K delimiter which allows for variable-length lookabehinds by matching anything before it and discarding it. However there is no negative variant of \K so it's of no use here.

Finally, I tried matching the character in the lookbehind, and while this kind of thing works for a positive lookbehind, e.g. this matches any letter preceeded by itself:

(?<=(.))\1

It does not work for negative lookbehind. This does not match anything:

(?<!(.))\1\1

So, an hour later I'm no wiser. :) Sounds like it should be pretty easy, and I'm possibly missing something obvious. If anyone knows the solution let me know.

This is trivially solved in code, and I've done so, but I like messing around with regexes and this would make me happy.

@ihabunek this would have been my first thought

(?<=(.))[\1]{2}[^\1]

@hirojin I did think of something similar, but [\1] matches character with ASCII code 1 (base 80), and not the backref. E.g. [\80] would match 0.

As far as I can tell, it's not possible to use back-references in square brackets.

@ihabunek
You're doing it wrong.
Lookahead/lookbehind are regex extensions that make it not really regex - you can match some non-regular languages with them.

For a single character, I'd instead write:

^(.*[^c])?cc([^c].*)?$

or, if you want these 2 c's to be the _only_ occurence of c:

^[^c]*cc[^c]*$

Now, to do it for any characters, you should take advantage of the fact that your alphabet is finite and fixed-size, and build a big regex which is the above regex for each character, all joined with a big alternative ('|').

You could eg. generate that regex with Python:

regex = '|'.join('^(.*[^%s])?%s%s([^%s].*)?$' % (c, c, c, c) for c in alphabet)

But you could also do it by hand, or with sed.

If alphabet contains special characters, some of them might need escaping, but that should be relatively easy.

@Wolf480pl I did generate the pattern in the end (in my case the alphabet was [0-9] so only 10 options) but that is pretty ugly, and I was wondering if it's possible to make a shorter regex pattern.

I don't agree about not using extensions though, if they're available I don't know of a good reason not to use them.

@ihabunek what were you been doing on your Automata class?

@ihabunek
Sorry for the post above, that was unhelpful and rude of me.

Regex extensions are fine if you can use them well, but then what you write shouldn't really be called "regular expression".

In CompSci, "regular expression" is something that can be evaluated by a finite state machine. Some regex extensions allow you to match things like a^n b^n ('a' n times, then 'b' n times) which requires a stack, and couldn't possibly be a regular language.
With Perl regexes you can[1] even match a^n b^n c^n, which is a context-sensitive language and requires a Turing Machine.
And that means you don't know how much time or RAM it'll take to evaluate such regex.

[1]: stackoverflow.com/questions/29

Sign in to participate in the conversation
Mastodon

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!