Follow

I made a regex to match all multiples of 7, but it was >10,000 characters so grep couldn't handle it.

@olligobber I tried writing this myself and couldn't get it down below 30k characters for the multiples of 7. And my code spits out something 200k chars long for the trivial multiples of 10, when "[0-9]*0" would work. Its output complexity for base \(b\) appears to be something like \(4^b\). Maybe there's a better way?

@11011110 I minimise my DFA before converting it to regex, which helps with cases like multiples of 10, but I still get extremely large answers for small numbers, like 6 million characters for multiples of 13.

@11011110 Also, it seems the order I remove the DFA states is very important to the size. 10k was the best I could do by trying all orders, but just trying random orders gets results from 13k to 60k.

@11011110 I can see where \(4^n\) would come from in my method; removing each state approximately quadruples the size of each regex.

@olligobber I didn't realize the state elimination ordering was important — I was just doing it in the most obvious loop ordering, and without DFA minimization (although I do also have code for that already written somewhere). Anyway, the standard conversion algorithm is en.wikipedia.org/wiki/Kleene%2

@11011110 That looks like the algorithm I'm using. My algorithm removes the states in a random order, and has options to try multiple orders and take the smallest regex, or to try all orders and take the smallest regex. I notice that article says something about the output being about \(4^n\) in length.

@olligobber According to this paper one can do significantly better than \(4^n\) (although still necessarily exponential) for binary alphabets: doi.org/10.1007/978-3-540-8578

Sign in to participate in the conversation
Mathstodon

A Mastodon instance for maths people. The kind of people who make \(\pi z^2 \times a\) jokes. Use \( and \) for inline LaTeX, and \[ and \] for display mode.