Regular Expressions Enhancements

Programmer’s Notepad currently uses two different regex engines for different parts of code:

  1. The excellent PCRE: Used by the PN code for matching output strings and in a couple of other internal bits of code.

  2. The tiny engine built into Scintilla. This is a very limited regular expressions engine, designed for embedded scintilla use rather than use in a full powered text editor. It’s currently used for all user regex searches.

The original plan was to switch to using PCRE as the engine for searching in the editor as well. However, PCRE has a rather unfortunate design issue - it expects its search string to be a char* in-memory buffer. Scintilla doesn’t provide access to the text you are editing as a single memory buffer (quite rightly) and so this means there is a fundamental incompatibility between Scintilla and PCRE. I could of course simply retrieve the entire document into an extra memory buffer and run PCRE on that but it’s a very wasteful solution and not one that I’m willing to entertain.

Other libraries work in a much nicer way, using iterators. This allows you to define a custom iterator to walk over Scintilla’s data store thus neatly avoiding the need to provide a full buffer to the regex engine.

Other libraries to consider:

  1. Boost::Regex: PN already has a boost dependency so I don’t have a big issue with adding regex. There are two current issues:

a. Boost::Regex supports Unicode expressions by using ICU. Bundling ICU will add at least 1Mb of code (I’m still building it to find a total) to the distribution. This is a lot compared to the rest of PN!

b. Currently Boost::Regex does not support named groups. This is an important regex feature that PN makes use of to support arbitrary output matching.

  1. GRETA: A regular expressions library from Microsoft that has similar features to that of Boost. This also doesn’t support named groups, and doesn’t seem to have UTF-8 Unicode support either, relying on wchar_t which is no use to PN.

Others I’ve discarded due to lack of iterator support include the one built into ICU, oniguruma, and GNU regex.

Currently I’m not 100% decided which way to go. There is a Google SOC project to add named groups to Boost::Regex which would at least remove that block, leaving only the expensive Unicode support. Alternatively I could try to retrofit iterators to PCRE - something that sounds like a lot of hard work!

One way or another, PN will transition to full regex support in the editor.

p.s. In the comments Sebastian points out the highly useful Wikipedia article comparing regular expression engines. This would have saved me a bunch of time if I’d found it earlier!