Boost::Xpressive and Scintilla

Programmer’s Notepad has long needed an improved Regular Expressions engine. Currently PN uses PCRE for all tasks but searching Scintilla. This is because PCRE doesn’t support searching anything but a memory buffer - i.e. it doesn’t support iterators. We need iterator (or indirect access) support because a regex engine for a text editor can’t expect all text for the editor to be in a single contiguous memory block.

Boost::Regex has been suggested several times, but it still doesn’t support named captures. When allowing users to specify regular expressions for use in parsing, named captures can significantly simply the process. For example, when using a regular expression to parse compiler output we have two alternatives:

  1. \s*(?P.+)(?P[0-9]+)(,(?P[0-9]+))?\s*:

This uses the standard named capture syntax to name the three capture blocks: “f” for filename, “l” for line and “c” for column. The single expression can be parsed and understood by PN without the user having to understand capture indexing.

  1. \s(.+)([0-9]+)(,([0-9]+))?\s:

This uses basic regular expression capture groups and results in the user having to enter three additional pieces of non-obvious data: the capture index for each capture. In this case these would be 1, 2, and 4 but this would potentially change for each output pattern.

I believe that using named captures significantly improves the user experience around this, especially considering that PN uses %f, %l and %c to represent the three named capture groups meaning that users don’t even need to understand regular expression capture syntax to use them.

Boost 1.35 introduces version 2 of Boost.Xpressive, the other boost regular expressions engine. Boost.Xpressive naturally supports iterators. Version 2 supports named captures.

Implementing a Scintilla Iterator

Xpressive requires a bi-directional iterator class (one that can move forwards and backwards over the contents). I’ve currently implemented a very simple, naive iterator to prove that this can work:

/**
 * std::iterator compatible iterator for Scintilla contents
 */
class ScintillaIterator : <br></br>    public std::iterator<std::bidirectional_iterator_tag, char>
{
public:
    ScintillaIterator() : 
        m_scintilla(0), 
        m_pos(0),
        m_end(0)
    {
    }

    ScintillaIterator(CScintilla* scintilla, int pos) : 
        m_scintilla(scintilla),
        m_pos(pos),
        m_end(scintilla->GetLength())
    {
    }

    ScintillaIterator(const ScintillaIterator& copy) :
        m_scintilla(copy.m_scintilla),
        m_pos(copy.m_pos),
        m_end(copy.m_end)
    {
    }

    bool operator == (const ScintillaIterator& other) const
    {
        return (ended() == other.ended()) 
            && (m_scintilla == other.m_scintilla) 
            && (m_pos == other.m_pos);
    }

    bool operator != (const ScintillaIterator& other) const
    {
        return !(*this == other);
    }

    char operator * () const
    {
        return charAt(m_pos);
    }

    ScintillaIterator& operator ++ ()
    {
        m_pos++;
        return *this;
    }

    ScintillaIterator& operator -- ()
    {
        m_pos--;
        return *this;
    }

    int pos() const
    {
        return m_pos;
    }

private:
    char charAt(int position) const
    {
        return m_scintilla->GetCharAt(position);
    }

    bool ended() const
    {
        return m_pos == m_end;
    }

    int m_pos;
    int m_end;
    CScintilla* m_scintilla;
};

This can then be used with Xpressive like this:

typedef boost::xpressive::basic_regex<ScintillaIterator> sciregex;
typedef boost::xpressive::match_results<ScintillaIterator> scimatch;
typedef boost::xpressive::sub_match<ScintillaIterator> scisub_match;

void test()
{
    sciregex regex = sciregex::compile("[0-9]+");
    scimatch match; 
    if (regex_match(m_scintilla, match, regex))
    {
        LOG("Yay!");
    }
}

This code is now available in Programmer’s Notepad subversion, and it seems to work. The iterator needs a bit of improvement to buffer data from Scintilla, or perhaps needs moving so that it doesn’t have to send a windows message for every character access. However, as a proof of concept it’s a good one and it suggests that we should be able to replace the current lacklustre regex searching support with fully featured multi-line support for the next release.