REGEXP enhancements

Description

MariaDB has the RLIKE predicate which allows to specify a
regular expression pattern for a complex search, e.g.:

RLIKE uses the Henry Spencer's implementation of regular expressions,
which is aimed at conformance with POSIX 1003.2.
This library has a number of limitations:

  • it works byte-wise thus supports only 8-bit character sets

  • it does not work well with accent insensitive MariaDB collations

  • it does not have features that present in the modern regex
    libraries such as look-ahead, look-behind, non-greedy modifier,
    recursion, etc.

The goal of this task is to replace the old regex library
with some modern Unicode-aware library.

Current implementation pointers

The current regexp library is bundled with the MariaDB sources
and resides in the regex/ sub-directory.
The RLIKE predicate is implemented with help of the class Item_func_regex
which is defined in the files sql/item_cmpfunc.h and
sql/item_cmpfunc.cc

Possible library candidates

The list of the possible candidates is a subject to extending.
Currently we consider the following libraries:

  • mb_regex (from PHP)

  • TRE

  • RE2

  • pcre

  • ICU regex

  • oniguruma

One of the goals of the task is to choose the library which
suites better our needs, taking into account that it will
need some coding to integrate with MariaDB character sets and collations.

The implementer should provide feature and performance comparison
and proves that the library of the choice suites our needs better
than the other candidates.

Library License requirements

The library should be distributed under some permissive open source license,
such as LGPL, BSD, MIT, etc. GPL is not desirable.

Library portability requirements

The library should be a thread-safe C or C++ library that compiles
on all operating systems supported by MariaDB
(e.g. Linux, Windows, *BSD, Mac, Solaris, etc).

Library technical requirements

After replacing the regexp library, the RLIKE predicate should be able to:

1. Support modern regexp features.

  • look-ahead

  • look-behind

  • non-greedy modifiers

  • recursion would be nice

  • Unicode features [[10]|#unicoderegex] would be nice:

    • Unicode code point specification

    • Unicode Character Properties

    • Unicode Scripts

2. Work with all MariaDB character sets

  • 8bit character sets (e.g. latin1, cp1250, koi8r, etc)

  • Unicode character sets: utf8, utf16, utf16le, utf32

  • Chinese-Japanese-Korean (CJK) multi-byte character sets: sjis, cp932, ujis, eucjpms, gbk, gb2312, euckr

Preferably, the library should be able to handle all character sets natively.
Another way would be to convert non-Unicode data to Unicode and then pass to
the library. For example, if a library supports utf16le but does not support
cp932, we can convert cp932 to utf16le, and then pass utf16le data to the
library.

3. Follow the case sensitivity rules defined by the MariaDB collations.

That is, take into account case sensitivity of the collation of the operation:

It was decided to use PCRE as a new library

The REGEXP_LIKE function (deferred)

This functions won't be implemented under terms of this MDEV. See the explanation in the end of this section, or just skip this section
Currently MariaDB supports regexp based
comparison in the form of the RLIKE predicate:

which can only accept two arguments.

This task should introduce a new function REGEXP_LIKE with
the optional third parameter:

If the mode parameter is not specified, then the behavior of REGEXP_LIKE
should be similar to RLIKE, so these two expressions bring the same result:

The mode argument of REGEXP_LIKE will be a string literal
that allows to change the default matching behavior.
It will understand one or more letter flags as follows:

n

force dot (.) to match a newline character (by default dot should not match newline).

m

treat the source string as multiple lines, so ^ and $ match the beginning and the end of any line inside text. The default behavior should be to match ^ and $ only at the beginning and the end of the entire text value.

x

ignore white-spaces (by default white-spaces should match themselves).

i

force case insensitive match (even if the collation is case sensitive)

c

force case sensitive match (even if the collation is case insensitive)

Examples:

Note 1: Some other letter flags may be added as well, depending on features
provided by the library of the choice. This will be discussed after
the decision of which library to use is made.

Note 2: mode will be a string literal (i.e. a constant).
Expressions will not be supported. The parser will return
a syntax error on attempt to use non-literal in the third parameter
to REGEXP_LIKE:

but

Note 3: If some unknown letter is given in the 'mode' parameter to
REGEXP_LIKE, then an error should be returned:

Note

It was decided not to implement the REGEXP_LIKE function,
because PCRE supports flags in the pattern syntax. There is
no a need for an extra argument that would contain flags.

For example:

The REGEXP_REPLACE function

Additionally, a much requested REGEXP_REPLACE function should be implemented
under terms of this task, with the following arguments:

where:

text

the search value (string)

pattern

the regular expression (string)

replace

the replacement (string), with sub-expressions in the form of \n, where n is a number from 1 to 9.

position

a positive integer indicating the position in text where to start search from. The default value will be 1 (the first character).

occurrence

a non-negative number indicating the occurrence to replace: 0 to replace all occurrences, or a positive number n to replace the n-th occurrence.

mode

a string literal that changes the default match behavior, with exactly the same set of flag letters that REGEXP_LIKE has.

The first version will be implemented only with the first three parameters "text", "pattern" and "replace" and won't include the optional parameters "position", "occurrence" and "mode", and it will replace all occurrences of the searched pattern, starting from the beginning of the text

Examples:

The REGEXP_INSTR function

The REGEXP_REPLACE function will be implemented,
with the following arguments:

It will return the position of the regexp "pattern" inside "text".
The positions will start with 1.

Examples:

The REGEXP_SUBSTR function

The REGEXP_SUBSTR function will be be implemented,
with the following arguments:

It will return the part of "text" that matches "pattern".

Examples:

  1. The description of the RLIKE predicate in the MySQL manual:
    http://dev.mysql.com/doc/refman/5.6/en/regexp.html

  2. pcre:\\http://www.pcre.org/\\http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions

  3. Oniguruma:\\http://www.geocities.jp/kosako3/oniguruma/\\http://en.wikipedia.org/wiki/Oniguruma

  4. Regular expression UDFs for MySQL
    http://sourceforge.net/projects/udf-regexp/ https://launchpad.net/mysql-udf-regexp

  5. TRE:\\http://laurikari.net/tre/\\http://en.wikipedia.org/wiki/TRE_%28computing%29

  6. RE2:
    http://code.google.com/p/re2/

  7. ICU regexp:
    http://userguide.icu-project.org/strings/regexp

  8. Comparison of a number of regexp implementations:
    http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines

  9. Benchmark of Regex Libraries
    http://lh3lh3.users.sourceforge.net/reb.shtml

  10. Unicode features in regexp:
    http://www.regular-expressions.info/unicode.html

  11. Performance comparison of regular expression engines:
    http://sljit.sourceforge.net/regex_perf.html

  12. Regular Expression Performance Comparison:
    http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/gcc-performance.html

Environment

None

Status

Assignee

Alexander Barkov

Reporter

Alexander Barkov

Labels

External issue ID

None

External issue ID

None

Time tracking

720h

Fix versions

Priority

Major
Configure