MariaDB has the RLIKE predicate which allows to specify a
regular expression pattern for a complex search, e.g.:
RLIKE uses the Henry Spencer's implementation of regular expressions,
which is aimed at conformance with POSIX 1003.2.
This library has a number of limitations:
- it works byte-wise thus supports only 8-bit character sets
- it does not work well with accent insensitive MariaDB collations
- it does not have features that present in the modern regex
libraries such as look-ahead, look-behind, non-greedy modifier,
The goal of this task is to replace the old regex library
with some modern Unicode-aware library.
The current regexp library is bundled with the MariaDB sources
and resides in the regex/ sub-directory.
The RLIKE predicate is implemented with help of the class Item_func_regex
which is defined in the files sql/item_cmpfunc.h and
The list of the possible candidates is a subject to extending.
Currently we consider the following libraries:
- mb_regex (from PHP)
- ICU regex
One of the goals of the task is to choose the library which
suites better our needs, taking into account that it will
need some coding to integrate with MariaDB character sets and collations.
The implementer should provide feature and performance comparison
and proves that the library of the choice suites our needs better
than the other candidates.
The library should be distributed under some permissive open source license,
such as LGPL, BSD, MIT, etc. GPL is not desirable.
The library should be a thread-safe C or C++ library that compiles
on all operating systems supported by MariaDB
(e.g. Linux, Windows, *BSD, Mac, Solaris, etc).
After replacing the regexp library, the RLIKE predicate should be able to:
- non-greedy modifiers
- recursion would be nice
- Unicode features  would be nice:
- Unicode code point specification
- Unicode Character Properties
- Unicode Scripts
- 8bit character sets (e.g. latin1, cp1250, koi8r, etc)
- Unicode character sets: utf8, utf16, utf16le, utf32
- Chinese-Japanese-Korean (CJK) multi-byte character sets: sjis, cp932, ujis, eucjpms, gbk, gb2312, euckr
Preferably, the library should be able to handle all character sets natively.
Another way would be to convert non-Unicode data to Unicode and then pass to
the library. For example, if a library supports utf16le but does not support
cp932, we can convert cp932 to utf16le, and then pass utf16le data to the
That is, take into account case sensitivity of the collation of the operation:
It was decided to use PCRE as a new library
This functions won't be implemented under terms of this MDEV. See the explanation in the end of this section, or just skip this section
Currently MariaDB supports regexp based
comparison in the form of the RLIKE predicate:
which can only accept two arguments.
This task should introduce a new function REGEXP_LIKE with
the optional third parameter:
If the mode parameter is not specified, then the behavior of REGEXP_LIKE
should be similar to RLIKE, so these two expressions bring the same result:
The mode argument of REGEXP_LIKE will be a string literal
that allows to change the default matching behavior.
It will understand one or more letter flags as follows:
|n||force dot (.) to match a newline character (by default dot should not match newline).|
|m||treat the source string as multiple lines, so ^ and $ match the beginning and the end of any line inside text. The default behavior should be to match ^ and $ only at the beginning and the end of the entire text value.|
|x||ignore white-spaces (by default white-spaces should match themselves).|
|i||force case insensitive match (even if the collation is case sensitive)|
|c||force case sensitive match (even if the collation is case insensitive)|
Note 1: Some other letter flags may be added as well, depending on features
provided by the library of the choice. This will be discussed after
the decision of which library to use is made.
Note 2: mode will be a string literal (i.e. a constant).
Expressions will not be supported. The parser will return
a syntax error on attempt to use non-literal in the third parameter
Note 3: If some unknown letter is given in the 'mode' parameter to
REGEXP_LIKE, then an error should be returned:
Additionally, a much requested REGEXP_REPLACE function should be implemented
under terms of this task, with the following arguments:
|text||the search value (string)|
|pattern||the regular expression (string)|
|replace||the replacement (string), with sub-expressions in the form of \n, where n is a number from 1 to 9.|
|position||a positive integer indicating the position in text where to start search from. The default value will be 1 (the first character).|
|occurrence||a non-negative number indicating the occurrence to replace: 0 to replace all occurrences, or a positive number n to replace the n-th occurrence.|
|mode||a string literal that changes the default match behavior, with exactly the same set of flag letters that REGEXP_LIKE has.|
The first version will be implemented only with the first three parameters "text", "pattern" and "replace" and won't include the optional parameters "position", "occurrence" and "mode", and it will replace all occurrences of the searched pattern, starting from the beginning of the text
The REGEXP_REPLACE function will be implemented,
with the following arguments:
It will return the position of the regexp "pattern" inside "text".
The positions will start with 1.
The REGEXP_SUBSTR function will be be implemented,
with the following arguments:
It will return the part of "text" that matches "pattern".
- The description of the RLIKE predicate in the MySQL manual:
- Regular expression UDFs for MySQL
- ICU regexp:
- Comparison of a number of regexp implementations:
- Benchmark of Regex Libraries
- Unicode features in regexp:
- Performance comparison of regular expression engines:
- Regular Expression Performance Comparison: