Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4850

MATCH uses a wrong fulltext index with mismatching collation

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 5.5.32, 5.1.67, 5.2.14, 5.3.12
    • Fix Version/s: 5.5
    • Component/s: None
    • Labels:
      None

      Description

      This script:

       DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (s1 VARCHAR (60) CHARACTER SET UTF8 COLLATE UTF8_UNICODE_CI) ENGINE = MyISAM;
      CREATE FULLTEXT INDEX i ON t1 (s1);INSERT INTO t1 VALUES ('a'),('b'),('c'),('d'),('ÓÓÓÓ'),('OOOO'),(NULL),('ÓÓÓÓ ÓÓÓÓ'),('OOOOOOOO'); 
      SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI);
      

      returns wrong results:

      +-------------------+
      | s1                |
      +-------------------+
      | ÓÓÓÓ              |
      | OOOO              |
      | ÓÓÓÓ ÓÓÓÓ         |
      +-------------------+
      

      The problem is that in the collation utf8_polish_ci
      the letter "O WITH ACUTE" is not equal to "O":

      MariaDB [test]> SELECT 'O'='Ó' COLLATE utf8_polish_ci;
      +---------------------------------+
      | 'O'='Ó' COLLATE utf8_polish_ci  |
      +---------------------------------+
      |                               0 |
      +---------------------------------+
      

      Therefore, returning the records containing accented letter
      in the above query is wrong.

      Further investigation with EXPLAIN:

      MySQL [test]> EXPLAIN SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI);
      +----+-------------+-------+----------+---------------+------+---------+------+------+-------------+
      | id | select_type | table | type     | possible_keys | key  | key_len | ref  | rows | Extra       |
      +----+-------------+-------+----------+---------------+------+---------+------+------+-------------+
      |  1 | SIMPLE      | t1    | fulltext | i             | i    | 0       | NULL |    1 | Using where |
      +----+-------------+-------+----------+---------------+------+---------+------+------+-------------+
      

      tells that MATCH erroneously chooses the non-suitable index "i"
      whose collation is NOT equal to the collation of the operation.

      If I further drop the index and use BOOLEAN MODE:

      ALTER TABLE t1 DROP INDEX i;
      SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI IN BOOLEAN MODE);
      

      it still returns wrong results:

      +-------------------+
      | s1                |
      +-------------------+
      | ÓÓÓÓ              |
      | OOOO              |
      | ÓÓÓÓ ÓÓÓÓ         |
      +-------------------+
      

      It seems MATCH always uses the collation of the field and does not follow the
      standard argument collation aggregation rules.

        Gliffy Diagrams

          Attachments

            Activity

            There are no comments yet on this issue.

              People

              • Assignee:
                bar Alexander Barkov
                Reporter:
                bar Alexander Barkov
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: