MariaDB Development
  1. MariaDB Development
  2. MDEV-4850

MATCH uses a wrong fulltext index with mismatching collation

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 5.5.32, 5.3.12, 5.2.14, 5.1.67
    • Fix Version/s: 5.5.38
    • Labels:
      None
    • Global Rank:
      3413

      Description

      This script:

       DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (s1 VARCHAR (60) CHARACTER SET UTF8 COLLATE UTF8_UNICODE_CI) ENGINE = MyISAM;
      CREATE FULLTEXT INDEX i ON t1 (s1);INSERT INTO t1 VALUES ('a'),('b'),('c'),('d'),('ÓÓÓÓ'),('OOOO'),(NULL),('ÓÓÓÓ ÓÓÓÓ'),('OOOOOOOO'); 
      SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI);
      

      returns wrong results:

      +-------------------+
      | s1                |
      +-------------------+
      | ÓÓÓÓ              |
      | OOOO              |
      | ÓÓÓÓ ÓÓÓÓ         |
      +-------------------+
      

      The problem is that in the collation utf8_polish_ci
      the letter "O WITH ACUTE" is not equal to "O":

      MariaDB [test]> SELECT 'O'='Ó' COLLATE utf8_polish_ci;
      +---------------------------------+
      | 'O'='Ó' COLLATE utf8_polish_ci  |
      +---------------------------------+
      |                               0 |
      +---------------------------------+
      

      Therefore, returning the records containing accented letter
      in the above query is wrong.

      Further investigation with EXPLAIN:

      MySQL [test]> EXPLAIN SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI);
      +----+-------------+-------+----------+---------------+------+---------+------+------+-------------+
      | id | select_type | table | type     | possible_keys | key  | key_len | ref  | rows | Extra       |
      +----+-------------+-------+----------+---------------+------+---------+------+------+-------------+
      |  1 | SIMPLE      | t1    | fulltext | i             | i    | 0       | NULL |    1 | Using where |
      +----+-------------+-------+----------+---------------+------+---------+------+------+-------------+
      

      tells that MATCH erroneously chooses the non-suitable index "i"
      whose collation is NOT equal to the collation of the operation.

      If I further drop the index and use BOOLEAN MODE:

      ALTER TABLE t1 DROP INDEX i;
      SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI IN BOOLEAN MODE);
      

      it still returns wrong results:

      +-------------------+
      | s1                |
      +-------------------+
      | ÓÓÓÓ              |
      | OOOO              |
      | ÓÓÓÓ ÓÓÓÓ         |
      +-------------------+
      

      It seems MATCH always uses the collation of the field and does not follow the
      standard argument collation aggregation rules.

        Activity

        There are no comments yet on this issue.

          People

          • Assignee:
            Alexander Barkov
            Reporter:
            Alexander Barkov
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated: