Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6218

Wrong result of CHAR_LENGTH(non-BMP-character) with 3-byte utf8

    Details

    • Type: Bug
    • Status: Stalled
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0.10
    • Fix Version/s: 10.2
    • Component/s: None
    • Labels:

      Description

      mysql> SET NAMES utf8;
      Query OK, 0 rows affected (0.00 sec)
      
      mysql> SELECT hex('😁'), char_length('😁'), octet_length('😁');
      +-------------+---------------------+----------------------+
      | hex('😁')     | char_length('😁')     | octet_length('😁')     |
      +-------------+---------------------+----------------------+
      | F09F9881    |                   4 |                    4 |
      +-------------+---------------------+----------------------+
      1 row in set (0.00 sec)
      

      Notice, I use "SET NAMES utf8" (which is a 3-byte character set
      and supports only BMP characters), but then input a 4-byte character.
      The result of CHAR_LENGTH() is wrong.

      0xF09F9881 is a wrong byte sequence of utf8 (it's correct for utf8mb4 only)

      The expected result would be:

      • either return error for the entire query
      • or replace the character to '?' and thus make CHAR_LENGTH() return 1.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              psergey Sergei Petrunia added a comment -

              Repeatable on mysql-5.6.17

              Show
              psergey Sergei Petrunia added a comment - Repeatable on mysql-5.6.17
              Hide
              bar Alexander Barkov added a comment -

              LEFT also returns a wrong result:

              MariaDB [test]> SELECT hex(left('😁',2));
              +---------------------+
              | hex(left('😁',2))     |
              +---------------------+
              | F09F                |
              +---------------------+
              1 row in set (0.00 sec)
              
              Show
              bar Alexander Barkov added a comment - LEFT also returns a wrong result: MariaDB [test]> SELECT hex(left('😁',2)); +---------------------+ | hex(left('😁',2)) | +---------------------+ | F09F | +---------------------+ 1 row in set (0.00 sec)
              Hide
              bar Alexander Barkov added a comment -

              RIGHT returns a wrong result:

              MariaDB [test]> SELECT hex(right('😁',2));
              +----------------------+
              | hex(right('😁',2))     |
              +----------------------+
              | 9881                 |
              +----------------------+
              
              Show
              bar Alexander Barkov added a comment - RIGHT returns a wrong result: MariaDB [test]> SELECT hex(right('😁',2)); +----------------------+ | hex(right('😁',2)) | +----------------------+ | 9881 | +----------------------+
              Hide
              bar Alexander Barkov added a comment -

              SUBSTRING returns a wrong result:

              MariaDB [test]> SELECT hex(substring('😁',2,1));
              +----------------------------+
              | hex(substring('😁',2,1))     |
              +----------------------------+
              | 9F                         |
              +----------------------------+
              
              Show
              bar Alexander Barkov added a comment - SUBSTRING returns a wrong result: MariaDB [test]> SELECT hex(substring('😁',2,1)); +----------------------------+ | hex(substring('😁',2,1)) | +----------------------------+ | 9F | +----------------------------+
              Hide
              bar Alexander Barkov added a comment -

              In this example, the returned string is also bad formed:

              MariaDB [test]> SELECT '11😁2😁22222';
              +------------------+
              | 11😁2😁22222         |
              +------------------+
              | 11😁2😁22222         |
              +------------------+
              

              It should probably replace unknown bytes to question marks.

              Show
              bar Alexander Barkov added a comment - In this example, the returned string is also bad formed: MariaDB [test]> SELECT '11😁2😁22222'; +------------------+ | 11😁2😁22222 | +------------------+ | 11😁2😁22222 | +------------------+ It should probably replace unknown bytes to question marks.

                People

                • Assignee:
                  bar Alexander Barkov
                  Reporter:
                  bar Alexander Barkov
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  3 Start watching this issue

                  Dates

                  • Created:
                    Updated: