Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7128

Configuring charsets or collations as utf8 yields surprising result and leads to data loss

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0
    • Fix Version/s: 10.2
    • Component/s: None
    • Labels:
      None

      Description

      Configuring databases and collations to be utf8 and utf8_unicode_ci respectively leads to the very surprising result that actually utf8 is NOT used, but instead a custom variant of utf8 which is incompatible with the full range of unicode characters.

      Which of course leads to hard to debug problems as one doesn't even suspect that such a problem exists.

      The problem is that the name utf-8 is reused to mean something that it is not according to the relevant standards rfc, unicode consortium

      Instead one has to workaround the problem by a) actually knowing about and b) configuring something like this:

      -- snip --
      [mysqld]
      # switch to 4 byte utf-8 as default
      # See: https://mathiasbynens.be/notes/mysql-utf8mb4
      init_connect  = "SET NAMES utf8mb4"
      collation_server = utf8mb4_unicode_ci
      character_set_server = utf8mb4
      
      [client]
      default-character-set = utf8mb4
      
      [mysql]
      default-character-set = utf8mb4
      -- snap --
      

      What do I expect as a user: If I configure mariadb to use utf8, it expect it to do so, and not some custom variant subset of unicode that automatically discards data my users entered as utf8 trusting it to be handled correctly.

      Options to handle this:
      a) Just actually use unicode if unicode is specified - as the utf8mb3 encoding is fully binary compatible with utf8mb4 (actual utf8) this would just work. People who need utf8mb3 can configure their system and likely know what it is anyway if the actually require it.
      b) If the name of the encoding cannot be just redefined, a lengthy deprecation cycle can be started right now. Deprecate all the utf8 encodings which don't specify if they should be utf8mb3 or utf8mb4 and emit warnings for the next several years (I'd think something like 4 or whatever floats your boat) to fully specify the actual wanted encoding. Then after that make the utf8 name invalid for the same amount to get it out of configurations. Only then (now we're 8 years into the future) can the name utf8 be reenabled as a new encoding that actually is utf8 and doesn't surprise users.

      Well a) and b) are certainly some extremes, so I suspect that you will choose something in between, but right now the situation is very bad as nobody who configures utf8 or utf8-ci actually suspects that they're not getting what they want, so just switching utf8 to mean utf8mb4 will - I suspect - do the right thing for a lot of people. So I would like to suggest choosing an option that is closer to a) than to b).

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              dwt Martin Häcker added a comment -

              Re: bug report with oracle

              I have to say that I'm not going to deal with oracle, thats simply something that I refuse to have the time in my life for. On the other hand, I assume that you guys (mariadb developers) have a very good connection to them.

              So why do you not bring this up on the shared mailing list that you have somewhere?

              Show
              dwt Martin Häcker added a comment - Re: bug report with oracle I have to say that I'm not going to deal with oracle, thats simply something that I refuse to have the time in my life for. On the other hand, I assume that you guys (mariadb developers) have a very good connection to them. So why do you not bring this up on the shared mailing list that you have somewhere?
              Hide
              dwt Martin Häcker added a comment -

              I love it how inconsistent behavior like this so often also have security implications. See this for example: https://cedricvb.be/post/wordpress-stored-xss-vulnerability-4-1-2/

              Since MySQL and MariaDB treat utf-8 as something meaning three byte encoding instead of four byte, many people have this configured wrong, but assume that it means the same thing as when they say encode as 'utf-8' in the fronted.

              And that leads to subtle security bugs. Here's hope that this raises the importance of this issue.

              Show
              dwt Martin Häcker added a comment - I love it how inconsistent behavior like this so often also have security implications. See this for example: https://cedricvb.be/post/wordpress-stored-xss-vulnerability-4-1-2/ Since MySQL and MariaDB treat utf-8 as something meaning three byte encoding instead of four byte, many people have this configured wrong, but assume that it means the same thing as when they say encode as 'utf-8' in the fronted. And that leads to subtle security bugs. Here's hope that this raises the importance of this issue.
              Hide
              bar Alexander Barkov added a comment -

              The problem mentioned in https://cedricvb.be/post/wordpress-stored-xss-vulnerability-4-1-2/
              is fixed in 5.5.43. See MDEV-7649 for details.

              There will be more fixes in 10.x under terms of MDEV-8036.

              Show
              bar Alexander Barkov added a comment - The problem mentioned in https://cedricvb.be/post/wordpress-stored-xss-vulnerability-4-1-2/ is fixed in 5.5.43. See MDEV-7649 for details. There will be more fixes in 10.x under terms of MDEV-8036 .
              Hide
              bar Alexander Barkov added a comment -

              See also MDEV-8334 Rename utf8 to utf8mb3

              Show
              bar Alexander Barkov added a comment - See also MDEV-8334 Rename utf8 to utf8mb3
              Hide
              dwt Martin Häcker added a comment -

              To stay in the style of this bug: 👍

              Show
              dwt Martin Häcker added a comment - To stay in the style of this bug: 👍

                People

                • Assignee:
                  bar Alexander Barkov
                  Reporter:
                  dwt Martin Häcker
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  4 Start watching this issue

                  Dates

                  • Created:
                    Updated: