Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8402

Bug #77473 Truncated data with subquery & UTF8

    Details

    • Type: Bug
    • Status: Confirmed
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.1, 10.0, 5.5
    • Fix Version/s: 10.1, 10.0
    • Component/s: None
    • Labels:

      Description

      REPEAT() and RPAD() results in a subquery get truncated if the length parameter is between 32768 and 65536 for 2-byte UTF-8 sequences, or between 21845 and 65535 . Below and above that point the result is correct.

      set names utf8;
      select length(data) as len from ( select repeat('ä', ...) as data ) as Sub;
      
      parameter -> result
      
      36766 -> 65532
      32767 -> 65534
      32678 -> 65534
      ...
      65535 -> 65534
      65536 -> 131072
      65537 -> 131074
      

      This is not a problem of the length function in the outer query, it is already the string returned from the inner query that gets truncated as can be seen with

      mysql -Ne "set names utf8; select data from (select repeat('é', 40000) as data ) as Sub;" | wc
            1       1   65535
      

      Looks as if the intermediate result length is calculated by character count and not by byte length, and so the wrong data type seems to be used to store the intermediate result. E.g. when using a unicode character that has a 3 byte UTF-8 representation:

      select length(data) as len from ( select repeat('☃', ...) as data ) as Sub;
      
      parameter -> result
      
      21844 -> 65532
      21845 -> 65535
      21846 -> 65535 
      ...
      65535 -> 65535
      65536 -> 196608
      

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              hholzgra Hartmut Holzgraefe added a comment -
              Show
              hholzgra Hartmut Holzgraefe added a comment - Upstream bug is http://bugs.mysql.com/77473
              Hide
              elenst Elena Stepanova added a comment -

              There is also a warning issued when the problem occurs. In the output below, the third result is erroneous.

              set names latin1;
              select length(data) as len from ( select repeat('ä', 65535) as data ) as Sub;
              len
              131070
              select length(data) as len from ( select repeat('ä', 65536) as data ) as Sub;
              len
              131072
              set names utf8;
              select length(data) as len from ( select repeat('ä', 65535) as data ) as Sub;
              len
              65534
              Warnings:
              Warning	1366	Incorrect string value: '\xC3\xA4\xC3\xA4\xC3\xA4...' for column 'data' at row 1
              select length(data) as len from ( select repeat('ä', 65536) as data ) as Sub;
              len
              131072
              
              Show
              elenst Elena Stepanova added a comment - There is also a warning issued when the problem occurs. In the output below, the third result is erroneous. set names latin1; select length(data) as len from ( select repeat('ä', 65535) as data ) as Sub; len 131070 select length(data) as len from ( select repeat('ä', 65536) as data ) as Sub; len 131072 set names utf8; select length(data) as len from ( select repeat('ä', 65535) as data ) as Sub; len 65534 Warnings: Warning 1366 Incorrect string value: '\xC3\xA4\xC3\xA4\xC3\xA4...' for column 'data' at row 1 select length(data) as len from ( select repeat('ä', 65536) as data ) as Sub; len 131072

                People

                • Assignee:
                  Unassigned
                  Reporter:
                  hholzgra Hartmut Holzgraefe
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  5 Start watching this issue

                  Dates

                  • Created:
                    Updated: