Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4742

UDF - Native natural sorting / natural compare

    Details

    • Type: Task
    • Status: Open
    • Priority: Trivial
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Hi guys, could be nice a natural sorting inside mariadb
      today some guys use SQL procedures/functions to execute this cpu intensive task
      php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?

      functions:
      1) natural sort, used in ORDER BY / GROUP BY
      2) natural sort compare, could be used as an operator, like the "SOUNDS LIKE" operator, in other words, the function that return a canonical form of the string could be used to compare, something that could be rewrite "field NATURAL LIKE value" to "natual(field)=natural(value)"

      i didn't found (2) in internet, but it's used in php, and the (1) should use it in some place to order/group correctly
      --------------
      example of (2):
      http://www.php.net/manual/en/function.strnatcmp.php
      https://github.com/php/php-src/blob/master/ext/standard/strnatcmp.c
      --------------

      example of (1)
      Here a example in STORED PROCEDURE/FUNCTIONS for ORDER BY, but not exaust tested:
      source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql

      DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
      DELIMITER ;;
      CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000)) 
      RETURNS int
      LANGUAGE SQL
      DETERMINISTIC
      NO SQL
      SQL SECURITY INVOKER
      BEGIN
          DECLARE position int;
          DECLARE tmp_position int;
          SET position = 5000;
          SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF; 
          SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
      
          IF (position = 5000) THEN RETURN 0; END IF;
          RETURN position;
      END
      ;;
      
      DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
      DELIMITER ;;
      CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50)) 
      RETURNS varchar(4000)
      LANGUAGE SQL
      DETERMINISTIC
      NO SQL
      SQL SECURITY INVOKER
      BEGIN
          DECLARE sortString varchar(4000);
          DECLARE numStartIndex int;
          DECLARE numEndIndex int;
          DECLARE padLength int;
          DECLARE totalPadLength int;
          DECLARE i int;
          DECLARE sameOrderCharsLen int;
      
          SET totalPadLength = 0;
          SET instring = TRIM(instring);
          SET sortString = instring;
          SET numStartIndex = udf_FirstNumberPos(instring);
          SET numEndIndex = 0;
          SET i = 1;
          SET sameOrderCharsLen = LENGTH(sameOrderChars);
      
          WHILE (i <= sameOrderCharsLen) DO
              SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
              SET i = i + 1;
          END WHILE;
      
          WHILE (numStartIndex <> 0) DO
              SET numStartIndex = numStartIndex + numEndIndex;
              SET numEndIndex = numStartIndex;
      
              WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                  SET numEndIndex = numEndIndex + 1;
              END WHILE;
      
              SET numEndIndex = numEndIndex - 1;
      
              SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
      
              IF padLength < 0 THEN
                  SET padLength = 0;
              END IF;
      
              SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
      
              SET totalPadLength = totalPadLength + padLength;
              SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
          END WHILE;
      
          RETURN sortString;
      END
      ;;
      

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              rspadim roberto spadim added a comment -

              a partial patch, todo:
              natstring need fraction implementation
              use collations

              rewrite, it's too ugly now

              Show
              rspadim roberto spadim added a comment - a partial patch, todo: natstring need fraction implementation use collations rewrite, it's too ugly now
              Hide
              rspadim roberto spadim added a comment -

              i done some job, as a start point...

              natstring = canonical string (can use in order by natstring(some_string))
              what need to be changed: the number is rewrite as a string with zeropad, but it's a fixed size (20), the size could be a optional parameter example: natstring("aasdfasdf",10), when number is bigger than 10, it's just copied not rewrite with zeropad)

              what this function should do:

              natstring("hello 1 person in 2 tables",13);
              return:
              "hello 0000000000001 person in 0000000000002 tables"


              how to rewrite numbers?
              1) 1 vs 01 (today this function do)
              -> the first idea is use bigint precision 20 digits => "18446744073709551615"/"+-9223372036854775807"
              "1" and "01" = "00000000000000000001"
              2) decimal: 0.1 (today this function don't do in the right way)
              using the same idea of (1):
              0.1 = "0"."1" => "00000000000000000000.00000000000000000001" (today it do, but it's not right...)
              or maybe understand that's as a fraction:
              0.1 = "00000000000000000000.10000000000000000000" (today it don't do, that's the right way)
              3) 10e-1 vs 0.1
              that's a problem since 10e-1 and 0.1 is the same value,
              but when you read 10e-1 you don't think about 0.1, i'm right?
              maybe we can use (2) here
              -> rewrite to 00000000000000000010e-00000000000000000001

              00000000000000000010e-00000000000000000001 vs 00000000000000000000.00000000000000000001
              strnatcasecmp('10e-1','0.1') => return "1" (right)
              strnatcmp('10e-1','0.1') => return "1" (right)
              4) what about numbers with more than 20 digits? example:
              PI = "3.1415926535897932384626433832795028841971693993751"
              rewrite with length = ceil(size/20)*20 ?
              no! just copy! (today this function do this part)
              5)leading zeros (today it do)
              "0000000000000000000000000000000000000000000000000000000001"
              must be rewrite to "1" => "00000000000000000001"
              6) special numbers (version and others kinds), example: (it's part of (2) problem)
              '1.1.1' should be understood as:
              integer.integer.integer instead of
              integer.fraction.integer
              1.1.1 (especial) = 00000000000000000001.00000000000000000001.00000000000000000001
              1.1 (not special) = 00000000000000000001.10000000000000000000
              check that if second or third number is bigger than 20 (max number size) it's just copied, but it's not a fraction, it's just a integer

              fraction = number+'.'+number + (next pos!='.' and next next pos is not a digit)
              especial = number+'.'number'.'+number... (end when don't find "."+digit)
              examples
              6.1)"1.1.1" => number+'.'number'.'+number + end of string (!='.'+digit) = special
              6.2)"1.1.1." => "1.1.1" (special) + "."
              number+'.'number'.'number (!='.'+digit) [non special part= '.']
              6.3)"1.1.1.a" => "1.1.1" (special) + ".a"
              number+'.'number'.'number (!='.'+digit) [non special part = '.a']

              Show
              rspadim roberto spadim added a comment - i done some job, as a start point... natstring = canonical string (can use in order by natstring(some_string)) what need to be changed: the number is rewrite as a string with zeropad, but it's a fixed size (20), the size could be a optional parameter example: natstring("aasdfasdf",10), when number is bigger than 10, it's just copied not rewrite with zeropad) what this function should do: natstring("hello 1 person in 2 tables",13); return: "hello 0000000000001 person in 0000000000002 tables" — how to rewrite numbers? 1) 1 vs 01 (today this function do) -> the first idea is use bigint precision 20 digits => "18446744073709551615"/"+-9223372036854775807" "1" and "01" = "00000000000000000001" 2) decimal: 0.1 (today this function don't do in the right way) using the same idea of (1): 0.1 = "0"."1" => "00000000000000000000.00000000000000000001" (today it do, but it's not right...) or maybe understand that's as a fraction: 0.1 = "00000000000000000000.10000000000000000000" (today it don't do, that's the right way) 3) 10e-1 vs 0.1 that's a problem since 10e-1 and 0.1 is the same value, but when you read 10e-1 you don't think about 0.1, i'm right? maybe we can use (2) here -> rewrite to 00000000000000000010e-00000000000000000001 00000000000000000010e-00000000000000000001 vs 00000000000000000000.00000000000000000001 strnatcasecmp('10e-1','0.1') => return "1" (right) strnatcmp('10e-1','0.1') => return "1" (right) 4) what about numbers with more than 20 digits? example: PI = "3.1415926535897932384626433832795028841971693993751" rewrite with length = ceil(size/20)*20 ? no! just copy! (today this function do this part) 5)leading zeros (today it do) "0000000000000000000000000000000000000000000000000000000001" must be rewrite to "1" => "00000000000000000001" 6) special numbers (version and others kinds), example: (it's part of (2) problem) '1.1.1' should be understood as: integer.integer.integer instead of integer.fraction.integer 1.1.1 (especial) = 00000000000000000001.00000000000000000001.00000000000000000001 1.1 (not special) = 00000000000000000001.10000000000000000000 check that if second or third number is bigger than 20 (max number size) it's just copied, but it's not a fraction, it's just a integer fraction = number+'.'+number + (next pos!='.' and next next pos is not a digit) especial = number+'.' number '.'+number... (end when don't find "."+digit) examples 6.1)"1.1.1" => number+'.' number '.'+number + end of string (!='.'+digit) = special 6.2)"1.1.1." => "1.1.1" (special) + "." number+'.' number '.' number (!='.'+digit) [non special part= '.'] 6.3)"1.1.1.a" => "1.1.1" (special) + ".a" number+'.' number '.' number (!='.'+digit) [non special part = '.a']
              Hide
              rspadim roberto spadim added a comment -

              i'm re thinking about the source code of canonical form (natural string), the first scratch was very ugly

              the easier way is:
              find first digit, find where the number ends

              • number= start with [0-9], end with [0-9], may have [0-9] in middle, may have one "." before and after two digits
                copy previous string without double space
                write the number from start digit to end rewriting with the zero padding, and espacial cases / fraction
                find next number or copy the end string removing the double spaces
                end
              Show
              rspadim roberto spadim added a comment - i'm re thinking about the source code of canonical form (natural string), the first scratch was very ugly the easier way is: find first digit, find where the number ends number= start with [0-9] , end with [0-9] , may have [0-9] in middle, may have one "." before and after two digits copy previous string without double space write the number from start digit to end rewriting with the zero padding, and espacial cases / fraction find next number or copy the end string removing the double spaces end

                People

                • Assignee:
                  Unassigned
                  Reporter:
                  rspadim roberto spadim
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  3 Start watching this issue

                  Dates

                  • Created:
                    Updated: