Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7383

[patch] engine-independent-stats column_stats has limited values for max/min values

    Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0.15
    • Fix Version/s: 10.0
    • Component/s: Optimizer
    • Labels:
    • Environment:
      Debian Wheezy 10.0.15
    • Sprint:
      10.0.20

      Description

      Engine-independent stats can't parse some values into min/max_value in column stats. The result is a NON NULL empty string. I suspect query plans based on this information won't work too well.

      MariaDB [logsleads]> analyze table GeoIPOrg persistent for all;
      +--------------------+---------+----------+---------------------------------------------------------------------------+
      | Table              | Op      | Msg_type | Msg_text                                                                  |
      +--------------------+---------+----------+---------------------------------------------------------------------------+
      | logsleads.GeoIPOrg | analyze | status   | Engine-independent statistics collected                                   |
      | logsleads.GeoIPOrg | analyze | Warning  | Incorrect string value: '\xD8ybj\xF8r...' for column 'max_value' at row 1 |
      | logsleads.GeoIPOrg | analyze | status   | Table is already up to date                                               |
      +--------------------+---------+----------+---------------------------------------------------------------------------+
      3 rows in set (9.52 sec)
      
      
      | GeoIPOrg | CREATE TABLE `GeoIPOrg` (
        `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
        `ipv4n_start` int(10) unsigned NOT NULL,
        `ipv4n_end` int(10) unsigned NOT NULL,
        `org_name` varchar(100) NOT NULL,
        `bucket` smallint(5) unsigned NOT NULL,
        PRIMARY KEY (`id`),
        UNIQUE KEY `ipv4n_start` (`ipv4n_start`,`ipv4n_end`),
        KEY `bucket` (`bucket`)
      ) ENGINE=MyISAM AUTO_INCREMENT=1292213 DEFAULT CHARSET=latin1 |
      
      MariaDB [mysql]> select *,IF(max_value IS NULL,'NULL','NOT NULL') as max_value_null,hex(max_value) as max_value_hex,length(max_value) as max_value_len from column_stats where db_name='logsleads' and table_name='GeoIPOrg';
      +-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
      | db_name   | table_name | column_name | min_value                                        | max_value  | nulls_ratio | avg_length | avg_frequency | hist_size | hist_type | histogram | max_value_null | max_value_hex        | max_value_len |
      +-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
      | logsleads | GeoIPOrg   | bucket      | 256                                              | 57343      |      0.0000 |     2.0000 |       49.4683 |         0 | NULL      | NULL      | NOT NULL       | 3537333433           |             5 |
      | logsleads | GeoIPOrg   | id          | 1                                                | 1292212    |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 31323932323132       |             7 |
      | logsleads | GeoIPOrg   | ipv4n_end   | 16777471                                         | 3758096127 |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 33373538303936313237 |            10 |
      | logsleads | GeoIPOrg   | ipv4n_start | 16777216                                         | 3758095872 |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 33373538303935383732 |            10 |
      | logsleads | GeoIPOrg   | org_name    | ! atlngahpla1_g8/7 to atlngahpla2_g8/7 ospf area |            |      0.0000 |    21.6174 |        2.6580 |         0 | NULL      | NULL      | NOT NULL       |                      |             0 |
      +-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
      
      select org_name,hex(org_name) from GeoIPOrg order by org_name desc limit 1;
      +------------------+--------------------------+
      | org_name         | hex(org_name)            |
      +------------------+--------------------------+
      | Øybjørg_sæbø     | D879626AF872675F73E662F8 |
      +------------------+--------------------------+
      

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            danblack Daniel Black added a comment -
            MariaDB [mysql]> alter table column_stats modify min_value VARBINARY(255) DEFAULT NULL, modify max_value VARBINARY(255) DEFAULT NULL;
            Query OK, 6907 rows affected (0.03 sec)                
            Records: 6907  Duplicates: 0  Warnings: 0
            
            MariaDB [mysql]> analyze table logsleads.GeoIPOrg persistent for all;
            +--------------------+---------+----------+-----------------------------------------+
            | Table              | Op      | Msg_type | Msg_text                                |
            +--------------------+---------+----------+-----------------------------------------+
            | logsleads.GeoIPOrg | analyze | status   | Engine-independent statistics collected |
            | logsleads.GeoIPOrg | analyze | status   | Table is already up to date             |
            

            Is varbinary(255) a sufficient solution?

            Show
            danblack Daniel Black added a comment - MariaDB [mysql]> alter table column_stats modify min_value VARBINARY(255) DEFAULT NULL, modify max_value VARBINARY(255) DEFAULT NULL; Query OK, 6907 rows affected (0.03 sec) Records: 6907 Duplicates: 0 Warnings: 0 MariaDB [mysql]> analyze table logsleads.GeoIPOrg persistent for all; +--------------------+---------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +--------------------+---------+----------+-----------------------------------------+ | logsleads.GeoIPOrg | analyze | status | Engine-independent statistics collected | | logsleads.GeoIPOrg | analyze | status | Table is already up to date | Is varbinary(255) a sufficient solution?
            Hide
            elenst Elena Stepanova added a comment -
            create table t1 (a varchar(100)) engine=MyISAM;
            insert into t1 values(unhex('D879626AF872675F73E662F8'));
            analyze table t1 persistent for all;
            
            +---------+---------+----------+---------------------------------------------------------------------------+
            | Table   | Op      | Msg_type | Msg_text                                                                  |
            +---------+---------+----------+---------------------------------------------------------------------------+
            | test.t1 | analyze | status   | Engine-independent statistics collected                                   |
            | test.t1 | analyze | Warning  | Incorrect string value: '\xD8ybj\xF8r...' for column 'min_value' at row 1 |
            | test.t1 | analyze | Warning  | Incorrect string value: '\xD8ybj\xF8r...' for column 'max_value' at row 1 |
            | test.t1 | analyze | status   | OK                                                                        |
            +---------+---------+----------+---------------------------------------------------------------------------+
            4 rows in set (0.59 sec)
            
            Show
            elenst Elena Stepanova added a comment - create table t1 (a varchar(100)) engine=MyISAM; insert into t1 values(unhex('D879626AF872675F73E662F8')); analyze table t1 persistent for all; +---------+---------+----------+---------------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+---------+----------+---------------------------------------------------------------------------+ | test.t1 | analyze | status | Engine-independent statistics collected | | test.t1 | analyze | Warning | Incorrect string value: '\xD8ybj\xF8r...' for column 'min_value' at row 1 | | test.t1 | analyze | Warning | Incorrect string value: '\xD8ybj\xF8r...' for column 'max_value' at row 1 | | test.t1 | analyze | status | OK | +---------+---------+----------+---------------------------------------------------------------------------+ 4 rows in set (0.59 sec)
            Hide
            danblack Daniel Black added a comment -

            assuming the move to varbinary(255) is an acceptable solution here is the patch and test case.

            Show
            danblack Daniel Black added a comment - assuming the move to varbinary(255) is an acceptable solution here is the patch and test case.
            Hide
            danblack Daniel Black added a comment -

            patch 2 of the statistics.cc to handle as my_charset_bin rather than my_charset_utf8_bin.

            Show
            danblack Daniel Black added a comment - patch 2 of the statistics.cc to handle as my_charset_bin rather than my_charset_utf8_bin.
            Hide
            psergey Sergei Petrunia added a comment -

            Trying to figure out what column_stats.min/max_value should store

            • For numeric types, it stores a string representation of the value (in a latin1-compatible charset).
            • For string-based types, it stores the min/max value in the charset of the field.
            Show
            psergey Sergei Petrunia added a comment - Trying to figure out what column_stats.min/max_value should store For numeric types, it stores a string representation of the value (in a latin1-compatible charset). For string-based types, it stores the min/max value in the charset of the field.
            Hide
            danblack Daniel Black added a comment -

            For string-based types, it stores the min/max value in the charset of the field.

            Makes sense to me. Which would mean changing to a binary type for storage (for compatibility with direct manipulation of the stats tables) (patch as attached) and changing the store/retrieve methods in sql_statistics.cc

            Show
            danblack Daniel Black added a comment - For string-based types, it stores the min/max value in the charset of the field. Makes sense to me. Which would mean changing to a binary type for storage (for compatibility with direct manipulation of the stats tables) (patch as attached) and changing the store/retrieve methods in sql_statistics.cc
            Hide
            danblack Daniel Black added a comment - - edited

            From one of our clients that does a lot of multilingual work:

            MariaDB [mysql]> select count(*) as isnulltotal  from information_schema.columns i join information_schema.tables tt on i.TABLE_SCHEMA=tt.TABLE_SCHEMA and i.TABLE_NAME=tt.TABLE_NAME and TABLE_ROWS>0   join column_stats cs         on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name)  and nulls_ratio<1.00   where max_value is null or min_value is null ;
            +-------------+
            | isnulltotal |
            +-------------+
            |        4083 |
            +-------------+
            
            without where clause - total not empty tables 69526
            
            MariaDB [mysql]> select 4083/69526;
            +------------+
            | 4083/69526 |
            +------------+
            |     0.0587 |
            +------------+
            
            

            5.87% of columns couldn't set the value of min or max in the column stats.

            note: updated to exclude empty tables and columsn where all was null which warped results significantly

            Show
            danblack Daniel Black added a comment - - edited From one of our clients that does a lot of multilingual work: MariaDB [mysql]> select count(*) as isnulltotal from information_schema.columns i join information_schema.tables tt on i.TABLE_SCHEMA=tt.TABLE_SCHEMA and i.TABLE_NAME=tt.TABLE_NAME and TABLE_ROWS>0 join column_stats cs on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name) and nulls_ratio<1.00 where max_value is null or min_value is null ; +-------------+ | isnulltotal | +-------------+ | 4083 | +-------------+ without where clause - total not empty tables 69526 MariaDB [mysql]> select 4083/69526; +------------+ | 4083/69526 | +------------+ | 0.0587 | +------------+ 5.87% of columns couldn't set the value of min or max in the column stats. note: updated to exclude empty tables and columsn where all was null which warped results significantly
            Hide
            danblack Daniel Black added a comment - - edited

            breakdown:

            MariaDB [mysql]> select i.character_set_name as charset,data_type as dt,count(*) as isnulltotal, t as total
               from information_schema.columns i
               join information_schema.tables tt
                    on i.TABLE_SCHEMA=tt.TABLE_SCHEMA and i.TABLE_NAME=tt.TABLE_NAME and TABLE_ROWS>0
               join column_stats cs
                    on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name) and nulls_ratio<1.00
               join (select count(*) as t,character_set_name as c, data_type as dt
                                from information_schema.columns i
                                 join column_stats cs
                                       on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name) and nulls_ratio<1.00
                                group by c,dt) itots
                    on (i.character_set_name=itots.c or (i.character_set_name is null and itots.c is null)) and data_type=itots.dt
               where max_value is null or min_value is null
               group by i.character_set_name,dt
               with rollup;
            
            | charset | dt         | isnulltotal | total |
            +---------+------------+-------------+-------+
            | NULL    | blob       |         207 |   207 |
            | NULL    | longblob   |         500 |   509 |
            | latin1  | longtext   |           7 |     7 |
            | latin1  | text       |          18 |    18 |
            | latin1  | tinytext   |           1 |     1 |
            | utf8    | longtext   |        1331 |  1331 |
            | utf8    | mediumtext |         712 |   712 |
            | utf8    | text       |        1293 |  1293 |
            | utf8    | tinytext   |          14 |    14 |
            

            So it looks to need text truncation to the right length at the same time.

            updated to exclude empty tables and those colums with all nulls:

            Show
            danblack Daniel Black added a comment - - edited breakdown: MariaDB [mysql]> select i.character_set_name as charset,data_type as dt,count(*) as isnulltotal, t as total from information_schema.columns i join information_schema.tables tt on i.TABLE_SCHEMA=tt.TABLE_SCHEMA and i.TABLE_NAME=tt.TABLE_NAME and TABLE_ROWS>0 join column_stats cs on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name) and nulls_ratio<1.00 join (select count(*) as t,character_set_name as c, data_type as dt from information_schema.columns i join column_stats cs on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name) and nulls_ratio<1.00 group by c,dt) itots on (i.character_set_name=itots.c or (i.character_set_name is null and itots.c is null)) and data_type=itots.dt where max_value is null or min_value is null group by i.character_set_name,dt with rollup; | charset | dt | isnulltotal | total | +---------+------------+-------------+-------+ | NULL | blob | 207 | 207 | | NULL | longblob | 500 | 509 | | latin1 | longtext | 7 | 7 | | latin1 | text | 18 | 18 | | latin1 | tinytext | 1 | 1 | | utf8 | longtext | 1331 | 1331 | | utf8 | mediumtext | 712 | 712 | | utf8 | text | 1293 | 1293 | | utf8 | tinytext | 14 | 14 | So it looks to need text truncation to the right length at the same time. updated to exclude empty tables and those colums with all nulls:

              People

              • Assignee:
                psergey Sergei Petrunia
                Reporter:
                danblack Daniel Black
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Agile