Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4362

Selectivity estimates for IN (...) do not depend on whether the values are in range

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 10.0.10
    • Component/s: None
    • Labels:

      Description

      Create the dataset:

      create table t5 (col1 int);
      set @a=-1;
      create table one_k (a int)  select (@a:=@a+1) as a from information_schema.session_variables A, information_schema.session_variables B limit 1000;
      insert into t5 select A.a from one_k A, one_k B where A.a < 100 and B.a < 100;
      set histogram_size=100;
      analyze table t5 persistent for all;
      select *, hex(histogram) from mysql.column_stats where table_name='t5'\G
      *************************** 1. row ***************************
             db_name: j10
          table_name: t5
         column_name: col1
           min_value: 0
           max_value: 99
         nulls_ratio: 0.0000
          avg_length: 4.0000
       avg_frequency: 100.0000
           hist_size: 100
           hist_type: 
           histogram:  (100 bytes here)
      

      Ok, so we've got a table with 100 rows of 0, 100 rows of 1, and so forth up to 99.

      Let's see how estimating works:

      MariaDB [j10]> explain extended select * from t5 where col1 in (1,2,3);
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.79 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      1 row in set, 1 warning (10.64 sec)
      

      The real selectivity is 3%, we've got 3.79. Good.

      Now, let's try values that are certainly not in the table:

      MariaDB [j10]> explain extended select * from t5 where col1 in (-1,-2,-3);
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.79 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      

      OOps, again 3.79%.

      Let's see what non-equality range shows:

      MariaDB [j10]> explain extended select * from t5 where col1<=-1;
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     0.99 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      

      1%. It's better.

      I consider selecitivity obtained for "where col1 in (-1,-2,-3)" to be a bug.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              psergey Sergei Petrunia added a comment -

              That is, the issue from comment #2 where filtered=100% suddenly, is gone.

              Show
              psergey Sergei Petrunia added a comment - That is, the issue from comment #2 where filtered=100% suddenly, is gone.
              Hide
              psergey Sergei Petrunia added a comment -

              Patryk Pomykalski, I was leaning towards that idea also.. but then I questioned the selectivity formula, tried to come up with my own formula, and, surprise, it didn't have anything resembling the "width" that the old formula used.

              Show
              psergey Sergei Petrunia added a comment - Patryk Pomykalski , I was leaning towards that idea also.. but then I questioned the selectivity formula, tried to come up with my own formula, and, surprise, it didn't have anything resembling the "width" that the old formula used.
              Hide
              psergey Sergei Petrunia added a comment -

              So, we still have col1 =

              {value_less_than_tables_minimum}

              return the same estimate as col1=

              { value_within_range_of_table_values}

              . (The data distribution is uniform).

              This might be not as bad as it sounds, because minimum and maximum are statistics. The actual table can have values which are greater than the maximum value we've had when collecting statistics (or, correspondingly, less than the minimum... ).

              I have also discovered a problem with very skewed distributions, MDEV-5950. My opinion is that estimates that are produced for this MDEV are satisfactory, while estimates that are produced for MDEV-5950 are not. However, when we fix MDEV-5950, we might fix this one also.

              Show
              psergey Sergei Petrunia added a comment - So, we still have col1 = {value_less_than_tables_minimum} return the same estimate as col1= { value_within_range_of_table_values} . (The data distribution is uniform). This might be not as bad as it sounds, because minimum and maximum are statistics . The actual table can have values which are greater than the maximum value we've had when collecting statistics (or, correspondingly, less than the minimum... ). I have also discovered a problem with very skewed distributions, MDEV-5950 . My opinion is that estimates that are produced for this MDEV are satisfactory, while estimates that are produced for MDEV-5950 are not. However, when we fix MDEV-5950 , we might fix this one also.
              Hide
              psergey Sergei Petrunia added a comment -

              With Fix for MDEV-5926, "Attempt#2", the division-by-zero problem started to appear again (it fails an assert that I've added specifically to catch this).

              Developed a fix for it. getting value_pos=0 (or 1) and the first (or the last) bucket of histogram having zero width is a special case.

              Show
              psergey Sergei Petrunia added a comment - With Fix for MDEV-5926 , "Attempt#2", the division-by-zero problem started to appear again (it fails an assert that I've added specifically to catch this). Developed a fix for it. getting value_pos=0 (or 1) and the first (or the last) bucket of histogram having zero width is a special case.
              Hide
              psergey Sergei Petrunia added a comment -

              Pushed the second variant of the fix for "division by zero" problem. Pushed a testcase.

              Show
              psergey Sergei Petrunia added a comment - Pushed the second variant of the fix for "division by zero" problem. Pushed a testcase.

                People

                • Assignee:
                  psergey Sergei Petrunia
                  Reporter:
                  psergey Sergei Petrunia
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  2 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: