Details
-
Type:
Task
-
Status: Open
-
Priority:
Minor
-
Resolution: Unresolved
-
Fix Version/s: None
-
Component/s: Optimizer
-
Labels:None
Description
I believe this is essentially the same issue as MySQL bug #41064, or at least related to it.
The MariaDB optimizer should, at times, reverse a condition to make better use of an index, when all possible values are known, as is the case with an enum.
Consider the following:
CREATE TABLE foobars (
id_foobar int(10) unsigned NOT NULL auto_increment,
status enum('red', 'green', 'blue', 'white') NOT NULL,
PRIMARY KEY (id_foobar),
KEY (status)
) ENGINE=InnodB;
delimiter $$
CREATE PROCEDURE populate_foobars (
c int(10)
)
BEGIN
DECLARE chance smallint(3) unsigned;
DECLARE weighted smallint(3) unsigned;
populate: LOOP
SET chance = FLOOR(RAND() * 100);
IF chance < 80 THEN
SET weighted = 1;
ELSEIF chance < 90 THEN
SET weighted = 2;
ELSEIF chance < 97 THEN
SET weighted = 3;
ELSE
SET weighted = 4;
END IF;
INSERT INTO foobars
(status)
VALUES (weighted);
SET c = c - 1;
IF c = 0 THEN
LEAVE populate;
END IF;
END LOOP populate;
END $$
delimiter ;
CALL populate_foobars(100000);
OPTIMIZE TABLE foobars;
EXPLAIN SELECT *
FROM foobars
WHERE status IN ('blue', 'white');
EXPLAIN SELECT *
FROM foobars
WHERE status NOT IN ('red', 'green');
EXPLAIN SELECT *
FROM foobars
WHERE status != 'red';
EXPLAIN SELECT *
FROM foobars
WHERE status = 'green' OR status = 'blue' OR status = 'white';
In the above, the table's status enum has multiple values with an uneven distribution. 80% are red, while only 3% are white.
Based on the results of the EXPLAIN, MariaDB can detect that IN ('blue', 'white') will examine significantly fewer rows (of the index, presumably) than NOT IN ('red', 'green'), although both options use the same index.
I've seen cases however where it does not choose to use the index at all (probably when there's even greater weight, such as an enum with "active", "inactive", and "preview".) This is more similar to the MySQL bug report noted above. This can cause significant performance concern in larger databases.
On my test machine, the SELECT queries in the above example take approximately 4-5 times as long when issued in their NOT IN variant, compared to the equivalent IN version. This is easiest to test with these:
SELECT COUNT(*) FROM foobars WHERE status IN ('blue', 'white'); SELECT COUNT(*) FROM foobars WHERE status NOT IN ('red', 'green');
Obviously, it is easy for the user to replace NOT IN with IN, but there are semantic reasons for using NOT IN. Aside from the code possibly being more straight-forward to read, it's also more future proof - suppose I might add "yellow" to the list of colors later. Correctly using IN and NOT IN in SQL code means that this future change would have a lower risk of introducing new bugs.
-[Unknown]
Gliffy Diagrams
Attachments
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions