Details
Description
Contents:
- Problem description
- High-level design
- Main assumptions in this task
- General idea
- Detailed design
- Low-level design
- Testing
Problem description
If the evaluation of a dependent single-row subquery is performed always after accessing the last table the subquery depends on the total query execution time may happen to be suboptimal. We can see it in the report for LP bug #914569 .
With default settings (optimizer_switch='semijoin=on,materialization=on') for Q20 of the DBT-3
select sql_calc_found_rows s_name, s_address from supplier, nation where s_suppkey in (select ps_suppkey from partsupp where ps_partkey in (select p_partkey from part where p_name like 'forest%') and ps_availqty > (select 0.5 * sum(l_quantity) from lineitem where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date('1994-01-01') and l_shipdate < date('1994-01-01') + interval '1' year )) and s_nationkey = n_nationkey and n_name = 'CANADA' order by s_name limit 10;
the 5.3 optimizer chooses the following plan for the database dbt3x10_myisam
(a myisam DBT-3 database of scale factor 10):
+----+--------------------+----------+--------+--------------------------------------+---------------+---------+------------------------------------+------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+----------+--------+--------------------------------------+---------------+---------+------------------------------------+------+----------------------------------------------+ | 1 | PRIMARY | nation | ALL | PRIMARY | NULL | NULL | NULL | 25 | Using where; Using temporary; Using filesort | | 1 | PRIMARY | supplier | ref | PRIMARY,i_s_nationkey | i_s_nationkey | 5 | dbt3x10_myisam.nation.n_nationkey | 4000 | | | 1 | PRIMARY | partsupp | ref | PRIMARY,i_ps_partkey,i_ps_suppkey | i_ps_suppkey | 4 | dbt3x10_myisam.supplier.s_suppkey | 80 | Using where | | 1 | PRIMARY | part | eq_ref | PRIMARY | PRIMARY | 4 | dbt3x10_myisam.partsupp.ps_partkey | 1 | Using where; FirstMatch(supplier) | | 4 | DEPENDENT SUBQUERY | lineitem | ref | i_l_shipdate,i_l_partkey,i_l_suppkey | i_l_partkey | 5 | dbt3x10_myisam.partsupp.ps_partkey | 30 | Using where | +----+--------------------+----------+--------+--------------------------------------+---------------+---------+------------------------------------+------+--------------------------------------------+
The execution by this plan takes significantly more time ( almost 4 times more) then the execution by the plan:
+----+--------------------+----------+-----------------+--------------------------------------+--------------+---------+-------------------------------------+--------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+----------+-----------------+--------------------------------------+--------------+---------+-------------------------------------+--------+-----------------------------+ | 1 | PRIMARY | supplier | ALL | i_s_nationkey | NULL | NULL | NULL | 100000 | Using where; Using filesort | | 1 | PRIMARY | nation | eq_ref | PRIMARY | PRIMARY | 4 | dbt3x10_myisam.supplier.s_nationkey | 1 | Using where | | 2 | DEPENDENT SUBQUERY | partsupp | index_subquery | i_ps_suppkey | i_ps_suppkey | 4 | func | 80 | Using where | | 4 | DEPENDENT SUBQUERY | lineitem | ref | i_l_shipdate,i_l_partkey,i_l_suppkey | i_l_partkey | 5 | dbt3x10_myisam.partsupp.ps_partkey | 30 | Using where | | 3 | DEPENDENT SUBQUERY | part | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where | +----+--------------------+----------+-----------------+--------------------------------------+--------------+---------+-------------------------------------+--------+-----------------------------+
that is chosen with the settings optimizer_switch='semijoin=off,materialization=off'.
The main difference in these two plans is that with the first execution plan the dependent subquery:
(select 0.5 * sum(l_quantity) from lineitem where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date('1994-01-01') and l_shipdate < date('1994-01-01') + interval '1' year ))
is evaluated before the table part is accessed while many rows where (p_name like 'forest%') is not true are filtered out and evaluation of the dependent subquery is not needed for them.
The main goal of this task is to implement a cost based decision whether the subquery predicate should be pushed to the table partsupp or to the table part.
High-level design
Main assumptions in this task
- It is not possible to estimate the selectivity of a subquery
predicate during optimization.
- It is possible to estimate the cost of a subquery (mdev-402).
- While currently we can estimate only the selectivity of predicates for which
there exists suitable indexes, the task will assume that it will be possible
to estimate selectivity of most predicates based on the work in mwl#248.
Until mwl#248 can be used, predicate selectivity is estimated through
quick_condition_rows when there is an index (mdev-446).
General idea
The goal of this task is to minimize the number of executions of subqueries in
a join plan. Since join conditions are pushed to the earliest possible
table in the join plan, an expensive condition may be re-attached to any
table later in the plan (but not earlier) in order to minimize the number of
times it is executed.
- The pushdown of subquery predicates is decided for each complete join plan
during join optimization. - For each complete join plan, we find the partial join with the least
cardinality. - Depending on table dependencies, a subquery can be pushed to a join
no earlier than a certain position in the plan. We search the partial
join with the least cardinality after the first position where the
subquery can be pushed. The subquery is marked with the number of this
optimal join position. - The total cost of the plan is updated taking into account that size of
the partial join result, and the subquery cost. - After join optimization, but before the predicate pushdown phase, each
subquery is set to depend on the table that is accessed by the partial
join with least cardinality. - The predicate pushdown phase automatically pushes the subquery to the
right partial join, because it is already marked to depend on it.
Gliffy Diagrams
Attachments
Issue Links
- includes
-
MDEV-4612 SQ pushdown: Server crashes in make_join_statistics with materialization+semijoin, IN subqueries, constant table, impossible condition
-
- Closed
-
-
MDEV-5524 Fix "Subqueries: n" in EXPLAIN for condition pushdown
-
- Open
-
-
MDEV-387 Move the optimization of subqueries earlier, before make_join_select()
-
- Closed
-
- is blocked by
-
MDEV-4145 Take into account the selectivity of single-table range predicates on non-indexed columns when searching for the best execution plan
-
- Closed
-
-
MDEV-5123 Remove duplicated conditions pushed both to join_tab->select_cond and join_tab->cache_select->cond for blocked joins.
-
- Closed
-
- relates to
-
MDEV-4612 SQ pushdown: Server crashes in make_join_statistics with materialization+semijoin, IN subqueries, constant table, impossible condition
-
- Closed
-
-
MDEV-4648 SQ pushdown: Wrong result (missing rows) with materialization+semijoin, IN and ALL subqueries, UNION
-
- Closed
-
-
MDEV-4657 SQ pushdown: Valgrind warnings (Conditional jump or move depends on uninitialised value) in compare_items_by_cost
-
- Closed
-
-
MDEV-4659 SQ pushdown: Valgrind warnings (Conditional jump or move depends on uninitialised value) in best_extension_by_limited_search / greedy_search with expensive_pred_static_pushdown=on
-
- Closed
-
-
MDEV-5178 Valgrind warnings (Conditional jump or move depends on uninitialised value) with static_cond_pushdown=on, FROM subquery
-
- Closed
-
-
MDEV-5201 Assertion `!(*exec_tab) || (*exec_tab >= first_tab && *exec_tab < last_tab)' fails in Item_func_dynamic_cond::val_int() with static_cond_pushdown=on
-
- Closed
-
-
MDEV-5203 Different results with record_cond_statistics=on and record_cond_statistics=off with SOME subquery
-
- Closed
-
-
MDEV-5470 Cost-based subquery item pushdown must take into account semi-joins
-
- Open
-
-
MDEV-383 Evaluate subquery predicates earlier or later depending on their SELECTIVITY
-
- Open
-
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
Discussion results:
We've implemented a solution which worked this way:
S1. choose join plan
S2. attach subquery predicates in a way that minimizes the total cost to execute them.
This solution doesn't work for us, because step S1 chooses the wrong query plan. We need a solution where costs of subquery predicate evaluation affect the join plan choice.
There are two possible solutions:
== Early-opt-1 ==
"given a complete join plan, attach subquery predicates in a way that minimizes their cost" —
note that any two complete join plans are not prefixes/suffixes of each other. Hence, the task
needs to be done from scratch.
We need to handle best plans. That means, we'll need to store two subquery attachments at a time:
1. The one for the best known query plan so far (the plan itself is in join->best_positions)
2. The one for the "incumbent" query plan (the plan itself is in join->positions).
After join optimization is done, we can put the attachment info into Item_subselect items (and let make_join_select() attach them appropriately).
Note: "attachement" here means the table that the subquery predicate depends on, in addition to the tables that it depends on syntactically.
Conclusion: Early-opt-1 is much easier than Early-opt-2.
== Early-opt-2 ==
, and generated optimal subquery predicate attachment OPT_SUBQ_t1t2. Now, we add t3. We need to generate new optimal subquery predicate attachment OPT_SUBQ_t1t2t3. One likely further action is to back track, ie. remove t3 from the join prefix, where we will need to restore OPT_SUBQ_t1t2.
== Real-subquery-location problem ==
A related, but different problem. Suppose the condition has form
func(tbl1, subquery(tbl2, tbl3))
then
are available.
Q: doesn't subquery cache make it irrelevant?
Practically, we will observe difference of this kind:
a join order of tbl2, tbl3, tbl1
we think subquery is attached to tbl3
actually it is attached to tbl1.
However, due to subquery cache, subquery will be executed #rows(tbl2) * #rows(tbl3) times.
Hence, Real-suquery-location-problem can be ignored for now.