Details
Description
Some ideas about using multiple threads to run a query.
== Position at N% of table/index ==
Consider queries
select sum(a) from tbl group by non_key_col
select sum(a) from tbl where key between C1 and C2 group by non_key_col
If we want to run these with N threads, we need to give 1/Nth of table to each thread. (An alternative is to run one "reader" thread and distribute work to multiple compute threads. The problem with this is that reading from the table won't be parallel. This will put a cap on the performance.)
In order to do that, we will need storage engine calls that do
- "position at N% in the table"
- "position at N% in the index range between [C1 and C2]".
these calls would also let us build equi-height histograms based on sampling.
== General execution ==
There are many works about converting SQL into MapReduce jobs. Are they relevant to this task? The difference seems to be in the Map phase - we assume that source data is equi-distant to all worker threads.
== Evaluation ==
It would be nice to assess how much speedup we will get. In order to get an idea, we could break the query apart and run the parts manually. The merge step could also be done manually in some cases (by writing to, and reading from temporary tables).
Gliffy Diagrams
Attachments
Issue Links
- relates to
-
MDEV-5004 Support parallel read transactions on the same snapshot
-
- Open
-
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
Shard-query is able to run a query on multiple CPUs if the queried tables are partitioned. There is some data about how much this brings: http://www.mysqlperformanceblog.com/2014/05/01/parallel-query-mysql-shard-query/ (this is probably not the only piece of data).
Shard-query has been around for a long time, but didn't get much momentum for some reason. (An "obvious" technical explanation is that people don't want to partition their tables. But why wouldn't they?)