Details

    • Type: Task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 10.2
    • Component/s: None
    • Labels:

      Description

      Some ideas about using multiple threads to run a query.

      == Position at N% of table/index ==
      Consider queries

      select sum(a) from tbl group by non_key_col
      
      select sum(a) from tbl where key between C1 and C2 group by non_key_col
      

      If we want to run these with N threads, we need to give 1/Nth of table to each thread. (An alternative is to run one "reader" thread and distribute work to multiple compute threads. The problem with this is that reading from the table won't be parallel. This will put a cap on the performance.)

      In order to do that, we will need storage engine calls that do

      • "position at N% in the table"
      • "position at N% in the index range between [C1 and C2]".

      these calls would also let us build equi-height histograms based on sampling.

      == General execution ==
      There are many works about converting SQL into MapReduce jobs. Are they relevant to this task? The difference seems to be in the Map phase - we assume that source data is equi-distant to all worker threads.

      == Evaluation ==
      It would be nice to assess how much speedup we will get. In order to get an idea, we could break the query apart and run the parts manually. The merge step could also be done manually in some cases (by writing to, and reading from temporary tables).

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              stephane@skysql.com VAROQUI Stephane added a comment - - edited

              In my experience 99,999% of workload are using order by limit and many have issues with such queries, most client stop using RDBMS for processing that , they are now using external indexers sphinx, solr, elastic search not for full text but for this. This is the hart of every back office presenting the data in a single page based on users input regarding filter and sorting.

              Partitioning is used yes, for most big tables, because BTree stop performing when the size of a table stop being in memory. I would say that many workload does not use covering index , and so all secondary index reference the clustered table, making such tables hot most of the time.

              There is more reasons to use parallel plan , Most storage engine does not have an I/O materialization layer for read , every read from secondary index to a clustered table will be queued in a single execution thread , so on a SAN or a networked FS , 5 ms larency will be added to every page missed , the total latency of such queries can be divide by the concurrency as long have we don't ask more than the total IO read capacity . Prefetching is only used when scanning index or clustered index but not when using both at the same table , or joins to secondary index without MRR. so we have added MRR but InnoDB does exactly the same as before because having this 2 level index scan , from secondary to primary index

              Now spider partitioning can be created on the flight with a temporary table. to create a virtual partitioning based on the QP . What need to be added is partition condition push down to every node in the reduce thread. Like this every thread can work on different range of the same table by heritage of the range conditions

              Show
              stephane@skysql.com VAROQUI Stephane added a comment - - edited In my experience 99,999% of workload are using order by limit and many have issues with such queries, most client stop using RDBMS for processing that , they are now using external indexers sphinx, solr, elastic search not for full text but for this. This is the hart of every back office presenting the data in a single page based on users input regarding filter and sorting. Partitioning is used yes, for most big tables, because BTree stop performing when the size of a table stop being in memory. I would say that many workload does not use covering index , and so all secondary index reference the clustered table, making such tables hot most of the time. There is more reasons to use parallel plan , Most storage engine does not have an I/O materialization layer for read , every read from secondary index to a clustered table will be queued in a single execution thread , so on a SAN or a networked FS , 5 ms larency will be added to every page missed , the total latency of such queries can be divide by the concurrency as long have we don't ask more than the total IO read capacity . Prefetching is only used when scanning index or clustered index but not when using both at the same table , or joins to secondary index without MRR. so we have added MRR but InnoDB does exactly the same as before because having this 2 level index scan , from secondary to primary index Now spider partitioning can be created on the flight with a temporary table. to create a virtual partitioning based on the QP . What need to be added is partition condition push down to every node in the reduce thread. Like this every thread can work on different range of the same table by heritage of the range conditions
              Hide
              rspadim roberto spadim added a comment - - edited

              1) i don't like the idea "multi thread only work if you partition your data", or "create a table to use multi thread" too, maybe implicit optimizations could be done, and at documentation we could explain others methods
              we could use parallel query execution with the order by part (why not?)

              2) i think this MDEV will solve problems like:

              to "get" data (a single SELECT):
              2.1) to execute this query "xyz", we need 99999 i/o, we can execute 99iops with 1 thread, 99 with 2 threads and 99 with 9999 threads
              2.2) to execute this query "xyzq", we need 99999 i/o, we can execute 9iops with 1 thread, 99 with 2 threads and 999 with 9999 threads
              2.3) what's the best iops/thread and how many threads we will need and what order should it be executed to get data from tables?
              from 1: (99io/1 thread=99 iops/thread, 99/2=49.5 iops/thread ... 99/9999=0.009 iops/thread) => use 1 thread
              from 2: (9/1 = 9 iops/thread, 99/2 = 49.5 iops/thread ... 999/9999 = 0.09 iops/thread) => use 2 threads

              to "order by", "group by", "limit", "union" (after fetching data, or how to execute many selects)
              2.4) this data with 1 thread we can order by with 10seconds, with 2 threads we can execute with 5 seconds, with 1000 threads we can execute with 1 second, with 100000 threads we can execute with 0.99 second
              again something like seconds/thread or a limit of % of gain from 1 to 2 to 3 threads, to select the best number of threads (something like calculating the sin/cos function with a % of error, from 1 to 2 we reduce 50% time, from 2 to 3 we reduce 1% time, let's use 2 threads 1% isn't a nice reduce)

              I don't know how to easily explain/extrapolate to user/optimizer these statistics, and limit the number of threads that could be created in each part of query (fetch, order, group, limit, union, etc...)

              3)about the workload at order by/limit: i agree, some read/write contention when executing a order by is a killer with some apps, but i don't know if we can remove this or not, my today work around is reduce lock/contention with commit before ordering, and ordering after with others engines or others tools

              i really like the idea of "please database order by the fast way you can, i don't care about ordering result be the best one just order it with a low response time"
              example with big data (or with the union with fast and slow queires):
              a big order by, share the data with 4 threads, each thread execute 25% of order by, if we got the result without a finall order by we have something like "abd, cde, def, efg" (not 'totally' ordered, but with a usefull order at least), with the final order by we have "abcdddeeeffg"
              i could accept the first one result if this reduce for example 1 minute or some time that is critical to others threads/resource/user use

              Show
              rspadim roberto spadim added a comment - - edited 1) i don't like the idea "multi thread only work if you partition your data", or "create a table to use multi thread" too, maybe implicit optimizations could be done, and at documentation we could explain others methods we could use parallel query execution with the order by part (why not?) 2) i think this MDEV will solve problems like: to "get" data (a single SELECT): 2.1) to execute this query "xyz", we need 99999 i/o, we can execute 99iops with 1 thread, 99 with 2 threads and 99 with 9999 threads 2.2) to execute this query "xyzq", we need 99999 i/o, we can execute 9iops with 1 thread, 99 with 2 threads and 999 with 9999 threads 2.3) what's the best iops/thread and how many threads we will need and what order should it be executed to get data from tables? from 1: (99io/1 thread=99 iops/thread, 99/2=49.5 iops/thread ... 99/9999=0.009 iops/thread) => use 1 thread from 2: (9/1 = 9 iops/thread, 99/2 = 49.5 iops/thread ... 999/9999 = 0.09 iops/thread) => use 2 threads to "order by", "group by", "limit", "union" (after fetching data, or how to execute many selects) 2.4) this data with 1 thread we can order by with 10seconds, with 2 threads we can execute with 5 seconds, with 1000 threads we can execute with 1 second, with 100000 threads we can execute with 0.99 second again something like seconds/thread or a limit of % of gain from 1 to 2 to 3 threads, to select the best number of threads (something like calculating the sin/cos function with a % of error, from 1 to 2 we reduce 50% time, from 2 to 3 we reduce 1% time, let's use 2 threads 1% isn't a nice reduce) I don't know how to easily explain/extrapolate to user/optimizer these statistics, and limit the number of threads that could be created in each part of query (fetch, order, group, limit, union, etc...) 3)about the workload at order by/limit: i agree, some read/write contention when executing a order by is a killer with some apps, but i don't know if we can remove this or not, my today work around is reduce lock/contention with commit before ordering, and ordering after with others engines or others tools i really like the idea of "please database order by the fast way you can, i don't care about ordering result be the best one just order it with a low response time" example with big data (or with the union with fast and slow queires): a big order by, share the data with 4 threads, each thread execute 25% of order by, if we got the result without a finall order by we have something like "abd, cde, def, efg" (not 'totally' ordered, but with a usefull order at least), with the final order by we have "abcdddeeeffg" i could accept the first one result if this reduce for example 1 minute or some time that is critical to others threads/resource/user use
              Hide
              stephane@skysql.com VAROQUI Stephane added a comment -

              From what i learn order by is not such a costly operation, it's a creation of an index on the results, cost many CPU cycles without much latency between cycles and innodb can do this this in background threads by merging insert buffer. If you don't have the memory you are dead anyway and parallel query will not help. Also the order by limit can be optimized by LRU keeping the best order from all streams, so attention should be put with fetching data in parallel, and the granularity of such fetches, if you take for granted that a single core can fetch 600K to 1M records/sec from memory, if i have enough memory this is still 20 minutes for SUM(C1) on a billion records table. With a 64 cores machine and enough memory i would like to get this result in few seconds and if i have 20 slaves i would like answer in less than a second, instead of trying to make all type of operation multi threaded , it's more easy to partition a big baby table.
              What is the good number of partitions? Number of partitions will define the size of a chunk or a job. This can be defined by a coordinator thread = mapper in map reduce. The job size should limit the number of roundtrip to enable prefetch and mrr to work correctly but also to limit the amount of CPU required by coordinator for reducing the worker results, but also small enough to enable coordinator to schedule a dynamic pool of threads drive by monitoring the number of jobs/s. like what we do today when producing benchmark. We increase concurrency until it does not help. Off loading capability when riching the best performance can be done by -1 +1 -1 +1 # of threads. On a busy concurrent server this would allow to allocate no more than 1.5 threads on all type of range queries.

              The optimizer should be adapted for range queries, historically optimizer will compute the plan that produce less cost without knowing about Disk/Memory io distributions, for most OLAP queries this does not work and the best plan is alway going from the big table to small tables . Materialization of small tables should be done to apply filters to organize hash or mrr join as small as possible.
              The range optimizer can try to guess such situation by looking the number of records in the range * avg record size * memory lost factor (2 for innodb) and compare to the available memory for the storage engine. This can be enough to invalidate a plan that would jump from secondary index to such big table. Let's say that the optimizer found a plan of 1K read + 10K ref starting form small table to big table but that we a 50% page missed ratio deduce from previous computation this will produce 1000 RND I/O that would take 10s on spinning disk . in 10 seconds i can prefetch 8 Millions rows minus the cost of 8 million eq_ref in memory, but with 64 threads i can prefecth 64*8M record with only 640 jumps on disks . that is drastically changing some optimizer rules on spinning disks !

              Show
              stephane@skysql.com VAROQUI Stephane added a comment - From what i learn order by is not such a costly operation, it's a creation of an index on the results, cost many CPU cycles without much latency between cycles and innodb can do this this in background threads by merging insert buffer. If you don't have the memory you are dead anyway and parallel query will not help. Also the order by limit can be optimized by LRU keeping the best order from all streams, so attention should be put with fetching data in parallel, and the granularity of such fetches, if you take for granted that a single core can fetch 600K to 1M records/sec from memory, if i have enough memory this is still 20 minutes for SUM(C1) on a billion records table. With a 64 cores machine and enough memory i would like to get this result in few seconds and if i have 20 slaves i would like answer in less than a second, instead of trying to make all type of operation multi threaded , it's more easy to partition a big baby table. What is the good number of partitions? Number of partitions will define the size of a chunk or a job. This can be defined by a coordinator thread = mapper in map reduce. The job size should limit the number of roundtrip to enable prefetch and mrr to work correctly but also to limit the amount of CPU required by coordinator for reducing the worker results, but also small enough to enable coordinator to schedule a dynamic pool of threads drive by monitoring the number of jobs/s. like what we do today when producing benchmark. We increase concurrency until it does not help. Off loading capability when riching the best performance can be done by -1 +1 -1 +1 # of threads. On a busy concurrent server this would allow to allocate no more than 1.5 threads on all type of range queries. The optimizer should be adapted for range queries, historically optimizer will compute the plan that produce less cost without knowing about Disk/Memory io distributions, for most OLAP queries this does not work and the best plan is alway going from the big table to small tables . Materialization of small tables should be done to apply filters to organize hash or mrr join as small as possible. The range optimizer can try to guess such situation by looking the number of records in the range * avg record size * memory lost factor (2 for innodb) and compare to the available memory for the storage engine. This can be enough to invalidate a plan that would jump from secondary index to such big table. Let's say that the optimizer found a plan of 1K read + 10K ref starting form small table to big table but that we a 50% page missed ratio deduce from previous computation this will produce 1000 RND I/O that would take 10s on spinning disk . in 10 seconds i can prefetch 8 Millions rows minus the cost of 8 million eq_ref in memory, but with 64 threads i can prefecth 64*8M record with only 640 jumps on disks . that is drastically changing some optimizer rules on spinning disks !
              Hide
              rspadim roberto spadim added a comment -

              i agree and see that considering i/o scheduler doing a good job, the multi thread create more iops than single thread, and considering a shard/cluster table (engine) create more "workers" than single thread, no doubt about it

              my doubt still as, how many threads should be used in each part of query? how to optimize this number? should we execute every part as multithread or we should consider some part as single thread?

              there's some guys at arm world talking about 65k cores in only one machine, at intell x86-64 we talk about 128 cores, i don't know if we should consider only the size of data, but the cpu power is something that will have some "complex" scenarios in future like many cores with small "computer power" (arm) or less cores with a big "computer power" (xeon x86-64)
              i don't have idea how could optimize the number of threads yet, we have number of cores, "computer power" of each core (arm/x86/etc), size of data, storage iops, storage read(write) rate, and memory (cache/buffer/access rate/cpu cache/etc)

              Show
              rspadim roberto spadim added a comment - i agree and see that considering i/o scheduler doing a good job, the multi thread create more iops than single thread, and considering a shard/cluster table (engine) create more "workers" than single thread, no doubt about it my doubt still as, how many threads should be used in each part of query? how to optimize this number? should we execute every part as multithread or we should consider some part as single thread? there's some guys at arm world talking about 65k cores in only one machine, at intell x86-64 we talk about 128 cores, i don't know if we should consider only the size of data, but the cpu power is something that will have some "complex" scenarios in future like many cores with small "computer power" (arm) or less cores with a big "computer power" (xeon x86-64) i don't have idea how could optimize the number of threads yet, we have number of cores, "computer power" of each core (arm/x86/etc), size of data, storage iops, storage read(write) rate, and memory (cache/buffer/access rate/cpu cache/etc)
              Hide
              f_razzoli Federico Razzoli added a comment -

              I totally agree with Varouqi about the need for hash joins - this is probably the most important limitation that make Maria not viable for big data warehouses.

              Any realistic chances to see this in 10.2?

              Show
              f_razzoli Federico Razzoli added a comment - I totally agree with Varouqi about the need for hash joins - this is probably the most important limitation that make Maria not viable for big data warehouses. Any realistic chances to see this in 10.2?

                People

                • Assignee:
                  Unassigned
                  Reporter:
                  psergey Sergei Petrunia
                • Votes:
                  5 Vote for this issue
                  Watchers:
                  7 Start watching this issue

                  Dates

                  • Created:
                    Updated: