CSE-4/562 Spring 2019 - Cost-Based Optimization

Cost-Based Optimization

CSE-4/562 Spring 2019

March 1, 2019

Textbook: Ch. 15.1-15.5, 16.2-16.3, 16.7

Checkpoint 2

sif$ javac -cp build:*.jar -D build {all .java files} sif$ java -cp build:jsqlparser.jar:evallib.jar dubstep.Main
 - --preload --in-mem $> CREATE TABLE R( A int, B int, C int, PRIMARY KEY (A), INDEX B_INDEX (B) ); $> SELECT A, SUM(C) FROM R WHERE B < 5 GROUP BY A; 1|5 ... $>

Checkpoint 2

  • Two parts, one in-memory, one with more data than memory.
  • New SQL features (Group-By/Aggregates, Limit, Order-By)
  • Tighter Constraints (Bigger data and less time/query)
--in-mem
Java will be allowed to use 1GB of heap space (-Xmx1g), and the CSV representation of all input data data will be under 50MB (slightly bigger than checkpoint 1).
--on-disk
Java will be allowed to use 150MB of heap space (-Xmx150m), and the textual representation of your data will be over 100MB.

As a general guideline, Java becomes incredibly slow once you hit 50% memory usage as you spend more time in the garbage collector than in your code.

New SQL Features: Sort

ORDER BY col1 asc/desc, col2 asc/desc, ...
Sort the data on col1 (using col2, col3, ... as tiebreakers) in ascending or descending order.
  • You will likely need 2 implementations:
    • In-Memory (Reference Impl Uses Java's Collections.sort)
    • On-Disk (Reference Impl Uses 2-Pass Sort)

New SQL Features: Limit

LIMIT N
Return only the first N rows. If the data is sorted, return the first rows according to the sort order. If not, return an arbitrary N rows.

JSQLParser also supports more expressive limit clauses (e.g., including offsets). You will not be required to support anything more complex than LIMIT N.

New SQL Features: Group-By/Aggregates

SELECT SUM(C), ... FROM ...
Apply the aggregate function to all rows of data.
SELECT A, B, SUM(C), ... FROM ... GROUP BY A, B
Group the data by the A and B columns and apply the aggregate functions to each group.

Handling Tighter Constraints

Selection Pushdown
Push selections down through projections, joins, unions; as close to the data as possible
Join Conversion
Identify useful join conditions (e.g., $R.B = S.B$) and replace results with a better Join algorithm.

$\sigma_{c_1 \wedge c_2 \wedge c_3}(R \times S) = \sigma_{c_1}(R) \bowtie_{c_2} \sigma_{c_3}(S)$

$\sigma(\pi(...)) \rightarrow \pi(\sigma(...))$?

Pattern: A SelectIterator with a ProjectIterator as an input.

Pattern Matching

Remember pattern matching on Statement and Expression objects? Try it on Iterators!

    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){

        // ...

      }
    }
    

Use instanceof to identify the root of your pattern


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){
        SelectIterator select = (SelectIterator)query;

        // ...

      }
    }
    

Cast to the desired type.


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){
        SelectIterator select = (SelectIterator)query;
        if(select.input instanceof ProjectIterator){
          ProjectIterator project = (ProjectIterator)select.input;

          // ...

        }
      }
    }
    

Match children (e.g., ProjectIterator).


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){
        SelectIterator select = (SelectIterator)query;
        if(select.input instanceof ProjectIterator){
          ProjectIterator project = (ProjectIterator)select.input;

          query = new ProjectIterator(
            new SelectIterator(
              project.input
            )
          );
        }
      }
    }
    

Compute a replacement


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){
        SelectIterator select = (SelectIterator)query;
        if(select.input instanceof ProjectIterator){
          ProjectIterator project = (ProjectIterator)select.input;

          query = new ProjectIterator(
            new SelectIterator(
              project.input
            )
          );
        }
      }
      query.input = optimize(query.input);
      return query;
    }
    

Recur to find patterns nested in the children

Handling Tighter Constraints

Selection Pushdown
Push selections down through projections, joins, unions; as close to the data as possible
Join Conversion
Identify useful join conditions (e.g., $R.B = S.B$) and replace results with a better Join algorithm.

One-Pass Hash Join

The reference implementation uses Java's HashMap.

Keep in-mind that you may need to hash multiple tuples to the same join key.

Sort-Merge Join

The join itself is easy once you have a sort operator. When running --on-disk the reference implementation uses Sort-Merge.

Extended Deadline

Due: March 29

Questions?