Checkpoint 2

Two parts, one in-memory, one with more data than memory.
New SQL features (Group-By/Aggregates, Limit, Order-By)
Tighter Constraints (Bigger data and less time/query)

--in-mem: Java will be allowed to use 1GB of heap space (-Xmx1g), and the CSV representation of all input data data will be under 50MB (slightly bigger than checkpoint 1).
--on-disk: Java will be allowed to use 150MB of heap space (-Xmx150m), and the textual representation of your data will be over 100MB.

As a general guideline, Java becomes incredibly slow once you hit 50% memory usage as you spend more time in the garbage collector than in your code.

New SQL Features: Sort

ORDER BY col1 asc/desc, col2 asc/desc, ...: Sort the data on col1 (using col2, col3, ... as tiebreakers) in ascending or descending order.

You will likely need 2 implementations:
- In-Memory (Reference Impl Uses Java's Collections.sort)
- On-Disk (Reference Impl Uses 2-Pass Sort)

New SQL Features: Limit

LIMIT N: Return only the first N rows. If the data is sorted, return the first rows according to the sort order. If not, return an arbitrary N rows.

JSQLParser also supports more expressive limit clauses (e.g., including offsets). You will not be required to support anything more complex than LIMIT N.

New SQL Features: Group-By/Aggregates

SELECT SUM(C), ... FROM ...: Apply the aggregate function to all rows of data.
SELECT A, B, SUM(C), ... FROM ... GROUP BY A, B: Group the data by the A and B columns and apply the aggregate functions to each group.

Handling Tighter Constraints

Selection Pushdown: Push selections down through projections, joins, unions; as close to the data as possible
Join Conversion: Identify useful join conditions (e.g., $R.B = S.B$) and replace results with a better Join algorithm.

$\sigma_{c_1 \wedge c_2 \wedge c_3}(R \times S) = \sigma_{c_1}(R) \bowtie_{c_2} \sigma_{c_3}(S)$

$\sigma(\pi(...)) \rightarrow \pi(\sigma(...))$?

Pattern: A SelectIterator with a ProjectIterator as an input.

Pattern Matching

Remember pattern matching on Statement and Expression objects? Try it on Iterators!


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){

        // ...

      }
    }

Use instanceof to identify the root of your pattern


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){
        SelectIterator select = (SelectIterator)query;

        // ...

      }
    }

Cast to the desired type.


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){
        SelectIterator select = (SelectIterator)query;
        if(select.input instanceof ProjectIterator){
          ProjectIterator project = (ProjectIterator)select.input;

          // ...

        }
      }
    }

Match children (e.g., ProjectIterator).


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){
        SelectIterator select = (SelectIterator)query;
        if(select.input instanceof ProjectIterator){
          ProjectIterator project = (ProjectIterator)select.input;

          query = new ProjectIterator(
            new SelectIterator(
              project.input
            )
          );
        }
      }
    }

Compute a replacement


    Iterator optimize(Iterator query) {
      if(query instanceof SelectIterator){
        SelectIterator select = (SelectIterator)query;
        if(select.input instanceof ProjectIterator){
          ProjectIterator project = (ProjectIterator)select.input;

          query = new ProjectIterator(
            new SelectIterator(
              project.input
            )
          );
        }
      }
      query.input = optimize(query.input);
      return query;
    }

Recur to find patterns nested in the children

Handling Tighter Constraints

Selection Pushdown: Push selections down through projections, joins, unions; as close to the data as possible
Join Conversion: Identify useful join conditions (e.g., $R.B = S.B$) and replace results with a better Join algorithm.

One-Pass Hash Join

The reference implementation uses Java's HashMap.

Keep in-mind that you may need to hash multiple tuples to the same join key.

Sort-Merge Join

The join itself is easy once you have a sort operator. When running --on-disk the reference implementation uses Sort-Merge.

Extended Deadline

Due: March 29

Cost-Based Optimization

CSE-4/562 Spring 2019

Textbook: Ch. 15.1-15.5, 16.2-16.3, 16.7

Checkpoint 2

Checkpoint 2

New SQL Features: Sort

New SQL Features: Limit

New SQL Features: Group-By/Aggregates

Handling Tighter Constraints

Pattern Matching

Handling Tighter Constraints

One-Pass Hash Join

Sort-Merge Join

Extended Deadline

Questions?