CSE-4/562 Spring 2021 - Cost-Based Optimization

March 9, 2021

General Query Optimizers

1. Apply blind heuristics (e.g., push down selections)
2. Enumerate all possible execution plans by varying (or for a reasonable subset)
• Join/Union Evaluation Order (commutativity, associativity, distributivity)
• Algorithms for Joins, Aggregates, Sort, Distinct, and others
• Data Access Paths
3. Estimate the cost of each execution plan
4. Pick the execution plan with the lowest cost

Idea 1: Run each plan

If we can't get the exact cost of a plan, what can we do?

Idea 2: Run each plan on a small sample of the data.

Idea 3: Analytically estimate the cost of a plan.

Plan Cost

CPU Time
How much time is spent processing.
# of IOs
How many random reads + writes go to disk.
Memory Required
How much memory do you need.
Randal Munroe (cc-by-nc)

Remember the Real Goals

1. Accurately rank the plans.
2. Don't spend more time optimizing than you get back.
3. Don't pick a plan that uses more memory than you have.

Accounting

Figure out the IO cost of the entire* subtree.

Only count the amount of memory added by each operator.

* Different from earlier in the semester.

OperationRATotal IOs (#pages)Memory (#tuples)
Table Scan $R$ $\frac{|R|}{\mathcal P}$ $O(1)$
Projection $\pi(R)$ $\textbf{io}(R)$ $O(1)$
Selection $\sigma(R)$ $\textbf{io}(R)$ $O(1)$
Union $R \uplus S$ $\textbf{io}(R) + \textbf{io}(S)$ $O(1)$
Sort (In-Mem) $\tau(R)$ $\textbf{io}(R)$ $O(|R|)$
Sort (On-Disk) $\tau(R)$ $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P} + \textbf{io}(R)$ $O(\mathcal B)$
(B+Tree) Index Scan $Index(R, c)$ $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ $O(1)$
(Hash) Index Scan $Index(R, c)$ $1$ $O(1)$
1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
2. Size of $R$ ($|R|$)
3. Pages of Buffer ($\mathcal B$)
4. Keys per Index Page ($\mathcal I$)
OperationRATotal IOs (#pages)Mem (#tuples)
Nested Loop Join (Buffer $S$ in mem) $R \times_{mem} S$ $\textbf{io}(R)+\textbf{io}(S)$ $O(|S|)$
Block NLJ (Buffer $S$ on disk) $R \times_{disk} S$ $\frac{|R|}{\mathcal B} \cdot \frac{|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$ $O(1)$
Block NLJ (Recompute $S$) $R \times_{redo} S$ $\textbf{io}(R) + \frac{|R|}{\mathcal B} \cdot \textbf{io}(S)$ $O(1)$
1-Pass Hash Join $R \bowtie_{1PH, c} S$ $\textbf{io}(R) + \textbf{io}(S)$ $O(|S|)$
2-Pass Hash Join $R \bowtie_{2PH, c} S$ $\frac{2|R| + 2|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$ $O(1)$
Sort-Merge Join $R \bowtie_{SM, c} S$ [Sort] [Sort]
(Tree) Index NLJ $R \bowtie_{INL, c}$ $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ $O(1)$
(Hash) Index NLJ $R \bowtie_{INL, c}$ $|R| \cdot 1$ $O(1)$
(In-Mem) Aggregate $\gamma_A(R)$ $\textbf{io}(R)$ $adom(A)$
(Sort/Merge) Aggregate $\gamma_A(R)$ [Sort] [Sort]
1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
2. Size of $R$ ($|R|$)
3. Pages of Buffer ($\mathcal B$)
4. Keys per Index Page ($\mathcal I$)
5. Number of distinct values of $A$ ($adom(A)$)
SymbolParameterType
$\mathcal P$Tuples Per Page Fixed ($\frac{|\text{page}|}{|\text{tuple}|}$)
$|R|$Size of $R$ Precomputed$^*$ ($|R|$)
$\mathcal B$Pages of Buffer Configurable Parameter
$\mathcal I$Keys per Index Page Fixed ($\frac{|\text{page}|}{|\text{key+pointer}|}$)
$adom(A)$Number of distinct values of $A$ Precomputed$^*$ ($|\delta_A(R)|$)

* unless $R$ is a query

Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$

Cardinality Estimation

Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.

Operator RA Estimated Size
Table $R$ $|R|$
Projection $\pi(Q)$ $|Q|$
Union $Q_1 \uplus Q_2$ $|Q_1| + |Q_2|$
Cross Product $Q_1 \times Q_2$ $|Q_1| \times |Q_2|$
Sort $\tau(Q)$ $|Q|$
Limit $\texttt{LIMIT}_N(Q)$ $N$
Selection $\sigma_c(Q)$ $|Q| \times \texttt{SEL}(c, Q)$
Join $Q_1 \bowtie_c Q_2$ $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct $\delta_A(Q)$ $\texttt{UNIQ}(A, Q)$
Aggregate $\gamma_{A, B \leftarrow \Sigma}(Q)$ $\texttt{UNIQ}(A, Q)$
• $\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
• $\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

Cardinality Estimation

(The Hard Parts)

$\sigma_c(Q)$ (Cardinality Estimation)
How many tuples will a condition $c$ allow to pass?
$\delta_A(Q)$ (Distinct Values Estimation)
How many distinct values of attribute(s) $A$ exist?

Idea 1: Assume each selection filters down to 10% of the data.

no... really!

... there are problems

Inconsistent estimation

$|\sigma_{c_1}(\sigma_{c_2}(R))| \neq |\sigma_{c_1 \wedge c_2}(R)|$

Too consistent estimation

$|\sigma_{id = 1}(\texttt{STUDENTS})| = |\sigma_{residence = 'NY'}(\texttt{STUDENTS})|$

... but remember that all we need is to rank plans.

Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.

(The specific % varies by DBMS.)

(Teradata uses 10% for the first AND clause,
cut by another 75% for every subsequent clause)

(Some) Estimation Techniques

The 10% rule
Rules of thumb if you have no other options...
Uniform Prior
Use basic statistics to make a very rough guess.
Sampling / History
Small, Quick Sampling Runs (or prior executions of the query).
Histograms
Using more detailed statistics for improved guesses.
Constraints
Using rules about the data for improved guesses.

Uniform Prior

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

1. Basic statistics are known about $Q$:
• COUNT(*)
• COUNT(DISTINCT A) (for each A)
• MIN(A), MAX(A) (for each numeric A)
2. Attribute values are uniformly distributed.
3. No inter-attribute correlations.

If necessary statistics aren't available (point 1), fall back to the 10% rule.

If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.

COUNT(DISTINCT A)

$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$

$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$

MIN(A), MAX(A)

$min_A(\pi_{A, \ldots}(R)) = min_A(R)$

$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$

$min_A(R \times S) = min_A(R)$ or $min_A(S)$

$min_A(R \uplus S) = min(min_A(R), min_A(S))$

Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)

Estimating Selectivity

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

 $P(A = x_1)$ $=$ $\frac{1}{\texttt{COUNT(DISTINCT A)}}$ $P(A \in (x_1, x_2, \ldots, x_N))$ $=$ $\frac{N}{\texttt{COUNT(DISTINCT A)}}$ $P(A \leq x_1)$ $=$ $\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ $P(x_1 \leq A \leq x_2)$ $=$ $\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ $P(A = B)$ $=$ $\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$ $P(c_1 \wedge c_2)$ $=$ $P(c_1) \cdot P(c_2)$ $P(c_1 \vee c_2)$ $=$ $1 - (1 - P(c_1)) \cdot (1 - P(c_2))$

(With constants $x_1$, $x_2$, ...)

Limitations

Don't always have statistics for $Q$
For example, $\pi_{A \leftarrow (B \cdot C)}(R)$
Don't always have clear rules for $c$
For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
Attribute values are not always uniformly distributed.
For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
Attribute values are sometimes correlated.
For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$

...but handles most usage patterns

... next class more!