- Cost-Based Optimization

Cost-Based Optimization

March 9, 2021

Garcia-Molina/Ullman/Widom: Ch. 16

General Query Optimizers

Apply blind heuristics (e.g., push down selections)
Enumerate all possible execution plans by varying (or for a reasonable subset)
- Join/Union Evaluation Order (commutativity, associativity, distributivity)
- Algorithms for Joins, Aggregates, Sort, Distinct, and others
- Data Access Paths
Estimate the cost of each execution plan
Pick the execution plan with the lowest cost

Idea 1: Run each plan

If we can't get the exact cost of a plan, what can we do?

Idea 2: Run each plan on a small sample of the data.

Idea 3: Analytically estimate the cost of a plan.

Plan Cost

CPU Time: How much time is spent processing.
# of IOs: How many random reads + writes go to disk.
Memory Required: How much memory do you need.

Randal Munroe (cc-by-nc)

Remember the Real Goals

Accurately rank the plans.
Don't spend more time optimizing than you get back.
Don't pick a plan that uses more memory than you have.

Accounting

Figure out the IO cost of the entire* subtree.

Only count the amount of memory added by each operator.

* Different from earlier in the semester.

Operation	RA	Total IOs (#pages)	Memory (#tuples)
Table Scan	$R$	$\frac{\|R\|}{\mathcal P}$	$O(1)$
Projection	$\pi(R)$	$\textbf{io}(R)$	$O(1)$
Selection	$\sigma(R)$	$\textbf{io}(R)$	$O(1)$
Union	$R \uplus S$	$\textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Sort (In-Mem)	$\tau(R)$	$\textbf{io}(R)$	$O(\|R\|)$
Sort (On-Disk)	$\tau(R)$	$\frac{2 \cdot \lfloor log_{\mathcal B}(\|R\|) \rfloor}{\mathcal P} + \textbf{io}(R)$	$O(\mathcal B)$
(B+Tree) Index Scan	$Index(R, c)$	$\log_{\mathcal I}(\|R\|) + \frac{\|\sigma_c(R)\|}{\mathcal P}$	$O(1)$
(Hash) Index Scan	$Index(R, c)$	$1$	$O(1)$

Tuples per Page ($\mathcal P$) – Normally defined per-schema
Size of $R$ ($|R|$)
Pages of Buffer ($\mathcal B$)
Keys per Index Page ($\mathcal I$)

Operation	RA	Total IOs (#pages)	Mem (#tuples)
Nested Loop Join (Buffer $S$ in mem)	$R \times_{mem} S$	$\textbf{io}(R)+\textbf{io}(S)$	$O(\|S\|)$
Block NLJ (Buffer $S$ on disk)	$R \times_{disk} S$	$\frac{\|R\|}{\mathcal B} \cdot \frac{\|S\|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Block NLJ (Recompute $S$)	$R \times_{redo} S$	$\textbf{io}(R) + \frac{\|R\|}{\mathcal B} \cdot \textbf{io}(S)$	$O(1)$
1-Pass Hash Join	$R \bowtie_{1PH, c} S$	$\textbf{io}(R) + \textbf{io}(S)$	$O(\|S\|)$
2-Pass Hash Join	$R \bowtie_{2PH, c} S$	$\frac{2\|R\| + 2\|S\|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Sort-Merge Join	$R \bowtie_{SM, c} S$	[Sort]	[Sort]
(Tree) Index NLJ	$R \bowtie_{INL, c}$	$\|R\| \cdot (\log_{\mathcal I}(\|S\|) + \frac{\|\sigma_c(S)\|}{\mathcal P})$	$O(1)$
(Hash) Index NLJ	$R \bowtie_{INL, c}$	$\|R\| \cdot 1$	$O(1)$
(In-Mem) Aggregate	$\gamma_A(R)$	$\textbf{io}(R)$	$adom(A)$
(Sort/Merge) Aggregate	$\gamma_A(R)$	[Sort]	[Sort]

Tuples per Page ($\mathcal P$) – Normally defined per-schema
Size of $R$ ($|R|$)
Pages of Buffer ($\mathcal B$)
Keys per Index Page ($\mathcal I$)
Number of distinct values of $A$ ($adom(A)$)

Symbol	Parameter	Type
$\mathcal P$	Tuples Per Page	Fixed ($\frac{\|\text{page}\|}{\|\text{tuple}\|}$)
$\|R\|$	Size of $R$	Precomputed$^*$ ($\|R\|$)
$\mathcal B$	Pages of Buffer	Configurable Parameter
$\mathcal I$	Keys per Index Page	Fixed ($\frac{\|\text{page}\|}{\|\text{key+pointer}\|}$)
$adom(A)$	Number of distinct values of $A$	Precomputed$^*$ ($\|\delta_A(R)\|$)

* unless $R$ is a query

Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$

Cardinality Estimation

Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.

Operator	RA	Estimated Size
Table	$R$	$\|R\|$
Projection	$\pi(Q)$	$\|Q\|$
Union	$Q_1 \uplus Q_2$	$\|Q_1\| + \|Q_2\|$
Cross Product	$Q_1 \times Q_2$	$\|Q_1\| \times \|Q_2\|$
Sort	$\tau(Q)$	$\|Q\|$
Limit	$\texttt{LIMIT}_N(Q)$	$N$
Selection	$\sigma_c(Q)$	$\|Q\| \times \texttt{SEL}(c, Q)$
Join	$Q_1 \bowtie_c Q_2$	$\|Q_1\| \times \|Q_2\| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct	$\delta_A(Q)$	$\texttt{UNIQ}(A, Q)$
Aggregate	$\gamma_{A, B \leftarrow \Sigma}(Q)$	$\texttt{UNIQ}(A, Q)$

$\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
$\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

Cardinality Estimation

(The Hard Parts)

$\sigma_c(Q)$ (Cardinality Estimation): How many tuples will a condition $c$ allow to pass?
$\delta_A(Q)$ (Distinct Values Estimation): How many distinct values of attribute(s) $A$ exist?

Idea 1: Assume each selection filters down to 10% of the data.

no... really!

... there are problems

Inconsistent estimation

$|\sigma_{c_1}(\sigma_{c_2}(R))| \neq |\sigma_{c_1 \wedge c_2}(R)|$

Too consistent estimation

$|\sigma_{id = 1}(\texttt{STUDENTS})| = |\sigma_{residence = 'NY'}(\texttt{STUDENTS})|$

... but remember that all we need is to rank plans.

Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.

(The specific % varies by DBMS.)

(Teradata uses 10% for the first AND clause,
cut by another 75% for every subsequent clause)

(Some) Estimation Techniques

The 10% rule: Rules of thumb if you have no other options...
Uniform Prior: Use basic statistics to make a very rough guess.
Sampling / History: Small, Quick Sampling Runs (or prior executions of the query).
Histograms: Using more detailed statistics for improved guesses.
Constraints: Using rules about the data for improved guesses.

Uniform Prior

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

Basic statistics are known about $Q$:
- COUNT(*)
- COUNT(DISTINCT A) (for each A)
- MIN(A), MAX(A) (for each numeric A)
Attribute values are uniformly distributed.
No inter-attribute correlations.

If necessary statistics aren't available (point 1), fall back to the 10% rule.

If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.

COUNT(DISTINCT A)

$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$

$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$

MIN(A), MAX(A)

$min_A(\pi_{A, \ldots}(R)) = min_A(R)$

$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$

$min_A(R \times S) = min_A(R)$ or $min_A(S)$

$min_A(R \uplus S) = min(min_A(R), min_A(S))$

Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)

Estimating Selectivity

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

$P(A = x_1)$	$=$	$\frac{1}{\texttt{COUNT(DISTINCT A)}}$
$P(A \in (x_1, x_2, \ldots, x_N))$	$=$	$\frac{N}{\texttt{COUNT(DISTINCT A)}}$
$P(A \leq x_1)$	$=$	$\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(x_1 \leq A \leq x_2)$	$=$	$\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(A = B)$	$=$	$\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$
$P(c_1 \wedge c_2)$	$=$	$P(c_1) \cdot P(c_2)$
$P(c_1 \vee c_2)$	$=$	$1 - (1 - P(c_1)) \cdot (1 - P(c_2))$

(With constants $x_1$, $x_2$, ...)

Limitations

Don't always have statistics for $Q$: For example, $\pi_{A \leftarrow (B \cdot C)}(R)$
Don't always have clear rules for $c$: For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
Attribute values are not always uniformly distributed.: For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
Attribute values are sometimes correlated.: For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$

...but handles most usage patterns

... next class more!