CSE-4/562 Spring 2021 - Parallel Query Execution

Parallel Query Execution

CSE-4/562 Spring 2021

March 18, 2021

Garcia-Molina/Ullman/Widom: Ch. 20.1, 20.3, 20.4

Why Scale?

Scan 1 PB at 300MB/s (SATA r2)

Communication Models

Communication Models

Shared Memory

Non-Uniform Memory Access

Shared Nothing / Message Passing

Shared Memory
HDFS, S3, RAM+Modern OSes
Shared Nothing
MPP, Actor Model

Today we want to clearly see the communications.

The Basic Model

Putting Workers Together

Wikipedia - Public Domain Image - Photographer Unknown

No Parallelism


"Pipeline" Parallelism


"Pipeline" Parallelism




Distributing Data

Replication Partitioning (Sharding)

How to Partition Data?

  • Arbitrarily
  • Range
  • Hash

look familiar?

Can we run each worker on one partition?



$N$ partitions in, $N$ partitions out



$N$ partitions in, $N$ partitions out



Trick question, just combines partitions!

(Single-row) Aggregate


$N$ partitions in, $1$ partition out

Partial Aggregation

Algebraic Aggregates (Count, Sum, Avg, Min, Max)
Bounded-size intermediate state
Holistic Aggregates (Median, Mode, Count-Distinct)
Unbounded-size intermediate state
$$\Sigma_{SUM} (R_1 \uplus R_2 \uplus \ldots \uplus R_N)$$

Aggregate needs to process $N$ partitions.

$$\Sigma_{SUM} (\Sigma_{SUM}(R_1) \uplus \Sigma_{SUM}(R_2) \uplus \ldots \uplus \Sigma_{SUM}(R_N))$$

Final aggregate only needs to process $N$ tuples.

$$\Sigma_{AVG} (R_1 \uplus R_2 \uplus \ldots \uplus R_N)$$


$$\Sigma_{\frac{SUM(A)}{SUM(B)}} (\Sigma_{A \leftarrow SUM,\; B \leftarrow COUNT}(R_1) \ldots \Sigma_{A \leftarrow SUM,\; B \leftarrow COUNT}(R_N))$$

Basic Aggregate Pattern

Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator
Extract the aggregate from the accumulator.
Merge(Accum, Accum)
Merge two accumulators together.



Every partition from one table needs to pair
with every partition from the other.

$$(R_1 \uplus \ldots \uplus R_N) \bowtie (S_1 \uplus \ldots \uplus S_K)$$


$$(R_1 \bowtie S_1) \uplus \ldots \uplus (R_1 \bowtie S_K)$$ $$\ldots\uplus \ldots \uplus \ldots$$ $$(R_N \bowtie S_1) \uplus \ldots \uplus (R_N \bowtie S_K)$$
$R_1$ $R_1\bowtie S_1$$R_1\bowtie S_2$$R_1\bowtie S_3$$R_1\bowtie S_4$
$R_2$ $R_2\bowtie S_1$$R_2\bowtie S_2$$R_2\bowtie S_3$$R_2\bowtie S_4$
$R_3$ $R_3\bowtie S_1$$R_3\bowtie S_2$$R_3\bowtie S_3$$R_3\bowtie S_4$
$R_4$ $R_4\bowtie S_1$$R_4\bowtie S_2$$R_4\bowtie S_3$$R_4\bowtie S_4$

$N$ workers gets us $\sqrt{N}$ scaling

How to Partition Data?

  • Arbitrarily
  • Range
  • Hash
$$R \bowtie_{A} S$$ $$R_i = \sigma_{\texttt{hash}(A) = i}(R)$$ $$S_i = \sigma_{\texttt{hash}(A) = i}(S)$$
$R_1$ $R_1\bowtie S_1$ $R_1\bowtie S_2$ $R_1\bowtie S_3$ $R_1\bowtie S_4$
$R_2$ $R_2\bowtie S_1$ $R_2\bowtie S_2$ $R_2\bowtie S_3$ $R_2\bowtie S_4$
$R_3$ $R_3\bowtie S_1$ $R_3\bowtie S_2$ $R_3\bowtie S_3$ $R_3\bowtie S_4$
$R_4$ $R_4\bowtie S_1$ $R_4\bowtie S_2$ $R_4\bowtie S_3$ $R_4\bowtie S_4$

Back to $N$ scaling for $N$ workers

What if the partitions aren't aligned so nicely?

Can we do better?

Focus on $R_1 \bowtie_B S_1$

Problem: All tuples in $R_1$ and $S_1$ need to be
sent to the same worker.

Data Transfer

  • Limited IO/Network bandwidth
  • Compute needed to receive data

Idea 1: Put the worker on the node that has the data!

Problem: What if the data is on 2 different nodes?

Idea 1.b: Put the worker on one of the nodes with data.

Can we reduce network use more?

Problem: Worker 2 is still sending a lot of data.

Idea: Compress $\pi_B(S_1)$

Lossy Compression

(not all errors are equal)

False Positives
($b \in \pi_B(S_1)$ when it isn't)
not ideal, but ok
False Negatives
($b \not\in \pi_B(S_1)$ when it is)
bad, wrong answer!

Bloom Filters

$$filter \leftarrow \texttt{Bloom}(\textbf{Alice}, \textbf{Bob}, \textbf{Carol}, \textbf{Dave})$$
User: Is Alice part of the set? $filter$: Yes
User: Is Eve part of the set? $filter$: No
User: Is Fred part of the set? $filter$: Yes

Bloom Filter

Test always returns Yes if the element is in the set.

Test usually returns No if the element is not in the set.

Bloom Filters

A bloom filter is an array of bits.

$M$: Number of bits in the array.

$K$: Number of hash functions.

For one record/key

  1. $\forall i \in [M] : filter[i] = 0$
  2. $\forall j \in [K] : filter[\texttt{hash}_j(key)] = 1$

Each bit vector has $\sim K$ bits set.

$Key_1$ 00101010
$Key_2$ 01010110
$Key_3$ 10000110
$Key_4$ 01001100

Filters are combined by Bitwise-OR

$$Key_1 \;|\; Key_2 = 01111110$$

Test for inclusion by checking for bits

$$Key_i \;\&\; filter = Key_i$$
$Key_1$ 00101010
$Key_2$ 01010110
$Key_3$ 10000110
$Key_4$ 01001100
$$Key_1 \;|\; Key_2 = 01111110$$
$Key_1 \;\&\; 01111110$ 00101010
$Key_3 \;\&\; 01111110$ 00101010
$Key_4 \;\&\; 01111110$ 01001100

(False positive)

Next time: Online Aggregation/AQP