February 4, 2019
Replace [Thing A] with better, but equivalent [Thing B].
Replace [Query A] with better, but equivalent [Query B].
... but first a few definitions.
| Set
 (Unique Only) | Bag
 | List
 (Order Matters) | 
| Declarative | Imperative | 
|---|---|
| Say what you want | Say how you want to get it | 
| "Get me the TPS Reports" |  | 
| SQL, RA, R, … | C, Scala, Java, Python, … | 
Declarative languages make it easier to explore equivalent computations to find the best one.
                      SELECT R.A FROM R, S 
                      WHERE R.B = S.B AND S.C = 10
    
          with open_csv("R.csv") as R:
            with open_csv("S.csv") as S:
              for r_row in R:
                for s_row in S:
                  if r_row[1] == s_row[0] and s_row[1] == "10":
                    print(r_row[0])
    $$Q(A) := R(A, B), S(B, 10)$$
We start with a database instance with a fixed schema
Queries are applied to Relations $$Q(\textbf{Trees}, \textbf{SpeciesInfo})$$
Queries are also Relations! $$Q_2(\textbf{SpeciesInfo}, Q_1(\textbf{Trees}))$$ (Relational Algebra is Closed)
| Operation | Sym | Meaning | 
|---|---|---|
| Selection | $\sigma$ | Select a subset of the input rows | 
| Projection | $\pi$ | Delete unwanted columns | 
| Cross-product | $\times$ | Combine two relations | 
| Set-difference | $-$ | Tuples in Rel 1, but not Rel 2 | 
| Union | $\cup$ | Tuples either in Rel 1 or in Rel 2 | 
Also: Intersection, Join, Division, Renaming
(Not essential, but can be useful)
| Input | Query Language | Output | ||
|---|---|---|---|---|
| Sets of Tuples | $\rightarrow$ | Set Relational Algebra | $\rightarrow$ | Set of Tuples | 
| Bags of Tuples | $\rightarrow$ | Bag Relational Algebra | $\rightarrow$ | Bag of Tuples | 
| Lists of Tuples | $\rightarrow$ | Extended Relational Algebra | $\rightarrow$ | List of Tuples | 
First we focus on sets and bags.
Delete rows that fail the condition $c$.
| TREE_ID | SPC_COMMON | BORONAME | ... | 
|---|---|---|---|
| 204026 | 'honeylocust' | 'Brooklyn' | ... | 
| 204337 | 'honeylocust' | 'Brooklyn' | ... | 
| 189565 | 'American linden' | 'Brooklyn' | ... | 
| 192755 | 'London planetree' | 'Brooklyn' | ... | 
| 189465 | 'London planetree' | 'Brooklyn' | ... | 
| ... and 177287 more | |||
          SELECT * FROM Trees [[ WHERE BORONAME = 'Brooklyn' ]]
      Delete attributes not in the projection list $A$.
| BORONAME | 
|---|
| Queens | 
| Brooklyn | 
| Manhattan | 
| Bronx | 
| Staten Island | 
Only 5 results... not 683788?
Set and Bag Projection are different
              SELECT [[ DISTINCT BORONAME ]] FROM Trees;
      What are these queries schemas?
Takes two relations that are union-compatible...
(Both relations have the same number of fields with the same types)
... and returns all tuples appearing in either relation
We use $\uplus$ if we explicitly mean bag union
Return all tuples appearing in both 
of two union-compatible relations
What is this query asking?
Return all tuples appearing in the first, but not the second
of two union-compatible relations
What is this query asking?
What is the schema of the result of any of these operators?
Create all pairs of tuples.
TreeInfo
| SPC_COMMON | AVG_HEIGHT | 
|---|---|
| cedar elm | 60 | 
| lacebark elm | 45 | 
| ... and more | |
| SPC_COMMON | BORONAME | SPC_COMMON | AVG_HEIGHT | 
|---|---|---|---|
| honeylocust | Brooklyn | cedar elm | 60 | 
| honeylocust | Brooklyn | cedar elm | 60 | 
| American linden | Brooklyn | cedar elm | 60 | 
| London planetree | Manhattan | cedar elm | 60 | 
| London planetree | Manhattan | cedar elm | 60 | 
| ... | |||
| honeylocust | Brooklyn | lacebark elm | 45 | 
| honeylocust | Brooklyn | lacebark elm | 45 | 
| American linden | Brooklyn | lacebark elm | 45 | 
| London planetree | Manhattan | lacebark elm | 45 | 
| London planetree | Manhattan | lacebark elm | 45 | 
| ... and more | |||
What is the schema of the resulting relation?
The relation has a naming conflict
(two attributes with the same name)
What is the schema of the resulting relation?
When writing cross-products on the board,
I will use implicit renaming
Pair tuples according to a condition c.
Equi-joins are joins with only equality tests in the condition.
(Which operators behave differently in Set- and Bag-RA?)
| Operator | Symbol | Creates Duplicates? | 
|---|---|---|
| Selection | $\sigma$ | No | 
| Projection | $\pi$ | Yes | 
| Cross-product | $\times$ | No | 
| Set-difference | $-$ | No | 
| Union | $\cup$ | Yes | 
| Join | $\bowtie$ | No | 
Using Set-Relational Algebra, find the BORONAMEs of all boroughs that do have tree species with an average height of below 45 inches
TreeInfo
| SPC_COMMON | AVG_HEIGHT | 
|---|---|
| cedar elm | 60 | 
| lacebark elm | 45 | 
| ... and more | |
Trees
| SPC_COMMON | BORONAME | 
|---|---|
| 'honeylocust' | 'Brooklyn' | 
| 'American linden' | 'Brooklyn' | 
| 'London planetree' | 'Manhattan' | 
| ... and more | |
Not typically supported as a primitive operator,
but useful for expressing queries like:
Find species that appear in all boroughs
$$R / S \equiv \{\; \left<\vec t\right> \;|\; \forall \left<\vec s\right> \in S, \left< \vec t \vec s \right> \in R \;\}$$
| BORO | SPC_COMMON | 
|---|---|
| Brooklyn | honeylocust | 
| Brooklyn | American linden | 
| Brooklyn | London planetree | 
| Manhattan | honeylocust | 
| Manhattan | American linden | 
| Manhattan | pin oak | 
| Queens | honeylocust | 
| Queens | American linden | 
| Bronx | honeylocust | 
| / 
 
 | |||||||
| / 
 
 | |||||||
| / 
 
 | 
If time permits: Implement division using other operators.
    A simple way to think about and work with
    computations over collections. 
  
… simple → easy to evaluate
… simple → easy to optimize
Next time, Optimizing RA