Show HN: Data algebra, going back to Codd’s relational operators

86
Show HN: Data algebra, going back to Codd’s relational operators

By on

When working with multiple data tables we often need to know how for a given set of keys, how many instances of rows each table has.

I would like to use such an example in Python as yet another introduction to the data algebra (an alternative to direct Pandas or direct SQL notation).

First let’s import our packages and set up our example data.

In [1]:

from typing import List

import pandas
from data_algebra.data_ops import *
import data_algebra.SQLite


# some example data
d1 = pandas.DataFrame({
    'ID': [2, 3, 7, 7],
    'OP': ['A', 'B', 'B', 'D'],
})

d2 = pandas.DataFrame({
    'ID': [1, 1, 2, 3, 4, 2, 4, 4, 5, 5, 6],
    'OP': ['A', 'B', 'A', 'D', 'C', 'A', 'D', 'B', 'A', 'B', 'B'],
})

And let’s declare our list of key columns.

What we want to do is:

  • Count how many rows in a given table have a given set of key values. This operation is called project.
  • Join counts from one table to counts to another table. This operation is called natural_join.
  • Replace missing counts, coming from keys present in one table and not another, with zero. This operation is called coalesce, and we perform this operation in an extend node.

First, we write code that implements the join step.

In [3]:

def merge_in_counts(
        pipeline: ViewRepresentation,
        id_cols: List[str],
        new_table_descr: TableDescription):
    return pipeline.natural_join(
            b=new_table_descr
                .project(
                    {f'count_{new_table_descr.table_name}': '(1).sum()'},
                    group_by=id_cols),
            by=id_cols,
            jointype='full')

Now we use that step to define our operator pipeline. The key point of the data algebra is: we apply operations incrementally on our operator pipeline instead of data.

In [4]:

ops = (
    data(d1=d1)
        .project({'count_d1': '(1).sum()'}, group_by=['ID'])
        .use(merge_in_counts, ['ID'], data(d2=d2))
)

The operations are built by standard Python method chaining. The data(d1=d1) step starts the pipeline with our d1 data frame. The .use() step treats the first argument as if it was in fact a method with the argument that follow. This allows us to easily treat user code such as merge_in_counts() as if they were class method extensions.

Now we insert the commands to clean up any count columns that may have picked up missing values. This is made easy as the operator pipeline tracks used and produced columns for us.

In [5]:

count_cols = [c for c in ops.column_names if c.startswith('count_')]
ops = (
    ops
        .extend({f'{c}': f'{c}.coalesce_0()' for c in count_cols})
        .order_rows(['ID'])
)

We now have our operator pipeline. Let’s pause and take a look at it.

(
    TableDescription(table_name="d1", column_names=["ID", "OP"])
    .project({"count_d1": "(1).sum()"}, group_by=["ID"])
    .natural_join(
        b=TableDescription(table_name="d2", column_names=["ID", "OP"]).project(
            {"count_d2": "(1).sum()"}, group_by=["ID"]
        ),
        by=["ID"],
        jointype="FULL",
    )
    .extend({"count_d1": "count_d1.coalesce(0)", "count_d2": "count_d2.coalesce(0)"})
    .order_rows(["ID"])
)

Notice we are treating the pipeline as inspectable meta data. We can print it, we can even use its records to neaten code (such as allowing it to track which columns have counts).

To execute the pipeline we simply call .ex(). This executes all the operations on the data captured by the data() statemetns.

Out[7]:

ID count_d1 count_d2
0 1 0.0 2.0
1 2 1.0 2.0
2 3 1.0 1.0
3 4 0.0 3.0
4 5 0.0 2.0
5 6 0.0 1.0
6 7 2.0 0.0

If one does not want to capture so much data, but just the first few rows we could use descr() instead of data().

In all cases, to execute on new data we would call .eval(), like so.

In [8]:

ops.eval({'d1': d1, 'd2': d2})

Out[8]:

ID count_d1 count_d2
0 1 0.0 2.0
1 2 1.0 2.0
2 3 1.0 1.0
3 4 0.0 3.0
4 5 0.0 2.0
5 6 0.0 1.0
6 7 2.0 0.0

The point being, operator pipelines can be used and re-used in many contexts.

Data algebra pipelines also have built in SQL translators for many common SQL dialects (currently PostgreSQL, Google Big Query, MySQL, SQLite, and SparkSQL; but extension is easy).

In [9]:

db_handle = data_algebra.SQLite.example_handle()

db_handle.insert_table(d1, table_name='d1')
_ = db_handle.insert_table(d2, table_name='d2')

We are inserting the tables as an example. In most database applications the data is already in the database and it is critical to avoid round-tripping the data to and from Python.

For example, we can use the data algebra generated query to create a new table int the database, with no additional round tripping.

In [10]:

db_handle.execute(f'CREATE TABLE result AS {db_handle.to_sql(ops)}')

We can then look at a few rows of the result as follows.

In [ ]:

db_handle.read_query('SELECT * FROM result ORDER BY ID LIMIT 3')

We didn’t show the produced SQL string, as this one is particularly ugly due to the “UNION ALL” strategy the data algebra inserts to simulate a full join in SQLite. However, the SQL can be accessed by db_handle.to_sql(ops) (actually we don’t need a live handle, the data algebra also includes explicit user accessible database models). In databases that have a full join, that is directly used. Adapting to divergent data dialects is one of the purposes of the data algebra. The Pandas interface itself is also parameterized to support replacement.

The design idea is: what if we took Codd’s relational algebra and left it as described in his 1970 article (instead of aggressively combining unrelated operations as SQL does). R users will see this as being related to dplyr, which is one of our reference systems and likely also owes ideas to Codd.

And that is the data algebra. It is a series of Codd relational data operators operating on themselves, and then appliable to data sources. The primary focus of the data algebra is composition of operations, with the intent of making decomposing problems easier. The data algebra is particularly useful in long term query maintenance, as modifying queries is easy given the emphasis on explicit meta-data and composition.

Categories: Coding data science Exciting Techniques Opinion Practical Data Science Pragmatic Data Science Tutorials

Tagged as:

John Mount

Read More

Knowasiak
WRITTEN BY

Knowasiak

Hey! look, i give tutorials to all my users and i help them!