By John Mount on •
When working with multiple data tables we often need to know how for a given set of keys, how many instances of rows each table has.
I would like to use such an example in Python as yet another introduction to the data algebra (an alternative to direct Pandas or direct SQL notation).
First let’s import our packages and set up our example data.
In [1]:
And let’s declare our list of key columns.
What we want to do is:
First, we write code that implements the join step.
In [3]:
Now we use that step to define our operator pipeline. The key point of the data algebra is: we apply operations incrementally on our operator pipeline instead of data.
In [4]:
The operations are built by standard Python method chaining. The Now we insert the commands to clean up any count columns that may have picked up missing values. This is made easy as the operator pipeline tracks used and produced columns for us.
In [5]:
We now have our operator pipeline. Let’s pause and take a look at it.
Notice we are treating the pipeline as inspectable meta data. We can print it, we can even use its records to neaten code (such as allowing it to track which columns have counts).
To execute the pipeline we simply call Out[7]:
If one does not want to capture so much data, but just the first few rows we could use In all cases, to execute on new data we would call In [8]:
Out[8]:
The point being, operator pipelines can be used and re-used in many contexts.
Data algebra pipelines also have built in SQL translators for many common SQL dialects (currently PostgreSQL, Google Big Query, MySQL, SQLite, and SparkSQL; but extension is easy).
In [9]:
We are inserting the tables as an example. In most database applications the data is already in the database and it is critical to avoid round-tripping the data to and from Python.
For example, we can use the data algebra generated query to create a new table int the database, with no additional round tripping.
In [10]:
We can then look at a few rows of the result as follows.
In [ ]:
We didn’t show the produced SQL string, as this one is particularly ugly due to the “UNION ALL” strategy the data algebra inserts to simulate a full join in SQLite. However, the SQL can be accessed by The design idea is: what if we took Codd’s relational algebra and left it as described in his 1970 article (instead of aggressively combining unrelated operations as SQL does). R users will see this as being related to dplyr, which is one of our reference systems and likely also owes ideas to Codd.
And that is the data algebra. It is a series of Codd relational data operators operating on themselves, and then appliable to data sources. The primary focus of the data algebra is composition of operations, with the intent of making decomposing problems easier. The data algebra is particularly useful in long term query maintenance, as modifying queries is easy given the emphasis on explicit meta-data and composition.
Categories: Coding data science Exciting Techniques Opinion Practical Data Science Pragmatic Data Science Tutorials
Tagged as: Codd data algebra Pandas pydata SQL
Featured Content Ads
add advertising herefrom typing import List
import pandas
from data_algebra.data_ops import *
import data_algebra.SQLite
# some example data
d1 = pandas.DataFrame({
'ID': [2, 3, 7, 7],
'OP': ['A', 'B', 'B', 'D'],
})
d2 = pandas.DataFrame({
'ID': [1, 1, 2, 3, 4, 2, 4, 4, 5, 5, 6],
'OP': ['A', 'B', 'A', 'D', 'C', 'A', 'D', 'B', 'A', 'B', 'B'],
})
def merge_in_counts(
pipeline: ViewRepresentation,
id_cols: List[str],
new_table_descr: TableDescription):
return pipeline.natural_join(
b=new_table_descr
.project(
{f'count_{new_table_descr.table_name}': '(1).sum()'},
group_by=id_cols),
by=id_cols,
jointype='full')
ops = (
data(d1=d1)
.project({'count_d1': '(1).sum()'}, group_by=['ID'])
.use(merge_in_counts, ['ID'], data(d2=d2))
)
data(d1=d1)
step starts the pipeline with our d1
data frame. The .use()
step treats the first argument as if it was in fact a method with the argument that follow. This allows us to easily treat user code such as merge_in_counts()
as if they were class method extensions.
count_cols = [c for c in ops.column_names if c.startswith('count_')]
ops = (
ops
.extend({f'{c}': f'{c}.coalesce_0()' for c in count_cols})
.order_rows(['ID'])
)
(
TableDescription(table_name="d1", column_names=["ID", "OP"])
.project({"count_d1": "(1).sum()"}, group_by=["ID"])
.natural_join(
b=TableDescription(table_name="d2", column_names=["ID", "OP"]).project(
{"count_d2": "(1).sum()"}, group_by=["ID"]
),
by=["ID"],
jointype="FULL",
)
.extend({"count_d1": "count_d1.coalesce(0)", "count_d2": "count_d2.coalesce(0)"})
.order_rows(["ID"])
)
.ex()
. This executes all the operations on the data captured by the data()
statemetns.
ID
count_d1
count_d2
0
1
0.0
2.0
1
2
1.0
2.0
2
3
1.0
1.0
3
4
0.0
3.0
4
5
0.0
2.0
5
6
0.0
1.0
6
7
2.0
0.0
descr()
instead of data()
.
.eval()
, like so.
ops.eval({'d1': d1, 'd2': d2})
ID
count_d1
count_d2
0
1
0.0
2.0
1
2
1.0
2.0
2
3
1.0
1.0
3
4
0.0
3.0
4
5
0.0
2.0
5
6
0.0
1.0
6
7
2.0
0.0
db_handle = data_algebra.SQLite.example_handle()
db_handle.insert_table(d1, table_name='d1')
_ = db_handle.insert_table(d2, table_name='d2')
db_handle.execute(f'CREATE TABLE result AS {db_handle.to_sql(ops)}')
db_handle.read_query('SELECT * FROM result ORDER BY ID LIMIT 3')
db_handle.to_sql(ops)
(actually we don’t need a live handle, the data algebra also includes explicit user accessible database models). In databases that have a full join, that is directly used. Adapting to divergent data dialects is one of the purposes of the data algebra. The Pandas interface itself is also parameterized to support replacement.
John Mount
