This is another fancy add-on.
Static Data is Different
A couple weeks ago, I came across a
blog from Retool
on their experience migrating a 4TB database. They put in place some good
procedures and managed a successful migration, but the whole experience was
complicated by the size of the database. The size of the database was the result
of a couple of very large “logging” tables: an edit log and an audit log.
The thing about log tables is, they don’t change much. They are append-only by
design. They are also queried fairly irregularly, and the queries are often time
ranged: “tell me what happened then” or “show me the activities between these
dates”.
So, one way the Retool migration could have been easier is if their log tables
were constructed as time-ranged partitions. That way there’d would only be one
“live” table in the partition set (the one with the recent entries) and a larger
collection of historical tables. The migration could move the live partition as
part of the critical path, and do all the historical partitions later.
Even after breaking up the log tables into manageable chunks they still remain,
in aggregate, pretty big! The PostgreSQL
documentation on partitioning
has some harsh opinions about stale data living at the end of a partition
collection:
The simplest option for removing old data is to drop the partition that is no
longer necessary.
There’s something to that! All those old historical records just fluff up your
base backups, and maybe you almost never have occasion to query it.
Is there an alternative to dropping the tables?
Dump Your Data in the Lake
What if there was a storage option that was still durable, allowed access via
multiple query tools, and could integrate transparently into your operational
transactional database?
How about: storing the static data in
Parquet format but
retaining database access to the data via the
parquet_fdw?
Sounds a bit crazy, but:
A foreign parquet table can participate in a partition along with a native
PostgreSQL table.
A parquet file can also be consumed by
R,
Python,
Go and a host of cloud
applications.
Modern PostgreSQL (14+) can parallelize access to foreign tables, so even
collections of Parquet files can be scanned effectively.
Parquet stores data compressed, so you can get way more raw data into less
storage.
Wait, Parquet?
Parquet is a language-independent storage format, designed for online analytics,
so:
Column oriented
Typed
Binary
Compressed
A standard table in PostgreSQL will be row-oriented on disk.
This layout is good for things PostgreSQL is expected to do, like query, insert,
update and delete data a “few” records at a time. (The value of “a few” can run
into the hundreds of thousands or millions, depending on the operation.)
A Parquet file stores data column-oriented on the disk, in batches called “row
groups“.
You can see where the Parquet format gets its name: the data is grouped into
little squares, like a parquet floor. One of the advantages of grouping data
together, is that compression routines tend to work better on data of the same
type, and even more so when the data elements have the same values.
Does This Even Work?
In a word “yes”, but with some caveats: Parquet has been around for several
years, but the ecosystem supporting it is still, relatively, in flux. New
releases of the underlying C++ libraries are still coming out regularly, the
parquet_fdw is only a couple years old,
and so on.
However, I was able to demonstrate to my own satisfaction that things were baked
enough to be interesting.
Loading Data
I started with a handy data table of Philadelphia parking infractions, that I
used in a previous
blog post on spatial joins,
and sorted the file by date of infraction, issue_datetime.
Data Download and Sort
#
# Download Philadelphia parking infraction data
#
curl “https://phl.carto.com/api/v2/sql?filename=parking_violations&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator&q=SELECT%20*%20FROM%20parking_violations%20WHERE%20issue_datetime%20%3E=%20%272012-01-01%27%20AND%20issue_datetime%20%3C%20%272017-12-31%27″> phl_parking_raw.csv
#
# Sort it
#
sort -k2 -t, phl_parking_raw.csv> phl_parking.csv
Sorting the data by issue_datetime will make queries that filter against that
column go faster in the column-oriented Parquet setup.
— Create parking infractions table
CREATE TABLE phl_parking (
anon_ticket_number integer,
issue_datetime timestamptz,
state text,
anon_plate_id integer,
division text,
location text,
violation_desc text,
fine float8,
issuing_agency text,
lat float8,
lon float8,
gps boolean,
zip_code text
);
— Read in the parking data
copy phl_parking FROM ‘phl_parking.csv’ WITH (FORMAT csv, HEADER true);
OK, so now I have an 8M record data table, good for some bulk data experiments.
How big is the table?
SELECT pg_size_pretty(pg_relation_size(‘phl_parking’)) AS pg_table_size;
pg_table_size
—————-
1099 MB
Just over 1GB!
Generating Parquet
How do I get a Parquet file?
This turns out to be way harder than I expected. Most internet advice was around
using Python or Spark to convert CSV files into Parquet. In the end, I used the
very new (currently unreleased, coming in GDAL 3.5)
support for Parquet in GDAL library,
and the ogr2ogr command to do the conversion.
ogr2ogr -f Parquet
/tmp/phl_parking.parquet
PG:”dbname=phl host=localhost”
phl_parking
For these tests the Parquet file will reside on my local disk in /tmp, though
for cloud purposes it might reside on a cloud volume, or even (with the
right software) in an object
store.
% ls -lh /tmp/phl_parking.parquet
-rw-r–r– 1 pramsey wheel 216M 29 Apr 10:44 /tmp/phl_parking.parquet
Thanks to data compression, the Parquet file is 20% the size of the database
table!
Querying Parquet
Querying Parquet in PostgreSQL involves a number of parts, which can be
challenging to build right now.
Apache libarrow, built with Parquet
support enabled.
parquet_fdw itself.
Note that parquet_fdw requires libarrow version 6, not the recently released
version 7.
Once the FDW and supporting libraries are built, though, everything works just
like other FDW extensions.
CREATE EXTENSION parquet_fdw;
CREATE SERVER parquet_srv FOREIGN DATA WRAPPER parquet_fdw;
CREATE FOREIGN TABLE phl_parking_pq (
anon_ticket_number integer,
issue_datetime timestamptz,
state text,
anon_plate_id integer,
division text,
location text,
violation_desc text,
fine float8,
issuing_agency text,
lat float8,
lon float8,
gps boolean,
zip_code text
)
SERVER parquet_srv
OPTIONS (filename ‘/tmp/phl_parking.parquet’,
sorted ‘issue_datetime’,
use_threads ‘true’);
Compared to the raw table, the Parquet file is similar in performance, usually a
little slower. Just blasting through a row count (when the tables are pre-cached
in memory).
— Give native table same indexing advantage
— as the parquet file
CREATE INDEX ON phl_parking USING BRIN (issue_datetime);
SELECT Count(*) FROM phl_parking_pq;
— Time: 1230 ms
SELECT Count(*) FROM phl_parking;
— Time: 928 ms
Similarly, a filter also is slightly faster on PostgreSQL.
SELECT Sum(fine), Count(1)
FROM phl_parking_pq
WHERE issue_datetime BETWEEN ‘2014-01-01’ AND ‘2015-01-01’;
— Time: 692 ms
SELECT Sum(fine), Count(1)
FROM phl_parking
WHERE issue_datetime BETWEEN ‘2014-01-01’ AND ‘2015-01-01’;
— Time: 571 ms
The parquet_fdw is very nicely implemented, and will even tell you the
execution plan that will be used on the file for a given filter. For example,
the previous filter involves opening about 20% of the 132 row groups in the
Parquet file.
EXPLAIN SELECT Sum(fine), Count(1)
FROM phl_parking_pq
WHERE issue_datetime BETWEEN ‘2014-01-01’ AND ‘2015-01-01′;
Finalize Aggregate (cost=6314.77..6314.78 rows=1 width=16)
-> Gather (cost=6314.55..6314.76 rows=2 width=16)
Workers Planned: 2
-> Partial Aggregate (cost=5314.55..5314.56 rows=1 width=16)
-> Parallel Foreign Scan on phl_parking_pq (cost=0.00..5242.88 rows=14333 width=8)
Filter: ((issue_datetime>=’2014-01-01 00:00:00-08’)
AND (issue_datetime
Read More
Share this on knowasiak.com to discuss with people on this topicSign up on Knowasiak.com now if you’re not registered yet.