Bid HN: git-historical past, for inspecting scraped recordsdata serene utilizing Git and SQLite

68
Bid HN: git-historical past, for inspecting scraped recordsdata serene utilizing Git and SQLite

I described Git scraping final 300 and sixty five days: a technique for writing scrapers the put you periodically snapshot a source of recordsdata to a Git repository in say to listing adjustments to that source over time.

The open topic turned into how to analyze that recordsdata once it turned into serene. git-historical past is my fresh instrument designed to model out that topic.

Git scraping, a refresher

A dapper thing about scraping to a Git repository is that the scrapers themselves also could perchance even be actually easy. I demonstrated how to jog scrapers for free utilizing GitHub Actions in this five minute lightning discuss back in March.

Here’s a concrete instance: California’s whisper fire department, Cal Fire, tackle an incident blueprint at fire.ca.gov/incidents showing the blueprint of latest expansive fires within the whisper.

I discovered the underlying recordsdata right here:

curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

Then I built a straightforward scraper that grabs a reproduction of that each 20 minutes and commits it to Git. I’ve been running that for 14 months now, and it’s serene 1,559 commits!

The thing that excites me most about Git scraping is that it could perchance in all probability originate actually fresh datasets. It’s general for organizations now to not tackle detailed archives of what changed and the put, so by scraping their recordsdata into a Git repository which it’s doubtless you’ll perchance presumably also most regularly pause up with a more detailed historical past than they tackle themselves.

There’s one suited topic though; having serene that recordsdata, how are you going to most fine analyze it? Reading by thousands of commit differences and eyeballing adjustments to JSON or CSV recordsdata isn’t a suited manner of discovering the difficult tales that were captured.

git-historical past

git-historical past is the fresh CLI instrument I’ve built to answer to that ask. It reads by the total historical past of a file and generates a SQLite database reflecting adjustments to that file over time. Additionally, you will then use Datasette to discover the resulting recordsdata.

Here’s an instance database created by running the instrument against my ca-fires-historical past repository. I created the SQLite database by running this within the repository listing:

git-historical past file ca-fires.db incidents.json 
  --namespace incident 
  --identity UniqueId 
  --convert 'json.hundreds(divulge material)["Incidents"]'

Animated gif showing the progress bar

On this case we’re processing the historical past of a single file known as incidents.json.

We use the UniqueId column to establish which recordsdata are changed over time versus newly created.

Specifying --namespace incident causes the created database tables to be known as incident and incident_version as opposed to the default of item and item_version.

And we have a fraction of Python code that is aware of how to turn each version kept in that commit historical past into a listing of objects properly matched with the instrument, gaze –convert within the documentation for facts.

Let’s use the database to answer to about a questions on fires in California over the last 14 months.

The incident desk contains a reproduction of primarily the most up-to-date listing for every incident. We can use that to leer a blueprint of every fire:

A map showing 250 fires in California

This uses the datasette-cluster-blueprint plugin, which pulls a blueprint of every row with a sound latitude and longitude column.

Where things in discovering difficult is the incident_version desk. Here is the put adjustments between different scraped variations of every item are recorded.

Those 250 fires bask in 2,060 recorded variations. If we facet by _item we are in a position to gaze which fires had primarily the most variations recorded. Listed below are the pause ten:

This looks to be about ethical—the increased the series of variations the longer the hearth will must were burning. The Dixie Fire has its maintain Wikipedia page!

Clicking by to the Dixie Fire lands us on a page showing each “version” that we captured, ordered by version number.

git-historical past handiest writes values to this desk that bask in changed for the reason that old version. This system which it’s doubtless you’ll perchance presumably also request at the desk grid and discover a actually feel for which objects of recordsdata had been up as much as now over time:

The table showing changes to that fire over time

The ConditionStatement is a text description that adjustments in most cases, however the other two difficult columns gaze to be AcresBurned and PercentContained.

That _commit desk is a international key to commits, which recordsdata commits that were processed by the instrument— mainly so that need to you jog it a 2nd time it could perchance in all probability pick up up the put it accomplished final time.

We can be half of against commits to leer the date that each version turned into created. Or we are in a position to use the incident_version_detail stare which performs that be half of for us.

The usage of that stare, we are in a position to filter for proper rows the put _item is 174 and AcresBurned just isn’t clean, then use the datasette-vega plugin to visualize the _commit_at date column against the AcresBurned numeric column… and we discover a graph of the insist of the Dixie Fire over time!

The chart plugin showing a line chart

To determine: we started out with a GitHub Actions scheduled workflow grabbing a reproduction of a JSON API endpoint each 20 minutes. Thanks to git-historical past, Datasette and datasette-vega we now bask in a chart showing the insist of the longest-lived California wildfire of the final 14 months over time.

A present on schema create

With out a doubt one of the toughest concerns in designing git-historical past turned into deciding on a suitable schema for storing version adjustments over time.

I accomplished up with the following (edited for readability):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item] (
   [_id] INTEGER PRIMARY KEY,
   [_item_id] TEXT,
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT,
   [_commit] INTEGER
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT
);
CREATE TABLE [columns] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [name] TEXT
);
CREATE TABLE [item_changed] (
   [item_version] INTEGER REFERENCES [item_version]([_id]),
   [column] INTEGER REFERENCES [columns]([id]),
   PRIMARY KEY ([item_version], [column])
);

As shown earlier, recordsdata within the item_version desk signify snapshots over time—but to construct on database blueprint and provide a higher interface for browsing variations, they handiest listing columns that had changed since their old version. Any unchanged columns are kept as null.

There’s one pick with this schema: what will we manufacture if a brand fresh version of an item sets one of the columns to null? How will we provide an explanation for the adaptation between that and a column that didn’t change?

I accomplished up fixing that with an item_changed many-to-many desk, which uses pairs of integers (optimistically taking up as cramped blueprint as that which it’s doubtless you’ll perchance presumably also imagine) to listing precisely which columns had been modified in which item_version recordsdata.

The item_version_detail stare displays columns from that many-to-many desk as JSON—right here’s a filtered instance showing which columns had been changed in which variations of which objects:

This table shows a JSON list of column names against items and versions

Here’s a SQL ask that reveals, for ca-fires, which columns had been up as much as now most most regularly:

pick up out columns.title, count(*)
from incident_changed
  be half of incident_version on incident_changed.item_version = incident_version._id
  be half of columns on incident_changed.column = columns.identity
the put incident_version._version > 1
neighborhood by columns.title
say by count(*) desc
  • Up as much as now: 1785
  • PercentContained: 740
  • ConditionStatement: 734
  • AcresBurned: 616
  • Started: 327
  • PersonnelInvolved: 286
  • Engines: 274
  • CrewsInvolved: 256
  • WaterTenders: 225
  • Dozers: 211
  • AirTankers: 181
  • StructuresDestroyed: 125
  • Helicopters: 122

Helicopters are thrilling! Let’s safe all of the fires which had at the least one listing the put the series of helicopters changed (after the first version). We’ll use a nested SQL ask:

pick up out * from incident
the put _id in (
  pick up out _item from incident_version
  the put _id in (
    pick up out item_version from incident_changed the put column = 15
  )
  and _version > 1
)

That returned 19 fires that had been main ample to possess helicopters—right here they’re on a blueprint:

A map of 19 fires that involved helicopters

Evolved usage of –convert

Drew Breunig has been running a Git scraper for the past 8 months in dbreunig/511-events-historical past against 511.org, a local showing internet page internet page visitors incidents within the San Francisco Bay Living. I loaded his recordsdata into this case sf-bay-511 database.

The sf-bay-511 instance is estimable for digging more into the --convert likelihood to git-historical past.

git-historical past requires recorded recordsdata to be in a particular shape: it wishes a JSON listing of JSON objects, the put each object has a column that could perchance perchance also even be treated as a special ID for beneficial properties of monitoring adjustments to that whisper listing over time.

The particular tracked JSON file would gaze one thing worship this:

[
  {
    "IncidentID": "abc123",
    "Location": "Corner of 4th and Vermont",
    "Type": "fire"
  },
  {
    "IncidentID": "cde448",
    "Location": "555 West Example Drive",
    "Type": "medical"
  }
]

It’s general for recordsdata that has been scraped to not match this perfect shape.

The 511.org JSON feed also could perchance even be found right here—it’s a reasonably refined nested dwelling of objects, and there’s a bunch of recordsdata in there that’s reasonably noisy with out including considerable to the general prognosis—things worship a up as much as now timestamp discipline that adjustments in each version even though there are no adjustments, or a deeply nested "extension" object chubby of reproduction recordsdata.

I wrote a snippet of Python to remodel each of those recorded snapshots into a more straightforward construction, and then handed that Python code to the --convert likelihood to the script:

#!/bin/bash
git-historical past file sf-bay-511.db 511-events-historical past/events.json 
  --repo 511-events-historical past 
  --identity identity 
  --convert '
recordsdata = json.hundreds(divulge material)
if recordsdata.in discovering("error"):
    # {"code": 500, "error": "Error gaining access to remote recordsdata..."}
    return
for event in recordsdata["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Dangle away noisy up as much as now timestamp
    del event["updated"]
    # Drop extension block fully
    del event["extension"]
    # "schedule" block is noisy but not difficult
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], listing):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'

The one-quoted string handed to --convert is compiled into a Python arrangement and jog against each Git version in turn. My code loops by the nested Events listing, editing each listing and then outputting them as an iterable sequence utilizing yield.

Some of the recordsdata within the historical past had been server 500 errors, so the code block is aware of how to establish and skip those as properly.

When working with git-historical past I safe myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to instruments worship this is a reasonably relaxing pattern—I also oldschool it for sqlite-utils convert earlier this 300 and sixty five days.

Attempting this out yourself

Even as you happen to must pick up a request at this out for yourself the git-historical past instrument has a detailed README describing the other alternatives, and the scripts oldschool to originate these demos also could perchance even be set within the demos folder.

The git-scraping topic on GitHub now has over 200 repos now built by dozens of different folk—that’s reasonably about a difficult scraped recordsdata sat there ready to be explored!

Knowasiak
WRITTEN BY

Knowasiak

Hey! look, i give tutorials to all my users and i help them!