How to analyse and aggregate data from DynamoDB (2020)

How to analyse and aggregate data from DynamoDB (2020)

02 Feb 2020

DynamoDB is just not any longer a database designed to will enable you to tear prognosis queries. We can alternatively consume DynamoDB streams and lambda capabilities to tear these analyses every time records changes.

This text explains how that you just can construct an prognosis pipeline and demonstrates it with two examples. You wish to be accustomed to DynamoDB tables and AWS Lambda.

Pipeline setup

Assuming now we already hang a DynamoDB desk, there are two more ingredients we want to space up: A DynamoDB traipse and a Lambda aim. The traipse emits changes equivalent to inserts, updates and deletes.

DynamoDB Poke

To space up the DynamoDB traipse, we’ll battle thru the AWS management console. Originate the settings of your desk and click on on the button referred to as “Arrange Poke”.

Stream details

By default that you just can slouch alongside with “Original and veteran photos” which is able to present you basically the most records to work with. Whenever you enabled the traipse, that you just can copy its ARN which we can consume in the following step.

Build a Lambda Diagram

While you occur to work with the serverless framework, that you just can merely space the traipse as an occasion supply for your aim by in conjunction with the ARN as a traipse in the events fragment.

    handler:  prognosis.kind out
      - traipse:  arn:aws:dynamodb:us-east-1:xxxxxxx:desk/my-desk/traipse/2020-02-02T20: 20: 02.002

Deploy the changes with sls deploy and your aim is able to route of the incoming events. It’s a real idea to start out by trusty printing the records from DynamoDB after which building your aim around that enter.

Recordsdata Originate

With DynamoDB it’s huge crucial to think of your records entry patterns first, otherwise you’ll have to rebuild your tables many more times than important. Also look Rick Houlihan’s incredible make patterns for DynamoDB from re:Originate 2018 and re:Originate 2019.

Instance 1: Mark calculation

EVE Online

In EVE Online’s participant driven economic system devices will also be traded thru contracts. A ardour challenge of mine uses EVE Online’s API to safe records about item substitute contracts in verbalize to calculate prices for these devices. It tranquil more than 1.5 million contracts over the final year and derived prices for roughly 7000 devices.


To construct a median imprint, we favor more than one imprint point. For this motive the single contract we safe is just not any longer adequate, but we favor your entire imprint parts for an item.

    "contract_id":  152838252,
    "date_issued":  "2020-01-05T20: 47: 40Z",
    "issuer_id":  1273892852,
    "imprint":  69000000,
    "location_id":  60003760,
    "contract_items":  [2047]

Because the API’s response is just not any longer in an optimal structure, now we should always set up some pre-processing to envision away with pointless records and put key records into the desk’s main and sorting keys. Endure in suggestions that desk scans can safe pricey and smaller entries imply more files per quiz result.

|type_id (pk)|date (sk)           |imprint   |situation|
|2047        |2020-01-05T20: 47: 40Z|69000000|60003760|

In this case I good to consume the article’s ID (e.g. 2047) as the main key and the date as the kind key. That manner my analyser can clutch your entire files for one item and limit it to basically the most modern entries.


The hooked up Lambda aim receives an occasion from the traipse. This occasion incorporates amongst others the article’s ID for which it would soundless calculate a original imprint. The consume of this ID the aim queries the pre-processed records and receives a checklist of devices from which it will calculate averages and other treasured records.

Attention: Don’t set up a scan right here! This would possibly maybe safe pricey fleet. Originate your records so that that you just can consume queries.

The aggregated result is persisted in one other desk from which we can supply a pricing API.


With out extra adjustment the prognosis aim will safe linearly slower the more imprint parts come in. It can well per chance alternatively limit the number of imprint parts it loads. By scanning the date in the kind key backwards we load entirely the most up-to-date, most linked entries. In accordance to our necessities we can then clutch to load entirely one or two pages, or go for basically the most modern 1000 entries. This diagram we can enforce an upper dawdle on the runtime per item.

Instance 2: Leaderboard

One more instance in step with a twitter discussion is about a leaderboard. Within the German soccer league Bundesliga the membership from Cologne won 4:0 against the membership from Freiburg at the contemporary time. This diagram that Cologne gets three parts whereas Freiburg gets zero. Loading your entire fits after which calculating the ranking on the fly will result in heinous efficiency after we safe deeper into the season. That’s why we would soundless all over again consume streams.

Analysis Pipeline

Recordsdata Originate

We are able to rob that our first desk holds raw records in the following structure:

|league (pk)|match_id (sk)|first_party|second_party|ranking|
|Bundesliga |1            |Cologne    |Freiburg    |4:0  |
|Bundesliga |2            |Hoffenheim |Bayer       |2:1  |

We make the leaderboard desk to take hang of a structure, where we can retailer a pair of leagues and paginate over the people. We’re going with a composite kind key as we want to let the database kind the leaderboard first by ranking, then by the quantity of desires they shot and at final by their title.

|league (pk)|ranking#desires#title (sk)|ranking|goals_shot|title (GSI)|
|Bundesliga |003#004#Cologne      |3    |4         |Cologne   |
|Bundesliga |003#002#Hoffenheim   |3    |2         |Hoffenheim|
|Bundesliga |000#001#Bayer        |0    |1         |Bayer     |
|Bundesliga |000#000#Freiburg     |0    |0         |Freiburg  |

Because the kind key (sk) is a string now we should always zero pad the numbers. Sorting a pair of strings containing numbers won’t hang the identical result as sorting straight forward numbers. Employ the padding properly and go for a pair orders of magnitude higher than you depend on the ranking to safe. Camouflage that this variety won’t work properly in case your scores can develop indefinitely. Must you hang a acknowledge to that, please share it and I’ll reference you right here!

We’re additionally in conjunction with a GSI on the membership’s title to hang better entry to a single membership’s leaderboard entry.


Whenever a match result is inserted to the first desk, the traipse will fire an occasion for the prognosis aim. This entry incorporates the match and its ranking from which we can get who gets what number of parts.

In accordance to the golf equipment’ names, we can load the veteran leaderboard entries. We consume these entries to first delete the contemporary files, then clutch the contemporary scores and desires, add the original ones and write the original leaderboard files.

|league (pk)|ranking#desires#title (sk)    |ranking|goals_shot|title (GSI)|
|Bundesliga |006#005#Cologne          |6    |5         |Cologne   |
|Bundesliga |003#005#Bayer            |3    |5         |Bayer     |
|Bundesliga |003#002#Hoffenheim       |3    |2         |Hoffenheim|
|Bundesliga |000#000#Freiburg         |0    |0         |Freiburg  |


As every match ends up in one or two queries and one or two updates, the time to update the ranking stays restricted.

When we show masks the leaderboard it’s miles a real idea to consume pagination. That manner the user sees a suitable quantity of files and our requests hang a restricted runtime as properly.

Loved this article? I submit a original article every month. Join with me on Twitter and join original articles to your inbox!

Join the pack! Join 8000+ others registered customers, and safe chat, develop groups, post updates and develop pals world broad!

Charlie Layers

Charlie Layers

Fill your life with experiences so you always have a great story to tell

you're currently offline