HistoryShow HN: Documentation Generator for Data Pipelines

Show HN: Documentation Generator for Data Pipelines

-

- Advertisment -

This package allows to generate documentation of data pipelines and data lineage charts. It is source agnostic and uses a predefined json/yaml format to represent the dependencies and business logic. The resulting markdown files can be used standalone or as part of a documentation site using tools like MkDocs or VuePress.

image

npm install -g @datayoga-io/lineage

To quickly get started with Lineage, scaffold a new project. This will create the folder structure along with sample files.

dy-lineage scaffold ./my-project
- Advertisement -

To generate the documentation for the new project:

dy-lineage build ./my-project --dest ./docs

Lineage models the data ecosystem using the following entities:

Datastore – A datastore represents a source or target of data that can hold data at rest or data in motion. Datastores include entities such as a table in a database, a file, or a stream in Kafka. A Datastore can act either as a source or a target of a pipeline.

File – A file is a type of Datastore that represents information stored in files. Files contain metadata about their structure and schema.

- Advertisement -

Dimension – A dimension table / file is typically used for lookup and constant information that is managed as part of the application code. This often includes lookup values such as country codes.

Runner – A runner is a processing engine capable of running data operations. Every Runner supports one or more programming languages. Some Runners, like a database engine, only support SQL, while others like Spark may support Python, Scala, and Java.

Consumer – A consumer consumes data and presents it to a user. Consumers include reports, dashboards, and interactive applications.

Pipeline – A pipeline represents a series of Jobs that operate on a single Runner.

Job – A job is composed of a series of Steps that fetch information from one or more Datastores, transform them, and store the result in a target Datastore, or perform actions such as sending out alerts or performing HTTP calls.

- Advertisement -

Job Step – Every step in a job performs a single action. A step can be of a certain type representing the action it performs. A step can be an SQL statement, a Python statement, or a callout to a library. Steps can be chained to create a Directed Acyclic Graph (DAG).

Lineage will scan the file(s) specified in input and attempt to load information about the catalog. In addition, a folder containing yaml files describing each pipeline’s business logic can be provided.

See the Example folder for sample input files and generated output files

Structure of input folder

.
├── .dyrc
├── datastores
├── files
├── pipelines
├── relations
  • .dyrc: Used to store global configuration.
  • datastores: Catalog file(s) with information about datastore entities and their metadata
  • files: Catalog file(s) with information about file datastore entities and their metadata
  • pipelines: Optional files containing information about the pipelines and business logic flow
  • relations: Information about the relations between the data entities

Structure of catalog file

The catalog file contains the entity definition and metadata for each of the entities.
The node naming convention is: :.. Module name can be nested: e.g. order_mgmt.weekdays.inbound.load_orders

Example:

pipeline:order_mgmt.load_orders:
datastore:orders:
datastore:raw_orders:

Structure of relations file

The relations file holds the relationships between the entities.

Example:

- source: datastore:raw_orders
  target: pipeline:order_mgmt.load_orders
- source: pipeline:order_mgmt.load_orders
  target: datastore:orders

Adding metadata and business logic flow to pipelines

TBD

Lineage collectors enable to export lineage knowlege from external systems to be processed and documented.

Coming soon

Informatica

Database data dictionary

Tableau

NOW WITH OVER +8500 USERS. people can Join Knowasiak for free. Sign up on Knowasiak.com
Read More

- Advertisement -
Previous articleFree book to master SSH tunneling concepts
Next articleSetup a Practically Free CDN
Charlie avatar
Charliehttps://plus.google.com/105215503769457384118
Fill your life with experiences so you always have a great story to tell

1 Comment

  1. We are often tasked with generating documentation for our ETL/ELT projects. These involve a large number of sources, transformations, and targets, each in a different system and format.
    We created this open source project to ease the creation of pipeline documentation using a source-agnostic yaml format. Since it uses a CLI, it can easily be embedded in a CI/CD process to make sure the documentation is always up to date.
    Would love to hear feedback and ideas for improvement!

You might also likeRELATED
Recommended to you

How Identical outdated Are Behavioral Biases? Proof from Capuchin Monkey Shopping and selling Conduct

AbstractBehavioral economics has demonstrated systematic decision‐making biases in both lab and field data. Do these biases extend across contexts, cultures, or even species? We investigate this question by introducing fiat currency and trade to a colony of capuchin monkeys and recovering their preferences over a range of goods and gambles. We show that capuchins react…

EDF shares tumble after faults learned at French nuclear power reactor

Steam rises from a cooling tower of the Electricite de France (EDF) nuclear power station in Civaux, France, October 8, 2021. REUTERS/Stephane MaheRegister now for FREE unlimited access to Reuters.comPARIS, Dec 16 (Reuters) - Shares in EDF (EDF.PA) plunged on Thursday after the French power giant found faults at a nuclear power station and shut…

Download Core Temp 1.17 (2021.1) Release

Download Core Temp 1.17 2021 ReleaseCore-Temp-setupDownloadCore Temp is a compact, no fuss, small footprint, yet powerful program to monitor processor...

Revisiting why hyperlinks are blue

Why we need to revisit the origin of blue hyperlink While musing over my recently published article, Why are hyperlinks blue, I was left feeling a bit blue myself. Yes, it could have been the fact that I was evacuated and Hurricane Ida was destroying my home, I’ll admit. Besides that, I was also bothered…
- Advertisement -

Thich Nhat Hanh, Vietnamese Zen Master, Dies at 95

Thich Nhat Hanh at the Plum Village monastery in southern France | Courtesy Plum Village Community of Engaged Buddhism Vietnamese Zen Master Thich Nhat Hanh—a world-renowned spiritual leader, author, poet, and peace activist—died on January 22, 2022 at midnight (ICT) at his root temple, Tu Hien Temple, in Hue, Vietnam. He was 95. “Our beloved…

Before wave of train thefts, Union Pacific laid off some of its police force

News organizations both locally and nationally have been covering the rise of cargo theft in L.A.’s northeast train tracks in the past few days. Anchors on morning news have been quick to point out that there have been over 100 arrests, and even Forbes have been quick to point out the staggering $5 million worth…

Must read

Fig Wrapped

Something went wrong, but don’t fret — let’s give...

A Response to Rich Harris

Rich Harris is a well-known...
- Advertisement -