This is another fancy addon!!
In recent years, the web has gotten very hostile to the lowly web scraper. It’s a result of the natural progression of web technologies away from statically rendered pages to dynamic apps built with frameworks like React and CSS-in-JS. Developers no longer need to label their data with class-names or ids – it’s only a courtesy to screen readers now.
There’s also been a concerted effort by large companies to protect their public data. Facebook, for example, employs a team of over 100 people to make sure it is as difficult as possible for any data to escape the black hole. Granted, some of these large companies do offer APIs for their data but rarely is this unrestricted. You’re usually at the whim of their app review process or granted access only to a partial view of the data. Data that would be otherwise public if you were to do a Google search and click through to their website manually.
How HTML looks nowadays
This can be frustrating if you’re like me – somebody who wanted to build a small, local, non-profit app that uses data hosted on a closed platform. The data is public but completely inaccessible to machines because of aggressive anti-web-scraping measures. That gave me two options – input the data manually or play the web-scraping game. Of course, I chose the latter.
puppeteer-heap-snapshot is born.
After a couple of attempts at extracting the data using the usual CSS selector method of web-scraping – that is, fetching the raw HTML or booting up a browser via something like Puppeteer and trying to pick the data embedded in the HTML structure – I was close to admitting defeat. It wasn’t until I had an epiphany: the data is inside the web page.
Chrome Dev Tools’ Memory Profiler
puppeteer-heap-snapshot is a Node.js module that, given a Puppeteer browser page, can capture and parse a heap snapshot and deserialize objects that contain a set of properties. It comes with a nifty CLI tool too so we can quickly prototype scrapers from our terminal.
For example, let’s fetch the metadata from the above video:
$ puppeteer-heap-snapshot query
–properties channelId,viewCount,keywords –no-headless
Share this on knowasiak.com to discuss with people on this topicSign up on Knowasiak.com now if you’re not registered yet.