I Have to Regain an Residence

Apr 5, 2022 Long story short: I’ve taken a new position in Luxembourg and I have to find an apartment, in a different country, in a reasonably short time. TL;DR: I hate apartment hunting, I’ve tried to make it as interesting as possible by creating a pipeline with different tools and languages to scrape data…

52
I Have to Regain an Residence

My teacher says here is stunning.

Apr 5, 2022

Lengthy legend quick: I’ve taken a brand unusual space in Luxembourg and I for certain should
safe an condominium, in a varied nation, in a moderately quick time.

TL;DR: I abominate condominium hunting, I’ve tried to make it as attention-grabbing
as imaginable by making a pipeline with varied instruments and languages
to gather 22 situation data from an internet internet page with the usage of Puppeteer, load data into
a SQLite database the usage of Crawl and visualizing the total data with Python and Folium.

In picture so that you simply can appear for on the closing code or note alongside with me
you would also checkout the project repo on
GitHub.

I first and vital sketched what I was purchasing for by priority

After that it used to be time for the slow stuff: purchasing for an condominium
that ideal suits my requirements.

I for certain absorb a couple of guests in Lux and to boot they all urged me to appear for on the noxious
internet internet page AtHome.lu to score a belief on the on hand
apartments that there are in Lux, so I did.

I don’t adore doing this, so I are attempting to make issues more easy by honest
having a search for on the photos and if the price looks good ample I’ll honest bookmark
the thing in say that I will search for at it later for comparisons.

This snappy becomes nerve-racking to complete. Some companies will publish section of the
month-to-month label, just a few of them will put up the actual month-to-month label. Every
company rate is diffeernt and it’s no longer directly visible. The plan
on the gain internet page is honest awful and it received’t let me compare positions with varied
apartments. Pointless to speak that it’s impartial much no longer seemingly to take grasp of
the lawful one with this messy data.

What I’d defend to absorb is a tall plan with the total apartments that I adore on
it and perchance a database to inquire of apartments to screen by varied parameters.

Let’s ogle if we are able to gather 22 situation some data off of athome.lu!

After giving a handy e-book a rough search for on the gain internet page I’ve came upon out that by deciding on
the renting apartments in the Luxembourg space, hyperlinks beginning with this URL route
www.athome.lu/en/hire/condominium/luxembourg, every condominium
has a varied identity and the total contend with for an condominium looks adore this
www.athome.lu/en/hire/condominium/luxembourg/identity-642398588.html.

Right here’s an staunch beginning, it plan that I will impartial much navigate to a particular condominium
internet page honest by gleaming its identity.

If I search for the html source of a single internet page, I directly ogle that there
is a HUGE json object with a form of data in it on the underside of the gain page,

initstate

let’s ogle if we are able to safe something interesing in it.

initstate console

Frosty, it looks adore a tall object to setup your complete internet page. If we search for at
every sub-object of INITIAL_STATE we safe aspect which
looks to be our fortunate card.

detail console

Large! We don’t even should gather 22 situation data from html, we now absorb the total data of the
condominium in this INITIAL_STATE.aspect object!

How can I entry that variable thru code though? I don’t absorb a huge experience with it
and I don’t write a form of JS, however I heard that Puppeteer
is the lawful instrument for the job. Let me are attempting something out

const puppeteer = require('puppeteer');
const fs = require('fs').promises;

async characteristic scrape_json_data(browser, link) {
    const internet page = await browser.newPage();
    await internet page.goto(link);
    const obj = await internet page.grasp(() => {
        var { aspect } = __INITIAL_STATE__;
        return aspect;
    });
    return obj;
}

(async () => {
    const browser = await puppeteer.beginning({ headless:  glorious });
    // File where I'm going to set aside all my most popular apartments' hyperlinks
    const data = await fs.readFile(route of.env.LINKS_PATH, "utf-8");
    // Learn line by line
    const hyperlinks = data.nick up(/r?n/);

    var objs = [];
    for (var i in hyperlinks) {
        let id_pattern = /identity-(d+)/;
        // Rating identity from closing section of every link
        let identity = hyperlinks[i].match(id_pattern)[1];

        console.log("scraping: " + identity);
        var obj = await scrape_json_data(browser, hyperlinks[i]);

        if (obj.came upon) {
            // Rating away the total superflous data from aspect obj
            obj = clean_obj(obj);
            // Add link to the obj, in a technique it be no longer integrated intimately obj :/
            obj.link = hyperlinks[i];
            objs.push(obj);
        } else {
            objs.push({ came upon:  fraudulent, listingId:  parseInt(identity) });
        }
    }

    // Set obj data to json file
    fs.writeFile(route of.env.JSON_OUT, JSON.stringify(objs));
    await browser.shut()
})();

The code above will beginning a headless instance of chrome, bound to every
condominium link that is saved in the file at LINKS_PATH and
spit an array of the total apartments data in a file at JSON_OUT.

We were fortunate this time, we didn’t should battle thru scraping html,
and this might perchance per chance absorb doubtlessly been doubtlessly the most slow section of your complete route of.
The next steps shall be about storing data in a database and visualize it,
however first let’s write a justfile
(different to a Makefile) that can make our existence more easy after we
should complete commands.

injurious     :=justfile_directory()
json_out :="/tmp/res.json"
hyperlinks    :=injurious + "/properties.txt"

gather 22 situation:
    LINKS_PATH={{hyperlinks}} 
    JSON_OUT={{json_out}} 
    node scraper/most most important.js

I will now gather 22 situation data by honest typing

I should set aside the total data to a sqlite database
in say that I will very with out problems check, inquire of and score
apartments data each time I’d like and on the opposite hand I’d like.

Let’s pass a long way from js and swap to a compiled language,
Crawl will match completely for this, it’s snappy and uncomplicated to make use of.

The binary will parse your complete json file that the scraper created
and cargo every condominium to the condominium desk in sqlite.

I didn’t screen it sooner than, however here is my closing, cleaned-from-unnecessary-stuff
Residence struct with some designate annotations to learn from json and cargo into
sqlite by the usage of sqlx.

style Residence struct {
	Came during                  bool      `json:"came upon,omitempty" db:"came upon,omitempty"`
	ListingId              uint32    `json:"listingId,omitempty" db:"listingId,omitempty"`
	ListingAgencyReference string    `json:"listingAgencyReference,omitempty" db:"listingAgencyReference,omitempty"`
	IsSoldProperty         bool      `json:"isSoldProperty,omitempty" db:"isSoldProperty,omitempty"`
	Set                 string    `json:"region,omitempty" db:"region,omitempty"`
	CityName               string    `json:"cityName,omitempty" db:"cityName,omitempty"`
	Lon                    float64   `json:"lon,omitempty" db:"lon,omitempty"`
	Lat                    float64   `json:"lat,omitempty" db:"lat,omitempty"`
	Mark                  int       `json:"label,omitempty" db:"label,omitempty"`
	ChargesPrice           int       `json:"chargesPrice,omitempty" db:"chargesPrice,omitempty"`
	Warning                float32   `json:"warning,omitempty" db:"warning,omitempty"`
	AgencyFee              string    `json:"agency_fee,omitempty" db:"agency_fee,omitempty"`
	PropertySubType        string    `json:"propertySubType,omitempty" db:"propertySubType,omitempty"`
	PublisherId            int       `json:"publisher_id,omitempty" db:"publisher_id,omitempty"`
	PublisherRemoteVisit   bool      `json:"publisher_remote_visit,omitempty" db:"publisher_remote_visit,omitempty"`
	PublisherPhone         string    `json:"publisher_phone,omitempty" db:"publisher_phone,omitempty"`
	PublisherName          string    `json:"publisher_name,omitempty" db:"publisher_name,omitempty"`
	PublisherAthomeId      string    `json:"publisher_athome_id,omitempty" db:"publisher_athome_id,omitempty"`
	PropertySurface        float64   `json:"propertySurface,omitempty" db:"propertySurface,omitempty"`
	BuildingYear           string    `json:"buildingYear,omitempty" db:"buildingYear,omitempty"`
	FloorNumber            string    `json:"floorNumber,omitempty" db:"floorNumber,omitempty"`
	BathroomsCount         int       `json:"bathroomsCount,omitempty" db:"bathroomsCount,omitempty"`
	BedroomsCount          int       `json:"bedroomsCount,omitempty" db:"bedroomsCount,omitempty"`
	BalconiesCount         int       `json:"balconiesCount,omitempty" db:"balconiesCount,omitempty"`
	CarparkCount           int       `json:"carparkCount,omitempty" db:"carparkCount,omitempty"`
	GaragesCount           int       `json:"garagesCount,omitempty" db:"garagesCount,omitempty"`
	HasLivingRoom          bool      `json:"hasLivingRoom,omitempty" db:"hasLivingRoom,omitempty"`
	HasKitchen             bool      `json:"hasKitchen,omitempty" db:"hasKitchen,omitempty"`
	Availability           string    `json:"availability,omitempty" db:"availability,omitempty"`
	Media                  *[]string `json:"media,omitempty" db:"media,omitempty"`
	Description            string    `json:"description,omitempty" db:"description,omitempty"`
	Hyperlink                   string    `json:"link,omitempty" db:"link,omitempty"`
	CreatedAt              string    `json:"createdAt,omitempty" db:"createdAt,omitempty"`
	UpdatedAt              string    `json:"updatedAt,omitempty" db:"updatedAt,omitempty"`
}

I’d switch my mind later down the aspect toll road on the guidelines that I should retain
in every Residence struct, so I’d should make changes to the database structure,
and attributable to this fact the queries to insert and update the database too. To make this a shrimp extra
flexible I will use a yaml file to set aside any database migration and insert/update
queries to the database.

migrations:  |
  CREATE TABLE IF NOT EXISTS condominium(
      came upon BOOL,
      listingId INTEGER PRIMARY KEY,
      ...
      description TEXT,
      link TEXT,
      createdAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
      updatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
  );


insertQuery:  |
    INSERT INTO condominium(came upon,listingId,listingAgencyReference,isSoldProperty,region,cityName,
                          lon,lat,label,chargesPrice,warning,agency_fee,propertySubType,publisher_id,
                          publisher_remote_visit,publisher_phone,publisher_name,publisher_athome_id,
                          propertySurface,buildingYear,floorNumber,bathroomsCount,bedroomsCount,balconiesCount,
                          garagesCount,carparkCount,hasLivingRoom,hasKitchen,availability,media,description,link)
    VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)


updateQuery:  |
    UPDATE condominium
    SET came upon=?, listingId=?, listingAgencyReference=?, isSoldProperty=?, region=?, cityName=?, lon=?, lat=?, label=?,
        chargesPrice=?, warning=?, agency_fee=?, propertySubType=?, publisher_id=?, publisher_remote_visit=?, publisher_phone=?,
        publisher_name=?, publisher_athome_id=?, propertySurface=?, buildingYear=?, floorNumber=?, bathroomsCount=?,
        bedroomsCount=?, balconiesCount=?, garagesCount=?, carparkCount=?, hasLivingRoom=?, hasKitchen=?,
        availability=?, media=?, description=?, link=?, updatedAt=CURRENT_TIMESTAMP
    WHERE listingId=?

After environment up these overall parts and with a shrimp extra code
I will assemble the program and speed it in say that this can load the old
json file into my sqlite condominium desk.

Let’s add some extra commands to the justfile that we’ve
created beforehand.

db_path  :=injurious + "/db.sqlite"

gobuild:
    cd {{injurious}}/loader; bound fabricate cmd/most most important.bound

load: gobuild
    CONFIG_PATH={{injurious}}/loader/config.yaml 
    JSON_OUT={{json_out}} 
    DB_PATH={{db_path}} 
    {{injurious}}/loader/most most important

procure: gather 22 situation load

Let’s load the guidelines into database

$ honest load
> OR
$ honest procure
> which can first gather 22 situation data and then load it in the database
> justfiles are frigid!

Comely to score some specs, this runs snappy. Rating a search for

$ cat home.txt | wc
  65      66    4469
$ time honest load
honest load  0.38s person 0.52s system 220% cpu 0.408 total

I now absorb the total data that I scraped in my nice and super snappy
database, ready to be queried with the craziest inquire of that comes
to my mind, I will grasp some.

We’re at a going point in the interim, I for certain absorb a form of parameters
with which I will inquire of apartments that I adore. I will engage them by
non-lowering label, by space and if I add some extra advanced Haversine
formulae I might also style them by distance from the city centre or any
varied plan coordinates.

I received’t end here though. I for certain absorb some attention-grabbing shrimp vars in
every aparment data: lat, lon. I don’t should rupture geo data!
It’s nice and stress-free to honest search for at tabular data, however I mediate I might
score a much less complicated belief of the positioning honest by plotting stuff on a plan.

I should code something fast with the least amount of code, so I’ll
bound alongside with Python and Jupyter notebook alongside with
Folium which is
a library that generates Leaflet maps.

Let’s setup the plan with my point of hobby

import folium

lux_coords = [49.611622, 6.131935]
map_ = folium.Plan(region = lux_coords, zoom_start = 10)

interesting_coords = [49.630033, 6.168936]
folium.Marker(region=interesting_coords, popup="Level of hobby", icon=folium.Icon(coloration='red')).add_to(map_)

folium.Circle(region=interesting_coords, radius=5000, coloration='green', opacity=0.5, weight=2).add_to(map_)
folium.Circle(region=interesting_coords, radius=10000, coloration='yellow', opacity=0.5, weight=2).add_to(map_)
folium.Circle(region=interesting_coords, radius=15000, coloration='orange', opacity=0.5, weight=2).add_to(map_)
folium.Circle(region=interesting_coords, radius=20000, coloration='red', opacity=0.5, weight=2).add_to(map_)

This will seemingly even screen a plan centered on Lux, with a groovy red pin on my point of hobby
and to score a higher belief of the gap, I also added some circles with a radius of
5km, 10km, 15km and 20km. Right here’s extremely precious as a consequence of I will discard directly
by having a search for on the plan the apartments that are too a long way from my point of hobby.

poi

Sooner than going crazy with SQL I should add my scraped apartments
to the plan and for the sake of simplicity I will inquire of all of them here

import os
import sqlite3


def getApartments(db): 
    cur = db.cursor()
    cur.end(
        """
        SELECT FROM condominium
        WHERE
            came upon=TRUE
        """
    )

    return [Apartment(row) for row in cur.fetchall()]


def addApartment(map_, a): 
    popup = folium.Popup(a._popup_(), max_width=450)
    folium.Marker(
        region=[a.lat, a.lon],
        popup=popup,
        # I will use fontawesome to exchange the pin icon
        icon=folium.Icon(coloration=a._get_color(), icon=a._get_icon(), prefix="fa")
    ).add_to(map_)


db = sqlite3.join(os.environ["DB_PATH"])
apartmens = getApartments(db)
for a in apartmens: 
    addApartment(map_, a)
map_

poi apartments

And here we now absorb it! Definetly a critically better experience
than going backward and ahead on the gain internet page and plan on a plan
the total apartments one by one, lawful?

Within the code above you would also ogle that I’ve feeble a custom popup for every
condominium. With Folium we are able to use HTML to customize the pin’s popup
with the largest data I should look for (i.e. month-to-month total label, initial rate,
warning etc.)

def _popup_(self): 
    return f"""
    

Data

ID:
{self.listingId}
Month-to-month Mark:
{self.label}
Month-to-month Cost:
{self.chargesPrice}
Warning:
{self.warning}
Company Price:
{self.agencyFee}

Complete

Month-to-month:
{self.label + self.chargesPrice}
Preliminary:
{self.warning + self.agencyFee}

Page

Gallery
"""

popup

That’s honest what I needed, I will now ogle on the plan which might even very well be the becoming
located apartments in Lux and directly score to appear for the guidelines that I’m
in doubtlessly the most!

Why would I set aside the guidelines on a database if I don’t use SQL in any respect?
Let’s recount that I for certain absorb a injurious funds of 1000€ and I should screen ideal
the apartments on which I’d should employ an incremental amount of 200€,
I might simply switch the SQL inquire of to

SELECT *
FROM condominium
WHERE
    came upon = TRUE AND
    listingId IN (
        SELECT listingId
        FROM condominium
        WHERE
            came upon = TRUE AND
            label + chargesPrice  1000 + 200
    )

Phewww, when you’re mute here finding out all this you deserve a bonus point.

Factor in I saw a in actuality frigid condominium that looks adore a in actuality staunch deal however
i

Read More
Share this on knowasiak.com to talk about to of us on this topicSignal in on Knowasiak.com now when you would also very well be no longer registered but.

Vanic
WRITTEN BY

Vanic

“Simplicity, patience, compassion.
These three are your greatest treasures.
Simple in actions and thoughts, you return to the source of being.
Patient with both friends and enemies,
you accord with the way things are.
Compassionate toward yourself,
you reconcile all beings in the world.”
― Lao Tzu, Tao Te ChingBio: About: