My teacher says here is stunning.
Apr 5, 2022
Lengthy legend quick: I’ve taken a brand unusual space in Luxembourg and I for certain should
safe an condominium, in a varied nation, in a moderately quick time.
TL;DR: I abominate condominium hunting, I’ve tried to make it as attention-grabbing
as imaginable by making a pipeline with varied instruments and languages
to gather 22 situation data from an internet internet page with the usage of Puppeteer, load data into
a SQLite database the usage of Crawl and visualizing the total data with Python and Folium.
In picture so that you simply can appear for on the closing code or note alongside with me
you would also checkout the project repo on
GitHub.
I first and vital sketched what I was purchasing for by priority
After that it used to be time for the slow stuff: purchasing for an condominium
that ideal suits my requirements.
I for certain absorb a couple of guests in Lux and to boot they all urged me to appear for on the noxious
internet internet page AtHome.lu to score a belief on the on hand
apartments that there are in Lux, so I did.
I don’t adore doing this, so I are attempting to make issues more easy by honest
having a search for on the photos and if the price looks good ample I’ll honest bookmark
the thing in say that I will search for at it later for comparisons.
This snappy becomes nerve-racking to complete. Some companies will publish section of the
month-to-month label, just a few of them will put up the actual month-to-month label. Every
company rate is diffeernt and it’s no longer directly visible. The plan
on the gain internet page is honest awful and it received’t let me compare positions with varied
apartments. Pointless to speak that it’s impartial much no longer seemingly to take grasp of
the lawful one with this messy data.
What I’d defend to absorb is a tall plan with the total apartments that I adore on
it and perchance a database to inquire of apartments to screen by varied parameters.
Let’s ogle if we are able to gather 22 situation some data off of athome.lu!
After giving a handy e-book a rough search for on the gain internet page I’ve came upon out that by deciding on
the renting apartments in the Luxembourg space, hyperlinks beginning with this URL route
www.athome.lu/en/hire/condominium/luxembourg
, every condominium
has a varied identity and the total contend with for an condominium looks adore this
www.athome.lu/en/hire/condominium/luxembourg/identity-642398588.html
.
Right here’s an staunch beginning, it plan that I will impartial much navigate to a particular condominium
internet page honest by gleaming its identity.
If I search for the html source of a single internet page, I directly ogle that there
is a HUGE json object with a form of data in it on the underside of the gain page,
let’s ogle if we are able to safe something interesing in it.
Frosty, it looks adore a tall object to setup your complete internet page. If we search for at
every sub-object of INITIAL_STATE
we safe aspect
which
looks to be our fortunate card.
Large! We don’t even should gather 22 situation data from html, we now absorb the total data of the
condominium in this INITIAL_STATE.aspect
object!
How can I entry that variable thru code though? I don’t absorb a huge experience with it
and I don’t write a form of JS, however I heard that Puppeteer
is the lawful instrument for the job. Let me are attempting something out
const puppeteer = require('puppeteer');
const fs = require('fs').promises;
async characteristic scrape_json_data(browser, link) {
const internet page = await browser.newPage();
await internet page.goto(link);
const obj = await internet page.grasp(() => {
var { aspect } = __INITIAL_STATE__;
return aspect;
});
return obj;
}
(async () => {
const browser = await puppeteer.beginning({ headless: glorious });
// File where I'm going to set aside all my most popular apartments' hyperlinks
const data = await fs.readFile(route of.env.LINKS_PATH, "utf-8");
// Learn line by line
const hyperlinks = data.nick up(/r?n/);
var objs = [];
for (var i in hyperlinks) {
let id_pattern = /identity-(d+)/;
// Rating identity from closing section of every link
let identity = hyperlinks[i].match(id_pattern)[1];
console.log("scraping: " + identity);
var obj = await scrape_json_data(browser, hyperlinks[i]);
if (obj.came upon) {
// Rating away the total superflous data from aspect obj
obj = clean_obj(obj);
// Add link to the obj, in a technique it be no longer integrated intimately obj :/
obj.link = hyperlinks[i];
objs.push(obj);
} else {
objs.push({ came upon: fraudulent, listingId: parseInt(identity) });
}
}
// Set obj data to json file
fs.writeFile(route of.env.JSON_OUT, JSON.stringify(objs));
await browser.shut()
})();
The code above will beginning a headless instance of chrome, bound to every
condominium link that is saved in the file at LINKS_PATH
and
spit an array of the total apartments data in a file at JSON_OUT
.
We were fortunate this time, we didn’t should battle thru scraping html,
and this might perchance per chance absorb doubtlessly been doubtlessly the most slow section of your complete route of.
The next steps shall be about storing data in a database and visualize it,
however first let’s write a justfile
(different to a Makefile) that can make our existence more easy after we
should complete commands.
injurious :=justfile_directory()
json_out :="/tmp/res.json"
hyperlinks :=injurious + "/properties.txt"
gather 22 situation:
LINKS_PATH={{hyperlinks}}
JSON_OUT={{json_out}}
node scraper/most most important.js
I will now gather 22 situation data by honest typing
I should set aside the total data to a sqlite database
in say that I will very with out problems check, inquire of and score
apartments data each time I’d like and on the opposite hand I’d like.
Let’s pass a long way from js and swap to a compiled language,
Crawl will match completely for this, it’s snappy and uncomplicated to make use of.
The binary will parse your complete json file that the scraper created
and cargo every condominium to the condominium
desk in sqlite.
I didn’t screen it sooner than, however here is my closing, cleaned-from-unnecessary-stuff
Residence
struct with some designate annotations to learn from json and cargo into
sqlite by the usage of sqlx.
style Residence struct {
Came during bool `json:"came upon,omitempty" db:"came upon,omitempty"`
ListingId uint32 `json:"listingId,omitempty" db:"listingId,omitempty"`
ListingAgencyReference string `json:"listingAgencyReference,omitempty" db:"listingAgencyReference,omitempty"`
IsSoldProperty bool `json:"isSoldProperty,omitempty" db:"isSoldProperty,omitempty"`
Set string `json:"region,omitempty" db:"region,omitempty"`
CityName string `json:"cityName,omitempty" db:"cityName,omitempty"`
Lon float64 `json:"lon,omitempty" db:"lon,omitempty"`
Lat float64 `json:"lat,omitempty" db:"lat,omitempty"`
Mark int `json:"label,omitempty" db:"label,omitempty"`
ChargesPrice int `json:"chargesPrice,omitempty" db:"chargesPrice,omitempty"`
Warning float32 `json:"warning,omitempty" db:"warning,omitempty"`
AgencyFee string `json:"agency_fee,omitempty" db:"agency_fee,omitempty"`
PropertySubType string `json:"propertySubType,omitempty" db:"propertySubType,omitempty"`
PublisherId int `json:"publisher_id,omitempty" db:"publisher_id,omitempty"`
PublisherRemoteVisit bool `json:"publisher_remote_visit,omitempty" db:"publisher_remote_visit,omitempty"`
PublisherPhone string `json:"publisher_phone,omitempty" db:"publisher_phone,omitempty"`
PublisherName string `json:"publisher_name,omitempty" db:"publisher_name,omitempty"`
PublisherAthomeId string `json:"publisher_athome_id,omitempty" db:"publisher_athome_id,omitempty"`
PropertySurface float64 `json:"propertySurface,omitempty" db:"propertySurface,omitempty"`
BuildingYear string `json:"buildingYear,omitempty" db:"buildingYear,omitempty"`
FloorNumber string `json:"floorNumber,omitempty" db:"floorNumber,omitempty"`
BathroomsCount int `json:"bathroomsCount,omitempty" db:"bathroomsCount,omitempty"`
BedroomsCount int `json:"bedroomsCount,omitempty" db:"bedroomsCount,omitempty"`
BalconiesCount int `json:"balconiesCount,omitempty" db:"balconiesCount,omitempty"`
CarparkCount int `json:"carparkCount,omitempty" db:"carparkCount,omitempty"`
GaragesCount int `json:"garagesCount,omitempty" db:"garagesCount,omitempty"`
HasLivingRoom bool `json:"hasLivingRoom,omitempty" db:"hasLivingRoom,omitempty"`
HasKitchen bool `json:"hasKitchen,omitempty" db:"hasKitchen,omitempty"`
Availability string `json:"availability,omitempty" db:"availability,omitempty"`
Media *[]string `json:"media,omitempty" db:"media,omitempty"`
Description string `json:"description,omitempty" db:"description,omitempty"`
Hyperlink string `json:"link,omitempty" db:"link,omitempty"`
CreatedAt string `json:"createdAt,omitempty" db:"createdAt,omitempty"`
UpdatedAt string `json:"updatedAt,omitempty" db:"updatedAt,omitempty"`
}
I’d switch my mind later down the aspect toll road on the guidelines that I should retain
in every Residence
struct, so I’d should make changes to the database structure,
and attributable to this fact the queries to insert and update the database too. To make this a shrimp extra
flexible I will use a yaml file to set aside any database migration and insert/update
queries to the database.
migrations: |
CREATE TABLE IF NOT EXISTS condominium(
came upon BOOL,
listingId INTEGER PRIMARY KEY,
...
description TEXT,
link TEXT,
createdAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
insertQuery: |
INSERT INTO condominium(came upon,listingId,listingAgencyReference,isSoldProperty,region,cityName,
lon,lat,label,chargesPrice,warning,agency_fee,propertySubType,publisher_id,
publisher_remote_visit,publisher_phone,publisher_name,publisher_athome_id,
propertySurface,buildingYear,floorNumber,bathroomsCount,bedroomsCount,balconiesCount,
garagesCount,carparkCount,hasLivingRoom,hasKitchen,availability,media,description,link)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
updateQuery: |
UPDATE condominium
SET came upon=?, listingId=?, listingAgencyReference=?, isSoldProperty=?, region=?, cityName=?, lon=?, lat=?, label=?,
chargesPrice=?, warning=?, agency_fee=?, propertySubType=?, publisher_id=?, publisher_remote_visit=?, publisher_phone=?,
publisher_name=?, publisher_athome_id=?, propertySurface=?, buildingYear=?, floorNumber=?, bathroomsCount=?,
bedroomsCount=?, balconiesCount=?, garagesCount=?, carparkCount=?, hasLivingRoom=?, hasKitchen=?,
availability=?, media=?, description=?, link=?, updatedAt=CURRENT_TIMESTAMP
WHERE listingId=?
After environment up these overall parts and with a shrimp extra code
I will assemble the program and speed it in say that this can load the old
json file into my sqlite condominium
desk.
Let’s add some extra commands to the justfile that we’ve
created beforehand.
db_path :=injurious + "/db.sqlite"
gobuild:
cd {{injurious}}/loader; bound fabricate cmd/most most important.bound
load: gobuild
CONFIG_PATH={{injurious}}/loader/config.yaml
JSON_OUT={{json_out}}
DB_PATH={{db_path}}
{{injurious}}/loader/most most important
procure: gather 22 situation load
Let’s load the guidelines into database
$ honest load
> OR
$ honest procure
> which can first gather 22 situation data and then load it in the database
> justfiles are frigid!
Comely to score some specs, this runs snappy. Rating a search for
$ cat home.txt | wc
65 66 4469
$ time honest load
honest load 0.38s person 0.52s system 220% cpu 0.408 total
I now absorb the total data that I scraped in my nice and super snappy
database, ready to be queried with the craziest inquire of that comes
to my mind, I will grasp some.
We’re at a going point in the interim, I for certain absorb a form of parameters
with which I will inquire of apartments that I adore. I will engage them by
non-lowering label, by space and if I add some extra advanced Haversine
formulae I might also style them by distance from the city centre or any
varied plan coordinates.
I received’t end here though. I for certain absorb some attention-grabbing shrimp vars in
every aparment data: lat
, lon
. I don’t should rupture geo data!
It’s nice and stress-free to honest search for at tabular data, however I mediate I might
score a much less complicated belief of the positioning honest by plotting stuff on a plan.
I should code something fast with the least amount of code, so I’ll
bound alongside with Python and Jupyter notebook alongside with
Folium which is
a library that generates Leaflet maps.
Let’s setup the plan with my point of hobby
import folium
lux_coords = [49.611622, 6.131935]
map_ = folium.Plan(region = lux_coords, zoom_start = 10)
interesting_coords = [49.630033, 6.168936]
folium.Marker(region=interesting_coords, popup="Level of hobby", icon=folium.Icon(coloration='red')).add_to(map_)
folium.Circle(region=interesting_coords, radius=5000, coloration='green', opacity=0.5, weight=2).add_to(map_)
folium.Circle(region=interesting_coords, radius=10000, coloration='yellow', opacity=0.5, weight=2).add_to(map_)
folium.Circle(region=interesting_coords, radius=15000, coloration='orange', opacity=0.5, weight=2).add_to(map_)
folium.Circle(region=interesting_coords, radius=20000, coloration='red', opacity=0.5, weight=2).add_to(map_)
This will seemingly even screen a plan centered on Lux, with a groovy red pin on my point of hobby
and to score a higher belief of the gap, I also added some circles with a radius of
5km, 10km, 15km and 20km. Right here’s extremely precious as a consequence of I will discard directly
by having a search for on the plan the apartments that are too a long way from my point of hobby.
Sooner than going crazy with SQL I should add my scraped apartments
to the plan and for the sake of simplicity I will inquire of all of them here
import os
import sqlite3
def getApartments(db):
cur = db.cursor()
cur.end(
"""
SELECT FROM condominium
WHERE
came upon=TRUE
"""
)
return [Apartment(row) for row in cur.fetchall()]
def addApartment(map_, a):
popup = folium.Popup(a._popup_(), max_width=450)
folium.Marker(
region=[a.lat, a.lon],
popup=popup,
# I will use fontawesome to exchange the pin icon
icon=folium.Icon(coloration=a._get_color(), icon=a._get_icon(), prefix="fa")
).add_to(map_)
db = sqlite3.join(os.environ["DB_PATH"])
apartmens = getApartments(db)
for a in apartmens:
addApartment(map_, a)
map_
And here we now absorb it! Definetly a critically better experience
than going backward and ahead on the gain internet page and plan on a plan
the total apartments one by one, lawful?
Within the code above you would also ogle that I’ve feeble a custom popup for every
condominium. With Folium we are able to use HTML to customize the pin’s popup
with the largest data I should look for (i.e. month-to-month total label, initial rate,
warning etc.)
def _popup_(self):
return f"""
Data
ID: {self.listingId}
Month-to-month Mark: {self.label}
Month-to-month Cost: {self.chargesPrice}
Warning: {self.warning}
Company Price: {self.agencyFee}
Complete
Month-to-month: {self.label + self.chargesPrice}
Preliminary: {self.warning + self.agencyFee}
Page
Gallery
"""
That’s honest what I needed, I will now ogle on the plan which might even very well be the becoming
located apartments in Lux and directly score to appear for the guidelines that I’m
in doubtlessly the most!
Why would I set aside the guidelines on a database if I don’t use SQL in any respect?
Let’s recount that I for certain absorb a injurious funds of 1000€ and I should screen ideal
the apartments on which I’d should employ an incremental amount of 200€,
I might simply switch the SQL inquire of to
SELECT *
FROM condominium
WHERE
came upon = TRUE AND
listingId IN (
SELECT listingId
FROM condominium
WHERE
came upon = TRUE AND
label + chargesPrice 1000 + 200
)
Phewww, when you’re mute here finding out all this you deserve a bonus point.
Factor in I saw a in actuality frigid condominium that looks adore a in actuality staunch deal however
i
Read More
Share this on knowasiak.com to talk about to of us on this topicSignal in on Knowasiak.com now when you would also very well be no longer registered but.