Scraping real estate for fun

Published

November 6, 2020

Here’s a fun weekend project: scrape the real estate classifieds of the website of your choice, and do some analytics on the data. I did just that last weekend, using the Scrapy Python library for web scraping, which I then let loose on one of the major real estate classifieds website in Switzerland (can’t tell you which one—not sure they would love me for it).

After about 10 minutes I had the data for 12’124 apartments or houses for sale across Switzerland, with room count, area, price, city, and canton.

I’ve imported the data in R, and log-transformed the room count, area, and price because of extreme skewness. Here’s the resulting scatterplot matrix, obtained with ggpairs():

There’s a number of interesting features, even from this raw, unclean dataset:

Before we go any further, we’ll obviously have to clean up these faulty data points. There doesn’t seem to be many of them so I’ll do that manually, and write a follow-up post if I find anything interesting.