Scala Website Crawler

All Machine Learning models require a large amount of data, both for training and for testing.

Getting the data, even before dealing with the ML stuff, can be hard and tedious, depending on the source you have and the accessibility of the data itself.

Crawling a website or a blog is a convenient way for getting the data you need. With a relatively small effort, you can generate a CSV file containing all the data, and analyze it easier using state of the art tools.

Analyzing your source

In order to crawl a website, you need to be familiar with its structure. Meaning, you need to know how the information is structured inside its HTML elements.

In my example, I will crawl this famous real estate website https://www.zillow.com, which contains tons of housing information all over the states.

My goal would be to scrap Florida housing prices and other useful information.

Finding your Data Items inside the DOM

After playing a bit with the website, I found this following URL to be the base URL I need for the crawler

https://www.zillow.com/homes/for_sale/FL/14_rid/globalrelevanceex_sort /32.971803,-76.311036,22.268764,-91.31836_rect/5_zm/1_p/

At the time of writing this post, the number of houses for sale in Florida is 271,821. Since not all the search results are visible on the screen, we need to iterate them from our Scala code using the “1_p” parameter.

Each data item I need, sits in a different DOM element. Looking at the HTML code, using the chrome console, will reveal the classes/ids/attributes I need in order to uniquely identify each item.

For example, in order to get the latitude value, I need to take the value at the following path:

article.photo-card >> [itemprop=geo] >> [itemprop=latitude]

Coding the Crawler

In this example, I’m only retrieving a few parameters for each house. This it why my House.scala case class is thin and simple. Of course you can add as many parameters as you like.

The Crawler uses the DOM analysis we did earlier. For each data field we need to know exactly how to retrieve it from the website’s DOM tree (using x-path like queries).

Finally, we write all the data into a CSV file, for further analysis

And this is the final output: (top few rows)

Street	Price	Postal	Latitude	Longitude
1224 S Peninsula Dr APT 617	$99000	32118	29.211072	-81.006751
5205 NW 27th Dr	$189000	32605	29.702196	-82.363273
2700 Bayshore Blvd APT 506	$219900	34698	28.055253	-82.77919
620 Porta Rosa Cir	$425000	32092	29.960255	-81.481996
1074 Celebration Dr	$29000	33872	27.51979	-81.50768
101 Maria Ct	$409000	33950	26.924016	-82.074747
1149 Mill Creek Dr	$399900	32259	30.093501	-81.631608
5000 Culbreath Key Way	359900	33611	27.894401	-82.529911
22711 SW 9th St	$349900	33433	26.33901	-80.182642

The entire source code can be found on my GitHub at the following link

https://github.com/yaronv/scala-crawler

Cheers

Contact Me

Analyzing your source

Finding your Data Items inside the DOM

Coding the Crawler

Subscribe to Blog