All Machine Learning models require a large amount of data, both for training and for testing.
Getting the data, even before dealing with the ML stuff, can be hard and tedious, depending on the source you have and the accessibility of the data itself.
Crawling a website or a blog is a convenient way for getting the data you need. With a relatively small effort, you can generate a CSV file containing all the data, and analyze it easier using state of the art tools.
Analyzing your source
In order to crawl a website, you need to be familiar with its structure. Meaning, you need to know how the information is structured inside its HTML elements.
In my example, I will crawl this famous real estate website https://www.zillow.com, which contains tons of housing information all over the states.
My goal would be to scrap Florida housing prices and other useful information.
Finding your Data Items inside the DOM
After playing a bit with the website, I found this following URL to be the base URL I need for the crawler
At the time of writing this post, the number of houses for sale in Florida is 271,821. Since not all the search results are visible on the screen, we need to iterate them from our Scala code using the “1_p” parameter.
Each data item I need, sits in a different DOM element. Looking at the HTML code, using the chrome console, will reveal the classes/ids/attributes I need in order to uniquely identify each item.
For example, in order to get the latitude value, I need to take the value at the following path:
article.photo-card >> [itemprop=geo] >> [itemprop=latitude]
Coding the Crawler
In this example, I’m only retrieving a few parameters for each house. This it why my House.scala case class is thin and simple. Of course you can add as many parameters as you like.
The Crawler uses the DOM analysis we did earlier. For each data field we need to know exactly how to retrieve it from the website’s DOM tree (using x-path like queries).
Finally, we write all the data into a CSV file, for further analysis
And this is the final output: (top few rows)
|1224 S Peninsula Dr APT 617||$99000||32118||29.211072||-81.006751|
|5205 NW 27th Dr||$189000||32605||29.702196||-82.363273|
|2700 Bayshore Blvd APT 506||$219900||34698||28.055253||-82.77919|
|620 Porta Rosa Cir||$425000||32092||29.960255||-81.481996|
|1074 Celebration Dr||$29000||33872||27.51979||-81.50768|
|101 Maria Ct||$409000||33950||26.924016||-82.074747|
|1149 Mill Creek Dr||$399900||32259||30.093501||-81.631608|
|5000 Culbreath Key Way||359900||33611||27.894401||-82.529911|
|22711 SW 9th St||$349900||33433||26.33901||-80.182642|
The entire source code can be found on my GitHub at the following link