Yaron Vazana

NLP, Algorithms, Machine Learning, Data Science, tutorials, tips and more

  • About
  • Blog
  • Projects
  • Medium

Contact Me

yaronv99 [at] gmail.com

Powered by Genesis

You are here: Home / Scala / Scala Website Crawler

Scala Website Crawler

April 13, 2017 by Yaron Leave a Comment

All Machine Learning models require a large amount of data, both for training and for testing.

Getting the data, even before dealing with the ML stuff, can be hard and tedious, depending on the source you have and the accessibility of the data itself.

Crawling a website or a blog is a convenient way for getting the data you need. With a relatively small effort, you can generate a CSV file containing all the data, and analyze it easier using state of the art tools.

Scala Crawler

Analyzing your source

In order to crawl a website, you need to be familiar with its structure. Meaning, you need to know how the information is structured inside its HTML elements.

In my example, I will crawl this famous real estate website https://www.zillow.com, which contains tons of housing information all over the states.

My goal would be to scrap Florida housing prices and other useful information.

Finding your Data Items inside the DOM

After playing a bit with the website, I found this following URL to be the base URL I need for the crawler

https://www.zillow.com/homes/for_sale/FL/14_rid/globalrelevanceex_sort /32.971803,-76.311036,22.268764,-91.31836_rect/5_zm/1_p/

At the time of writing this post, the number of houses for sale in Florida is 271,821. Since not all the search results are visible on the screen, we need to iterate them from our Scala code using the “1_p” parameter.

Each data item I need, sits in a different DOM element. Looking at the HTML code, using the chrome console, will reveal the classes/ids/attributes I need in order to uniquely identify each item.

For example, in order to get the latitude value, I need to take the value at the following path:

article.photo-card >> [itemprop=geo] >> [itemprop=latitude]

Coding the Crawler

In this example, I’m only retrieving a few parameters for each house. This it why my House.scala case class is thin and simple. Of course you can add as many parameters as you like.

The Crawler uses the DOM analysis we did earlier. For each data field we need to know exactly how to retrieve it from the website’s DOM tree (using x-path like queries).

Finally, we write all the data into a CSV file, for further analysis

And this is the final output: (top few rows)

Street Price Postal Latitude Longitude
1224 S Peninsula Dr APT 617 $99000 32118 29.211072 -81.006751
5205 NW 27th Dr $189000 32605 29.702196 -82.363273
2700 Bayshore Blvd APT 506 $219900 34698 28.055253 -82.77919
620 Porta Rosa Cir $425000 32092 29.960255 -81.481996
1074 Celebration Dr $29000 33872 27.51979 -81.50768
101 Maria Ct $409000 33950 26.924016 -82.074747
1149 Mill Creek Dr $399900 32259 30.093501 -81.631608
5000 Culbreath Key Way  359900 33611 27.894401 -82.529911
22711 SW 9th St $349900 33433 26.33901 -80.182642

The entire source code can be found on my GitHub at the following link

https://github.com/yaronv/scala-crawler

Subscribe to Blog

Subscribe to get the latest posts to your inbox

Cheers

Filed Under: Algorithms, Scala Tagged With: Crawler, Data Science, Scala

I am a data science team lead at Darrow and NLP enthusiastic. My interests range from machine learning modeling to solving challenging data related problems. I believe sharing ideas is where we all become better in what we do. If you’d like to get in touch, feel free to say hello through any of the social platforms. More About Yaron…

SUBSCRIBE TO BLOG

Subscribe to Blog

Subscribe to get the latest posts to your inbox

Recent Posts

  • Training an AutoEncoder to Generate Text Embeddings
  • Using Dockers for your Data Science Dev Environment
  • Identifying Real Estate Opportunities using Machine Learning
  • How to Create a Simple WhatsApp Chatbot in Python using Doc2vec
  • Average Word Vectors – Generate Document / Paragraph / Sentence Embeddings
  • Visualizing Vectors using TensorBoard
  • Training a Doc2Vec Model with Gensim
 

Loading Comments...