Building a food recommendation engine with Spark / MLlib and Play


Recommendation engines have become very popular in the last decade with the explosion of e-commerce, on demand music and movie services, dating sites, local reviews, news aggregation and advertising (behavioral targeting, intent targeting, …). Depending on your past actions (e.g., purchases, reviews left, pages visited, …) or your interests (e.g., Facebook likes, Twitter follows), the recommendation engine will present other products that might interest you using other users actions and user behaviors (page clicks, page views, time spent on page, clicks on images/reviews, …).

In this post, we’re going to implement a recommender for food using Apache Spark and MLlib. So for instance, if one is interested by some coffee products then we might recommend her some other coffee brands, coffee filters or some related products that some other users like too.


View original post 528 more words

MongoDB lessons learned

Continuous Updating

MongoDB is currently the primary database for VersionEye. In the last couple weeks we had some performance and scaling issues. Unfortunately that caused some down times. Here are the learnings from the last 3 weeks.


The Ruby code at VersionEye is using the MongoID driver to access MongoDB. All in one MongoID is a great piece of open source software. There is a very active community which offers a great support.

In our case MongoID somehow didn’t close the opened connections. With each HTTP Request a new connection to MongoDB is created. If the HTTP Response is generated the connection can be closed. Unfortunately this didn’t happened automatically. So the open connections summed up on the MongoDB Replica Set  and the application become slower and slower over time. After a restart of the Replica Set the game started by 0 again the application was fast again. At least…

View original post 500 more words

Music recommendations with 300M data points and one SQL query

Alexandre Passant

While toying with the public BigQuery datasets, impatiently waiting for Google Cloud Dataflow to be released, I’ve noticed the Wikipedia Revision History one, which contains a list of 314M Wikipedia edits, up to 2010. In the spirit of Amazon’s “people who bought this”, I’ve decided to run a small experiment about music recommendations based on Wikipedia edits. The results are not perfect, but provide some insights that could be used to bootstrap a  recommendation platform.

Wikipedia edits as a data source

Wikipedia pages are often an invaluable source of knowledge. Yet, the type and frequency of their edits also provide great data to mine knowledge from. See for instance the Wikipedia Live Monitor by Thomas Steiner, detecting breaking news through Wikipedia,  “You are what you edit“, an ICWSM09 study of Wikipedia edits to identify contributors’ location, or some of my joint work on data provenance with Fabrizio Orlandi.

Here, my assumption to build a…

View original post 689 more words

Scaling Concurrent Writes in Neo4j

Max De Marzi

concurrent writes

A while ago, I showed you a way to scale Neo4j writes using RabbitMQ. Which was kinda cool, but some of you asked me for a different solution that didn’t involve adding yet another software component to the stack.

Turns out we can do this in just Neo4j using a little help from the Guava library. The solution involved a background service running that holds the writes in a queue, and every once in a while (like say every second) commits those writes in one transaction.

View original post 1,354 more words