Chapter 1 Introduction

This report is a story of failing human intuitions and data science success. In brief, it demonstrates that statistical learning brings insights otherwise unavailable, and eventually achieves and RMSE of *0.

This project is the first of two final projects of the HarvardX - PH125.9x Data Science course.

Its purpose is the development of a recommender system for movie ratings using the Movie Lens dataset.1 Recommender systems are a class of statistical learning systems that analyse individual past choices and/or preferences to propose relevant information to make future choices. Typical systems would be propose additional items to purchase knowing past shopping activity, searches (e.g. Amazon), choice of books (e.g. GoodRead) or movies (Netflix).

Broadly, recommender systems fall into two categories:

  • collaborative filtering (user-based) which attempts to pool similar users together and guide a user’s recommendation given the pool’s preference.

  • content-based filtering which attempts to pool similar contents (e.g. shopping carts, movie ratings) together and guide a user’s recommendation within a simila pools of content.

In practice, those two approaches are mixed together. A general overview is available on Wikipedia2 and in the course materials.3

Being given a training and a validation dataset, we will attempt to minimise the Root Mean Sqared Error (RMSE) of predicted ratings for pairs of user/movie below 0.8649.

We note that Netflix organised a competition spanning over several years to improve a recommender system which shares many similarities with this project (Bennett, Lanning, and others 2007). Papers published by teams who participated in that competition have guided some of this report. (Bennett, Lanning, and others 2007) (Bell, Koren, and Volinsky 2007) (Bell, Koren, and Volinsky 2008) (Koren 2009) (Töscher, Jahrer, and Bell 2009) (Piotte and Chabbert 2009) (Gower 2014)

This report is organised as follows. In Section 2, we describe the dataset and add a number of possibly relevant predictors. Section 3 provides a number of visualistions. Section 4 proposes three models that will show to be poor performers. Section 5 is dedicated to a low-rank matrix factorisation estimated with a stochastic gradient descent.

References

Bell, Robert M, Yehuda Koren, and Chris Volinsky. 2007. “The Bellkor Solution to the Netflix Prize.” KorBell Team’s Report to Netflix. Citeseer.

Bell, Robert M, Yehuda Koren, and Chris Volinsky. 2008. “The Bellkor 2008 Solution to the Netflix Prize.” Statistics Research Department at AT&T Research 1.

Bennett, James, Stan Lanning, and others. 2007. “The Netflix Prize.” In Proceedings of Kdd Cup and Workshop, 2007:35. New York, NY, USA.

Gower, Stephen. 2014. “Netflix Prize and Svd.” Working Paper.

Koren, Yehuda. 2009. “The Bellkor Solution to the Netflix Grand Prize.” Netflix Prize Documentation 81 (2009): 1–10.

Piotte, Martin, and Martin Chabbert. 2009. “The Pragmatic Theory Solution to the Netflix Grand Prize.” Netflix Prize Documentation.

Töscher, Andreas, Michael Jahrer, and Robert M Bell. 2009. “The Bigchaos Solution to the Netflix Grand Prize.” Netflix Prize Documentation, 1–52.