9b Recommender Systems

Aug 13, 2017 14:09 · 431 words · 3 minutes read

Recommender Systems

Problem Definitions

Recommendation comes into play when we're trying to do things like recommend movies to customers.

  • $n_u = $ number of users
  • $n_m = $ number of movies
  • $r(i,j) = 1$ if user $j$ has rated movie $i$
  • $y(i,j) = $ rating given by user $j$ to movie $i$ (defined only if $r(i,j) = 1$)

Ways of approaching it

Content Based Recommendations

We can introduce features that represent, for example, how much romance or action is in a film. Then we could use linear regression to learn $\theta^{(j)}$ for each user, using the squared error function with regularisation.

Collaborative Filtering

The problem with the above approach is that it's difficult to find features like "amount of romance" or "amount of action" in a movie. Rather than doing this for every movie, we can get users to tell us how much they like different genres, and infer movie features from these.

To do this, we can use the squared error function with regularisation again, this time over all users.

Collaborative Filtering Algorithm

To speed things up, we can simultaneously minimise our features and our parameters.

If we combine the squared error cost function for theta and for x, we get:

\[J(x,\theta) = \dfrac{1}{2} \displaystyle \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 + \dfrac{\lambda}{2}\sum_{j=1}^{n_u} \sum_{k=1}^{n} (\theta_k^{(j)})^2\]

This looks complicated, but is just the combined squared error cost functions over users and over movie parameters - the only unique bit to each is the regularisation parameter.

The steps in the algorithm are:

  1. Initialise $x^{(1)}, ..., x^{(n_m)}, \theta^{(1)}, ..., \theta^{(n_u)}$ to small random values. This breaks symmetry, and ensures the algorithm learns features $x^{(1)}, ..., x^{(n_m)}$ that are different from each other.
  2. Minimise $J(x^{(1)}, ..., x^{(n_m)}, \theta^{(1)}, ..., \theta^{(n_u)})$ using gradient descent (or an advanced optimisation algorithm). e.g. for every $j = 1,...,n_u, i=1,...,n_m$:

\(x_k^{(i)} := x_k^{(i)} - \alpha\left (\displaystyle \sum_{j:r(i,j)=1}{((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) \theta_k^{(j)}} + \lambda x_k^{(i)} \right)\) \(\theta_k^{(j)} := \theta_k^{(j)} - \alpha\left (\displaystyle \sum_{i:r(i,j)=1}{((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) x_k^{(i)}} + \lambda \theta_k^{(j)} \right)\)

  1. For a user with parameters $\theta$ and a movie with (learned) features $x$, predict a star rating of $\theta^Tx$.

Vectorisation

Given matrices $X$ (each row containing features of a particular movie) and $\Theta$ (each row containing the weights for those features for a given user), then the full matrix $Y$ of all predicted ratings of all movies by all users is given by $Y = X \Theta^T$.

We can predict how similar two movies $i$ and $j$ are by looking at the distance between their respective feature vectors $x$. Specifically, we're looking for a small $||x^{(i)} - x^{(j)}||$.