CBF Programming Assignment Solution

Starting from:

~~$34.99~~

$28.99

Home

In this assignment, you will implement a content-based recommender as a LensKit recommender

algorithm. LensKit provides tools to produce recommendations from a user; your task is to implement

the logic of the recommender itself.

There are 2 parts to this assignment, implementing two variants of a TF-IDF recommender.

Like you did in the nonpersonalized recommenders, the CBF recommenders are broken into three

components:

-   A model class, `TFIDFModel`, that you do not need to modify.

-   A model provider, `TFIDFModelProvider`, that you will need to complete with the logic to

    compute TF-IDF vectors for items.

-   A scorer/recommender class that uses the precomputed model to score items. In this assignment,

    you will implement an `ItemScorer`; its only job is to compute user-personalized scores for

    items. You do not have to rank the items: LensKit will take care of that for you.

In addition, the item scorer will use another component, the `UserProfileBuilder`, to compute user

tag interest profiles based on a user's ratings and the tags applied to the movies they like. You

will implement two versions of this profile builder.

## Downloads and Resources

- Project template (from Coursera)

- [LensKit for Teaching website](http://mooc.lenskit.org) (links to relevant documentation)

- [JavaDoc for included code](http://mooc.lenskit.org/assignments/cbf/javadoc/)

Additionally, you will need:

- [Java](http://java.oracle.com) — download the Java 8 JDK. On Linux, install the OpenJDK 'devel' package (you will need the devel package to have the compiler).

- An IDE; we recommend [IntelliJ IDEA](http://jetbrains.com/idea/) Community Edition.

## Notation

Here's the mathematical notation we use in this assignment description:

$\mathbf{p}_u$

:   The user's profile vector, indicating how much the user likes each tag.

$p_{ut}$

:   User $u$'s preference for a particular tag $t$ (this is an element of $\mathbf{p}_u$).[^1]

$\mathbf{q}_i$

:   The tag vector for an item $i$. In the final computation, this will be a unit-normalized TF-IDF vector.

$q_{it}$

:   The value for a particular tag $t$ for an item $i$ (an element of $\mathbf{q}_i$).

$I_u$

:   The set of items rated by user u.

$r_{ui}$

:   User u's rating for item i.

$\bar{r_u}$

:   The average of user u's ratings.

$||\mathbf{v}||_2$

:   The Euclidean ($L_2$) norm, or length, of a vector $\mathbf{v}$. It is the square root of the

    sum of the squares of the elements of $\mathbf{v}$.

[^1]: If you have reviewed the Standardized Notation resource, you may notice that we generally use

    $\mathbf{p}$ and $\mathbf{q}$ vectors for matrix factorization, a topic in Course 4. However,

    in that setting, they serve as user and item description vectors, respectively; we therefore

    use them here in that sense, describing users and items by their scores for tags.

## Part 1: TF-IDF Recommender with Unweighted Profiles

Start by downloading the project template. This is a Gradle project; you can import it into your IDE directly (IntelliJ users can open the `build.gradle` file as a project). The code should compile as-is; you can test this by running the `build` Gradle target from your IDE, or running `./gradlew build` at the command line.

There are 3 things you need to implement to complete the first part of the assignment:

Compute item-tag vectors (the model)

:   For this task, you need to modify the model builder (`TFIDFModelProvider`, your modifications go in the get() method) to compute the unit-normalized TF-IDF vector for each movie in the data set. We provide the skeleton of this; TODO comments indicate where you need to implement missing pieces. When this piece is done, the model should contain a mapping of item IDs to TF-IDF vectors, normalized to unit vectors, for each item.

    There are two main steps to this task:

    1. Iterate through items, building the term vector $\hat{\mathbf{q}_i}$ for each item and a global document frequency vector $\mathbf{d}$. At this stage, these are unnormalized term and document frequency vectors, storing the number of times the term appears on each document or the number of documents in which it appears.

    2. Iterate through each item again, performing the following:

        1. Divide each term value $\hat q_{it}$ by the log of the document frequency ($\mathrm{ln} d_t$). The resulting vector $\tilde{\mathbf{q}_i}$ is the TF-IDF vector.

        2. After dividing each term value by the log of the DF, compute the length (Euclidean norm) of the TF-IDF vector $\tilde{\mathbf{q}_i}$, and divide each element of it by the length to yield a unit vector $\mathbf{q}_i$.

    The `getItemTags` method of the `ItemTagDAO` class, an instance of which is available as a field in your model builder, will give you the list of all tags applied to the item.

Build user profile for each query user

:   The `UserProfileBuilder` interface defines classes that take a user's history – a list of ratings — and produce a vector $\mathbf{p}_u$ representing that user's profile. Recall that the user profile vectors will also be term vectors, describing how much the user likes each tag. For part 1, the profile should be the sum of the item-tag vectors of all items the user has rated positively (= 3.5 stars); this implementation goes in `ThresholdUserProfileBuilder`.

    Mathematically, this is:

    $$p_{ut} = \sum_{i \in I_u: r_{ui} \ge 3.5} q_{it}$$

Generate item scores for each user

:   The heart of the recommendation process in many LensKit recommenders is the score method of the item scorer, in this case `TFIDFItemScorer`. Modify this method to score each item by using cosine similarity: the score for an item is the cosine between that item's tag vector and the user's profile vector. Cosine similarity is defined as follows:

    $$\mathrm{cos}(\mathbf{p_u},\mathbf{q_i}) = \frac{\mathbf{p_u} \cdot \mathbf{q_i}}{\|\mathbf{p_u}\|_2 \|\mathbf{q_i}\|_2} = \frac{\sum_t q_{ut} p_{it}}{\sqrt{\sum_t q_{ut}^2} \sqrt{\sum_t p_{it}^2}}$$

When you are done, you can run your program from the command line using Gradle:

    ```

    ./gradlew recommendBasic -PuserId=320

    ```

Try different user IDs.

We have also provide some unit tests to check basic functionality of your classes, particularly the

TF-IDF model builder. Run these tests in your IDE to pre-check your code. They are not extensive,

and many incorrect solutions will pass them, but they will help you flush out general data handling

bugs and make sure the model builder at least runs before you move on to the scorer and user profile

builders.

### Example Item Term Vector

Item 2231 has the following final term vector:

    {cards=0.1076736727697595,

     John Malkovich=0.23970853283741034,

     Matt Damon=0.3159509700099918

     poker=0.5969723646885582

     gambling=0.24572008949856772

     library vhs=0.06053343286414036

     card games=0.12526237038499663

     2.5=0.06789050985047351

     John Turturro=0.2606167586076272

     Edward Norton=0.5529141975174856

     John Dahl=0.1141651377329722

     watched 2006=0.061105276598972115}

### Example Output for Unweighted User Profile

The following example gives actual outputs for user 320 in the data set. It was executed using `./gradlew recommendBasic -PuserId=320` in a Unix-like console.

```

recommendations for user 320:

32 (Twelve Monkeys (a.k.a. 12 Monkeys) (1995)): 0.349

1748 (Dark City (1998)): 0.307

1206 (Clockwork Orange, A (1971)): 0.298

48394 (Pan's Labyrinth (Laberinto del fauno, El) (2006)): 0.288

1199 (Brazil (1985)): 0.288

32587 (Sin City (2005)): 0.287

1270 (Back to the Future (1985)): 0.283

1089 (Reservoir Dogs (1992)): 0.278

1210 (Star Wars: Episode VI - Return of the Jedi (1983)): 0.277

7147 (Big Fish (2003)): 0.264

```

## Part 2: Weighted User Profile

For this part, adapt your solution from Part 1 to compute weighted user profiles. Put your weighted user profile code in `WeightedUserProfileBuilder`.

In this variant, rather than just summing the vectors for all positively-rated items, compute a weighted sum of the item vectors for all rated items, with weights being based on the user's rating. Your solution should implement the following formula:

$$\mathbf{p}_u = \sum_{i \in I(u)} (r_{ui} - \mu_u) \mathbf{q}_i$$

Using non-vector notation, each user tag value $x_{ut}$ is computed as follows:

$$p_{ut} = \sum_{i \in I(u)} (r_{ui} - \mu_u) q_{it}$$

If an item $i$ has no data for a tag $t$, then $q_{it} = 0$.

### Example Output for Weighted User Profile

The following example gives actual outputs for user 320 in the data set. It was executed using `./gradlew recommendWeighted -PuserId=320` in a Unix-like console.

```

recommendations for user 320:

48394 (Pan's Labyrinth (Laberinto del fauno, El) (2006)): 0.204

7153 (Lord of the Rings: The Return of the King, The (2003)): 0.184

1210 (Star Wars: Episode VI - Return of the Jedi (1983)): 0.178

628 (Primal Fear (1996)): 0.177

2997 (Being John Malkovich (1999)): 0.176

1089 (Reservoir Dogs (1992)): 0.165

7147 (Big Fish (2003)): 0.161

32587 (Sin City (2005)): 0.160

48780 (Prestige, The (2006)): 0.160

1199 (Brazil (1985)): 0.155

```

## Running the Program

We have provided two Gradle targets to help you run the program. `recommendBasic` computes recommendations using the threshold user profile builder, and `recommendWeighted` uses the weighted user profile builder. You can users to recommend for with the `userId` property, e.g. `-PuserId=91`.

For example:

```

$ ./gradlew recommendBasic -PuserId=91

```

### Debugging

If you run the Gradle tasks using IntelliJ's Gradle runner, you can run them under the debugger to debug your code.

The Gradle file also configures LensKit to write log output to `build/recommend-basic.log` and `build/recommend-weighted.log`. If you use the SLF4J logger (the `logger` field on the classes we provide) to emit debug messages, you can find them there.

## Submitting

You will submit a compiled `jar` file containing your solution. To prepare your project for

submission, run the Gradle `prepareSubmission` task:

    ./gradlew prepareSubmission

This will create file `cbf-submission.jar` under `build/distributions` that contains your final

solution code in a format the grader will understand. Upload this `jar` file to the Coursera

assignment grader.

## Grading

Your grade for each part will be based on two components:

- Outputting items in the correct order: 75%

- Computing correct scores for items (within an error tolerance): 25%

The parts themselves are weighted as follows:

- Basic CBF: 70%

- CBF with Weighted User Profiles: 30%