Wednesday, December 26, 2007

Netflix, While I Am Waiting, .... Part 2

I was hoping that my Netflix run can achieve a better progress after my 10 day long vacation overseas. To my disappointment, it did not achieve a lot. I am only getting 3 predictions based on my 'buddy' information for every 10 runs. Although I completed almost 70% of the movie files, it covered less than 9% of all the predications. If I did not do anything on my approach, I would have to wait for one whole year before I could do my first submission (if the contest is still open).

As I mentioned in my previous blog, there are some strange customers who rated more movies than anyone else and it is almost impossible for a human being to be able to rate (or watch) that many movies over the last 6 years. So the question is, can I simply ignore these customers. Will these customers provide some genuine data. If so, what is considered genuine and what is not.

If I could summarise the whole training set based on the customer id and date as the index, I should be able to see how often someone rate movie in a per day basis. However, we are talking about 100 million records and likely 'awk' may not be able to handle that in a single run. If I can split it into some manageable sets, I should be able to overcome this hurdle. Here is my script to handle that in 10 separate awk runs

#! /bin/sh

for i in 0 1 2 3 4 5 6 7 8 9
do
        for j in mv_*$i.txt
        do
                cat $j
        done | nawk '
BEGIN {
        FS=","
}
NR>1 {
        ind=sprintf("%s:%s",$1,$3)
        ++s[ind]
}
END {
        for(i in s) {
                print i,s[i]
        }
}' > perday.txt.$i
done

To merge the 10 files together, you can do a for loop and cat them together for the awk to process, pretty similar to the above.

To my surprise, customer #1664010 rated 5446 times on 2005-10-12!! I think I get the answer in finding the 'non-genuine' data. However, what will be the cutting point for # movies rated per day. Okay, let's plotted the cumulative graph. I realised that if I were to choose anything rated below 50 movies per day, I can reduce the data set by 37%. Now I have to wait for the SQL DELETE to complete before I can prove my concept. In the meantime, the calculation continues.


Labels: ,

Thursday, December 13, 2007

Netflix, While I Am Waiting, ...

While I am waiting for the full set of qualifying.txt, I spent some time in analysing the raw data. 90-20 rule applies to the training_set as well as the qualifying data. That's why I started my buddy building from the less rated movies.

Although my run has completed 50% of movies, it only finished 4% of the total predictions.

I also wrote an AWK program to summarise the ratings (1-5) based on all the customers. Do you know that 1269 customers rated 1 time and 16419 customers rated less than 10 times, considered the average ratings per customer is 209. As for the top rated customer #305344, he/she rated 17653 movies over a 6 years period (320 weeks). That means he/she rated 55 movies per week, possible ? Obviously he/she cannot be possible to watch that many movies. BTW, we are talking about 'rated', not 'watched'.

Labels:

Tuesday, December 11, 2007

Netflix, My Performance Journey

My ex-colleague introduced me the Netflix Prize 3 months ago. As quoted from the website, "The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love.". The prize is US$1,000,000 and a job in Netflix.

I downloaded the dataset in tar.gz format of is 697,552,015 bytes. The training set comprises of 17,770 movies rated by 480,189 unique customers. All in all, it consists of 100,480,507 ratings. These movies were rated between 11 Nov 1999 to 31 Dec 2005.

Your job is to work out the 2,817,131 predictions out of 17,470 movies with 478,615 unique customers. That means 300 movies and 1,574 customers are not included in the prediction. Why, I have no idea at this moment.

As you already know, I am going to tackle this problem with my favourite programming language. With such a large dataset, a database is definitely a must. There are a couple of databases that I worked with before with Tcl. They are Metakit, MySQL and Oracle. However, none of them appeal to me as the "ideal" candidate. Metakit may not be able to scale to 100 million records. Oracle requires a license to run and installation/operation may require expertise that I may not be able to acquire within a short period. MySQL may not be able to provide me with the performance that I need.

I knew about this file-based SQL database called SQLite some time ago but do not have any chance to work on it. It is written by D. Richard Hipp, a long time Tcl developer. This file-based database has been documented in the website that it is faster than PostgreSQL and MySQL. So, this is a golden opportunity to 'test drive' this "in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine". FYI, SQLite comes with C/C++, Java, Perl, Python bindings/API. If you want to find out more, watch this video for an introduction.

I am no database person and therefore the best way to populate the database is to have one table to store all the records. Each record with 4 different fields, movie id, customer id, rating, and date. It took more than a week to populate the entire database before I can start doing any work. It is absolutly frustrating to wait for the database to finish. The initial runtime for just one prediction took almost 19 minutes. This is definitely unacceptable. Due to work commitment, I am kind of losing interest in this exercise.

I talked to my ex-colleague recently and he got me interested again. This time, I focused more on the approach and design of the program. I spent quite a bit of time reading up on the SQLite and realised that why it took 7+ days to populate and long run time. By default, SQL insert in SQLite is synchronous. By committing the database transaction for every 1000 inserts, I am able reduce the population time to just a couple of hours. Also, I introduced couple of table indexes. Although this will create a larger database (6.5GB), it speeds up the query tremendously . Instead of 19 minutes to process a prediction, I am able to reduce it down to a minute.

Another performance trick that I used is to compile Tcl and SQLite with SunStudio 12 with optimisation flag (-O) turn on and configure it to create binary for the target platform (-xarch=amd64). SunStudio, a production grade (now free of charge) compiler is definitely better than the gcc compiler.

My first version of the program is single-threaded and I have to run a few copies of it in order to take full advantage of all the CPUs. Once the logic is correct, I converted it to a multi-threaded version. In order to speed up the entire calculation, I need to take advantage of whatever relationship that I established for the future calculation. I term this as "buddy". If customer-1 and customer-2 like the same set of movies, they are consider to be "buddy". In the future prediction, if these two customers happen to be involved in the calculation, I can make use of these buddy information. A weight factor is also given to differentiate the "closeness" of these buddies.

The multi-threaded version speeds up the calculation but it gives rise to another problem. Now the issue becomes an I/O problem and I can see that all the threads are waiting for SQLite database to respond. By moving the calculation to a better disk sub-system server, SunFire X4500 (with 4 CPU cores), it provides better performance than a SunFire X4600 with 16 CPU cores with SAS disks.

In each thread calculation, I stored all the "buddy" information in the shared thread variable. To avoid threads overwriting information, the shared variable is locked during update. However, this slowed thing down because each thread has to wait to acquire the lock before it can proceed. In Solaris, you can use "prstat -m" to find out the microstate process accounting to identify the percentage of time the process/thread has spent waiting for locks.

To overcome this lock acquiring, I stored the "buddy" information in a variable that is unique to the thread id. In this case, each thread does not have to 'lock' the variable 'cos the variable name is unique to their own thread. With this, I am able to bring the performance down to 8 seconds using 4 CPU cores. Hey, that is about 140+ times improvement overall.

According to some of my initial benchmarks, movies rated by more customers will take a longer time to process. Another trick that I hope will give me an edge is to arrange the order of processing. Less rated movies will be processed first so that I can accumulate more "buddy" data to shorten the time for the heavily rated movies. You may argue that the information may not be accurate, but this is a trade-off between accuracy and performance.

The whole process is not really about getting the prize, but the journey of continuously pushing the performance envelop limit is really satisfying. I started the entire run on 6 Dec and I can see 20% of the prediction is based on my previous "buddy" information. Stay tune for more blog on this Netflix. Cheers.

Labels: , ,

Monday, December 03, 2007

Templating Your Shell Script

My colleague wanted to run a set of predefined commands on a set of remote servers via SSH connection. However, the arguments used in the commands are very much server specific. In this type of situation, you may want to create a template shell so that your main script can convert the template to specific script commands.

My preference is to quote those variables using "@" sign. You can use any character that has special meaning in shell programming. Suppose you want to run this command in the target host:
cd some-dir; tar cvf /backup/host11-20071203.tar ./https-host11; gzip /backup/host11-20071203.tar
for host11 server.

What you can do is to quote it in the template like this:
cd some-dir; tar cvf /backup/@HOST@-@TAG@.tar ./https-@HOST@; gzip /backup/@HOST@-@TAG@.tar

By running "sed" (Stream Editor), you can do global substitution on the fly and even pipe it over to the ssh connection to get the things done automatically, just like this:

TAG="20071203"
for HOST in host2 host3 host5 host7 host11 host13 host17 host19
do
  sed -e "s/@HOST@/$HOST/g" -e "s/@TAG@/$TAG/g" command-template | ssh user@$HOST
done

You can ever dynamically define the date if your @TAG@ is supposed to represent the current date. This approach can help you to run template shell commands over a bundle of hosts. Make sure you include the "#!" in the first line of the template to avoid mis-interpretation of shell syntax, eg. sh interprets as csh.

If you even compile open source tools before, you will normally run:
./configure; make; make install
What ./configure does is to work out all the details and create the 'Makefile' by substituting all the @VARIABLES@ in the 'Makefile.in' template. Below is the first few lines of 'Makefile.in' in Tcl 8.4

VERSION                 = @TCL_VERSION@
MAJOR_VERSION           = @TCL_MAJOR_VERSION@
MINOR_VERSION           = @TCL_MINOR_VERSION@
PATCH_LEVEL             = @TCL_PATCH_LEVEL@

prefix                  = @prefix@
exec_prefix             = @exec_prefix@
bindir                  = @bindir@
libdir                  = @libdir@
includedir              = @includedir@
mandir                  = @mandir@

Next time when you compile open source tools, take a look under the hood to understand how the thing works.

Labels: ,