Monday, January 28, 2008

What is the Maximum Length for Host Name in Solaris

My colleague was asking me what is the maximum length of hostname in Solaris. Sorry, I don't know, but I can find out.

The command '/bin/hostname' is used to print or set the hostname. However, /bin/hostname is actually a shell script that wrap around the 'uname' command.

With Solaris being open-sourced, we can just browse the source code of 'uname' from OpenSolaris to find out if there is a limit imposed. Below shows the code snippet:

    219  if (Sflg) {
    220   int len = strlen(nodename);
    221 
    222   if (len > SYS_NMLN - 1) {
    223    (void) fprintf(stderr, gettext(
    224     "uname: name must be <= %d letters\n"),
    225     SYS_NMLN-1);
    226    exit(1);
    227   }
    228   if (sysinfo(SI_SET_HOSTNAME, nodename, len) < 0) {
    229    int err = errno;
    230    (void) fprintf(stderr, gettext(
    231     "uname: error in setting name: %s\n"),
    232     strerror(err));
    233    exit(1);
    234   }
    235   return (0);
    236  }
There you go, the limit is SYS_NMLN. SYS_NMLN is defined in <sys/utsname.h> which is set as 257 (256 + string terminate character 'null'). BTW, the 'man uname' documented this too.

Another way to find out is to do a brute-force approach. Below script will try to set the hostname starting with 1 character at a time until it throws exception. A trap is setup to reset the hostname and print out the maximum length of the hostname before it exit gracefully.

#! /bin/sh

orig=`hostname`
name=""
count=0


trap 'hostname $orig; echo Maximum Length for host name is $count; exit' 0 1 2 9 15


while :
do
        name="a$name"
        hostname $name > /dev/null 2>&1
        if [ $? -ne 0 ]; then
                exit
        fi
        count=`expr $count + 1`
done

Labels: ,

Wednesday, January 23, 2008

Netflix, My First Submission

After 1.5 months of calculation, I managed to complete all the movie predictions (2,817,131 in total) this morning. Below shows the run profile based on 'buddy' information. As you can see, I am benefitting from the 'buddy' towards the end of the run.

Guess what's my score? It is 1.0523. Not too bad for the first submission, but not good enough to be in the Netflix Leaderboard. You need to be better than 0.9514 (that is what Cinematch score on quiz subset) to be listed.

Labels:

Sunday, January 20, 2008

Tcl Dynamic String

In my last blog, I introduced the Tcl hash table library that can be easily called from your own C program. Today I am going to show you the power of Tcl dynamic string. No more 'realloc' to re-allocate memory. As mentioned in the documentation,
... Dynamic strings provide a mechanism for building up arbitrarily long strings by gradually appending information. If the dynamic string is short then there will be no memory allocation overhead; as the string gets larger, additional space will be allocated as needed. ...

Source code: dstring.c

#include <stdio.h>
#include "tcl.h"


double elapsedTime(Tcl_Time start, Tcl_Time stop)
{
 double microsecond;

 /* copy from Tcl_TimeObjCmd (tclCmdMZ.c) */
 microsecond = ( ( (double) ( stop.sec - start.sec ) ) * 1.0e6
                      + ( stop.usec - start.usec ) );
 return(microsecond);
}


int main(int argc, char *argv[])
{
 Tcl_DString ds;
 Tcl_Time start, stop;
 char *pattern;
 int i, patternN, repeat;

 if ( argc != 3 ) {
  fprintf(stderr, "Usage: %s <pattern> <#repeat)\n", argv[0]); 
  exit(1);
 }

 patternN=strlen(argv[1]);
 pattern=Tcl_Alloc(sizeof(char)*(patternN+1));
 pattern=strdup(argv[1]);
 repeat=atoi(argv[2]);

 Tcl_GetTime(&start);
 Tcl_DStringInit(&ds);
 for ( i=1 ; i<=repeat ; ++i ) {
  Tcl_DStringAppend(&ds, pattern, patternN);
 }
 Tcl_GetTime(&stop);

 printf("%s x %d = %d took %lf microseconds\n", 
  pattern, repeat, Tcl_DStringLength(&ds),
  elapsedTime(start,stop)
 );

 printf("Value: %s\n", Tcl_DStringValue(&ds));

 Tcl_DStringFree(&ds);


}

Environment, Compilation, Demonstation:

$ /usr/local/bin/tclsh
% parray tcl_platform
tcl_platform(byteOrder) = littleEndian
tcl_platform(machine)   = i686
tcl_platform(os)        = CYGWIN_NT-5.1
tcl_platform(osVersion) = 1.5.25(0.156/4/2)
tcl_platform(platform)  = unix
tcl_platform(user)      = chchan
tcl_platform(wordSize)  = 4
% exit

$ gcc --version
gcc (GCC) 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ gcc -o dstring.exe -I /usr/local/include -L /usr/local/lib -O3 dstring.c -ltcl

$ ./dstring.exe hello-world, 10
hello-world, x 10 = 120 took 7.000000 microseconds
Value: hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,h
ello-world,hello-world,hello-world,hello-world,

$ ./dstring.exe hello-world, 100
hello-world, x 100 = 1200 took 59.000000 microseconds
Value: hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,h
ello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-wor
ld,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello
-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,h
ello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-wor
ld,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello
-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,h
ello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-wor
ld,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello
-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,h
ello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-wor
ld,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello
-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,h
ello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello-wor
ld,hello-world,hello-world,hello-world,hello-world,hello-world,hello-world,hello
-world,

Labels: ,

Thursday, January 17, 2008

Tcl Hash Table

Tcl has been designed as a general purpose cross-platform scripting language. Also it can be easily extendable and embeddable. However, not many people know that you can use part of its C library in your current program. Library functions include hash table, dynamic string allocation, a platform and compiler independent interface for memory allocation, regular expression, ... See Tcl library documentation for more details.

You don't have to be a Tcl programmer to use Tcl stuff.

Couple of years ago, one of my project partners used the Tcl hash table implementation in their AutoCAD extension running in Windows environment. Below shows a sample implementation of hash table. As you can see it took only 1.3 milliseconds to locate the value from a key out of a hash table with half a million key-value pairs. BTW, my notebook is running Cygwin in Windows XP (SP2) with Intel Celeron 1.4GHz CPU.

Source code: hash.c

#include <stdio.h>
#include <math.h>
#include "tcl.h"


double elapsedTime(Tcl_Time start, Tcl_Time stop)
{
 double microsecond;

 /* copy from Tcl_TimeObjCmd (tclCmdMZ.c) */
 microsecond = ( ( (double) ( stop.sec - start.sec ) ) * 1.0e6
                      + ( stop.usec - start.usec ) );
 return(microsecond);
}


int main(int argc, char *argv[])
{
 Tcl_HashTable t;
 Tcl_HashEntry *entry, *hasEntry;
 Tcl_Time start, stop;
 int i, max, isNew, lookup;
 char *key, *mesg;
 char keyFormat[]="key-%08d";


 max=500000;
 if ( argc != 2 ) {
  fprintf(stderr, "Usage: %s <1..%d>\n", argv[0], max); 
  exit(1);
 }
 lookup=atoi(argv[1]);

 Tcl_GetTime(&start);
 Tcl_InitHashTable(&t, TCL_STRING_KEYS);
 key=Tcl_Alloc(sizeof(char)*13);
 for ( i=0 ; i<=max ; ++i ) {
  sprintf(key, keyFormat, i);
  entry=Tcl_CreateHashEntry(&t, key, &isNew);
  if ( isNew ) {
   Tcl_SetHashValue(entry, i);
  }
 }
 Tcl_GetTime(&stop);
 printf("Took %lf microseconds to populate hash\n", elapsedTime(start,stop));

 sprintf(key, keyFormat, lookup);
 Tcl_GetTime(&start);
 hasEntry=Tcl_FindHashEntry(&t, key);
 printf("Key: %s, ", key);
 if ( hasEntry ) {
  printf("Value: %d\n", (int)Tcl_GetHashValue(hasEntry));
 } else {
  printf("Value: NOT FOUND\n");
 }
 Tcl_GetTime(&stop);
 printf("Took %lf microseconds to locate \"%s\"\n",elapsedTime(start,stop),key);

 mesg=Tcl_HashStats(&t);
 printf("\nHash Statistics:\n%s\n", mesg);
 Tcl_Free(mesg);
}
Environment, Compilation, Demonstation:
$ /usr/local/bin/tclsh
% parray tcl_platform
tcl_platform(byteOrder) = littleEndian
tcl_platform(machine)   = i686
tcl_platform(os)        = CYGWIN_NT-5.1
tcl_platform(osVersion) = 1.5.25(0.156/4/2)
tcl_platform(platform)  = unix
tcl_platform(user)      = chchan
tcl_platform(wordSize)  = 4
% exit

$ gcc --version
gcc (GCC) 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ gcc -o hash.exe -I /usr/local/include -L /usr/local/lib -O3 hash.c -

$ ./hash.exe 2345
Took 967343.000000 microseconds to populate hash
Key: key-00002345, Value: 2345
Took 1333.000000 microseconds to locate "key-00002345"

Hash Statistics:
500001 entries in table, 262144 buckets
number of buckets with 0 entries: 0
number of buckets with 1 entries: 122400
number of buckets with 2 entries: 85450
number of buckets with 3 entries: 28016
number of buckets with 4 entries: 17240
number of buckets with 5 entries: 3784
number of buckets with 6 entries: 3622
number of buckets with 7 entries: 607
number of buckets with 8 entries: 661
number of buckets with 9 entries: 192
number of buckets with 10 or more entries: 172
average search distance for entry: 1.8

$ ./hash.exe 234592302
Took 890433.000000 microseconds to populate hash
Key: key-234592302, Value: NOT FOUND
Took 1342.000000 microseconds to locate "key-234592302"

Hash Statistics:
500001 entries in table, 262144 buckets
number of buckets with 0 entries: 0
number of buckets with 1 entries: 122400
number of buckets with 2 entries: 85450
number of buckets with 3 entries: 28016
number of buckets with 4 entries: 17240
number of buckets with 5 entries: 3784
number of buckets with 6 entries: 3622
number of buckets with 7 entries: 607
number of buckets with 8 entries: 661
number of buckets with 9 entries: 192
number of buckets with 10 or more entries: 172
average search distance for entry: 1.8

Labels: ,

Thursday, January 10, 2008

Netflix, Benefitting from Buddies

Two days ago, I mentioned that my Netflix run started to reap the benefits from 'buddy' information. Below graph visualises how many percent of the predictions are based on the buddy data. As you can see, the latest 45-80% of the predictions are based the buddies. I am looking forward to more benefits for the last 15% of the movies (80% of the predictions). Cheers.

Labels:

Tuesday, January 08, 2008

Calculate Min/Max/Sum/Avg/RMS/Std in a Script

I have been working on the Netflix predictions lately and have to find out the min, max, avg, root mean square, standard deviation every now and then. I have been typing a lot of these one-liners to work out some of these values and it is about time to put that all in one script.

Some of the sample one-liners to work out the sum and average:

awk '{s+=$1} END {print s}' data
awk '{s+=$1} END {print s/NR}' data

Below is the script that has been generalised to handle most of the situations such as applying statistical method on specific column, ability to handle various type of field separator.

$ cat ~/bin/calc.sh
#! /bin/sh


usage()
{
        echo "Usage: $0 [-h] [-c column] [-m sum|avg|max|min|rms|std] [-f field_sep] [file ...]"
        echo "\\tDefault: column 1, sum, white space, standard input"
}



#
# get arguments and flags
#
set -- `getopt c:m:f: $* 2>/dev/null`
if [ $? != 0 ]; then
        usage
        exit 1
fi
column=1
method=sum
fs='[ \t]+'
for i in $*; do
        case $i in
                -c)
                        column=$2
                        shift 2
                        ;;
                -f)
                        fs=$2
                        shift 2
                        ;;
                -m)
                        method=$2
                        shift 2
                        ;;
                -h)
                        usage
                        exit 1
                        ;;
                --)
                        shift
                        break
                        ;;
        esac
done



#
# default, standard input channel
#
args=${*:--}



#
# check arguments
#
if [ "$method" != "sum" -a \
     "$method" != "max" -a \
     "$method" != "min" -a \
     "$method" != "avg" -a \
     "$method" != "rms" -a \
     "$method" != "std" ]; then
        echo "Error. Method $method unsupported"
        exit 2
fi
echo $column | egrep '^[1-9][0-9]*$' > /dev/null 2>&1
if [ $? -ne 0 ]; then
        echo "Error. Column number has to be integer"
        exit 2
fi



nawk -v col="$column" -v met="$method" -v fs="$fs" '
BEGIN {
        FS=fs
        max=-99999999999
        min=99999999999
}
col<=NF{
        if ( met == "sum" || met == "avg" ) {
                sum+=$col
                ++count
        } else if ( met == "std" ) {
                sum+=$col
                ++count
                term[count]=$col
        } else if ( met == "std" ) {
                diff=avg-$col
                sum+=diff*diff
                ++count
        } else if ( met == "rms" ) {
                sum+=$col*$col
                ++count
        } else if ( met == "max" ) {
                if ( $col > max ) {
                        max=$col
                }
        } else if ( met == "min" ) {
                if ( $col < min ) {
                        min=$col
                }
        }
}
END {
        if ( met == "sum" ) {
                print sum
        } else if ( met == "avg" ) {
                print sum/count
        } else if ( met == "std" ) {
                avg=sum/count
                for(i in term) {
                        diff=term[i]-avg
                        std+=diff*diff
                }
                print sqrt(std/count)
        } else if ( met == "rms" ) {
                print sqrt(sum/count)
        } else if ( met == "max" ) {
                print max
        } else if ( met == "min" ) {
                print min
        }
}' $args

See my script in action

$ ~/bin/calc.sh -h
Usage: /export/home/chihung/bin/calc.sh [-h] [-c column] [-m sum|avg|max|min|rms|std] [-f field_sep] [file ...]
        Default: column 1, sum, white space, standard input

$ ~/bin/calc.sh -c 1 -m avg /tmp/x
3.51955

$ ~/bin/calc.sh -c 2 -m min /tmp/x
1.0

$ ~/bin/calc.sh -c 2 -m max /tmp/x
4.4

$ awk 'BEGIN{OFS=":"}{print $1,$2}' /tmp/x | ~/bin/calc.sh -c 1 -m avg -f :
3.51955

Labels: ,

Netflix, While I Am Waiting, .... Part 3

Two weeks ago, I thought I found the way to 'reduce' the training set for my Netflix run. I deleted all the sqlite database records for those customers who voted more than 50 times a day. The database has been reduced from 100,480,507 records to 64,353,618 records, that is about 36% reduction. However, the predictions have not been good compared with the one generated from the original database, with a standard deviation of 0.6195. Wrong move.

I started my run on 6 Dec 2007 and it has been over a month now. My run is currently at over 82% completion (based on movie id), but just under 17% completion based on predictions. It took 1 month to complete 17% of the predictions, however, my 'buddy' database has grown to 1.1GB in size. Also, over 70% of the predictions are calculated based on the 'buddy' information. This is a very promising sign. I hope to submit my first run by early next month. Keep me fingers crossed.

Labels: ,