Wednesday, March 26, 2008

Chinese History and Performance

How can we apply Chinese History knowledge in the area of performance.

Last week I flipped through my son's chinese history book and found this 4000+ years ago story very interesting. A fellow called "Yu" who successfully get the flood under control. I am not going to tell you the story here, you can read it up here (for English) or here (for Chinese).

Basically he didn't follow what his predecessors do by building a dam to control the flood. Instead, he spent a lot of time trying to understand the problems and found out the root cause. At the end, he built a canal to channel the water to the sea.

A lot of time when we encounter a performance problem or poor response time, we think that an upgrade of memory, subscribe a bigger Internet bandwidth, or throw in a bigger server with lots of CPUs will be able to resolve the issue. Sounds very much like building a dam, a bigger dam just like Yu's predecessors did 4000+ years ago. Often time we simply do not understand how various components work together to serve that mouse click.

Labels:

Tcl Is As Fast As Perl

Netflixprize came with a utility check_format.pl to check the submission format before you upload your predictions.

The Perl utility took 11.437 seconds to check my submit file with a total of 2,834,601 lines. So how is Tcl compare with Perl in this "race". It took me about a couple of minutues to crank out a similar utility with almost similar functionalities. Guess what, Tcl is slightly faster than Perl with a run time of 11.128 seconds, that is 0.309 second faster.

My version of check_format:

#! /usr/local/bin/tclsh
if { $argc != 1 } {
    puts stderr "Usage: $argv0 <file>"
    exit 1
}
array set m {
1 53
10 10
...
...
9998 5
9999 3
}
set f [lindex $argv 0]
if { ![file exists $f] } {
    puts stdeer "Error. $f does not exist"
    exit 2
}
set fp [open $argv r]
set lineno 1
while { [gets $fp line] >= 0 } {
    if { [string match {*:} $line] } {
        set movie_id [string range $line 0 end-1]
        if { ![string is integer $movie_id] } {
            puts "Error. Movie_id=$movie_id not integer"
            exit 3
        }
        if { [info exists count] && $count != $m($mid) } {
            puts "Error. Movie_id=$mid, count=$count, should be $m($mid)"
            exit 4
        }

        set mid $movie_id
        set count 0
        set required $m($mid)
    } else {
        incr count
        if { $line < 1.0 || $line > 5.0 } {
            puts "Error. LINE#=$lineno, Movie id=$mid, count=$count"
            exit 5
        }
    }
    incr lineno
}
close $fp
puts "OK!"

See Tcl and Perl in action:

$ perl -v

This is perl, v5.8.4 built for i86pc-solaris-64int
(with 28 registered patches, see perl -V for more detail)

Copyright 1987-2004, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'.  If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.

$ tclsh
% set tcl_patchLevel
8.4.16
% parray tcl_platform
tcl_platform(byteOrder) = littleEndian
tcl_platform(machine)   = i86pc
tcl_platform(os)        = SunOS
tcl_platform(osVersion) = 5.10
tcl_platform(platform)  = unix
tcl_platform(threaded)  = 1
tcl_platform(user)      = chihung
tcl_platform(wordSize)  = 8
% set tcl_patchLevel
8.4.16
% exit

$ wc -l netflix-submission.txt
  2834601 2834601 11379859 netflix-submission.txt

$ time ../../check_format.pl netflix-submission.txt
OK!

real    0m11.437s
user    0m11.417s
sys     0m0.017s

$ time ./check_format.tcl netflix-submission.txt
OK!

real    0m11.128s
user    0m11.093s
sys     0m0.029s

Tcl has changed a lot and the statement of "Everything is string" is not really true. Internally it has more than one representation. As Donal Fellows once mentioned "Tcl_Obj's are like storks. They have two legs, the internal representation and the string representation. They can stand on either leg, or on both." See the typedef struct Tcl_Obj in the tcl.h (8.4):

typedef struct Tcl_Obj {
    int refCount;               /* When 0 the object will be freed. */
    char *bytes;                /* This points to the first byte of the
                                 * object's string representation. The array
                                 * must be followed by a null byte (i.e., at
                                 * offset length) but may also contain
                                 * embedded null characters. The array's
                                 * storage is allocated by ckalloc. NULL
                                 * means the string rep is invalid and must
                                 * be regenerated from the internal rep.
                                 * Clients should use Tcl_GetStringFromObj
                                 * or Tcl_GetString to get a pointer to the
                                 * byte array as a readonly value. */
    int length;                 /* The number of bytes at *bytes, not
                                 * including the terminating null. */
    Tcl_ObjType *typePtr;       /* Denotes the object's type. Always
                                 * corresponds to the type of the object's
                                 * internal rep. NULL indicates the object
                                 * has no internal rep (has no type). */
    union {                     /* The internal representation: */
        long longValue;         /*   - an long integer value */
        double doubleValue;     /*   - a double-precision floating value */
        VOID *otherValuePtr;    /*   - another, type-specific value */
        Tcl_WideInt wideValue;  /*   - a long long value */
        struct {                /*   - internal rep as two pointers */
            VOID *ptr1;
            VOID *ptr2;
        } twoPtrValue;
    } internalRep;
} Tcl_Obj;

Labels: , ,

Sunday, March 23, 2008

HTTP Status Codes Summary, The Google Chart Way

Do you know that Google Code provides a lot of APIs to do very cool stuff. In this blog, I am going to show you how you can dynamically embed chart in your webpage using Google Chart.

In my previous blog I wrote an AWK program to summarise the HTTP status codes and tabluated the result. Now I am modifying my program to output the source http in the <img> tag to dynamically generate the HTTP status chart. First colour used is green (00ff00) represents HTTP code 200, blue represents 302 and red represent 304. Look under the query string name value pair, "chco".

This is the modified AWK program

#! /bin/bash


if [ $# -ne 1 ]; then
 echo "Usage: $0 <access_log>"
 exit 1
fi
log=$1


if [ ! -f "$log" ]; then
 echo "Error. $log does not exist"
 exit 2
fi


file $log | grep "gzip compressed data" > /dev/null 2>&1
if [ $? -eq 0 ]; then
 cmd="zcat"
else
 cmd="cat"
fi





echo -n "<img src=\"http://chart.apis.google.com/chart?"

# chart type: bvs - bar vertical
echo -n "cht=bvs&"

# chart colours
echo -n "chco=00ff00,0000ff,ff0000,00ffff,ff00ff,ffff00,008800,000088,880000&"

# chart title
echo -n "chtt=HTTP+Status+Code&"

# chart size
echo -n "chs=600x200&"

# bar width
echo -n "chbh=12&"



$cmd $log | gawk '
function chd()
{
 printf("chd=t:")
 for ( c=1 ; c<=nc ; ++c ) {
  for ( d=1 ; d<=nd ; ++d ) {
   printf("%d", a_cd[a_date[d],a_code[c]])
   if ( d<nd ) {
    printf(",")
   } else {
    if ( c!=nc ) {
     printf("|")
    }
   }
  }
 }
 printf("&")
}
function chxl() {
 # find max total per date
 max=0
 for ( d=1 ; d<=nd ; ++d ) {
  total=0
  for ( c=1 ; c<=nc ; ++c ) {
   total+=a_cd[a_date[d],a_code[c]]
  }
  if ( total>max ) {
   max=total
  }
 }

 # per 10,000
 m10k=int(max/10000)+1
 

 printf("chxl=0:|")
 for ( d=1 ; d<=nd ; ++d ) {
  split(a_date[d],x,"/")
  printf("%s|", x[1])
 }
 printf("1:|")
 for ( m=0 ; m<=m10k ; ++m ) {
  printf("%d|", m*10000)
 }
 printf("&")

 # chart axis 0=x, 1=y
 printf("chxt=x,y&")

 # data scaling
 printf("chds=0,%d&",m10k*10000)
}
$(NF-1)>=100 && $(NF-1)<=505 {
 date=substr($4,2,11)
 code=$(NF-1)
 a_code[code]=code
 a_date[date]=date
 a_cd[date,code]++
}
END {
 nc=asort(a_code)
 nd=asort(a_date)

 chd()
 chxl()
}'


echo -n "\">"

And the output is:
(I introduced line-break for ease of reading)

<img src="http://chart.apis.google.com/chart?cht=bvs&chco=00ff00,0000ff,ff0000,00ffff,ff00ff,ffff00,008800,000088,880000&
chtt=HTTP+Status+Code&chs=600x200&chbh=12&chd=t:22038,30732,31988,23525,21865,29891,30866,32001,33043,34076,23703,21811,
30458,32659,34758,32477,33215,23717,21947,33129,32149,34153,32234,34076,23917,22046,32851,36528,36627,33731,20741|8,499,
529,81,1,184,370,608,586,438,63,17,341,348,539,457,406,90,0,378,493,477,312,533,98,0,329,447,664,522,213|290,11427,11718,
2199,246,7874,10107,12380,14069,12374,2604,393,7867,10302,13515,13728,10919,1275,42,11618,14163,13045,10560,12402,1724,6,
12652,14036,14179,10825,6473|0,0,0,0,1,2,4,1,0,0,0,0,1,0,2,0,0,0,0,0,0,1,0,10,0,0,0,0,0,0,0|2,14,14,3,0,0,11,24,42,28,0,1,
18,9,15,22,15,0,0,15,8,9,5,12,4,0,17,6,9,8,12|0,10,6,2,1,5,8,22,11,14,1,1,12,11,13,11,8,3,0,24,18,9,6,4,2,1,8,8,18,10,6|
0,100,203,1,0,60,23,67,151,128,5,3,89,65,79,924,75,80,54,102,78,82,77,70,106,46,58,83,97,87,55&chxl=0:|01|02|03|04|05|06|
07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|1:|0|10000|20000|30000|40000|50000|60000|&
chxt=x,y&chds=0,60000&">

And the dynamic chart is shown below (actually generated by Google, not a static image):

Labels: ,

Wednesday, March 19, 2008

HTTP State Codes Summary, The AWK Way

A colleague of mine is writing an awk program to get the monthly summary of HTTP status codes. This enable him to use that as a reference to cross-check with another commercial web log tool.

His code is something like this

$ awk '{++s[$(NF-1)]}END{for(i in s){print i,s[i]}}' access_log | sort
200 916952
302 10031
304 265012
400 22
401 323
404 253
500 3048

This may serve his purpose for checking. However, I think it is possible to write an entire HTTP status code summary using gawk to present the summary in a per-day basis. "asort" (Sorting Array Values and Indices) function in gawk is very handy in sorting array so that rows and columns can be displayed in order. Below is my implementation:

$ cat ncode.sh
#! /bin/bash


if [ $# -ne 1 ]; then
 echo "Usage: $0 <access_log>"
 echo "       <access_log> can be either plain text or gzip compressed"
 exit 1
fi
log=$1


if [ ! -f "$log" ]; then
 echo "Error. $log does not exist"
 exit 2
fi


file $log | grep "gzip compressed data" > /dev/null 2>&1
if [ $? -eq 0 ]; then
 cmd="zcat"
else
 cmd="cat"
fi


$cmd $log | gawk '
function separator(n)
{
 for ( i=1 ; i<=n ; ++i ) {
  printf("-")
 }
 printf("\n")
}
$(NF-1)>=100 && $(NF-1)<=505 {
 date=substr($4,2,11)
 code=$(NF-1)
 a_code[code]=code
 a_date[date]=date
 a_cd[date,code]++
}
END {
 nc=asort(a_code)
 nd=asort(a_date)

 separator(80)

 # header for http code
 printf("HTTP Codes:")
 for ( c=1 ; c<=nc ; ++c ) {
  printf("%8d", a_code[c])
 }
 printf("   Total\n")

 separator(80)

 # result per date
 for ( d=1 ; d<=nd ; ++d ) {
  printf("%s", a_date[d])
  total=0
  for ( c=1 ; c<=nc ; ++c ) {
   value=a_cd[a_date[d],a_code[c]]
   printf("%8d", value)
   total+=value
  }
  printf("%8d\n", total)
 }

 separator(80)

 # total by code
 printf("Total:     ")
 all=0
 for ( c=1 ; c<=nc ; ++c ) {
  total=0
  for ( d=1 ; d<=nd ; ++d ) {
   value=a_cd[a_date[d],a_code[c]]
   total+=value
  }
  all+=total
  printf("%8d", total)
 }
 printf("%8d\n", all)

 separator(80)
}'

It took just under 15 seconds on my notebook (Intel Celeron 1.4GHz, 512 MB memory) to summarise 1,192,178 lines of web access log with gawk 3.1.6 in Cygwin.

$ ./ncode.sh
Usage: ./ncode.sh <access_log>
       <access_log> can be either plain text or gzip compressed

$ ./ncode.sh access_log.gz
--------------------------------------------------------------------------------
HTTP Codes:     200     302     304     400     401     404     500   Total
--------------------------------------------------------------------------------
01/Jan/2008   22038       8     290       0       2       0       0   22338
02/Jan/2008   30732     499   11427       0      14      10     100   42782
03/Jan/2008   31988     529   11718       0      14       6     203   44458
04/Jan/2008   23525      81    2199       0       3       2       1   25811
05/Jan/2008   21865       1     246       1       0       1       0   22114
06/Jan/2008   29891     184    7874       2       0       5      60   38016
07/Jan/2008   30866     370   10107       4      11       8      23   41389
08/Jan/2008   32001     608   12380       1      24      22      67   45103
09/Jan/2008   33043     586   14069       0      42      11     151   47902
10/Jan/2008   34076     438   12374       0      28      14     128   47058
11/Jan/2008   23703      63    2604       0       0       1       5   26376
12/Jan/2008   21811      17     393       0       1       1       3   22226
13/Jan/2008   30458     341    7867       1      18      12      89   38786
14/Jan/2008   32659     348   10302       0       9      11      65   43394
15/Jan/2008   34758     539   13515       2      15      13      79   48921
16/Jan/2008   32477     457   13728       0      22      11     924   47619
17/Jan/2008   33215     406   10919       0      15       8      75   44638
18/Jan/2008   23717      90    1275       0       0       3      80   25165
19/Jan/2008   21947       0      42       0       0       0      54   22043
20/Jan/2008   33129     378   11618       0      15      24     102   45266
21/Jan/2008   32149     493   14163       0       8      18      78   46909
22/Jan/2008   34153     477   13045       1       9       9      82   47776
23/Jan/2008   32234     312   10560       0       5       6      77   43194
24/Jan/2008   34076     533   12402      10      12       4      70   47107
25/Jan/2008   23917      98    1724       0       4       2     106   25851
26/Jan/2008   22046       0       6       0       0       1      46   22099
27/Jan/2008   32851     329   12652       0      17       8      58   45915
28/Jan/2008   36528     447   14036       0       6       8      83   51108
29/Jan/2008   36627     664   14179       0       9      18      97   51594
30/Jan/2008   33731     522   10825       0       8      10      87   45183
31/Jan/2008   20741     213    6473       0      12       6      55   27500
--------------------------------------------------------------------------------
Total:       916952   10031  265012      22     323     253    3048 1195641
--------------------------------------------------------------------------------

You must be wondering why I need to 'reinvent the wheel' when there are free open source tools (eg. Analog, AWStats, Webalizer, ... ) that can do a much better job because I believe

"I hear and I forget; I see and I remember; I do and I understand"
- Chinese Proverb
"Willing is not enough; we must do. Knowing is not enough; we must apply."
- Bruce Lee
The equivalent in the IT world will be
"I install and I am just an Installer; I use and I am just a User; I write and I am proud to call myself a Software Engineer"
- Chihung's proverb, hopefully someone will quote it in the future :-)

Labels: , ,

Wednesday, March 12, 2008

Netflixprize, A Different Approach

A friend of mine forwarded this link to me, This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize. This fellow is a 48-year-old Englishman (his team name is known as justaguyinagarage) , a retired management consultant with an undergraduate degree in psychology and a master's in operations research. If you read the article, you will realise that he approached the problem different from other mathematicans and computer scientists. Since his approach is not based on any complex mathematical functions or models, his movie prediction run-time will definitely be shorten by signifiance amount and this is a plus point for anyone who wants to adopt his method.

According to today's Netflix Prize Leaderboard, he ranked 9th with a score of 0.8740 (8.14% improvement over Cinematch).

Recently I have been adjusting my approach and tried to adopt some of his psychology ideas. Run time downs from 10+ days to just about 4 hours, but not much improvement in terms of score. Also I am reading this book - Simple Heuristics That Make Us Smart that he mentioned in his blog. I hope to have some breakthrough in my movie prediction.

Labels:

Wednesday, March 05, 2008

Web Page Loading Time in Internet Explorer

Do you know that you can communication with a Windows application via the COM without knowing much details of how COM works. A Tcl extension, tcom, is all you need if you are a Tcl enthusiast like myself. tcom is a Windows-specific Tcl extension that provides commands to access and implement COM objects. It enables client-side and server-side scripting of COM objects through IDispatch and IUnknown derived interfaces. Download ActiveTcl from ActiveState to try out the below example.

Below is a sample script to find out how long it takes for the web page to be completely displayed in Internet Explorer.

package require tcom

set url http://www.your-web-site.com/
set ie [tcom::ref createobject "InternetExplorer.Application"]
$ie Visible 1
$ie Navigate $url
set t1 [clock seconds]
while { 1 } {
 if { [$ie Busy] == 0 } {
  set t2 [clock seconds]
  break
 }
 after 1000
}
$ie Quit
puts "[expr $t2 - $t1] to display $url in Internet Explorer"

You can also find out the complete list of method descriptions for methods defined in the COM interface. Each method description is a list. The first element is the member ID. The second element is the return type. The third element is the method name. The fourth element is a list of parameter descriptions. (extracted from tcom man page).

$ cat tcom.tcl
package require tcom

set ie [tcom::ref createobject "InternetExplorer.Application"]
set interface [tcom::info interface $ie]
foreach i $interface { 
 puts $i
}
$ie Quit


$ ./tcom.tcl
100 VOID GoBack {}
101 VOID GoForward {}
102 VOID GoHome {}
103 VOID GoSearch {}
104 VOID Navigate {{in BSTR URL} {in {VARIANT *} Flags} {in {VARIANT *} TargetFrameName} {in {VARIANT *} PostData} {in {VARIANT *} Headers}}
-550 VOID Refresh {}
105 VOID Refresh2 {{in {VARIANT *} Level}}
106 VOID Stop {}
200 {DISPATCH *} Application {}
201 {DISPATCH *} Parent {}
202 {DISPATCH *} Container {}
203 {DISPATCH *} Document {}
204 {BOOL *} TopLevelContainer {}
205 {BSTR *} Type {}
206 {I4 *} Left {}
206 VOID Left {{in I4 pl} {in {I4 *} propertyValue} {in {I4 *} propertyValue} {in {I4 *} propertyValue}}
207 {I4 *} Top {}
207 VOID Top {{in I4 pl} {in {I4 *} propertyValue} {in {I4 *} propertyValue} {in {I4 *} propertyValue}}
208 {I4 *} Width {}
208 VOID Width {{in I4 pl} {in {I4 *} propertyValue} {in {I4 *} propertyValue} {in {I4 *} propertyValue}}
209 {I4 *} Height {}
209 VOID Height {{in I4 pl} {in {I4 *} propertyValue} {in {I4 *} propertyValue} {in {I4 *} propertyValue}}
210 {BSTR *} LocationName {}
211 {BSTR *} LocationURL {}
212 {BOOL *} Busy {}
300 VOID Quit {}
301 VOID ClientToWindow {{{in out} {INT *} pcx} {{in out} {INT *} pcy}}
302 VOID PutProperty {{in BSTR Property} {in VARIANT vtValue}}
303 {VARIANT *} GetProperty {{in BSTR Property}}
0 {BSTR *} Name {}
-515 {I4 *} HWND {}
400 {BSTR *} FullName {}
401 {BSTR *} Path {}
402 {BOOL *} Visible {}
402 VOID Visible {{in BOOL pBool} {in {BOOL *} propertyValue} {in {BOOL *} propertyValue}}
403 {BOOL *} StatusBar {}
403 VOID StatusBar {{in BOOL pBool} {in {BOOL *} propertyValue} {in {BOOL *} propertyValue}}
404 {BSTR *} StatusText {}
404 VOID StatusText {{in BSTR StatusText} {in {BSTR *} propertyValue} {in {BSTR *} propertyValue}}
405 {INT *} ToolBar {}
405 VOID ToolBar {{in INT Value} {in {INT *} propertyValue} {in {INT *} propertyValue}}
406 {BOOL *} MenuBar {}
406 VOID MenuBar {{in BOOL Value} {in {BOOL *} propertyValue} {in {BOOL *} propertyValue}}
407 {BOOL *} FullScreen {}
407 VOID FullScreen {{in BOOL pbFullScreen} {in {BOOL *} propertyValue} {in {BOOL *} propertyValue}}
500 VOID Navigate2 {{in {VARIANT *} URL} {in {VARIANT *} Flags} {in {VARIANT *} TargetFrameName} {in {VARIANT *} PostData} {in {VARIANT *} Headers}}
501 {I4 *} QueryStatusWB {{in I4 cmdID}}
502 VOID ExecWB {{in I4 cmdID} {in I4 cmdexecopt} {in {VARIANT *} pvaIn} {{in out} {VARIANT *} pvaOut}}
503 VOID ShowBrowserBar {{in {VARIANT *} pvaClsid} {in {VARIANT *} pvarShow} {in {VARIANT *} pvarSize}}
-525 {I4 *} ReadyState {}
550 {BOOL *} Offline {}
550 VOID Offline {{in BOOL pbOffline} {in {BOOL *} propertyValue}}
551 {BOOL *} Silent {}
551 VOID Silent {{in BOOL pbSilent} {in {BOOL *} propertyValue}}
552 {BOOL *} RegisterAsBrowser {}
552 VOID RegisterAsBrowser {{in BOOL pbRegister} {in {BOOL *} propertyValue}}
553 {BOOL *} RegisterAsDropTarget {}
553 VOID RegisterAsDropTarget {{in BOOL pbRegister} {in {BOOL *} propertyValue}}
554 {BOOL *} TheaterMode {}
554 VOID TheaterMode {{in BOOL pbRegister} {in {BOOL *} propertyValue}}
555 {BOOL *} AddressBar {}
555 VOID AddressBar {{in BOOL Value} {in {BOOL *} propertyValue}}
556 {BOOL *} Resizable {}
556 VOID Resizable {{in BOOL Value} {in {BOOL *} propertyValue}}

With such a rich interface, it is possible to traverse a set of URLs or simulate the user's click stream.

Labels: ,