Wednesday, January 21, 2009

Shell Script Performance, Take 2

In this 2nd part of shell script performance, I am going to show you with another example and walk you through my performance journey.

Suppose you need to change 1000 files' extension to another extension (eg, from .txt to .log). What you need to do is to write a script to do that. Let's start with a straightforward approach like this:

$ ls -1 *.txt | wc -l
1000

$ cat ext1.sh
#! /bin/bash

if [ $# -ne 2 ]; then
        echo "Usage: $0 <original-extension> <new-extension>"
        exit 1
fi
ext_o=$1
ext_n=$2


for i in *.$ext_o
do
        mv $i `basename $i $ext_o`$ext_n

done

$ time ./ext1.sh txt log

real    4m45.135s
user    0m10.040s
sys     0m58.621s

$ ls -1 *.log | wc -l
1000

As you can see, you need to run "basename" command 1000 times to take away the extension before you can rename all these files and that's the overhead of forking out 1000 processes. As I mentioned in my previous blog, it needs to make 160+ system calls for an empty shell with 0.13 seconds run. All these will add up to slow down your script when you are working with large data set. In order to increase your script's performance, you need to basically follow these principles:

  1. Use the most appropriate tools for the job (more efficient)
  2. Try to reduce the amount of commands to run (less forking)
  3. Explore all the options and capability of every command (may reduce no. of commands)
  4. Explore the functionality of built-in shell construct (so no forking)

In my second attempt, I will explore a built-in pattern matching in Bash shell.
${variable%pattern}, - If the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest.
Example:

$ f=abc.txt

$ echo ${f%.txt}.log
abc.log

Here is the run time:

$ ls -1 *log | wc -l
1000

$ cat ext2.sh
#! /bin/bash

if [ $# -ne 2 ]; then
        echo "Usage: $0 <original-extension> <new-extension>"
        exit 1
fi
ext_o=$1
ext_n=$2


for i in *.$ext_o
do
        mv $i ${i%.$ext_o}.$ext_n

done

$ time ./ext2.sh log txt

real    2m6.684s
user    0m4.765s
sys     0m28.804s

$ ls -1 *.txt | wc -l
1000

That's 2.2 times faster than the first version. Is this good enough ? Let's explore other scripting languages. The advantage of modern scripting languages (like Perl, Python, Tcl, ... just to name a few) has a lot of capabilities built-in without having the script to fork processes. Below shows my Tcl and Python implementation.

In Tcl:

$ ls -1 *.txt | wc -l
1000

$ cat ext.tcl
#! /cygdrive/c/Tcl8.4.19/bin/tclsh


if { $argc != 2 } {
        puts stderr "Usage: $argv0 <original-ext> <new-ext>"
        exit 1
}
set ext_o [lindex $argv 0]
set ext_n [lindex $argv 1]


foreach ofile [glob -nocomplain *.$ext_o] {
        file rename $ofile "[file rootname $ofile].$ext_n"
}

$ time ./ext.tcl txt log

real    0m12.149s
user    0m0.000s
sys     0m0.015s

$ ls -1 *log | wc -l
1000

In Python:

$ ls -1 *log | wc -l
1000

$ cat ext.py
#! /usr/bin/python

import os, sys

if len(sys.argv) != 3:
        print "Usage:", sys.argv[0], "<original-ext> <new-ext>"
        sys.exit(1)

ext_o = os.path.extsep + sys.argv[1]
ext_n = os.path.extsep + sys.argv[2]

for f in os.listdir('.'):
        [fname,ext] = os.path.splitext(f)
        if ext == ext_o:
                f_new = fname + ext_n
                os.rename(f,f_new)

$ time ./ext.py log txt

real    0m4.307s
user    0m0.062s
sys     0m1.653s

$ ls -1 *.txt | wc -l
1000

Wow, we are able to reduce the runtime from less than 5min to just over 4sec. That's over 66 times faster than the first version.

  1. 4m45.135s (1st version of shell script)
  2. 2m6.684s (2nd version, using built-in Bash function)
  3. 0m12.149s (Tcl version)
  4. 0m4.307s (Python version)

Next time when you write scripts, make sure your script is tested with a large data set for performance reason. BTW, can you come up with the fastest script to create 1000 files (eg, file-0001.txt ... file-10000.txt) in some running sequence ? Stay tune for "Shell Script Performance, Take 3" 'cos I will show you how to take advantage of built-in functions in shell so that your script can run as fast as Python.

Labels: , , ,

6 Comments:

Blogger Mathieu said...

How does this compare the rename command?

rename .oldext .newext *

5:37 AM  
Blogger pjz said...

touch `seq -f file-%g.txt`

6:08 AM  
Blogger pjz said...

bah. of course I meant:

touch `seq -f file-%05g.txt 1 10000`

6:10 AM  
Blogger chihungchan said...

Thanks for all the comments.

Mathieu:
rename command is only available in Linux, not Solaris. Apparently rename is more efficient than mv. In my CentOS box, rename took 26 system calls to rename a single file and mv took 106. mv is definitely more powerful than rename and therefore they have to link with more sharable objects. This resulted in more open/close/... in system calls. (pre tag is not available in comment, therefore cannot show you the details)


Hi Pjz,
Thanks for highlighting the "-f" formatting flag. It is indeed very useful. Sad to say that seq is introduced in the OpenSolaris, not yet in the mainstream of Solaris. However, I think we can write a shell function based on AWK to achieve the same effect.

8:47 AM  
Blogger Marco Fabbri said...

Very nice read, two notes.

First, if you are iterating over a huge number of files you can also use the following "idiom":

ls -1 *pattern | while read f; do ...; done

this way you have the file listing happening in parallel to the rest of the script, and you also prevent the number of members in the for enumeration from overflowing. For small number of files this is actually slower as you pay the cost of spawning an additional process (ls) and setting up the pipeline.

Second, the seq command can be implemented (input error checking aside) also relying on bash arithmetics:

myseq () { c=$1; while [ $c -lt $2 ]; do echo $c; c=$(( c + 1 )); done; };

10:11 PM  
Blogger chihungchan said...

Thanks for the notes, Marco. I was preparing for my "Take 3" when your comment came. I implemented my own "seq" using awk, which I only have to execute 1 command to create N number of file names.

10:31 PM  

Post a Comment

<< Home