Sunday, January 23, 2011

Building a High Performance Cluster with Amazon Web Services

Watch this video to see how easy to build a HPC cluster with Amazon EC2

Labels:

Saturday, January 15, 2011

xmllint for XML Namspace

With XML namespace, xmllint is not able to traverse it simply using the XML tags. There is not much examples on the Internet showing how xmllint handle XML with namespace and this blog serves to fill this gap. Let's try xmllint with this sample XML file
$ cat ns.xml
<?xml version="1.0"?>
<Tests xmlns="http://www.adatum.com">
  <Test TestId="0001" TestType="CMD">
    <Name>Convert number to string</Name>
    <CommandLine>Examp1.EXE</CommandLine>
    <Input>1</Input>
    <Output>One</Output>
  </Test>
  <Test TestId="0002" TestType="CMD">
    <Name>Find succeeding characters</Name>
    <CommandLine>Examp2.EXE</CommandLine>
    <Input>abc</Input>
    <Output>def</Output>
  </Test>
  <Test TestId="0003" TestType="GUI">
    <Name>Convert multiple numbers to strings</Name>
    <CommandLine>Examp2.EXE /Verbose</CommandLine>
    <Input>123</Input>
    <Output>One Two Three</Output>
  </Test>
  <Test TestId="0004" TestType="GUI">
    <Name>Find correlated key</Name>
    <CommandLine>Examp3.EXE</CommandLine>
    <Input>a1</Input>
    <Output>b1</Output>
  </Test>
  <Test TestId="0005" TestType="GUI">
    <Name>Count characters</Name>
    <CommandLine>FinalExamp.EXE</CommandLine>
    <Input>This is a test</Input>
    <Output>14</Output>
  </Test>
  <Test TestId="0006" TestType="GUI">
    <Name>Another Test</Name>
    <CommandLine>Examp2.EXE</CommandLine>
    <Input>Test Input</Input>
    <Output>10</Output>
  </Test>
</Tests>

$ xmllint --shell ns.xml
/ > cd Tests
Tests is a 0 Node Set
/ >

In order to traverse XML file with namespace defined, you need to set it with a prefix.

$ head -2 ns.xml
<?xml version="1.0"?>
<Tests xmlns="http://www.adatum.com">

$ xmllint --shell ns.xml
/ > setns a=http://www.adatum.com
/ > cd a:Tests
Tests > cd a:Test
a:Test is a 6 Node Set
Tests > cd a:Test[3]
Test > dir
ELEMENT Test
  ATTRIBUTE TestId
    TEXT
      content=0003
  ATTRIBUTE TestType
    TEXT
      content=GUI
Test > cat
<Test TestId="0003" TestType="GUI">
    <Name>Convert multiple numbers to strings</Name>
    <CommandLine>Examp2.EXE /Verbose</CommandLine>
    <Input>123</Input>
    <Output>One Two Three</Output>
  </Test>
Test >

If you have more than 1 namespace to work with, just set it with a different prefix name. You do not have to use the same namespace declaration mapping.

$ cat ns2.xml
<h:html xmlns:xdc="http://www.xml.com/books"
        xmlns:h="http://www.w3.org/HTML/1998/html4">
 <h:head><h:title>Book Review</h:title></h:head>
 <h:body>
  <xdc:bookreview>
   <xdc:title>XML: A Primer</xdc:title>
   <h:table>
    <h:tr align="center">
     <h:td>Author</h:td><h:td>Price</h:td>
     <h:td>Pages</h:td><h:td>Date</h:td></h:tr>
    <h:tr align="left">
     <h:td><xdc:author>Simon St. Laurent</xdc:author></h:td>
     <h:td><xdc:price>31.98</xdc:price></h:td>
     <h:td><xdc:pages>352</xdc:pages></h:td>
     <h:td><xdc:date>1998/01</xdc:date></h:td>
    </h:tr>
   </h:table>
  </xdc:bookreview>
 </h:body>
</h:html>


$ xmllint --shell ns2.xml
/ > cd h:html
h:html is a 0 Node Set
/ > setns h=http://www.w3.org/HTML/1998/html4
/ > setns xdc=http://www.xml.com/books
/ > cd h:html/h:body/xdc:bookreview/xdc:title
title > cat
<xdc:title>XML: A Primer</xdc:title>
title > 

Labels:

Friday, January 07, 2011

xmllint - Answer to an XML Question

Today I was asked about how I can validate or check the well-formness of XML file. My immediate answers were using a browser to view the malformed XML and the second answer was to parse it using tdom. At that time, xmllint wasn't in my mind. 'cos I seldom use it.

After some thoughts, I think I should validate my anwser. I downloaded a pretty sizeable XML file from Mondial project for my test. I deliberately removed one of the closing tags to make it not well-formed. Both tdom and Firefox are not able to identify the exact location of the missing closing tag. It is only xmllint is able to pinpoint the location

$ diff mondial.xml mondial-malformed.xml 
16819d16818
<    </country>


$ firefox mondial-malformed.xml

Firefox
XML Parsing Error: mismatched tag. Expected: </country>.
Location: file:///home/chihung/Projects/xmllint/mondial-malformed.xml
Line Number 39564, Column 3:</mondial>
--^


$ tclsh
% package require tdom
0.8.3
% set doc [dom parse [tDOM::xmlReadFile mondial-malformed.xml]]
error "mismatched tag" at line 39564 character 2
"ude>
   </desert>
</m <--Error-- ondial>
"


$ xmllint --shell mondial-malformed.xml 
mondial-malformed.xml:39564: parser error : Opening and ending tag mismatch: country line 16795 and mondial
</mondial>
          ^
mondial-malformed.xml:39565: parser error : Premature end of data in tag mondial line 3

^

OK, xmllint is sure the winner in this exercise. Below shows xmllint in action:

$ xmllint --shell mondial.xml 
/ > help
 base         display XML base of the node
 setbase URI  change the XML base of the node
 bye          leave shell
 cat [node]   display node or current node
 cd [path]    change directory to path or to root
 dir [path]   dumps informations about the node (namespace, attributes, content)
 du [path]    show the structure of the subtree under path or the current node
 exit         leave shell
 help         display this help
 free         display memory usage
 load [name]  load a new document with name
 ls [path]    list contents of path or the current directory
 set xml_fragment replace the current node content with the fragment parsed in context
 xpath expr   evaluate the XPath expression in that context and print the result
 setns nsreg  register a namespace to a prefix in the XPath evaluation context
              format for nsreg is: prefix=[nsuri] (i.e. prefix= unsets a prefix)
 setrootns    register all namespace found on the root element
              the default namespace if any uses 'defaultns' prefix
 pwd          display current working directory
 quit         leave shell
 save [name]  save this document to name or the original name
 write [name] write the current node to the filename
 validate     check the document for errors
 relaxng rng  validate the document agaisnt the Relax-NG schemas
 grep string  search for a string in the subtree

/ > validate
mondial.xml:35144: element island: validity error : Syntax of value for attribute sea of island is not valid
validity error : attribute sea line 35144 references an unknown ID ""

/ > base
mondial.xml

/ > dir
DOCUMENT
version=1.0
encoding=UTF-8
URL=mondial.xml
standalone=true

/ > grep Singapore
/mondial/country[105]/name : t--        9 Singapore
/mondial/country[105]/city/name : t--        9 Singapore
/mondial/island[163]/name : t--        9 Singapore

/ > cd /mondial/country[105]

country > cat
<country car_code="SGP" area="632.6" capital="cty-Singapore-Singapore" memberships="org-AsDB org-ASEAN org-Mekong-Group org-CP org-C org-CCC org-ESCAP org-G-77 org-IAEA org-IBRD org-ICC org-ICAO org-ICFTU org-Interpol org-IFRCS org-IFC org-ILO org-IMO org-Inmarsat org-IMF org-IOC org-ISO org-ICRM org-ITU org-Intelsat org-NAM org-PCA org-UN org-UNIKOM org-UPU org-WHO org-WIPO org-WMO org-WTrO">
      <name>Singapore</name>
      <population>3396924</population>
      <population_growth>1.9</population_growth>
      <infant_mortality>4.7</infant_mortality>
      <gdp_total>66100</gdp_total>
      <gdp_ind>28</gdp_ind>
      <gdp_serv>72</gdp_serv>
      <inflation>1.7</inflation>
      <indep_date>1965-08-09</indep_date>
      <government>republic within Commonwealth</government>
      <encompassed continent="asia" percentage="100"/>
      <ethnicgroups percentage="6.4">Indian</ethnicgroups>
      <ethnicgroups percentage="76.4">Chinese</ethnicgroups>
      <ethnicgroups percentage="14.9">Malay</ethnicgroups>
      <city id="cty-Singapore-Singapore" is_country_cap="yes" country="SGP">
         <name>Singapore</name>
         <longitude>103.833</longitude>
         <latitude>1.3</latitude>
         <population year="87">2558000</population>
         <located_at watertype="sea" sea="sea-SouthChinaSea"/>
         <located_on island="island-Singapore"/>
      </city>
   </country>

Finding countries with infant_mortality less than Singapore.

country > xpath //country[infant_mortality<4.7]/name/text()
Object is a Node Set :
Set contains 9 nodes:
1  TEXT
    content=Andorra
2  TEXT
    content=Sweden
3  TEXT
    content=Iceland
4  TEXT
    content=Jersey
5  TEXT
    content=Man
6  TEXT
    content=Hong Kong
7  TEXT
    content=Japan
8  TEXT
    content=Anguilla
9  TEXT
    content=Bermuda

country > quit

This can be turned into command line too.

$ xmllint --xpath '//country[infant_mortality<4.7]/name' --format mondial.xml 
<name>Andorra</name><name>Sweden</name><name>Iceland</name><name>Jersey</name><name>Man</name><name>Hong Kong</name><name>Japan</name><name>Anguilla</name><name>Bermuda</name>

real 0m0.219s
user 0m0.192s
sys 0m0.020s

Alternatively, you can do the above dynamically:

$ xmllint --shell mondial.xml
/ > xpath //country[infant_mortality<//country[name="Singapore"]/infant_mortality]/name/text()
Object is a Node Set :
Set contains 9 nodes:
1  TEXT
    content=Andorra
2  TEXT
    conte;nt=Sweden
3  TEXT
    content=Iceland
4  TEXT
    content=Jersey
5  TEXT
    content=Man
6  TEXT
    content=Hong Kong
7  TEXT
    content=Japan
8  TEXT
    content=Anguilla
9  TEXT
    content=Bermuda

$ time xmllint --xpath '//country[infant_mortality<//country[name="Singapore"]/infant_mortality]/name' --format mondial.xml 
<name>Andorra</name><name>Sweden</name><name>Iceland</name><name>Jersey</name><name>Man</name><name>Hong Kong</name><name>Japan</name><name>Anguilla</name><name>Bermuda</name>
real 0m2.074s
user 0m2.052s
sys 0m0.016s

xmllint is definitely the preferred XML companion. It is extremely fast and efficient comparing with Firefox and tdom.

Labels: ,