PIG XML Parsing

JP
Aug 2, 2015
2 min read

Feeding the Pig with XML

Most of the business data available today is XML and it is always tough to parse XML, especially when it comes to PIG. There are two approaches to parse an XML file in PIG.

1. Using Regular Expression 2. Using XPath

For simplicity, let’s work on XML shown below(store this as sample.xml). The file is placed in HDFS for processing (path used here is /tmp/sample.xml).

<row _id="1" _uuid="F9B3B00E-49D0-47AC-9E7B-10984D6BD5C8" _position="1" _address="http://data.illinois.gov/resource/_2hzd-qc46/1"> <idno>196</idno> <course_provider>1st All Around (English)</course_provider> <contact_person>Marcin Swierzowski</contact_person> <address>370 55TH STREET</address> <city>Clarendon Hills</city> <state>IL</state> <zip_code>60514</zip_code> </row> <row _id="2" _uuid="62CFE15E-3DEF-416F-B338-7F7B69DF6C3A" _position="2" _address="http://data.illinois.gov/resource/_2hzd-qc46/2"> <idno>195</idno> <course_provider>1st All Around (Polish)</course_provider> <contact_person>Marcin Swierzowski</contact_person> <address>370 55TH STREET</address> <city>Clarendon Hills</city> <state>IL</state> <zip_code>60514</zip_code> </row>

Using Regular Expressions

XMLLoader() in piggybank UDF to load the xml, so ensure that PiggyBank UDF is registered(latest versions of pig will have this by default). Use regular expression to parse the XML.

REGISTER piggybank.jar // use this command for older pig versions A = LOAD '/tmp/sample.xml' using org.apache.pig.piggybank.storage.XMLLoader('row') as (x:chararray); B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<row.*>\\s*<idno>(.*)</idno>\\s*<course_provider>(.*)</course_provider>\\s*<contact_person>(.*)</contact_person>\\s*<address>(.*)</address>\\s*<city>(.*)</city>\\s*<state>(.*)</state>\\s*<zip_code>(.*)</zip_code>\\s*</row>'));

Result is below:

(196,1st All Around (English),Marcin Swierzowski,370 55TH STREET,Clarendon Hills,IL,60514) (195,1st All Around (Polish),Marcin Swierzowski,370 55TH STREET,Clarendon Hills,IL,60514)

Using XPath

XPath is a function that allows text extraction from xml. Starting PIG 0.13 , Piggy bank UDF comes with XPath support. It eases the XML parsing in PIG scripts. A sample script using XPath is as shown below. Its always better to use XPath if you want to abstract only few items from XML

REGISTER piggybank.jar // use this command for older pig versions DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();

A = LOAD '/tmp/sample.xml' using org.apache.pig.piggybank.storage.XMLLoader('row') as (x:chararray); B = FOREACH A GENERATE XPath(x, 'row/course_provider'), XPath(x, 'row/city');

Result below:

(1st All Around (English),Clarendon Hills)

(1st All Around (Polish),Clarendon Hills)

Thanks for reading..

#RStudioInstallationonHadoop #RStudio #Hadoop #ApachePIG #PIG #XMLparsing