Querying ZIP, JAR, and MS Office Files


DataDirect XQuery supports the use of fn:collection to directly query XML files archived in a ZIP or JAR file, without first unpacking the archive file. This feature is useful for querying many types of business documents (word-processing documents, spreadsheets, charts, and graphical images such as drawings and presentations) stored in an XML format such as MS Office Open XML and OpenDocument Format.

In the following example, suppose you have multiple XML files archived in the ZIP file xml.zip. Each XML file contains information about one book, and you want to create a single XML document that contain lists of all your books.

<books>
for $book in collection("zip:file:///c:/xml.zip")//books
return
  <myBook>{$book/book/title}</myBook>
</books>  

The result would look something like this:

<books>
  <myBook>
    <title>Emma</title>
  </myBook>
  <myBook>
    <title>Pride and Prejudice</title>
  </myBook>
  . . .
</books>  

The function’s declaration for this feature is:

collection("zip_or_jar_url(?option(;option)*)?") 

where:

zip_or_jar_url is a URL referencing a ZIP or JAR file. The URL must use the zip: or jar: scheme.

option is {(select="REGEX") | recurse={yes | no} | (sort=[a,t,r]+) | (xquery-regex=(yes|no))}

where:

select contains a regular expression (REGEX), which determines which files in the directory are selected. If select is not specified, any file is assumed.

sort determines how the retrieved files are sorted, as follows:

<books>
for $book in 
collection("zip:file:///c:/xml.zip?select=*.xml;recurse=yes")//books
return
  <myBook>{$book/book/title}</myBook>
</books>  
 

xquery-regex determines what type of regular expression syntax is specified in select.

  • If set to no (the default), the select pattern syntax takes the conventional form. For example, *.xml selects all files with an xml extension. More generally, the select pattern is converted to a regular expression by prepending "^", appending "$", replacing "." with "\.", and replacing "*" with ".*". Then, the select pattern is used to match the file names appearing in the directory using the XQuery regular expression rules. So, for example, you can specify *.(xml|xhtml) to match files with either of these two file extensions.
  • Note however, that special characters used in the URL may need to be escaped using the %HH convention, which can be achieved using the iri-to-uri function.

  • If set to yes, the select pattern syntax as supported by XQuery is assumed. In this case, some characters may need to be escaped such as the backslash character (\) in a file name, for example:
  • select=.*\.xml$ must be select=.*%5C.xml$

See also Collection URI Resolvers.

Creating and Updating ZIP Files

You can use the ddtek:serialize-to-url function to create new ZIP files and add files to an existing ZIP file. See ddtek:serialize-to-url for more information.