Querying ZIP, JAR, and MS Office Files
DataDirect XQuery supports the use of fn:collection to directly query XML files archived in a ZIP or JAR file, without first unpacking the archive file. This feature is useful for querying many types of business documents (word-processing documents, spreadsheets, charts, and graphical images such as drawings and presentations) stored in an XML format such as MS Office Open XML and OpenDocument Format.
In the following example, suppose you have multiple XML files archived in the ZIP file xml.zip. Each XML file contains information about one book, and you want to create a single XML document that contain lists of all your books.
<books> for $book in collection("zip:file:///c:/xml.zip")//books return <myBook>{$book/book/title}</myBook> </books>The result would look something like this:
<books> <myBook> <title>Emma</title> </myBook> <myBook> <title>Pride and Prejudice</title> </myBook> . . . </books>The function’s declaration for this feature is:
where:
zip_or_jar_url
is a URL referencing a ZIP or JAR file. The URL must use the zip: or jar: scheme.
option
is{(select="REGEX") | recurse={yes | no} | (sort=[a,t,r]+) | (xquery-regex=(yes|no))}
where:
select
contains a regular expression (REGEX)
, which determines which files in the directory are selected. Ifselect
is not specified, any file is assumed.
sort
determines how the retrieved files are sorted, as follows:
a
sorts alphabetically (ascending).t
sorts by modification time (beginning with most recent).r
combined witha
andt
reverses the sort order.
recurse
determines whether subdirectories archived in the ZIP or JAR file are searched. The default is no.To search subdirectories, set
recurse
to yes, for example:<books> for $book in collection("zip:file:///c:/xml.zip?select=*.xml;recurse=yes")//books return <myBook>{$book/book/title}</myBook> </books>
xquery-regex
determines what type of regular expression syntax is specified inselect
.
- If set to no (the default), the select pattern syntax takes the conventional form. For example, *.xml selects all files with an xml extension. More generally, the select pattern is converted to a regular expression by prepending "^", appending "$", replacing "." with "\.", and replacing "*" with ".*". Then, the select pattern is used to match the file names appearing in the directory using the XQuery regular expression rules. So, for example, you can specify *.(xml|xhtml) to match files with either of these two file extensions.
Note however, that special characters used in the URL may need to be escaped using the %HH convention, which can be achieved using the iri-to-uri function.
- If set to
yes
, the select pattern syntax as supported by XQuery is assumed. In this case, some characters may need to be escaped such as the backslash character (\) in a file name, for example:
select=.*\.xml$
must beselect=.*%5C.xml$
See also Collection URI Resolvers.
Creating and Updating ZIP Files
You can use the ddtek:serialize-to-url function to create new ZIP files and add files to an existing ZIP file. See ddtek:serialize-to-url for more information.