Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
You need to convert some XML into data
suitable for importing into another application such as a
spreadsheet.
5.2.2. Solution
Many applications import delimited data. The most common format is
called Comma Separated Values (CSV). Many spreadsheets and
databases can handle CSV and other forms of delimited data. Mapping
XML to delimited data can be simple or complex, depending on the
difficulty of the mapping. This section starts with simple cases and
progresses toward more complicated scenarios.
5.2.2.1Create a CSV file from flat attribute-encoded elements
In this scenario, you have a flat XML
file with elements mapping to rows and attributes mapping to columns.
This problem is trivial for any given XML file of the appropriate
format. For example, the following stylesheet shown in Example 5-6 through Example 5-8 outputs a
CSV based on the input people.xml.
Al Zehtooney,33,m,no
Brad York,38,m,yes
Charles Xavier,32,m,no
David Willimas,33,m,no
Edward Ulster,33,m,yes
Frank Townsend,35,m,no
Greg Sutter,40,m,no
...
Although the solution is simple, it would be nice to create a generic
stylesheet that can be customized easily for this class of
conversion. Example 5-9 and Example 5-10 show a generic solution and how it might be
used in the case of people.xml.
This solution is table driven. The
generic-attr-to-csv.xslt stylesheet uses a
variable containing csv:column elements that are
defined in the importing spreadsheet. The importing spreadsheet needs
only to arrange the csv:column elements in the
order in which the resulting columns should appear in the output. The
csv:column elements define the mapping between a
named column and an attribute name in the input XML. Optionally, the
importing stylesheet can translate the values of certain attributes
by providing a template that matches the specified attribute using
the mode csv:map-value. Here you use such a
template to translate the abbreviated @sex values
in people.xml. Any common sets of mapping in use
can be placed in a third stylesheet and imported as well. The nice
thing about this solution is that it is easy for someone with only
very limited XSLT knowledge to define a new CSV mapping. As an added
benefit, the generic stylesheet defines a top-level parameter that
can change the default delimiting character from a comma to something
else.
5.2.2.2Create a CSV file from flat element-encoded data
In this scenario, you have a flat XML
file with elements mapping to rows and children mapping to columns.
This problem is similar to the previous one, except you have XML that
uses elements rather than attributes to encode the columns. You can
also provide a generic solution here, as shown in Example 5-11toExample 5-14.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:csv="http://www.ora.com/XSLTCookbook/namespaces/csv">
<xsl:import href="generic-elem-to-csv.xslt"/>
<!--Defines the mapping from attributes to columns -->
<xsl:variable name="columns" select="document('')/*/csv:column"/>
<csv:column name="Name" elem="name"/>
<csv:column name="Age" elem="age"/>
<csv:column name="Gender" elem="sex"/>
<csv:column name="Smoker" elem="smoker"/>
</xsl:stylesheet>
Example 14. Output
Name,Age,Gender,Smoker
Al Zehtooney,33,m,no
Brad York,38,m,yes
Charles Xavier,32,m,no
David Willimas,33,m,no
...
5.2.2.3Handle more complex mappings
In this scenario, you must
deal with an arbitrary mapping of both attributes and elements to
rows and columns. Here the document order does not map as nicely onto
row or column order. In addition, the mapping may be sparse, in the
sense that many empty values must be generated in the CSV data.
Consider, for example, the following XML representing an expense
report of a soon-to-be-fired employee:
Now imagine that you need to import this XML into a spreadsheet so
that when appropriate spreadsheet styles are applied, the result
looks like Figure 5-1.
Figure 1. Expense report spreadsheet
To place the data correctly in all cells so that styling is the only
further processing necessary, the following comma-delimited file must
be produced:
,,,,,,,,,,,,Statement No.,123,
,,,,,,,,,,,Expense Statement,
,,,Employee,,,,,,,,,Pay Period,
,,,Name,Salvatore Mangano,,Emp #,1,,,,,From,1/1/02,
,,,SSN,999-99-9999,,Position,Cook,
,,,Department,XSLT Hacking,,,,,,,,To,1/31/02,
,,,Date,Account,Description,Lodging,Transport,Fuel,Meals,Phone,Entertainment,Other,
Total,
,,,12/20/01,12345,Goofing off instead of going to confrence.,500.00,50.00,0,300.
00,100,1000.00,300.00,
,,,12/20/01,12345,On the beach,500.00,50.00,0,200.00,20,300.00,100.00,Sub Total,
,,,Approved,,Notes,,,,,,,Advances,
,,,,,,,,,,,,Total,
As you can see, mapping from XML to delimited data lacks the
uniformity that made the previous examples simple to implement. This
is not to say that a stylesheet cannot be created to do the required
mapping. However, if you attack the problem directly, we will
probably end up with an ad-hoc and complex stylesheet.
When confronted with complex transformations, see if the problem
could be simplified by first transforming the source document to an
intermediate form, and then transform the intermediate form to the
desired result. In other words, try to break complex transformation
problems into two or more less-complicated problems.
Thinking along these lines, you'll see that the
problem of mapping the XML to the spreadsheet is really a problem of
assigning XML content to cells in the spreadsheet. You can therefore
invent an intermediate form consisting of cell
elements. For example, a cell element that places the
value "foo" in cell A1 would be
<cell col="A"row="1"
value="foo"/>. Your goal is to create a stylesheet that
maps each significant element in the source onto a cell element.
Because you no longer have to worry about ordering, mapping is
simple:
One major advantage of using an attribute to encode a
cell's value is that it lets you use attribute-value
templates, thus creating a very concise translation scheme. Two types
of mappings occur in this stylesheet. The first type is absolute. For
example, you want the employee name to map to cell E12. The second
type is relative; you want each expense item to map relative to row
16, based on its position in the source document.
When you apply this stylesheet to the source document, you get the
following output:
Of course, this is not the final result you are after. However, it is
not to0 difficult to see that by sorting these cells first by
@row and then by @col makes
mapping the cells into a comma-delimited form simple. In fact, if you
are willing to use the EXSLT node-set extension, you can obtain your
result with a single pass. Also notice that the cell-to-comma
delimited mapping is completely generic, so you can reuse it in the
future for other complex XML-to-comma-delimited mappings. See Example 5-15 and Example 5-16.
Example 15. Generic cells-to-comma-delimited.xslt
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:c="http://www.ora.com/XSLTCookbook/namespaces/cells"
xmlns:exsl="http://exslt.org/common" extension-element-prefixes="exsl">
<xsl:output method="text"/>
<!-- Used to map column letters to numbers -->
<xsl:variable name="columns" select=" '_ABCDEFGHIJKLMNOPQRSTUVWXYZ' "/>
<xsl:template match="/">
<!-- Capture cells in a variable -->
<xsl:variable name="cells">
<xsl:apply-templates/>
</xsl:variable>
<!-- Sort into row-column order -->
<xsl:variable name="cells-sorted">
<xsl:for-each select="exsl:node-set($cells)/c:cell">
<xsl:sort select="@row" data-type="number"/>
<xsl:sort select="@col" data-type="text"/>
<xsl:copy-of select="."/>
</xsl:for-each>
</xsl:variable>
<xsl:apply-templates select="exsl:node-set($cells-sorted)/c:cell"/>
</xsl:template>
<xsl:template match="c:cell">
<xsl:choose>
<!-- Detect a row change -->
<xsl:when test="preceding-sibling::c:cell[1]/@row != @row">
<!-- Compute how many rows to skip, if any -->
<xsl:variable name="skip-rows">
<xsl:choose>
<xsl:when test="preceding-sibling::c:cell[1]/@row">
<xsl:value-of
select="@row - preceding-sibling::c:cell[1]/@row"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="@row - 1"/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:call-template name="skip-rows">
<xsl:with-param name="skip" select="$skip-rows"/>
</xsl:call-template>
<xsl:variable name="current-col"
select="string-length(substring-before($columns,@col))"/>
<xsl:call-template name="skip-cols">
<xsl:with-param name="skip" select="$current-col - 1"/>
</xsl:call-template>
<xsl:value-of select="@value"/>,<xsl:text/>
</xsl:when>
<xsl:otherwise>
<!-- Compute how many cols to skip, if any -->
<xsl:variable name="skip-cols">
<xsl:variable name="current-col"
select="string-length(substring-before($columns,@col))"/>
<xsl:choose>
<xsl:when test="preceding-sibling::c:cell[1]/@col">
<xsl:variable name="prev-col"
select="string-length(substring-before($columns,
preceding-sibling::c:cell[1]/@col))"/>
<xsl:value-of select="$current-col - $prev-col - 1"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$current-col - 1"/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:call-template name="skip-cols">
<xsl:with-param name="skip" select="$skip-cols"/>
</xsl:call-template>
<!--Output the value of the cell and a comma -->
<xsl:value-of select="@value"/>,<xsl:text/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- Used to insert empty lines for non contiguous rows -->
<xsl:template name="skip-rows">
<xsl:param name="skip"/>
<xsl:choose>
<xsl:when test="$skip > 0">
<xsl:text>
</xsl:text>
<xsl:call-template name="skip-rows">
<xsl:with-param name="skip" select="$skip - 1"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise/>
</xsl:choose>
</xsl:template>
<!-- Used to insert extra commas for non contiguous cols -->
<xsl:template name="skip-cols">
<xsl:param name="skip"/>
<xsl:choose>
<xsl:when test="$skip > 0">
<xsl:text>,</xsl:text>
<xsl:call-template name="skip-cols">
<xsl:with-param name="skip" select="$skip - 1"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise/>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Example 16. Applications-specific expense-to-delimited.xslt
The reusable cells-to-comma-delimited.xslt
captures the cells produced by the application-specific stylesheet
into a variable and sorts. It then transforms those cells into
comma-delimited output. This is done by considering each cell
relative to its predecessor in sorted order. If the predecessor is on
a different row, then one or more newlines must be output. On the
other hand, if the predecessor is on a nonadjacent column, then one
or more extra commas must be output. You must also handle the case
when the first row or column within a row is not the first row or
column in the spreadsheet. Once these details are handled, you only
need to output the value of the cell followed by a comma.
5.2.3. Discussion
Most XML-to-delimited transformations you are likely to encounter are
fairly simple for someone well-versed in XSLT. The value of the
previous examples is that they demonstrate that problems can be
separated into two parts: a reusable part that requires XSLT
expertise and an application-specific part that does not require much
XSLT knowledge once its conventions are understood.
The true value of this technique is that it allows individuals who
are less skilled in XSLT to do useful work. For example, suppose you
had to convert a large base of XML to comma-delimited data and it
needed to be done yesterday. Showing someone how to reuse these
generic solutions would be much easier than teaching them enough XSLT
to come up with custom scripts.