From time to time I come across XML data – most likely exports from content management systems – with fairly large portions of HTML embedded in CDATA sections. This brings up a problem: due to the nature of CDATA the markup can’t be processed properly because it is recognized as literal text. And apart from this: in most cases the embedded HTML is definitely not well-formed.

SDL has published a little HowTo in the Translator’s Workbench User Guide, but it’s effects are somewhat limited. The main drawbacks are

  • Externals will not be recognized at all
  • Translatable attributes like alt or title will be locked

The following will show you a more generalistic (but still simple) approach to deal with this nasty situation. It proved to work well with Trados 7.5+ and can most likely be adapted to the CAT tool of your choice.

A typical example for HTML in a CDATA section might look like this

<page>
<ptitle>A sample page</ptitle>
<ptext><!--[CDATA[<ul><li>This is the first item</li><li>This is the second item <img src="my.png" alt="Some text"></li></ul>]]></ptext>
</page>

Everything printed in black in the above sample will be treated as plain text. The fairly complex HTML part most likely will be messed up during translation (don’t  blame it on the translators).

The most important part of what I call the “CDATA hack” is converting CDATA to PCDATA by introducing a new (fake) element. Simply replace the CDATA markup with something like <cdata></cdata> (you can choose whatever you want). Now the sample will look like this

<page>
<ptitle>A sample page</ptitle>
<ptext>
<cdata>
<ul>
<li>
This is the first item
<li>This is the second item <img src="my.png" alt="Some text">
</ul>
</cdata>
</ptext>

</page>

Now comes the second part of the hack. It is plain to see that this chunk is no valid XML, because it does not meet the basic requirements for well-formedness.  As such it cannot be processed with the XML parser of your favorite CAT tool.

However, once again the solution is quite simple. All you have to do is to slip the data as (fake) HTML to the SGML processor, because it is not so picky about a well-formed structure. To do so

  • rename the file(s) to *.html
  • make a copy of the parser instructions for HTML (INI, ITS or whatever) and save it under a new name
  • add the new root element to the copied parser instructions
  • add the new elements (in our sample page, ptitle, ptext, cdata) to the parser instructions as externals or – alternatively – make sure that all unknown elements will be treated as externals.

Well, that’s all. Now you can process the nasty XML as fake HTML, and the translators will benefit from markup protection and better segmentation.

Of course you should not forget to replace the fake <cdata>…</cdata> markup with real <!–CDATA…]]> sections. Otherwise they might wonder what’s gone wrong 😉

Advertisements