I don’t have the faintest idea how these files have been created: spreadsheets with translatable data that contain tons of HTML markup. Sometimes there are fairly complex HTML structures (like tables), lots of incomplete fragments, and in many cases even some translated items. Possibly somebody has to do a lot of manual copy&paste, but I never could find out …

When you try to process files like this as normal spreadsheets you’ll run into problems: markup can neither be evaluated nor protected, segmentation (and leverage) will be poor, and the result will most likely look like a mess.

The following recipe shows a way to prepare these files for translation.

The core of this solution is similar to the CDATA hack published earlier. When there is HTML data we have to enable the CAT tool to recognize it as such. However, when processing XML we already have a basic markup structure. When processing spreadsheets formats we have to create it before we can go on. To do so

  • We save the file as tab-delimited Unicode.
  • We create a ‘row’ element for each record, and a ‘cell’ element for each item.
  • We adapt the HTML parser of our CAT tool to the slightly “refined” HTML.

Step 1 – save as unquoted tab-delimited Unicode

Let’s assume we have three columns: a (numeric) ID, a source column, and a target column – the latter only partially filled. After exporting this to unquoted tab-delimited Unicode we will end up with something like this (tabs represented by {tab})

001{tab}source{tab}Quelle
002{tab}something to <b>translate</b>{tab}
003{tab}something else to <b>translate</>{tab}
004{tab}already translated{tab}bereits übersetzt

Please note: This example assumes that there are no embedded line breaks within the cells. If your data happens to have such line breaks you should remove them or replace them with sme kind of placeholder markup line <mybr />. If not, you have to deal with the extra complexity of escaped quotation marks!

Step 2 – create fake HTML from the text export

Now we need a sequence of four simple regex bases search&replace oprations.

The first will create the basic <row> markup:

search for ^(.+?)$
replace with <row>\n$1\n</row>

The second will add the markup for the individual cells (<id>, <src>, <tgt>)

search for ^(\d+)\t(.*?)\t(.*?)$
replace with $1$2<tgt>$3

Note: For both source and target we use the * quantifier because either value possibly can be empty!

Meanwhile our sample data will look like this:

<row>
<id>001</id><src>source</src><tgt>Quelle</tgt>
</row>
<row>
<id>002</id><src>something to <b>translate</b></src><tgt></tgt>
</row>
<row>
<id>003</id><src>something else to <b>translate</b></src><tgt></tgt>
</row>
<row>
<id>004</id><src>already translated</src><tgt>bereits übersetzt</tgt>
</row>

The third s&r operation will mark the existing translations with an attribute we can evaluate later

search for <tgt>(.+?)</tgt>
replace with <tgt translated=”yes”>$1</tgt>

Note: This time we use the + quantifier to skip all empty elements

The intermediate result looks like this:

<row>
<id>001</id><src>source</src><tgt translated=”yes”>Quelle</tgt>
</row>
<row>
<id>002</id><src>something to <b>translate</b></src><tgt></tgt>
</row>
<row>
<id>003</id><src>something else to <b>translate</b></src><tgt></tgt>
</row>
<row>
<id>004</id><src>already translated</src><tgt translated=”yes”>bereits übersetzt</tgt>
</row>

In the fourth and last s&r operation we copy the content of all <src> elements that need translation to the corresponding <tgt> element.

search for <src>(.+?)</src><tgt></tgt>
replace with <src>$1</src><tgt>$1</tgt>

Now we have something we can work with:

<row>
<id>001</id><src>source</src><tgt translated=”yes”>Quelle</tgt>
</row>
<row>
<id>002</id><src>something to <b>translate</b></src><tgt>something to <b>translate</b></tgt>
</row>
<row>
<id>003</id><src>something else to <b>translate</b></src><tgt>something else to <b>translate</b></tgt>
</row>
<row>
<id>004</id><src>already translated</src><tgt translated=”yes”>bereits übersetzt</tgt>
</row>

If your CAT tool needs that you can additionally wrap up everything in an <html></html> element.

Step 3 – adapt the HTML parser instructions

Now we create a copy of the default HTML settings for the parser of the CAT tool (make sure to change the name as well in order not to confuse the altered settings with the original later on!), and add the following rules:

<row> external, not translatable
<src> external, not translatable
<tgt> external, translatable
<tgt translated=”yes”> external, not translatable

Voilà. The data will be parsed as HTML, markup will be protected, segmentation will be good, but more tricky things like the values of “alt” or “title” attributes will be translatable as well.

And after translation the way back to the original spreadsheet format will be fairly easy:

  • delete the lines with <row> markup
  • replace the remaining markup with tabs
  • copy/paste the result to the spreadsheet.
  • done 🙂
Advertisements