In HTML authors often have the bad habit to “format” text using <br /> elements. A chunk of text from such a file may look like this:

<p> This is the first sentence.<br />This is the second sentence.</p>

Now as a rule of thumb text will be segmented whenever we have a punctuation mark (.!?:) followed by at least one space. Hence the example above cannot be segmented due to the missing space.

However, the solution is quite simple. All you have to do is to run a regex-based search & replace operation on your HTML data:

  • search for ([.?!:])< (This will find punctuation marks followed immediately by an opening bracket.)
  • replace with $1 < (Note the space on the left hand side of the opening bracket!)

The replacement expression will insert the punctuation mark found in the search step and add the missing space. Of course we might have a couple of unnecessary spaces afterwards (e.g. on the left hand side of </p> markup, but usually this is not a problem. Once you have tweaked the HTML data this way it will be segmented much better – which in turn might give you better leverage and/or consistency.

Important note: regex-based s&r operations are a very powerful thing and can easily mess up tons of data. Always test them sufficiently on sample files before you run them on your production data!

Advertisements