Skip to Content
Technical Articles
Author's profile photo Mike B.

Regex in ABAP, HTML processing in ABAP with regular expressions

Recently I faced a problem to proceed HTML-code and replace some CSS-expression with HTML tag analog. For instance, font-weight: bold; property inside of <span> tag value must be replaced with <strong> HTML-tag. One of the ways to solve this problem is to use regular expressions in ABAP. Further I’m going to explain my solution with detailed code of ABAP regex.

First-of-all we need to detect <span style=”…”> block, where there is a font-weight property, and then surround the content of this block with HTML <strong> tag.

  1. REPLACE ALL OCCURRENCES OF REGEX ‘(font-weight:[^>]*>)([^♦]*)(♦)(</span>)’
  2. IN html_string WITH ‘$1<strong>$2</strong>$3$4’ IGNORING CASE.

You may ask about „“ symbol, I’ll pay attention to it at the end of this post.
 

Some comments:

  • Brackets „(…)“ allow to us to define an block, that can be placed or deleted in specific place in result of regex.
  • Expression „[^>]*“ will get the string until the char „>“, the same logic with „[^♦]*“.
  • By using „$“ char and number we can arrange and put concrete block to the specific place.

Now, when we have found the relevant <span> block and surrounded its content with wanted tag we can remove font-weight property from <span style=”…”> block.

  1. REPLACE ALL OCCURRENCES OF REGEX ‘(font-weight:[^;]*;)’
  2. IN html_string WITH IGNORING CASE.

That’s all. We just replaced font-weight property in <span> block with <strong> HTML-tag.

Now, it’s a turn to explain the meaning of „“ symbol. Actually, it’s a kind of workaround for the case of nested HTML-tags inside of span-block, e.g. <span style=”…”>…<em>…</em>…</span>.

In order to detect the end of span-block content and not the end of any nested tag I add an anchor — „“ symbol before </span> and use this anchor in my regex.

At the and I have to remove this anchor with the following regex:

  1. REPLACE ALL OCCURRENCES OF REGEX ‘♦’
  2. IN html_string WITH IGNORING CASE.

 

Final code:

  1. ” set workaround for nested tags case
  2. ” I’m using a special char ‘♦’ in order to deal
  3. ” with case when we have a nested HTML tags and we want to know
  4. ” the real end of the string that we want to surround
  5. ” with basic HTML tag
  6. REPLACE ALL OCCURRENCES OF REGEX ‘</span>’
  7. IN html_string WITH ‘♦</span>’ IGNORING CASE.
  8. ” surround bold (FONT-WEIGHT: bold) text with HTML’s STRONG tag
  9. REPLACE ALL OCCURRENCES OF REGEX ‘(font-weight:[^>]*>)([^♦]*)(♦)(</span>)’
  10. IN html_string WITH ‘$1<strong>$2</strong>$3$4’ IGNORING CASE.
  11. ” remove unneeded CSS-style font-weight property
  12. REPLACE ALL OCCURRENCES OF REGEX ‘(font-weight:[^;]*;)’
  13. IN html_string WITH IGNORING CASE.
  14. ” delete workaround for nested tags case
  15. REPLACE ALL OCCURRENCES OF REGEX ‘♦’
  16. IN html_string WITH IGNORING CASE.

 

Additional links:

P.S. If you know the better way to solve this problem, feel free to share your experience!

Assigned Tags

      4 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Sascha Wenninger
      Sascha Wenninger

      Hi Mike,

      wow, those are some great regular expressions and a really good use case to show the power of that technology! Thank you for sharing and bringing this technology to everyone's attention. I'm always surprised at how few ABAPers know about the regex functionality, so hopefully your blog will help in increasing awareness. Thank you also for explaining the anchor concept - very clever! 🙂

      Sascha

      Author's profile photo Tom Van Doorslaer
      Tom Van Doorslaer

      cool that you found a working solution, but I immediately had to think of a discussion someone directed me to.

      http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

      It's the top rated answer on stack overflow (and completely over the top as well)

      but basically, in the discussion, they try to explain that you can parse HTML with regex to some extent, but as HTML is a complex language, at some point regex will no longer suffice.

      Instead, they recommend using XML parsers, because you have the hierarchy in there as well.

      ex: suppose you have a text like [strong]This[/strong] is a text with [strong]some[/strong] bold characters.

      by using regex, you risk replacing the entire "This is a text with some" because it takes everything from the first [strong] to the last [/strong]

      I find the whole discussion intriguing, though very theoretical. So it's nice to see you came up with a clever solution for regex as well, (for this particular problem)  although I would recommend going through what they al say on stackoverflow. You might find some interesting tips there.

      Author's profile photo Mike B.
      Mike B.
      Blog Post Author

      In my case I had to find a special workaround to deal with some CSS/HTML related issue.

      Next time, I'll have to deal with HTML I'll look at XML Parser too.

      Thanks!

      Author's profile photo Former Member
      Former Member

      Only a small subset of ABAPers have read about regex, and even smaller subset apply it in day-to-day jobs.

      I use regex heavily, but only inside Notepad++.

      It would be nice to see more such articles that show real world application of regex in ABAP programs.

      Keep it up.

      Using regex was probably the fastest way to meet your requirement.

      An error-proof way of doing it would be using XSLT transformation tool. But it has its own learning curve.

      I can think of 2 scenarios where present code may give improper output.

      1. Nested span tags. Since you are placing marker at all ending span tags, not all span tags would have font-weight attribute.
      2. In a given span tag, attributes followed by font-weight would get deleted.