Skip to Content
Technical Articles

hyphenation algorithm for german words

In October 2019, a report about the now 70-year history of word processing was published in the German-language IT portal heise.de. In my opinion very interesting. It is fascinating how everything started and what possibilities we have today. In everyday work, I often use Microsoft Word. Besides, I’m a fan of plain ASCII text and Markdown.

Thanks to the article, I remembered some of my free time work on text mining. At that time, I was working on an implementation of a Porter-Stemmer algorithm for the German language in ABAP. In addition, I had also implemented a very basic hyphenation algorithm for the automatic separation of German words.

Such hyphenation algorithms have been widely used because there was little disk space available in the past. Since they don’t give good results in certain cases, as far as I know, comprehensive dictionaries with all combinations are used today.

Because the algorithm is not very complex, here’s an excerpt. The complete source code can be found on GitHub (check abapGit). As a hint: Some statements as “DATA(current_character)” in combination with a DO-statement are not recommended by Clean ABAP styleguide.

[..]

DATA(word_length) = strlen( word_to_separate ).
DATA(current_position) = word_length - 1.

DO.
   IF current_position <= 0.
      EXIT.
    ENDIF.

    DATA(current_character) = word_to_separate+current_position(1).
    current_position = current_position - 1.
    IF current_character CA 'aeiouyäöü'.
      DO.
        IF current_position <= 0.
          EXIT.
        ENDIF.

        current_character = word_to_separate+current_position(1).

        IF current_character CN 'aeiouy'.
          result-left_word = word_to_separate+0(current_position).
          DATA(length_right_word) = word_length - current_position.
          result-right_word = word_to_separate+current_position(length_right_word).
          APPEND result TO results.
          EXIT.
        ENDIF.

        current_position = current_position - 1.
      ENDDO.
    ENDIF.

[..]

Some tests quickly show weaknesses in the letter combinations “sch”, “ss” (“ß”) or “ck”. So there is still a lot of potential for improvement 😉

Where is the benefit of the algorithm? On the one hand, it’s a nice exercise for students and trainees, especially if you ensure the accuracy of your results through unit tests and try to solve the weaknesses as shown above. Because then you have a small, closed topic that you can work out within a few hours.

On the other hand, it reminds us all again about the technical possibilities we have today. In SAPUI5, there exists a standard hyphenation solution for the text control. I haven’t worked with it yet, but it sounds interesting.

 

In this sense, happy hacking and thanks for reading

Michael

 

P. S.: If the algorithm is not of interest, you may also like to take a look at this ASCII art generator.

/
2 Comments
You must be Logged on to comment or reply to a post.
  • Hi Michael,

    nice article! Seems like most programmers had their time, where they spend some of their spare time for playing around with text mining and correlated algorithms. (At least I did, too.) When thinking about ABAP and phonetic algorithms – have you seen, that there’s a “native” soundex function for HANA SQL? Maybe a nice entry point for an upcoming article in the field of text processing. 😉

    /
    😉
    • Nice note. I didn’t know the function yet. But just for your note it was worth to write the blog 🙂 At the moment I’m working with the ALV IDA, which has an integrated text search. I would like to examine the text search even more closely, because you can work with a level of accuracy of the match. That could be helpful in analyzing vendor and customer master data.

      /
      🙂