hyphenation algorithm for german words

Technology Blogs by Members

Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!

In October 2019, a report about the now 70-year history of word processing was published in the German-language IT portal heise.de. In my opinion very interesting. It is fascinating how everything started and what possibilities we have today. In everyday work, I often use Microsoft Word. Besides, I'm a fan of plain ASCII text and Markdown.

Thanks to the article, I remembered some of my free time work on text mining. At that time, I was working on an implementation of a Porter-Stemmer algorithm for the German language in ABAP. In addition, I had also implemented a very basic hyphenation algorithm for the automatic separation of German words.

Such hyphenation algorithms have been widely used because there was little disk space available in the past. Since they don't give good results in certain cases, as far as I know, comprehensive dictionaries with all combinations are used today.

Because the algorithm is not very complex, here's an excerpt. The complete source code can be found on GitHub (check abapGit). As a hint: Some statements as "DATA(current_character)" in combination with a DO-statement are not recommended by Clean ABAP styleguide.

[..]



DATA(word_length) = strlen( word_to_separate ).

DATA(current_position) = word_length - 1.



DO.

   IF current_position <= 0.

      EXIT.

    ENDIF.



    DATA(current_character) = word_to_separate+current_position(1).

    current_position = current_position - 1.

    IF current_character CA 'aeiouyäöü'.

      DO.

        IF current_position <= 0.

          EXIT.

        ENDIF.



        current_character = word_to_separate+current_position(1).



        IF current_character CN 'aeiouy'.

          result-left_word = word_to_separate+0(current_position).

          DATA(length_right_word) = word_length - current_position.

          result-right_word = word_to_separate+current_position(length_right_word).

          APPEND result TO results.

          EXIT.

        ENDIF.



        current_position = current_position - 1.

      ENDDO.

    ENDIF.



[..]

Some tests quickly show weaknesses in the letter combinations "sch", "ss" ("ß") or "ck". So there is still a lot of potential for improvement 😉

Where is the benefit of the algorithm? On the one hand, it's a nice exercise for students and trainees, especially if you ensure the accuracy of your results through unit tests and try to solve the weaknesses as shown above. Because then you have a small, closed topic that you can work out within a few hours.

On the other hand, it reminds us all again about the technical possibilities we have today. In SAPUI5, there exists a standard hyphenation solution for the text control. I haven't worked with it yet, but it sounds interesting.

In this sense, happy hacking and thanks for reading

Michael

P. S.: If the algorithm is not of interest, you may also like to take a look at this ASCII art generator.