Mining Biomedical and Clinical Data – a CBmed & SAP showcase at the Global Biobank Week in Stockholm
The Global Biobank Week conference in Stockholm (September 13-15) is rapidly approaching. I spoke with two colleagues working closely together in a cooperative project between SAP SE and CBmed GmbH (the Austrian K1 Competence Center for Biomarker Research in Medicine). Markus Kreuzthaler, PhD is a Research Associate at CBmed, specialized in clinical natural language processing (NLP) and Peter Kaiser, PhD is a Development Project Manager at SAP Health. Both will be onsite at the conference.
Markus, you are one of the speakers at the event in Stockholm. What will you present?
Markus: I will talk about the CBmed project Innovative Use of Information for Clinical Care and Biomarker Research (IICCAB), aimed at mining large-scale clinical data sets for primary and secondary use. Specifically, I will highlight challenges and solutions in cohort building, where we work together with SAP and the BioBank Graz. For this project, the retrieval of clinical information is fundamental for a precise selection of suitable biospecimens by querying clinical routine data linked to biobank sample data. The challenge is that electronic health records (EHRs) contain most of the relevant clinical information as free text only. This is sufficient for the communication and documentation needs of clinicians, but is challenging when it comes to machine-based information extraction for a defined task. A robust NLP engine, powered by dictionaries that reflect the local language, is indispensable for comprehensive data integration. In the end, semantically normalized patient profiles using international terminology standards can be stored and queried via the SAP Connected Health platform to support biomarker research.
Peter, you have been working closely with CBmed in this project. What have been the outcomes?
Peter: To bring all types of biomedical data together, to ensure appropriately standardized data for biomarker research, and ultimately to improve clinical decisions is both ambitious and exciting. We had to address the right processing and federation of the data, and we had to design the repositories. As all clinical documents are in German, a semantic layer for that language had to be put into place. Pseudonymisation1 and de-identification1 of patient data (which is a mandatory requirement to ensure data privacy) is also addressed in this project. Finally, the analytics must be in place to be able to mine the data. These are just some functionalities of this system, the basis of which is the SAP Connected Health platform, which uses the real-time analytics capabilities of SAP HANA.
Markus: Considering the amount of data which must be analyzed, this becomes a Big Data challenge. SAP HANA is well suited as the basis for a highly responsive system for managing these data loads. CBmed’s strategy is to drive innovative topics such as biomarker-based precision medicine, and optimized clinical trial -execution and -recruitment. For this, real-time analytics is important, as strict in-time response requirements to the system must be considered; for instance, if a clinician collects and inspects patient data, and wants to know whether certain patients are good candidates for a clinical trial. Therefore, fast access and response times must be guaranteed, and well-structured data (extracted from unstructured sources as previously mentioned) must be available and easily accessible at all times.
Why is the Global Biobank Week conference of interest to you?
Markus: First of all, CBmed and SAP will both be exhibiting at this event, in two adjacent booths (#34 and #35), so I am looking forward to conversations with the other delegates, and to hear about their experiences, expectations and the challenges that they have today or expect for the future. Biobanking is a Big Data topic, and the biobank specimens hold a wealth of information about known, but also yet to be discovered biomarkers. This treasure can only be unveiled with the right tools, and applying them in the right order. Only then, the data can be made accessible, and semantically interpretable and interoperable, ultimately leading to an ideal connection of the clinical and biospecimen information. The attendees at this event may be curious to hear how CBmed and SAP are solving this challenge, how we go about mining biomedical information, and how they themselves can benefit from our joint effort.
Peter: Just like Markus, I look forward to speaking to as many people as possible on site. I am curious to learn what the data needs and challenges are for “biobankers” and other researchers in this area. One additional aspect is cohort analysis, for which SAP has developed a dedicated application (SAP Medical Research Insights), which also uses the real-time analytic capabilities of SAP HANA. In research scenarios this has proven very useful, for instance for the analysis of melanoma patient cohorts, across hundreds of parameters per patient. At last year’s event, several presentations addressed national and international cohorts; and I would like to explore with the attendees how SAP’s technology can support these activities.
What are the greatest challenges that hamper successful mining of Big Data?
Markus: Unleashing the data from the clinical information systems is a challenge for all of us. The transfer of structured data, such as lab results poses little problems, but the analysis of text documents requires a robust interface. Clinical text is difficult to analyze due to its compactness and idiosyncratic terminology. Privacy is another issue for projects like these. Right from the start of the project we have addressed this by storing all data in the SAP Connected Health platform in a pseudonymized1 manner. Our activities are constantly monitored by a data protection expert and take into account national and international regulations like the EU-GDPR or the U.S. HIPAA “safe harbor” criteria. We are also evaluating de-identification systems, which identify and eliminate sensitive passages like patient names in clinical texts, so that access to clinical documents can be granted to a broader group of researchers. This would mean that de-identification is fulfilled on the fly with the help of a trained system. A lot of functionality has to be in place: Extract-Transform-Load (ETL) workflows (specifying which data items are embedded where, and where they have to be transferred to) and NLP as a service (extracting information through machine learning and rules, as well as ontological and terminology services adaptable to the language and clinical domain); these are just some of the aspects that have to be addressed. I will reveal results in my presentation “Secondary Use of Clinical Routine Data for Enhanced Phenotyping of Biobank Sample Data” – Conference Session 6B, “Biobanks and electronic health records”, on Thursday 14th September, 15h45.
Peter: To add to that, we soon realized that end-users of this system (researchers and physicians) have specific expectations on how the data is presented. A special interface, the “Patient Quick View,” is being developed with and for these users. Physicians simply do not have the time to browse through hundreds of pages to find the information necessary to treat patients with chronic illnesses, and therefore smarter solutions must be provided.
Markus: Within this “Patient Quick View,” the “Timeline View” visualizes the frequency and characteristics of a patient’s past encounters, or how a clinical biomarker (e.g. creatinine or HBA1) used for monitoring chronic disorders evolved over time. Another feature we plan to implement is personalization; for a certain user profile, data is shown in a special, prioritized way. Consider a surgeon: this person is more interested in seeing past operations, whereas a cardiologist is more interested in lab values and past medications. A core asset of the “Patient Quick View” will thus be a focused display of the most relevant patient parameters for a specific user.
Peter: The language issues are challenging too. Whereas many systems focus exclusively on English, in Europe systems must be adapted to many more languages. SAP is very familiar with this issue, and knows how to tackle it. NLP systems need to process documents in the local language, and store the biomedical information in a format that enables understanding: in German in case of this project.
Markus: Language-specific resources must be built and adapted, including vocabularies or text collections (so-called corpora) that represent the language of a specific type of documents (like radiology reports or dermatology discharge summaries). Mapping of German language clinical terms to international semantic standards like ICD, LOINC, or SNOMED CT must be in place. We are pioneering the automated mapping of German language clinical terms – as used in clinical texts – to codes of the international terminology SNOMED CT. Finally, we need corpora with human mark-up, which are necessary to train certain NLP components and, most importantly, we must assess the quality of components so that we can predict what is found and missed. For example, in the German speaking community there are no existing clinical de-identified (gold-standard) corpora openly available, which could foster NLP in the clinical domain. Recently, the advantage of data-driven approaches and deep learning have been demonstrated for NLP, but their use requires a certain amount of training data, ideally made available to the research community.
Peter: The advantage is that many of these CBmed-specific needs may well be included in future releases of SAP products, developed with input and feedback from CBmed.
Additional information
- Markus, Peter and I will be on site in Stockholm. Visit CBmed and SAP in booth 34 and 35 at the Global Biobank Week. Pre-arrange a meeting by leaving a comment below this post, or contacting me through @clesucr.
- Mark your calendar to attend Markus’ presentation: “Secondary Use of Clinical Routine Data for Enhanced Phenotyping of Biobank Sample Data” – Session: 6B “Biobanks and electronic health records”, Thursday 14th September, 15h45.
- Follow us on Twitter: @SAPHealth, @Clesucr and @CBmed_News. #GBWstockholm
1 De-identification is the process used to prevent a person’s identity from being connected with information. Pseudonymization is a procedure by which the most identifying fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. Anonymization is the process of either encrypting or removing personally identifiable information from data sets, so that the individuals described by the data remain anonymous (source: wikipedia.org)