Enhanced RoBERTa Model for OCR-Based EHR Parsing and Information Extraction
- 1 School of Computing, SRM Institute of Science and Technology, Tiruchirappalli, India
- 2 Chidambaram Sawri Rajan CEO, EMEDLOGIX SOLUTIONS Chennai, India
Abstract
Healthy source data for medical research and health analytics in general can be obtained from Electronic Health Records (EHRs). Nevertheless, due to the complexities of the design and especially the unstructured nature of them, it is not easy to extract important information from digital documents. This paper proposes a fundamentally new approach to the problem of interpreting EHRs obtained by Optical Character Recognition (OCR) that utilizes a refined RoBERTa foundation architecture. Basically, our method is very efficient in extracting key elements, like section headings and bold words, which most of the time have very significant clinical significance. More than just straightforward text recognition is the use of RoBERTa for semantic understanding. 89. 2% is the accuracy of the tests that we have performed. This paper presents an exhaustive benchmarking of the pros and cons of the deep learning techniques that are currently being used for parsing EHRs. However, our model is fixing the problem of very accurately extracting bold section heads from unstructured data in EHRs. The system proposes a two-phase approach combining natural language and image processing techniques. Performing thinning and normalizing operations first to separate bold texts based on pixel intensity over a preset threshold. By successfully removing the needless text from the paragraphs, our method significantly enhances the accuracy of bold word extraction, reaching 98%.
DOI: https://doi.org/10.3844/jcssp.2026.1434.1447
Copyright: © 2026 Balaji Ganesh Rajagopal, Amarnath C, Chidambaram Sawri Rajan and Deebalakshmi Ramalingam. This is an open access article distributed under the terms of the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 53 Views
- 13 Downloads
- 0 Citations
Download
Keywords
- SOTA DL Models
- Bold Text Extraction
- Section Header
- EHR Parsing
- Clinical NLP