Preview

Siberian Journal of Clinical and Experimental Medicine

Advanced search

Development of a service for automatically extraction of medical concepts from Russian unstructured texts

https://doi.org/10.29001/2073-8552-2025-40-2-201-210

Abstract

Introduction. A significant part of medical data is currently generated and stored in an unstructured (textual) form. One way to process unstructured information is named entity recognition (NER). In the classical view, solving the NER problem within medical texts involves identifying objects or concepts that have a specific context related to the actions or events mentioned in the text. The National Unified Terminological System (NUTS) has been developed since 2022 based on international and federal medical thesauri and other sources. It can be used as the term set for solving problems of this type. At the time of the study, there was no available information in the scientific literature about tools solving NER problem in unstructured Russianlanguage medical texts.

Aim: To develop a tool for extracting named entities from Russian-language medical texts.

Material and Methods. Named entity recognition is performed using the NUTS as the terminological framework. The preprocessing pipeline includes full text segmentation, sentences tokenization and dependency parsing, words lemmatization and morphological analysis. The Annotation tool has been evaluated on clinical guidelines. The primary evaluation metric is the ratio of correctly identified terms to the total number of experts’ extracted terms.

Results. As part of this study, the Annotation tool for medical texts has been developed. It is an automatized tool for extraction and categorization NUTS terms. This service is based on combined use large language models and rules. The Annotation tool can analyze texts in any language of the Indo-European group using any terminological system.

The Annotation tool is hybrid and extracts automatically up to 93% of terms from the actual unstructured guidelines texts. The quality of this service is comparable to international NER tools for English-language texts: cTAKES with 91% accuracy and MetaMap with an F1-score of 88%.

Conclusion. The article presents the Annotation tool a hybrid service for named entity recognition within unstructured medical texts. The service was validated by extraction of NUTS terms in current clinical guidelines, with subsequent verification by medical experts. The obtained results demonstrate the promising potential of both this tool and the National Unified terminology system (NUTS).

 

About the Authors

L. V. Ronzhin
Pirogov Russian National Research Medical University
Russian Federation

Lev V. Ronzhin, Analyst, Laboratory of Semantic Analysis of Medical Information

1 Ostrovityanova str., Moscow, 117513



P. A. Astanin
Pirogov Russian National Research Medical University
Russian Federation

Pavel A. Astanin, Analyst, Laboratory of Semantic Analysis of Medical Information; Assistant, S. A. Gasparyan Department of Medical Cybernetics and Computer Science

1 Ostrovityanova str., Moscow, 117513



S. E. Rauzina
Pirogov Russian National Research Medical University
Russian Federation

Svetlana E. Rauzina, PhD, Associate Professor, Head of the Laboratory of Semantic Analysis of Medical Information; Associate Professor, S. A. Gasparyan Department of Medical Cybernetics and Computer Science

1 Ostrovityanova str., Moscow, 117513



P. A. Yadgarova
Pirogov Russian National Research Medical University
Russian Federation

Polina A. Yadgarova, Analyst, Laboratory of Digital Development of Medical Education

1 Ostrovityanova str., Moscow, 117513



T. V. Zarubina
Pirogov Russian National Research Medical University
Russian Federation

Tatyana V. Zarubina, MD, Professor,  Corresponding  Member  of the Russian Academy of Sciences; Director of the Institute of Digital Transformation of Medicine; Head of the S.A. Gasparyan Department of Medical Cybernetics and Computer Science

1 Ostrovityanova str., Moscow, 117513



References

1. Gusev A.V., Zingerman B.V., Tjufilin D.S., Zinchenko V.V. Electronic medical records as a source of real-world clinical data. Real-world data & evidence. 2022;2(2):8–20. (In Russ.). https://doi.org/10.37489/27823784-myrwd-13

2. Lebedev S.V., Zhukova N.A. Ontology-driven approach to medical data fusion. Ontology of Designing. 2017; 7(2):145–159. (In Russ.). https://doi.org/10.18287/2223-9537-2017-7-2-145-159

3. Demner-Fushman D., Chapman W.W., McDonald C.J. What can natural language processing do for clinical decision support? Journal of Biomedical Informatics. 2009;42(5):760–772. https://doi.org/10.1016/j.jbi.2009.08.007

4. Aronson A.R., Lang F.M. An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association. 2010;17(3):229–236. https://doi.org/10.1136/jamia.2009.002733

5. Hunter L.E. Life sciences linkout. Journal of Biomedical Informatics. 2006;39(2):192–202. https://doi.org/10.1016/j.jbi.2005.09.006

6. Humphreys B.L, Tuttle M.S. Something new and different: The Unified Medical Language System. Information services & use. 2022;42(1):95– 106. https://doi.org/10.3233/ISU-210138

7. Zarubina T.V., Rauzina S.E., Astanin P.A. Creation of a Medical Knowledge Base for Unify the Development of Clinical Decision Support Systems Based on the National Metathesaurus. Annals of the Russian Academy of Medical Sciences. 2024;79(2):175–192. (In Russ.). https://doi.org/10.15690/vramn17390

8. Reátegui R., Ratté S. Comparison of MetaMap and cTAKES for entity extraction in clinical notes. BMC Medical Informatics and Decision Making. 2018;18(S3):74. https://doi.org/10.1186/s12911-018-0654-2

9. Astanin P.A., Ronzhin L.V., Fedorov A.A., Rauzina S.E., Zarubina T.V. Automated abbreviations recognition system for unified national medical nomenclature filling with using Russian language unstructured text of articles. Medical doctor and information technologies. 2023;4:24–35. (In Russ.). https://doi.org/10.25881/18110193_2023_4_24

10. Astanin P.A., Rauzina S.E., Zarubina T.V. Computing the UMLS concepts etiopathogenetic image using graph metrics. Program Systems: Theory and Applications. 2023;14(3):59-94. (In Russ.). https://doi.org/10.25209/2079-3316-2023-14-3-59-94

11. Astanin P.A., Rauzina S.E., Zarubina T.V. Automated system for recognizing clinically relevant UMLS terms in texts of the Englishlanguage articles exemplified by axial spondyloarthritis. Social'nye aspekty zdorov'a naselenia. 2023;69(3):14. (In Russ.). https://doi.org/10.21045/2071-5021-2023-69-3-14

12. Abdaoui A., Pradel C., Sigel G. Load what you need: Smaller versions of multilingual BERT. In: Proceedings of SustaiNLP / EMNLP; 2020. https://doi.org/10.48550/arXiv.2010.05609

13. Droganova K., Lyashevskaya O., Zeman D. Data conversion and consistency of monolingual corpora: Russian UD treebanks. Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018); December 13–14, 2018; Oslo University, Norway. Linköping Electronic Conference Proceedings 155;7:52–65.

14. Marneffe M.-C., Manning C., Nivre J., Zeman D. Universal Dependencies. Computational Linguistics. 2021;47(2):255–308. https://doi.org/10.1162/ coli_a_00402

15. Qi P., Zhang Y., Zhang Y., Bolton J., Manning C.D. Stanza: A Python natural language processing toolkit for many human languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2020;101–108. https://doi.org/10.18653/v1/2020.acl-demos.14

16. Dyachenko P.V., Iomdin L.L., Lazursky A.V., Mityushin L.G., Podleskaya O.Yu., Sizov V.G., Frolova T.I., Tsinman L.L. The Current State of the Deeply Annotated Corpus of Russian Language Texts (SinTagRus). In: National Corpus of the Russian Language: 10 Years of the Project. Proceedings of the V.V. Vinogradov Institute of Russian Language. Moscow; 2015:272–299. (In Russ.).

17. Grashhenko L.A. Application of modeling stop-list. News of the National Academy of Sciences of Tajikistan. Department of physical, mathematical, chemical, geological and technical sciences. 2013;1(150):40–46. (In Russ.).

18. Asiler M., Yazıcı A. BB-Graph: A subgraph isomorphism algorithm for efficiently querying big graph databases. Preprint [arXiv:1706.06654]; 2018. https://doi.org/10.1234/abcd.5678

19. Syneva I.S., Golovchenko V.E. Application of multidimensional statistical analysis and NLP methods for classifying scientific publications. DSPA: Digital Signal Processing. 2024;14(2):44–51. (In Russ.).


Review

For citations:


Ronzhin L.V., Astanin P.A., Rauzina S.E., Yadgarova P.A., Zarubina T.V. Development of a service for automatically extraction of medical concepts from Russian unstructured texts. Siberian Journal of Clinical and Experimental Medicine. 2025;40(2):201-210. (In Russ.) https://doi.org/10.29001/2073-8552-2025-40-2-201-210

Views: 21


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2713-2927 (Print)
ISSN 2713-265X (Online)