Ignacio Martinez Soriano, Juan Luis Castro Peña



Automatic Medical Concept Extraction from Free Text Clinical Reports, a New Named Entity Recognition Approach

pdf PDF


Actually in the Hospital Information Systems, there is a wide range of clinical information representation from the Electronic Health Records (EHR), and most of the information contained in clinical reports is written in natural language free text. In this context, we are researching the problem of automatic clinical named entities recognition from free text clinical reports. We are using Snomed-CT (Systematized Nomenclature of Medicine – Clinical Terms) as dictionary to identify all kind of clinical concepts, and thus the problem we are considering is to map each clinical entity named in a free text report with its Snomed-CT unique ID. More in general, we are developed a new approach for the named entity recognition (NER) problem in specific domains, and we have applied it to recognize clinical concepts in free text clinical reports. In our approach we apply two types of NER approaches, dictionary-based and machine learning-based. We use a specific domain dictionary-based gazetteer (using Snomed-CT to get the standard clinical code for the clinical concept), and the main approach that we introduce is using a unsupervised shallow learning neural network, word2vec from Mikolov et al., to represent words as vectors, and then making the recognition based on the distance between candidates and dictionary terms. We have applied our approach on a Dataset with 318.585 clinical reports in Spanish from the emergency service of the Hospital “Rafael Méndez” from Lorca (Murcia) Spain, and preliminary results are encouraging.


Snomed-CT, word2vec, doc2vec, clinical information extraction, skipgram, medical terminologies, search semantic, named entity recognition, ner, medical entity recognition


[1] A. Gangemi. A Comparison of Knowledge Extraction Tools for the Semantic Web. In P. Cimiano, O. Corcho, V. Presutti, L. Hollink, and S. Rudolph, editors, The Semantic Web: Semantics and Big Data,number 7882 in Lecture Notes in Computer Science, pages 351{366. Springer Berlin Heidelberg, Jan. 2013.

[2] S. v. Hooland, M. D. Wilde, R. Verborgh, T. Steiner, and R. V. d. Walle. Exploring entity recognition and disambiguation for cultural heritage collections. Literary and Linguistic Computing, page fqt067, Nov. 2013.

[3] Timm Heuss, Bernhard Humm, Christian Henninger, and Thomas Rippl. A comparison of NER tools w.r.t. a domain-specific vocabulary. In Proceedings of the 10th International Conference on Semantic Systems (SEM '14), Harald Sack, Agata Filipowska, Jens Lehmann, and Sebastian Hellmann (Eds.). ACM, New York, NY, USA, 100-107. 2014.

[4] Quoc V Le and Tomas Mikolov, Distributed representations of sentences and document,.arXiv preprint arXiv:1405.4053., 2014.

[5] Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning representations by back-propagatingerrors. Nature, 323(6088):533– 536, 1986.

[6]L. Ratinov and D. Roth. Design challengesand misconceptions in named entity recognition. InCoNLL, 6.2009.

[7] RadimRehurek, Software Framework for topic Modelling with Large Corpora, Procedings of LREC 2010 workshop on New Challenges for NLP Framworks, 2010

[8]genism, linkweb: https://radimrehurek.com/gensim/models/word2vec. html

[9]genismlinkweb: https://radimrehurek.com/gensim/models/doc2vec.ht ml

[10]Jin D. Kim, Tomoko Ohta, YoshimasaTsuruoka, YukaTateisi, and Nigel Collier. Introduction to thebio-entity recognition task at JNLPBA. In Proceedingsof the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applications, JNLPBA ’04, pages 70–75, 2004.

[11]ShaodianZhang,NóemieElhadad,Unsupervised Biomedical Named Entity Recognition:Experiments with Clinical and Biological Texts, J Biomed Inform. 2013.

[12]Chen Y, Lasko TA, Mei Q, Denny JC, Xu H. A Study of Active Learning Methods for Named Entity Recognition in Clinical Text. Journal of biomedical informatics. 58:11-18. 2015.

[13]K. Gojenola, M.Oronoz, A. Pérez, A. Casillas. IxaMed: ApplyingFreeling and a Perceptron Sequential Tagger at the Shared Task onAnalyzing Clinical Texts”, Proceedings of the 8th International Workshop on Semantic Evaluation , pages 361–365, Dublin, Ireland, August 23-24, 2014.

[14]Fernando Aparicio et al. TMT: A tool to guide users in finding information on clinical texts. ProcesamientodelLenguaje Natural,

[S.l.], v. 46, p. 27-34, feb. 2010.

[15]Katona, Melinda and RichárdFarkas. “SZTENLP: Clinical Text Analysis with Named Entity Recognition.” SemEval@COLING (2014).

[16]Tseytlin E, Mitchell K, Legowski E, Corrigan J, Chavan G, Jacobson RS. NOBLE - Flexible concept recognition for large-scale biomedical natural language processing. BMC Bioinformatics. 2016.


[18]https://confluence.ihtsdotools.org/display/DOC START/SNOMED+CT+Starter+Guide

[19] Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey, Efficient estimation of word representations in vectorspace. arXiv:1301.3781, 2013a

[20] Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine LearningResearch, 3:1137-1155, 2003.

[21] Pastor, Mª Dolores, Navalon, Rafael, Manual de Codificacion CIE-10-Diagnosticos, ministerio de sanidad.

Cite this paper

Ignacio Martinez Soriano, Juan Luis Castro Peña. (2017) Automatic Medical Concept Extraction from Free Text Clinical Reports, a New Named Entity Recognition Approach. International Journal of Computers, 2, 38-46


Copyright © 2017 Author(s) retain the copyright of this article.
This article is published under the terms of the Creative Commons Attribution License 4.0