Medical Data Mining

Data mining and knowledge discovery is the process of finding patterns, trends and regularities by sifting through large amounts of data [1]. Using data stored in databases, data mining involves the creation of prediction (or classification) models, segmentation (or clustering) records based on similarity of attributes and discovery of association rules (or patterns). Nowadays, medical (or clinical) databases have accumulated large amounts of data on patients and their medical conditions. This kind of information, stored along with that of other patients, make up an ideal place to look for new analysis and patterns, or to validate proposed hypotheses. To exploit such large volumes of medical data, numerous inductive data analysis techniques derived from Machine Learning (ML) study have been successfully applied to medical data to discover useful and new knowledge [2, 3, 4]. However, medical data mining is considered by many ML communities as the most complex and problematic domain yet to be overcome [5, 6].

The range of applications of medical data mining is very wide, with the two most popular applications being diagnosis and prognosis. Diagnosis is the process of selectively gathering information concerning a patient, and interpreting it according to previous knowledge, as evidence for or against the presence or absence of disorders [7]. In a prognostic process, a patient’s information is also gathered and interpreted, but the objective is to predict the future development of the patient’s condition. Due to the predictive nature of this process, prognostic systems are frequently used as tools to plan medical treatments [8].

In the context of the data mining tasks, diagnosis and prognosis are to discover knowledge necessary to interpret the gathered information. In some cases this knowledge is expressed as probabilistic relationships between clinical features and the proposed diagnosis or prognosis. In other cases, the system is designed as a black-box decision maker that is totally unconcerned with the interpretation of its decisions. Finally, in yet other cases, a rule-based representation is chosen so as to provide the physician with an explanation of the decision. The latest is the most convenient way for physician to express their knowledge in medical diagnosis. In particular, if learned diagnostic rules can be presented in such a form, physicians are much more likely to trust and believe the consequent diagnoses. Thus, the major challenge presented by medicine is to develop technology to provide trusted hypotheses based on measures which can be relied upon in medical research and clinical hypothesis formulation [9].

Continue Reading