Background Contemporary pulmonary embolism (PE) research, in many cases, relies on data from electronic health records (EHRs) and administrative databases that use International Classification of... Show moreBackground Contemporary pulmonary embolism (PE) research, in many cases, relies on data from electronic health records (EHRs) and administrative databases that use International Classification of Diseases (ICD) codes. Natural language processing (NLP) tools can be used for automated chart review and patient identification. However, there remains uncertainty with the validity of ICD-10 codes or NLP algorithms for patient identification.Methods The PE-EHR+ study has been designed to validate ICD-10 codes as Principal Discharge Diagnosis, or Secondary Discharge Diagnoses, as well as NLP tools set out in prior studies to identify patients with PE within EHRs. Manual chart review by two independent abstractors by predefined criteria will be the reference standard. Sensitivity, specificity, and positive and negative predictive values will be determined. We will assess the discriminatory function of code subgroups for intermediate- and high-risk PE. In addition, accuracy of NLP algorithms to identify PE from radiology reports will be assessed.Results A total of 1,734 patients from the Mass General Brigham health system have been identified. These include 578 with ICD-10 Principal Discharge Diagnosis codes for PE, 578 with codes in the secondary position, and 578 without PE codes during the index hospitalization. Patients within each group were selected randomly from the entire pool of patients at the Mass General Brigham health system. A smaller subset of patients will also be identified from the Yale-New Haven Health System. Data validation and analyses will be forthcoming.Conclusions The PE-EHR+ study will help validate efficient tools for identification of patients with PE in EHRs, improving the reliability of efficient observational studies or randomized trials of patients with PE using electronic databases. Show less
The clinical notes in electronic health records have many possibilities for predictive tasks in text classification. The interpretability of these classification models for the clinical domain is... Show moreThe clinical notes in electronic health records have many possibilities for predictive tasks in text classification. The interpretability of these classification models for the clinical domain is critical for decision making. Using topic models for text classification of electronic health records for a predictive task allows for the use of topics as features, thus making the text classification more interpretable. However, selecting the most effective topic model is not trivial. In this work, we propose considerations for selecting a suitable topic model based on the predictive performance and interpretability measure for text classification. We compare 17 different topic models in terms of both interpretability and predictive performance in an inpatient violence prediction task using clinical notes. We find no correlation between interpretability and predictive performance. In addition, our results show that although no model outperforms the other models on both variables, our proposed fuzzy topic modeling algorithm (FLSA-W) performs best in most settings for interpretability, whereas two state-of-the-art methods (ProdLDA and LSI) achieve the best predictive performance. Show less
The Cross-Industry Standard Process for Data Mining (CRISP-DM), despite being the most popular data mining process for more than two decades, is known to leave those organizations lacking... Show moreThe Cross-Industry Standard Process for Data Mining (CRISP-DM), despite being the most popular data mining process for more than two decades, is known to leave those organizations lacking operational data mining experience puzzled and unable to start their data mining projects. This is especially apparent in the first phase of Business Understanding, at the conclusion of which, the data mining goals of the project at hand should be specified, which arguably requires at least a conceptual understanding of the knowledge discovery process. We propose to bridge this knowledge gap from a Data Science perspective by applying Natural Language Processing techniques (NLP) to the organizations' e-mail exchange repositories to extract explicitly stated business goals from the conversations, thus bootstrapping the Business Understanding phase of CRISP-DM. Our NLP-Automated Method for Business Understanding (NAMBU) generates a list of business goals which can subsequently be used for further specification of data mining goals. The validation of the results on the basis of comparison to the results of manual business goal extraction from the Enron corpus demonstrates the usefulness of our NAMBU method when applied to large datasets. Show less
The summary of product characteristics from the European Medicines Agency is a reference document on medicines in the EU. It contains textual information for clinical experts on how to safely use... Show moreThe summary of product characteristics from the European Medicines Agency is a reference document on medicines in the EU. It contains textual information for clinical experts on how to safely use medicines, including adverse drug reactions. Using natural language processing (NLP) techniques to automatically extract adverse drug reactions from such unstructured textual information helps clinical experts to effectively and efficiently use them in daily practices. Such techniques have been developed for Structured Product Labels from the Food and Drug Administration (FDA), but there is no research focusing on extracting from the Summary of Product Characteristics. In this work, we built a natural language processing pipeline that automatically scrapes the summary of product characteristics online and then extracts adverse drug reactions from them. Besides, we have made the method and its output publicly available so that it can be reused and further evaluated in clinical practices. In total, we extracted 32,797 common adverse drug reactions for 647 common medicines scraped from the Electronic Medicines Compendium. A manual review of 37 commonly used medicines has indicated a good performance, with a recall and precision of 0.99 and 0.934, respectively. Show less
OBJECTIVE Incidental durotomy is a common complication of elective lumbar spine surgery seen in up to 11% of cases. Prior studies have suggested patient age and body habitus along with a history of... Show moreOBJECTIVE Incidental durotomy is a common complication of elective lumbar spine surgery seen in up to 11% of cases. Prior studies have suggested patient age and body habitus along with a history of prior surgery as being associated with an increased risk of dural tear. To date, no calculator has been developed for quantifying risk. Here, the authors' aim was to identify independent predictors of incidental durotomy, present a novel predictive calculator, and externally validate a novel method to identify incidental durotomies using natural language processing (NLP).METHODS The authors retrospectively reviewed all patients who underwent elective lumbar spine procedures at a tertiary academic hospital for degenerative pathologies between July 2016 and November 2018. Data were collected regarding surgical details, patient demographic information, and patient medical comorbidities. The primary outcome was incidental durotomy, which was identified both through manual extraction and the NLP algorithm. Multivariable logistic regression was used to identify independent predictors of incidental durotomy. Bootstrapping was then employed to estimate optimism in the model, which was corrected for; this model was converted to a calculator and deployed online.RESULTS Of the 1279 elective lumbar surgery patients included in this study, incidental durotomy occurred in 108 (8.4%). Risk factors for incidental durotomy on multivariable logistic regression were increased surgical duration, older age, revision versus index surgery, and case starts after 4 PM. This model had an area under curve (AUC) of 0.73 in predicting incidental durotomies. The previously established NLP method was used to identify cases of incidental durotomy, of which it demonstrated excellent discrimination (AUC 0.97).CONCLUSIONS Using multivariable analysis, the authors found that increased surgical duration, older patient age, cases started after 4 PM, and a history of prior spine surgery are all independent positive predictors of incidental durotomy in patients undergoing elective lumbar surgery. Additionally, the authors put forth the first version of a clinical calculator for durotomy risk that could be used prospectively by spine surgeons when counseling patients about their surgical risk. Lastly, the authors presented an external validation of an NLP algorithm used to identify incidental durotomies through the review of free-text operative notes. The authors believe that these tools can aid clinicians and researchers in their efforts to prevent this costly complication in spine surgery. Show less