Misdiagnosis of acute and chronic otitis media in children can result in significant consequences from either undertreatment or overtreatment. Our objective was to develop and train an artificial intelligence algorithm to accurately predict the presence of middle ear effusion in pediatric patients presenting to the operating room for myringotomy and tube placement.
We trained a neural network to classify images as “ normal” (no effusion) or “abnormal” (effusion present) using tympanic membrane images from children taken to the operating room with the intent of performing myringotomy and possible tube placement for recurrent acute otitis media or otitis media with effusion. Model performance was tested on held-out cases and fivefold cross-validation.
The mean training time for the neural network model was 76.0 (SD ± 0.01) seconds. Our model approach achieved a mean image classification accuracy of 83.8% (95% confidence interval [CI]: 82.7–84.8). In support of this classification accuracy, the model produced an area under the receiver operating characteristic curve performance of 0.93 (95% CI: 0.91–0.94) and F1-score of 0.80 (95% CI: 0.77–0.82).
Artificial intelligence–assisted diagnosis of acute or chronic otitis media in children may generate value for patients, families, and the health care system by improving point-of-care diagnostic accuracy. With a small training data set composed of intraoperative images obtained at time of tympanostomy tube insertion, our neural network was accurate in predicting the presence of a middle ear effusion in pediatric ear cases. This diagnostic accuracy performance is considerably higher than human-expert otoscopy-based diagnostic performance reported in previous studies.
Despite best efforts, diagnostic accuracy for acute and chronic otitis media has yet to consistently surpass 70% for primary care providers, pediatricians, and otolaryngologists. There are significant patient and public health implications for misdiagnoses that produce undertreatment or overtreatment.
With a small training data set of ground-truth data determined by myringotomy, our neural network accurately predicted the presence of a middle ear effusion in 83.8% of novel cases, considerably higher than reported human-expert otoscopy-based diagnostic performance.
Otitis media and its acute and chronic variants are among the most-common childhood infections, with at least 80% of all children experiencing at least one episode before the age of 3, contributing to >6 to 10 million clinic visits and millions of antibiotic prescriptions each year in the United States.1,2 Given the prevalence of otitis media, tympanostomy tube placement is the most-common pediatric surgical procedure performed annually in the United States, accounting for 667 000 procedures in 2006 alone.3,4 Although advances in antibiotic strategies and the widespread adoption of the pneumococcal conjugate vaccine5 have lessened morbidity and mortality from otitis media, the disease process still confers a massive burden on global public health. Consequences of undiagnosed and untreated otitis media still include hearing loss, delayed language development,6 and morbidity from extra- and intracranial complications.7,8 From a health care systems perspective, the cost of care for otitis media ranges from 3 to 5 billion dollars annually in the United States.9,10
For decades, a technique for the consistent and accurate diagnosis of otitis media, both acute and chronic, has eluded us. This diagnostic challenge has produced an array of responses ranging from targeted educational programs for medical trainees, novel otoscopic approaches and techniques, using absorbance and acoustic admittance measurements, integration of audiometric adjuncts such as tympanometry, and clinical trials comparing the effectiveness of one or more of these approaches.11–17 Despite these efforts, diagnostic accuracy has yet to consistently surpass 70% for all providers across the full spectrum of primary care, pediatrics, and otolaryngology.14,18–21 There are significant implications for such misdiagnoses. Undertreatment leading to possible morbid consequences such as mastoiditis, meningitis, and sensorineural hearing loss.7,8,22,23 Overtreatment may lead to excessive antibiotic usage and increased development of resistance, unnecessary tympanostomy tube procedures, excess days off from school for children and work for parents, and significant deficits in quality of life for both children and parents.24–29 The persistent gap in diagnostic accuracy underscores a pressing need for innovation using an alternative approach.
Dovetailing on significant interest in machine-learning applications to challenges in hearing loss,30,31 alternative approaches for otitis media diagnosis have emerged in the form of artificial intelligence computer vision algorithms in the past 5 years. Artificial intelligence (ie, machine learning) algorithms are typically deployed for predictive modeling as opposed to explanatory modeling that uses traditional hypothesis testing frameworks to compare sample means, variance, correlation, or other causal constructs. Deep-learning algorithms, a subset of machine learning, learn patterns in data through iterative training steps. The fully trained deep-learning algorithm is then used to make predictions on previously unseen data. In the case of applications to medical image data, deep-learning computer vision algorithms work by forming a prediction on a novel image case (eg, radiographic data from computed tomography or MRI sources, pathology slides) after having been trained on data labeled by human experts. A full review of deep-learning applications in medicine is beyond the scope of this introduction, but its promises and limitations have been reviewed in other published works.32,33 Deep learning and other machine-learning algorithms have been applied to improve on otitis media diagnosis by several research teams.34–39 With the modeling to detect middle ear effusions, a common technique underlying most of this work has been to train algorithms to classify middle ear or tympanic membrane pathology using otoscopic images taken either with a video otoscope or endoscope without confirmation that an effusion was present. A limitation of this input data are that some effusions (eg, serous) may be missed with external otoscopy, particularly given diagnostic accuracy limitations.
In this study, we developed a neural network model exclusively with intraoperative data. Because an effusion may escape detection on simple video otoscopy, ground-truth training data labeling with a myringotomy is prudent. The objective of our approach was to develop and train a neural network to accurately predict the presence of middle ear effusion in pediatric patients presenting to the operating room for myringotomy and tube placement.
This study protocol was reviewed and approved by the Mass General Brigham institutional review board (protocol no. 2019P003086).
Image Data Acquisition
Pediatric patients were brought to the operating room by 1 of 5 board-certified pediatric otolaryngologists (M.G.C., C.J.H., G.R.D., T.Q.G., J.S.) for an examination under anesthesia with the intent of performing myringotomy and possible tube placement for either recurrent acute otitis media or otitis media with effusion. Images data were collected from cases taking place from November 2019 to September 2020. Intraoperative pictures of the medial external auditory canal and tympanic membrane were taken by using a 0-degree rigid fiber-optic endoscope coupled to a high definition camera (“IMAGE1 HD Camera Head;” Karl Storz SE & Co KG, Tuttlingen, Germany). The camera captured raw images at 1920- × 1080-pixel resolution (ie, 2.1 megapixels). After image acquisition, a myringotomy blade was used to make a stab incision in the anterior-inferior quadrant of the tympanic membrane.
Images were included in the modeling data set if they surpassed our quality criteria such that 75% of the surface area of the tympanic membrane was visible and sufficient resolution to assess major landmarks (tympanic annulus, malleus umbo). Each ear was classified by the attending surgeon at the time of myringotomy on the basis of the type of effusion and presence or absence of tympanic membrane pathology. If an effusion was present, the effusion was categorized as either “serous or mucoid,” “mucopurulent,” or “purulent.” Tympanic membrane pathology was categorized as “none,” “minor” (presence of tympanosclerosis, retraction, atrophy), or “major” (tympanic membrane perforation). For the purposes of our deep-learning model target binary output, cases that had no effusion were defined as “normal.” Cases that had an effusion and any degree of tympanic membrane pathology were considered “abnormal.”
The goal of our model was to accurately classify high-resolution endoscopic images of the tympanic membrane as either normal or abnormal indicative of the absence or presence of a middle ear effusion, respectively (Fig 1).
We performed transfer learning using the ResNet-34 neural network architecture. The ResNet neural nets were previously trained on the ImageNet database composed of millions of high-resolution images within 22 000 categories.40 Our database was split into 80% of images used for training and 20% for validation to compute of out-of-training-set performance. Images were resized to 361 by 361 pixels and normalized for ImageNet. Training data augmentation included random crops of original images to a minimum scale of 0.15 to enable our model to train on fine details that could differentiate a normal ear from abnormal (eg, a subtle fluid meniscus, fluid bubbles, localized hypervascularity). Our model was trained over 15 epochs while unfreezing the final neural network layer weights.
We used Monte Carlo cross-validation resampling with 5 repetitions. The final trained model performance for each repeat was assessed using the folds’ respective held-out validation data. Specific metrics of accuracy (in percentage correct predictions on the validation set), a confusion matrix, area under the receiver operating curve, and F1-score were computed. The F1-score is a weighted average of the precision and recall. A 95% confidence interval (CI) for the metrics was generated by using the results of the Monte Carlo resampling.
To produce insight into why the ResNet-34 network made the predictions it did, we generated layer activation heatmaps using a gradient class activation map technique, Grad-CAM.41 The Grad-CAM technique allowed us to visualize which parts of a test image were important in the model in making its prediction. To illustrate how the neural network learned important features from an endoscopic image of the tympanic membrane, we generated layer activation maps for one of the abnormal images spanning the final 4 neural network layers. We also applied the Grad-CAM technique to 2 novel intraoperative cases to demonstrate the model’s classification prediction, prediction probability, and the influential image features activated in the final layer of the model.
Software and Hardware Environment
The neural network modeling was completed using the fast.ai deep-learning library in Python. Model training and performance assessment was completed on a custom-built Linux desktop computer with an NVIDIA GeForce RTX 2070 8GB graphics processing unit.
Our image database comprised 126 normal and 212 abnormal ear images meeting our quality criteria. In the abnormal image subset, myringotomy determined that 95 (45.5%) were a serous or mucoid effusion, 62 (29.7%) were mucopurulent, and 55 (26.3%) were purulent. The majority of abnormal images had no significant tympanic membrane pathology (n = 177, 84.7%), followed by minor (n = 34; 16.3%) and major (n = 1, 0.5%). For each cross-validation repeat, the model was trained on 270 images and used to predict 68 held-out image cases. The mean training time for the model was 76.0 (SD ± 0.01) seconds. Our deep-learning model approach achieved a mean image classification accuracy of 83.8% (95% CI: 82.7–84.8) across all 5 Monte Carlo repeats (Table 1). In support of this classification accuracy, the model produced an area under the receiver operating characteristic curve performance of 0.93 (95% CI: 0.91–0.94).
|Classification Accuracy (%) .||F1-Score .||Area Under the ROC .|
|83.8 (82.7–84.1)||0.80 (0.77–0.82)||0.93 (0.91–0.94)|
|Classification Accuracy (%) .||F1-Score .||Area Under the ROC .|
|83.8 (82.7–84.1)||0.80 (0.77–0.82)||0.93 (0.91–0.94)|
Mean (95% CI). ROC, receiver operating characteristic curve.
To further interrogate the performance of the model on the held-out data, a confusion matrix was generated for each Monte Carlo repeat (Fig 2 A–E). The confusion matrix demonstrates individual-level predictions in terms of the correct classifications for normal (ie, true-negative) and abnormal (ie, true-positive) as well as the inverse erroneous predictions (ie, false-negative, false-positive). There was no consistent trend in the proportion of accurate or erroneous predictions of normal or abnormal across each fold. For instance, the model did not appear to consistently overcall normal images as abnormal, or vice versa.
Grad-CAM was used to illustrate the neural network’s ability to “learn” patterns and structural relationships in endoscopic images of the medial external auditory canal and tympanic membrane. Qualitatively, in one image predicated to be an abnormal case (prediction probability of 0.87), the model seemed to rely on the retraction at the malleus umbo bordered by bubbles of the effusion in the middle ear in making its abnormal prediction (Fig 3; “Last layer”). In earlier layers, the model appeared to learn to distinguish the tympanic membrane from the bordering medial external auditory canal epithelium (Fig 3; “−3 Layer”). In later layers, the network activated along ridges and folds of the uneven surface of the tympanic membrane caused by the retraction and effusion (Fig 3; “−1 Layer”).
Using the fully validated model, we tested 2 novel case images from 2 children presenting to the operating room for planned myringotomy and tubes for presumed otitis media. In 1 known normal case, the model predicted a normal ear with high probability (P = 1.000; Fig 4A). The last layer of the neural network activated at the region of the umbo. In a known abnormal case, the model predicted an “abnormal ear” with high probability (P = .985; Fig 4B). In this instance, the last layer of the neural network activated along the angulated umbo with a prominent blood vessel, as well as the anterior-superior ridge of an atelectatic central portion of the tympanic membrane.
High-quality data with ground-truth labels are requisite for training accurate predictive models. With a relatively small training data set composed of 267 intraoperative ear images, our neural network approach was successful in predicting the presence of a middle ear effusion in 83.8% (95% CI: 82.7–84.8) of novel pediatric ear cases. This diagnostic accuracy performance is considerably higher than human-expert otoscopy-based diagnostic performance.14,17,18,20,21
Our work builds on previously published deep learning and machine-learning modeling approaches used to diagnose otitis media effusion in adults and children. Our model’s accuracy metric was consistent with other works that report classification accuracy in the range of 75% to 91%.34,36–39 The main difference in our approach was the use of ground-truth data determined by myringotomy. A serous effusion can easily be missed by otoscopy if the fluid is transparent, if a tympanic membrane is thickened or otherwise opaque, and if an adequate view is not achieved. A model trained on imperfect input data will likely not perform as well as a model trained with inputs closer to the ground truth. Our approach also illustrates the utility of using transfer learning to achieve suitable model performance using a small data set. This is contrary to a commonly held belief that machine-learning algorithms require massive data sets to train accurate models.42 Deep learning as a field is experiencing rapid advancement such that best-in-class model performance is quickly becoming possible without the need and expense of procuring massive data sets.
Through the incorporation of interpretability steps in the modeling process, we become closer to developing trust in a model to make predictions at the point-of-care. Another purported constraint of deep learning is the perception that the predictions come from a “black box” whereby the means and mechanism for how the model arrives at prediction is indecipherable or unavailable. This limitation has been propagated as source of mistrust inhibiting deployment of deep-learning models.43 Over the past several years, a number of solutions to address deep-learning model interpretability have been developed.44 In our study, we used a neural network layer activation technique to produce intuitive visual explanations of our models’ learning and prediction process.41 This allowed us to explore the morphologic characteristics of normal and abnormal tympanic membrane images that human experts might also use in making the diagnosis. Through the incorporation of interpretability steps in the modeling process, we become closer to developing trust in a model to make predictions at the point of care.
Enhanced diagnostic accuracy of otitis media in children may produce the most value in cases in which an otitis media is truly not present. For example, ruling out middle ear effusion or infection may curtail antibiotic use, limit unnecessary referrals and time spent waiting for specialist consultations, decrease the number of operating room cases (eg, examination under anesthesia), reduce parental leave from work and life responsibilities, or some combination thereof. We must temper the value of our model in its current state because it is not ready for deployment. We captured our data in an ideal and controlled intraoperative setting with an endoscope and high-resolution camera. A more-feasible and scalable solution would be a model that is trained with a mid- or low-resolution image capture system found in affordable consumer mobile devices coupled with a training program to teach parents or caregivers on how to accurately take home pictures of their children’s tympanic membranes. Such a system will need to be easy for front-line health care workers and parents to use and capture images with a sufficient view of the tympanic membrane.
There are limitations of our approach and model that require mentioning. Achieving an adequate database size for a desired model accuracy is challenging in deep learning because multiple components contribute, including complexity of the task, parameter tuning, and quality of the data; there is no “sample size” or power calculation analogous to probabilistic statistical techniques (ie, ; where n is sample size; z is constant for desired level of confidence; s is estimated SD; e is desired margin of error). The performance of our deep-learning model is ultimately associated with the chosen modeling approach and amount of training data available. We suspect that our model would benefit from additional data, but logistic constraints exist that limit the ability to obtain unlimited amounts of intraoperative data from “real” pediatric patients. With respect to our model architecture, transfer learning has been demonstrated to be an effective deep-learning approach with limited data.45 Recent attempts to use convolutional neural nets, (eg, ResNet-34 and its deeper variants) in biomedical imaging tasks have produced state-of-the-art accuracies approaching 100%.46,47
Another limitation is that our model is likely only to be generalizable to novel cases that are similar in image resolution and anatomic perspective. Image cases obtained with angled endoscopes or alternative image capture systems (eg, mobile phone camera) may not observe the same predictive accuracy. This is an important consideration given the goal of deploying the approach to the hands of parents or front-line health care workers who are not necessarily experts in otoscopy or otologic anatomy recognition. Moreover, intraoperative assessment allows for the complete removal of cerumen. The removal of obstructive cerumen is a prerequisite for obtaining a satisfactory view the tympanic membrane. Outside of the operating room, cerumen removal may be challenging in certain cases, especially in younger children.
Our approach is subject to a selection bias as clinical suspicion based off history, physical examination, and other diagnostic information (eg, tympanometry, audiometry) helped determine individual patients’ candidacy for an operative examination under anesthesia with planned myringotomy. This bias in the data, and possibly other confounders as mentioned in the previous section, limits the generalizability of our model’s output. Generalizability is a key consideration in the development of predictive models, and advancing the usefulness of this model will depend on the further development of an image database robust to context variability in settings beyond the data sources used in this study. This issue cannot be understated because selection bias and confounders in training deep-learning models can produce unintended consequences. A poignant example of confounding occurred when a deep-learning model was developed to detect pneumothoraces on chest radiograph imaging.48,49 The model achieved high accuracy, but post hoc analysis attributed the performance, in part, to the presence of a chest-tube in the training radiograph images. A similar deep-learning approach was developed to screen chest radiographs for pneumonia, and the resulting model’s performance was confounded by the hospital site and department from which the radiograph originated.50
Artificial intelligence–assisted diagnosis of acute or chronic otitis media in children may generate value for patients, families, and the health care system by improving point-of-care diagnostic accuracy. Misdiagnoses are not benign, because there are significant implications from both undertreatment and overtreatment that can result. Several published reports involving automatic classification of tympanic membrane and middle ear pathology images from outpatient clinics or scraped from the web have laid conceptual foundations for the utility of this approach. Our work adds validation of the concept using ground-truth intraoperative data. Further work on scaling the technology and adapting for clinic, home, and/or parent-use will be important milestones for realizing the benefits of this approach. Deep learning is an iterative process, and we are continuing work on further validating our high-resolution model prospectively with additional “real” patient cases. We are also in the process of developing a deep-learning model trained with low-resolution imagery captured with a consumer grade smart device platform.
Drs Crowson, Cohen, and Hartnick conceptualized and designed the study, collected data, performed the analyses drafted the initial manuscript, and reviewed and revised the manuscript; Drs Fracchia, Diercks, Setlur, and Gallagher coordinated data collection, collected data, and critically reviewed and revised the manuscript for important intellectual content; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.
FUNDING: No external funding.
POTENTIAL CONFLICT OF INTEREST: The authors have indicated they have no potential conflicts of interest to disclose.
FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.