This was a very good question on Quora. As a person with background in machine learning, working on a medical start-up, I have a personal interest in machine learning in diagnosis. How much can we use AI in medicine? What are the obstacles in the way? Why, despite so much interest, have attempts to automate diagnosis failed?
We can’t talk about machine learning in medicine without looking at the history of expert systems. In the 1980’s, it looked like expert systems were going to revolutionize medicine. They didn’t. Why?
An expert system is an “IF…THEN” collection of rules, based on propositional logic. The first expert system was DENDRAL, developed at Stanford in 1965 by Carl Djerassi (inventor of the birth control pill) and Edward Feigenbaum, among others; it was a program to identify amino acids by their mass spectra. The key, as Edward Feigenbaum recalled, was that they worked closely with Djerassi, pressing him to incorporate all his knowledge. The project would have failed if it had been executed by computer scientists alone; they needed a biochemist’s domain expertise.
The first medical expert systems were created in the 1970’s, and some (such as CADUCEUS) were later commercialized. But today, expert systems are rarely used in everyday diagnosis, being reserved more for educational settings and clinical laboratories. Despite the fact that even simple statistical prediction rules outperform doctors’ clinical judgment, the high hopes for expert systems weren’t borne out. Part of the problem seems to have been inconvenience: expert systems required a lot of extra work from doctors, they had unfriendly user interfaces, and they required electronic medical data that wasn’t available. The story of MYCIN is an illustrative example. In research conducted at Stanford Medical School, MYCIN outperformed infectious disease experts at correct diagnosis. But it was never used in practice, for two reasons: first, there were legal and ethical issues in case it gave an incorrect diagnosis (who’s liable?), but secondly, it required an unreasonable amount of data entry from a busy clinician.
The problem with machine learning in medicine is not the machine learning. Machine learning and AI have come a long way since the 80’s, and even then automated systems outperformed doctors in experimental settings. Tempting as it may be for theorists like me to work on developing ever better algorithms, the hard work is in improving everything else: the user interface, the data collection, and the relationship between the computer science community and the medical community.
The top voted answer in the Quora discussion is from a medical student named Jae Won Joh describing the complexity of a diagnostic decision, and claims “the human body is incapable of being defined by any algorithm, no matter how bloody brilliant it is.” He’s diagnosed patients and most computer scientists haven’t, so his analysis is worth considering. But the tree of diagnostic questions he describes is an algorithm: tree methods are a cornerstone of machine learning, and they work with far more variables than even the dizzying array of options that doctors contend with. True, there are things humans are much better than machines at, like reading social cues (patients lie). But machines are much, much better than people at thinking statistically. Doctors, being human, think in narrative terms, which allows them to narrow down the search space, but introduces all kinds of biases. There’s plenty of evidence of statistical models outperforming human judgment, including in areas like hiring that are even fuzzier and more subjective than medicine.
There are fairly straightforward responses to many of Jae Won Joh’s criticisms. How do you account for genetics? Well, the cost of sequencing is dropping like a rock and many genome-wide association studies are actually beginning to build risk models (most use logistic regression but a few use support vector machines, which seem to perform better.) How do you account for diseases whose symptoms are the same? Um, priors? How do you account for the costs of running different tests? Um, decision theory is based on minimizing a loss function. How do you account for false positives/negatives? Jesus H. Christ how is this even a question. Medicine isn’t some mystical thing beyond the reach of formulae. The language of statistics and machine learning is perfectly adequate to the task of medical diagnosis.
What’s hugely inadequate: software engineering and data collection.
All medical software is worse quality than it should be. I have never been to a doctor without seeing her curse the slowness of the medical record database. The reason why is rather mysterious, but probably has something to do with the fact that we don’t have a market in health care. Until diagnostic software is convenient and usable, doctors will be perfectly right to shun it. Seconds count for saving lives. What works in a university setting won’t fly in a real hospital (and software written by scientists is famously inferior to commercial or open-source software).
The other problem is the colossal inadequacy of medical data for training machine learning algorithms. Perhaps for privacy reasons, perhaps due to the expense of data collection, it’s incredibly hard to find large biomedical data sets. Even “open” or “public” datasets are usually only available by application, and they require you to be a principal investigator in a biology department. If you’re Random J. Hacker, you’re used to norms of data sharing and easy access; when you try to find a good dataset to train your cool new algorithm on, you will be totally stymied.
In many computational research communities, such as image processing, there are standard public datasets to test models on. (Here’s a sampling. Some test images are so universal that we know them by name: Lena, Barbara, Peppers, Cameraman, etc.) Want to know which algorithm performs better? Test them both on the same dataset and you can get a definitive answer. There is no equivalent, as far as I know, in biology and medicine. It’s less democratic. Data is in messier formats and much less centralized and accessible — partly because biology is complicated, but perhaps also because there’s been no major push to get vast quantities of medical data in J. Random Hacker’s hands.
This data problem is not something that computer scientists and statisticians can do alone. We need people like Djerassi — domain experts in medicine and biology who are genuinely committed to making medicine computational. It’s easy to want to swoop in like Delbruck and Gamow and do biology better than the biologists, and I think there’s a lot of potential in that approach, but we have to get the data from somewhere. It’s enough of a challenge just getting scientists to publish their data at all; centralizing it and standardizing formats to make it usable for testing models on — the hardest part of a data scientist’s job — is a whole new ballgame.
Not being a biologist, I’m probably ignorant of existing attempts to solve this problem. But from my attempts to find medical and genetic data, my impression is that there really is a data problem. And there’s clearly a software problem. It’s not machine learning that’s inadequate for diagnosis, it’s the input (data) and the output (software interfaces) that’s holding us back.
