Computers learn from data in a process called machine learning. Computer scientists use machine learning to automatically generate captions for images. This sounds simple, but it actually requires really sophisticated techniques. Scientists at the cutting edge of a new field called quantitative biology are using similar techniques to transform medical research.
Computers get calculus, but they’re confused by bears
You know those little “I’m not a robot” tests online, where you have to identify all the images with trucks in them? Those security walls work, because real humans are naturally good at identifying objects in photos. Computers are not.
Consider the picture below
To you, this probably looks like a picture of bears.
Circa 2005, a computer might have correctly guessed that there is a bear in this picture a little more than half the time. Fast forward to 2012, computers could tell that there’s an animal in the picture, but they might have thought it was a wolf.
Enter machine learning
Traditionally, people have had to try to explain to computers how to caption images. But while we’re good at identifying bears intuitively, we’re not very good at describing the process to computers. We don’t break the image down into pixels and quantify the colors, we just see a bear and know it’s a bear.
Machine learning sort of bypasses the human-to-computer communication step. In machine learning, the computer looks for patterns among images and captions and learns from them. That means, the more images we can give a computer to learn from, the better it will get at generating captions.
In 2015, researchers on the Google Brain Team used 10 million captioned images to train an image captioning model. According to their model, the image above contains “two brown bears sitting on top of rocks.” That’s pretty good.
Others have developed machine learning models so sophisticated, that they can tell the difference between images of animals that humans struggle to distinguish, like related breeds of dog.
From identifying bears to classifying cancer
The real strength of machine learning is that the computer isn’t told what to look for by humans. That means computers can find patterns we wouldn’t think to examine.
In 2011, Daphne Koller’s lab at the University of Stanford used machine learning to find unknown patterns in breast cancer tissue.
When cancerous tissue is biopsied, it’s sliced thin, stained, placed on a glass slide, and imaged. Those images are analyzed by an experienced pathologist. The pathologist looks for patterns that are consistent with images they’ve encountered before to suggest a treatment and predict how quickly the disease is likely to progress.
Koller’s team set out to see if computers could make the same predictions. They used machine learning to create an “automated pathologist,” that was able to differentiate biopsies from patients with a more or less severe prognosis.
Remarkably, the model’s predictions were not based off of known prognostic indicators. It was patterns in the cells adjacent to the tumor cells that tipped the model off. This turned out to be among the earliest evidence suggesting that the microenvironment in which a tumor finds itself is important in shaping how the cancer will progress.
From images to sequencing data
Tumor biopsies are an important diagnostic tool, but they’re not the only one. As costs continue to decrease, there is mounting interest in using sequencing to diagnose cancer and rare diseases. In much the same way that computers can learn to identify what patterns link images with their captions, they could also learn to link genotypes with phenotypes.
In the last ten years genome wide association studies have identified thousands of mutations that correlate with particular diseases. But not all of those variants are important. “There is signal in the genetics, it’s just that we don’t know how to read it,” said Koller in a recent talk at SynBioBeta. She argued that finding that signal is a job for machine learning.
If we continue at the current—faster than exponential—rate, we will have sequenced between one and two billion genomes by 2025. The amount of phenotypic data is also increasing. In a project called the UK Biobank, data was collected on 500,000 people over 30 years on everything from urinary biomarkers to lifestyle choices. The US is currently embarking on a similar project called “All of Us.” That’s a lot of training data for machine learning models.
Koller is confident that machine learning can help us solve all sorts of predictive problems in biology. For instance, we can create genetic diversity in a test tube, probe that diversity with potential drug candidates, and use machine learning to find patterns. Those patterns could help narrow down candidate drugs to stymie the exponentially escalating cost of drug discovery.
Koller predicts that the combination of machine learning and biology will define the next scientific epoch. But it isn’t without its challenges. Machine learning algorithms are really good at homing in on subtle signals, she said, which means they could get thrown off course by subtle artefacts. Good algorithms start with good data.
A big thank you to Daphne Koller, Founder and CEO of Insitro, for giving an amazing talk at SynBioBeta and sharing a recording of another amazing talk at the MARS conference.