By 2025, the pharmaceutical industry is expected to spend more than $3 billion on AI, up from $463 million in 2019. Artificial intelligence clearly brings added value, but according to its supporters, it is not yet reaching its full potential.
There are many reasons why reality has yet to live up to the hype, but limited data is a big one.
Given the vast amount of data collected every day—from steps taken to electronic medical records—data scarcity is one of the last anticipated obstacles.
The traditional big data/AI approach uses hundreds or even thousands of data points to characterise a person’s face. For the training to be reliable, the AI needs thousands of data sets to recognise faces regardless of gender, age, ethnicity, or disease.
Examples of face recognition are readily available. Drug development is a different story.
“When you imagine all the different ways you could tailor a drug… that dense data set that covers all the possibilities is less rich,” Verseon co-founder and CEO Adityo Prakash told sources.
Small changes change what a drug does in our bodies, so you really need detailed information about all possible changes.
This can require millions of samples of data, which Prakash says even the biggest drug companies don’t have.
Limited predictive capabilities
AI can be quite useful if the “rules of the game” are known, he continued, citing protein folding as an example. Protein folding is the same in many species and can therefore be used to infer the likely structure of a functional protein because biology follows certain rules. But
Drug design uses entirely new combinations and is less amenable to AI “because you don’t have enough data to cover all the possibilities,” Prakash said.
Even when data sets are used to predict similar things, such as small molecule interactions, the predictions are limited. “That’s because no negative information has been released,” he said. Negative data is important for AI predictions.
Also: “Many times much of what is published is not reproducible.” Small data, questionable data, and a lack of negative information limit the predictive power of AI.
Too much noise
Another challenge is the noise in the large amount of available data. PubChem, one of the largest public databases, contains more than 300 million bioactivity data points from high-throughput screens, according to Jason Rolfe, co-founder and CEO of Variational AI.
“But these data are both unbalanced and noisy,” he told source. “Overall, more than 99% of the tested compounds are inactive.”
Of the less than 1 percent of compounds that appear to be active at high levels across the screen, most are false positives, Rolfe said. This is due to aggregation, assay interference, reactivity, or contamination.
X-ray crystallography can be used to train artificial intelligence for drug development and to identify the precise spatial arrangement of a ligand and its protein target. However, despite great progress in crystal structure prediction, drug-induced protein conformations have not been well predicted.
Similarly, molecular docking, which simulates the binding of drugs to target proteins, is notoriously imprecise, Rolfe said.
“The correct spatial arrangement of a drug and its protein target is accurately predicted only about 30% of the time, and predictions of pharmacological activity are even less reliable.”
With the astronomical number of drug-like molecules, even AI algorithms that can accurately predict the binding between ligands and proteins face a daunting challenge.
“This requires acting against the primary target without affecting the tens of thousands of other proteins found in the human body so they don’t cause side effects or toxicity,” Rolfe said. Currently, artificial intelligence algorithms are not capable of this task.
He suggested the use of physics-based drug-protein interaction models to improve accuracy but noted that they are computationally intensive and require approximately 100 hours of CPU time per drug, which may limit their usefulness for studying large numbers of molecules.
However, computer physics simulations are a step towards overcoming the current limitations of artificial intelligence, as noted by Prakash.
“They can give you artificially generated virtual information about how two things interact.” “Physics-based simulations, however, do not provide insight into the internal decomposition of a body.”
Another challenge is related to smooth data systems and disconnected data sets.
“Many facilities still use paper batch records, so useful information… is not readily available electronically,” Moira Lynch, senior director of innovation in Thermo Fisher Scientific’s bioprocessing group.
meets the challenge: “Data available electronically comes from a variety of sources, is available in a variety of formats, and is stored in a variety of locations.”
According to Jaya Subramaniam, head of life sciences product and strategy at Definitive Healthcare, these datasets are also limited in scope and reach.
According to him, the two main reasons are decentralised data and de-identified data. “No single entity has a complete set of data of any type, whether it’s claims, EMR/EHR, or lab diagnoses.”
In addition, patient privacy laws require that personal information be deleted, making it difficult to track an individual’s journey from diagnosis to outcome. Pharmaceutical companies are then hampered by the slower pace of insights.
Despite the unprecedented amount of information, there is still enough relevant and usable information. Only when these obstacles are overcome will the power of artificial intelligence be truly unleashed.