Newswise — In late December 2019, U.S. analysts monitoring global biothreats began tracking an unidentified viral pneumonia spreading in China through technology developed at the U.S. Department of Energy’s Pacific Northwest National Laboratory. About a month later, the rest of the world would know that disease as COVID-19.

Some of the earliest insights came courtesy of a team of U.S. analysts who constantly monitor open-source text for information about active and potential biological, chemical and radiation threats to humans, animals and the environment. This information helps them track all aspects of an ongoing event like COVID-19 from inception to its impact on the world.

Data-mining software developed at PNNL called BioFeeds plays a key role, helping analysts by automating the process of combing through tens of thousands of articles each day. Dozens of government agencies and international partners rely on the reports from BioFeeds – developed with support from the U.S. Department of Homeland Security National Biosurveillance Integration Center -- to quickly get relevant information about active, future, and emerging biothreats, including COVID-19.

“COVID-19 is an example of why BioFeeds exists,” said Lauren Charles, a senior data scientist at PNNL, who is leading the development of advanced analytic algorithms for this software. “Now it’s also important because this software can look below the continuous talk about COVID-19 and monitor other potential biothreats happening in the world right now.”

Currently, those threats include an outbreak of bubonic plague in the Democratic Republic of Congo, as well as the largest outbreak of dengue fever ever recorded in Argentina.

A daily harvest of information worldwide

So far, BioFeeds has harvested information from more than 800,000 reports, news articles, blogs, scientific research, web search alerts, and other publicly available information in 90 different languages.

The software “reads” the articles using natural language processing algorithms to extract information regarding an event and its impacts. Then, it automatically labels relevant information from a taxonomy of about 1,500 tags, including the type of threat (disease or chemical agent, for example), specific event details, impacts on humans and critical infrastructure, and control measures being applied for mitigation. The software also flags special cases, such as new events, novel or unusual pathogens, and abnormal characteristics of ongoing events.

Typically for an ongoing event like COVID-19, the software notes tens of thousands of articles a day, then applies filters to reduce the data to the most important for review by analysts – making it possible for people confronting a rapidly evolving situation to navigate a flood of information and react appropriately.

Analysts can query the tagged data, find similar articles, add additional tags, and generate reports for immediate, daily or weekly notifications. Any user can also subscribe to customized web feeds based on user-defined queries or specific alerts.

Improving reports with more context and user-tagged data

For the past few years, the PNNL BioFeeds team has been working on automating the analysts’ workflow using state-of-the-art analytics, such as artificial intelligence, machine learning and deep learning. The main challenge is training an algorithm to approach the expert judgment of an analyst in identifying reliable sources and key information. Because some words that have the same spelling have different meanings, an algorithm that just searches for key words might wrongly identify information as important when it is actually irrelevant, given more context.

For example, cryptosporidium is a parasite that often causes diarrhea. It is commonly called “crypto” for short. But an algorithm searching articles for “crypto” could deliver results about virtual cryptocurrencies or cryptozoology, which is the study of mythical creatures such as Bigfoot.

Identifying the location of an event is another challenge for natural language processing algorithms for a similar reason. For example, there’s a city named The, which is a very common word in the English language. Also, many diseases, organizations and people’s names have a location in their title.

To provide more context, PNNL researchers have the algorithms pull information from several sentences. They also have an additional tool—information from analysts as they use BioFeeds data. Articles tagged by analysts provide additional data that can be used to improve the performance of their machine learning algorithms.

“Monthly automatic retraining, combined with user-in-the-loop processes, help us develop algorithms that only send data relevant to a mission,” added Charles, a doctor of veterinary medicine whose other credentials include degrees in mathematics, bioinformatics, and plant pathology, and a doctorate in fisheries, wildlife, and conservation biology.

The foundation of this automated analysis is a tool called Automated Analytics and Integration of Data. It can be applied to other areas of interest, like the maritime environment, because it learns from users interacting with it, Charles added.

The key project team includes Charles, who is in charge of the project’s analytics capabilities; Scott Dowson, who leads the software engineering; and Michelle Hart, the project manager.

# # #