In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 51,000 scholarly articles, including over 40,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other ML techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up. They have ordered a call to action!
That is where Definitive Logic’s Machine Learning specialists stepped up to Kaggle’s “COVID-19 Dataset Challenge.” This challenge requires us to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions. The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community the opportunity to apply text and data mining approaches to find answers to questions within, and connect insights across, this content in support of the ongoing COVID-19 response efforts worldwide. There is a growing urgency for these approaches because of the rapid increase in coronavirus literature, making it difficult for the medical community to keep up. Many of these questions are suitable for text mining, and we encourage researchers to develop text mining tools to provide insights into these questions.
DL put together a team of our brightest, to use their ML and automation skills to digest scientific articles and help the medical community to keep up to date with the latest publications on COVID-19. Over the past 30 days, the DL team was able to accomplish the following:
- Building an AWS Cloud environment to host our data architecture
- Built a graph database using OrientDB to learn graph which in return, the team was able to learn and understand the data (Figure 2)
- Built a SQL Server database to prepare the data for visualization
- Developed Python code to extract, transform, and load (ETL) the source data into SQL Server
- Built ranking views by researching the myriad of ways the medical arena scores their information by web scraping:
- Everyone on the team collected data for ranking of authors, journals, institutions, impact factors, keyword counts over very large text data, etc.
- With the collection of ranking data, the team designed a ranking algorithm to account for seven variables, which led to the analysis and labeling with K-means clustering – a ML method for finding clusters and cluster centers within a set of unlabeled data. (Figure 1)
- Built a single SQL view bringing all the data together
This ultimately lead to the creation of an interactive dashboard using Tableau to aid in the discovery within the body of knowledge surrounding COVID-19.
Figure 1. K-means clustering. Four unique clusters of papers were identified.
Figure 2. Graph database diagram showing relationships among papers, journals, authors, and citations.
We are extremely proud of the team that volunteered to help create this visual:
- James Eselgroth
- Mark DeRosa
- John Bonfardeci
- Matt Sorando
- Jon Owens
- Catie Reed
- Jean Nehring
- Segundo Espinoza
- Allison Kiteley
- Tony Depew
- Susan Love
The team has completed the first round and will move on to the second round ending June 16, 2020.
Here is the link to the challenge, and under each task, there is a submission from John Bonfardeci.
You can also see our interactive dashboard, hosted by Tableau Public.