COVID-19 Dataset Builder

Real-time Datasets for Hidden Relationship Detection

Drug Compounds Associated to Viruses, Genes or Proteins

  SARS-CoV-2 ACE2 Interferon β
Dexamethasone 0.000 0.000 0.000
Vitamin D 0.000 0.000 0.000
Ivermectin 0.000 0.000 0.000
... ... ... ...

Context-controlled correlation matrix datasets for unsupervised learning, clustering & relationship network visualization

Create unique datasets for drugs, genes, proteins & infectious diseases
Choose custom feature vectors or column labels
Generate clusters or visualize hidden relationship networks

Every 24 hours we analyze thousands of bioscientific research papers that are published around the world through the National Library of Medicine (NLM) and other sources including the COVID-19 Open Research Dataset (CORD-19) composed of scientific literature directly related to COVID-19, SARS-CoV-2, and the Coronavirus group along with LitCovid, a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus.

Option 1.  Request a Dataset with Custom Features

Drug Compounds
Infectious Diseases
Microorganisms (Microbiome)
Botanicals, Phytochemicals, Micronutrients
Pharmaceutical Companies
Data Sources:

Option 2.  Dataset Builder API v1.2

Build a dataset and discover correlations between drug compounds and genes, proteins, infectious diseases or other biological processes related to COVID-19 research papers via context-controlled NLP/NLU

JSON API endpoint:

GET /recommend/app/ai_connect-covid-19-datasetbuilder_api


ccl1=[ user defined feature vector or column label ]
ccl2=[ user defined feature vector or column label ]
ccl3=[ user defined feature vector or column label ]
ccl4=[ user defined feature vector or column label ]
ccl5=[ user defined feature vector or column label ]

curl: Returns:

Option 4.  Download a Pre-Built Dataset

4,212 approved drug compounds correlated via NLP/NLU to 4,212 approved drug compounds in the context of the latest COVID-19 research:

Rows contain approved drug compounds. Columns contain approved drug compounds.

4000+ approved drug compounds associated via NLP/NLU to 800+ publicly traded pharmaceutical companies in the context of the latest COVID-19 research:

Rows contain stock symbols. Columns contain approved drug compounds.

Dataset Provenance Pipeline (DPP) Hash: 225d6bacb7c2c2b980c9b4f8daaa4204a74f7251
*Dataset generation is based on public & private data, triangulation along with human curation by knowledge domain experts

What kind of things can be done with custom concept columns & features?

Create unique clusters based on concepts and hidden relationships
Determine if gene expression correlations have similar concept or keyword correlations
Dataset Augmentation: Detach custom columns, append them to other proprietary inhouse datasets
Select a Data Context (e.g. Biological, Chemical, Geophysical and others) to derive different signals
Use drug, gene, protein or infectious disease names as custom concept column labels
Create features using global events e.g. trending terms anywhere on the internet


Scores range from 0 to 1 and represent strength of known and hidden relationships between a concept and a stock, option or ETF. The score is calculated based on a series of algorithms that monitor data surrounding each company associated to the underlying security where each score is combined with scores from human curation teams. These concepts can then be factored or parameterized for exploring new signals or building new models.

Real-time insights

Using our API, our customers have access to near real-time (NRT) datasets that update as frequently as once per minute (1440 API calls per day), allowing for near real-time correlation scores and insights that can be generated in isolation, or as an augmentation to any external or internal dataset.

Dataset augmentation

We provide data augmentation services in the form of static and real-time, context-controlled, correlation matrix datasets based on Natural Language Processing (NLP) and Natural Language Understanding (NLU). Our datasets can be applied in all industries to generate new interpretations, hypotheses and discoveries.

Built-in data provenance

For advanced users, we offer optional data provenance solutions via our Data Provenance Pipeline (DPP). The DPP rigorously controls data lineage, ensuring that you always know exactly where your data originated and how it was processed. This is a must-have for bioscience and financial institutions who rely on our datasets to make billion-dollar decisions every day.

State-of-the-art data pipeline

The Vectorspace data engineering pipeline takes unstructured text from any data source and applies state-of-the-art machine learning techniques based on unsupervised learning and NLP/NLU to find hidden relationships between entities (e.g. genes, proteins, diseases & drug compounds) that can accelerate the process of discovery.

Discover correlations to COVID-19 research papers via context-controlled NLP/NLU datasets
Pass a biomedical term and get associated COVID-19 research abstracts and papers based on an algorithm used for uncovering hidden relationships in data.
These papers can be queried based on updating correlation matrix datasets which are freely available to the public or through the API below.

While our NLP/NLU datasets are endlessly customizable and can be leveraged by any industry, we provide access to many high quality data sources out-of-the-box.

Sign Up to learn more about our highly customizable product offerings.

Additional Real-time Data Sources

Data object types processed:

Data acquisition & model training sources: