Create unique datasets for drugs, genes, proteins & infectious diseases
Choose custom feature vectors or column labels
Generate clusters or visualize hidden relationship networks
Every 24 hours we analyze thousands of bioscientific research papers that are published around the world through the National Library of Medicine (NLM) and other sources including the COVID-19 Open Research Dataset (CORD-19) composed of scientific literature directly related to COVID-19, SARS-CoV-2, and the Coronavirus group along with LitCovid, a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus.
Option 1. Request a Dataset with Custom Features
Option 2. Dataset Builder API v1.2
Build a dataset and discover correlations between drug compounds and genes, proteins, infectious diseases or other biological processes related to COVID-19 research papers via context-controlled NLP/NLU
JSON API endpoint:
ccl1=[ user defined feature vector or column label ]
ccl2=[ user defined feature vector or column label ]
ccl3=[ user defined feature vector or column label ]
ccl4=[ user defined feature vector or column label ]
ccl5=[ user defined feature vector or column label ]
Option 4. Download a Pre-Built Dataset
4,212 approved drug compounds correlated via NLP/NLU to 4,212 approved drug compounds in the context of the latest COVID-19 research:
Rows contain approved drug compounds. Columns contain approved drug compounds.
Dataset Provenance Pipeline (DPP) Hash: 225d6bacb7c2c2b980c9b4f8daaa4204a74f7251
*Dataset generation is based on public & private data, triangulation along with human curation by knowledge domain experts
What kind of things can be done with custom concept columns & features?
Create unique clusters based on concepts and hidden relationships
Determine if gene expression correlations have similar concept or keyword correlations
Dataset Augmentation: Detach custom columns, append them to other proprietary inhouse datasets
Select a Data Context (e.g. Biological, Chemical, Geophysical and others) to derive different signals
Use drug, gene, protein or infectious disease names as custom concept column labels
Scores range from 0 to 1 and represent strength of known and hidden relationships between a concept and a stock, option or ETF. The score is calculated based on a series of algorithms that monitor data surrounding each company associated to the underlying security where each score is combined with scores from human curation teams. These concepts can then be factored or parameterized for exploring new signals or building new models.
Using our API, our customers have access to near real-time (NRT) datasets that update as frequently as once per minute (1440 API calls per day), allowing for near real-time correlation scores and insights that can be generated in isolation, or as an augmentation to any external or internal dataset.
We provide data augmentation services in the form of static and real-time, context-controlled, correlation matrix datasets based on Natural Language Processing (NLP) and Natural Language Understanding (NLU). Our datasets can be applied in all industries to generate new interpretations, hypotheses and discoveries.
Built-in data provenance
For advanced users, we offer optional data provenance solutions via our Data Provenance Pipeline (DPP). The DPP rigorously controls data lineage, ensuring that you always know exactly where your data originated and how it was processed. This is a must-have for bioscience and financial institutions who rely on our datasets to make billion-dollar decisions every day.
State-of-the-art data pipeline
The Vectorspace data engineering pipeline takes unstructured text from any data source and applies state-of-the-art machine learning techniques based on unsupervised learning and NLP/NLU to find hidden relationships between entities (e.g. genes, proteins, diseases & drug compounds) that can accelerate the process of discovery.