Dataset Builder: NLP/NLU context-controlled correlation matrices for unsupervised learning
- Create unique equity, cryptocurrency, ETF & OTC datasets
- Engineer custom feature vectors
- Explore trends & generate alpha
Option 1: Dataset production platforms
Rows contain stock symbols.
Columns contain features you choose which can be concepts, trends, entities or labels of any kind. Dataset generation is based on public & private databases in addition to human curation and market research.
Option 2: Dataset build request
Rows contain stock symbols.
Columns contain scores that represent known and hidden relationships between stocks & data streams below.
Option 4: Pre-built sample datasets
1,002 curated cryptocurrencies correlated via NLP to 10,286 NYSE, Nasdaq & OTC Stocks:
Rows contain stock symbols.
Columns contain cryptocurrencies. Dataset generation is based on public & private data, triangulation & human curation by market researchers.
110 elements & minerals correlated via NLP to 10,286 NYSE, Nasdaq & OTC Stocks:
Rows contain stock symbols.
Columns contain elements & minerals. Dataset generation is based on public & private data, triangulation & human curation by market researchers.
Use Cases:
What kind of things can be done with a NLP-based correlation matrix?
Create unique sectors or clusters based on concepts and hidden relationships and compare their gains to the S&P (see below)
Determine if price correlations have similar concept or keyword correlations
Examine symbiotic, parasitic and sympathetic relationships between equities
Automatically create baskets of stocks based on concepts and/or keywords
Detach the custom columns and append them to other proprietary inhouse datasets
Select a Data Context (e.g. Biological, Chemical, Geophysical and others) to derive different signals
Use stock symbols as custom concept column labels and model cross-correlations between equities
Create features using trending terms anywhere on the internet
Data Engineering Pipeline Overview:
Selected references and acknowledgements:
- R&D100 Conference: Search, the human way
https://www.rd100conference.com/awards/winners-finalists/677/search-the-human-way/
- System and method for generating a relationship network
K Franks, CA Myers, RM Podowski, Lawrence Berkeley National Laboratory - US Patent 7,987,191, 2011
http://www.google.com/patents/US7987191
- Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
Blei DM, Franks K, Jordan MI, Mian IS.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533868
- Contagious Speculation and a Cure for Cancer: A Non-Event that Made Stock Prices Soar
Gur Huberman, Tomer Regev, Journal of Finance
http://www8.gsb.columbia.edu/researcharchive/getpub/1555/p
- Vectorspace AI whitepaper v0.3
https://vectorspace.ai/assets/Vectorspace_Whitepaper.pdf
Pushshift.io: