Dataset Builder: NLP/NLU context-controlled correlation matrices for unsupervised learning

  • Create unique equity, cryptocurrency, ETF & OTC datasets
  • Engineer custom feature vectors
  • Explore trends & generate alpha




Option 1: Dataset production platforms


Rows contain stock symbols. Columns contain features you choose which can be concepts, trends, entities or labels of any kind. Dataset generation is based on public & private databases in addition to human curation and market research.

  • Nasdaq, NYSE & OTCBB + General Data
  • Nasdaq, NYSE & OTCBB + Genomic & Molecular Biology Data
  • Nasdaq, NYSE & OTCBB + Chemical Data
  • Nasdaq, NYSE & OTCBB + Geophysical Data
  • S&P 500 + General Data
  • S&P 500 + Genomic & Molecular Biology Data
  • S&P 500 + Chemical Data
  • S&P 500 + Geophysical Data


  • Option 2: Dataset build request


    Rows contain stock symbols. Columns contain scores that represent known and hidden relationships between stocks & data streams below.

    Step 1.


    Step 2.



    Option 3: Create a dataset


    Rows contain stock symbols. Columns contain scores that represent known and hidden relationships between stocks & data streams below.

    Step 1.


    Step 2.

    Enter 1 to 5 features, concepts or keywords:
    example: Batteries, Bioengineering, Graphene, Blockchain, Machine Learning

    Step 3.



    Option 4: Pre-built sample datasets

    1,002 curated cryptocurrencies correlated via NLP to 10,286 NYSE, Nasdaq & OTC Stocks:


    Rows contain stock symbols. Columns contain cryptocurrencies. Dataset generation is based on public & private data, triangulation & human curation by market researchers.


    110 elements & minerals correlated via NLP to 10,286 NYSE, Nasdaq & OTC Stocks:


    Rows contain stock symbols. Columns contain elements & minerals. Dataset generation is based on public & private data, triangulation & human curation by market researchers.



    Use Cases:

    What kind of things can be done with a NLP-based correlation matrix?


  • Create unique sectors or clusters based on concepts and hidden relationships and compare their gains to the S&P (see below)
  • Determine if price correlations have similar concept or keyword correlations
  • Examine symbiotic, parasitic and sympathetic relationships between equities
  • Automatically create baskets of stocks based on concepts and/or keywords
  • Detach the custom columns and append them to other proprietary inhouse datasets
  • Select a Data Context (e.g. Biological, Chemical, Geophysical and others) to derive different signals
  • Use stock symbols as custom concept column labels and model cross-correlations between equities
  • Create features using trending terms anywhere on the internet

    Data Engineering Pipeline Overview:










    Selected references and acknowledgements:


    Pushshift.io: