Dataset Builder: NLP-based context-controlled correlation matrices for unsupervised learning

  • Create unique NYSE, Nasdaq, cryptocurrency, ETF & OTC datasets
  • Engineer custom feature vectors
  • Explore trends & generate alpha




Option 1: Dataset production platforms


Rows contain stock symbols. Columns contain features you choose which can be concepts, trends, entities or labels of any kind. Dataset generation is based on public & private databases in addition to human curation and market research.

  • Nasdaq, NYSE & OTCBB + General Data
  • Nasdaq, NYSE & OTCBB + Genomic & Molecular Biology Data
  • Nasdaq, NYSE & OTCBB + Chemical Data
  • Nasdaq, NYSE & OTCBB + Geophysical Data
  • S&P 500 + General Data
  • S&P 500 + Genomic & Molecular Biology Data
  • S&P 500 + Chemical Data
  • S&P 500 + Geophysical Data


  • Option 2: Dataset build request


    Rows contain stock symbols. Columns contain scores that represent known and hidden relationships between stocks & data streams below.

    Step 1.


    Step 2.



    Option 3: Create a dataset


    Rows contain stock symbols. Columns contain scores that represent known and hidden relationships between stocks & data streams below.

    Step 1.


    Step 2.

    Enter 1 to 5 features, concepts or keywords:
    example: Batteries, Bioengineering, Graphene, Blockchain, Machine Learning

    Step 3.



    Option 4: Pre-built datasets

    1,002 curated cryptocurrencies correlated via NLP to 10,286 NYSE, Nasdaq & OTC Stocks:


    Rows contain stock symbols. Columns contain cryptocurrencies. Dataset generation is based on public & private data, triangulation & human curation by market researchers.


    110 elements & minerals correlated via NLP to 10,286 NYSE, Nasdaq & OTC Stocks:


    Rows contain stock symbols. Columns contain elements & minerals. Dataset generation is based on public & private data, triangulation & human curation by market researchers.



    Use Cases:

    What kind of things can be done with a NLP-based correlation matrix like this?


  • Create unique sectors or clusters based on concepts and hidden relationships and compare their gains to the S&P (see below)
  • Determine if price correlations have similar concept or keyword correlations
  • Examine symbiotic, parasitic and sympathetic relationships between equities
  • Automatically create baskets of stocks based on concepts and/or keywords
  • Detach the custom columns and append them to other proprietary inhouse datasets
  • Select a Data Context (e.g. Biological, Chemical, Geophysical and others) to derive different signals
  • Use stock symbols as custom concept column labels and model cross-correlations between equities
  • Create features using trending terms anywhere on the internet

    How do the concepts & trends correlate to crypto, stocks or ETFs?


    Scores range from 0 to 1 and represent strength of known and hidden relationships between a concept and a stock, option or ETF. The score is calculated based on a series of algorithms that monitor data surrounding each company associated to the underlying security where each score is combined with scores from human curation teams. These concepts can then be factored or parameterized for exploring new signals or building new models. [Ref: Equity Correlations - J.P. Morgan]

    Data Engineering Pipeline Overview:







    Partners & Collaborators:





    Selected references and acknowledgements:


    Pushshift.io: