Natural Language Processing (NLP)-based mapping of semi-structured data


Our Clients

Client Background

The Client is a global company that helps global clients to run effective cross-channel, dynamic content marketing campaigns. The client runs campaigns worldwide and has offices and partner agencies around the world.

Business Challenge

The Client has big customers with their own marketing departments with their own established business processes and history. The challenge is to help to align the meta descriptions of marketing content produced by different agencies involved in the campaigns. The process should be automated so that minimal human involvement is required.

Challenges by names:

  • Standardize the meta descriptions for Head Quarter so that cross-agency analysis could be done
  • Each agency has its own way (format) to describe the creatives, ads, campaigns, etc.
  • Thousands of creatives, hundreds of ads, tens of campaigns multiplied by different customers – analysis requires automation
  • Ability to accumulate knowledge so that no repeating boring mapping operations

Value Delivered

Project status

  • successfully delivered Proof-of-Concept (PoC) phase to demonstrate the potential of AI/ML-based approach
  • Minimum Viable Product (MVP) phase – in progress

The overall project timeline

  • PoC phase – 5 weeks
  • MVP phase – in progress

Value delivered

  • Research and experiments with different Natural Language Processing (NLP) models to identify the best fitting model
  • High Level Architecture (HLA) design includes:
    • UI for user interactions
    • API as façade of back-end services
    • ML model for NLP
  • MLOps pipeline for continuous ML model accuracy refinement
  • CI/CD pipelines to reduce Time-to-market for new product’s versions
  • Prototype has been successfully developed
  • MVP development has been started to increase the functional offer of the solution



The ability to run performance analysis of marketing campaigns is essential for successful customer engagement. The problem comes when you have multiple agencies involved with their own standards for describing creatives, ads, campaigns, etc. Such a heterogeneous environment makes an analysis almost impossible or extremely difficult (costly) task. But if we could imagine some “common denominator” of such descriptions then everything becomes much easier. We call such a “common denominator” a master taxonomy.

The primary objective is to find a way how to map meta descriptions coming from agency, customer or client’s data to that master taxonomy in an automatic way.

It is obvious, that mapping would have errors and they must be corrected with human interventions, the second objective is to reduce such interventions with time.


That was clear from the very beginning that conventional approaches with structured, pattern-based data schema won’t work because that would require to change business processes of an arbitrary number of agencies. As a result, the idea of using Natural Language Processing (NLP) technics came to the stage.

Our AI/ML team has conducted intensive research of existing NLP models and run experiments comparing the performance of those models against the test data set to identify the best fitting models. Just to give a couple of names:

Based on the conducted analysis the candidate network has been selected and put into the heart of the mapping process.

The objective of “prediction improvement” has been addressed with the Siamese network approach. The task is complex and requires a proper data set to be generated automatically based on human choices made during the operations.

Lessons learned

NLP domains have a number of well-established, free for commercial use, publicly available models pre-trained on huge volumes of data

Overfitting problem could be a serious challenge in case of attempts to improve accuracy by adding extra models on top of out-of-the-box NLP models, it is better to invest efforts into a better understanding of the model behavior and utilization of those nuances

MLOps pipelines are difficult for development, but definitely worth the effort and reduce development and operational costs due to standardization, predictability, and removed routine operations.