Given a description of a superhero
, return two guessed superhero names
Input description: Knight of Dark, Gotham protector, Smart, Intelligent, martial artist, master of dark, educated
Output: Batman
Dataset link: https://www.kaggle.com/datasets/jonathanbesomi/superheroes-nlp-dataset
Total rows: 1450
Total columns: 81
The task is to map a given description to an entity name.
Challenges:
Limited number of records
for supervised machine learning approches.- There is
no target class
: Each record describes a unique super hero. Thereby not a regular classification & regression task - Although
multiple superheros may have similar characteristics, each super hero is different
. Thereby not exactly clustering
Considering the input to the system, first and second points in challenges, I need to construct superhero description
from the provided dataset. This transformation of structured information into unstructured text is unconventional
but it is efficient
this way.
The assumption is this constructed superhero description
will have rich information
about the superhero. This description will help us match the input description to superhero name.
Solution
: Use Semantic Search
to match the input description query to existing superhero descriptions and fetch top - k records. Later, use Keyword Search
to find the records that have large number of similar words as of input description.
- Python - 3.8
- Spacy - 3.4
- Spacy Transformers - 1.1.8
- KeyBERT - 0.6.0
- hnswlib - 0.6.2 - Hierarchical Navigable Small World for Approximate Nearest Neightbour Search
- Sentence Transformers - 2.2.2
- Text Distance - 4.5.0
- AWS EC2
- Docker & Docker Compose
- GitHub Actions CI/CD - Deploy on AWS on Git Push
- Caddy - Reverse Proxy & Automatic SSL Certificate Generation and Verification
{description
: "marvel comics, super strength, leader, avengers, super solider, strong, honest, brooklyn"}
{Superhero_Guess
: [Knockout
, Captain America
]]}