Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Analyzer] Topic Modeling #131

Open
shahrukhx01 opened this issue Jun 7, 2021 · 18 comments
Open

[Analyzer] Topic Modeling #131

shahrukhx01 opened this issue Jun 7, 2021 · 18 comments
Assignees
Labels
analyzer enhancement New feature or request

Comments

@shahrukhx01
Copy link
Collaborator

https://github.com/MaartenGr/BERTopic

@shahrukhx01 shahrukhx01 added the enhancement New feature or request label Jun 7, 2021
@shahrukhx01
Copy link
Collaborator Author

@lalitpagaria you can assign me this and #130

@shahrukhx01 shahrukhx01 changed the title Topic Modeling [Analyzer] Topic Modeling Jul 5, 2021
@akar5h
Copy link
Contributor

akar5h commented Jul 26, 2021

@shahrukhx01 Shahrukh pls confirm if you've started anything on this yet ? if not I would like to collaborate .

@shahrukhx01
Copy link
Collaborator Author

shahrukhx01 commented Jul 26, 2021

Hi Akarsh, I’ve not started anything yet. However before starting I’d suggest you to go through this:
https://www.sbert.net/examples/applications/clustering/README.html
this would also be helpful in #130

@akar5h
Copy link
Contributor

akar5h commented Jul 27, 2021

thanks @shahrukhx01 , I went through the sentence_transformers library and models for clustering and have some idea myself.
@lalitpagaria Hi , can you please elaborate more on Topic Modeling requirements . I assume our focus is on short texts , like reviews , is that right ?

@lalitpagaria
Copy link
Collaborator

@akar5h Actually main idea to perform do it on both short (reviews) and large text (news articles, emails etc) as well. But yes we could first start on small texts. This how I see it from user point of view -

  1. User load data from DB/CSV/Excel/Airtable etc
  2. Perform topic modeling on whole set using selected techniques (clustering algo, LDA, PCA)

This would be helpful for user to perform per-processing and gain taste about what all texts are talking about.
Feel free to correct if I wrote incorrectly in context of data science.

@shahrukhx01
Copy link
Collaborator Author

Just adding to Lalit's comment, when I was adding issue, my intention was to have a comparable pipeline to the zero-shot classifier, the end goal is to categorize data into categories/clusters with user-defined categories without any fine-tuning/training.

@lalitpagaria
Copy link
Collaborator

Yes @shahrukhx01 this is important and very helpful to user who doesn't have resource to run Obsei on GPUs. Even few Obsei users are asking it as well. So this will another classification analyzer.

@akar5h
Copy link
Contributor

akar5h commented Jul 29, 2021

@shahrukhx01 @lalitpagaria
Thanks for the input, from the discussion I will start Building a (Semi Supervised) Clustering pipeline to be used as another classification analyzer.
https://github.com/MaartenGr/BERTopic seems like a good start for this .

Yes @shahrukhx01 this is important and very helpful to user who doesn't have resource to run Obsei on GPUs. Even few Obsei users are asking it as well. So this will another classification analyzer.

Just adding to Lalit's comment, when I was adding issue, my intention was to have a comparable pipeline to the zero-shot classifier, the end goal is to categorize data into categories/clusters with user-defined categories without any fine-tuning/training.

Also, I feel Topic Modeling can have a separate module eventually, more practical to use it as a visualizer/preprocessor using some clustering algos .

@lalitpagaria lalitpagaria assigned akar5h and unassigned shahrukhx01 Jul 30, 2021
@lalitpagaria
Copy link
Collaborator

Great @akar5h. Looking for ward for it :)

@lalitpagaria
Copy link
Collaborator

Adding one more https://github.com/ddangelov/Top2Vec

@akar5h
Copy link
Contributor

akar5h commented Aug 12, 2021

Hi Lalit @lalitpagaria
I have worked out first iteration of Topic Analyzer, I want to clarify on output format from the TopicAnalyzer:

This topic analyzer takes in unlabelled texts as input and clusters them . It inherits from BaseAnalyzer so , has analyze_input as its base function,
After clustering the texts , I calculate the most frequent texts in these clusters , and use them as the representation of the cluster. The output of analyze_input is List[TextPayload] where each TextPayload has this " representation of the cluster" as processed texts and meta: contains a list of all TextPayload that belong in this cluster ,
Example: Cluster1 most frequent Topics = ["Happy", "fast", delivery"] , CLuster 2 = ["worse","bad", "agent"] as representation of the cluster,
Then the output if analyzeinput will be [TextPayload(processed_text = "Happy_fast_delivert", meta: [list of textpayloads in this cluster]), TextPayload(processed_text = "worse_bad_agent", meta: [list of textpayloads in this cluster])]

So the above cluster and their representation of cluster is fine output format or you expect these calculated labels in some other format ?

You can see this implemented here:
https://github.com/akar5h/obsei/blob/topic-analyzer/obsei/analyzer/topic_analyzer.py

Will create a PR when I complete a non deep learning based approach : "LDA" integrated to this analyzer by end of the week .

@lalitpagaria
Copy link
Collaborator

Thanks a lot @akar5h
Let me review it and provide you early review.

@lalitpagaria
Copy link
Collaborator

@akar5h I just had cursory looks and it looks fine to me. There are few code structure related things which we can take is forward on PR review.

@shahrukhx01 can you please have a look as you have more context in this field.

@shahrukhx01
Copy link
Collaborator Author

@lalitpagaria sure I will take a look at it over the weekend.

@lalitpagaria
Copy link
Collaborator

@akar5h could you please create PR so we can discuss on it.

@akar5h
Copy link
Contributor

akar5h commented Aug 19, 2021

@lalitpagaria , Will do surely , running on low bandwidth with office this week

@shahrukhx01
Copy link
Collaborator Author

Hi @lalitpagaria @akar5h I'm really sorry. I have been super consumed on couple of other things lately, I will try giving my input on this within this week.

@lalitpagaria
Copy link
Collaborator

@akar5h @shahrukhx01 No issue please take your time, no urgency :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants