[Analyzer] Topic Modeling #131

shahrukhx01 · 2021-06-07T17:01:56Z

shahrukhx01 · 2021-06-07T19:57:46Z

@lalitpagaria you can assign me this and #130

akar5h · 2021-07-26T05:04:21Z

@shahrukhx01 Shahrukh pls confirm if you've started anything on this yet ? if not I would like to collaborate .

shahrukhx01 · 2021-07-26T05:09:30Z

Hi Akarsh, I’ve not started anything yet. However before starting I’d suggest you to go through this:
https://www.sbert.net/examples/applications/clustering/README.html
this would also be helpful in #130

akar5h · 2021-07-27T17:34:16Z

thanks @shahrukhx01 , I went through the sentence_transformers library and models for clustering and have some idea myself.
@lalitpagaria Hi , can you please elaborate more on Topic Modeling requirements . I assume our focus is on short texts , like reviews , is that right ?

lalitpagaria · 2021-07-29T04:47:45Z

@akar5h Actually main idea to perform do it on both short (reviews) and large text (news articles, emails etc) as well. But yes we could first start on small texts. This how I see it from user point of view -

User load data from DB/CSV/Excel/Airtable etc
Perform topic modeling on whole set using selected techniques (clustering algo, LDA, PCA)

This would be helpful for user to perform per-processing and gain taste about what all texts are talking about.
Feel free to correct if I wrote incorrectly in context of data science.

shahrukhx01 · 2021-07-29T06:18:03Z

Just adding to Lalit's comment, when I was adding issue, my intention was to have a comparable pipeline to the zero-shot classifier, the end goal is to categorize data into categories/clusters with user-defined categories without any fine-tuning/training.

lalitpagaria · 2021-07-29T06:26:07Z

Yes @shahrukhx01 this is important and very helpful to user who doesn't have resource to run Obsei on GPUs. Even few Obsei users are asking it as well. So this will another classification analyzer.

akar5h · 2021-07-29T19:27:15Z

@shahrukhx01 @lalitpagaria
Thanks for the input, from the discussion I will start Building a (Semi Supervised) Clustering pipeline to be used as another classification analyzer.
https://github.com/MaartenGr/BERTopic seems like a good start for this .

Yes @shahrukhx01 this is important and very helpful to user who doesn't have resource to run Obsei on GPUs. Even few Obsei users are asking it as well. So this will another classification analyzer.

Just adding to Lalit's comment, when I was adding issue, my intention was to have a comparable pipeline to the zero-shot classifier, the end goal is to categorize data into categories/clusters with user-defined categories without any fine-tuning/training.

Also, I feel Topic Modeling can have a separate module eventually, more practical to use it as a visualizer/preprocessor using some clustering algos .

lalitpagaria · 2021-07-30T06:21:11Z

Great @akar5h. Looking for ward for it :)

lalitpagaria · 2021-08-05T11:41:36Z

Adding one more https://github.com/ddangelov/Top2Vec

akar5h · 2021-08-12T17:00:21Z

Hi Lalit @lalitpagaria
I have worked out first iteration of Topic Analyzer, I want to clarify on output format from the TopicAnalyzer:

This topic analyzer takes in unlabelled texts as input and clusters them . It inherits from BaseAnalyzer so , has analyze_input as its base function,
After clustering the texts , I calculate the most frequent texts in these clusters , and use them as the representation of the cluster. The output of analyze_input is List[TextPayload] where each TextPayload has this " representation of the cluster" as processed texts and meta: contains a list of all TextPayload that belong in this cluster ,
Example: Cluster1 most frequent Topics = ["Happy", "fast", delivery"] , CLuster 2 = ["worse","bad", "agent"] as representation of the cluster,
Then the output if analyzeinput will be [TextPayload(processed_text = "Happy_fast_delivert", meta: [list of textpayloads in this cluster]), TextPayload(processed_text = "worse_bad_agent", meta: [list of textpayloads in this cluster])]

So the above cluster and their representation of cluster is fine output format or you expect these calculated labels in some other format ?

You can see this implemented here:
https://github.com/akar5h/obsei/blob/topic-analyzer/obsei/analyzer/topic_analyzer.py

Will create a PR when I complete a non deep learning based approach : "LDA" integrated to this analyzer by end of the week .

lalitpagaria · 2021-08-12T18:15:34Z

Thanks a lot @akar5h
Let me review it and provide you early review.

lalitpagaria · 2021-08-13T15:01:33Z

@akar5h I just had cursory looks and it looks fine to me. There are few code structure related things which we can take is forward on PR review.

@shahrukhx01 can you please have a look as you have more context in this field.

shahrukhx01 · 2021-08-13T15:15:24Z

@lalitpagaria sure I will take a look at it over the weekend.

lalitpagaria · 2021-08-19T04:56:23Z

@akar5h could you please create PR so we can discuss on it.

akar5h · 2021-08-19T05:27:28Z

@lalitpagaria , Will do surely , running on low bandwidth with office this week

shahrukhx01 · 2021-08-19T09:14:37Z

Hi @lalitpagaria @akar5h I'm really sorry. I have been super consumed on couple of other things lately, I will try giving my input on this within this week.

lalitpagaria · 2021-08-20T14:59:51Z

@akar5h @shahrukhx01 No issue please take your time, no urgency :)

shahrukhx01 added the enhancement New feature or request label Jun 7, 2021

lalitpagaria assigned shahrukhx01 Jun 7, 2021

shahrukhx01 changed the title ~~Topic Modeling~~ [Analyzer] Topic Modeling Jul 5, 2021

shahrukhx01 mentioned this issue Jul 5, 2021

[Analyzer] Unsupervised Clustering #130

Open

lalitpagaria added the analyzer label Jul 9, 2021

lalitpagaria assigned akar5h and unassigned shahrukhx01 Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Analyzer] Topic Modeling #131

[Analyzer] Topic Modeling #131

shahrukhx01 commented Jun 7, 2021

shahrukhx01 commented Jun 7, 2021

akar5h commented Jul 26, 2021

shahrukhx01 commented Jul 26, 2021 •

edited

akar5h commented Jul 27, 2021 •

edited

lalitpagaria commented Jul 29, 2021

shahrukhx01 commented Jul 29, 2021

lalitpagaria commented Jul 29, 2021

akar5h commented Jul 29, 2021 •

edited

lalitpagaria commented Jul 30, 2021

lalitpagaria commented Aug 5, 2021

akar5h commented Aug 12, 2021

lalitpagaria commented Aug 12, 2021

lalitpagaria commented Aug 13, 2021

shahrukhx01 commented Aug 13, 2021

lalitpagaria commented Aug 19, 2021

akar5h commented Aug 19, 2021

shahrukhx01 commented Aug 19, 2021

lalitpagaria commented Aug 20, 2021

[Analyzer] Topic Modeling #131

[Analyzer] Topic Modeling #131

Comments

shahrukhx01 commented Jun 7, 2021

shahrukhx01 commented Jun 7, 2021

akar5h commented Jul 26, 2021

shahrukhx01 commented Jul 26, 2021 • edited

akar5h commented Jul 27, 2021 • edited

lalitpagaria commented Jul 29, 2021

shahrukhx01 commented Jul 29, 2021

lalitpagaria commented Jul 29, 2021

akar5h commented Jul 29, 2021 • edited

lalitpagaria commented Jul 30, 2021

lalitpagaria commented Aug 5, 2021

akar5h commented Aug 12, 2021

lalitpagaria commented Aug 12, 2021

lalitpagaria commented Aug 13, 2021

shahrukhx01 commented Aug 13, 2021

lalitpagaria commented Aug 19, 2021

akar5h commented Aug 19, 2021

shahrukhx01 commented Aug 19, 2021

lalitpagaria commented Aug 20, 2021

shahrukhx01 commented Jul 26, 2021 •

edited

akar5h commented Jul 27, 2021 •

edited

akar5h commented Jul 29, 2021 •

edited