-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
include algo of tidy autostemmer #225
Comments
Thank you for sharing your project @edvardoss! 🙌 In #17 we discussed how to support or include stemming within tidytext and decided against it since these approaches are quite diverse and work already with a tidy data principles approach. I see that is already true of your project: library(tidyverse)
library(tidytext)
library(abbrevTexts)
tidy_p_and_p <-
tibble(txt = janeaustenr::prideprejudice) %>%
unnest_tokens(word, txt)
p_and_p_dict <-
makeAbbrStemDict(
term.vec = tidy_p_and_p$word,
min.len = 3,
min.share = .6
)
tidy_p_and_p %>%
left_join(p_and_p_dict, by = c("word" = "parent")) %>%
mutate(word = coalesce(terminal.child, word)) %>%
anti_join(get_stopwords()) %>%
count(word, sort = TRUE)
#> Joining, by = "word"
#> # A tibble: 4,940 × 2
#> word n
#> <chr> <int>
#> 1 mr 785
#> 2 elizabeth 635
#> 3 darcy 417
#> 4 said 401
#> 5 though 344
#> 6 mrs 343
#> 7 ever 334
#> 8 much 327
#> 9 bennet 323
#> 10 bingley 306
#> # … with 4,930 more rows
## to compare
tidy_p_and_p %>%
anti_join(get_stopwords()) %>%
count(word, sort = TRUE)
#> Joining, by = "word"
#> # A tibble: 6,404 × 2
#> word n
#> <chr> <int>
#> 1 mr 785
#> 2 elizabeth 597
#> 3 said 401
#> 4 darcy 373
#> 5 mrs 343
#> 6 much 326
#> 7 must 305
#> 8 bennet 294
#> 9 miss 283
#> 10 jane 264
#> # … with 6,394 more rows Created on 2022-12-09 with reprex v2.0.2 So we are really glad to see your approach available 🎉 but it wouldn't be something we would include in tidytext itself. |
Hi Julia! I'll be happy if my algorithm of autostemming become of part of tidytext package!
https://github.com/edvardoss/abbrevTexts
The text was updated successfully, but these errors were encountered: