Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lowercase script? #82

Open
mayhewsw opened this issue Jan 14, 2020 · 20 comments
Open

Add lowercase script? #82

mayhewsw opened this issue Jan 14, 2020 · 20 comments

Comments

@mayhewsw
Copy link

Moses scripts included a useful lowercasing script. Are there any plans to add this?

@alvations
Copy link
Contributor

This is actually quite trivial in Python and on command line, not sure whether adding a lowercase script would be beneficial.

In Python:

s = "abc" 
s.lower()

On command line:

tr [:upper:] [:lower:] < in.txt > out.txt

But if more people vote +1 on the idea, it's not hard to implement and add it =)

@mayhewsw
Copy link
Author

It's definitely easy to do (although I always have to google the command line version), but I often find myself looking for a script to do it, and the original moses had one.

@noe
Copy link

noe commented Jan 20, 2020

Caution with tr. Most versions are not Unicode compliant: https://stackoverflow.com/a/13383175/674487

@mayhewsw
Copy link
Author

For what it's worth, tr (bash default installation on Mac 10.15.2) seems to work fine.

$ echo "He lived in Moscow." | tr [:upper:] [:lower:]
he lived in moscow.
$ echo "Он жил в Москве." | tr [:upper:] [:lower:]
он жил в москве.
$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
έζησε στη μόσχα

@noe
Copy link

noe commented Jan 20, 2020

Not in GNU coreutils 8.28 (ubuntu 18.04.03):

$ echo "He lived in Moscow." | tr [:upper:] [:lower:]
he lived in moscow.
$ echo "Он жил в Москве." | tr [:upper:] [:lower:]
Он жил в Москве.
$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
Έζησε στη Μόσχα.

@alvations
Copy link
Contributor

alvations commented Jan 23, 2020

Interesting. Hmmm, so is that feature in the sacremoses CLI worth implementing?

@noe's point to https://stackoverflow.com/questions/13381746/tr-upper-lower-with-cyrillic-text/13383175#13383175 is right, on Ubuntu

$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
Έζησε στη Μόσχα.

$ echo "Έζησε στη Μόσχα." | sed 's/[[:upper:]]*/\L&/'
έζησε στη Μόσχα.

@noe
Copy link

noe commented Jan 23, 2020

To me, having the lowercasing in sacremoses CLI would be useful because:

  • It would relieve me from googling for the correct perl/awk one-liner every time I need to do it.
  • It would provide a point to fix context-dependent problems that probably shouldn't be fixed in awk/perl, like those described here (i.e. having a word-ending Σ should be converted to ς instead of σ)

@mjpost
Copy link

mjpost commented Mar 31, 2020

+1.

It would also be nice to chain operations, e.g.,

echo This is a test | sacremoses normalize [options] lowercase [options]...

@alvations
Copy link
Contributor

@mayhewsw @noe @mjpost No promises but lowercase is a low-hanging fruit. Lets see how far I get go by end of the week of this sprint =)


@mjpost good idea on pipelining. Any other interface to follow? Anyone can point to similar pipelining interface in CLI? Maybe it should start with how we want to do in within Python first then move to CLI?

References:

@alvations
Copy link
Contributor

@mjpost Good news on chaining the commands for pipelining https://click.palletsprojects.com/en/7.x/commands/#multi-command-pipelines =)

Gonna be a fun Tuesday tomorrow, implementing this!!

@alvations
Copy link
Contributor

Here's some updates on a POC on the pipeline, it seems like doing any simplistic stdin pipelining with click requires some full storage of the data into some memory first. https://github.com/alvations/warppipe

I'm not sure how UNIX do it but keeping stdin / stdout in memory might be painful when corpus is rather huge. Currently, if we do the processing stepwise, streaming in and out, theoretically nothing would be kept in memory but I/O time is costly since we have to save the stdout to somewhere.

Anyone knows how UNIX does streams and pipes? Any pointers?

@mjpost
Copy link

mjpost commented Apr 14, 2020 via email

@alvations
Copy link
Contributor

Maybe there's some usefulness in loading the whole dataset into memory instead of processing one sentence at time. Empirically it seems to be a few seconds faster on a dataset that takes 20-30 seconds to tokenize. Maybe this should be an option too ---load-in-ram or something. Given that processing usually gets done of servers with much more RAM than plain text data (nowadays), this isn't a problem.

From some playing around the UNIX CLI, it looks like it's processing the full pipeline by chunks instead of performing the processes sequentially. Got to look at this a little more carefully. https://linux.die.net/man/7/pipe

P/S: Fixing last issues with the kwargs things in click and the pipeline feature should be good to go for a PR.

@mjpost
Copy link

mjpost commented Apr 14, 2020 via email

@alvations
Copy link
Contributor

With pipeline feature out of the way, coming back to lowercase, any ideas/suggestions of what options one would need for sacremoses lowercase?

I guess with the pipeline global, the lowercase command would look something like this:

cat big.txt | sacremoses -j 4 -l en lowercase [OPTIONS] 

@mjpost
Copy link

mjpost commented May 4, 2020

I can't think of any options for lowercase. That looks good above.

@mayhewsw
Copy link
Author

mayhewsw commented May 4, 2020

I wonder if a "reverse lowercase" option would be useful. Sometimes you want everything in upper case.

@bricksdont
Copy link

bricksdont commented May 6, 2020

@mayhewsw I can't think of a frequent NLP usecase where everything needs to be uppercase. What did you have in mind?

@mayhewsw
Copy link
Author

mayhewsw commented May 6, 2020

I agree that it's not frequent, but sometimes it's useful, and if the pipeline is already there, it shouldn't be hard to add w.upper(). One example: in this paper the authors wanted to create all upper case training data for robustness to NER.

@alvations
Copy link
Contributor

There's something better coming up, upper, lower and a surprise. But it'll take a couple of days to free myself up for some more coding and finishing up the feature =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants