Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve quotation detection by parsing quotation mark types #379

Open
afriedman412 opened this issue May 1, 2023 · 0 comments
Open

Improve quotation detection by parsing quotation mark types #379

afriedman412 opened this issue May 1, 2023 · 0 comments

Comments

@afriedman412
Copy link

Problem:

The current code for detecting quotes is pretty unsophisticated. It just sequentially pairs anything the token.is_quote deems a quotation mark and assumes the indexes to be the quote boundaries. If there are an odd number of quotation marks, it throws an error.

Solution:

I've been doing quote detection in some of unreliably formatted text lately which has things like "»" used as bullet points and lots of unpredictable stray characters, so I came up with a workaround. I updated the quote detection functionality to only return quotes whose starting and ending code points match a set of pre-determined pairs.

For example:
Bill told me I "shouldn‘t wear those pants" but I will.

In the current version, running quote detection here would raise an error because there are three quotation mark-like tokens in the sentence. Even if it didn't, it would return "shouldn" as a quote because textacy assumes sequential quotation marks are quote boundaries.

My version takes the first quotation mark (q) and iterates through all the later quotation marks until it finds one (q_) where (ord(q.text), ord(q_.text)) is in the list of acceptable pairs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant