Improve quotation detection by parsing quotation mark types #379

afriedman412 · 2023-05-01T19:01:10Z

Problem:

The current code for detecting quotes is pretty unsophisticated. It just sequentially pairs anything the token.is_quote deems a quotation mark and assumes the indexes to be the quote boundaries. If there are an odd number of quotation marks, it throws an error.

Solution:

I've been doing quote detection in some of unreliably formatted text lately which has things like "»" used as bullet points and lots of unpredictable stray characters, so I came up with a workaround. I updated the quote detection functionality to only return quotes whose starting and ending code points match a set of pre-determined pairs.

For example:
Bill told me I "shouldn‘t wear those pants" but I will.

In the current version, running quote detection here would raise an error because there are three quotation mark-like tokens in the sentence. Even if it didn't, it would return "shouldn" as a quote because textacy assumes sequential quotation marks are quote boundaries.

My version takes the first quotation mark (q) and iterates through all the later quotation marks until it finds one (q_) where (ord(q.text), ord(q_.text)) is in the list of acceptable pairs.

The text was updated successfully, but these errors were encountered:

afriedman412 added the enhancement label May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve quotation detection by parsing quotation mark types #379

Improve quotation detection by parsing quotation mark types #379

afriedman412 commented May 1, 2023

Improve quotation detection by parsing quotation mark types #379

Improve quotation detection by parsing quotation mark types #379

Comments

afriedman412 commented May 1, 2023

Problem:

Solution: