Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Closed
Omikhleia opened this issue Jan 21, 2024 · 17 comments · Fixed by #1995
Closed

Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Omikhleia opened this issue Jan 21, 2024 · 17 comments · Fixed by #1995
Assignees
Labels
bug Software bug issue question Ask for advice or investigate solutions
Milestone

Comments

@Omikhleia
Copy link
Member

Omikhleia commented Jan 21, 2024

As noted in #1960, it seems Czech also repeats hyphens when breaking a compounds word. Some other languages might do the same, see below.

The same solution would likely apply.
But I only found single references to this feature in TeX StackExchange discussions -- We may need some more normative documents and references before generalizing such a feature...

  • Czech (example with modro-zelený)
  • Portuguese, also here (examples with anti--inflamatório, entendendo-se)
  • Which other languages... mentions Spanish, Basque, Czech, Polish, Portuguese (also states than in Czech, this is not widely applied...)
  • Czech again: last paragraph "Pokud se spojovník objeví na konci řádku a nenaznačuje neúplné slovo, opakuje se na začátku řádku dalšího, např. česko‑ | ‑polské, není‑ | ‑li. Spojovník na začátku dalšího řádku neopakujeme, naznačuje‑li rozdělení slova (např. žong‑ | lér). Ve webových a e‑mailových adresách se spojovník k naznačení rozdělení adresy do více řádků nepoužívá." -- according to DeepL, "If the hyphen appears at the end of a line and does not indicate an incomplete word, it is repeated at the beginning of the next line, e.g. česko- | -polské, není- | -li. Do not repeat the hyphen at the beginning of the next line if it indicates a split word (e.g. žong- | lér)."
  • Spanish on Wikipedia: "Words written with hyphen are hyphenated by repeating the hyphen on the following line: teórico-/-práctico. Repeating the hyphen is not necessary if the hyphenated word is a proper name where a hyphen is followed by a capital letter" (Note the sudden exception... but again no reference...)
  • Babel's v3.58 Release Notes mentions activating such repetition for Czech, Polish, Portuguese, Slovak, Spanish.
@Omikhleia Omikhleia added bug Software bug issue question Ask for advice or investigate solutions labels Jan 21, 2024
@Omikhleia
Copy link
Member Author

Omikhleia commented Jan 21, 2024

According to your profile, @jodros you are from Brasil, perhaps you can comment on these rules for Portuguese?

If this is not widespread, we may need settings...
If it is locale/country/idiom dependent, we may want to proceed with #1641 so as to be able to use BCP47-qualified languages to select the proper rules by default...

Typography is hard ;)

@jodros
Copy link
Contributor

jodros commented Jan 22, 2024

perhaps you can comment on these rules for Portuguese?

Right, but I still don't know how the hyphenation algorithm works, where could I start?

@Omikhleia
Copy link
Member Author

@jodros

Right, but I still don't know how the hyphenation algorithm works, where could I start?

As far as you know, if line is broken at the dash in Portuguese "anti-inflamatório", should it yield:

case 1 (nothing fancy)

...... anti-
inflamatório

Or case 2 (repeated hyphens):

....... anti-
-inflamatório

In the second case, we'd need to generalize the solution adopted for Polish. It seems we have to do it for other languages too, but trying to understand which languages are concerned and whether it's a widespread rule in these languages --- so as to propose the correct generalization.

@alerque

This comment was marked as off-topic.

@Omikhleia
Copy link
Member Author

Omikhleia commented Jan 22, 2024

As for the general logic (simplified, but it took me a while to get a grasp of it -- a bit off-topic here, but worth trying to explain anyway):

  1. Input text is set in "unshaped" text nodes initially
  2. Unshaped nodes are later shaped into "nnodes" (with dimensions)
  3. As part of the process, the nodes are also segmented into elementary "words"
    • In most cases via SILE.nodeMakers.unicode which uses ICU to identify word boundaries
    • Sometimes via a language-specific subclass of the latter, to implement some additional fancy rules (French does it for its special interpretation of punctuation spaces; Polish now does it too for repeated hyphens).
  4. Each word is hyphenated (so a parent word node has children segments, for each potential hyphenation point)
    • (Most of the time), using the Liang algorithm with language-specific patterns (introducing "-" discretionary nodes where hyphenation may occur, between segments) = that's the most TeX-like part here....
    • Some languages (Turkish, and now Catalan too) require a specific post-hyphenation logic (for context-dependent discretionary nodes)
  5. Later at paragraphing time, this will be fed to the line-breaking engine, again a TeX-like part, but that's another story

To recap, then:

  • We have specific segmenting rules for French (punctuation handling) and Polish (repeated hyphens) = This is where we may want to generalize the solution for more languages (Slovak? Czech? Spanish? Portuguese? Basque?) -- Hence my question.
  • We have specific post-hyphenation/segmentation rules for Turkish and Catalan (also to be refactored, but that's a whole other story)

@Omikhleia

This comment was marked as off-topic.

@alerque

This comment was marked as off-topic.

@alerque
Copy link
Member

alerque commented Jan 24, 2024

Any thoughts on whether we'll know enough to add fixes for other languages soon or should I go ahead with v0.14.15 with the Catalan/Polish/Turkish/French features we have queued up already?

@Omikhleia
Copy link
Member Author

Omikhleia commented Jan 25, 2024

Any thoughts on whether we'll know enough to add fixes for other languages soon or should I go ahead with v0.14.15 with the Catalan/Polish/Turkish/French features we have queued up already?

It depends when you want to ship 0.14.15 -- I'm willing to work on the topic, but I don't think it needs to be rushed -- after all, SILE's presence on GitHub just passed 10+ years, and no one came asking for these... So I bet we can take some time to think on how to do it properly. In the same vein, #1242 (deriving from a fix where I needed to deactivate the French unicode segmenter) could possibly be addressed too in a nicer way.

In the meantime, we have a "quick workaround" if anyone urgently needs the repeating dashes in any language. Just insert the ugly hack after your first target language change, and voilà!

\begin[papersize=a6]{document}
\set[parameter=document.parindent, value=1.25em]
\nofolios
\language[main=pt]
% BEGIN DOUBLE HYPHEN WORKAROUND
\lua{
-- Switch to Polish temporarily
-- and steal its node maker to current main language
local current = SILE.settings:get("document.language")
SILE.settings:temporarily(function ()
  SILE.call("language", { main = "pl" })
  SILE.nodeMakers[current] = SILE.nodeMakers.pl
end)
}% END DOUBLE HYPHEN WORKAROUND

\font[size=16]
\kern[width=210pt] anti-inflamatório

\end{document}

(Checked with Portuguese, Czech and Spanish)

@alerque
Copy link
Member

alerque commented Jan 27, 2024

It depends when you want to ship 0.14.15 -- I'm willing to work on the topic, but I don't think it needs to be rushed -- after all, SILE's presence on GitHub just passed 10+ years, and no one came asking for these.

I have an upcoming publication project that wants to use the alternate Turkish apostrophe handling and it is always much nicer to do production work in a shipped stable version of SILE. At this point the release machinery is working pretty well and it isn't too much hassle to make small patch releases with incremental improvements.

@Omikhleia
Copy link
Member Author

Interesting feedback here typst/typst#3235 (comment) adding (lower) Sorbian and Croatian to the list, and confirming Czech and Slovak.

Sorbian is a minority language (< 50000 people), it doesn't have a 2-letter language codes. Unless mistaken the 3-letter codes are hsb (Upper Sorbian), dsb (Lower Sorbian) and wen (Sorbian or "Wendish" collectively)

@Omikhleia
Copy link
Member Author

Food for thought: I am not sure we should use settings to enable/disable such features, at the cost of checking them many times, when they wouldn't change much normally.

A possible alternative would be encode this in the BCP47 language name, as an extension.

So far, unless mistaken, BCP47 only has two official extensions, -t- (RFC6497) and -u- (RFC6067), but "private" -x- extension could be used here for our own purpose.

For instance:

  • pl would be using the double hyphens by default, pl-x-nohyphens (or any similar pattern as long as respecting BCP47) would disable them (i.e. use the default unicode node maker rather than its derived version).
  • fr would use the French punctuation rules, fr-x-nospacing would disable them (rather than my low-level hack-like pseudo _fr_noSpacingRules from feat(packages): URLs are allowed many more breakpoints #1233)

@DavidLRowe
Copy link
Contributor

A minor point: I believe that string of characters after the -x- of a private tag is limited to 8 characters. So, for example, -x-nohyphens would be too long.

@alerque
Copy link
Member

alerque commented Feb 3, 2024

This is not a comment on using BCP-47 private extensions because I haven't looked into that...

But yes @Omikhleia sometimes where we want to use a setting is too hot a loop to actually be checking it given that they can change almost any time. But something we haven't really utilized yet but could if we need to is callbacks: there is no reason we can't rig up settings:set() with a callback function that invalidates a cache or private variable used somewhere for efficiency purposes. The end user would not need to be any the wiser. All we would need is a registry to store the callbacks and they could be registered from almost anywhere. Since they would be Lua functions that act as closures they would serve to reach into whatever private implementation was used to speed up the hot looks with a cached value while still allowing it to be changed with a setting.

@Omikhleia
Copy link
Member Author

Omikhleia commented Feb 3, 2024

@alerque Yes, active hooks on settings is also a possibility I had in mind too. I'm always reluctant on such hooks / callbacks (because ordering is unclear and side-effects are not always intended), but it may have to be considered.

As an additional food for though: I suspect those language would not repeat the hyphens when breaking URLs (and thus would have to bypass it, as does the current _fr_noSpacingRules hack).

@Omikhleia Omikhleia changed the title Hyphenation in compounds words in Czech, Portuguese, etc. Double hyphens in compounds words in Czech, Portuguese, etc. Feb 3, 2024
@jodros
Copy link
Contributor

jodros commented Feb 7, 2024

case 1 (nothing fancy)

...... anti-
inflamatório

Or case 2 (repeated hyphens):

....... anti-
-inflamatório

@Omikhleia I've just checked for examples in a reference grammar1 in the part about hyphens, and indeed all the examples testify in favor of case 2.

Footnotes

  1. Gramática da língua portuguesa padrão by Amini Hauy (Grammar of standard Portuguese)

@Omikhleia
Copy link
Member Author

Omikhleia commented Feb 7, 2024

For Basque (which we support, code eu), this orthotypography manual p. 53 and this other more general document p. 47

= Both seem to contradict the repetition of hyphens (marratxoa) mentioned in LaTeX Babel some discussions (EDIT: Babel is not mentioning it, my bad, other sources cited above did).

"Lerro-bukaerako marratxoa hitz-elkarketarena izanez gero, ez dago marratxo hori errepikatu beharrik hurrengo lerroaren hasieran." --> Google translated: "If the hyphen at the end of the line belongs to the combination of words, there is no need to repeat that hyphen at the beginning of the next line."

And the second document even illustrates the wrong usage (marked with an asterisk) and the correct one.

image

--> So no for Basque, in the general case. (I did see various posts on the web from people asking how to do it, but official recommendations seem to disfavor it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software bug issue question Ask for advice or investigate solutions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants