Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Omikhleia · 2024-01-21T02:27:22Z

As noted in #1960, it seems Czech also repeats hyphens when breaking a compounds word. Some other languages might do the same, see below.

The same solution would likely apply.
But I only found single references to this feature in TeX StackExchange discussions -- We may need some more normative documents and references before generalizing such a feature...

Czech (example with modro-zelený)
Portuguese, also here (examples with anti--inflamatório, entendendo-se)
Which other languages... mentions Spanish, Basque, Czech, Polish, Portuguese (also states than in Czech, this is not widely applied...)
Czech again: last paragraph "Pokud se spojovník objeví na konci řádku a nenaznačuje neúplné slovo, opakuje se na začátku řádku dalšího, např. česko‑ | ‑polské, není‑ | ‑li. Spojovník na začátku dalšího řádku neopakujeme, naznačuje‑li rozdělení slova (např. žong‑ | lér). Ve webových a e‑mailových adresách se spojovník k naznačení rozdělení adresy do více řádků nepoužívá." -- according to DeepL, "If the hyphen appears at the end of a line and does not indicate an incomplete word, it is repeated at the beginning of the next line, e.g. česko- | -polské, není- | -li. Do not repeat the hyphen at the beginning of the next line if it indicates a split word (e.g. žong- | lér)."
Spanish on Wikipedia: "Words written with hyphen are hyphenated by repeating the hyphen on the following line: teórico-/-práctico. Repeating the hyphen is not necessary if the hyphenated word is a proper name where a hyphen is followed by a capital letter" (Note the sudden exception... but again no reference...)
Babel's v3.58 Release Notes mentions activating such repetition for Czech, Polish, Portuguese, Slovak, Spanish.

Omikhleia · 2024-01-21T03:17:54Z

According to your profile, @jodros you are from Brasil, perhaps you can comment on these rules for Portuguese?

If this is not widespread, we may need settings...
If it is locale/country/idiom dependent, we may want to proceed with #1641 so as to be able to use BCP47-qualified languages to select the proper rules by default...

Typography is hard ;)

jodros · 2024-01-22T19:54:44Z

perhaps you can comment on these rules for Portuguese?

Right, but I still don't know how the hyphenation algorithm works, where could I start?

Omikhleia · 2024-01-22T22:32:55Z

@jodros

Right, but I still don't know how the hyphenation algorithm works, where could I start?

As far as you know, if line is broken at the dash in Portuguese "anti-inflamatório", should it yield:

case 1 (nothing fancy)

...... anti-
inflamatório

Or case 2 (repeated hyphens):

....... anti-
-inflamatório

In the second case, we'd need to generalize the solution adopted for Polish. It seems we have to do it for other languages too, but trying to understand which languages are concerned and whether it's a widespread rule in these languages --- so as to propose the correct generalization.

Omikhleia · 2024-01-22T23:02:37Z

As for the general logic (simplified, but it took me a while to get a grasp of it -- a bit off-topic here, but worth trying to explain anyway):

Input text is set in "unshaped" text nodes initially
Unshaped nodes are later shaped into "nnodes" (with dimensions)
As part of the process, the nodes are also segmented into elementary "words"
- In most cases via SILE.nodeMakers.unicode which uses ICU to identify word boundaries
- Sometimes via a language-specific subclass of the latter, to implement some additional fancy rules (French does it for its special interpretation of punctuation spaces; Polish now does it too for repeated hyphens).
Each word is hyphenated (so a parent word node has children segments, for each potential hyphenation point)
- (Most of the time), using the Liang algorithm with language-specific patterns (introducing "-" discretionary nodes where hyphenation may occur, between segments) = that's the most TeX-like part here....
- Some languages (Turkish, and now Catalan too) require a specific post-hyphenation logic (for context-dependent discretionary nodes)
Later at paragraphing time, this will be fed to the line-breaking engine, again a TeX-like part, but that's another story

To recap, then:

We have specific segmenting rules for French (punctuation handling) and Polish (repeated hyphens) = This is where we may want to generalize the solution for more languages (Slovak? Czech? Spanish? Portuguese? Basque?) -- Hence my question.
We have specific post-hyphenation/segmentation rules for Turkish and Catalan (also to be refactored, but that's a whole other story)

alerque · 2024-01-24T12:20:44Z

Any thoughts on whether we'll know enough to add fixes for other languages soon or should I go ahead with v0.14.15 with the Catalan/Polish/Turkish/French features we have queued up already?

Omikhleia · 2024-01-25T19:36:20Z

Any thoughts on whether we'll know enough to add fixes for other languages soon or should I go ahead with v0.14.15 with the Catalan/Polish/Turkish/French features we have queued up already?

It depends when you want to ship 0.14.15 -- I'm willing to work on the topic, but I don't think it needs to be rushed -- after all, SILE's presence on GitHub just passed 10+ years, and no one came asking for these... So I bet we can take some time to think on how to do it properly. In the same vein, #1242 (deriving from a fix where I needed to deactivate the French unicode segmenter) could possibly be addressed too in a nicer way.

In the meantime, we have a "quick workaround" if anyone urgently needs the repeating dashes in any language. Just insert the ugly hack after your first target language change, and voilà!

\begin[papersize=a6]{document}
\set[parameter=document.parindent, value=1.25em]
\nofolios
\language[main=pt]
% BEGIN DOUBLE HYPHEN WORKAROUND
\lua{
-- Switch to Polish temporarily
-- and steal its node maker to current main language
local current = SILE.settings:get("document.language")
SILE.settings:temporarily(function ()
  SILE.call("language", { main = "pl" })
  SILE.nodeMakers[current] = SILE.nodeMakers.pl
end)
}% END DOUBLE HYPHEN WORKAROUND

\font[size=16]
\kern[width=210pt] anti-inflamatório

\end{document}

(Checked with Portuguese, Czech and Spanish)

alerque · 2024-01-27T10:20:33Z

It depends when you want to ship 0.14.15 -- I'm willing to work on the topic, but I don't think it needs to be rushed -- after all, SILE's presence on GitHub just passed 10+ years, and no one came asking for these.

I have an upcoming publication project that wants to use the alternate Turkish apostrophe handling and it is always much nicer to do production work in a shipped stable version of SILE. At this point the release machinery is working pretty well and it isn't too much hassle to make small patch releases with incremental improvements.

Omikhleia · 2024-02-03T01:31:11Z

Interesting feedback here typst/typst#3235 (comment) adding (lower) Sorbian and Croatian to the list, and confirming Czech and Slovak.

Sorbian is a minority language (< 50000 people), it doesn't have a 2-letter language codes. Unless mistaken the 3-letter codes are hsb (Upper Sorbian), dsb (Lower Sorbian) and wen (Sorbian or "Wendish" collectively)

Omikhleia · 2024-02-03T02:13:59Z

Food for thought: I am not sure we should use settings to enable/disable such features, at the cost of checking them many times, when they wouldn't change much normally.

A possible alternative would be encode this in the BCP47 language name, as an extension.

So far, unless mistaken, BCP47 only has two official extensions, -t- (RFC6497) and -u- (RFC6067), but "private" -x- extension could be used here for our own purpose.

For instance:

pl would be using the double hyphens by default, pl-x-nohyphens (or any similar pattern as long as respecting BCP47) would disable them (i.e. use the default unicode node maker rather than its derived version).
fr would use the French punctuation rules, fr-x-nospacing would disable them (rather than my low-level hack-like pseudo _fr_noSpacingRules from feat(packages): URLs are allowed many more breakpoints #1233)

DavidLRowe · 2024-02-03T03:11:58Z

A minor point: I believe that string of characters after the -x- of a private tag is limited to 8 characters. So, for example, -x-nohyphens would be too long.

alerque · 2024-02-03T09:00:31Z

This is not a comment on using BCP-47 private extensions because I haven't looked into that...

But yes @Omikhleia sometimes where we want to use a setting is too hot a loop to actually be checking it given that they can change almost any time. But something we haven't really utilized yet but could if we need to is callbacks: there is no reason we can't rig up settings:set() with a callback function that invalidates a cache or private variable used somewhere for efficiency purposes. The end user would not need to be any the wiser. All we would need is a registry to store the callbacks and they could be registered from almost anywhere. Since they would be Lua functions that act as closures they would serve to reach into whatever private implementation was used to speed up the hot looks with a cached value while still allowing it to be changed with a setting.

Omikhleia · 2024-02-03T09:35:50Z

@alerque Yes, active hooks on settings is also a possibility I had in mind too. I'm always reluctant on such hooks / callbacks (because ordering is unclear and side-effects are not always intended), but it may have to be considered.

As an additional food for though: I suspect those language would not repeat the hyphens when breaking URLs (and thus would have to bypass it, as does the current _fr_noSpacingRules hack).

jodros · 2024-02-07T05:03:51Z

case 1 (nothing fancy)
...... anti-
inflamatório
Or case 2 (repeated hyphens):
....... anti-
-inflamatório

@Omikhleia I've just checked for examples in a reference grammar¹ in the part about hyphens, and indeed all the examples testify in favor of case 2.

Gramática da língua portuguesa padrão by Amini Hauy (Grammar of standard Portuguese) ↩

Omikhleia · 2024-02-07T10:04:11Z

For Basque (which we support, code eu), this orthotypography manual p. 53 and this other more general document p. 47

= Both seem to contradict the repetition of hyphens (marratxoa) mentioned in ~~LaTeX Babel~~ some discussions (EDIT: Babel is not mentioning it, my bad, other sources cited above did).

"Lerro-bukaerako marratxoa hitz-elkarketarena izanez gero, ez dago marratxo hori errepikatu beharrik hurrengo lerroaren hasieran." --> Google translated: "If the hyphen at the end of the line belongs to the combination of words, there is no need to repeat that hyphen at the beginning of the next line."

And the second document even illustrates the wrong usage (marked with an asterisk) and the correct one.

--> So no for Basque, in the general case. (I did see various posts on the web from people asking how to do it, but official recommendations seem to disfavor it)

Omikhleia added bug Software bug issue question Ask for advice or investigate solutions labels Jan 21, 2024

This comment was marked as off-topic.

Sign in to view

Omikhleia mentioned this issue Jan 31, 2024

Incorrect hyphenation in situations including dashes for some languages (e.g. polish) typst/typst#3235

Closed

1 task

Omikhleia changed the title ~~Hyphenation in compounds words in Czech, Portuguese, etc.~~ Double hyphens in compounds words in Czech, Portuguese, etc. Feb 3, 2024

alerque mentioned this issue Feb 3, 2024

Add active hook system to settings #1983

Merged

alerque added this to the v0.14.17 milestone Feb 7, 2024

alerque self-assigned this Feb 7, 2024

This was referenced Feb 7, 2024

Add support for Upper and Lower Sorbian, aka Wendish #1994

Open

Apply double hyphenation rules to more languages #1995

Merged

alerque closed this as completed in #1995 Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Omikhleia commented Jan 21, 2024 •

edited

Omikhleia commented Jan 21, 2024 •

edited

jodros commented Jan 22, 2024

Omikhleia commented Jan 22, 2024

This comment was marked as off-topic.

Omikhleia commented Jan 22, 2024 •

edited

This comment was marked as off-topic.

This comment was marked as off-topic.

alerque commented Jan 24, 2024

Omikhleia commented Jan 25, 2024 •

edited

alerque commented Jan 27, 2024

Omikhleia commented Feb 3, 2024

Omikhleia commented Feb 3, 2024

DavidLRowe commented Feb 3, 2024

alerque commented Feb 3, 2024

Omikhleia commented Feb 3, 2024 •

edited

jodros commented Feb 7, 2024

Omikhleia commented Feb 7, 2024 •

edited

Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Comments

Omikhleia commented Jan 21, 2024 • edited

Omikhleia commented Jan 21, 2024 • edited

jodros commented Jan 22, 2024

Omikhleia commented Jan 22, 2024

This comment was marked as off-topic.

Omikhleia commented Jan 22, 2024 • edited

This comment was marked as off-topic.

This comment was marked as off-topic.

alerque commented Jan 24, 2024

Omikhleia commented Jan 25, 2024 • edited

alerque commented Jan 27, 2024

Omikhleia commented Feb 3, 2024

Omikhleia commented Feb 3, 2024

DavidLRowe commented Feb 3, 2024

alerque commented Feb 3, 2024

Omikhleia commented Feb 3, 2024 • edited

jodros commented Feb 7, 2024

Footnotes

Omikhleia commented Feb 7, 2024 • edited

Omikhleia commented Jan 21, 2024 •

edited

Omikhleia commented Jan 21, 2024 •

edited

Omikhleia commented Jan 22, 2024 •

edited

Omikhleia commented Jan 25, 2024 •

edited

Omikhleia commented Feb 3, 2024 •

edited

Omikhleia commented Feb 7, 2024 •

edited