Update Character.UnicodeBlock and add better tests #1887

er1c · 2020-09-01T19:37:05Z

TODO:

Add JDK8 static variables audit
Use 6.3.0 as basis (https://unicode.org/Public/6.3.0/ucd/Blocks.txt)
Ensure any existing edge-case tests weren't removed

Findings

All of the JDK8 constants are defined in the JDK 14 constants (aka JDK14 is a superset)

ekrich

We are currently tracking Unicode for Java 8. We used Unicode 6.3.0. This version was used since it was a minor update from JDK8's 6.2.0 whereas 7.0.0 added many new codes.

If you would like avoid formatting with scripts/scalafmt you could probably use their comment for ignoring a section.

er1c · 2020-09-01T20:45:23Z

@ekrich thanks for the initial feedback, I'm updating the PR, will see what breaks, but happy to hopefully get it resolved, I'm duplicating the work in scaala-js so figured I'd throw it back here.

I'll see if I can get a deprecation list from the unicode java8 and make sure they are included.

ekrich · 2020-09-01T20:55:27Z

What version of JVM is Scala.js tracking? I created a project here but I like your script approach. https://github.com/ekrich/scala-unicode Initially I was thinking that we could put all the scripts in one place and maybe put it under scala-native vs my personal GitHub.

LeeTibbert · 2020-09-01T21:26:46Z

I seem to remember that to be bug-for-bug compatible with Java 8 there were a few codes in Java 8 which
differed from its reference Unicode version. I may have put some tests under the re2 directory to
explicitly test for those. Or I may have left comments in the re2 UnicodeMumble.scala files.

@er1c Now what was it written over the gate leading into the Inferno?

er1c · 2020-09-02T15:58:25Z

@LeeTibbert I didn't remove anytests, so if there were any edge cases previously documented, they should be found.

LeeTibbert · 2020-09-06T21:13:33Z

er1c Thank you for starting this conversation.

I mean to be supportive of your energy & efforts. I hope my investment of time & plain
speaking end up that way, if not please disregard.

Let me start at the highest level, the level of strategy. While it should not get in the way of the
current PR, the bird in the hand, I think that SN should be actively considering a long term
strategy of little.big or fast.comprehensive, where SN core supplies only the minimal implementation
of certain components and relies upon currently hypothetical third party libraries by domain experts
to supply complete implementations. java.time is one of those. I think Unicode and Locale are
two more.

scala-java-time is a good candidate for java.time. The former compiles & pases its Demo project on SN.
I am currently diving into seeing which SN changes are needed to support the extensive scala-java-time test
ensemble.

I believe that Eric P is a contributor to scala-java-locale (sp?).

Third party libraries are likely to have more domain expertise and a quicker turnaround. Unicode is now
something like Unicode 13 and here we are discussing Unicode 6.3. A third party library could be more
agile.

As we probably all agree, the Unicode and Locale issues are Large Problems, and it will take a Beast to
address them comprehensively and with rapid turnaround. I believe that SN will ultimately need third
party libraries which are calls into libicu (International Components for Unicode), which is such a Beast.
The strength of SN is interoperability and libicu simply has more people working on it than the SN or
entire Scala communities can muster.

LeeTibbert · 2020-09-06T21:14:56Z

So, let me now focus on the PR at hand. I have a few high level concerns. If those bear out, I will do another pass at line-by-line.

Again, if the comment is useful, great. If it is wrong or just not useful, no harm done.

LeeTibbert · 2020-09-06T21:18:55Z

scripts/GenerateUnicodeBlock.sc

@@ -0,0 +1,415 @@
+// Run with $ amm scripts/GenerateUnicodeBlock.sc


It is good to get the generating script into Git. Please consider giving a few
more clues the the poor suffering who comes after you
and tries to run this.

For example, WT? is amm? yes, I know it is ammonite, but clues such as a URL
to ammonite and ammonite scripts, would
improve, IMO, the software engineering.

I believe the software engineering would also be improved if there were a short, one paragraph description of how this script fits into the general build process. That it is
not automagically executed in build.sbt, but rather is manually run and the output
manually edited into j.l.Character.scala.

That is, could a bright middle-school pupil, or the less capable engineer-in-a-hurry of English common law, follow in your footsteps?

LeeTibbert · 2020-09-06T21:21:07Z

scripts/GenerateUnicodeBlock.sc

@@ -0,0 +1,415 @@
+// Run with $ amm scripts/GenerateUnicodeBlock.sc
+// Replace code in ./javalanglib/src/main/scala/java/lang/Character.scala


Please see comments in Character.scala, which I will review next. IMO, the software engineering would be improved if this script printed out some, actually quite a bit of audit
trail information: Who ran it, when, which version of Unicode used as input files.

LeeTibbert · 2020-09-06T21:43:15Z

javalib/src/main/scala/java/lang/Character.scala

- val LATIN_EXTENDED_B = new UnicodeBlock("LATIN_EXTENDED_B", 0x180, 0x24f)
- val IPA_EXTENSIONS = new UnicodeBlock("IPA_EXTENSIONS", 0x250, 0x2af)
+ //////////////////
+ // Begin Generated


I believe the software engineering would be improved by at least giving the
reference Unicode version, the name and Git repository location of the script
used to generate the output, and the fact that it was manually folded (no build.sbt)
into the current file.

I do not see the usual "Do not edit" .

I also like to see who generated the output and date. Some would say that this
data can be more reliably figured out from GitBlame, but that is a recursive distraction from trying to read the code as is, on-the-fly.

LeeTibbert · 2020-09-06T21:52:44Z

javalib/src/main/scala/java/lang/Character.scala

+ //////////////////
+
+ // scalastyle:off line.size.limit
+ val SURROGATES_AREA = new UnicodeBlock("SURROGATES_AREA", 0x0, 0x0)


I think this takes a close look from an SN compiler expert.

I believe that the next whack of UnicodeBlock declarations should be "lazy val". Of course, that means that any tables using them should be carefully constructed to be
lazy on any path but a small set of exceedingly common UnicodeBlocks (such as English only). That is a fast&small chain to slow&complete design.

My concerns are the latency/startup_time when the binary begins to execute. Since these vals are in an object within an object, I believe they will currently all be allocated and initialized at binary startup. That is quite a few allocations for data that will almost certainly never be used.

I am pretty certain that changing to "lazy" will reduce startup time & heap requirements. I am not certain if it will slightly increase code size.

I am aware that the code being replaced used "vals" rather than "lazy vals", I am reluctant to see hundreds of more items slowing up binary startup.

I seem to remember being advised when I was doing Unicode work for the re2 regex code to use one dimensional arrays (of structures). And to be careful to make
maps lazy.

I think it prudent to ask what the recommendation from current SN leadership is.

And yes, the re2s and Character.scala unicode data should not be being duplicated.
I think there is at least one pending PR to reduce, not eliminate, some of that
duplication. IIRC, the idea was to let that PR merge & settle, then do additional
passes. I seem to remember to total elimination was not feasible, but that a lot of error-prone duplication could be eliminated.

I believe that known flaw should not keep progress from happening here. I am also open to & welcome better suggestions. Sometimes one simply runs out of time before the next distraction/crises hits and the technical debt remains in the code base for far too long.

LeeTibbert · 2020-09-06T22:01:53Z

javalib/src/main/scala/java/lang/Character.scala


- private val BLOCKS = Array(
+ private val allBlocks: Array[UnicodeBlock] = Array(


Now this is the type of thing that needs to be lazy, if possible. I did not trace how it gets used. This table will, I believe, cause all of the UnicodeBlocks it references to be
instantiated.

May need to be an method which returns the "known&fast" UnicodeBlocks, then iterates over a lazy array of "complete&slow" UnicodeBlocks.

LeeTibbert · 2020-09-06T22:07:33Z

javalib/src/main/scala/java/lang/Character.scala

- SUPPLEMENTARY_PRIVATE_USE_AREA_B)
+ import scala.Predef.ArrowAssoc
+ private val blocksByNormalizedName
+ : scala.collection.Map[String, UnicodeBlock] = scala.collection.Map(


Again, baseline code uses scala map. There was an effort in Scala.js to reduce/eliminate the use of scala.collection, at least in javalib, in favor of java collections
(which are a Royal Pain!).

I think there has been no organized effort to do that yet in SN. Perhaps changing this can be delayed due to prior use in the baseline code. I am not the one to make that
call but thought I would point the effort out.

LeeTibbert · 2020-09-06T22:11:18Z

javalib/src/main/scala/java/lang/Character.scala

 val SUPPLEMENTARY_PRIVATE_USE_AREA_B =
- new UnicodeBlock("SUPPLEMENTARY_PRIVATE_USE_AREA_B", 0x100000, 0x10ffff)
+ new UnicodeBlock("SUPPLEMENTARY_PRIVATE_USE_AREA_B", 0x100000, 0x10FFFF)


This seems to be the last generated UnicodeBlock. With generated code, my very early computer science training to print out the number of generated entries comes to mind.
That makes comparing the count from the generated output against that in the Unicode Spec reference files. I seem to remember that either they give the expected count, or it was very easy to get (unix wc -l minus a few).

LeeTibbert · 2020-09-06T22:19:39Z

javalib/src/main/scala/java/lang/Character.scala

+ // For example, "Latin Extended-A" and "latin extended a" are equivalent.
+ // For more information on the comparison of property values,
+ // see UAX #44: http://www.unicode.org/reports/tr44/
+ private def toNormalizedName(name: String): String = {


Thank you for the comment: just-in-time delivery of hard to find pertinent wizardry.
Allows me to ask my next question, just below.

LeeTibbert · 2020-09-06T22:27:58Z

javalib/src/main/scala/java/lang/Character.scala

+ // For more information on the comparison of property values,
+ // see UAX #44: http://www.unicode.org/reports/tr44/
+ private def toNormalizedName(name: String): String = {
+ name.toLowerCase.replaceAll(raw"[\s\-_]", "")


I believe that the "\s" is incorrect/incomplete here. That is, I am pretty sure that the
text in the comment is referring to Unicode whitespace (table Zs) while "\s" is referring to the much smaller set of Java whitespace.

As a thought experiment, imagine some person set out to break/test the code by giving an input string which contains an Ogham (Old Irish) space mark (one of my favorite test cases). I submit that according to the comment, the Ogham character should be removed and that the presented code will not do that.

We are in the land of regex here. The known & tedious way to loop through each of the input characters and retain it only if it is not in the Unicode Zs table (in scalanative.re2s). Since replaceAll brings us well into the land of re2s, I could look
to see if re2s has a way to specify * replace Unicode whitespace.

At the nanolevel, I think doing the replaceAll first, then the toLowerCase has lower
amortized cost over the expected set of inputs. Cost for text not containing Unicode
whitespace, etc. is probably same (but one always has to look at implementations).
Most names will contain whitespace, etc, so replaceAll first will be a few cycles faster.
Too many hours and exams in Algorithms classes.

LeeTibbert · 2020-09-06T22:32:25Z

javalib/src/main/scala/java/lang/Character.scala

+ }
+
+ ////////////////
+ // End Generated


Having an end generated marker is good, but generated by which script, date?
Where there is one generator, there may well be more than one.

LeeTibbert · 2020-09-06T22:40:05Z

javalib/src/main/scala/java/lang/Character.scala

+ // Begin Generated
+ //////////////////
+
+ // scalastyle:off line.size.limit


Does the scalafmt used in ScalaNative respect the scalastyle:off (and others)
I suspect these are holdovers from the Scala.js port. Perhaps left in baseline
code to reduce churn but probably should not be in new SN code (especially generated).

What does scalafmt do to these large generated tables? I would expect it to mangle them, but the code in this PR looks OK.

LeeTibbert · 2020-09-06T22:46:00Z

er1c re: I didn't remove anytests, so if there were any edge cases previously documented, they should be found.

Now that I have reviewed the code on a full screen and not on my phone, I see that you are mainly declaring UnicodeBlocks.
I believe that the Java 8 quirks/idiosyncratic_implementations I was concerned about should not be touched/triggered by the code of this PR. Any tests I wrote would be under re2s. Pending re2s tests blowing up, I think you can check this box off as completed/reviewed/signed_off. Thank you for considering it, this stuff is a well-tended, biologically diverse swamp.

LeeTibbert · 2020-09-07T04:33:45Z

javalib/src/main/scala/java/lang/Character.scala

- .fold {
- throw new IllegalArgumentException()
- } { value => value }
+ blocksByNormalizedName


Why remove a perfectly good & useful null check?
There is the off chance is that such was recommended by someone in the know at
Scala.js.

If not, my long training in expecting explicit argument checks for every & anything passed in from application code leads me to miss the null check here. True, that
some code some-J-Random-where, perhaps 5 layers down will throw the exception
at some hard to find and debug place. With line numbers in the NIR for SN in the offing, there is a chance to get an explicit line number at an understandable place early in execution, when there is less shrapnel.

Do I sound opinionated here? I was merely trying for experienced. I am also aware
that I am a contributor, and that more senior reviewers set the standard & practice.

er1c force-pushed the unicodeblock branch 2 times, most recently from d2a4eb7 to 6672521 Compare September 1, 2020 19:40

ekrich reviewed Sep 1, 2020

View reviewed changes

er1c force-pushed the unicodeblock branch from 6672521 to abe1b2c Compare September 1, 2020 20:53

Update Character.UnicodeBlock and add better tests

a671ed7

er1c force-pushed the unicodeblock branch from abe1b2c to a671ed7 Compare September 2, 2020 14:28

er1c marked this pull request as draft September 2, 2020 15:17

LeeTibbert reviewed Sep 6, 2020

View reviewed changes

LeeTibbert reviewed Sep 7, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Character.UnicodeBlock and add better tests #1887

Update Character.UnicodeBlock and add better tests #1887

er1c commented Sep 1, 2020 •

edited

ekrich left a comment

er1c commented Sep 1, 2020 •

edited

ekrich commented Sep 1, 2020

LeeTibbert commented Sep 1, 2020

er1c commented Sep 2, 2020

LeeTibbert commented Sep 6, 2020

LeeTibbert commented Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert Sep 6, 2020

LeeTibbert commented Sep 6, 2020

LeeTibbert Sep 7, 2020

		@@ -0,0 +1,415 @@
		// Run with $ amm scripts/GenerateUnicodeBlock.sc

		@@ -0,0 +1,415 @@
		// Run with $ amm scripts/GenerateUnicodeBlock.sc
		// Replace code in ./javalanglib/src/main/scala/java/lang/Character.scala


		private val BLOCKS = Array(
		private val allBlocks: Array[UnicodeBlock] = Array(

Update Character.UnicodeBlock and add better tests #1887

Are you sure you want to change the base?

Update Character.UnicodeBlock and add better tests #1887

Conversation

er1c commented Sep 1, 2020 • edited

Findings

ekrich left a comment

Choose a reason for hiding this comment

er1c commented Sep 1, 2020 • edited

ekrich commented Sep 1, 2020

LeeTibbert commented Sep 1, 2020

er1c commented Sep 2, 2020

LeeTibbert commented Sep 6, 2020

LeeTibbert commented Sep 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LeeTibbert commented Sep 6, 2020

Choose a reason for hiding this comment

er1c commented Sep 1, 2020 •

edited

er1c commented Sep 1, 2020 •

edited