Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Character.UnicodeBlock and add better tests #1887

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

er1c
Copy link

@er1c er1c commented Sep 1, 2020

TODO:

Findings

All of the JDK8 constants are defined in the JDK 14 constants (aka JDK14 is a superset)

@er1c er1c force-pushed the unicodeblock branch 2 times, most recently from d2a4eb7 to 6672521 Compare September 1, 2020 19:40
Copy link
Member

@ekrich ekrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently tracking Unicode for Java 8. We used Unicode 6.3.0. This version was used since it was a minor update from JDK8's 6.2.0 whereas 7.0.0 added many new codes.

If you would like avoid formatting with scripts/scalafmt you could probably use their comment for ignoring a section.

@er1c
Copy link
Author

er1c commented Sep 1, 2020

@ekrich thanks for the initial feedback, I'm updating the PR, will see what breaks, but happy to hopefully get it resolved, I'm duplicating the work in scaala-js so figured I'd throw it back here.

I'll see if I can get a deprecation list from the unicode java8 and make sure they are included.

@ekrich
Copy link
Member

ekrich commented Sep 1, 2020

What version of JVM is Scala.js tracking? I created a project here but I like your script approach. https://github.com/ekrich/scala-unicode Initially I was thinking that we could put all the scripts in one place and maybe put it under scala-native vs my personal GitHub.

@LeeTibbert
Copy link
Contributor

I seem to remember that to be bug-for-bug compatible with Java 8 there were a few codes in Java 8 which
differed from its reference Unicode version. I may have put some tests under the re2 directory to
explicitly test for those. Or I may have left comments in the re2 UnicodeMumble.scala files.

@er1c Now what was it written over the gate leading into the Inferno?

@er1c er1c marked this pull request as draft September 2, 2020 15:17
@er1c
Copy link
Author

er1c commented Sep 2, 2020

@LeeTibbert I didn't remove anytests, so if there were any edge cases previously documented, they should be found.

@LeeTibbert
Copy link
Contributor

er1c Thank you for starting this conversation.

I mean to be supportive of your energy & efforts. I hope my investment of time & plain
speaking end up that way, if not please disregard.

Let me start at the highest level, the level of strategy. While it should not get in the way of the
current PR, the bird in the hand, I think that SN should be actively considering a long term
strategy of little.big or fast.comprehensive, where SN core supplies only the minimal implementation
of certain components and relies upon currently hypothetical third party libraries by domain experts
to supply complete implementations. java.time is one of those. I think Unicode and Locale are
two more.

scala-java-time is a good candidate for java.time. The former compiles & pases its Demo project on SN.
I am currently diving into seeing which SN changes are needed to support the extensive scala-java-time test
ensemble.

I believe that Eric P is a contributor to scala-java-locale (sp?).

Third party libraries are likely to have more domain expertise and a quicker turnaround. Unicode is now
something like Unicode 13 and here we are discussing Unicode 6.3. A third party library could be more
agile.

As we probably all agree, the Unicode and Locale issues are Large Problems, and it will take a Beast to
address them comprehensively and with rapid turnaround. I believe that SN will ultimately need third
party libraries which are calls into libicu (International Components for Unicode), which is such a Beast.
The strength of SN is interoperability and libicu simply has more people working on it than the SN or
entire Scala communities can muster.

@LeeTibbert
Copy link
Contributor

So, let me now focus on the PR at hand. I have a few high level concerns. If those bear out, I will do another pass at line-by-line.

Again, if the comment is useful, great. If it is wrong or just not useful, no harm done.

@@ -0,0 +1,415 @@
// Run with $ amm scripts/GenerateUnicodeBlock.sc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to get the generating script into Git. Please consider giving a few
more clues the the poor suffering who comes after you
and tries to run this.

For example, WT? is amm? yes, I know it is ammonite, but clues such as a URL
to ammonite and ammonite scripts, would
improve, IMO, the software engineering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the software engineering would also be improved if there were a short, one paragraph description of how this script fits into the general build process. That it is
not automagically executed in build.sbt, but rather is manually run and the output
manually edited into j.l.Character.scala.

That is, could a bright middle-school pupil, or the less capable engineer-in-a-hurry of English common law, follow in your footsteps?

@@ -0,0 +1,415 @@
// Run with $ amm scripts/GenerateUnicodeBlock.sc
// Replace code in ./javalanglib/src/main/scala/java/lang/Character.scala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments in Character.scala, which I will review next. IMO, the software engineering would be improved if this script printed out some, actually quite a bit of audit
trail information: Who ran it, when, which version of Unicode used as input files.

val LATIN_EXTENDED_B = new UnicodeBlock("LATIN_EXTENDED_B", 0x180, 0x24f)
val IPA_EXTENSIONS = new UnicodeBlock("IPA_EXTENSIONS", 0x250, 0x2af)
//////////////////
// Begin Generated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the software engineering would be improved by at least giving the
reference Unicode version, the name and Git repository location of the script
used to generate the output, and the fact that it was manually folded (no build.sbt)
into the current file.

I do not see the usual "Do not edit" .

I also like to see who generated the output and date. Some would say that this
data can be more reliably figured out from GitBlame, but that is a recursive distraction from trying to read the code as is, on-the-fly.

//////////////////

// scalastyle:off line.size.limit
val SURROGATES_AREA = new UnicodeBlock("SURROGATES_AREA", 0x0, 0x0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this takes a close look from an SN compiler expert.

I believe that the next whack of UnicodeBlock declarations should be "lazy val". Of course, that means that any tables using them should be carefully constructed to be
lazy on any path but a small set of exceedingly common UnicodeBlocks (such as English only). That is a fast&small chain to slow&complete design.

My concerns are the latency/startup_time when the binary begins to execute. Since these vals are in an object within an object, I believe they will currently all be allocated and initialized at binary startup. That is quite a few allocations for data that will almost certainly never be used.

I am pretty certain that changing to "lazy" will reduce startup time & heap requirements. I am not certain if it will slightly increase code size.

I am aware that the code being replaced used "vals" rather than "lazy vals", I am reluctant to see hundreds of more items slowing up binary startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to remember being advised when I was doing Unicode work for the re2 regex code to use one dimensional arrays (of structures). And to be careful to make
maps lazy.

I think it prudent to ask what the recommendation from current SN leadership is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, the re2s and Character.scala unicode data should not be being duplicated.
I think there is at least one pending PR to reduce, not eliminate, some of that
duplication. IIRC, the idea was to let that PR merge & settle, then do additional
passes. I seem to remember to total elimination was not feasible, but that a lot of error-prone duplication could be eliminated.

I believe that known flaw should not keep progress from happening here. I am also open to & welcome better suggestions. Sometimes one simply runs out of time before the next distraction/crises hits and the technical debt remains in the code base for far too long.


private val BLOCKS = Array(
private val allBlocks: Array[UnicodeBlock] = Array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is the type of thing that needs to be lazy, if possible. I did not trace how it gets used. This table will, I believe, cause all of the UnicodeBlocks it references to be
instantiated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to be an method which returns the "known&fast" UnicodeBlocks, then iterates over a lazy array of "complete&slow" UnicodeBlocks.

SUPPLEMENTARY_PRIVATE_USE_AREA_B)
import scala.Predef.ArrowAssoc
private val blocksByNormalizedName
: scala.collection.Map[String, UnicodeBlock] = scala.collection.Map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, baseline code uses scala map. There was an effort in Scala.js to reduce/eliminate the use of scala.collection, at least in javalib, in favor of java collections
(which are a Royal Pain!).

I think there has been no organized effort to do that yet in SN. Perhaps changing this can be delayed due to prior use in the baseline code. I am not the one to make that
call but thought I would point the effort out.

val SUPPLEMENTARY_PRIVATE_USE_AREA_B =
new UnicodeBlock("SUPPLEMENTARY_PRIVATE_USE_AREA_B", 0x100000, 0x10ffff)
new UnicodeBlock("SUPPLEMENTARY_PRIVATE_USE_AREA_B", 0x100000, 0x10FFFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the last generated UnicodeBlock. With generated code, my very early computer science training to print out the number of generated entries comes to mind.
That makes comparing the count from the generated output against that in the Unicode Spec reference files. I seem to remember that either they give the expected count, or it was very easy to get (unix wc -l minus a few).

// For example, "Latin Extended-A" and "latin extended a" are equivalent.
// For more information on the comparison of property values,
// see UAX #44: http://www.unicode.org/reports/tr44/
private def toNormalizedName(name: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment: just-in-time delivery of hard to find pertinent wizardry.
Allows me to ask my next question, just below.

// For more information on the comparison of property values,
// see UAX #44: http://www.unicode.org/reports/tr44/
private def toNormalizedName(name: String): String = {
name.toLowerCase.replaceAll(raw"[\s\-_]", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the "\s" is incorrect/incomplete here. That is, I am pretty sure that the
text in the comment is referring to Unicode whitespace (table Zs) while "\s" is referring to the much smaller set of Java whitespace.

As a thought experiment, imagine some person set out to break/test the code by giving an input string which contains an Ogham (Old Irish) space mark (one of my favorite test cases). I submit that according to the comment, the Ogham character should be removed and that the presented code will not do that.

We are in the land of regex here. The known & tedious way to loop through each of the input characters and retain it only if it is not in the Unicode Zs table (in scalanative.re2s). Since replaceAll brings us well into the land of re2s, I could look
to see if re2s has a way to specify * replace Unicode whitespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the nanolevel, I think doing the replaceAll first, then the toLowerCase has lower
amortized cost over the expected set of inputs. Cost for text not containing Unicode
whitespace, etc. is probably same (but one always has to look at implementations).
Most names will contain whitespace, etc, so replaceAll first will be a few cycles faster.
Too many hours and exams in Algorithms classes.

}

////////////////
// End Generated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an end generated marker is good, but generated by which script, date?
Where there is one generator, there may well be more than one.

// Begin Generated
//////////////////

// scalastyle:off line.size.limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the scalafmt used in ScalaNative respect the scalastyle:off (and others)
I suspect these are holdovers from the Scala.js port. Perhaps left in baseline
code to reduce churn but probably should not be in new SN code (especially generated).

What does scalafmt do to these large generated tables? I would expect it to mangle them, but the code in this PR looks OK.

@LeeTibbert
Copy link
Contributor

er1c re: I didn't remove anytests, so if there were any edge cases previously documented, they should be found.

Now that I have reviewed the code on a full screen and not on my phone, I see that you are mainly declaring UnicodeBlocks.
I believe that the Java 8 quirks/idiosyncratic_implementations I was concerned about should not be touched/triggered by the code of this PR. Any tests I wrote would be under re2s. Pending re2s tests blowing up, I think you can check this box off as completed/reviewed/signed_off. Thank you for considering it, this stuff is a well-tended, biologically diverse swamp.

.fold {
throw new IllegalArgumentException()
} { value => value }
blocksByNormalizedName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove a perfectly good & useful null check?
There is the off chance is that such was recommended by someone in the know at
Scala.js.

If not, my long training in expecting explicit argument checks for every & anything passed in from application code leads me to miss the null check here. True, that
some code some-J-Random-where, perhaps 5 layers down will throw the exception
at some hard to find and debug place. With line numbers in the NIR for SN in the offing, there is a chance to get an explicit line number at an understandable place early in execution, when there is less shrapnel.

Do I sound opinionated here? I was merely trying for experienced. I am also aware
that I am a contributor, and that more senior reviewers set the standard & practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants