Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's equals regex for control range in chinese #19

Open
bung87 opened this issue Aug 23, 2018 · 13 comments
Open

what's equals regex for control range in chinese #19

bung87 opened this issue Aug 23, 2018 · 13 comments

Comments

@bung87
Copy link
Contributor

bung87 commented Aug 23, 2018

let re_han = re(r"(*U)([\p{Han}]+)")

I tried re and nre module, this works but very slow,and re is much slower than nre, so I'm looking for this module, cuz Araq told me nre also deprecated.

@nitely
Copy link
Owner

nitely commented Aug 23, 2018

Good question, the nim-regex regex should be r"\p{Han}+", but I think that's not yet available. I mean the \p{...} does not support everything it should support, yet. That said you don't need regex for this, you need https://github.com/nitely/nim-unicodedb, but sadly unicodedb does not support the unicode script data (what tells you if a character is han, greek, etc).

I've open an issue for you nitely/nim-unicodedb#4 , I can probably get it done today or tomorrow 😄

@bung87
Copy link
Contributor Author

bung87 commented Aug 23, 2018

glad to hear that,once it's ready I'll check out the performance,since re split takes 60-70 percent costs time in test case of my package.

@nitely
Copy link
Owner

nitely commented Aug 24, 2018

Well, split may create many small substrings/allocations. Maybe you can get ride of it somehow, by using findBoundaries + toOpenArray for example. Something like:

import re

let reHan = re(r"(*UTF8)\p{Han}+")

proc processSubString(s: openArray[char]) =
  # This allocates, do something that does not allocate :P
  var myss = newString(s.len)
  for i in 0 ..< s.len:
    myss[i] = s[i]
  echo myss

proc main(s: string) =
  var start = 0
  var bounds = (first: 0, last: 0)
  while start < s.len:
    bounds = findBounds(s, reHan, start)
    if bounds.first == -1:
      break
    if bounds.first > 0:
      processSubString(toOpenArray(s, start, bounds.first-1))
    processSubString(toOpenArray(s, bounds.first, bounds.last))
    start = bounds.last + 1
  if start < s.len:
    processSubString(toOpenArray(s, start, s.len-1))

echo "start1"
main("諸夏foo諸夏bar")
echo "end1"
echo "start2"
main("諸夏foo諸夏")
echo "end2"
echo "start3"
main("foo諸夏bar")
echo "end3"
echo "start4"
main("foo諸夏")
echo "end4"
echo "start5"
main("諸夏bar")
echo "end5"

#[
start1
諸夏
foo
諸夏
bar
end1
start2
諸夏
foo
諸夏
end2
start3
foo
諸夏
bar
end3
start4
foo
諸夏
end4
start5
諸夏
bar
end5
]#

However, last time I checked Nim's string modules (strutils, unicode, etc) didn't support openArray[char] and I don't know if it's possible to convert to cstring in case it's needed (probably not, due to the missing null terminator). So, you may not be able to do this.

Next step would be getting ride of regex.

@nitely
Copy link
Owner

nitely commented Aug 24, 2018

I just updated unicodedb. The followind code does the same as the above code:

import unicode
import unicodedb/scripts

proc isHanCheck(r: Rune): bool =
  # fast ascii check followed by unicode check
  result = r.int > 127 and r.unicodeScript() == sptHan

iterator splitHan(s: string): string =
  var
    i = 0
    j = 0
    k = 0
    r: Rune
    isHan = false
    isHanCurr = false
  fastRuneAt(s, i, r, false)
  isHanCurr = r.isHanCheck()
  isHan = isHanCurr
  while i < s.len:
    while isHan == isHanCurr:
      k = i
      if i == s.len:
        break
      fastRuneAt(s, i, r, true)
      isHanCurr = r.isHanCheck()
    yield s[j ..< k]
    j = k
    isHan = isHanCurr

proc main(s: string) =
  for ss in s.splitHan():
    echo ss

main("諸夏foo諸夏bar")
#[
諸夏
foo
諸夏
bar
]#

@nitely
Copy link
Owner

nitely commented Aug 24, 2018

If you just care about the han text, just check isHan in the above code, like this:

# ...

if isHan:
  yield s[j ..< k]

# ...

#[
諸夏
諸夏
]#

Oh, don't forget to compile the code in release mode -d:release to make it fast.

@bung87
Copy link
Contributor Author

bung87 commented Aug 24, 2018

@nitely thank you ! help me a lot ! these codes works fine,#19 (comment) will reduce 1 second, and second version will reduce to 1.5 seconds which faster than python's version (about 2 seconds) in my test case.

@nitely
Copy link
Owner

nitely commented Aug 24, 2018

@bung87 Awesome! I'm glad to hear that! 😸

I'll leave this open so I don't forget to implement the \p{Han} support.

@bung87
Copy link
Contributor Author

bung87 commented Aug 24, 2018

that's even more wonderful, since it's implemented in perl python php ... ,since you described about string copy that results so much performance affected, that I didn't aware of before, my package has a big table

let PROB_EMIT_DATA* = {
  'B': {
    "": -3.6544978750449433,
    "": -8.125041941842026,
   ....

could you give me more advice to improve performance?

@nitely
Copy link
Owner

nitely commented Aug 24, 2018

Sure, I can try 😄 . Nim has proper constants, you would use const instead of let for those. There is no need for all caps variable names. That may improve performance (but maybe not in this case) and gives you deep immutability. let has shallow immutability, since when using refs it allows to mutate whatever the ref is pointing to.

Using strings as Table keys is slow since it needs hashing the string + string comparison. Using characters like 'B' is good since that's just a int8. So, you can improve that by using an enum instead of a string. But if the key is provided by the user, then you won't be able to use enum effectively, in which case is better to use a Rune (as long as keys are single unicode characters) since that's a int32. Like this:

# untested code

import unicode

# Use enums instead of this when keys are not provided by user
proc toRune(s: string): Rune =
  var n = 0
  fastRuneAt(s, n, result, true)
  if n < s.len:
    raise newException(ValueError, "not a single unicode char")

const ProbEmitData* = {
  'B': {
    "".toRune: -3.6544978750449433,
    "".toRune: -8.125041941842026,
   ....

Next step is getting ride of the hash table. You can try a switch statement instead.

proc ProbEmitDataMap(keyA: char, keyB: Rune): float =
  case keyA
  of 'B':
    case keyB
    of "".toRune:  # enum would be better if possible
      -3.6544978750449433  # maybe result = ... is needed here
    of "".toRune:
      -8.125041941842026
    else:
      raise newException(ValueError, "invalid key")
  # ...
  else:
      raise newException(ValueError, "invalid key")

This may get translated into a jump table (in assembly), a bunch of if statements or a perfect hash table (but this is gcc specific I think), depending on the C compiler.

As a more general advice, it's better to think C rather than python performance wise. If you don't know C then learn it. It's a very small language and there are good books showing what good code looks like and explaining why it's good. It's probably not a good idea to write Nim as if it were C, but at least you'll get an idea of what code Nim may be generating.

About string handling, I'm not sure using toOpenArray is such a great idea. At some point Nim will be smart enough to translate mystr[a .. b] into an openArray (i.e: without copying) in most places and to make openArray[char] work seamlessly with string (i.e passing an openarray to a proc expecting a string type). It'll also be able to avoid copying when assigning a string variable to another variable in some cases, thanks to destructors (It may already be doing this in latest Nim devel).

You can ask this kind of thing in IRC and the forum as well, as there are more knowledgeable people than me in there.

@nitely
Copy link
Owner

nitely commented Aug 24, 2018

Another thing that I just thought is using a flat hash table, like this:

const ProbEmitData* = {
  [Rune('B'.ord), "".toRune]: -3.6544978750449433,
  [Rune('B'.ord), "".toRune]: -8.125041941842026,
  ....

Depending on the table length, this may take a lot more memory, though. Also, there may be better data structures than this, like an "static trie" (not the one with nodes) or another kind of state machine, but it depends on the data. I can't tell for sure without the whole ProbEmitData thing.

@bung87
Copy link
Contributor Author

bung87 commented Aug 24, 2018

the table generated by gen_prob_json.py the original data came from a python package ,you can see the results data here:
https://github.com/bung87/finalseg/blob/master/src/finalseg/prob_emit.nim
and there's also a cpp version, the author combinated most state each line,
https://github.com/bung87/cppjieba/blob/7b2fdc41a235f332977ee2ca8c43715e7dc145e0/dict/pos_dict/prob_emit.utf8
I think at first focus on string things, since most improvement done related to it, I also test checked the loop without perform state probililty calculating, it seems wrap more loop just a little affected.

and now I reduced time costs to 1.2 think may make it down to 1 second.(but will throw index error on debug build)

you mentioned about const, that I tried but did not figure out , since I trans it to table with newTable or initTable that will build fails. I checked the tables module it will using hash module's index convert the string key to int, yeah ,make it as const that can reduce runtime index calculating.

@nitely
Copy link
Owner

nitely commented Aug 24, 2018

is prob_emit.nim complete? why have a single 'B' key? you can just remove that and do a check it before accessing the table.

@bung87
Copy link
Contributor Author

bung87 commented Aug 24, 2018

    "": -10.61937952828986
  }.newTable
}.newTable

it should ends like this, 35235 lines. hmm I checked generated json file, it only has 'B' key indeed. that I did not noticed. seems other state just depends on 'B' state, turns out there's only .toRunes and .runes,when using const and toTable program will get stuck, that's could be a bug, leave this at this stage since I got confidence about current performance,takes your much time . thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants