Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster base conversion #580

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

czurnieden
Copy link
Contributor

This PR includes both variations to implement the Schönhage trick: the standard way and the method proposed by Lars Helmström to compute the reciprocals.

The default is the normal way. Switch to the second method which uses a round of Newton-Raphson (N-R-method) with

make "CFLAGS=-DMP_TO_RADIX_USE_NEWTON_RAPHSON " test

Found no significant difference in speed, but YMMV, as always.

The cutoffs are all regulated by MP_RADIX_BARRETT_START_MULTIPLICATOR, no finer resolution for now. Default is MP_RADIX_BARRETT_START_MULTIPLICATOR=10. There is not much diffference but there is:

MP_RADIX_BARRETT_START_MULTIPLICATOR = 1
Bits 1079 Base 10 SLOW: 0.0099020000, FAST: 0.0105440000
Bits 1079 Base 64 SLOW: 0.0011470000, FAST: 0.0046120000
Bits 19078 Base 10 SLOW: 2.2566370000, FAST: 0.187417000
Bits 19078 Base 64 SLOW: 0.1112380000, FAST: 0.0827020000
SUM: SLOW: 662.0398000000, FAST: 112.4025840000
Sums for Base 10 SLOW: 15.7169280000, FAST: 1.8504550000

MP_RADIX_BARRETT_START_MULTIPLICATOR = 10
Bits 1078 Base 10 SLOW: 0.0098600000, FAST: 0.0069170000
Bits 1078 Base 64 SLOW: 0.0011670000, FAST: 0.0019670000
Bits 19080 Base 10 SLOW: 2.2540880000, FAST: 0.1689260000
Bits 19080 Base 64 SLOW: 0.1115890000, FAST: 0.0357570000
SUM: SLOW: 660.6001590000, FAST: 85.153820000
Sums for Base 10 SLOW: 15.6973850000, FAST: 1.6237270000

MP_RADIX_BARRETT_START_MULTIPLICATOR = 50
Bits 1080 Base 10 SLOW: 0.0103930000, FAST: 0.0108190000
Bits 1080 Base 64 SLOW: 0.0011380000, FAST: 0.0015330000
Bits 19080 Base 10 SLOW: 2.2607540000, FAST: 0.2614210000
Bits 19080 Base 64 SLOW: 0.1112290000, FAST: 0.0335130000
SUM: SLOW: 662.9941510000, FAST: 113.7476640000
Sums for Base 10 SLOW: 15.7478230000, FAST: 2.4405090000

(Timings are read/write combined)

The timing used in demo/test.c to check if the faster method is actually faster is switched off if it runs in Valgrind.

There are a lot of edgecases, so it is a lot to test. On the upper end with MP_64BIT and base 10, the N-R-method works well up to 2^(2^27), the normal way up to 2^(2^30) ( Up to MP_MAX_DIGIT_COUNT - 4 in reading, writing is still running )

More information in the comments in the code.

@czurnieden
Copy link
Contributor Author

/home/runner/work/libtommath/libtommath/demo/test.c:9:10: fatal error: valgrind/valgrind.h: No such file or directory
    9 | #include <valgrind/valgrind.h>
      |          ^~~~~~~~~~~~~~~~~~~~~

Yep, feared that ;-)

@czurnieden
Copy link
Contributor Author

Took me a bit but the test for the largest possible input for MP_64BIT run successfully.

$ wc -c ../exact_MP_64_BIT_MAX_minus_1
646456956 ../exact_MP_64_BIT_MAX_minus_1
$ time  ./test_fast_radix_conversion ../exact_MP_64_BIT_MAX_minus_1
Reading
read: 1945.3631711750
Writing
NOR: t = 1, S = 80
NOR: t = 2, S = 160
NOR: t = 3, S = 320
NOR: t = 4, S = 640
NOR: t = 5, S = 1280
NOR: t = 6, S = 2560
NOR: t = 7, S = 5120
NOR: t = 8, S = 10240
NOR: t = 9, S = 20480
NOR: t = 10, S = 40960
NOR: t = 11, S = 81920
NOR: t = 12, S = 163840
NOR: t = 13, S = 327680
NOR: t = 14, S = 655360
NOR: t = 15, S = 1310720
S_MP_DIV_NEWTON b.used = 10886
NOR: t = 16, S = 2621440
S_MP_DIV_NEWTON b.used = 21771
NOR: t = 17, S = 5242880
S_MP_DIV_NEWTON b.used = 43542
NOR: t = 18, S = 10485760
S_MP_DIV_NEWTON b.used = 87083
NOR: t = 19, S = 20971520
S_MP_DIV_NEWTON b.used = 174165
NOR: t = 20, S = 41943040
S_MP_DIV_NEWTON b.used = 348330
NOR: t = 21, S = 83886080
S_MP_DIV_NEWTON b.used = 696659
NOR: t = 22, S = 167772160
S_MP_DIV_NEWTON b.used = 1393318
NOR: t = 23, S = 335544320
S_MP_DIV_NEWTON b.used = 2786636
NOR: t = 24, S = 671088640
S_MP_DIV_NEWTON b.used = 5573271
NOR: t = 25, S = 1342177280
S_MP_DIV_NEWTON b.used = 11146542
S_MP_DIV_NEWTON b.used = 22293083
write: 9063.8351113740

real    183m37,776s
user    430m45,561s
sys     159m45,221s

Input number generated with Pari/GP

? allmem()
  ***   Warning: new stack size = 8589934592 (8192.000 Mbytes).
? 35791390 * 60
%1 = 2147483400
? a = 2^(35791390 * 60)-1;
? write("/home/czurnieden/exact_35791390_minus_1",a);
? ##
  ***   last result: cpu time 4min, 59,145 ms, real time 5min, 2,763 ms.

Largest possible number for 60-bit limbs would be 2^(35791394*60)-1. One removed because addition adds one limb to the result memory unconditionally . Three extra for some angst-allowance. But 2^(35791393 * 60)-1 is currently running.

This result has not been made with the develop branch (would have taken weeks) but with some additional fast algorithms: TC 2-5 , FFT(FHT), Newton-division (no NTT yet). TC only up to 5 because the lower cutoff of FFT was so close to TC-5 that I called it: good enough for this test. Multi-threading by a handful of simple OpenMP instructions, so there is some room for improvement.

Test run on an older i7-2600 with 4 cores/8 threads.

@czurnieden
Copy link
Contributor Author

2^(35791393*60)-1 failed at writing:

$ time  ./test_fast_radix_conversion exact_35791393_minus_1
Reading
read: 1947.6428634690
   ...
 S_MP_DIV_NEWTON b.used = 22293083
In mp_grow.c at 22 in mp_grow: Integer overflow
In mp_mul_2d.c at 17 in mp_mul_2d: Integer overflow
In s_mp_faster_to_radix.c at 95 in s_mp_to_radix_recursive: Integer overflow
In s_mp_faster_to_radix.c at 355 in s_mp_faster_to_radix: Integer overflow
error while writing to string: Integer overflow

2^(35791392*60)-1 was successful.

Memory used was not measured exactly but was about 6.5 Gibibytes (Pari/GP uses less but it uses fully filled limbs).

@czurnieden
Copy link
Contributor Author

For MP_28BIT the upper limit is 2^(2^(76695844*28))-1 (two limbs less than MP_MAX_DIGIT_COUNT).

$ time  ./test_fast_radix_conversion /home/czurnieden/exact_76695844_minus_1 
Reading
read: 2934.6979325720
Writing
NOR: t = 1, S = 80
NOR: t = 2, S = 160
NOR: t = 3, S = 320
NOR: t = 4, S = 640
NOR: t = 5, S = 1280
NOR: t = 6, S = 2560
NOR: t = 7, S = 5120
NOR: t = 8, S = 10240
NOR: t = 9, S = 20480
NOR: t = 10, S = 40960
NOR: t = 11, S = 81920
NOR: t = 12, S = 163840
NOR: t = 13, S = 327680
NOR: t = 14, S = 655360
S_MP_DIV_NEWTON b.used = 11663
NOR: t = 15, S = 1310720
S_MP_DIV_NEWTON b.used = 23326
NOR: t = 16, S = 2621440
S_MP_DIV_NEWTON b.used = 46652
NOR: t = 17, S = 5242880
S_MP_DIV_NEWTON b.used = 93303
NOR: t = 18, S = 10485760
S_MP_DIV_NEWTON b.used = 186606
NOR: t = 19, S = 20971520
S_MP_DIV_NEWTON b.used = 373211
NOR: t = 20, S = 41943040
S_MP_DIV_NEWTON b.used = 746421
NOR: t = 21, S = 83886080
S_MP_DIV_NEWTON b.used = 1492841
NOR: t = 22, S = 167772160
S_MP_DIV_NEWTON b.used = 2985681
NOR: t = 23, S = 335544320
S_MP_DIV_NEWTON b.used = 5971362
NOR: t = 24, S = 671088640
S_MP_DIV_NEWTON b.used = 11942723
NOR: t = 25, S = 1342177280
S_MP_DIV_NEWTON b.used = 23885446
S_MP_DIV_NEWTON b.used = 47770891
write: 12804.5266057830

real	262m22,303s
user	697m43,803s
sys	200m22,995s

MP_16BIT needs a bit more work, a simple change of INT_MAX is most likely insufficient and setting up a virtual 16-bit machine needs a second or two.

@czurnieden
Copy link
Contributor Author

For MP_16BIT with INT_MAX = 32767

$ ./test_fast_radix_conversion  /home/czurnieden/exact_2183_minus_1 > /dev/null
MP_DIGIT_BIT: 15, INT_MAX: 32767, MP_MAX_DIGIT_COUNT: 2184
Reading
read: 0.0178557870
Writing
NOR: t = 1, S = 80
NOR: t = 2, S = 160
NOR: t = 3, S = 320
NOR: t = 4, S = 640
NOR: t = 5, S = 1280
NOR: t = 6, S = 2560
NOR: t = 7, S = 5120
NOR: t = 8, S = 10240
NOR: t = 9, S = 20480
write: 0.0216505660

MP_OVF as the error?

@czurnieden
Copy link
Contributor Author

Ah, I see.

@czurnieden
Copy link
Contributor Author

Speed-test gets reinstated when cutoff-tuning is implemented.

@czurnieden
Copy link
Contributor Author

Some pretty pictures of benchmarks with MP_64BIT (The y-axes are different!):
b10wnc3000_podo
b16wnc3000_podo
b64wnc3000_podo

A bit further up to visualize the different complexities:
b10wnc25000_podo
b16wnc25000_podo
b64wnc25000_podo

Most of the bases that are not a power of two are more or less in the same ballpark as base 10 (bigger bases are slightly faster than smaller bases), so tuning can be restricted to the most used bases 10, 16 and 64 without much loss. It is already a lot of code, don't want to add more branches than absolutely necessary.

@czurnieden
Copy link
Contributor Author

Other limb-sizes:

15 bit:
16bitb64wnc3000_podo
16bitb16wnc3000_podo
16bitb10wnc3000_podo

28 bit:
32bitb64wnc3000_podo
32bitb16wnc3000_podo
32bitb10wnc3000_podo

31 bit:
31bitb64wnc3000_podo
31bitb10wnc3000_podo
31bitb16wnc3000_podo

There is also a spot in the code in s_mp_faster_to_radix.cthat can make use of the high short product (s_mp_mul_high.c). It is not part of the faster multiplication algorithms except Comba, so a branch would be needed. (data for s_mp_faster_read_radix.c added to show fluctuations while running the benchmark)

64bitshortpb64wnc3000_podo
64bitshortpb16wnc3000_podo
64bitshortpb10wnc3000_podo

There are not a lot of differences here, a fixed cutoff at 500 bits (rounded to next limb-size) would make the most sense for both: as a general cutoff and the cutoff between the build-in high short product and full multiplication. as said before: it is already a lot of code for a "simple" radix conversion!

@czurnieden
Copy link
Contributor Author

What? Ah, MP_RADIX_CUTOFF is not used in tune and gets lost afert that run.

@czurnieden
Copy link
Contributor Author

Line 29 in mp_div.c:

       && (b->used > (2 * MP_MUL_KARATSUBA_CUTOFF))

Great! *sigh*

@czurnieden
Copy link
Contributor Author

It is already a lot of code. Not so much for reading but quite a lot for writing. So I like to stop here. No more cutoff branches (only bases 16 and 64 would be of interest anyways), and no more internal branches except for radices of the form 2^n. I tried to use s_mp_mul_high but it is too specialized and would need a bit of extra massage (see code for some more details).

So without any complains I like to give it a wet wipe and call it done. OK?

@sjaeckel
Copy link
Member

I like to give it a wet wipe and call it done. OK?

Absolutely OK :)

Those graphs look good. Do I understand correctly that the "slow path" is the old version?

@czurnieden
Copy link
Contributor Author

Do I understand correctly that the "slow path" is the old version?

They are the old ones, yes, but with the added string positioning which shouldn't add much of runtime because it is outside of the loops.

Yes, my ability to give the children good names is still highly underdeveloped, sorry ;-)

@czurnieden
Copy link
Contributor Author

As always: when you thought you had it all ... ;-)
There was still an unnecessary branch in mp_fread left.

@czurnieden
Copy link
Contributor Author

Hold your horses, please, I just saw, that I forgot to add the printing part in tune.c.

@czurnieden
Copy link
Contributor Author

No, I don't know how that happened (the rotation of > in s_mp_faster_read_readix.c) but I guess it was a case of a git restore.

You might try make graphs in etc. It is not in the parent makefiles yet, safe CMake, because it should go to the logs directory which is quite…um…desolate now. It also needs some documentation.

@czurnieden
Copy link
Contributor Author

czurnieden commented Jun 13, 2024

Tuning does not work in makefile.shared because of the hidden symbols. implemented a workaround by generating a static version, do the tuning/graphing, clean up and build the dynamic version again. Might be a separate PR?

@sjaeckel
Copy link
Member

Tuning does not work in makefile.shared because of the hidden symbols. implemented a workaround by generating a static version, do the tuning/graphing, clean up and build the dynamic version again. Might be a separate PR?

Sure, make a separate PR. Did you check whether the amalgamated version of the library results in different tuning parameters?
Would it maybe make sense to have tune run against the profiled/profiled_single library? Or the other way around? Or first tune then profiled then tune again? :D So many options ;)

@czurnieden
Copy link
Contributor Author

Did you check whether the amalgamated version of the library results in different tuning parameters?

No, I was just happy that it works in the first place! ;-)

But I'll take a look, of course.

Would it maybe make sense to have tune run against the profiled/profiled_single library?

Good question. I'll take a look. (I think we would need the full triplet tune->profile->tune)

So many options ;)

Nearly as much as in the cereals aisle at Walmart ;-)

But I found a problem with tune.c itself. I'm working at incorporating fast multiplication (again ;-) but properly this time) and run tune.c with the following result:

#define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 115
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 152
#define MP_DEFAULT_MUL_TOOM_CUTOFF      139
#define MP_DEFAULT_SQR_TOOM_CUTOFF      212
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF    218
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF    256
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF    254
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF    256
#define MP_DEFAULT_MUL_TOOM_6_CUTOFF    256
#define MP_DEFAULT_SQR_TOOM_6_CUTOFF    256
#define MP_DEFAULT_MUL_FFT_CUTOFF       608
#define MP_DEFAULT_SQR_FFT_CUTOFF       708

Which is clearly nonsense. (256 limbs is the COMBA size on this machine btw) I mean: my implementation of FFT is fast but I'm pretty sure not that fast ;-)
The problem is resetting the cutoffs of the algorithms back to INT_MAX after testing. If I comment the line *test[n].cutoff = INT_MAX; out I get:

#define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 109
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 138
#define MP_DEFAULT_MUL_TOOM_CUTOFF      149
#define MP_DEFAULT_SQR_TOOM_CUTOFF      266
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF    880
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF    2952
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF    982
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF    927
#define MP_DEFAULT_MUL_TOOM_6_CUTOFF    1251
#define MP_DEFAULT_SQR_TOOM_6_CUTOFF    1083
/* Results with steps of 100 and (1000) resp. */
#define MP_DEFAULT_MUL_FFT_CUTOFF       24208 (47008)
#define MP_DEFAULT_SQR_FFT_CUTOFF       12308 (43008)

That looks a bit more reasonable. There is not much difference up to TOOM-3 because their cutoffs are all below or at least very close to the COMBA size.

Why is tune.c so complex? Who wrote that?

 (git blame   --line-porcelain    etc/tune_it.sh; git blame   --line-porcelain    etc/tune.c ) \
   |  sed -n 's/^author //p' | sort | uniq -c | sort -rn

Ah, I forgot, never mind ;-)

The outlier SQR_TOOM_4 is not an outlier it behaves always that way, probably something broken, not checked yet, not my algorithm (that would be TOOM_6). FFT is raw, no slicing implemented yet. NTT will probably never come (a separate modulus for every MP_xxBIT alone means a lot of extra work, and slicing with CRT needs even more moduli)

And because I know you like a pretty picture or two:

multiplying_60_all_four
squaring_60_all_four

Ok, the FFT cutoffs from the beginning were not a lie ;-)

TC-7 is in the works (some coefficients are too large even for 32-bit) and TC-8 has a bug in the implementation (PARI/GP script works, though), With TC-9 and above the coefficients get larger than 64 bit.

@czurnieden
Copy link
Contributor Author

I'll take a look.

Me and my big mouth! ;-)

I see that timing.c still uses RDTSC which was state-of-the-art at the time of writing but shouldn't be used anymore in times of multi-core/threaded CPUs, not every hardware supports it and it can even be switched off (e.g.: by setting PR_SET_TSC with prctl(2) in Linux). So it might be a bit more work than I expected (which seems to be always the case ;-) ).

The rest of the TODOs in this PR are mostly optimizations that need external work (e.g. short products) or small things that are not function related and hence not urgent. Will again shove my mop through it and wrap it up for good now. I know my tendency for "featuritis" too well ;-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants