Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Switch to cython_<lapack,blas> based wrappers and deprecate custom scipy.linalg.<lapack,blas> #20682

Open
ilayn opened this issue May 9, 2024 · 12 comments
Labels
Cython Issues with the internal Cython code base deprecated Items related to behavior that has been deprecated Fortran Items related to the internal Fortran code base RFC Request for Comments; typically used to gather feedback for a substantial change proposal scipy.linalg

Comments

@ilayn
Copy link
Member

ilayn commented May 9, 2024

Historically, we have gone through many stages of BLAS/LAPACK wrappers. And consequently we have a lot of home-baked wrappers to the formal BLAS/LAPACK routines.

This actually allowed the users to conveniently call native routines through f2py glue. While this is very valuable, our way of carrying the historical inconsistencies have been quite high. Just to name a few

  • Character switches sometimes kept ala uplo='L' or converted into 0, 1 ala lower=1 (since bool is not really available).
  • The pivot arrays sometimes converted at the wrapper with C injection, sometimes kept in Fortran order.
  • lwork queries sometimes included as separate public function <routine>_lwork but sometimes left to user by calling the same routine twice with exposed lwork keyword set to -1.
  • This would also help us handling the OpenBLAS linkage or other libs a bit more ergonomic.

I think we can switch to cython based linkage to have more automated wrapper generation and less strain on our testing framework. And also it would make things more straightforward to compile.

On the downside, we let f2py to handle the normalization of arguments and this has to be robust in the cythonized wrappers. Moreover, we will have to make sure that the F-contiguous and <s,d,c,z> flavorizing happens properly.

We touched upon this many times over the years but let's consolidate the discussions here (last time we mentioned this was I think #13663 (comment))

One particular difficulty is that if we want to solve the historical issues, unfortunately, we have to get rid of all wrappers and offer new ones where the naming would be different. I'm not necessarily a big fan of this but doing this without breaking anybody's current code this seems like the only option. And probably needs more than 2 versions for deprecation cycles.

Hence it would be the case that scipy.linalg.lapack and scipy.linalg.blas gets deprecated goes away and new ones, whatever the name might be will be the new auto-generated submodule names.

@ilayn ilayn added scipy.linalg Fortran Items related to the internal Fortran code base Cython Issues with the internal Cython code base RFC Request for Comments; typically used to gather feedback for a substantial change proposal labels May 9, 2024
@ev-br
Copy link
Member

ev-br commented May 9, 2024

cross-ref #4516

@ilayn
Copy link
Member Author

ilayn commented May 9, 2024

Just occured to me that we can at least do this for our own internal use and leave the deprecation for later.

@lucascolley lucascolley changed the title RfC:Switch to cython_<lapack,blas> based wrappers and deprecate custom scipy.linalg.<lapack,blas> RFC: Switch to cython_<lapack,blas> based wrappers and deprecate custom scipy.linalg.<lapack,blas> May 9, 2024
@lucascolley lucascolley added the deprecated Items related to behavior that has been deprecated label May 9, 2024
@rgommers
Copy link
Member

This proposal isn't very clear to me I'm afraid. We should consider public API we expose separately from what we do internally. For public API:

  • scipy.linalg.blas and scipy.linalg.lapack are quite low-level indeed. If someone wants to write nicer wrappers, then sure why not.
    • Once those exist, we can consider if the old ones can be deprecated.
  • scipy.linalg.cython_blas/cython_lapack are LP64 only, and that cannot be changed without breaking backwards compatibility. We could add a second set for ILP64 builds perhaps, but that isn't done today and I haven't seen anyone planning to work on this.

For SciPy's internal usage:

  • We're using the f2py-based wrappers internally Since we also want ILP64 support for SciPy itself, the f2py-based wrappers must be kept.
  • The machinery to generate the wrappers was improved a fair bit recently.

Just occured to me that we can at least do this for our own internal use

I don't think so, because of what I wrote above.

@ilayn
Copy link
Member Author

ilayn commented May 22, 2024

scipy.linalg.cython_blas/cython_lapack are LP64 only, and that cannot be changed without breaking backwards compatibility. We could add a second set for ILP64 builds perhaps, but that isn't done today and I haven't seen anyone planning to work on this.

I can do that at some point after Fortran rewrites. It is not the problem. But probably the actual problem is having two copies of OpenBLAS or picking up what we have at compile time. How does Cython set pickup ILP64 if the linked OpenBLAS is 32bit?

@rgommers
Copy link
Member

How does Cython set pickup ILP64 if the linked OpenBLAS is 32bit?

It doesn't, that was my point. Internally we can use ILP64 everywhere when the user builds for that, but the requirements for the Cython API are different.

having two copies of OpenBLAS

Yeah, we probably don't want to put that back. That was a really weird construction. We should have a unified build that provides both LP64 and ILP64 symbols. Currently MKL and Accelerate provide that. OpenBLAS could in the future, but not today.

@ilayn
Copy link
Member Author

ilayn commented May 22, 2024

Ah OK, intel-lingo for 32 bit is LP64, I didn't pay attention. We need to do something about this at some point anyways. I'll try some about it at when I can spare time.

@chillenb
Copy link

If you deprecate the F2PY wrappers, would it be possible to expose the Cython ones as Python functions? Something along these lines.. The ability to call BLAS from Python is really nice, especially ?trsm, which is faster than scipy.linalg.solve_triangular (?trtrs) when B is row-major and nrhs is large. Also ?axpy, which is much faster than single-threaded addition of ndarrays. If these wrappers are Cython-only, then they cannot be used interactively and packages that use them will need to add a Cython dependency.

@rgommers
Copy link
Member

If you deprecate the F2PY wrappers,

I don't think we will. At least I haven't seen a reason for that yet. Either way, we will only deprecate the current wrappers if a new and improved API for low-level access to BLAS/LAPACK functions is available to switch to.

@ilayn I'm all for improving our BLAS/LAPACK wrapping, however it requires some prototyping and a very crisp and concise proposal, after laying out all the constraints that the new situation should meet. This RFC as written should probably be closed, because it is fairly incomplete and as written it cannot be executed, due to the points I raised above.

Maybe let's use this one as a discussion thread, and at some point open a new one which is complete?

@ilayn
Copy link
Member Author

ilayn commented May 26, 2024

If you deprecate the F2PY wrappers,

If we are going to deprecate wrappers we are not going to remove them without alternatives. The problem with the existing ones are just historical inconsistencies and by quite a lot. There is hardly any consistency of the wrappers and everyone of them feels unique. Some exports m, n array size some hide it and so on many weird stuff as I mentioned above. So we will deprecate when we have a proper alternative which what this issue is about.

Also ?axpy, which is much faster than single-threaded addition of ndarrays.

That's very unlikely to beat numpy if same optimizations are given at build time to both.

@ilayn
Copy link
Member Author

ilayn commented May 26, 2024

This RFC as written should probably be closed, because it is fairly incomplete and as written it cannot be executed, due to the points I raised above.

Not sure what you mean by this. This is an RfC not a PR. And as I am trying to make a case, there are ways to do this including 64 bit Cython wrappers. And there are many open similar discussion as linked above. So what is missing about the RfC, maybe you didn't like it?

We just did not choose which way. But I don't agree that backwards compatibility is an issue here because it will be needed at some point because of const addition saga and many other things are waiting for similar handling. NumPy 2.0 changes were much more drastic than this change. This is not even close to that scale. And we can easily do the big changes in other downstream packages.

@rgommers
Copy link
Member

An RFC is usually a detailed proposal about what to do. This is more a "hey let's get rid of the f2py wrappers" without (so far) a concrete way of achieving that, nor laying out all the constraints/requirements. So I just really don't know what to do with this at the moment beyond the input I have already given in my first comment - it's too vague. All I can say is "sure, I'd love better/simpler wrappers and less duplication", but that's about it. To me, that is not a very useful open issue.

Let's turn this around: what do you actually want as input from other maintainers here? Or what are the next steps?

@chillenb
Copy link

chillenb commented May 26, 2024

Thanks everybody for your clarifications!! I'm glad to hear that BLAS wrappers are going to be kept in the public API, at least in some form.

That said, I'd like to push back against the following statement a little:

Also ?axpy, which is much faster than single-threaded addition of ndarrays.

That's very unlikely to beat numpy if same optimizations are given at build time to both.

The multithreading does make a difference! These days, single-core memory bandwidth is often a fraction of the total. This gets worse if there is NUMA.

Here is a benchmark on a machine with 2x Intel Xeon 6252N @2.6 GHz. NumPy was compiled with -march=native -mtune=native and BLAS is OpenBLAS from conda-forge:

In [2]: X=np.random.random(1000**3)

In [3]: Y=np.random.random(1000**3)

In [4]: %timeit Z=X+Y
3.15 s ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: from scipy.linalg.blas import daxpy

In [7]: %timeit daxpy(X,Y)
656 ms ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit np.add(X,Y,out=Y)
1.25 s ± 358 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

With Intel Distribution for Python, even np.add is multithreaded. It's a pretty significant boost:

In [5]: %timeit Z=X+Y
339 ms ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit daxpy(X,Y)
413 ms ± 53.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit np.add(X,Y,out=Y)
209 ms ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython Issues with the internal Cython code base deprecated Items related to behavior that has been deprecated Fortran Items related to the internal Fortran code base RFC Request for Comments; typically used to gather feedback for a substantial change proposal scipy.linalg
Projects
None yet
Development

No branches or pull requests

5 participants