Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stats.zipf: incorrect pmf values #20692

Closed
tk-yoshimura opened this issue May 10, 2024 · 4 comments · Fixed by #20702
Closed

BUG: stats.zipf: incorrect pmf values #20692

tk-yoshimura opened this issue May 10, 2024 · 4 comments · Fixed by #20702
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@tk-yoshimura
Copy link

tk-yoshimura commented May 10, 2024

Describe your issue.

The pmf value of zipf may return an incorrect value.

If the shape parameter 'a' is set to 9, pmf returns an incorrect value.

Scipy returns the following for inputs greater than or equal to 0:
0.000000000000000000e+00
9.979956327307618613e-01
1.949210220177269260e-03
5.070343101817618650e-05
3.807051211283729024e-06
5.109737639581500849e-07
9.903013870737536426e-08
2.473126213304208103e-08
7.435646897038533250e-09
2.576001169444504624e-09
9.979956327307618846e-10
0.000000000000000000e+00
1.154001579655571043e-09
4.953901915407116765e-10
0.000000000000000000e+00
0.000000000000000000e+00

In wolfram, the values will be as follows:
PDF[zipfdistribution[8], 11]
1/(2357947691 ζ(9))≈4.23248×10^-10

(Note: that in wolfram's definition, the shape parameter has a difference of 1.)

In my experiment, I obtained the following.
0
0.9979956327307621568646761321051
0.001949210220177269837626320570518
5.070343101817620062311010171748e-5
3.807051211283730151613907364292e-6
5.109737639581502243147141796378e-7
9.903013870737539184201191741694e-8
2.473126213304208857623746328563e-8
7.435646897038535452370912820884e-9
2.576001169444505442417827654193e-9
9.979956327307621568646761321051e-10
4.232475709872573915655519654635e-10
1.93418239662842562191429526205e-10
9.411058434986077082628878685352e-11

The cause of this bug is that there is no overflow avoidance when the shape paramter and input values to pmf are both integer types.

Reproducing Code Example

import numpy as np
import scipy.stats as stats

x = np.arange(0, 16).astype(np.int32)
dist = stats.zipf(9)
pmf = dist.pmf(x)

Error message

(N/A)

SciPy/NumPy/Python version and system information

1.13.0 1.26.4 sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /c/opt/64/include
    lib directory: /c/opt/64/lib
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=24
    pc file directory: c:/opt/64/lib/pkgconfig
    version: 0.3.26.dev
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /c/opt/64/include
    lib directory: /c/opt/64/lib
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=24
    pc file directory: c:/opt/64/lib/pkgconfig
    version: 0.3.26.dev
  pybind11:
    detection method: config-tool
    include directory: unknown
    name: pybind11
    version: 2.12.0
Compilers:
  c:
    commands: cc
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  c++:
    commands: c++
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.10
  fortran:
    commands: gfortran
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  pythran:
    include directory: C:\Users\runneradmin\AppData\Local\Temp\pip-build-env-z8a5b3_f\overlay\Lib\site-packages/pythran
    version: 0.15.0
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: windows
  cross-compiled: false
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: windows
Python Information:
  path: C:\Users\runneradmin\AppData\Local\Temp\cibw-run-0mr_3znw\cp39-win_amd64\build\venv\Scripts\python.exe
  version: '3.9'
@tk-yoshimura tk-yoshimura added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label May 10, 2024
@dschmitz89
Copy link
Contributor

dschmitz89 commented May 11, 2024

Thanks for the report. Could you elaborate with a minimal working example for which parameter values the error occurs? I have a hard time interpreting the Wolfram Alpha computarion. The same holds for the other issue you opened.

@dschmitz89
Copy link
Contributor

Thanks for adding the MWE, that helps a lot.

@tk-yoshimura
Copy link
Author

For example, if the following code is executed:

import numpy as np
import scipy.stats as stats

x = np.arange(0, 16).astype(np.int32)
dist = stats.zipf(9.0)
pmf = dist.pmf(x)

print(pmf)

The values are as expected.

[0.00000000e+00 9.97995633e-01 1.94921022e-03 5.07034310e-05
3.80705121e-06 5.10973764e-07 9.90301387e-08 2.47312621e-08
7.43564690e-09 2.57600117e-09 9.97995633e-10 4.23247571e-10
1.93418240e-10 9.41105843e-11 4.83032464e-11 2.59601567e-11]

However, if the shape parameter is an integer, an incorrect value is obtained for x>=11, as shown below.

import numpy as np
import scipy.stats as stats

x = np.arange(0, 16).astype(np.int32)
dist = stats.zipf(9)
pmf = dist.pmf(x)

print(pmf)

[0.00000000e+00 9.97995633e-01 1.94921022e-03 5.07034310e-05
3.80705121e-06 5.10973764e-07 9.90301387e-08 2.47312621e-08
7.43564690e-09 2.57600117e-09 9.97995633e-10 0.00000000e+00
1.15400158e-09 4.95390192e-10 0.00000000e+00 0.00000000e+00]

When executing the following lines
_discrete_distns.py L1309

def _pmf(self, k, a):
    # zipf.pmf(k, a) = 1/(zeta(a) * k**a)
    Pk = 1.0 / special.zeta(a, 1) / k**a
    return Pk

The input value k and the shape parameter a are both integer types, which may be causing the overflow.

@dschmitz89
Copy link
Contributor

dschmitz89 commented May 11, 2024

That is indeed the issue. The maximum value for int32 is $2147483647$ but $11^9=2357947691$. We could cast k to float, then do *k**-a instead. Would you be interested in drafting a PR?

@tylerjereddy tylerjereddy added backport-candidate This fix should be ported by a maintainer to previous SciPy versions. and removed backport-candidate This fix should be ported by a maintainer to previous SciPy versions. labels May 13, 2024
@tylerjereddy tylerjereddy added this to the 1.13.1 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants