Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas DataFrame is reported as array, but it should be a map as it has keys method! #353

Open
JaroslavTulach opened this issue Sep 1, 2023 · 12 comments
Assignees

Comments

@JaroslavTulach
Copy link
Contributor

Steps to reproduce:

Create Pandas.java File

import java.io.File;
import org.graalvm.polyglot.Context;

class Pandas {
  public static void main(String... args) throws Exception {
    var pwd = System.getProperty("user.dir");
    var exe = new File(pwd + "/pandaspy/bin/graalpy");
    if (!exe.exists()) {
      throw new Exception("Cannot find " + exe);
    }
    var ctx = Context.newBuilder()
      .option("python.Executable", exe.getAbsolutePath())
      .allowAllAccess(true)
      .build();

    var df = ctx.eval("python", """
    import site
    import pandas as pd

    df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
    print(df)
    df
    """);

    System.out.println("Size: " + df.getArraySize());
    System.out.println("At 0: " + df.getArrayElement(0));
  }
}
$ /graalvm-community-openjdk-17.0.7+7.1/bin/graalpy -m venv pandaspy
$ ./pandaspy/bin/graalpy -m pip install pandas
Successfully installed numpy-1.23.5 pandas-1.5.2 python-dateutil-2.8.2 pytz-2022.2.1 six-1.16.0
$ /graalvm-community-openjdk-17.0.7+7.1/bin/java Pandas.java
   col1  col2
0     1     3
1     2     4
Size: 2
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Invalid array index 0 for array '   col1  col2
0     1     3
1     2     4'(language: Python, type: DataFrame).
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotEngineException.arrayIndexOutOfBounds(PolyglotEngineException.java:159)
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotValueDispatch.invalidArrayIndex(PolyglotValueDispatch.java:1390)
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotValueDispatch$InteropValue$GetArrayElementNode.doCached(PolyglotValueDispatch.java:3396)
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotValueDispatchFactory$InteropValueFactory$GetArrayElementNodeGen.executeAndSpecialize(PolyglotValueDispatchFactory.java:2922)
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotValueDispatchFactory$InteropValueFactory$GetArrayElementNodeGen.executeImpl(PolyglotValueDispatchFactory.java:2870)
	at org.graalvm.truffle/com.oracle.truffle.polyglot.HostToGuestRootNode.execute(HostToGuestRootNode.java:124)
	at jdk.internal.vm.compiler/org.graalvm.compiler.truffle.runtime.OptimizedCallTarget.executeRootNode(OptimizedCallTarget.java:718)
	at jdk.internal.vm.compiler/org.graalvm.compiler.truffle.runtime.OptimizedCallTarget.profiledPERoot(OptimizedCallTarget.java:641)
	at jdk.internal.vm.compiler/org.graalvm.compiler.truffle.runtime.OptimizedCallTarget.callBoundary(OptimizedCallTarget.java:574)
	at jdk.internal.vm.compiler/org.graalvm.compiler.truffle.runtime.OptimizedCallTarget.doInvoke(OptimizedCallTarget.java:558)
	at jdk.internal.vm.compiler/org.graalvm.compiler.truffle.runtime.GraalRuntimeSupport.callProfiled(GraalRuntimeSupport.java:255)
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotValueDispatch$InteropValue.getArrayElement(PolyglotValueDispatch.java:2200)
	at org.graalvm.sdk/org.graalvm.polyglot.Value.getArrayElement(Value.java:335)
	at Pandas.main(Pandas.java:28)
	Suppressed: Attached Guest Language Frames (1)

the Panda array can be printed in Python, it is recognized as an array, but accessing its elements doesn't work yielding unexpected exceptions.

@timfel
Copy link
Member

timfel commented Sep 1, 2023

In Python you also cannot access it via indices. I guess we should just not report it as an array, but it's not possible for us. Python doesn't really have a way to tell you what kind of indices any one object supports until you try.

@timfel
Copy link
Member

timfel commented Sep 1, 2023

The current logic follows https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes. We say anything that is a Sequence reports as Array to interop. So anything that has a length and can be subscripted. What kind of indices are actually supported is not in general knowable for us.

@radeusgd
Copy link

radeusgd commented Sep 1, 2023

The Truffle readArrayElement only accepts a long index.

So the Pandas DataFrame reports that it hasArrayElements but there is no way to access any of its contents through the InteropLibrary.readArrayElement - because that only accepts integer indices and the DataFrame does not accept integer indices.

Do you think it would be possible to somehow make the logic for discovering if something is an array not make the DataFrame an array? It seems that from polyglot perspective that is only causing trouble, like in this bug report.


Interestingly, looking at Python docs, the sequence type expects integer indices:

An iterable which supports efficient element access using integer indices via the __getitem__() special method and defines a __len__() method that returns the length of the sequence.

How do you check that the Pandas DataFrame is a sequence? I think checking just for the existence of __getitem__() is not enough. After all, a Mapping also has these two methods (__getitem__(), __len__()) and it is not a sequence.

@timfel
Copy link
Member

timfel commented Sep 1, 2023

How do you check that the Pandas DataFrame is a sequence?

We can only check for the existence of slots though. hasArrayElements is not expected to cause side-effects.

For native types like dataframes, we can implement a different check where we go to the native side, because on the native side there are two different function pointers for mapping subscript and sequence subscript. But on the Python side, both are simply called __getitem__ and we cannot distinguish them

@timfel
Copy link
Member

timfel commented Sep 1, 2023

However, this is a bit more work and will not happen in the next weeks or two. For now, you will have to do a workaround and use invokeMember

@timfel
Copy link
Member

timfel commented Sep 1, 2023

After all, a Mapping also has these two methods (getitem(), len()) and it is not a sequence.

But Mapping defines more methods that sequences do not have (keys, items, values) so we check for those.

@msimacek
Copy link
Contributor

msimacek commented Sep 1, 2023

It doesn't matter what slots or methods we check. Whether a dataframe can or cannot be indexed by integers cannot be determined by the type. Observe:

>>> pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})[0]
Traceback (most recent call last):
...
KeyError: 0
>>> pd.DataFrame([1,2,3])[0]
0    1
1    2
2    3
Name: 0, dtype: int64

There's no way to tell those two cases apart until you try to access the item. And we cannot do that in hasArrayElements because it would have side-effects.

@JaroslavTulach
Copy link
Contributor Author

Whether a dataframe can or cannot be indexed by integers

That fact that something can accidentally be indexed by integers does not make it Interop.hasArrayElements ...

There's no way to tell those two cases apart until you try to access the item.

... no need for that. Just claim DataFrame possitivellly replies to hasHashEntries - then the keys can be strings as well as numbers.

Looks like the DataFrame is more a hash structure than array anyway.

@timfel
Copy link
Member

timfel commented Sep 1, 2023

That fact that something can accidentally be indexed by integers does not make it Interop.hasArrayElements

By that logic, very few things can then be defined has having array elements in Python ;) Also, hasHashEntries means we can iterate the keys and values. What would you like the keys to be? The columns or the rows?

@timfel
Copy link
Member

timfel commented Sep 1, 2023

A DataFrame is not a hash, neither it is an array. The correct fix in this case would be to support neither. The correct user code in this case will be to use the pandas specific APIs directly. We don't have a DataFrame abstraction in interop so far :/

@msimacek
Copy link
Contributor

msimacek commented Sep 1, 2023

That fact that something can accidentally be indexed by integers does not make it Interop.hasArrayElements ...

Python is duck-typed. We don't have anything better to rely on than presence of methods.

... no need for that. Just claim DataFrame possitivellly replies to hasHashEntries - then the keys can be strings as well as numbers.

I think we can do this. @timfel DataFrame has keys method, so we can fulfill the interop contracts for iterating keys and values there. We could report hasHashEntries for objects that have __getitem__, __len__ and keys.

@JaroslavTulach JaroslavTulach changed the title Pandas array has length 2 but yields an interop error when accessing index 0 Pandas DataFrame is reported as array, but it should be a map as it has keys method! Sep 7, 2023
@msimacek
Copy link
Contributor

I think this would better be resolved using the mechanism outlined in #354. We would by default claim that a dataframe is neither an array, nor a hash. And you can then implement your own interop behavior for dataframes, i.e. define that it should behave as a hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants