Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyHive, Presto connector returning wrong resultset #460

Open
alextk87 opened this issue Aug 1, 2023 · 3 comments
Open

PyHive, Presto connector returning wrong resultset #460

alextk87 opened this issue Aug 1, 2023 · 3 comments

Comments

@alextk87
Copy link

alextk87 commented Aug 1, 2023

I'm using Presto Cluster for processing large amount of data.

To visualize the data I use the connector provided and suggested by the official Superset documentation, which is PyHive from the SQLAlchemy library and I'm using the default settings for the connection.

When using the provided pyhive presto connector and executing a very simple query - "SELECT * FROM test_table", the returned number of rows by the resultset is incorrect compared with the same query executed in the presto-cli app, the official connector provided by the Presto documentation.

I created two simple python scripts to test Presto connection using PyHive and the official jdbc.jar driver.

The PyHive connector returned wrong number of rows in the resultset about 817000 rows, exactly the same number of rows that was returned by the Superset chart. The connector with the official jdbc driver returned the correct amount of data - 875000 rows.

It looks like the issue is caused by the PyHive connector. Is it possible to change the connection method from PyHive to the official JDBC driver?

I'm attaching the two python scripts that I used to reproduce the issue.

#This Python script is using PyHive
from pyhive import presto

def execute_presto_query(host, port, user, catalog, schema, table, max_rows):
    connection = presto.connect(host=host, port=port, username=user, catalog=catalog, schema=schema, session_props={'query_max_output_size': '1TB'})

    try:
        cursor = connection.cursor()
        query = f"""SELECT * FROM test_table"""
        cursor.execute(query)
        total_rows = 0

        while True:
            rows = cursor.fetchmany(max_rows)

            if not rows:
                break

            for row in rows:
                total_rows += 1
                print(row)

    except Exception as e:
        print("Error executing the query:", e)

    finally:
        print(total_rows)
        cursor.close()
        connection.close()

if __name__ == "__main__":
    host = "localhost"
    port = 30000
    user = "testUser"
    catalog = "pinot"
    schema = "default"
    table = "test_table"
    max_rows = 1000000

    execute_presto_query(host, port, user, catalog, schema, table, max_rows)
#This Python script is using the official JDBC driver
import jaydebeapi
import jpype

def execute_presto_query(host, port, user, catalog, schema, table, max_rows):
    jar_file = '/home/admin1/Downloads/presto-jdbc-0.282.jar'
    jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=" + jar_file)
    connection_url = f'jdbc:presto://{host}:{port}/{catalog}/{schema}'
    conn = jaydebeapi.connect(
        'com.facebook.presto.jdbc.PrestoDriver',
        connection_url,
        {'user': user},
        jar_file
    )

    try:
        cursor = conn.cursor()
        query = f"SELECT * FROM test_table"
        cursor.execute(query)
        rows = cursor.fetchall()
        for row in rows:
            print(row)

        print(f"Total rows returned: {len(rows)}")

    except Exception as e:
        print("Error executing the query:", e)

    finally:
        cursor.close()
        conn.close()

    jpype.shutdownJVM()

if __name__ == "__main__":
    host = "localhost"
    port = 30000
    user = "testUsername"
    catalog = "pinot"
    schema = "default"
    table = "test_table"
    max_rows = 1000000

    execute_presto_query(host, port, user, catalog, schema, table, max_rows)
@mdeshmu
Copy link
Contributor

mdeshmu commented Aug 6, 2023

Which version of PyHive and SQLAlchemy are you using?

@alextk87
Copy link
Author

Sorry for my late reply, I was on a vacation.
The version of PyHive is 0.6.2 and the version of SQLAlchemy is 2.0.19.
The issue is reproducible with result sets larger than 150000 records.

@mdeshmu
Copy link
Contributor

mdeshmu commented Aug 21, 2023

Can you try latest version of PyHive i.e. 0.7.0 to check whether issue still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants