Presto Workload Analyzer

The Workload Analyzer collects Presto® and Trino workload statistics, and analyzes them. The analysis provides improved visibility into your analytical workloads, and enables query optimization - to enhance cluster performance.

The Presto® Workload Analyzer collects, and stores, QueryInfo JSONs for queries executed while it is running, and any historical queries held in the Presto® Coordinator memory.

The collection process has negligible compute-costs, and does not impact cluster query execution in any way. Ensure that sufficient disk space is available in your working directory. Typically, a compressed JSON file size will be 50kb - 200kb.

Features

Continuously collects and stores QueryInfo JSONs, in the background without impacting query performance.
Summarizes key query metrics to a summary.jsonl file.
Generates an analysis report:
- Query detail- query peak memory, input data read by query, and joins distribution.
- Table activity- wall time utilization, and input bytes read, by table scans.
- Presto® Operators- wall time usage, and input bytes read, by operator.

Supported Versions of Presto

The Workload Analyzer supports the following versions:

Trino (FKA PrestoSQL)- 402 and older.
PrestoDB- 0.245.1 and older.
Starburst Enterprise- 402e and older.
Dataproc- 1.5.x and older.

Although the Workload Analyzer may run with newer versions of Presto®, these scenarios have not been tested.

Installation

For installation, see here.

Usage

Local machine/ Remote machine

First, go to the analyzer directory, where the Workload Analyzer Python code can be found.

cd analyzer/

To collect statistics from your cluster, run the following script for a period that will provide a representative sample of your workload.

./collect.py -c http://<presto-coordinator>:8080 --username-request-header "X-Trino-User" -o ./JSONs/ --loop

Notes:

In most cases, this period will be between 5 and 15 days, with longer durations providing more significant analysis.
The above command will continue running until stopped by the user (Ctrl+C).

To analyze the downloaded JSONs directory (e.g. ./JSONs/) and generate a zipped HTML report, execute the following command:

./extract.py -i ./JSONs/ && ./analyze.py -i ./JSONs/summary.jsonl.gz -o ./output.zip

Docker

To collect statistics from your cluster, run the following script for a period that will provide a representative sample of your workload.

$ mkdir JSONs/
$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/collect.py -c http://$PRESTO_COORDINATOR:8080 --username-request-header "X-Trino-User" -o JSONs/ --loop

To analyze the downloaded JSONs directory (e.g. ./JSONs/), and generate a zipped HTML report, execute the following commands:

$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/extract.py -i JSONs/
$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/analyze.py -i JSONs/summary.jsonl.gz -o JSONs/output.zip

Notes:

In most cases, this period will be between 5 and 15 days, with longer durations providing more significant analysis.
The above command will continue running until stopped by the user (Ctrl+C).

Screencasts

See the following screencasts for usage examples:

Collection

Analysis

Advanced Features

In exceptional circumstances, it may be desirable to do one or more of the following:

Obfuscate the schema names
Remove the SQL queries from the summary file
Analyze queries for a specific schema (joins with other schemas are included)

To enable these requirements, the ./jsonl_process.py script may be executed, after the ./extract.py script, but before the ./analyze.py script.

In the example below, only queries from the transactions schema are kept, and the SQL queries are removed from the new summary file:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --filter-schema transactions --remove-query

In the following example, all the schema names are obfuscated:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --rename-schemas

In the following example, all the partition and user names are obfuscated:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --rename-partitions --rename-user

After the ./jsonl_process.py script has been executed, to generate a report based on the new summary file, run:

./analyze.py -i ./processed_summary.jsonl.gz -o ./output.zip

To create a high-contrast report, use the --high-contrast-mode parameter, for example:

./analyze.py --high-contrast-mode -i ./JSONs/summary.jsonl.gz -o ./output.zip

Notes

Presto® is a trademark of The Linux Foundation.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
analyzer		analyzer
resources		resources
COPYING		COPYING
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyzer

analyzer

resources

resources

COPYING

COPYING

Dockerfile

Dockerfile

INSTALL.md

INSTALL.md

MANIFEST.in

MANIFEST.in

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Presto Workload Analyzer

Table of contents

Features

Supported Versions of Presto

Installation

Usage

Local machine/ Remote machine

Docker

Screencasts

Collection

Analysis

Advanced Features

Notes

About

Releases

Packages

Contributors 6

Languages

License

varadaio/presto-workload-analyzer

Folders and files

Latest commit

History

Repository files navigation

Presto Workload Analyzer

Table of contents

Features

Supported Versions of Presto

Installation

Usage

Local machine/ Remote machine

Docker

Screencasts

Collection

Analysis

Advanced Features

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages