GitHub - PApostol/spark-submit: Python manager for spark-submit jobs

Spark-submit

TL;DR: Python manager for spark-submit jobs

Description

This package allows for submission and management of Spark jobs in Python scripts via Apache Spark's spark-submit functionality.

Installation

The easiest way to install is using pip:

pip install spark-submit

To install from source:

git clone https://github.com/PApostol/spark-submit.git
cd spark-submit
python setup.py install

For usage details check help(spark_submit).

Usage Examples

Spark arguments can either be provided as keyword arguments or as an unpacked dictionary.

Simple example:

from spark_submit import SparkJob

app = SparkJob('/path/some_file.py', master='local', name='simple-test')
app.submit()

print(app.get_state())

Another example:

from spark_submit import SparkJob

spark_args = {
    'master': 'spark://some.spark.master:6066',
    'deploy_mode': 'cluster',
    'name': 'spark-submit-app',
    'class': 'main.Class',
    'executor_memory': '2G',
    'executor_cores': '1',
    'total_executor_cores': '2',
    'verbose': True,
    'conf': ["spark.foo.bar='baz'", "spark.x.y='z'"],
    'main_file_args': '--foo arg1 --bar arg2'
    }

app = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)
print(app.get_submit_cmd(multiline=True))

# poll state in the background every x seconds with `poll_time=x`
app.submit(use_env_vars=True,
           extra_env_vars={'PYTHONPATH': '/some/path/'},
           poll_time=10
           )

print(app.get_state()) # 'SUBMITTED'

while not app.concluded:
    # do other stuff...
    print(app.get_state()) # 'RUNNING'

print(app.get_state()) # 'FINISHED'

Examples of `spark-submit` to `spark_args` dictionary:

A `client` example:

~/spark_home/bin/spark-submit \
--master spark://some.spark.master:7077 \
--name spark-submit-job \
--total-executor-cores 8 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--py-files /some/utils.zip \
--files /some/file.json \
/path/to/pyspark/file.py --data /path/to/data.csv

becomes

spark_args = {
    'master': 'spark://some.spark.master:7077',
    'name': 'spark_job_client',
    'total_executor_cores: '8',
    'executor_cores': '4',
    'executor_memory': '4G',
    'driver_memory': '2G',
    'py_files': '/some/utils.zip',
    'files': '/some/file.json',
    'main_file_args': '--data /path/to/data.csv'
    }
main_file = '/path/to/pyspark/file.py'
app = SparkJob(main_file, **spark_args)

A `cluster` example:

~/spark_home/bin/spark-submit \
--master spark://some.spark.master:6066 \
--deploy-mode cluster \
--name spark_job_cluster \
--jars "s3a://mybucket/some/file.jar" \
--conf "spark.some.conf=foo" \
--conf "spark.some.other.conf=bar" \
--total-executor-cores 16 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--class my.main.Class \
--verbose \
s3a://mybucket/file.jar "positional_arg1" "positional_arg2"

becomes

spark_args = {
    'master': 'spark://some.spark.master:6066',
    'deploy_mode': 'cluster',
    'name': 'spark_job_cluster',
    'jars': 's3a://mybucket/some/file.jar',
    'conf': ["spark.some.conf='foo'", "spark.some.other.conf='bar'"], # note the use of quotes
    'total_executor_cores: '16',
    'executor_cores': '4',
    'executor_memory': '4G',
    'driver_memory': '2G',
    'class': 'my.main.Class',
    'verbose': True,
    'main_file_args': '"positional_arg1" "positional_arg2"'
    }
main_file = 's3a://mybucket/file.jar'
app = SparkJob(main_file, **spark_args)

Testing

You can do some simple testing with local mode Spark after cloning the repo.

Note any additional requirements for running the tests: pip install -r tests/requirements.txt

pytest tests/

python tests/run_integration_test.py

Additional methods

spark_submit.system_info(): Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS

spark_submit.SparkJob.kill(): Kills the running Spark job (cluster mode only)

spark_submit.SparkJob.get_code(): Gets the spark-submit return code

spark_submit.SparkJob.get_output(): Gets the spark-submit stdout

spark_submit.SparkJob.get_id(): Gets the spark-submit submission ID

License

Released under MIT by @PApostol.

You can freely modify and reuse.
The original license must be included with copies of this software.
Please link back to this repo if you use a significant portion the source code.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
spark_submit		spark_submit
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
requirements-dev.txt		requirements-dev.txt
setup.py		setup.py

License

PApostol/spark-submit

Folders and files

Latest commit

History

Repository files navigation

Spark-submit

TL;DR: Python manager for spark-submit jobs

Description

Installation

Usage Examples

Simple example:

Another example:

Examples of spark-submit to spark_args dictionary:

A client example:

becomes

A cluster example:

becomes

Testing

Additional methods

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Examples of `spark-submit` to `spark_args` dictionary:

A `client` example:

A `cluster` example: