Skip to content

Latest commit

 

History

History
73 lines (46 loc) · 5.98 KB

S3.md

File metadata and controls

73 lines (46 loc) · 5.98 KB

Using an External S3-compatible Object Storage Service for Storing Job Output

By default, JARVICE stores all system and job-related objects in its local database (or external database if so configured). Database-stored objects are scalable and have no limit in theory, other than underlying storage space. However, JARVICE limits them to the last 10,000 lines of output for a job (or approximately 800kb based on full 80-column lines). The internal object store is indexed and provides adequate performance for storing and retrieving job output.

Alternatively, JARVICE can be configured to store job output objects in an S3-compatible object storage service.

NOTE: JARVICE does not provide an S3 service as part of its deployment.

Use Cases

In some cases, such as the following, it may be advantageous to move job output to an external object storage:

Capturing very large job output in full

When using this mechanism there is no limit to the size of the objects, and the entire "standard output" of each job is captured without truncation. This may be beneficial for batch solves of very large job runs.

Storage retention and lifecycle management

Some S3-compatible services offer advanced policies for deleting old objects or moving them to lower tiered storage - note that missing job output is considered an error in JARVICE, but only applies if a user or administrator examines the specific job the output pertained to. Pruning older objects may be an acceptable way to reduce storage costs if users are expected to retrieve output within a certain period of time. Retention and lifecycle management policies are external to JARVICE and service provider-specific.

Controlling storage costs

Some service providers may charge less for object storage (especially lower tiers) than block storage which usually underpins the internal JARVICE database. Keeping very large blobs out of the database may result in significant cost savings as more jobs complete and produce output over time. Further, the loss of one or more job output objects affects only user(s) inspecting the respective job(s) after they execute; the loss of database storage results in complete system failure (until it's restored from backup). Because of this, it may be beneficial to use different storage tiers for database storage and job output object storage, respectively.

Optimizing database backups

Storing fewer "binary blobs" in the internal database should result in smaller and more efficient database backups.

Configuration

Enabling this capability requires setting all of the following values correctly:

  • jarvice.JARVICE_S3_BUCKET - set to the S3 bucket to store job output objects in
  • jarvice.JARVICE_S3_ACCESSKEY - set to the S3 access key for the bucket
  • jarvice.JARVICE_S3_SECRETKEY - set to the S3 secret access key for the bucket

Additionally, the following optional values may be set:

  • jarvice.JARVICE_S3_ENDPOINTUL - set to the S3 endpoint URL, if not using AWS (default is AWS)
  • jarvice.JARVICE_S3_REGION - set to the S3 region to use, if applicable
  • jarvice.JARVICE_S3_PREFIX - set to the object name prefix - the default is output/, so that all objects in the bucket will appear to be in an output folder; some S3 service providers will provide folder views to the object store in their respective console(s), and using this prefix can be an effective organizational/"namespace" method if storing other types of objects in the same bucket

Multi-cluster Configuration Notes

The best practice is to configure both the upstream and all downstream cluster(s) to point to the same object store. However, it is possible to only configure S3 output storage for select downstream cluster(s) in a topology. If a downstream cluster is not configured to talk to S3, it will return truncated job output (to last 10,000 lines) as part of job completion, and JARVICE will store it in the database instead. JARVICE will automatically find objects either in the upstream cluster's configured S3 endpoint, or its database (see below). It may be advantageous to not configure certain downstream clusters with an S3 endpoint in order to address data transfer fees. Note that a configuration that only uses S3 for the control plane but runs all compute endpoints without it does not make logical sense.

Valid Multi-cluster Configurations

Upstream Downstream(s)
Database Database (all)
S3 S3 (same as upstream)
S3 Database (one or more)

Invalid Multi-cluster Configurations

Upstream Downstream(s)
Database S3 (any)
S3 S3 (different from upstream)

Migration and Operational Notes

  • If jobs have already run and completed before the system is configured to use an external S3 service, JARVICE will automatically fetch their output from the database. This is true for any job's output that is not found in the S3 storage.
  • If the external S3 service is disabled after jobs have run and stored their output there, their output will no longer be available (a copy of it is not stored in the database). Subsequent jobs however will store their output in the database.
  • If an external S3 service is used with retention/lifecycle policies configured to prune objects, and a user tries to examine output pertaining to an object that was pruned, they will receive an error in the portal or via the API. This also applies to system administrators using the job detail function in Administration->Jobs

Service Compatibility

JARVICE can support any service that implements the S3 API, provided it offers an HTTPS-based endpoint URL and can respond similarly to Amazon's S3. For the most up to date list of S3 service providers JARVICE has been tested against, please see the External S3-compatible Object Storage Service Compatibility section in the Release Notes.