Skip to content

Archival, search, and replay of Amazon Kinesis records

License

Notifications You must be signed in to change notification settings

aheiss1/flux-capacitor

Repository files navigation

flux-capacitor

Without using any costly database, this solution complements Amazon Kinesis with the following capabilities:

  • Long-term archival of records.
  • Making both current and archived records accessible to SQL-based exploration and analysis.
  • Replay of archived records:
    • which supports key-value compaction so only the last record for a key is replayed.
    • which supports bounded replay (one needn’t replay the full archive).
    • which supports filtered replay (only replay records matching some criteria).
    • which supports annotating records as they are replayed in order to alter consumer behavior, such as to force overwrite.
    • which, with consumer cooperation, provides some definition of eventual consistency with respect to records that arrive on a stream concurrently with a replay operation, without requiring this solution to mediate the flow of the stream.

Project Status

  • In active development for use at CommerceHub.
  • Capable of using SQL to search a stream archive and a live stream.
  • Stream archival capability to come next.
  • Message replay capability to follow.

Assumptions and Applicability Constraints

  • This is mostly an integration project, light on actual software. The AWS CLI will be used, and is assumed to be installed and configured.
  • This will probably be more of an ephemeral tool than a service, but the archival portion will have to run at least once every 24 hours (the Kinesis record expiration time) in order to not miss any records.
  • The initial implementation might only support JSON records, but further contributions should be able to remove that as a requirement.
  • The initial implementation might only support a single Kinesis stream, but further contributions should be able to remove that as a requirement.
  • Data and cluster security is currently left to the user.

Technical Goals

  • Configure and launch a process (TBD, there are many options) to archive blocks of Amazon Kinesis records to Amazon S3 before they expire, possibly via Amazon EMRFS.
  • Launch an Amazon EMR cluster including the Hive application.
  • Deploy Apache Drill to the cluster.
  • Configure Apache Drill to read archived records from Amazon S3, possibly via EMRFS.
  • Configure Amazon EMR Hive to expose an Amazon Kinesis stream as an externally-stored table.
  • Configure the Amazon EMR Hive Metastore for consumption by Apache Drill.
  • Configure Apache Drill to read from Amazon Kinesis via Amazon EMR Hive.
  • To the greatest extent possible without storing another copy of the data, provide a unified and de-duplicated view spanning current and archived Amazon Kinesis records.
  • (TBD) Provide a basic UI or API to initiate search and replay operations, and monitor progress.

Prerequisites

  • Bash shell installed at /bin/bash
  • AWS CLI installed and configured with your credentials and default region (you can run aws configure to do so interactively)

Getting Started

  • Create a config file. Either:
  • Make a copy of conf/defaults.conf and edit the copy, or
  • Create a new file that will contain only overrides, and import the defaults by following the directions at the top of conf/defaults.conf
  • Run ./upload-resources <config file>
  • Run ./launch-cluster <config file> and note the cluster-id that is printed to stdout; future commands will require it.
  • Run ./wait-until-ready <cluster-id>
  • Run ./forward-local-ports <cluster-id> <private-key-file>
  • As with any new SSH host, you will have to accept an authenticity warning the first time you connect to a cluster.
  • Once it's forwarding, this process will not exit, nor print any output.
  • Run ./terminate-clusters <cluster-id> when done to avoid recurring charges.
  • For additional advanced operations, explore the emr subcommand of the AWS CLI.

Releases

No releases published

Packages

No packages published