Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Establishing and refining the vision #255

Open
13 tasks
kevinbader opened this issue Sep 3, 2019 · 0 comments
Open
13 tasks

RFC: Establishing and refining the vision #255

kevinbader opened this issue Sep 3, 2019 · 0 comments
Assignees
Milestone

Comments

@kevinbader
Copy link
Contributor

kevinbader commented Sep 3, 2019

A Refined Vision

We're need to know what we're aiming at, and users need to know what they can expect from using RIG in their infrastructure. Given new requirements or change requests, it should be easier to decide whether or not they match the goals of the project.

The subject of this issue is to formulate this vision, but also to track the progress of related tasks (spoiler: we're not quite there yet).

The tasks are defined inline in the following vision statement, which will become a document on the website eventually. Feedback welcome.

  • Update README with the vision statement
  • Update repository tagline with the vision statement
  • Update the website with the text below

Reactive Interaction Gateway a.k.a. RIG

RIG enables event-driven frontends in a secure, reliable and scaleable way.

What does that mean, exactly? Well, let's break it down.

Event-driven frontends

Frontends react to what happens on the backend. Upon receiving business events, they can change their state, adapt their UI or use HTTP requests to fetch new data.

This is not just Server-Sent Events versus polling for data; it also decouples the frontends from the backends, as they describe what kind of events they're interested in, but they don't need to know where and when to fetch them.

Secure

RIG employs process supervisors, which makes it very unlikely that a node fails as a whole. For example, a fault in the Kafka connection might cause the Kafka subsystem to fail, but the rest of the system would be unaffected. The respective supervisor would restart the Kafka process automatically; as soon as the connection would be re-established successfully, the node would be fully operational again.

Running on the BEAM (i.e., the Erlang runtime machine), there is tooling available for introspection, code updates and instrumentation (e.g., tracing) at runtime, in production.

  • Describe, and link to, the tooling mentioned here.

Verbose logging and a Prometheus metrics endpoint also help with monitoring a node in production.

  • Describe how to set change log levels (using env vars and at runtime).

To limit a cluster's resource consumption, an administrator can configure a number of soft limits:

Plugins are sandboxed, as they are either

Reliable

Multiple RIG nodes form a self-healing cluster, as described in the following.

Scenario 1: A RIG node goes offline and leaves the cluster

After a node has failed, the frontends that were connected to that node reconnect to another node. They will be notified upon reconnect whether their subscriptions are still available and whether events have been lost (that depends on whether a frontend's session was hosted on the failed node). If subscriptions need to be set up again or missed data has to be requested, the frontend can act accordingly.

Any Kafka or Kinesis partitions the node had handled before its failure are reassigned to other nodes in the cluster automatically. As a consequence of how Kafka/Kinesis consumer groups work, messages may be delivered more than once.

Scenario 2: A RIG node comes back online and (re-)joins the cluster

When a node (re-)joins the cluster, it replicates the subscriptions table as well as the JWT blacklist from other nodes. It might get assigned Kafka or Kinesis partitions and starts to accept frontend connections.

Scaleable

RIG is designed to be scaleable with respect to two key metrics: the number of connected frontends and the number of backend events it can process without generating a noticeable* delay in transmission.

*For our tests we assume 1 second as the highest tolerable delay between RIG's consumption of a message from Kafka and the reception of that message on the frontend via an established SSE connection.

Scalability with respect to the number of connected frontends

For handling large amounts of frontend connections per node we rely on the light-weight processes ("actors") that BEAM, the Erlang runtime, provides. Those processes each have their own lifecycle (and even their own garbage collector) and are scheduled by BEAM on all available CPU cores. They are also memory-efficient and designed to be created and destroyed very quickly. Those features make them ideal for running TCP connections, where they provide the isolation and runtime efficiency needed for handling the long-lived connections in RIG.

  • Benchmarks

Scaleability with respect to the number of processed backend events

Designed as the event-handling backbone for "everything UI" in a microservice architecture, RIG must be able to process large amounts of data. The trick: RIG immediately discards events that no users are subscribed to.

For example, in a banking system, the backend might generate a million events per second that need to be consumed by RIG. At the same time, only a few thousand users are online at any point in time, which allows RIG to drop most of the messages right at the consuming node.

  • Benchmarks
@kevinbader kevinbader added this to the 3.0.0 milestone Sep 3, 2019
@kevinbader kevinbader self-assigned this Sep 3, 2019
@kevinbader kevinbader pinned this issue Sep 3, 2019
@kevinbader kevinbader unpinned this issue Feb 13, 2020
@kevinbader kevinbader modified the milestones: 3.0.0, 3.1.0 May 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant