Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Webhooks #50

Open
awm33 opened this issue Nov 21, 2018 · 13 comments
Open

Proposal: Webhooks #50

awm33 opened this issue Nov 21, 2018 · 13 comments

Comments

@awm33
Copy link

awm33 commented Nov 21, 2018

I'm proposing an additional use case for Singer taps, webhooks. Many APIs / SaaS products offer webhooks as a way to reduce load on their servers and overcome rate limits.

I'm proposing that a new Singer message type be added with the type "WEBHOOK". A WEBHOOK message is consumed by a tap via stdin. A tap would go into webhook mode using the --webhooks option, consuming WEBHOOK messages from stdin and producing SCHEMA, RECORD, and STATE messages for stdout.

Allowing the logic to stay within the same tap as RESTful API tap logic allows for sharing code and keeping a highly related set of code (webhook and REST/SOAP/etc) in the same project.

Having the WEBHOOK message decouples processing webhook messages from receiving HTTP requests. A webserver implemetnation is responbile for handling requests with/without authentication, and optional queuing or dead lettering. A message via stdin also allows for toolchaining, such as placing a queue or recorder in-between.

The WEBHOOK message would contain the following fields:

  • type always "WEBHOOK"
  • received_at datetime string - ISO format. When was the webhook request received. Can be used for the SDC or queuing sequence value later.
  • url - string - The URL of the incoming webhook request.
  • method - string enum - Any valid HTTP 1.1 verb.
  • headers - key/value object containing the headers from the request.
  • body - string - Body of the webhook. It's kept as a string to support JSON, XML, and other types. Binary types can be base64 encoded.

The webserver implementation can vary, and be easily customized to be incorporated into other systems. A basic reference webserver can be created to give the community some base code and functionality, while individuals can implement webservers in varying complexity. Advanced webserves may be backed by Kafka, dynamically load and cache taps, etc.

The webserver should handle basic authentication, usually checking some sort of token added to the URL or a header. The webserver can optional choose to implement a dead letter queue, which can pass failed messages later if the authentication issue is transient. Passing the full headers, url, and body allows for tap specific passive (always 2xxs) authentication such as Github's webhook security method.

The most basic implementation may be:

singer-webhook-server | tap-shopify -c config.json --webserver

More advanced multi-tap webservers are also possible.

@dmosorast
Copy link
Contributor

To kick off the discussion, I have some quick first thoughts off the top of my head. I like the idea, since webhooks are quite solved by Singer as it stands, but I'm having trouble with where it fits into the Singer paradigm. (Unix-style, one-responsibility, process pipes)

I'm not sure I understand the need to add a mode to the tap in order to support this. A webhook should be a relatively simple thing, with perhaps some data typing, config, etc. much like a tap would have. However, given the short-lived-process nature of the tap -> target relationship vs. a long-running process nature of a web server, this sounds like it would be something other than a tap. I worry about muddying the concepts too much by mixing the two.

In order to keep it simple, I could see something like the structure being [tap|webhook-server] -> target, where maybe the webhook server's implementation has a kind of supervisor that makes sure the target's pipe remains open, or restarts if it dies, etc.

@awm33
Copy link
Author

awm33 commented Nov 21, 2018

@dmosorast Great!

The key thing to my proposal is separating the concerns of handling requests and processing the webhooks, by using a key concept in singer, passing messages via stdin/stdout.

I'm not sure I understand the need to add a mode to the tap in order to support this.

Somehow the tap needs to know to open up stdin, rather than begin a sync cycle hitting APIs. I think of --discover and --catalog as trigger modes / commands. Really, I wish singer used command styles like git, such as discover and sync, where webhooks or process-webhooks would be another command / mode.

However, given the short-lived-process nature of the tap -> target relationship vs. a long-running process nature of a web server, this sounds like it would be something other than a tap.

Again, I don't think the webserver should be part of the tap, and webserver specifics should be left up to webserver implementations. The tap is only expected to take in WEBHOOK messages from stdin and produce the common singer messages to stdout. So there is something other than a tap, a third thing reusable across taps, a webhook webserver.

Using the message also makes testing and replaying webhooks easier since it provides a message format for recording request, allowing tooling to be built around that.

A production ready webserver may need to POpen the tap and targets and potentially kill and recycle them in case their memory leaks. It's also not that uncommon to kill and restart long running python processes. Or if someone places a message queue (SQS, Kafka, Google Sub/Pub, etc) in between, they may just run the tap every 10 minutes. I like that the decoupling allows for many options, and different infrastructures, but everyone could easily share tap logic. A development webserver could be created that may be as simple as possible, and not realistic for production use. Sort of like using flask's built-in webserver for dev and gunicorn for prod.

@awm33
Copy link
Author

awm33 commented Nov 21, 2018

Another reason for decoupling, imagine using a lambda function. Someone could write an adaptor that turns lambda API Gateway events into WEBHOOK messages and run the tap and target. Lambda would keep it warm for a while and kill it if there's no activity. API target's like Stitch in particular would work well here, since the destination has an underlying queue (Kafka) that would prevent thrashing the destination with many connections (like Redshift).

Point being, I think there's sooo many possibilities for handling the webhook requests themselves, but it all boils down to getting an HTTP request to a tap.

@ghost
Copy link

ghost commented Nov 22, 2018

I think this violates the batch/ephemeral nature of the singer tap->target workflow. If the desire is to handle an HTTP request containing a payload, and ultimately route it to a target, then my suggestion would be to do something like Snowplow's batch collector (a tracking pixel stored in an S3 bucket, served by a cloudfront distribution which writes its logs back to S3), then writing a tap to parse the cloudfront log and feed singer messages to stdout.

@awm33
Copy link
Author

awm33 commented Nov 22, 2018

@rsmichaeldunn

I think this violates the batch/ephemeral nature of the singer tap->target workflow

Singer is a stream oriented protocol. It calls a series of records a "stream". The name tap even comes from from thinking of data as streams. A tap - "a device consisting of a spout and valve attached to the end of a pipe to control the flow of a fluid". So I wouldn't say Singer has a batch nature, it just tends to be used in batches since it's most often used to pull from APIs.

As far as ephemeral, I think that's just because of the limited use so far. I believe the protocol itself extends beyond small batches. I'm also saying that HTTP request handling and processing the request should be separated, which makes the webhook request just a data structure from stdin to the tap.

You're example is interesting, though I don't think would work for webhooks since AFAIK cloudfront doesn't log POST bodies. But, say it did, the WEBHOOK message would allow you to use that infrastructure for any webhook supporting tap, rather than requiring everyone using that tap to use AWS CloudFront.

@mbilling
Copy link

The paradigm shift from batch to stream to realtime stream is interesting.

Assuming the webserver endpoint relies on the sender of the webhook retrying if the tap -> target process fails then it would make sense to do what you describe.

But assuming realtime (that we almost always return status 200) taking full responsibility on a webhook request then we need to add a buffer, hence the snowplow/S3 proposal.
I would suggest a set of tap protocol reader/writers for whatever intermediate target you prefer (disk/s3/etc)

singer-io-(destination)-writer
singer-io-(source)-reader
we need these for node.js, python, ruby etc

S3 could be the first set of source/destination

@awm33
Copy link
Author

awm33 commented Nov 24, 2018

@mbilling Without the sender retrying, I would not consider the webhook reliable. Network partitions in particular are an issue there but there's also AWS systems being temporarily unavailable, errors in code, etc. It's pretty common for webhook implementations to retry, and even exponential back off. I think they usually just use SQS. On the receiving side, high availability would probably require a queue in-between.

This issue goes to only protocol / spec changes, the open standard, not specific implementations. Most singer taps (all at this point?) are in python, but they don't have to be. The spec is basic enough that you could create say a Java tap pretty easily, it just needs to emit a few different JSON structures to stdout. Python can easily run in lots of different infrastructure from CRON on a VM to docker. Webservers with high availability usually rely on cloud specific services like ELB and SQS. I'd like to use this new message type through stdin to eliminate locking users of an open standard into a specific cloud technology, or only using a very specific open source stack without the ability to utilize easy-to-use managed services.

@mbilling
Copy link

I agree that is has to be cloud provider/language agnostic. Protocol and format only.
Collecting from realtime from IoT devices or browsers etc. could be supported too.

@awm33
Copy link
Author

awm33 commented Nov 25, 2018

@mbilling Oh def, you could apply the Singer protocol to streaming. You could even do it now, say creating a tap-salesforce-streaming which runs continuously connected to the Salesforce Streaming API, since in that scenario, similar to hitting the REST API, the tap is initiating communications with the source of data. A webhook is the other way around.

Webhooks require the source of data (usually a SaaS provider) to initiate communications on every new data event/message. I think this is unique enough technologically (inbound data events/messages via HTTP requests), but common enough in practice (many SaaS providers offer webhooks), that creating a new message type specially for webhooks makes sense.

@awm33
Copy link
Author

awm33 commented Nov 25, 2018

Another thing to consider. Some SaaS providers require a webhooks to be registered programmatically. How would a tap handle that? Use another option like --register-webhooks? How would it know what the URL(s) would be? Pass them or a base URL as an option?

@chaholl
Copy link

chaholl commented Oct 9, 2019

Interesting idea. Dealing with sender retries would require some internal state to prevent duplicate output. I suspect that'll run into threading problems pretty quickly if the server is under any load. I guess you can deal with that internally within the web server but it adds a whole load of complexity.

What about a message bus tap as a simple resilient solution? It can read from something like RabbitMQ and anything can push the messages in there in the first place. Maybe that's a whole other proposal :)

@awm33
Copy link
Author

awm33 commented Oct 11, 2019

@chaholl I think those are implementation details outside of the spec. With the spec, I'm looking to separate the web server and processing of messages. There may be some simplistic reference server created, but a prod implementation would likely use a webserver and queue (SQS, Kafka, RabbitMQ, etc). We could add it to the message spec, but yes, an ID could be used for dedupping. I've typically either 1) if available from the webhook message, use a per webhook event ID 2) if not available from the webhook message, use a UUID when received so idempotency can be maintained downstream.

This is all implementation specific though, there could be many different prod implementations, the spec should be about high level workflow and message standards.

@rafalkrupinski
Copy link

rafalkrupinski commented Jan 9, 2022

why does it need a new message type? What's wrong with a webhook listener emitting a schema configured by the user and the incoming messages, after validation, as records?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants