Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log datasette access IPs #3669

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Log datasette access IPs #3669

wants to merge 4 commits into from

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Jun 12, 2024

Overview

Relates to #3664

Right now, our datasette instance only logs out the IP of the load balancer it sits behind. Useless!

If we shove datasette behind nginx, we can then use nginx's "real IP" module to grab the IPs of the clients that the load balancer is forwarding.

This PR does that, and:

  • reorganizes the deploy logic slightly so there is less duplication between the various deploy targets: "metadata" is an early exit, but everything else should share all the logic except for "how do we take this Dockerfile and deploy it?"
  • make --local target actually use Docker - this makes the deploy slightly slower but also much closer to what is actually getting deployed on fly.io

Testing

How did you make sure this worked?

I did a fast deploy to our staging server:

$ python publish.py --staging --only ferc2_xbrl.sqlite --only ferc6_xbrl.sqlite

Then I went to the staging url: https://catalyst-coop-pudl-staging.fly.dev/

And I saw my IP show up in the fly.io live logs:

https://fly.io/apps/catalyst-coop-pudl-staging/monitoring

Which I verified with a quick curl icanhazip.com / asking other people to do the same.

To-do list

Edit tasklist title
Beta Give feedback Tasklist To-do list, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. Review the PR yourself and call out any questions or issues you have
    Options
Loading

@@ -0,0 +1 @@
load_module modules/ngx_http_realip_module.so;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could technically also live in nginx.conf but this is more idiomatic.

@@ -8,4 +8,5 @@ ls
mv all_dbs.tar.zst /data
zstd -f -d /data/all_dbs.tar.zst -o /data/all_dbs.tar
tar -xf /data/all_dbs.tar --directory /data
datasette serve --host 0.0.0.0 ${DATABASES} --cors --inspect-file inspect-data.json --metadata metadata.yml --setting sql_time_limit_ms 5000 --port $PORT
datasette serve --host 0.0.0.0 ${DATABASES} --cors --inspect-file inspect-data.json --metadata metadata.yml --setting sql_time_limit_ms 5000 --port $DATASETTE_PORT > /dev/null &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to suppress the datasette serve logs since we get the same information in the nginx access logs.

&& ln -sf /dev/stdout /var/log/nginx/access.log \
&& ln -sf /dev/stderr /var/log/nginx/error.log

COPY . /app
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this down to the bottom since all the other setup changes much less frequently than the contents of the Docker context - this helps with build caching.

@jdangerx jdangerx requested review from a team and zaneselvans and removed request for a team June 12, 2024 20:46
Base automatically changed from 3664-fly-staging-environment to main June 12, 2024 22:14
@@ -2216,7 +2216,7 @@ def from_data_source_ids(
xbrl_resources=xbrl_resources,
)

def to_yaml(self) -> None:
def to_yaml(self) -> str:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy was complaining about this.

}
http {
server {
listen 8080;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we could turn this (and the Dockerfile) into a jinja template, which might be cleaner, but it felt like gilding the lily to me.

@jdangerx jdangerx self-assigned this Jun 13, 2024
@zaneselvans zaneselvans requested review from bendnorman and removed request for zaneselvans June 13, 2024 16:37
@zaneselvans zaneselvans added cloud Stuff that has to do with adapting PUDL to work in cloud computing context. datasette Issues related the accessing PUDL data via Datasette. labels Jun 13, 2024
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all mostly makes sense to me! On the staging server I get a 502 but when I reload the page it works.

Also, how do we transfer the logs into a bucket?

check_tables_have_metadata(metadata_yml, databases)

if deploy == "metadata":
logging.info("Only writing metadata. Aborting now.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You maybe forgot to abort the script here?

Copy link
Member Author

@jdangerx jdangerx Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure did, fixing! Good catch :) Fortunately we only actually deploy if we actively choose a valid deployment target 😅

* use nginx real IP module
* get nginx logs to show in stdout/err and suppress datasette/gunicorn logs
@jdangerx
Copy link
Member Author

Yeah, the 502 is happening because nginx will start up and become available before datasette is fully initialized. Since the staging server shuts down with inactivity, you'll see a 502 first every time the VM comes out of inactivity.

In terms of how to transfer logs to a bucket: the official solution is fly-log-shipper: https://github.com/superfly/fly-log-shipper.

@jdangerx jdangerx requested a review from bendnorman June 17, 2024 21:54
@bendnorman
Copy link
Member

We can't declare victory on tracking logs until we set up the sink right? Any idea how the fly-log-shipper actually runs? Also, looks like there isn't a GCS sink :(

@jdangerx
Copy link
Member Author

We'd have to set up the sink to declare victory, yes. The quick-start in their docs seems reasonable.

The confusing thing about the GCS sink is that it's clearly supported by Vector, the library that fly-log-shipper uses... but the fly-log-shipper docs seem to omit GCS from the documentation. So we could try configuring Vector for GCS, or we can shove the logs into S3 and then pull them into GCS separately.

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Looks like those next steps are tracked in the issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context. datasette Issues related the accessing PUDL data via Datasette.
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants