Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation metadata #6

Open
lassik opened this issue Nov 11, 2020 · 30 comments
Open

Implementation metadata #6

lassik opened this issue Nov 11, 2020 · 30 comments

Comments

@lassik
Copy link
Contributor

lassik commented Nov 11, 2020

The schemedoc/implementation-metadata repo collects info about as many Scheme implementations as we can into S-expression files. For example, here's the current data for Guile. The format is subject to change; it hasn't yet settled into a good form.

Anyway, if we support (scheme-implementation ...) top-level forms in the metadata file in addition to (package ...) top-level forms, then we can use the same filename for info about releases of implementations. IMHO as with (package ...) the form should be optional, and all fields in it should be optional as well. This makes it easy to adopt it by adding a couple of fields at first and gradually adding more if desired.

Thoughts?

@diamond-lizard
Copy link
Collaborator

I like this format, because it focuses on just the metadata that we need. There's nothing in it about how to build a package, for instance, and it's not a free-for-all.

I'm not a fan of leaving all the fields optional, however. The main value a SRFI like this will add (if we adopt the direction of small, narrow focus) is that it will tell people which Scheme implementations support which SRFIs, and if they don't give us that information, what's the point of complying with our SRFI in the first place?

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

HTTP is a huge success even though it tells people nothing about what's at the other end of a HTTP GET. There can be a lot of value in things that are just a conduit or a known meeting point for exchanging other things.

I'll repeat my point in the other thread(s): all of the information we've compiled so far has to go somewhere. If there are separate implementation-srfi-list.scm, implementation-contact-info.scm and implementation-license-info.scm, there soon comes a point where the total is no simpler than having them all in one file to begin with.

@diamond-lizard
Copy link
Collaborator

HTTP is a huge success even though it tells people nothing about what's at the other end of a HTTP GET.

But I don't think there were any, and certainly not many standards that did what HTTP did when it came around.

How many packaging standards are there? At least one for every Scheme. Plus some more generic ones.

We'd be competing with all those.

There can be a lot of value in things that are just a conduit or a known meeting point for exchanging other things.

Yes, if we were providing a service in gathering and storing metadata, I think many people would find it valuable.. especially as there really isn't something like that for all Schemes right now (apart from the SRFI Table itself).

I'll repeat my point in the other thread(s): all of the information we've compiled so far has to go somewhere. If there are separate implementation-srfi-list.scm, implementation-contact-info.scm and implementation-license-info.scm, there soon comes a point where the total is no simpler than having them all in one file to begin with.

If we do proceed in the direction of collecting and providing more information than just which Scheme supports which SRFI, then I am in favor of including all of that information in one file.

But I am against including things that are implementation details such as compiler options and instructions, scripts, and so forth that at least Chicken packages sometimes come with. And yet Chicken sometimes needs that information to deploy its packages. For this reason, I don't think this would make the best packaging format.

That said, I agree that if we were to go in the direction of standardizing a packaging format that would be useful to every Scheme we'd have to be very generic and allow optional addition of custom fields.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

But I don't think there were any, and certainly not many standards that did what HTTP did when it came around.

There have been many competing ways to transfer files and other data for as long as there have been computer networks. Part of HTTP's success is that the idea was one of the simplest: just GET /whatever, read the response and close the connection. It also worked well together with other useful things, notably using URLs for addresses. Later content negotiation was added so you could only ask for the one kind of thing you understand (the Accept: header field). That's similar to having a cond-expand or skipping unknown S-expressions. Hyperlinks in HTML are like the include we are considering.

How many packaging standards are there? At least one for every Scheme. Plus some more generic ones.

Indeed, if we invent a new one, it will just be added to the pile.

The trick is to not invent anything new unless we get approval from at least some of the people who made the existing ones. If we make something that's just a merging of the existing formats, without requiring anything extra, and is simple to adopt, we have a chance of unifying them eventually.

Yes, if we were providing a service in gathering and storing metadata, I think many people would find it valuable.. especially as there really isn't something like that for all Schemes right now (apart from the SRFI Table itself).

And once we consider that the package's own metadata, the implementation's package index, and any aggregated indexes, need to contain (some subset of) similar data for each package, it makes sense to strive to use the same field names and value formats in all of them. If we don't, we need to write filter code to munge the same information into slightly different variations, which can be done but is not ideal.

If we add up all of these observations, the simplest solution is to gather all the fields in the existing formats, remove duplication (e.g. using a symbol vs a string for the same thing, or a slightly different field name, etc.), and make everything optional so that if someone has trouble with some part of it, they can always leave it out.

The key is to think more like a museum curator or taxonomist, instead of a designer who invents entirely new things.

But I am against including things that are implementation details such as compiler options and instructions, scripts, and so forth that at least Chicken packages sometimes come with. And yet Chicken sometimes needs that information to deploy its packages. For this reason, I don't think this would make the best packaging format.

If you have:

$ cat package.scm
(package
  (name "termcap"))
$ cat package.chicken.scm
(package
  (cflags "-ltermcap"))

It's not much different from:

$ cat package.scm
(package
  (name "termcap")
  (cond-expand (chicken (cflags "-ltermcap"))))

All of that information has to go somewhere in any case.

For example, check the compatibility files in chez-srfi. chez-srfi is for R6RS which doesn't have cond-expand, so implementation-specific code is placed in one file per implementation and there's some import-time magic to find those files. It works, but you end up with many little files that sometimes have like 5 lines each. It's less flexible than cond-expand and probably not that much cleaner.

That said, I agree that if we were to go in the direction of standardizing a packaging format that would be useful to every Scheme we'd have to be very generic and allow optional addition of custom fields.

Great!

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

You know the filter function from functional programming, right? If all the information is in a big S-expression (which can be assembled from parts using (include "...") or written all in one file), any consumer of that file can filter out all the fields they don't care about, and produce a new S-expression containing only the relevant stuff.

@diamond-lizard
Copy link
Collaborator

The trick is to not invent anything new unless we get approval from at least some of the people who made the existing ones. If we make something that's just a merging of the existing formats, without requiring anything extra, and is simple to adopt, we have a chance of unifying them eventually.

If we are going to go the unified package format route, it makes sense to have as much in common with existing formats as possible, so I think you're on the right track.

However, I'm still not convinced we should be going that route. I'm not solidly opposed to it, but I still think it'll be a harder sell and take much longer than to create a narrowly focused SRFI-index metafile format.

But we should really get more input, I think, before we proceed with this idea, even as a pre-SRFI. I'm really not the best qualified to opine on this, as my experience in the Scheme and Chicken world is very brief and limited.

I think I've offered as much insight as I'm capable of at this point. There are much more experienced and knowledgeable people than myself who can give you much deeper insight and advice.

I'm not bowing out of the project, but I'd really like to hear what some experienced representatives of various Schemes have to say on the grand unified package format idea.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

Sure, let's do that!

The thing is, our design is loosely coupled: the container format (unified S-expression file vs separate files) is almost completely orthogonal to the field names and values we are inventing/curating. So even if the container doesn't fly, the curation work is still useful. We don't end up wasting much effort no matter what the outcome is.

@diamond-lizard
Copy link
Collaborator

If we do float the idea, I think there should either be two separate proposals or one proposal which offers two alternatives: a lean one focusing on just on providing/consuming SRFI-compatibility metadata, and one being the unified package format.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

Would the SRFI compatibility data be complex enough to warrant its own SRFI? What information would it have other than a list of the SRFI numbers?

@diamond-lizard
Copy link
Collaborator

Would the SRFI compatibility data be complex enough to warrant its own SRFI? What information would it have other than a list of the SRFI numbers?

There are some short SRFIs out there. In some ways they're the best kind if they're both short, useful, and easy to implement.

Things that would be useful to know:

  • The name of the Scheme implementation
  • The version of the Scheme implementation
  • SRFI number
  • Is that SRFI: part of core or in a library
  • Is that SRFI: fully or partially implemented

I think for SRFIs that's all that I can think of that we'd need to know.

But, again, maybe others have some other ideas or needs.

If we had such metadata deliberately exposed, your scripts wouldn't need to scrape anything. They could just consume the data directly and populate the table from there. No hunting around, no uncertainty, no mess.

And it should be super easy to implement on both the producer and consumer side.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

Where would this metadata be stored? That's question is almost more difficult than the format. It raises the question of who is responsible for keeping it up to date.

Partially implemented SRFIs are possible in principle, but by convention, we try to steer clear of them; if a SRFI is difficult to implement, we'd rather split it into two different SRFIs.

@diamond-lizard
Copy link
Collaborator

Personally, I think the most natural thing these days is to serve it up at and consume it from some URL somewhere.

It's true that it might not be quite as definitive or great for archival purposes as putting it in a tar file, but the archive of this information could be the metadata server which consumes the metadata at these URLs.

What do you think?

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

Do you mean something like the following:

  • We have one file per Scheme implementation.
  • Each implementation puts the file at a known URL. E.g. Chicken's 5 file at https://call-cc.org/srfi-support-5.scm.
  • The contents of this file are something like the following:
(scheme-implementation-name "Chicken")
(scheme-implementation-version 5)
(srfi 0 core full)
(srfi 1 external full)
(srfi 2 external part)

@diamond-lizard
Copy link
Collaborator

By the way, something else I've just thought of that I think we have to take in to consideration is what do we do if someone makes a mistake or needs to revise some of this information?

It's easy enough to change the data on the producer side, but if the consumer has already consumed it then the next time they look there'll be a discrepancy.

How do we deal with that? This may not need to be part of the SRFI, but we should think about it.

As far as SRFI Table generation goes, it could just serve up the latest information, even if it differs from the old information.

From an archival standpoint, I suppose the consumed information could be put in to version control and have frequent backups, so that you'd be able to roll back to previous versions.

Yet another concern is authentication and authorization. Should the consumer always trust the producer? Or does the producer need to authenticate itself to prove that it's an authoritative source?

If the consumer just trusts that whatever data's at a particular URL is true, then it's easy to implement on both sides. If we want more assurances, it gets tricker.

@diamond-lizard
Copy link
Collaborator

Do you mean something like the following:

* We have one file per Scheme implementation.

* Each implementation puts the file at a known URL. E.g. Chicken's 5 file at `https://call-cc.org/srfi-support-5.scm`.

* The contents of this file are something like the following:
(scheme-implementation-name "Chicken")
(scheme-implementation-version 5)
(srfi 0 core full)
(srfi 1 external full)
(srfi 2 external part)

That sounds reasonable to me. What do you think?

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

I'd expect most consumers to simply overwrite their old data with the new data every time they update.

If they download data from multiple URLs, they should be prepared for the possibility where some of those are reachable during the update but other not. For this, they'd need to keep track of which part of their output data came from which input URL, and not touch that part of the output if the corresponding input could not be downloaded.

To avoid tampering in case old data has changed and the consumer doesn't want it to change, either an equality check (like Scheme's equal?) or a cryptographic hash like SHA-1 or SHA-256 is fine.

Authentication = encrypted connection + permission check. HTTPS solves both problems: encryption is built in, and since HTTP GET is read-only and the server admin has hopefully prevented unauthorized people from writing files on the server, permission checks are solved too. Alternatively, git over a secure protocol (HTTPS or SSH) also takes care of encryption and permission checks. I wouldn't worry about these concerns on the level of abstraction we are working on.

The ability to use the git history to rollback after mistakes is a good point. It's generally a good idea to track almost anything kind of text file in Git. (That doesn't mean Git should be the official interface used by the consumer.)

@diamond-lizard
Copy link
Collaborator

diamond-lizard commented Nov 11, 2020

I'd expect most consumers to simply overwrite their old data with the new data every time they update.

Yes, I think that makes sense because most people will probably be interested in what the current version supports, not what it used to support.

If they download data from multiple URLs, they should be prepared for the possibility where some of those are reachable during the update but other not. For this, they'd need to keep track of which part of their output data came from which input URL, and not touch that part of the output if the corresponding input could not be downloaded.

To avoid tampering in case old data has changed and the consumer doesn't want it to change, either an equality check (like Scheme's equal?) or a cryptographic hash like SHA-1 or SHA-256 is fine.

What is our threat model in this case? What are we protecting against? Deliberate modification? Accidental modification? An attack? Could this information be exploited in an attack somehow? Is that something we want to worry about or try to prevent? Or do we just care about accidental modification (ie. say some datafile corruption)

SHA-1 has ostensibly been broken, and GitHub, for example, is moving away from using it. So if we were to use a hash, we might want to consider using something else.

Authentication = encrypted connection + permission check. HTTPS solves both problems: encryption is built in, and since HTTP GET is read-only and the server admin has hopefully prevented unauthorized people from writing files on the server, permission checks are solved too. Alternatively, git over a secure protocol (HTTPS or SSH) also takes care of encryption and permission checks.

HTTPS or SSH could, at best, only ensure you're getting the data from the server you intended to get it from, not that it was actually generated by someone who was authorized to do so. If the server had been compromised, then so potentially is your data. That's not to mention ways of getting around HTTPS or SSH.

But should we care about this? What's the worst thing that could happen if the data is completely wrong? Is that actually exploitable or could it cause some kind of critical failure that we want to prevent?

I'm struggling to think of a way that this could have horrible consequences, and coming up blank.

I think that speaks in favor of this kind of simple format. Because, unlike data which contains names and URLs to get software from, or computer instructions, it's pretty hard to exploit.

I wouldn't worry about these concerns on the level of abstraction we are working on.

My instinct is that you're probably right.

The ability to use the git history to rollback after mistakes is a good point. It's generally a good idea to track almost anything kind of text file in Git. (That doesn't mean Git should be the official interface used by the consumer.)

Agreed. But that probably doesn't need to be part of the SRFI. Just something to think about.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

From an archival standpoint, I suppose the consumed information could be put in to version control and have frequent backups, so that you'd be able to roll back to previous versions.

If we store the authoritative source copy of the metadata file in a Scheme implementation's git repo, and the release tarballs are made from that repo, we get the best possible archival and backup in the easiest possible way.

By the way, something else I've just thought of that I think we have to take in to consideration is what do we do if someone makes a mistake or needs to revise some of this information?

Indeed, it's very easy to make mistakes when editing these kinds of files by hand. Most programmers don't enjoy routine manual work, and are not good at concentrating when doing it, so it's easy to copy/paste the wrong thing or overlook some number. That's why it's ideal to have a "single point of truth" updated by the people who are directly responsible for that component (e.g. the author of a library or Scheme implementation that provides a SRFI), and everyone else who needs that information uses an automated program that downloads the authoritative information (or some derivative of it) and transforms it into the format that is needed.

Basically all problems of the general form that we're dealing with here, are funnel problems. There is some original source information hand-written in some authoritative place. Then a bunch of programs (or people) read that info, and output some transformation of that info or combine it with some other info. Other programs (or people) can then take that output, transform and combine it, and output some more info, etc.

Some important observations arise:

  • The "programs (or people)" part are probably going to be more reliable if they're programs instead of people. Programmers are generally not good at manual detail work, but they are good at writing programs and fixing them.

  • By extension, the entire network is more reliable if it has more programs and less people :)

  • If the network is made of reliable programs (or people), we can add any number of new programs (or people) and it will still stay reliable. But if there is one unreliable part, every other part depending on it will be working with unreliable data.

This shows that the main problem in the whole network is ensuring that the original source data is reliable. Everything else is just a matter of writing simple programs. And since any program in the network can store its output in a file, which can be uploaded to a HTTP URL, there isn't that much need to worry about the intermediate formats. If a particular kind of intermediate format is useful, someone can just write a program to generate it.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

there isn't that much need to worry about the intermediate formats. If a particular kind of intermediate format is useful, someone can just write a program to generate it.

However, neither is there any particular benefit to having different formats for the same information. If for example the information about SRFI support can be encoded the same way in all of the following places:

  • A Scheme implementations' git repo
  • An external package's git repo
  • A package collection's index
  • Any aggregations made from the above

Then the whole community will save effort. This IMHO is the problem we should address, in addition to the (more important) problem of making a convention for how to store the authoritative data as close as possible to the source to have the best chance that it stays up to date.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

What is our threat model in this case? What are we protecting against? Deliberate modification? Accidental modification? An attack?

There is no threat; there's no money or fame in publishing misleading information about SRFI support. If you list a SRFI as supported when it isn't, the worst that will happen to the user is they type chicken-install srfi-999 and it doesn't work. If a SRFI is actually supported but is not listed, some people will be missing out on it, but that can happen by accidental omission anyway. We have no way of telling the difference between accidental and intentional modification without doing something extremely complex. Complex solutions are error-prone, and if complex manual effort is required, people usually don't like to do it, which would ironically cause the data to be incorrect simply by being out of date :)

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

If we take the suggested format of https://call-cc.org/srfi-support-5.scm:

(scheme-implementation-name "Chicken")
(scheme-implementation-version 5)
(srfi 0 core full)
(srfi 1 external full)
(srfi 2 external part)

and modify it a bit:

(scheme-implementation-name "Chicken")
(scheme-implementation-version 5)
(package
 (srfi 0 core full))
(package
 (srfi 1 external full))
(package
 (srfi 2 external part))

And the metadata file for one egg looks like this:

(package
 (name srfi-2)
 (srfi 2 external part)
 ...)

And Chicken's egg index looks like this:

(package
 (name srfi-1)
 (srfi 1 external full)
 ...)
(package
 (name srfi-2)
 (srfi 2 external part)
 ...)
...

It's now clear that:

  • The egg index is a simple concatenation of the individual egg metadata files.
  • Our srfi-support-5.scm file is a simple filtering of the egg index with a static header added to the top.

That's what I've been getting at. The big problem with the individual files and URLs is, is the information correct? The formats of all those files are just random filterings and concatenations of each other.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

srfi-support-5.scm could be generated by a Scheme procedure almost literally like this:

(define (get-chicken-srfi-support major-version)
  `((scheme-implementation-name "Chicken")
    (scheme-implementation-version ,major-version)
    ,@(map (lambda (package)
             (filter-properties
              (lambda (property) (eq? 'srfi (car property)))
              package))
           (get-chicken-egg-index major-version))))

@diamond-lizard
Copy link
Collaborator

There is no threat

Now that I've had a bit of time to think about it, I think there are a couple of minor threats.

The first is, as you say, denial of service. This could be performed at either the metadata consumer (like the SRFI Table scraper) or users that get data from it. Maybe they can't build, but maybe they also are stuck reading endless amounts of information that was served up in a tampered metafile. We may want to consider length limits because of this, or at least maybe fields in the consumed data that indicate how long the data proper is expected to be. So, for example, a string value of "foo" could be prefixed with "3" to indicate it's only 3 bytes long. (This gets trickier with unicode, of course). And/or we could just have limits on the size of the entire file. We don't want to try downloading a 1 TB file, for instance.

The second threat is offensive, illegal, or otherwise undesirable information put in to a metadata file. The Scheme community is not one to do that deliberately, from what I can tell, but if a server is hacked some kids might think it funny to do that, and do we really want to carelessly republish it?

The Scheme community is relatively small, so I think the chance of targeted attacks that would exploit these metadata files are low, but it is still a possibility we may want to consider, even if we decide in the end not doing anything about it.

@lassik
Copy link
Contributor Author

lassik commented Nov 11, 2020

Denial of service (which in most cases would be accidental - e.g. a server goes down due to a power outage or failed upgrade) is most effectively handled if each scraper keeps a local cache that it uses as a fallback if the update fails.

Malicious information is usually blocked via a web-of-trust model: I trust the admin of call-cc.org and the people who admin the Chicken git repos; they trust the programmers who they gave access to those repos; those programmers trust themselves. The communication links between these things are encrypted and I trust the people who built the crypto not to have make mistakes or put in backdoors. Basically the entire world runs on a web-of-trust model and has always done so; there is no alternative.

Blocking against malicious information is mainly relevant on sites where a large number of unknown people are free to post anything they like, e.g. social media sites. Since you don't know those people, and they don't lose any reputation by posting bad stuff, it eventually pays to have some heuristics to block that stuff and there's a lot of money in developing machine learning tools to implement good heuristics. For less popular stuff, it's much easier not to give write access to unreliable people in the first place (or revoke access from people who were previously reliable but are not anymore). This is the model by which all open source projects (and most commercial projects) have always been run. In finance, you'd be surprised how much money changes hands over informal phone calls while having a hangover after partying all weekend...

@diamond-lizard
Copy link
Collaborator

That's what I've been getting at. The big problem with the individual files and URLs is, is the information correct? The formats of all those files are just random filterings and concatenations of each other.

I'm not advocating for a single file per SRFI. One file that puts them all together is fine.

And, yes, you could put this information in to a package index, along with all the other information you might want to have about packages, and create a package file format while we're at ti. But do we want to? Do users, maintainers want it, when there are so many other competing ways of doing that? Like I said before, that may be a much harder sell (though we can and probably should just ask).

The advantage of just limiting the standardized information is that it makes for a small, humble, SRFI that doesn't require major buy-in and probably lots of bickering from all the big stakeholders to come to some sort of agreement (see how far just the two of us lasted.. now imagine the entire Scheme community going at it on an issue of such central importance).

It's easily implemented, easily specified, easily ingested, pretty damn safe, and how much objection are we really going to get to a minimalistic SRFI metafile standard that has just the bare minimum information you'd want to know about a SRFI?

We wouldn't be trying to fit every kind of implementation's packaging needs in to one format, and it would be clear exactly what we need so we wouldn't need to keep it wide open and let anybody put anything in to it.

When going the minimalistic direction the most we realistically have to worry is ensuring a reasonable file size, and that should be pretty simple.

With the unified package format suddenly you have to worry about all sorts of other things, and we can probably count on all those other things being argued to death, possibly without any consensus being reached. But I'm willing to give it a try, as long as we also offer the minimalistic SRFI-metadata-only option as an alternative (or maybe a separate SRFI).

@diamond-lizard
Copy link
Collaborator

diamond-lizard commented Nov 12, 2020

Denial of service (which in most cases would be accidental - e.g. a server goes down due to a power outage or failed upgrade) is most effectively handled if each scraper keeps a local cache that it uses as a fallback if the update fails.

No, I mean the kind of denial of service you get when your scraper connects and tries to download a 1 TB file. Of course, this is easily dealt with by having timeouts and/or size limits. That's the kind of scenario and solution that I meant to draw our attention to. The possibility and likely consequences are pretty minor, but they're there. Almost not worth mentioning, but I thought I'd mention it in passing just so we have these possibilities in the back of our mind and aren't broadsided if we choose not to do anything about them and they happen.

Malicious information is usually blocked via a web-of-trust model: I trust the admin of call-cc.org and the people who admin the Chicken git repos; they trust the programmers who they gave access to those repos; those programmers trust themselves. The communication links between these things are encrypted and I trust the people who built the crypto not to have make mistakes or put in backdoors. Basically the entire world runs on a web-of-trust model and has always done so; there is no alternative.

There are alternatives like requiring signatures and key signing, using a gpg web of trust, not the informal web of trust. Another alternative is PKI (Public Key Infrastructure). If we get security experts involved I'm sure we can get lots of other suggestions. But do we want to? I don't think for the minimalistic solution of just collecting and serving up which Scheme has which SRFI we really care that much about whether the data is authentic (a global package management format, on the other hand, is another story).

However, we could build in some simple sanity checks.. such as if the number of SRFI's that a Scheme supported the last 3 times was around 100 and today it's 1 or 10,000 then there's something wrong, so send an alert and don't publish the anomalous data unless admin issues an override.

I didn't mean to get us sidetracked on security issues. These are just passing thoughts.

@diamond-lizard
Copy link
Collaborator

By the way, I think I'm going to have to take a break on this. I'm usually able to port a bunch of SRFIs in a day, and today I've done nothing but discuss this one project. It's an important project, to be sure, but I can't commit this much time to it. I'd only meant to put in a word of advice here and there and now gone way overboard. Not your fault at all. It's my big mouth.. but for now I'm going to withdraw and maybe check in every now and then for some relatively focused input.

@lassik
Copy link
Contributor Author

lassik commented Nov 12, 2020

I'm not advocating for a single file per SRFI. One file that puts them all together is fine.

My point is that it doesn't matter all that much which kinds of files we have. We can just as well have one kind as another kind.

The major problem is whether or not the information is correct. This is currently a problem, and if we make a new file format that stores hand-written information far from the authoritative source of that information, it will continue to be a problem no matter what the file format is. It's possible to specify a file format that is more convenient to write, without improving the correctness of the information written into it.

The main question our proposal should answer is, how is this information going to be more up-to-date and comprehensive (with an easy-to-follow audit trail leading to authoritative source information from which it was derived) than earlier formats?

If our answer is, "Scheme implementors and general Scheme fans are going to be diligent about tracking third-party packages they do not personally maintain", that answer is the same as the current one and it hasn't worked so far.

  • If we make a table that is scraped programmatically from existing information, why are we standardizing the format of that table specifically? It's not that hard to write programs that download files and do simple transformations to them to generate tables.

  • If we make take another approach and make a table that is written by hand (by the people maintaining a Scheme implementation, say), why are we writing it by hand instead of auto-generating it from something closer to the source? Wouldn't it be a better idea to gather Chicken's SRFI list by grepping for srfi in egg names, or by parsing .egg files or (module ...) forms, than writing that list by hand? And if we do have a program to grep that stuff and output the list, why standardize that program's output in a SRFI instead of something more general? After all, since the info is programmatically derived, anyone who wants that info can now run that program, not just the Chicken maintainers.

And, yes, you could put this information in to a package index, along with all the other information you might want to have about packages, and create a package file format while we're at ti. But do we want to? Do users, maintainers want it, when there are so many other competing ways of doing that? Like I said before, that may be a much harder sell (though we can and probably should just ask).

The advantage of just limiting the standardized information is that it makes for a small, humble, SRFI that doesn't require major buy-in and probably lots of bickering from all the big stakeholders to come to some sort of agreement (see how far just the two of us lasted.. now imagine the entire Scheme community going at it on an issue of such central importance).

The bickering is ultimately necessary if we want Scheme to have an impact beyond its current reach :) @johnwcowan is a pioneer at it, going out of his way to find compatible solutions to problems. It's people like him that make Scheme more frictionless long term and make it easier for Scheme to build bridges to outside communities. The work is often thankless, but it pays off.

It's easily implemented, easily specified, easily ingested, pretty damn safe, and how much objection are we really going to get to a minimalistic SRFI metafile standard that has just the bare minimum information you'd want to know about a SRFI?

If we make a new technical invention that doesn't address the underlying social problem, we may not get much objection (since programmers are used to thinking in technical terms) but the social problem would persist.

We wouldn't be trying to fit every kind of implementation's packaging needs in to one format, and it would be clear exactly what we need so we wouldn't need to keep it wide open and let anybody put anything in to it.

If we have a format that can fit many kinds of data, but all of them are optional, people are free to use a very simple subset of that format.

When going the minimalistic direction the most we realistically have to worry is ensuring a reasonable file size, and that should be pretty simple.

time curl --silent https://archive.akkuscm.org/archive/Akku-index.scm | wc -c shows that it takes only 0.4 seconds to download the entire Akku package index containing a full description of 300 packages. The file size is 360 KiB. Web servers can transparently gzip the files they serve. The Akku index gzipped with the default compression level is 50 KiB.

With the unified package format suddenly you have to worry about all sorts of other things, and we can probably count on all those other things being argued to death, possibly without any consensus being reached. But I'm willing to give it a try, as long as we also offer the minimalistic SRFI-metadata-only option as an alternative (or maybe a separate SRFI).

Thanks for considering it. cond-expand and include would be the main complexity. That's ~100 lines of code. include is potentially complex if it has to be able to download files over the internet.

We can factor the overall problem into many files each using its own idiosyncratic format; we can factor it into several SRFIs. If we do those things, each of those files/SRFIs is going to reinvent the wheel, which is what the community has been doing thus far around these problems.

@lassik
Copy link
Contributor Author

lassik commented Nov 12, 2020

By the way, I think I'm going to have to take a break on this. I'm usually able to port a bunch of SRFIs in a day, and today I've done nothing but discuss this one project. It's an important project, to be sure, but I can't commit this much time to it. I'd only meant to put in a word of advice here and there and now gone way overboard. Not your fault at all. It's my big mouth.. but for now I'm going to withdraw and maybe check in every now and then for some relatively focused input.

No problem :) That's fine with me. Thanks for sticking with it! This is the way we usually solve problems in Scheme: argue for two weeks and then spend half a day writing code :p

@lassik
Copy link
Contributor Author

lassik commented Nov 12, 2020

No, I mean the kind of denial of service you get when your scraper connects and tries to download a 1 TB file. Of course, this is easily dealt with by having timeouts and/or size limits. That's the kind of scenario and solution that I meant to draw our attention to. The possibility and likely consequences are pretty minor, but they're there. Almost not worth mentioning, but I thought I'd mention it in passing just so we have these possibilities in the back of our mind and aren't broadsided if we choose not to do anything about them and they happen.

This is best solved by putting the scraper into a general scraping framework which does some sanity checks on file sizes.

There are alternatives like requiring signatures and key signing, using a gpg web of trust, not the informal web of trust. Another alternative is PKI (Public Key Infrastructure). If we get security experts involved I'm sure we can get lots of other suggestions. But do we want to? I don't think for the minimalistic solution of just collecting and serving up which Scheme has which SRFI we really care that much about whether the data is authentic (a global package management format, on the other hand, is another story).

If the release tarballs of Scheme implementations and/or Scheme packages are signed, that doubles as a signature of the SRFI metadata contained in the archive. Also the hash tree of Git's commit history. Otherwise I wouldn't bother with signing.

However, we could build in some simple sanity checks.. such as if the number of SRFI's that a Scheme supported the last 3 times was around 100 and today it's 1 or 10,000 then there's something wrong, so send an alert and don't publish the anomalous data unless admin issues an override.

Once we start considering problems at this level of detail, there's no principled point at which to stop. We'd have to consider infinitely small problems that might go wrong. Once we do that, we have surely overlooked some bigger problems.

It's ok to do good-enough things for jobs that aren't that serious, and only put in checks after the fact if something has been a problem in the past.

I didn't mean to get us sidetracked on security issues. These are just passing thoughts.

No problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants