-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation metadata #6
Comments
I like this format, because it focuses on just the metadata that we need. There's nothing in it about how to build a package, for instance, and it's not a free-for-all. I'm not a fan of leaving all the fields optional, however. The main value a SRFI like this will add (if we adopt the direction of small, narrow focus) is that it will tell people which Scheme implementations support which SRFIs, and if they don't give us that information, what's the point of complying with our SRFI in the first place? |
HTTP is a huge success even though it tells people nothing about what's at the other end of a HTTP GET. There can be a lot of value in things that are just a conduit or a known meeting point for exchanging other things. I'll repeat my point in the other thread(s): all of the information we've compiled so far has to go somewhere. If there are separate |
But I don't think there were any, and certainly not many standards that did what HTTP did when it came around. How many packaging standards are there? At least one for every Scheme. Plus some more generic ones. We'd be competing with all those.
Yes, if we were providing a service in gathering and storing metadata, I think many people would find it valuable.. especially as there really isn't something like that for all Schemes right now (apart from the SRFI Table itself).
If we do proceed in the direction of collecting and providing more information than just which Scheme supports which SRFI, then I am in favor of including all of that information in one file. But I am against including things that are implementation details such as compiler options and instructions, scripts, and so forth that at least Chicken packages sometimes come with. And yet Chicken sometimes needs that information to deploy its packages. For this reason, I don't think this would make the best packaging format. That said, I agree that if we were to go in the direction of standardizing a packaging format that would be useful to every Scheme we'd have to be very generic and allow optional addition of custom fields. |
There have been many competing ways to transfer files and other data for as long as there have been computer networks. Part of HTTP's success is that the idea was one of the simplest: just
Indeed, if we invent a new one, it will just be added to the pile. The trick is to not invent anything new unless we get approval from at least some of the people who made the existing ones. If we make something that's just a merging of the existing formats, without requiring anything extra, and is simple to adopt, we have a chance of unifying them eventually.
And once we consider that the package's own metadata, the implementation's package index, and any aggregated indexes, need to contain (some subset of) similar data for each package, it makes sense to strive to use the same field names and value formats in all of them. If we don't, we need to write filter code to munge the same information into slightly different variations, which can be done but is not ideal. If we add up all of these observations, the simplest solution is to gather all the fields in the existing formats, remove duplication (e.g. using a symbol vs a string for the same thing, or a slightly different field name, etc.), and make everything optional so that if someone has trouble with some part of it, they can always leave it out. The key is to think more like a museum curator or taxonomist, instead of a designer who invents entirely new things.
If you have:
It's not much different from:
All of that information has to go somewhere in any case. For example, check the compatibility files in chez-srfi. chez-srfi is for R6RS which doesn't have
Great! |
You know the |
If we are going to go the unified package format route, it makes sense to have as much in common with existing formats as possible, so I think you're on the right track. However, I'm still not convinced we should be going that route. I'm not solidly opposed to it, but I still think it'll be a harder sell and take much longer than to create a narrowly focused SRFI-index metafile format. But we should really get more input, I think, before we proceed with this idea, even as a pre-SRFI. I'm really not the best qualified to opine on this, as my experience in the Scheme and Chicken world is very brief and limited. I think I've offered as much insight as I'm capable of at this point. There are much more experienced and knowledgeable people than myself who can give you much deeper insight and advice. I'm not bowing out of the project, but I'd really like to hear what some experienced representatives of various Schemes have to say on the grand unified package format idea. |
Sure, let's do that! The thing is, our design is loosely coupled: the container format (unified S-expression file vs separate files) is almost completely orthogonal to the field names and values we are inventing/curating. So even if the container doesn't fly, the curation work is still useful. We don't end up wasting much effort no matter what the outcome is. |
If we do float the idea, I think there should either be two separate proposals or one proposal which offers two alternatives: a lean one focusing on just on providing/consuming SRFI-compatibility metadata, and one being the unified package format. |
Would the SRFI compatibility data be complex enough to warrant its own SRFI? What information would it have other than a list of the SRFI numbers? |
There are some short SRFIs out there. In some ways they're the best kind if they're both short, useful, and easy to implement. Things that would be useful to know:
I think for SRFIs that's all that I can think of that we'd need to know. But, again, maybe others have some other ideas or needs. If we had such metadata deliberately exposed, your scripts wouldn't need to scrape anything. They could just consume the data directly and populate the table from there. No hunting around, no uncertainty, no mess. And it should be super easy to implement on both the producer and consumer side. |
Where would this metadata be stored? That's question is almost more difficult than the format. It raises the question of who is responsible for keeping it up to date. Partially implemented SRFIs are possible in principle, but by convention, we try to steer clear of them; if a SRFI is difficult to implement, we'd rather split it into two different SRFIs. |
Personally, I think the most natural thing these days is to serve it up at and consume it from some URL somewhere. It's true that it might not be quite as definitive or great for archival purposes as putting it in a tar file, but the archive of this information could be the metadata server which consumes the metadata at these URLs. What do you think? |
Do you mean something like the following:
(scheme-implementation-name "Chicken")
(scheme-implementation-version 5)
(srfi 0 core full)
(srfi 1 external full)
(srfi 2 external part) |
By the way, something else I've just thought of that I think we have to take in to consideration is what do we do if someone makes a mistake or needs to revise some of this information? It's easy enough to change the data on the producer side, but if the consumer has already consumed it then the next time they look there'll be a discrepancy. How do we deal with that? This may not need to be part of the SRFI, but we should think about it. As far as SRFI Table generation goes, it could just serve up the latest information, even if it differs from the old information. From an archival standpoint, I suppose the consumed information could be put in to version control and have frequent backups, so that you'd be able to roll back to previous versions. Yet another concern is authentication and authorization. Should the consumer always trust the producer? Or does the producer need to authenticate itself to prove that it's an authoritative source? If the consumer just trusts that whatever data's at a particular URL is true, then it's easy to implement on both sides. If we want more assurances, it gets tricker. |
That sounds reasonable to me. What do you think? |
I'd expect most consumers to simply overwrite their old data with the new data every time they update. If they download data from multiple URLs, they should be prepared for the possibility where some of those are reachable during the update but other not. For this, they'd need to keep track of which part of their output data came from which input URL, and not touch that part of the output if the corresponding input could not be downloaded. To avoid tampering in case old data has changed and the consumer doesn't want it to change, either an equality check (like Scheme's Authentication = encrypted connection + permission check. HTTPS solves both problems: encryption is built in, and since HTTP GET is read-only and the server admin has hopefully prevented unauthorized people from writing files on the server, permission checks are solved too. Alternatively, git over a secure protocol (HTTPS or SSH) also takes care of encryption and permission checks. I wouldn't worry about these concerns on the level of abstraction we are working on. The ability to use the git history to rollback after mistakes is a good point. It's generally a good idea to track almost anything kind of text file in Git. (That doesn't mean Git should be the official interface used by the consumer.) |
Yes, I think that makes sense because most people will probably be interested in what the current version supports, not what it used to support.
What is our threat model in this case? What are we protecting against? Deliberate modification? Accidental modification? An attack? Could this information be exploited in an attack somehow? Is that something we want to worry about or try to prevent? Or do we just care about accidental modification (ie. say some datafile corruption) SHA-1 has ostensibly been broken, and GitHub, for example, is moving away from using it. So if we were to use a hash, we might want to consider using something else.
HTTPS or SSH could, at best, only ensure you're getting the data from the server you intended to get it from, not that it was actually generated by someone who was authorized to do so. If the server had been compromised, then so potentially is your data. That's not to mention ways of getting around HTTPS or SSH. But should we care about this? What's the worst thing that could happen if the data is completely wrong? Is that actually exploitable or could it cause some kind of critical failure that we want to prevent? I'm struggling to think of a way that this could have horrible consequences, and coming up blank. I think that speaks in favor of this kind of simple format. Because, unlike data which contains names and URLs to get software from, or computer instructions, it's pretty hard to exploit.
My instinct is that you're probably right.
Agreed. But that probably doesn't need to be part of the SRFI. Just something to think about. |
If we store the authoritative source copy of the metadata file in a Scheme implementation's git repo, and the release tarballs are made from that repo, we get the best possible archival and backup in the easiest possible way.
Indeed, it's very easy to make mistakes when editing these kinds of files by hand. Most programmers don't enjoy routine manual work, and are not good at concentrating when doing it, so it's easy to copy/paste the wrong thing or overlook some number. That's why it's ideal to have a "single point of truth" updated by the people who are directly responsible for that component (e.g. the author of a library or Scheme implementation that provides a SRFI), and everyone else who needs that information uses an automated program that downloads the authoritative information (or some derivative of it) and transforms it into the format that is needed. Basically all problems of the general form that we're dealing with here, are funnel problems. There is some original source information hand-written in some authoritative place. Then a bunch of programs (or people) read that info, and output some transformation of that info or combine it with some other info. Other programs (or people) can then take that output, transform and combine it, and output some more info, etc. Some important observations arise:
This shows that the main problem in the whole network is ensuring that the original source data is reliable. Everything else is just a matter of writing simple programs. And since any program in the network can store its output in a file, which can be uploaded to a HTTP URL, there isn't that much need to worry about the intermediate formats. If a particular kind of intermediate format is useful, someone can just write a program to generate it. |
However, neither is there any particular benefit to having different formats for the same information. If for example the information about SRFI support can be encoded the same way in all of the following places:
Then the whole community will save effort. This IMHO is the problem we should address, in addition to the (more important) problem of making a convention for how to store the authoritative data as close as possible to the source to have the best chance that it stays up to date. |
There is no threat; there's no money or fame in publishing misleading information about SRFI support. If you list a SRFI as supported when it isn't, the worst that will happen to the user is they type |
If we take the suggested format of https://call-cc.org/srfi-support-5.scm:
and modify it a bit:
And the metadata file for one egg looks like this:
And Chicken's egg index looks like this:
It's now clear that:
That's what I've been getting at. The big problem with the individual files and URLs is, is the information correct? The formats of all those files are just random filterings and concatenations of each other. |
(define (get-chicken-srfi-support major-version)
`((scheme-implementation-name "Chicken")
(scheme-implementation-version ,major-version)
,@(map (lambda (package)
(filter-properties
(lambda (property) (eq? 'srfi (car property)))
package))
(get-chicken-egg-index major-version)))) |
Now that I've had a bit of time to think about it, I think there are a couple of minor threats. The first is, as you say, denial of service. This could be performed at either the metadata consumer (like the SRFI Table scraper) or users that get data from it. Maybe they can't build, but maybe they also are stuck reading endless amounts of information that was served up in a tampered metafile. We may want to consider length limits because of this, or at least maybe fields in the consumed data that indicate how long the data proper is expected to be. So, for example, a string value of "foo" could be prefixed with "3" to indicate it's only 3 bytes long. (This gets trickier with unicode, of course). And/or we could just have limits on the size of the entire file. We don't want to try downloading a 1 TB file, for instance. The second threat is offensive, illegal, or otherwise undesirable information put in to a metadata file. The Scheme community is not one to do that deliberately, from what I can tell, but if a server is hacked some kids might think it funny to do that, and do we really want to carelessly republish it? The Scheme community is relatively small, so I think the chance of targeted attacks that would exploit these metadata files are low, but it is still a possibility we may want to consider, even if we decide in the end not doing anything about it. |
Denial of service (which in most cases would be accidental - e.g. a server goes down due to a power outage or failed upgrade) is most effectively handled if each scraper keeps a local cache that it uses as a fallback if the update fails. Malicious information is usually blocked via a web-of-trust model: I trust the admin of Blocking against malicious information is mainly relevant on sites where a large number of unknown people are free to post anything they like, e.g. social media sites. Since you don't know those people, and they don't lose any reputation by posting bad stuff, it eventually pays to have some heuristics to block that stuff and there's a lot of money in developing machine learning tools to implement good heuristics. For less popular stuff, it's much easier not to give write access to unreliable people in the first place (or revoke access from people who were previously reliable but are not anymore). This is the model by which all open source projects (and most commercial projects) have always been run. In finance, you'd be surprised how much money changes hands over informal phone calls while having a hangover after partying all weekend... |
I'm not advocating for a single file per SRFI. One file that puts them all together is fine. And, yes, you could put this information in to a package index, along with all the other information you might want to have about packages, and create a package file format while we're at ti. But do we want to? Do users, maintainers want it, when there are so many other competing ways of doing that? Like I said before, that may be a much harder sell (though we can and probably should just ask). The advantage of just limiting the standardized information is that it makes for a small, humble, SRFI that doesn't require major buy-in and probably lots of bickering from all the big stakeholders to come to some sort of agreement (see how far just the two of us lasted.. now imagine the entire Scheme community going at it on an issue of such central importance). It's easily implemented, easily specified, easily ingested, pretty damn safe, and how much objection are we really going to get to a minimalistic SRFI metafile standard that has just the bare minimum information you'd want to know about a SRFI? We wouldn't be trying to fit every kind of implementation's packaging needs in to one format, and it would be clear exactly what we need so we wouldn't need to keep it wide open and let anybody put anything in to it. When going the minimalistic direction the most we realistically have to worry is ensuring a reasonable file size, and that should be pretty simple. With the unified package format suddenly you have to worry about all sorts of other things, and we can probably count on all those other things being argued to death, possibly without any consensus being reached. But I'm willing to give it a try, as long as we also offer the minimalistic SRFI-metadata-only option as an alternative (or maybe a separate SRFI). |
No, I mean the kind of denial of service you get when your scraper connects and tries to download a 1 TB file. Of course, this is easily dealt with by having timeouts and/or size limits. That's the kind of scenario and solution that I meant to draw our attention to. The possibility and likely consequences are pretty minor, but they're there. Almost not worth mentioning, but I thought I'd mention it in passing just so we have these possibilities in the back of our mind and aren't broadsided if we choose not to do anything about them and they happen.
There are alternatives like requiring signatures and key signing, using a gpg web of trust, not the informal web of trust. Another alternative is PKI (Public Key Infrastructure). If we get security experts involved I'm sure we can get lots of other suggestions. But do we want to? I don't think for the minimalistic solution of just collecting and serving up which Scheme has which SRFI we really care that much about whether the data is authentic (a global package management format, on the other hand, is another story). However, we could build in some simple sanity checks.. such as if the number of SRFI's that a Scheme supported the last 3 times was around 100 and today it's 1 or 10,000 then there's something wrong, so send an alert and don't publish the anomalous data unless admin issues an override. I didn't mean to get us sidetracked on security issues. These are just passing thoughts. |
By the way, I think I'm going to have to take a break on this. I'm usually able to port a bunch of SRFIs in a day, and today I've done nothing but discuss this one project. It's an important project, to be sure, but I can't commit this much time to it. I'd only meant to put in a word of advice here and there and now gone way overboard. Not your fault at all. It's my big mouth.. but for now I'm going to withdraw and maybe check in every now and then for some relatively focused input. |
My point is that it doesn't matter all that much which kinds of files we have. We can just as well have one kind as another kind. The major problem is whether or not the information is correct. This is currently a problem, and if we make a new file format that stores hand-written information far from the authoritative source of that information, it will continue to be a problem no matter what the file format is. It's possible to specify a file format that is more convenient to write, without improving the correctness of the information written into it. The main question our proposal should answer is, how is this information going to be more up-to-date and comprehensive (with an easy-to-follow audit trail leading to authoritative source information from which it was derived) than earlier formats? If our answer is, "Scheme implementors and general Scheme fans are going to be diligent about tracking third-party packages they do not personally maintain", that answer is the same as the current one and it hasn't worked so far.
The bickering is ultimately necessary if we want Scheme to have an impact beyond its current reach :) @johnwcowan is a pioneer at it, going out of his way to find compatible solutions to problems. It's people like him that make Scheme more frictionless long term and make it easier for Scheme to build bridges to outside communities. The work is often thankless, but it pays off.
If we make a new technical invention that doesn't address the underlying social problem, we may not get much objection (since programmers are used to thinking in technical terms) but the social problem would persist.
If we have a format that can fit many kinds of data, but all of them are optional, people are free to use a very simple subset of that format.
Thanks for considering it. We can factor the overall problem into many files each using its own idiosyncratic format; we can factor it into several SRFIs. If we do those things, each of those files/SRFIs is going to reinvent the wheel, which is what the community has been doing thus far around these problems. |
No problem :) That's fine with me. Thanks for sticking with it! This is the way we usually solve problems in Scheme: argue for two weeks and then spend half a day writing code :p |
This is best solved by putting the scraper into a general scraping framework which does some sanity checks on file sizes.
If the release tarballs of Scheme implementations and/or Scheme packages are signed, that doubles as a signature of the SRFI metadata contained in the archive. Also the hash tree of Git's commit history. Otherwise I wouldn't bother with signing.
Once we start considering problems at this level of detail, there's no principled point at which to stop. We'd have to consider infinitely small problems that might go wrong. Once we do that, we have surely overlooked some bigger problems. It's ok to do good-enough things for jobs that aren't that serious, and only put in checks after the fact if something has been a problem in the past.
No problem. |
The schemedoc/implementation-metadata repo collects info about as many Scheme implementations as we can into S-expression files. For example, here's the current data for Guile. The format is subject to change; it hasn't yet settled into a good form.
Anyway, if we support
(scheme-implementation ...)
top-level forms in the metadata file in addition to(package ...)
top-level forms, then we can use the same filename for info about releases of implementations. IMHO as with(package ...)
the form should be optional, and all fields in it should be optional as well. This makes it easy to adopt it by adding a couple of fields at first and gradually adding more if desired.Thoughts?
The text was updated successfully, but these errors were encountered: