Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is a package? #257

Open
big-guy opened this issue Oct 25, 2023 · 12 comments
Open

What is a package? #257

big-guy opened this issue Oct 25, 2023 · 12 comments

Comments

@big-guy
Copy link

big-guy commented Oct 25, 2023

I'm sorry if this has been discussed somewhere else. I looked through the PURL spec and the answer isn't explicitly stated. #242 has similar vibes.

What is a package?

hierarchy

The spec describes the components of a PURL as a hierarchy.

If you had these two PURLs, would it make sense to call them the same package, but one of them is more specific?
pkg:example/org/mypkg
pkg:example/org/[email protected]

IOW, does pkg:example/org/mypkg generically refer to all versions of the package?

I think this falls into something per-ecosystem because only the name is required universally. So for some ecosystems, you may never have a version or you must always have a version. If an ecosystem mixed these two notations, it would be confusing.

qualifiers

Let's say this is a package: pkg:example/org/[email protected]

But does this represent a different package? pkg:example/org/[email protected]?key=value

I can see arguments for both.

Yes - the qualifiers identify a specific set of files and different meta information could be appropriate (different CVEs, different licenses, different dependencies, etc).

No - the qualifiers are optional and a package is more than just a single set of files. It's the collection of all things from all qualifiers.

If we look at regular URLs, URLs with different query parameters are treated as separate URLs, but the query parameter might not affect the semantic content of what's available at the URL. e.g., a page with ?sort=ascending could be identical to one without it (if ascending is the default).

I think this might be a per-ecosystem decision, but it makes talking about what a PURL points to a little harder.

well-known qualifiers

The spec describes a few well-known qualifiers for all package types. There's a warning to keep the use of qualifiers to a bare minimum for "package identification".

Like above, if this is a package: pkg:example/org/[email protected]

Is this a different package? pkg:example/org/[email protected]?repository_url=https://example.com

If not, why would this information be useful?

My thinking is that these coudl be different packages, but tooling should treat them as potentially the same package in one direction. That means if there's a known CVE against pkg:example/org/[email protected], it should also be assumed to be against pkg:example/org/[email protected]?repository_url=https://example.com, but not the other way around.

Related to this, what if the repository_url happens to point to the default package registry? I think this could be considered an error.

Is this a different package? pkg:example/org/[email protected]?download_url=https://example.com/mypkg.zip

I think no, but this means there would always be exceptions to qualifiers being meaningful to package identification.

vcs_url and file_name feel similar to download_url. These shouldn't be considered part of the identification.

checksum seems similar too, except if you did try to consider it part of the package identification, would the lack of certain checksum values also be considered "the same"?

multiple files

Let's say that when resolving the files for package pkg:example/org/[email protected], we actually do multiple things:

  • Locate pkg:example/org/[email protected] in the registry
  • Read its meta information (e.g., list of files, other dependencies)
  • Download the files listed above

This means that pkg:example/org/[email protected] potentially refers to several files.

It seems like the PURL spec is written with the idea that the PURL refers to a single file (see the well known qualifiers above), but this isn't universally the case for all ecosystems.

other

I think the questions for qualifiers may apply to #subpath too.

why?

This all came up because I've been looking at how Maven and Gradle would use PURLs to describe things internally or in reports.

If this is a package in Maven pkg:maven/org.apache.xmlgraphics/[email protected] and pkg:maven/org.apache.xmlgraphics/[email protected]?classifier=sources refers to the sources of that package, I could see theoretically that different CVEs could be reported against each of these, but I'm not sure it ever makes sense to do that.

In Gradle, you could follow a similar convention where the group-name-version coordinates map to the same location in a Maven repository, but Gradle could publish many different files there that are selected by something other than group-name-version.

If PURLs need to designate a particular file that was used, this means Gradle would need to encode more information:
pkg:gradle/org.apache.xmlgraphics/[email protected]?org.gradle.libraryelements=jar,... where ... is a list of a dozen or more key-values that Gradle considered when selecting the file.

The key-values are used by Gradle to select between different variants of the same thing. Compare this to other package managers that have different artifacts for separate architectures or OSes. If Gradle doesn't include that information, the PURL is what Maven has pkg:maven/org.apache.xmlgraphics/[email protected]. If PURLs are used by other tooling to associate CVEs against packages, then all CVEs apply to all variants.

There are other complications here (e.g., some key-values can be considered equivalent even if they have different values), but I thought I'd start with the simpler question:

what is a package?

  • is it always a single file?
  • is it always one or more files?
  • is it one or more files where only a subset may be used at one time?
@bureado
Copy link

bureado commented Oct 25, 2023

Related: #161

@bureado
Copy link

bureado commented Oct 25, 2023

Anything below that feels deterministic is not authoritative, just my opinion.

IOW, does pkg:example/org/mypkg generically refer to all versions of the package?

Very likely. The question is, is that a contract with purl, or is that just a reasonable assumption the consumer must make?

Let's say this is a package: pkg:example/org/[email protected]
But does this represent a different package? pkg:example/org/[email protected]?key=value

Possibly. It depends on key, and on the ecosystem as you mention. (I think your point on single vs. multiple files is orthogonal to this, though.)

Like above, if this is a package: pkg:example/org/[email protected]
Is this a different package? pkg:example/org/[email protected]?repository_url=https://example.com

Possibly. Depends on the ecosystem. As you say, it's up to the end user to know what to do with that information. For example, are you stripping the qualifiers from the purl string and trying to use them to compare hashes of files on disks? That would not be right, but purl doesn't know that's what you wanted to do with the string. If you're "just" trying to render a bunch of logos for software that you use in your application, the qualifier might not be as important. And purl wouldn't know that's what you wanted to do with the string.

Related to this, what if the repository_url happens to point to the default package registry? I think this could be considered an error.

Possibly, depending on ecosystem conventions.

vcs_url and file_name feel similar to download_url. These shouldn't be considered part of the identification.

You're making two points here. I disagree that download_url and vcs_url are similar or equivalent. But I also agree that in some cases they might be redundant for identification purposes. For example, if your vcs_url does not point to a git ref, then it might not be helpful for detailed purl strings with specific version numbers.

checksum seems similar too, except if you did try to consider it part of the package identification, would the lack of certain checksum values also be considered "the same"?

Possibly. The more I read what you're writing I think it makes sense to clarify how a qualifier might be redundant for identification purposes. Are some qualifiers helpful for disambiguation? Are there other qualifiers helpful for verification? (I generally agree, although I haven't thought of it as a big problem, that you shouldn't use the checksum qualifier to actually verify package integrity)

It seems like the PURL spec is written with the idea that the PURL refers to a single file (see the well known qualifiers above), but this isn't universally the case for all ecosystems.

I don't see it this way. I actually don't think it refers to any files, but to the concept of a package. There are remarkably few mentions to "files" in the spec or the types; some, like in generic, even say file or directory.

I don't think we've down a great job at explaining purl's "concept of a package" with specifics. In a sense, I think it's the implementers that will tell that story, and it'll largely be in the eye of the beholder, and in general (maybe paradoxically) that ambiguity doesn't seem to detract much from purl's value.

If this is a package in Maven pkg:maven/org.apache.xmlgraphics/[email protected] and pkg:maven/org.apache.xmlgraphics/[email protected]?classifier=sources refers to the sources of that package, I could see theoretically that different CVEs could be reported against each of these, but I'm not sure it ever makes sense to do that.

Realistically it'll be reported against the first one; someone will need to read the CVE description to understand if this applies to sources or to binaries.

I don't understand Maven or Gradle well enough to weigh in on the Gradle key-value modifiers, but you might be onto something. Would love to hear more thoughts.

And sorry if it wasn't clear from the rest of my comment, yes, I don't think we define package well. Some of that might be by design, but I think there's more we can do.

@prabhu
Copy link

prabhu commented Oct 31, 2023

@big-guy Security tools should avoid assumptions.

If you had these two PURLs, would it make sense to call them the same package, but one of them is more specific?
pkg:example/org/mypkg
pkg:example/org/[email protected]

These must be considered different.

Let's say this is a package: pkg:example/org/[email protected]

But does this represent a different package? pkg:example/org/[email protected]?key=value

They are different. An example of a qualifier is arch, which could be amd64 or arm64 representing different packages.

pkg:maven/org.apache.xmlgraphics/[email protected] and pkg:maven/org.apache.xmlgraphics/[email protected]?classifier=sources

They are different packages, but the same vulnerability must be applicable for both the source and compiled form unless the vulnerability is only due to the compiler settings or runtime used. In such cases, the vulnerability must be assigned to the compiler or the runtime.

@pombredanne
Copy link
Member

@big-guy Thank you ++ for this detailed set of questions and and comments!

FWIW, I enabled "discussions" at #260 as we may want to use this in the future!

Here are some answers:

IOW, does pkg:example/org/mypkg generically refer to all versions of the package?

yes.

[...] I think this falls into something per-ecosystem because only the name is required universally. So for some ecosystems, you may never have a version or you must always have a version. If an ecosystem mixed these two notations, it would be confusing.

Whether you include a version or not in a PURL is a choice in a given context, and has nothing to do with the package type IMHO. Where you could have no version and be able to locate a single package? Say I want to talk about a package in general, I could use pkg:maven/com.drewnoakes/metadata-extractor whereas pkg:maven/com.drewnoakes/[email protected] woul be a single version....

And for version ranges, this is going to be separate. See #139

@pombredanne
Copy link
Member

@big-guy re: Qualifiers

Let's say this is a package: pkg:example/org/[email protected]
But does this represent a different package? pkg:example/org/[email protected]?key=value

This is a different package in most cases (but say if you add a checksum this may not change much). But where it matters depends on the context. A vulnerability may affect all the architectures of a Debian package (e.g., all qualifiers) or just when on arm arch. When it comes to a vulnerability database, it may prefer to enumerate all the affected arches at all times or only track them when needed. And I see your point as this could be clarified alright.

@pombredanne
Copy link
Member

pombredanne commented Nov 6, 2023

@bureado all your comments in #257 (comment) are right to the point! Thank you ++

@prabhu same for #257 (comment) 👍 Thank!

@big-guy re:

If this is a package in Maven pkg:maven/org.apache.xmlgraphics/[email protected] and pkg:maven/org.apache.xmlgraphics/[email protected]?classifier=sources refers to the sources of that package, I could see theoretically that different CVEs could be reported against each of these, but I'm not sure it ever makes sense to do that.

There are cases where you may want to make this distinction. Say with we have this case where the sources and binaries differ, for instance the binary is really an uberjar with "shaded" contents of a vulnerable log4j.

Here the source may not be vulnerable, but the binary ma be?

@pombredanne
Copy link
Member

@big-guy re: well-known qualifiers

Related to this, what if the repository_url happens to point to the default package registry? I think this could be considered an error.

Good point, yet I do not see this as an error per se. May be instead something we should recommend tools to normalize and simplify? This is redundant for sure!

@pombredanne
Copy link
Member

@big-guy re: files

If PURLs need to designate a particular file that was used, this means Gradle would need to encode more information:
pkg:gradle/org.apache.xmlgraphics/[email protected]?org.gradle.libraryelements=jar,... where ... is a list of a dozen or more key-values that Gradle considered when selecting the file.

This already speced in this case as Maven calls this "type" and "classifier" https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#maven ... I personally think that the way Maven handles this is a little contrived, but this is likely this way because of "frozen accidents" in its history and kept for backward compatibility? There is a complex matrix to consider.
See https://github.com/nexB/purldb/blob/255a692c2d1e043d4f945a573c008a5fb1d52119/minecode/visitors/maven.py#L1065 for an example of code and https://maven.apache.org/ref/4-LATEST/maven-core/artifact-handlers.html for Maven's own spec details... (This has made my head spin and if you are well plugged in this ecosystem, this could get some love!) ....

PURL is not trying to change the world, just modestly to make it easy enough to handle the common case easily and obviously and to accommodate the more complex cases (but may be not as obviously).

Here is a possible analogy that may not be too shabby! Say the PURL spec is like a the spec for an address book of people and places. 🧑‍🤝‍🧑 🏙️

Each package type is like a country or state and defines how you can identify and locate a place reasonably uniquely. Uniquely enough that the post can deliver the mail. In a city with well defined streets and street numbers, you get a precise location with the street name and number and may be an apartment number. In some cases you may want the address for a single person with its name, or the whole household. If someone is off the grid in the bayou or some isolated mountain, crafting a proper address may be more hairy and fuzzy. Worst case I may need GPS coordinates for these edge cases. I may also have many different ways to write an address or a name. Heck, some folks also live in orbit on the ISS and GPS will not work there!

I think the same applies to software. I wish everything was well organized and tidy, but we have to deal with a lot of warts and weirdness!

@pombredanne
Copy link
Member

@big-guy oops! just reading what I wrote and I missed a point :]

if PURLs need to designate a particular file that was used, this means Gradle would need to encode more information:
pkg:gradle/org.apache.xmlgraphics/[email protected]?org.gradle.libraryelements=jar,... where ... is a list of a dozen or more key-values that Gradle considered when selecting the file.

How would pkg:gradle be different from pkg:maven ? I thought Gradle (as a build tool) was using Maven for its packages? Care to elaborate?

@pombredanne
Copy link
Member

@big-guy so finally about your questions:

what is a package?

  • is it always a single file?
  • is it always one or more files?
  • is it one or more files where only a subset may be used at one time?

It depends on the context and what's been baked in PURL, the same way a physical address may point to a country, (type), city (namespace), street (name), building number (~ version :]) or a room or a person (qualifiers, subpaths or name) and everything in between. (I guess the address analogy breaks down and wears out quickly.)

@oej
Copy link

oej commented Nov 8, 2023

I think the spec has to clarify

  • how to compare two PURLs (we need to check if there are generic rules for URI/URL comparison to base it on)

If there's a PURL in a vulnerability report and I have a PURL in my SBOM, when should I react, when do they match?

Like above, if there's no arch in the vulnerability report - does it match ALL architectures then?

@bureado
Copy link

bureado commented Nov 9, 2023

@oej,

Like above, if there's no arch in the vulnerability report - does it match ALL architectures then?

If I were to implement a VA tool, then yes, I would take no explicit arch as a need to match all arches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants