Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scheme-dependent and file-system dependent URI normalization to URIResolverRegistry #1944

Open
jurgenvinju opened this issue May 13, 2024 · 1 comment

Comments

@jurgenvinju
Copy link
Member

jurgenvinju commented May 13, 2024

Is your feature request related to a problem? Please describe.

There are many reasons for aliases in source locations:

  • soft and hard links (symlinks)
  • repeated mounts of the same filesystem
  • case insensitivity
  • etc.

These are semantic properties of file systems, not syntactic. It means that you have to have
an actively running filesystem with a file on it, to be able to know what the aliases are and how
they might be normalized.

Loc aliases are detrimental to downstream analysis in Rascal as loc are pretty much always used
as identities.

Describe the solution you'd like

I'd like an additional method to URIResolverRegistry: normalize(ISourceLocation x),
which would be implemented by dispatching to ISourceLocationInput::normalize(ISourceLocation x) via the scheme,
and then making this available to Rascal users via loc Location::normalize(loc l).

This way the user is able to fix possible issues with aliasing easily, without having to consider every
different way files could be aliases. Also they are not forced to use it.

Maybe normalize should also replace logical schemes by physical schemes (since that is also a source of aliases). But the jury is still out on this.

Describe alternatives you've considered

There is something to be said for normalizing add location creation time, however there is not always
a file system available to normalize against. So this is impossible. It's better to let source locations remain
purely syntactical, and leave it to a normalize function to deal with the semantics of aliases.

Additional context

Typically people run into these things with case sensitive file systems, but there are many ways to alias files. The more
we use Rascal for IO, and on different systems with different OSes and file systems, the more often we run into these issues.

Bad news

Implementing normalization is a lot of detailed research work for each scheme.

Good news

We might implement a default that does nothing, and start incrementally adding normalization. If we start with the file scheme, then we quickly saturize at 80% of all the schemes.

@jurgenvinju
Copy link
Member Author

Note that file name normalization is an "IO feature". We need to know all about the implementation parameters of the file system under the hood and not just the opaque location.

There will be some normalization steps that can be done without knowledge of the filesystem implementation, but predicting which is which should not be up to the user. A general normalize function that works specifically for each scheme will be easier to implement than some parallel hierarchy.

If a normalization scheme can not be implemented, we are probably missing a specific scheme identifier. For example unc:/// is very important to have next to file:/// for this purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant