Skip to content

Commit

Permalink
Raw identifier syntax (#3797)
Browse files Browse the repository at this point in the history
We want to support legacy identifiers that overlap with new keywords
(for example, `base`). This is being called "raw identifier syntax"
using `r#<identifier>`, and is based on
[Rust](https://doc.rust-lang.org/reference/identifiers.html).

Note this proposal is derived from [Proposal #17: Lexical
conventions](#17).

Co-authored-by: zygoloid <[email protected]>

---------

Co-authored-by: Carbon Infra Bot <[email protected]>
  • Loading branch information
jonmeow and CarbonInfraBot committed May 1, 2024
1 parent ecdecf4 commit 6907870
Showing 1 changed file with 274 additions and 0 deletions.
274 changes: 274 additions & 0 deletions proposals/p3797.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
# Raw identifier syntax

<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->

[Pull request](https://github.com/carbon-language/carbon-lang/pull/3797)

<!-- toc -->

## Table of contents

- [Abstract](#abstract)
- [Problem](#problem)
- [Background](#background)
- [Prior discussion](#prior-discussion)
- [Other languages](#other-languages)
- [Proposal](#proposal)
- [Diagnostics](#diagnostics)
- [Rationale](#rationale)
- [Alternatives considered](#alternatives-considered)
- [Other raw identifier syntaxes](#other-raw-identifier-syntaxes)
- [Restrict raw identifier syntax to current and future keywords](#restrict-raw-identifier-syntax-to-current-and-future-keywords)
- [Don't require syntax for references to raw identifiers](#dont-require-syntax-for-references-to-raw-identifiers)
- [Don't provide raw identifier syntax](#dont-provide-raw-identifier-syntax)

<!-- tocstop -->

## Abstract

We want to support legacy identifiers that overlap with new keywords (for
example, `base`). This is being called "raw identifier syntax" using
`r#<identifier>`, and is based on
[Rust](https://doc.rust-lang.org/reference/identifiers.html).

Note this proposal is derived from
[Proposal #17: Lexical conventions](https://github.com/carbon-language/carbon-lang/pull/17).

## Problem

One of Carbon's most important goals is to support program and language
evolution. We know that the set of keywords in Carbon will grow over time, and
the easiest kind of language change from an evolutionary perspective is one that
is known to break no programs, that lets programs migrate incrementally to the
new language rule, and that either has no migration cost or only imposes
automatable migration cost on the code that intends to use the new feature.

## Background

### Prior discussion

We have proposals that discussed using `r#` but did not make a decision in favor
of it:

- [Proposal #17: Lexical conventions](https://github.com/carbon-language/carbon-lang/pull/17)
originally proposed it, but when it was split into multiple proposals, raw
identifiers were not retained.
- This proposal copies substantial parts of its text from here.
- [Proposal #2107: Clarify rules around `Self` and `.Self`](https://github.com/carbon-language/carbon-lang/pull/2107)
mentions `r#` syntax as proposed but not in use.

### Other languages

[Rust](https://doc.rust-lang.org/reference/identifiers.html) provides this as
"Raw identifiers", using `r#` as a prefix (`r#self`). The documented syntax is:

```
RAW_IDENTIFIER : r# IDENTIFIER_OR_KEYWORD Except crate, self, super, Self
```

[C#](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/tokens/verbatim)
provides this as "vebatim identifiers", using `@` as a prefix (`@self`). The
[documented syntax](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/language-specification/lexical-structure#643-identifiers)
is:

```
fragment Escaped_Identifier
// Includes keywords and contextual keywords prefixed by '@'.
// See note below.
: '@' Basic_Identifier
;
```

[Swift](https://docs.swift.org/swift-book/documentation/the-swift-programming-language/lexicalstructure/#Identifiers)
provides this as part of the identifier grammar, using backticks (\`self\`). The
documented syntax is:

```
identifier → `identifier-head identifier-characters?`
```

## Proposal

A _raw identifier_ can be specified by prefixing a word with `r#`, such as
`r#requires`. Raw identifiers can be used to introduce and use names that are
lexically identical to keywords. The declaration of a raw identifier does not
prevent the base word from being interpreted as a keyword; otherwise, they
behave identically to the word formed by removing the `r#` prefix.

### Diagnostics

In diagnostics, if there is a keyword `r#<identifier>`, then raw identifiers
should be expected to print with the `r#` prefix. Otherwise, they will typically
use the non-prefixed identifier name for consistency.

## Rationale

- [Software and language evolution](/docs/project/goals.md#software-and-language-evolution)
- Raw identifier syntax provides a way to add keywords to the language
while still offering code a reasonable upgrade path, which can also be
automated.
- [Code that is easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
- The `r#` syntax is consistent with raw string literals, and should be
representative to readers that something unusual is being done.
- [Interoperability with and migration from existing C++ code](/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code)
- C++ code using identifiers that are keywords in Carbon can use raw
identifier syntax.
- The converse does not work: if Carbon code has an identifier that is a
C++ keyword, it needs to be renamed for use from C++ code.

## Alternatives considered

### Other raw identifier syntaxes

For considering other syntaxes, a couple initial considerations for
`r#identifier` prefixing is:

- We use `#` prefixes for
[string literals](/docs/design/lexical_conventions/string_literals.md), and
it's likely we'll support syntax similar to `f#"..."` for interpolated
string literals. The `r#` syntax offers consistency with this, and will
hopefully be recognizable to users.
- Consistency with Rust.
- Rust uses `r#"..."` for raw string literals, whereas Carbon uses
`#"..."`.
- Introduces another code execution path in lexing identifiers. This likely
causes a slowdown;
[PR #3044](https://github.com/carbon-language/carbon-lang/pull/3344)
indicates roughly 2%, although that was run on a system with noisy
benchmarks -- details would require a better system for benchmark. Note 2%
could represent that `r` is 1-in-55 identifiers with a 100% slowdown with
linear cost scaling for other similar code, or it could indicate that the
additional code path causes incremental slowdown but if other code (such as
`f#"..."`) used the same codepath it may instead have constant cost scaling
(negligible incremental cost). This may also be either reduced or become
more significant if we enable tail calls and other optimizations. As a
consequence, the precise overhead is difficult to quantify at this time.

Various other prefixes have been discussed, mostly using a special character
prefix in order to restrict the lexing impact. In particular:

- `\` prefix, as in `\identifier`.
- Similar to `\` escaping in strings.
- More intuitive "escaping" semantic for some developers versus `r#`.
- Creates a different meaning for `\n` as an identifier versus `\n` as a
character escape.
- Some of this could be addressed by restricting `\` raw identifiers
to only keywords in the language, meaning `\n` would only be a
character escape. The alternative
[Restrict raw identifier syntax to current and future keywords](#restrict-raw-identifier-syntax-to-current-and-future-keywords)
applies to this solution.
- `#` prefix without `r`, as in `#identifier`.
- Would be more consistent with string literals, and avoid the lexing
overhead.
- We are considering using a `#` prefix for metaprogramming, so the `r`
offers a way to keep the `#` prefix available for other purposes.
- `#if` may look to C++ developers like a compiler directive, rather than
a raw identifier for `if`.
- `@` prefix, as in `@identifier`.
- Consistent with C#.
- We've also discussed using a `@` prefix for attributes, similar to
Python. Similar to `#`, this would be conflicting.
- `` ` `` wrapping, as in `` `identifier` ``.
- Consistent with Swift.
- We prefer not to use backticks for Carbon syntax so that it is easy to
write in Markdown, which uses backticks for inline code. For example, to
render a backtick there are a couple options:
- Use more backticks: ``` `` ` `` ```
- Use inline HTML: ``<code>\`</code>``
- Other currently unused characters as prefix, such as `~identifier`,
`$identifier`, or `%identifier`.
- We expect raw identifiers to be relatively rare. There may be future
uses for these characters that allow us to serve a broader use-case.
- While we could change raw string literal syntax to use the same
character, it would be helpful if raw string literal syntax had some
degree of cross-language syntactic consistency in order to reduce
learning curves.

Raw identifier syntax is expected to be an edge case of the language. As a
consequence, it should probably be expected that developers reading it will be
more likely to rely on their understanding of the syntax either from other parts
of Carbon, or from other languages. This means it's helpful if the syntax can be
understood on its own, but if it's confusable with C++ syntax, the relative
rarity could exacerbate understandability issues.

If performance of the `r#` prefix is prohibitive, that would be a justification
for changing approaches.

### Restrict raw identifier syntax to current and future keywords

We had discussed maintaining a list of current and future keywords, and only
allowing raw identifier syntax in those cases. If this were done as part of the
toolchain, releases would need to push versions that "declare" future keywords
without turning them into actual keywords. For a library that used those
identifiers, it would initially be compatible with compiler versions up to and
including the "future" keyword version; upon using raw identifier syntax, that
would become the minimum compiler version. This creates a compiler versioning
dependency that it might be helpful to avoid.

As an alternative approach, Carbon could provide a command line option which
libraries could use to specify future keywords that are used in the program.
While some systems such as `bazel` allow libraries to indicate options they need
for compilation, other build systems such as `cmake` might require library users
to update their dependencies as well. The consequence would be that library
users might need to more carefully monitor options when updating compilers.

### Don't require syntax for references to raw identifiers

We could say that, in a scope where a raw identifier has been declared, the
token without `r#` now refers to the identifier instead of the keyword. If the
user actually needs the keyword within that scope, they could instead use `k#`
or something similar.

A particular example of this can be seen with the `base` keyword:

```
class C {
// `base` now means this name in the scope of `C`.
var r#base: i32;
// To extend, `k#base` is now required.
extend k#base: T;
}
fn MakeC() -> C {
// The struct literal's `base` is outside the scope of `C`, so must use
// `r#base`.
var c: C = {.r#base = 0, .base = { ... }};
// A member reference could use the identifier-default for `base` in `C`.
c.base = 1;
c.k#base = {...};
return c;
}
```

The equivalent under proposed syntax (uniformly using `r#base`) is:

```
class C {
var r#base: i32;
extend base: T;
}
fn MakeC() -> C {
var c: C = {.r#base = 0, .base = { ... }};
c.r#base = 1;
c.base = {...};
return c;
}
```

At present we are deciding this is unnecessary complexity, and it's better to
require `r#` in all references to the identifier.

### Don't provide raw identifier syntax

We could omit raw identifier syntax. It introduces a novel risk of underhanded
code that appears to mean one thing but means a different thing, by shadowing a
keyword with an identifier. This risk is discussed in
[Initial Analysis of Underhanded Source Code (Wheeler 2020)](https://www.ida.org/-/media/feature/publications/i/in/initial-analysis-of-underhanded-source-code/d-13166.ashx)
(page 4-2).

This concern is considered non-blocking.

0 comments on commit 6907870

Please sign in to comment.