Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable wide Unicode support for names #24

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
45 changes: 34 additions & 11 deletions dev/lib/factory-name.js
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
/**
* @typedef {import('micromark-util-types').Code} Code
* @typedef {import('micromark-util-types').Effects} Effects
* @typedef {import('micromark-util-types').State} State
* @typedef {import('micromark-util-types').TokenizeContext} TokenizeContext
* @typedef {import('micromark-util-types').TokenType} TokenType
*/

import {asciiAlpha, asciiAlphanumeric} from 'micromark-util-character'
import {codes} from 'micromark-util-symbol'
import {asciiAlphanumeric} from 'micromark-util-character'
import {classifyCharacter} from 'micromark-util-classify-character'
import {codes, constants} from 'micromark-util-symbol'

/**
* @this {TokenizeContext}
Expand All @@ -22,7 +24,7 @@ export function factoryName(effects, ok, nok, type) {

/** @type {State} */
function start(code) {
if (asciiAlpha(code)) {
if (allowedEdgeCharacter(code)) {
effects.enter(type)
effects.consume(code)
return name
Expand All @@ -33,18 +35,39 @@ export function factoryName(effects, ok, nok, type) {

/** @type {State} */
function name(code) {
if (
code === codes.dash ||
code === codes.underscore ||
asciiAlphanumeric(code)
) {
if (allowedCharacter(code)) {
effects.consume(code)
return name
}

effects.exit(type)
return self.previous === codes.dash || self.previous === codes.underscore
? nok(code)
: ok(code)
return allowedEdgeCharacter(self.previous) ? ok(code) : nok(code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn’t dashes also be edge characters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean allowed edge characters, it was forbidden by the spec previously. I kept it but I don't mind changing.

Currently, the name cannot either start or end with any punctuation or underscore.

Is this something you suggest to change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To refresh my memory: So, in name, this last stuff is about what is possible to exit after.
That behavior at the end is very different from whether the first character is allowed to start a name.
Before, there was a very different check compared to the check in start: - and _ were allowed in names but not at the end.
Now they’re the same. I’m not sure if that’s useful? Perhaps the last line should just be return ok(code)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried return ok(code) at the end. Allowing the name to end with an underscore interferes with emphasis notation. I reverted the commit.

I think, there is no point to allow punctuation in the end but don't allow at the start. If we are going to allow punctuation, it should be (almost) equal.

Possible options:

  1. Leave as is
  2. Allow any punctuation at the start but forbid markdown special chars at the end. Those should include =, ~ (special in some flavours), _, *, parentheses and perhaps some more — basically anything in the ASCII (in comparison to option 3, blacklist under 128).
  3. Allow any punctuation at the start and end if it's beyond ASCII (in comparison to option 2, whitelist above 127)

}
}

/**
* Checks if the character code is valid for a directive name
*
* @param {Code} code
**/
function allowedCharacter(code) {
return code !== null && code <= codes.del
? code === codes.dash ||
code === codes.dot ||
code === codes.underscore ||
asciiAlphanumeric(code)
: classifyCharacter(code) !== constants.characterGroupWhitespace
}

/**
* Checks if the character code is valid as a directive name start (or end)
*
* @param {Code} code
**/
function allowedEdgeCharacter(code) {
return (
allowedCharacter(code) &&
classifyCharacter(code) !== constants.characterGroupPunctuation &&
code !== codes.underscore
)
}
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
"micromark-factory-space": "^2.0.0",
"micromark-factory-whitespace": "^2.0.0",
"micromark-util-character": "^2.0.0",
"micromark-util-classify-character": "^2.0.0",
"micromark-util-symbol": "^2.0.0",
"micromark-util-types": "^2.0.0",
"parse-entities": "^4.0.0"
Expand Down
149 changes: 117 additions & 32 deletions test/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,9 @@ test('micromark-extension-directive (syntax, text)', async function (t) {
}
)

await t.test(
'should not support a colon not followed by an alpha',
async function () {
assert.equal(micromark(':', options()), '<p>:</p>')
}
)
await t.test('should not support a lonely colon', async function () {
assert.equal(micromark(':', options()), '<p>:</p>')
})

await t.test(
'should support a colon followed by an alpha',
Expand All @@ -57,24 +54,17 @@ test('micromark-extension-directive (syntax, text)', async function (t) {
}
)

await t.test(
'should not support a colon followed by a digit',
async function () {
assert.equal(micromark(':9', options()), '<p>:9</p>')
}
)
await t.test('should support a colon followed by a digit', async function () {
assert.equal(micromark(':9', options()), '<p></p>')
})

await t.test(
'should not support a colon followed by a dash',
'should not support a colon followed by a punctuation',
async function () {
assert.equal(micromark(':-', options()), '<p>:-</p>')
}
)

await t.test(
'should not support a colon followed by an underscore',
async function () {
assert.equal(micromark(':_', options()), '<p>:_</p>')
assert.equal(micromark(':.', options()), '<p>:.</p>')
assert.equal(micromark(':—', options()), '<p>:—</p>') // Em dash
}
)

Expand All @@ -86,21 +76,43 @@ test('micromark-extension-directive (syntax, text)', async function (t) {
assert.equal(micromark(':a-b', options()), '<p></p>')
})

await t.test('should support unicode alphabets in name', async function () {
// Latin, Greek, Cyrillic respectively
assert.equal(micromark(':xγз', options()), '<p></p>')
})

await t.test('should support unicode accents inner name', async function () {
// (Decomposed) Combining Acute Accent in Cyrillic
assert.equal(micromark(':за́мок-чи-замо́к', options()), '<p></p>')
})

await t.test(
'should *not* support a dash at the end of a name',
'should support unicode accents at the name end',
async function () {
assert.equal(micromark(':a-', options()), '<p>:a-</p>')
// (Decomposed) Combining Circumflex Accent in Latin
assert.equal(micromark(':â', options()), '<p></p>')
}
)

await t.test('should support an underscore in a name', async function () {
assert.equal(micromark(':a_b', options()), '<p></p>')
await t.test('should support emojis in name', async function () {
assert.equal(micromark(':🌍', options()), '<p></p>')
assert.equal(micromark(':w🌍rld', options()), '<p></p>')
})

await t.test('should support math symbols in name', async function () {
assert.equal(micromark(':𝜋∈ℝ', options()), '<p></p>') // Italic
assert.equal(micromark(':𝛑≈3.14', options()), '<p></p>') // Bold
assert.equal(micromark(':𝝅∉ℚ', options()), '<p></p>') // Bold italic
assert.equal(micromark(':𝞹≠3.14', options()), '<p></p>') // Sans bold italic
})

await t.test(
'should *not* support an underscore at the end of a name',
'should *not* support punctuation at the end of a name',
async function () {
assert.equal(micromark(':a-', options()), '<p>:a-</p>')
assert.equal(micromark(':a_', options()), '<p>:a_</p>')
assert.equal(micromark(':a.', options()), '<p>:a.</p>')
assert.equal(micromark(':a—', options()), '<p>:a—</p>') // Em dash
}
)

Expand Down Expand Up @@ -411,25 +423,62 @@ test('micromark-extension-directive (syntax, leaf)', async function (t) {
)

await t.test(
'should not support two colons followed by a digit',
'should support two colons followed by a digit',
async function () {
assert.equal(micromark('::9', options()), '<p>::9</p>')
assert.equal(micromark('::9', options()), '')
}
)

await t.test(
'should not support two colons followed by a dash',
'should not support two colons followed by punctuation',
async function () {
assert.equal(micromark('::-', options()), '<p>::-</p>')
assert.equal(micromark('::_', options()), '<p>::_</p>')
assert.equal(micromark('::.', options()), '<p>::.</p>')
assert.equal(micromark('::—', options()), '<p>::—</p>') // Em dash
}
)

await t.test('should support a digit in a name', async function () {
assert.equal(micromark('::a9', options()), '')
})

await t.test('should support a dash in a name', async function () {
await t.test('should support punctuation in a name', async function () {
assert.equal(micromark('::a-b', options()), '')
assert.equal(micromark('::a-b', options()), '')
assert.equal(micromark('::a_b', options()), '')
assert.equal(micromark('::a.b', options()), '')
assert.equal(micromark('::a—b', options()), '')
})

await t.test('should support unicode alphabets in name', async function () {
// Latin, Greek, Cyrillic respectively
assert.equal(micromark('::xγз', options()), '')
})

await t.test('should support unicode accents inner name', async function () {
// (Decomposed) Combining Acute Accent in Cyrillic
assert.equal(micromark('::за́мок-чи-замо́к', options()), '')
})

await t.test(
'should support unicode accents at the name end',
async function () {
// (Decomposed) Combining Circumflex Accent in Latin
assert.equal(micromark('::â', options()), '')
}
)

await t.test('should support emojis in name', async function () {
assert.equal(micromark('::🌍', options()), '')
assert.equal(micromark('::w🌍rld', options()), '')
})

await t.test('should support math symbols in name', async function () {
assert.equal(micromark('::𝜋∈ℝ', options()), '') // Italic
assert.equal(micromark('::𝛑≈3.14', options()), '') // Bold
assert.equal(micromark('::𝝅∉ℚ', options()), '') // Bold italic
assert.equal(micromark('::𝞹≠3.14', options()), '') // Sans bold italic
Comment on lines +478 to +481
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wooorm the letter Pi you have suggested, is a part of the Greek alphabet covered by the test:

should support unicode alphabets in name

In this (an other similar tests), I added characters that have the word mathematical in their names.

})

await t.test(
Expand Down Expand Up @@ -773,25 +822,61 @@ test('micromark-extension-directive (syntax, container)', async function (t) {
)

await t.test(
'should not support three colons followed by a digit',
'should support three colons followed by a digit',
async function () {
assert.equal(micromark(':::9', options()), '<p>:::9</p>')
assert.equal(micromark(':::9', options()), '')
}
)

await t.test(
'should not support three colons followed by a dash',
'should not support three colons followed by punctuation',
async function () {
assert.equal(micromark(':::-', options()), '<p>:::-</p>')
assert.equal(micromark(':::_', options()), '<p>:::_</p>')
assert.equal(micromark(':::.', options()), '<p>:::.</p>')
assert.equal(micromark(':::—', options()), '<p>:::—</p>') // Em dash
}
)

await t.test('should support a digit in a name', async function () {
assert.equal(micromark(':::a9', options()), '')
})

await t.test('should support a dash in a name', async function () {
await t.test('should support punctuation in a name', async function () {
assert.equal(micromark(':::a-b', options()), '')
assert.equal(micromark(':::a_b', options()), '')
assert.equal(micromark(':::a.b', options()), '')
assert.equal(micromark(':::a—b', options()), '') // Em dash
})

await t.test('should support unicode alphabets in name', async function () {
// Latin, Greek, Cyrillic respectively
assert.equal(micromark(':::xγз', options()), '')
})

await t.test('should support unicode accents inner name', async function () {
// (Decomposed) Combining Acute Accent in Cyrillic
assert.equal(micromark(':::за́мок-чи-замо́к', options()), '')
})

await t.test(
'should support unicode accents at the name end',
async function () {
// (Decomposed) Combining Circumflex Accent in Latin
assert.equal(micromark(':::â', options()), '')
}
)

await t.test('should support emojis in name', async function () {
assert.equal(micromark(':::🌍', options()), '')
assert.equal(micromark(':::w🌍rld', options()), '')
})

await t.test('should support math symbols in name', async function () {
assert.equal(micromark(':::𝜋∈ℝ', options()), '') // Italic
assert.equal(micromark(':::𝛑≈3.14', options()), '') // Bold
assert.equal(micromark(':::𝝅∉ℚ', options()), '') // Bold italic
assert.equal(micromark(':::𝞹≠3.14', options()), '') // Sans bold italic
})

await t.test(
Expand Down