Detection Evasion with Unicode #212

QuinceyJames · 2023-03-28T16:53:50Z

Problem

Hi! I just read an interesting article on how bad actors can evade text-based static analysis tools using Unicode. Ever since PEP 3131, Python allowed programmers to use non-ASCII characters to allow developers "to define classes and functions with names in their native languages". As a consequence, there are now many ways keywords like eval be specified. (See: https://lingojam.com/BoldTextGenerator)

Proposal

Guarddog could preprocess all source files by converting any Unicode to ASCII. According to the PEP, "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."

Alternatively, Guarddog could define a new heuristic that warns if non-ASCII characters are found.

Test

Generate a bolded Unicode variant of the letter e to obtain 𝐞
Append tests/analyzer/sourcecode/code-execution.py with the following code:
```
# ruleid: code-execution
𝐞val("print('malicious print statement')")
```
From the root of the project, run semgrep --metrics off --test --config guarddog/analyzer/sourcecode tests/analyzer/sourcecode
Verify all of the unit tests pass

The text was updated successfully, but these errors were encountered:

zmallen · 2023-04-10T23:41:41Z

Interesting post!

I've seen this be solved a few ways, one of them being what you suggest. The preprocessing/replacement part can be tricky as it could break functionality if you incorrectly replace a piece of unicode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection Evasion with Unicode #212

Detection Evasion with Unicode #212

QuinceyJames commented Mar 28, 2023

zmallen commented Apr 10, 2023

Detection Evasion with Unicode #212

Detection Evasion with Unicode #212

Comments

QuinceyJames commented Mar 28, 2023

Problem

Proposal

Test

zmallen commented Apr 10, 2023