Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X.509Name.get_components() doesn't process Subject values like X.509Name.__getattr__() does with Unicode strings. #1305

Open
zeriny opened this issue May 6, 2024 · 1 comment

Comments

@zeriny
Copy link

zeriny commented May 6, 2024

Hello,

I recently encountered a problem when parsing X.509 certificates with Unicode in the Subject DN fields.

An example of PEM cert to be parsed is (sha1=a324f399248e42e218ec40ae771bf27c4f5aea1d):
-----BEGIN CERTIFICATE-----
MIIFKjCCBBKgAwIBAgIBczANBgkqhkiG9w0BAQsFADBHMQswCQYDVQQGEwJVUzEW
MBQGA1UEChMNR2VvVHJ1c3QgSW5jLjEgMB4GA1UEAxMXR2VvVHJ1c3QgRVYgU1NM
IENBIC0gRzUwHhcNMTQxMTE2MTI0MzI1WhcNMTUwNzIxMDE1MzAyWjCB/TEdMBsG
A1UEDxMUUHJpdmF0ZSBPcmdhbml6YXRpb24xEzARBgsrBgEEAYI3PAIBAxMCREUx
GzAZBgsrBgEEAYI3PAIBAR4K/v8ASwD2AGwAbjESMBAGA1UEBRMJSFJCIDIxNjIw
MQswCQYDVQQGEwJERTEcMBoGA1UECBMTTm9yZHJoZWluLVdlc3RmYWxlbjEOMAwG
A1UEBxMFS29lbG4xODA2BgNVBAoTL1lhemFraSBFdXJvcGUgTGltaXRlZCwgWndl
aWduaWVkZXJsYXNzdW5nIEtvZWxuMSEwHwYDVQQDExhtYXRyaXgueWF6YWtpLWV1
cm9wZS5jb20wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDAqFmrBTfc
W2rj8JKBjp48snoSsWCUE/Sbt53eotH/LwngaLMoZpx4s2KD4G6SfD1NlooJaUAF
yDwT/2g4EaRCUN8RiRoPlilXGJosKi2evS3+rjCvd05Zy+v24hQR9MvH6ZRL2ArC
xG7yLl1WM2hNCBtytbuMyZoT4IToEqZl+mO1ev5eT2oiPRYnUT5r3Ok6LqiW12lp
b9L7rPBJJdrsdnA2FLJeC20/+rSWOryeDOYhdTPdNTReK1b4aNAGGYhkKdUQlba9
8dz0DJBi5MOkehkfYZTZqsNtFDE+rWDtuT5Q/5rGbAoUVhD1qdMiv5Tr6KcB2oe8
9WwhWdfQR4RLAgMBAAGjggFoMIIBZDAfBgNVHSMEGDAWgBQIJQchR2wx/ghYDqbq
L+fbG2JoyjBXBggrBgEFBQcBAQRLMEkwHwYIKwYBBQUHMAGGE2h0dHA6Ly9neC5z
eW1jZC5jb20wJgYIKwYBBQUHMAKGGmh0dHA6Ly9neC5zeW1jYi5jb20vZ3guY3J0
MA4GA1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIw
IwYDVR0RBBwwGoIYbWF0cml4LnlhemFraS1ldXJvcGUuY29tMCsGA1UdHwQkMCIw
IKAeoByGGmh0dHA6Ly9neC5zeW1jYi5jb20vZ3guY3JsMAwGA1UdEwEB/wQCMAAw
WQYDVR0gBFIwUDBOBgkrBgEEAfAiAQYwQTA/BggrBgEFBQcCARYzaHR0cHM6Ly93
d3cuZ2VvdHJ1c3QuY29tL3Jlc291cmNlcy9yZXBvc2l0b3J5L2xlZ2FsMA0GCSqG
SIb3DQEBCwUAA4IBAQCGqxvB42yVVQlneK7RNXM1pkFYYmwAnFbbLEPhOLoQOo/K
mk8k4X8pDEA6I6x73k7ejTDYdZUsEjEM3r1BJF2/XjPTB9rbfKqC518dyYVrtcdN
rUrb07ruRxS+scLFaYLztI42HQEeCVx+AaGWVrkZsz9oWY8k3WzCW8SQRQImLzVD
8z9rWEcCgDtGqjlrtmhlMFfVcP5bgBi5b8AbCDvhXJ3BThPGM7Ct/QCRzYXwr8WT
Tu9+isD+7UT+j9UzAhQKOw8jsaDblBG+ABNGJq1Egv19HxUpb+Toj5amY0NbZjbg
PRC+vKC1qyo5gXWj8ODHRvSLZ8aRueg5X4VdrvGN
-----END CERTIFICATE-----

In this certificate, the value of Subject.jurisdictionLocalityName field is '\ufeffMünchen'.

I initially try to parse the PEM cert using certobj.get_subject().jurisdictionL (which internally calls the__getattr__()function), and retrieve the correct value ('\ufeffMünchen'.encode('utf-8') is b'\xef\xbb\xbfK\xc3\xb6ln').

However, when I try to get this field with certobj.get_subject().get_components(). It returns a list of DNs, and the value of jurisdictionL field is b'\xfe\xff\x00K\x00\xf6\x00l\x00n', which cannot be decoded with "utf-8".

I checked this inconsistency through the source codes and find that:
In X.509Name.__getattr__() function, it handles strings with _lib.ASN1_STRING_to_UTF8. Instead, X.509Name.get_components() directly calls ASN1_STRING_get0_data and ASN1_STRING_length, which returns bytes that can not be decoded to 'utf-8'.

I'm not familiar with Unicode and I wonder whether this is an issue and which method is the correct way to parse X.509 subject.

@zeriny
Copy link
Author

zeriny commented May 6, 2024

Code:

certobj = crypto.load_certificate(crypto.FILETYPE_PEM, pem)
subject_obj = certobj.get_subject()

subject_jurisdictionLocalityName1 = subject_obj.jurisdictionL
print(subject_jurisdictionLocalityName1)

subjects = subject_obj.get_components()
for subject in subjects:
      try:
          key = subject[0].decode()
          if key == 'jurisdictionL':
                print(subject[1])
                subject_jurisdictionLocalityName2 = subject[1].decode("utf-8")
                print(subject_jurisdictionLocalityName2)
      except Exception as e:
          print(e)

Output:
'\ufeffKöln'
b'\xfe\xff\x00K\x00\xf6\x00l\x00n'
UnicodeDecodeError('utf-8', b'\xfe\xff\x00K\x00\xf6\x00l\x00n', 0, 1, 'invalid start byte')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant