Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_len on a Vec<u8> of uninit is UB #79

Open
nico-abram opened this issue Jan 2, 2022 · 8 comments
Open

set_len on a Vec<u8> of uninit is UB #79

nico-abram opened this issue Jan 2, 2022 · 8 comments

Comments

@nico-abram
Copy link

encoding_rs currently has UB in the form of creating uninitialized u8's via set_len
Here are 2 examples where the UB is crystal clear:

encoding_rs/src/mem.rs

Lines 2007 to 2010 in dd9d99b

let mut vec = Vec::with_capacity(capacity);
unsafe {
vec.set_len(capacity);
}

encoding_rs/src/mem.rs

Lines 2044 to 2047 in dd9d99b

let mut vec = Vec::with_capacity(capacity);
unsafe {
vec.set_len(capacity);
}

set_len is also used in 7 functions in lib.rs, but I haven't looked at them very closely.

The docs for set_len explicitly say https://doc.rust-lang.org/std/vec/struct.Vec.html#method.set_len :

The elements at old_len..new_len must be initialized.

Some relevant discussion can be found here rust-lang/unsafe-code-guidelines#71

rustc itself has a lint specifically for this kind of thing: rust-lang/rust#75968

Using MaybeUninit::uninit().assume_init() is instant UB unless the target type is itself composed entirely of MaybeUninit

My understanding is this is currently considered UB, but this rule may be relaxed in the future to allow types where all bit patterns are valid to store uninitalized if they are not read from.

@5225225
Copy link

5225225 commented Jan 3, 2022

Hmm.. Having a look at this, it looks like functions like convert_latin1_to_utf8_partial will need to have a maybe_uninit variant added

/// # Safety
/// You may not deinitialise any T here.
unsafe fn mucast<T>(x: &mut [T]) -> &mut [MaybeUninit<T>] {
    &mut *(x as *mut [T] as *mut [MaybeUninit<T>])
}

pub fn convert_latin1_to_utf8_partial(src: &[u8], dst: &mut [u8]) -> (usize, usize) {
    convert_latin1_to_utf8_partial_uninit(src, mucast(dst))
}
pub fn convert_latin1_to_utf8_partial_uninit(src: &[u8], dst: &mut [MaybeUninit<u8>]) -> (usize, usize);

It would need to promise that no previously init bytes are made uninit (in order to make the cast from &mut [u8] to &mut [MaybeUninit<u8>] sound, but that seems reasonable enough as a promise to make

This would be a useful change to make to consumers, as it would let them use the mem functions to stack allocated buffers soundly, without needing to zero init them.

edit: These functions wouldn't have to be made public, of course, but they would need to exist in some form, and I think it might be useful for consumers.

@hsivonen
Copy link
Owner

Per @Gankra's tweet, &mut [u8] to unintialized integers isn't UB if only writing, citing @RalfJung. My reading is that it applies here. Do I understand correctly?

If at all possible, I'd like to avoid complicating the outward API (including the FFI part in encoding_c) with MaybeUninit.

@5225225
Copy link

5225225 commented Aug 11, 2022

I think there's 2 questions here.

  1. Is it UB to have a &mut [T] to uninitialized T
  2. Is it UB to call set_len on a Vec to cause uninitalised data to be counted as being initialised

The answer to 1 seems to be "probably not, but it's not definitely not UB?" (see: rust-lang/unsafe-code-guidelines#77)

2 seems to be yes, going off the safety requirement on Vec::set_len.

Thought that requirement could be closely tied to the answer to 1, so Vec could weaken that safety requirement. It would still be UB to observe an uninit u8 by value, but we don't seem to be doing that here (because miri passes).


If 1. is not UB, then I don't think the public API would need to change at all, you'd just take a &mut [u8] and then convert it into a &mut [MaybeUninit<u8>] internally if it's an API that allows uninit bytes to be passed. (And doing that conversion isn't mandatory, but it means you work with a type that's a lie only as much as you have to.

@RalfJung
Copy link
Contributor

RalfJung commented Aug 11, 2022

Is it UB to have a &mut [T] to uninitialized T

This is currently deep in no-mans-land. The reference clearly makes it UB, as does the documentation of MaybeUninit::assume_init_mut. However I think it should be allowed if T is inhabited, and in particular should be allowed for u8, and MiniRust accordingly allows it. But that is not necessarily the consensus in UCG or T-Lang, so the final rules might end up making this UB.

So what Gankra writes is correct, in the sense that I think we should change the rules to allow this. But the current rules clearly don't allow it and I don't decide these rules by myself.

@hsivonen
Copy link
Owner

I think there's 2 questions here.

  1. Is it UB to have a &mut [T] to uninitialized T
  2. Is it UB to call set_len on a Vec to cause uninitalised data to be counted as being initialised

...

2 seems to be yes, going off the safety requirement on Vec::set_len.

Thought that requirement could be closely tied to the answer to 1, so Vec could weaken that safety requirement.

AFAICT, the UB concern related to 2 is precisely about 1, and, therefore, if 1 isn't UB when T is a primitive integer, it seems to me that encoding_rs's use of set_len isn't UB, either. (Why would set_len even be exposed as a public API if it didn't have precisely the use case that encoding_rs uses it for as non-UB according to the 2015 understanding of UB?)

@hsivonen
Copy link
Owner

(Why would set_len even be exposed as a public API if it didn't have precisely the use case that encoding_rs uses it for as non-UB according to the 2015 understanding of UB?)

Well, I guess the example given is distinct, since it has C code doing the writing to the uninitialized u8s.

@RalfJung
Copy link
Contributor

The safety comment of set_len currently reads

The elements at old_len..new_len must be initialized.

That is a library API comment. I do not know when it was added (it has not always been there), but what it means seems fairly clear.

In terms of language UB, it would be possible to allow set_len to be used before elements are initialized, but then the question is how does one go about initializing them? vec[idx] = val creates an &mut T to the uninit element so that would be potentially problematic. vec.as_mut_ptr().add(idx).write(val) would work though. But it is up to t-libs-api to decide if they want to bless this pattern or not.

@Stargateur
Copy link

how does one go about initializing them?

spare_capacity_mut look perfect for this use case. While I agree nothing really bad should happen I don't see why user shouldn't assume it's UB, cause if code change or whatever, one day you do it for a Dropable item and you forget to do the set_len properly. I don't see any situation where you can't do the init before the set_len.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants