Non-deterministic (and sometimes incorrect) finding of BaseModel classes #117

slafs · 2023-08-12T21:39:23Z

When trying to run BP001 (add default None) on our project, I've noticed something interesting.
The part when the tool is "Looking for Pydantic Models" with ClassDefVisitor yields different results when running more than once. I think I've managed to boil it down to a minimal example.

Setup

Lets consider 4 simple files: a.py, common.py, utils.py and z.py like so:
==> common.py <==

from pydantic import BaseModel

class NewBaseModel(BaseModel):
    foo: str | None

==> utils.py <==

from common import NewBaseModel

class CustomBaseModel(NewBaseModel):
    bar: str | None

==> a.py <==

from utils import CustomBaseModel

class AModel(CustomBaseModel):
    a: str | None

==> z.py <==

from utils import CustomBaseModel

class ZModel(CustomBaseModel):
    z: str | None

(apologies for terrible and potentially unintuitive naming 😅)

Requirements

There are 2 important things to note here:

There are at least 2 levels of inheritance from pydantic's BaseModel. That is: z.ZModel and a.AModel -> utils.CustomBaseModel -> common.NewBaseModel -> pydantic.BaseModel. Apparently this is important as the problem doesn't seem to occur when we get rid of one level (common.NewBaseModel for example).
Two "normal" modules, a and z, are listed respectively before and after the utility modules (common and utils) when processing.

Expected outcome

Running bump-pydantic . on such "project" should add None as a default value for fields in all 4 modules.

Observed outcome

But it turns out it does that in (more or less) half of the runs. I've run the tool 100 times and only 47 of them ended up in modifying all four files, 15 times it modified 3 of them (common.py, utils.py and z.py) and 38 times only 2 of them (common.py and utils.py).

Problem research

I see two issues here:

non-deterministic behaviour and
incorrect finding of BaseModel classes.

Looks like the problem occurs because of the way the queue is being constructed and the way files are being added to it later in

bump-pydantic/bump_pydantic/main.py

Lines 116 to 118 in 393dc89

 missing_files = set(files) - visited 

 if not queue and missing_files: 

 queue.append(next(iter(missing_files)))

since set by nature is unsorted, files will be added to the queue in a "random" order.
Although a.py will be processed first every time.
Fixing that non-determinism should be relatively simple. For example, we could make the queue initially the same as files, appendleft files coming from visitor.next_file and get rid of the missing_files bit.

Now, once we fix that, AModel will never get marked as a pydantic model, at least with the way I've fixed the first issue. Which I haven't figured out why, yet. I mean, it does get marked correctly ~50% of the times currently, so... 🤷. I haven't fully grokked the ClassDefVisitor code yet. Seems like there is some consideration for those cases there already.

For solutions I was considering either going through all of the files twice or allowing callers to specify fqns of custom base models.

~~I can open a PR to better demonstrate what I mean here.~~
This is what I mean about that non-determinism fix: 13bfe61 (from #118).

The text was updated successfully, but these errors were encountered:

slafs · 2023-08-14T14:18:24Z

I haven't fully grokked the ClassDefVisitor code yet. Seems like there is some consideration for those cases there already.

OK. I think I figured it out. It looks like the code responsible for disambiguation in

bump-pydantic/bump_pydantic/codemods/class_def_visitor.py

Lines 61 to 85 in dda0b55

 # In case we have the following scenario: 

 # class A(B): ... 

 # class B(BaseModel): ... 

 # class D(C): ... 

 # class C: ... 

 # We want to disambiguate `A` as soon as we see `B` is a `BaseModel`. 

 if ( 

 fqn.name in self.context.scratch[self.BASE_MODEL_CONTEXT_KEY] 

 and fqn.name in self.context.scratch[self.CLS_CONTEXT_KEY] 

 ): 

 for parent_class in self.context.scratch[self.CLS_CONTEXT_KEY].pop(fqn.name): 

 self.context.scratch[self.BASE_MODEL_CONTEXT_KEY].add(parent_class) 

 # In case we have the following scenario: 

 # class A(B): ... 

 # class B(BaseModel): ... 

 # class D(C): ... 

 # class C: ... 

 # We want to disambiguate `D` as soon as we see `C` is NOT a `BaseModel`. 

 if ( 

 fqn.name in self.context.scratch[self.NO_BASE_MODEL_CONTEXT_KEY] 

 and fqn.name in self.context.scratch[self.CLS_CONTEXT_KEY] 

 ): 

 for parent_class in self.context.scratch[self.CLS_CONTEXT_KEY].pop(fqn.name): 

 self.context.scratch[self.NO_BASE_MODEL_CONTEXT_KEY].add(parent_class)

only works one level "up". So it manages to resolve something like:

# -- bar.py
from package.foo import Foo

class Bar(Foo):
    b: str

# -- foo.py
from pydantic import BaseModel

class Foo(BaseModel):
    a: str

but once you add one (or more!) levels of inheritance e.g.:

# -- baz.py
from package.bar import Bar

class Baz(Bar):
    c: str

then at the end CLS_CONTEXT_KEY entry will have one lingering entry with:
{'package.bar.Bar': {'package.baz.Baz'}}.

It seems like some recursive disambiguation is needed at the end of the whole process (or at the end each visit).

Kludex · 2023-08-14T16:19:42Z

Closed by #118 ?

slafs · 2023-08-14T17:40:15Z

Closed by #118 ?

Nope. #118 is only for 1. from:

I see two issues here:

non-deterministic behaviour and

incorrect finding of BaseModel classes.

Will try to propose something for number 2. later.

It might get tricky with testing as I see there's no unit tests for ClassDefVisitor :/.

slafs mentioned this issue Aug 13, 2023

🐛 Make file processing with ClassDefVisitor deterministic #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic (and sometimes incorrect) finding of BaseModel classes #117

Non-deterministic (and sometimes incorrect) finding of BaseModel classes #117

slafs commented Aug 12, 2023 •

edited

slafs commented Aug 14, 2023

Kludex commented Aug 14, 2023

slafs commented Aug 14, 2023

Non-deterministic (and sometimes incorrect) finding of BaseModel classes #117

Non-deterministic (and sometimes incorrect) finding of BaseModel classes #117

Comments

slafs commented Aug 12, 2023 • edited

Setup

Requirements

Expected outcome

Observed outcome

Problem research

slafs commented Aug 14, 2023

Kludex commented Aug 14, 2023

slafs commented Aug 14, 2023

slafs commented Aug 12, 2023 •

edited