Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label studio integration #139

Merged
merged 81 commits into from
Jun 8, 2024
Merged

Conversation

aaron-siegel
Copy link
Contributor

@aaron-siegel aaron-siegel commented Apr 24, 2024

Initial implementation of Label Studio integration.

@aaron-siegel
Copy link
Contributor Author

This change is Reviewable

Copy link
Collaborator

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss tomorrow how to get it over the finish line.

Reviewable status: 0 of 42 files reviewed, 10 unresolved discussions (waiting on @aaron-siegel and @orm011)


pixeltable/catalog/table.py line 721 at r6 (raw file):

        print(f'Linked remote {remote} to table `{self.get_name()}`.')

    def unlink_remote(

Let's go with add and drop, since we already have that for columns, and that's generally the terminology used by dbms's.


pixeltable/catalog/table.py line 757 at r6 (raw file):

                raise excs.Error(
                    f'Column `{t_col}` does not exist in Table `{self.get_name()}`. '
                    '(Specify `col_mapping` explicitly?)'

Confusing, because you passed in the col_mapping.


pixeltable/datatransfer/label_studio.py line 211 at r6 (raw file):

        localpath_col_opt = [t[media_col_name].localpath] if is_stored else []
        # Select the media column, rectanglelabels columns, and localpath (if appropriate)
        rows = t.select(t[media_col_name], *[t[col] for col in t_rl_cols], *localpath_col_opt)

This is not a good idea, this should be limited to only retrieving new rows. Running this query isn't guaranteed to be free, you might end up downloading a large number of media files


pixeltable/datatransfer/remote.py line 10 at r6 (raw file):

class Remote(abc.ABC):

Not a good name. It's already used by git to mean something specific and different. And it's also not part of standard dbms terminology.


pixeltable/datatransfer/remote.py line 13 at r6 (raw file):

    """
    Abstract base class that represents a remote data store. Subclasses of `Remote` provide
    functionality for synchronizing between Pixeltable tables and stateful remote stores.

It's unclear that this can be used for all external data stores. I generally stay away from introducing abstractions when the actual requirements aren't known, and this is definitely the case here. This will work for labeling projects, and that should be the scope of this class. We can always expand the abstraction or create more general ones later, when we're integrating other data stores such as dbms's, etc., and understand the requirements


pixeltable/exprs/data_row.py line 205 at r6 (raw file):

    @property
    def vmin(self) -> int:

Let's go with v_min to stay with current conventions


pixeltable/functions/huggingface.py line 149 at r6 (raw file):

@pxt.udf
def detr_to_coco(image: PIL.Image.Image, detr_info: dict[str, Any]) -> dict[str, Any]:

This needs to live somewhere else, it's not HF-specific.


pixeltable/functions/openai.py line 19 at r6 (raw file):

@env.Env.get().register_client('openai')

What is the purpose of this? I'm finding this harder to follow than what it replaces, and it doesn't appear to save any code.

Copy link
Contributor Author

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 42 files reviewed, 10 unresolved discussions (waiting on @mkornacker and @orm011)


pixeltable/catalog/table.py line 721 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Let's go with add and drop, since we already have that for columns, and that's generally the terminology used by dbms's.

Oh, ok. I changed it to link and unlink to avoid mentioning remotes, but add_X / drop_X is fine too, we just have to figure out what X is.


pixeltable/datatransfer/label_studio.py line 211 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

This is not a good idea, this should be limited to only retrieving new rows. Running this query isn't guaranteed to be free, you might end up downloading a large number of media files

Ok. Let's talk about how to address it


pixeltable/datatransfer/remote.py line 10 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Not a good name. It's already used by git to mean something specific and different. And it's also not part of standard dbms terminology.

I can't change the name on this PR for schema reasons, but I can hide it from the user and then change the name in the next PR.


pixeltable/datatransfer/remote.py line 13 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

It's unclear that this can be used for all external data stores. I generally stay away from introducing abstractions when the actual requirements aren't known, and this is definitely the case here. This will work for labeling projects, and that should be the scope of this class. We can always expand the abstraction or create more general ones later, when we're integrating other data stores such as dbms's, etc., and understand the requirements

Ok. Happy to discuss what might be a better name and/or change the docs.


pixeltable/exprs/data_row.py line 205 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Let's go with v_min to stay with current conventions

If it's v_min, shouldn't it also be row_id?


pixeltable/functions/huggingface.py line 149 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

This needs to live somewhere else, it's not HF-specific.

Actually, I think it is! It's specific to the DETR huggingface models, as far as I know. If you're aware of some standard that it aligns with, let me know and I can move it (but then the name shouldn't be detr_to_coco, it should be X_to_coco where X is whatever standard is being implemented)


pixeltable/functions/openai.py line 19 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

What is the purpose of this? I'm finding this harder to follow than what it replaces, and it doesn't appear to save any code.

That's true in the case of OpenAI, but the old pattern only worked for SDK clients with exactly one parameter and with the parameter name "api_key". That won't work for Label Studio, which has two parameters (url and api_key). And I think we should standardize on a uniform pattern for SDK client registration. This is a nice clean pattern that works in the general case.

I did find a way to clean up the decorator syntax a bit, let me know what you think.

Copy link
Collaborator

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 42 files reviewed, 19 unresolved discussions (waiting on @aaron-siegel and @orm011)


pixeltable/env.py line 66 at r11 (raw file):

        self._http_address: Optional[str] = None

        self._registered_clients: dict[str, Any] = {}

incorrect type decl


pixeltable/env.py line 365 at r11 (raw file):

            - name: The name of the client
        """
        cl: ApiClient = self._registered_clients[name]

we don't use type decls here


pixeltable/env.py line 372 at r11 (raw file):

        # if not, look in Pixeltable config from `config.yaml`.

        init_kwargs = {}

missing type decl


pixeltable/env.py line 381 at r11 (raw file):

            if param not in init_kwargs or init_kwargs[param] == '':
                raise excs.Error(
                    f'`{name}` client not initialized: parameter `{param}` is not configured.\n'

I'm finding this harder to read with all the backticks. Do we need those?


pixeltable/env.py line 542 at r11 (raw file):

@dataclass
class ApiClient:

move to top


pixeltable/catalog/table.py line 726 at r11 (raw file):

            self,
            remote: 'pixeltable.datatransfer.Remote',
            col_mapping: Optional[dict[str, str]] = None

As discussed, move the data store-specific parameters and logic into the data store-specific class and call this _link(). It should only have the remote parameter.


pixeltable/catalog/table.py line 738 at r11 (raw file):

        """
        self._check_is_dropped()
        push_cols = remote.get_push_columns()

Given that we already have import and export elsewhere, why not reuse those concepts here and call them export_/import_cols.


pixeltable/catalog/table.py line 751 at r11 (raw file):

            self,
            *,
            remotes: Optional['pixeltable.datatransfer.Remote' | list['pixeltable.datatransfer.Remote']] = None,

Remove this parameter and assert that there's only one remote. In the next round we can add the optional name/names parameter


pixeltable/catalog/table.py line 760 at r11 (raw file):

            remotes: If specified, will unlink only the specified `Remote` or list of `Remote`s. If not specified,
                will unlink all of this table's `Remote`s.
            ignore_errors (bool): If `True`, no exception will be thrown if the specified `Remote` is not linked

What's the point of that?


pixeltable/catalog/table.py line 780 at r11 (raw file):

            print(f'Unlinked remote {remote} from table `{self.get_name()}`.')

    def _validate_remote(

Move all data store-specific logic to data store class.


pixeltable/catalog/table.py line 838 at r11 (raw file):

            self,
            *,
            remotes: Optional['pixeltable.datatransfer.Remote' | list['pixeltable.datatransfer.Remote']] = None,

Remove parameter and assert there's only a single remote. This will be replaced with names.


pixeltable/catalog/table.py line 839 at r11 (raw file):

            *,
            remotes: Optional['pixeltable.datatransfer.Remote' | list['pixeltable.datatransfer.Remote']] = None,
            push: bool = True,

export/import


pixeltable/catalog/table.py line 849 at r11 (raw file):

                will sync all linked `Remote`s.
            push: If `True`, data from this table will be pushed to the remotes during synchronization.
            pull: If `True`, data from this table will be pulled from the remotes during synchronization.

Ambiguous phrasing. "data from the remote will be imported into this table ..."


pixeltable/functions/huggingface.py line 149 at r6 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Actually, I think it is! It's specific to the DETR huggingface models, as far as I know. If you're aware of some standard that it aligns with, let me know and I can move it (but then the name shouldn't be detr_to_coco, it should be X_to_coco where X is whatever standard is being implemented)

Are you saying that running a non-HF DETR model gives you different output?

https://mmdetection.readthedocs.io/en/v2.16.0/_modules/mmdet/models/detectors/detr.html

Copy link
Contributor Author

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r7.
Reviewable status: 0 of 42 files reviewed, 19 unresolved discussions (waiting on @mkornacker and @orm011)


pixeltable/env.py line 66 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

incorrect type decl

Fixed, thanks


pixeltable/env.py line 365 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

we don't use type decls here

fixed


pixeltable/env.py line 381 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

I'm finding this harder to read with all the backticks. Do we need those?

I've always used them to denote identifiers in user-facing messages. Most systems quote identifiers in error messages in some way. e.g. Python uses single quotes:

TypeError: 'int' object is not callable

I don't have a strong preference on backticks in particular but I do think we should quote identifiers in some way, when they appear inline in an error message like this.


pixeltable/env.py line 542 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

move to top

Why? Is this following some particular convention? I think it's standard for the principal class (in this case Env) to be declared before its auxiliary classes, although to my knowledge PEP8 is agnostic about this.


pixeltable/catalog/table.py line 757 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Confusing, because you passed in the col_mapping.

Fixed


pixeltable/catalog/table.py line 726 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

As discussed, move the data store-specific parameters and logic into the data store-specific class and call this _link(). It should only have the remote parameter.

I'll do this in the next PR since it doesn't affect the API.


pixeltable/catalog/table.py line 738 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Given that we already have import and export elsewhere, why not reuse those concepts here and call them export_/import_cols.

Done.


pixeltable/catalog/table.py line 751 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Remove this parameter and assert that there's only one remote. In the next round we can add the optional name/names parameter

Done.


pixeltable/catalog/table.py line 760 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

What's the point of that?

Just to be consistent with drop_table etc. If we're going to be using names, I definitely think we should have it as a parameter.


pixeltable/catalog/table.py line 780 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Move all data store-specific logic to data store class.

I'll do this in the next PR (it doesn't impact the public-facing API since this is a protected method)


pixeltable/catalog/table.py line 838 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Remove parameter and assert there's only a single remote. This will be replaced with names.

Done.


pixeltable/catalog/table.py line 839 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

export/import

Done.


pixeltable/functions/huggingface.py line 149 at r6 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Are you saying that running a non-HF DETR model gives you different output?

https://mmdetection.readthedocs.io/en/v2.16.0/_modules/mmdet/models/detectors/detr.html

I haven't looked at mmdetect specifically, but yeah, the DETR model doesn't create the JSON struct. As I understand it, that's done by the huggingface library. So unless they're both conforming to some standard, there's no reason to expect the output format to be the same.

Copy link
Collaborator

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 42 files reviewed, 9 unresolved discussions (waiting on @aaron-siegel and @orm011)


pixeltable/env.py line 542 at r11 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Why? Is this following some particular convention? I think it's standard for the principal class (in this case Env) to be declared before its auxiliary classes, although to my knowledge PEP8 is agnostic about this.

I guess I personally find it easier to see this before it's being used and would always put it at the top.


pixeltable/catalog/table.py line 726 at r11 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

I'll do this in the next PR since it doesn't affect the API.

It would affect the API, because the col_mapping parameter will go away.

This doesn't look like much work to me.


pixeltable/catalog/table.py line 849 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Ambiguous phrasing. "data from the remote will be imported into this table ..."

Isn't that exactly not the case? import == True means data will be imported into this table.

Copy link
Collaborator

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 42 files reviewed, 16 unresolved discussions (waiting on @aaron-siegel and @orm011)


tests/datatransfer/test_label_studio.py line 54 at r12 (raw file):

        skip_test_if_not_installed('label_studio_sdk')
        from pixeltable.datatransfer.label_studio import LabelStudioProject
        remote = LabelStudioProject.create(title='test_remote_project', label_config=self.test_config_2)

Given that we decided to move to the create_ls_project() function, this interface isn't user-facing anymore.


tests/datatransfer/test_label_studio.py line 64 at r12 (raw file):

        with pytest.raises(excs.Error) as exc_info:
            _ = LabelStudioProject.create(

This should be tested via the user-facing function create_label_studio_project()


tests/datatransfer/test_label_studio.py line 96 at r12 (raw file):

        skip_test_if_not_installed('label_studio_sdk')
        t = ls_image_table
        from pixeltable.datatransfer.label_studio import LabelStudioProject

This shouldn't be needed.


tests/datatransfer/test_label_studio.py line 125 at r12 (raw file):

        t = pxt.get_table('test_ls_sync')
        t.sync()
        annotations = t.collect()['annotations_col']

You're expecting a particular order, which means you need an order_by (otherwise Postgres might decide to return data in a different order and the test breaks for no obvious reason)


tests/datatransfer/test_label_studio.py line 180 at r12 (raw file):

        from pixeltable.datatransfer.label_studio import LabelStudioProject

        remote = LabelStudioProject.create('test_sync_errors_project', self.test_config)

Use create_label_studio_project() to ensure that you're testing the same code paths that a user would exercise.


tests/datatransfer/test_label_studio.py line 241 at r12 (raw file):

    _logger.info('Setting up a venv the Label Studio pytext fixture.')
    subprocess.run('python -m venv target/ls-env'.split(' '), check=True)
    if platform.system() == 'Windows':

Why this? I thought this doesn't run on Windows anyway.


tests/datatransfer/test_remote.py line 18 at r12 (raw file):

        t = pxt.create_table('test_remote', schema)

        remote1 = MockRemote(

Why not use the public API for this?

Copy link
Contributor Author

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 42 files reviewed, 16 unresolved discussions (waiting on @mkornacker and @orm011)


pixeltable/catalog/table.py line 726 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

It would affect the API, because the col_mapping parameter will go away.

This doesn't look like much work to me.

see my comment on slack. It's considerably more work than you realize, and it doesn't affect the API since _link is now protected.


pixeltable/catalog/table.py line 849 at r11 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Isn't that exactly not the case? import == True means data will be imported into this table.

Oops typo. I'll fix this. Good catch


tests/datatransfer/test_label_studio.py line 125 at r12 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

You're expecting a particular order, which means you need an order_by (otherwise Postgres might decide to return data in a different order and the test breaks for no obvious reason)

Ah good catch, I'll fix this

Copy link
Contributor Author

@aaron-siegel aaron-siegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 42 files reviewed, 16 unresolved discussions (waiting on @mkornacker and @orm011)


tests/datatransfer/test_label_studio.py line 96 at r12 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

This shouldn't be needed.

Fixed


tests/datatransfer/test_label_studio.py line 241 at r12 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Why this? I thought this doesn't run on Windows anyway.

It's part of an earlier attempt to get it working on Windows. I have visions of that effort eventually succeeding, so I'd like to leave this in


tests/datatransfer/test_remote.py line 18 at r12 (raw file):

Previously, mkornacker (Marcel Kornacker) wrote…

Why not use the public API for this?

Not sure what this comment is asking, the goal here is to isolate the remote validation behavior (which i guess will become the project validation behavior) without depending on label studio or another specific implementation

Copy link
Collaborator

@mkornacker mkornacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 42 files reviewed, 16 unresolved discussions (waiting on @aaron-siegel and @orm011)


tests/datatransfer/test_label_studio.py line 241 at r12 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

It's part of an earlier attempt to get it working on Windows. I have visions of that effort eventually succeeding, so I'd like to leave this in

Leave explanatory comment


tests/datatransfer/test_remote.py line 18 at r12 (raw file):

Previously, aaron-siegel (Aaron Siegel) wrote…

Not sure what this comment is asking, the goal here is to isolate the remote validation behavior (which i guess will become the project validation behavior) without depending on label studio or another specific implementation

In general, it's better to use the public-facing API, to make sure that the error condition bubbles up in the right way. Leave todo.

@aaron-siegel aaron-siegel merged commit f99b389 into pixeltable:master Jun 8, 2024
14 checks passed
@aaron-siegel aaron-siegel deleted the label-studio branch June 8, 2024 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants