Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the HFOnnx pipeline to use Hugging Face Optimum rather than onnxruntime directly #371

Open
nickchomey opened this issue Oct 17, 2022 · 25 comments

Comments

@nickchomey
Copy link

nickchomey commented Oct 17, 2022

The HF documentation says that you can now export seq2seq to ONNX with the OnnxSeq2SeqConfigWithPast class.
https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/onnx#onnx-configurations

This was added with this PR in March huggingface/transformers#14700

Perhaps it is sufficient to be incorporated into txtai now? It would be great to be able to use ONNX versions of the various HF models, for their increased performance.

Additionally, it seems to support ViT models, along with other enhancements that have been made since then. Here's the history for that class https://github.com/huggingface/transformers/commits/main/src/transformers/onnx/config.py

@davidmezzetti
Copy link
Member

I'll have ONNX be a focus in the 5.2 release. About to release 5.1.

@nickchomey
Copy link
Author

I made this comment in #369, but I'll put it here as well since it seems to be more focused on ONNX improvements:

The hfonnx.call() method uses opset=12 as its default value. But ONNX v1.12 has added opset v17. I wonder whether it might be possible/prudent to add some sort of version check to this method that says:

ONNX_OPSET = {
  "1.12" : 17,
  "1.11" : 16,
  "1.10" : 15,
  "1.9" : 14,
  "1.8" : 13
}

ONNX_VERSION = onnx.version (or whatever is needed to find it...)
OPSET = ONNX_OPSET[ONNX_VERSION]

class HFOnnx(Tensors):
    """
    Exports a Hugging Face Transformer model to ONNX.
    """

    def __call__(self, path, task="default", output=None, quantize=False, opset=OPSET):

etc...

Here's the ONNX versions and their respective opsets https://onnxruntime.ai/docs/reference/compatibility.html

@davidmezzetti
Copy link
Member

Sounds good. I've taken a preliminary look and I think the transformers.onnx package can replace some but not all of the code in HFOnnx.

For whatever reason, the default opset in transformers is 11. I wonder if the thinking behind this is picking the lowest opset necessary to maximize compatibility (i.e. support more versions).

@nickchomey
Copy link
Author

Probably. But if there's a check for the installed version, then that should solve any compatibility issues - the highest supported version will always be used as the default.

I look forward to whatever you're able to put together to make onnx model usage more accessible! From everything I've read about it, it provides a massive performance improvement.

@nickchomey nickchomey changed the title Add ONNX support for HF Seq2Seq models Change the HFOnnx pipeline to use Hugging Face Optimum rather than onnxruntime directly Feb 9, 2023
@nickchomey
Copy link
Author

nickchomey commented Feb 9, 2023

Rather than simply implement ONNX for Seq2Seq models, as discussed in this Slack thread, it would be beneficial and prudent to outsource HFOnnx()'s custom onnxruntime implementation to Hugging Face Optimum. HF Optimum has made it very easy to use the full extent of ONNX's capabilities (including the currently missing model Optimization) with most/all of their model types, including Seq2Seq.

Some relevant links:

Please don't feel any pressure to implement this immediately on account of me! But I do think it would be quite helpful to txtai users to be able to make full use of ONNX, and would lighten the load on you to monitor and implement changes in onnxruntime.

@nickchomey
Copy link
Author

nickchomey commented Feb 21, 2023

@davidmezzetti

I've made some good progress on building a mechanism to seamlessly auto-convert any HF transformers model to ONNX with HF Optimum, which also allows for the user to configure the optimization level and other important parameters. All combinations of optimization level (0-4) and quantize (True/False) would generate its own model which can be saved to a models directory of choice for quicker re-loading.

I'd like to make it available to txtai via PR such that anyone could start reaping the massive performance and resource (RAM usage) improvements from a fully optimized ONNX model in any of their existing txtai pipelines/workflows/applications, with as little friction as possible.

However, I am not sure what the preferred approach would be.

The easiest option, it seems to me, would be to create .../txtai/pipeline/optimum.py which could be run before any other pipeline, and it would return a (model, tokenizer) tuple, which could then be passed into any other pipeline as its path or model parameter. It would probably require some slight tweaking of txtai/txtai/src/python/txtai/pipeline/hfpipeline.py, but not much else. I'm not sure if this would enable it to be used in a workflow or not...

Perhaps a better approach would be creating .../txtai/models/optimum.py (or just overrwrite the .../txtai/models/onnx.py module). And then modify .../txtai/pipeline/hfpipeline.py and .../txtai/models/models.py to generate/return an Optimum model when requested via the parameters. The biggest problem I foresee with this is that it would probably also require modifying all pipelines such that the required args could be passed through to the Optimum module. (I think a separate case could be made that **kwargs could/should be added to all pipelines and related modules such that things could be extended more easily...)

Any thoughts? I'll probably proceed with the first option today as it should be relatively seamless. But I'm happy to discuss and modify things to accommodate whatever approach you think would be best.

@nickchomey
Copy link
Author

nickchomey commented Feb 21, 2023

On second thought, I'm not sure that the first pipeline approach would be all that useful - you'd have to run a separate one for each txtai pipeline that you want to use, and probably also have to know which task (e.g. zero-shot-classification is being requested.

The only reasonable way to do it would be to allow for setting arguments in any pipeline (e.g. labels = Labels(model_id, onnx=True, optimization_level=3, quantize = true, architecture="avx512_vnni") and it will handle all the model selection, conversion, optimization and quantization behind the scenes... But that would require modifying all existing pipelines to have the required parameters. I don't mind doing it if that's the case, but perhaps, as mentioned, adding **kwargs to each pipeline would allow for more flexibility going forward?

Any thoughts on any of this? I'm happy to do all of the work if you can point me in the right direction, such that all you'd need to do is review the code and modify things to accommodate your style, and be more robust, efficient etc...

@davidmezzetti
Copy link
Member

For this change to be most effective, the best path is to replace this class - https://github.com/neuml/txtai/blob/master/src/python/txtai/models/onnx.py with an optimum version. And this method https://github.com/neuml/txtai/blob/master/src/python/txtai/models/models.py#L118 to detect loading an ONNX model.

Last I checked with Optimum it doesn't support loading streaming models. It expects everything to work with files. That will cause issues with some existing functionality.

The other piece would be changing HFOnnx to use Optimum to convert models. From there you shouldn't need to add these extra arguments as that would be done at model conversion time.

@nickchomey
Copy link
Author

nickchomey commented Feb 21, 2023

Thanks! I'll explore those files and submit something for your review when ready.

Yeah, Optimum works off of files. The workflow I've set up is:

  1. download the transformers model
  2. convert to ONNX and save file
  3. optionally optimize the ONNX model and save
  4. optionally quantize the raw ONNX model or the optimized ONNX model.

I've built it such that it checks for existing files as early as possible to minimize processing. But can also store many versions depending on what combination of optimization and quantization is desired. Once the initial download is processing is done, then it should all load quickly from the disk.

I don't know anything about streaming - could you please point me to the relevant txtai code where streaming is used, so that I can dig in and see what it does and what options might exist for this implementation?

At the very least, it seems like this could be implemented as a sort of optional alternative. It seems far easier and more future-maintainable to offload all the work to Hugging Face than to build custom mechanisms directly on top of onnxruntime to do optimization and quantization on all the different transformer model types (Seq2Seq, FeatureExtraction etc...).

@davidmezzetti
Copy link
Member

Did you try using txtai pipelines with ORTModelForXYZ.from_pretrained?

from optimum.onnxruntime import ORTModelForSequenceClassification, ORTModelForSeq2Seq
from transformers import AutoTokenizer

from txtai.pipeline import Labels

model = ORTModelForSequenceClassification.from_pretrained("path", from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("path")

labels = Labels((model, tokenizer), gpu=False)
labels("Text to label")

And for Seq2Seq

from optimum.onnxruntime import ORTModelForSeq2SeqLM

from txtai.pipeline import Sequences

model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("t5-small")

sequences = Sequences((model, tokenizer), gpu=False)
sequences("translate English to French: Hello")

This should all work with quantized models as well.

from optimum.onnxruntime.configuration import AutoQuantizationConfig

quantizer = ORTQuantizer.from_pretrained(model)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

quantizer.quantize(save_dir="quant", quantization_config=qconfig)

So while I like the idea of making this more integrated, it seems like it's pretty straightforward to do run Optimum models with txtai as it stands right now.

@nickchomey
Copy link
Author

nickchomey commented Feb 22, 2023

Yes, I believe I mentioned somewhere above (or in Slack) that it works quite well already by passing a (model, tokenizer) tuple in - precisely how you did.

But, it resulted in a lot of redundant code for me when I started testing it with different pipelines. So, I started refactoring and eventually arrived at something generalized and flexible - along with something that makes use of saving the various converted/optimized/quantized models. So, I figured that it would be worth going the extra mile to make something nice and integrated - both for my sake as well as for everyone else who don't have time/interest to figure all of this out.

If accompanied by a good colab example, I think it would be well received by txtai users - it should make it simple for anyone to experiment with any of this by adding/tweaking some arguments in their existing code. It will also make it easy for people to make use of any hardware-specific acceleration that they have available to them (via the ONNX Execution Providers, such as the OpenVINO EP - which seems to be highly compatible with most systems and even more powerful than raw onnxruntime)

I'm going to build it anyway, so would you be willing to review a PR for this? To start, I'll leave onnx.py alone and create a new module and slightly modify the models.py logic to handle it appropriately - that way nothing at all will be affected. But it would be easy enough to replace onnx.py later if desired.

@davidmezzetti
Copy link
Member

If it can be done with changing HFOnnx, models/onnx.py and models/models.py, then yes. I don't want to go down the path of adding a bunch of optimum options to the pipelines. That should be a one time conversion via a process similar to HFOnnx.

Sounds like you're going to work on a lot of this regardless for your own work. But in terms of a PR, that is the type of code above that would make sense for core txtai.

@nickchomey
Copy link
Author

Thanks! I'll change those files directly then.

Though, in order to allow for people to tweak the onnxruntime and onnx execution providers parameters (optimization level, model save path, and plenty more), it would definitely be necessary to add parameters to the pipelines.

Again, I suspect that the cleanest approach would be to add a single kwargs parameter to each pipeline which can then be processed in models/onnx.py, so I'll give that a shot first to keep things as clean and unchanged as possible. But its ultimately a small detail to iron out later upon review.

@davidmezzetti
Copy link
Member

I don't think we're on the same page with the idea.

I envision the HFOnnx pipeline to import the Optimum ForXYZ models and load the transformers models there. Then any logic to optimize, quantize or whatever would happen and the model saved to a directory. That would be what is loaded by the pipelines. I don't see a valid use case for anything to happen at runtime in the pipelines, it should just load the model created by optimum.

I would leave models/onnx.py for backwards compatibility and make a very small change to Models.load that checks if the model path is a directory and it contains *.onnx files. If that is the case, it would load it with the appropriate Optimum ForXYZ model.

@nickchomey
Copy link
Author

Ok thanks. I'll do my best to meet what you're looking for. Any required changes should be relatively simple to implement when I submit the PR.

@davidmezzetti
Copy link
Member

Sounds good, I appreciate the efforts in giving it a try and you sharing your plan.

@nickchomey
Copy link
Author

My pleasure - its the least I can do to give back something to this fantastic tool!

Just to clarify - after thinking some more, I think it is now fully clear to me what you envision.

  • HFOnnx pipeline contains all the new HF Optimum logic/mechanism that I've put together. Add any parameters (optimization level, execution provider, model save path, etc...) that I need there
  • Instantiate/call the HFOnnx pipeline with whatever arguments are needed for optimization, execution provider etc... which will then process and save the model
  • Modify Models.load() to allow for also receiving something like path=(model_id, filepath) given that HFPipeline checks for a tuple in order to call Models.load
  • Check if that filepath contains a matching model_id.onnx file, if so, load and return the model through the appropriate ORTforXYZ wrapper.

Is that correct? If so, that's completely fine with me.

But my only outstanding question is with regards to the last step - normally Models.load loads onnx.py, but you'd like to leave onnx.py alone for backwards compatibility. I agree.

So, where does the ORTforXYZ get loaded? Another method within onnx.py? A separate models/optimum.py->OptimumModel class?

@nickchomey
Copy link
Author

nickchomey commented Feb 24, 2023

I'm just coming back to this now and find it very confusing why you don't want any of this to be implemented at runtime. I really do think that implementing through a standalone HFOnnx pipeline to pre-generate onnx models is the wrong approach.

  1. Surely it can't be for performance reasons. After all, you're already implementing some aspects of this stuff at runtime. For example, each pipeline has a quantize parameter that will quantize the model at runtime.
  2. Likewise, if a model is not currently cached, it needs to be downloaded, and surely other things happen to it.
  3. By the same token, if we implement Optimum/ONNX at runtime, if the desired model is already generated/saved then there's no overhead at all beyond parsing paths. If it isn't, then it downloads and converts it - just as would happen with a non-onnx model.
  4. In fact, if I'm not mistaken, the current ONNX implementation doesn't save the .onnx models - it just converts it for use in the current runtime and then discards it, which is less efficient.
  5. It would be FAR more confusing for users to specify the specific model that they want by using the path=(model_id, filepath) than it would be to just use parameters because there needs to be a way to specify which specific version of the model is desired - which optimization level, which architecture etc... This Labels(model_id, /path/to/model/model_optimizeO3_quantizeavx512.onnx) is far more difficult than just specifying the parameters as Labels (model_id, models_path="/path/to/models", opt="O3", quantize="avx512") and having txtai sort out what the specific file path/name needs to be and generate/save it if needed.
  6. Perhaps more than anything, its HF Optimum's implementation of/API for the onnxRUNTIME, not onnxMODELCONVERTER. There are specific parameters, such as provider (for the execution provider - cpu, cuda, openvino, tensorrt, etc... which surely must be provided at runtime. e.g. Labels(model_id, path=path, opt="O3", quantization="avx2", provider="OpenVINOExecutionProvider". As it stands, there's no mechanism to allow for execution providers other than CPU and CUDA.

I feel pretty strongly its the implementation that I would personally want to use, and surely others as well. As such, I think I'm going to go ahead with building it in this way. I'll submit it as a PR, which you'll be perfectly welcome to reject if you don't like it. But, any collaboration would be much appreciated, so I hope you'll consider the above points and be willing to at least review the PR with an open mind! At that point, if you can show me why it really is a dealbreaker, then I'll be happy to modify it to suit your needs.

(As mentioned above, I'd do it in as non-obtrusive way as possible - I think **kwargs with some documentation on what args are permitted/expected is the right approach so as to not have to add a dozen parameters to each pipeline)

@davidmezzetti
Copy link
Member

I wasn't proposing path=(model_id, filepath). I would have HFOnnx convert the model and save it to a directory. The pipeline would then take that directory as the path parameter and Models.load would detect that directory is an ONNX model directory and handle appropriately. It would be seamless to the user they were loading an Optimum model.

The current implementation does support saving ONNX files and reloading from storage.

Perhaps you should consider making this functionality a separate package to suit your specific needs. We might need to agree to disagree on this one.

@nickchomey
Copy link
Author

nickchomey commented Feb 24, 2023

Right, but if someone wants to test/have/use different versions of the model - combinations of optimization levels, quantization methods, etc... - then the path has to be different for each. That seems very cumbersome for someone to track and implement as compared to just putting args in the pipeline call and then the code finds the right folder and file. Likewise, how does one use the other Execution providers (along with the other dozen onnxruntime parameters, should they so choose?). As it stands or as proposed, they can't.

Anyway, I'll disappointedly respect your decision - I'll close this issue and carry on with my own implementation for this. I'll share the code somewhere - be it in an immediately-closed PR or another repo - for you or someone else to consider incorporating into txtai. No hard feelings though - thanks again for everything.

@davidmezzetti
Copy link
Member

Going to re-open this to keep a placeholder to implement the way I've proposed. I think you'll see it does most of what you want.

While people may test a bunch of different options, there will be a final model in most cases. The framework will be able to detect if the model is ONNX or OpenVINO, which are different formats not different execution providers within ONNX.

I consider model optimization in the same family of operations as compiling a static program or training a model. While you toggle the options at development time, once finished, it's a single model or set per platform architecture.

@davidmezzetti davidmezzetti reopened this Feb 25, 2023
@nickchomey
Copy link
Author

nickchomey commented Feb 25, 2023

Fair enough.

Actually, OpenVINO is its own execution provider, of which there appear to be a couple dozen. To use it, you need to install a separate onnxruntime package, onnxruntime-openvino. Perhaps I'm misunderstanding things, but given that there's a parameter for specifying the provider, surely it needs to be used. But txtai doesn't have a mechanism to specify the provider - it only supports CPU and CUDA. There's another dozen parameters as well that people might want to tweak, but they're not currently available in txtai. There's also parameters for optimization and quantization. Perhaps you want to keep txtai as simple/streamlined as possible, but what I'm building shouldn't add any confusion/hindrance while also allowing people to experiment as much as they want.

So, I'll still build what I'm envisioning, because I'm quite sure that it'll be far easier for prototyping/testing (in fact, I intend to add some mechanism to leverage HF Evaluate as well). I'll submit what I end up building to use as a starting point for you/me/others to build what you are looking for.

Interestingly, I'm having strange results so far. I only used the Labels pipeline though. Basic Onnx is faster than transformers, but a quantized transformers (quantize=true) is faster than quantized onnx through optimum. However, both quantized versions produce nonsensical results - you can confirm in Example 7. I've tried with a few different zero shot models and the same thing happens - could just be inherent to those models... I'll do testing on other pipelines/models later when I get a chance.

@davidmezzetti
Copy link
Member

Keeping this issue open, still a good issue to consider.

@davidmezzetti
Copy link
Member

davidmezzetti commented Oct 9, 2023

This issue is still on the radar. A new pipeline called HFOptimum will be added along with logic to detect these models in the Models.load function.

@nickchomey
Copy link
Author

Ive been away from developing for 6+ months due to life, so didn't get very far past initial work that was mentioned above. I'm hoping to get back to it in the coming weeks and will be happy to provide feedback on this if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants