Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splitting talking head into separate files #32

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

anselm
Copy link

@anselm anselm commented Apr 23, 2024

At the moment the talkinghead constructor initializes threejs and also talks to third party services. This reduces the reusability or modularity of code. A more modular pattern would (for example) make it easier to have multiple avatars in one scene, or to swap out different llms or tts capabilities. If these parts can be separated to some degree then it may become possible to use talkinghead in third party experiences such as 3d games, or in situations where you want multiple ai driven puppets. This PR is a first step in that direction to test the waters and see what Mika thinks of the idea of splitting things up.

…e service inside of other threejs experiences such as games - a later commit would move the initialization of threejs itself to a separate file
@met4citizen
Copy link
Owner

Thanks, Anselm. I'm not against splitting things up in principle. However, for reusability and modularity, wouldn't it be better to try to split the class into separate reusable classes (using OOP principles)?

In this PR, individual properties (templates) have been separated from their related functionality (factories), but the templates themselves are not really reusable as they depend on factories that are able to encode them and generate the actual animation sequence or pose. Of course, besides reusability, there might be some other reasons to separate the templates, for example, to replace them with one's own, but in the most likely scenario the app wants to keep the existing templates and just add its own custom poses and moods.

That said, I'm not sure how many reusable classes it would be possible to extract from the TalkingHead class. When the system to be modeled is complex, the model needs to reflect that complexity. Often, this means that you need to add more interdependencies and connections, which makes the splitting and reuse more difficult. There is already a lot of overlap between moods, poses and animations, and for the avatar to act more realistically, this overlap should be increased, not decreased.

Before this, I didn't have any plans to divide the class, so I don't have a specific class/component diagram in mind. I need to think about this some more. - If you have any ideas about how to split the class into several classes, I would very much like to hear your thoughts.

I should also point out that the TalkingHead class doesn't use any LLMs. And if you would like to use a different TTS, the class has an interface for that integration. There are already projects/apps using the class with Google, Microsoft, OpenAI, and ElevenLabs TTS. Sure, Google TTS support is there as a default for simple web projects, but it doesn't have to be used, so no need to swap anything out. Replacing or abstracting away Three.js would be difficult because function calls to it cover such a big part of the code.

The 'TalkingHead' class should represent one talking avatar. For multiple avatars to share a scene, there should be several instances of the TalkingHead class sharing a scene. In some sense, if you ignore the naming, this PR moves in that direction by separating the camera and lights from the actor, but I don't think inheritance (extension) is the right way to relate the scene and the actors as the relationship is one-to-many. I'm not even convinced that the scene should be in the scope of the project (except in the same sense as Google TTS).

I really appreciate the time you have spent on this, and I hope that my comments don't discourage you. This is just a hobby project for me, but I used to work as a software architect for many years, and old habits die hard.

@anselm
Copy link
Author

anselm commented Apr 24, 2024 via email

@met4citizen
Copy link
Owner

Very interesting. I have no real experience in multiplayer game development, but it seems to me that there are many different ways in which one might implement a game engine and many ways to split the responsibilities between the web clients and the server. It would be a mistake, I think, to make any changes to the TalkingHead class without first having the game engine and knowing its architecture.

I know that Ready Player Me supports game engines such as Unity and UE, but do you know of any existing multiplayer game engines supporting Three.js?

On their site, Ready Player Me seems to have several multiplayer games that run on a browser. They are using a game engine/SDK called Hiber3D. Any thoughts about it?

P.S. The index.html is just a test app for the class. The TalkingHead module/class is what the project is all about.

anselm added 2 commits May 1, 2024 16:44
…e animation capability, a tts capability and then a top level wrapper (untested) that allows outsiders to drive the talking head in the same way as before hopefully - should have the same interface as before as far as outsiders are concerned. also removed the thrown error when no google tts audio supplied. also added an ability to pass an audio blob that is not an array buffer - since that is not transportable over the net easily - but rather you can pass a b64 encoded audio string which is then decoded insitu
@anselm
Copy link
Author

anselm commented May 12, 2024 via email

@met4citizen
Copy link
Owner

I browsed through all your changes, and I'm glad you have found some parts useful.

The big issue here, I think, is that talking heads and multiplayer games are different use cases and have different functional and non-functional requirements. For example, they involve one versus multiple avatars, one versus multiple personalities, different camera angles, framing, and viewing distances, interaction with the user versus avatar-to-avatar, different user dynamics, etc. Furthermore, game engines impose various limitations on implementation, timing, concurrency, use of computational resources, etc.

Unlike Ready Player Me, which focuses on games, the TalkingHead project, as the name suggests, focuses on the "talking head" niche — a form of presentation where an individual's head and shoulders are displayed. The only reason I ended up using a full-body avatar was to simulate lower body movements (e.g., shifting weight from one leg to another) to make upper body movement seem more natural.

That said, I have no intention of extending the scope of the TalkingHead project to include the use cases of multiplayer games, as that would introduce more functional and non-functional requirements, add complexity, interdependencies, and lead to compromises. This means the TalkingHead project will not address all the functional and non-functional requirements of multiplayer games that you are probably aiming at. From your perspective, if multiplayer games are your focus, that's a problem.

Now, you can continue to develop your own branch, but I suspect it will move further and further away from the main branch. Since you clearly have a strong vision and needed skills, my suggestion is that you take what is useful and start a new project that aligns better with multiplayer game use cases.

I think the above is also related to how and why you have split the class. It may serve your purposes, and that's great, but as it is now, I don't think it adds value to the TalkingHead project. I also think the way you have done the split mixes inheritance (is-a relationship) and composition (has-a relationship) in a mistaken manner, but, as you said, there are different design philosophies and preferences.

I also briefly looked at the Ethereal Engine (thanks for the link). I should, again, point out that I'm not a game developer, and didn't look at the source code, so I might be mistaken, but based on the examples, the engine already has most of the things in place: animations, audio, etc. What is missing - related to TalkingHead class functionality - is basically just the body language in interactions (including lip-sync and facial expressions).

Body language is just another "language" like English and French. In the long run, GPT-like models will output animations as well as any other language. I have already tried such products. In the short term, however, OpenAI and the new GPT-4o probably has the best voice quality, but no timestamps, body language, blendshapes, or visemes, so you'll probably need the Whisper, a personal body language script for each game character (similar to poseTemplates, animMoods), and "factories" that can turn templates/text/audio into the animation format that the game engine supports (similar to lip-sync modules, poseFactory, animFactory).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants