Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker Recognition #367

Draft
wants to merge 6 commits into
base: naomi-dev
Choose a base branch
from

Conversation

aaronchantrill
Copy link
Contributor

@aaronchantrill aaronchantrill commented Nov 6, 2022

Description

This introduces a new "sr" plugin type, which allows Naomi to recognize users by voice.

Speaker recognition only happens during active speech recognition, because passive speech recognition needs to be fast.

The default "sr" plugin just passes back the name in the profile variable "first_name" without trying to recognize the speaker from the voice, which is basically how Naomi originally worked. The name of the speaker is embedded in the intent passed to the speechhandler as 'user', so it can be accessed as intent.get('user',''). The only plugin that is currently set up to use this is the shutdown plugin which may respond using the name of the user. The name of the user appears in parenthesis after the utterance if you have "print_transcript" on.

The setup still assumes en-US when downloading the VOSK models, which needs to be fixed to respect the "language" setting in the profile.

The VOSK speaker recognition is not terribly accurate. It also seems like you need to retrain your speaker recognition database from new recordings when you switch to different recording hardware.

Naomi does not record the speaker it thinks is speaking in the audiolog. You currently have to manually tag user utterances using the NaomiSTTTrainer.py program, although I would like to see the ability to learn voices while running by asking if unsure. With Vosk, if the cosine angle is less than 30, it is probably the correct speaker. If no voice matches with a cosine angle of less than 60, then it is most likely a new voice. Any time the best match is more than 30, Naomi should ask to verify who is talking.

Related Issue

Ability to recognize users by voice #267
VOSK STT Engine #280
Simplify the mic initialization #326

Motivation and Context

My goal is to eventually start building vocabulary profiles for different users, including acoustic models and pronunciation dictionaries. This could also be used to build unique profiles for users.

How Has This Been Tested?

$ python -m unittest discover
...s.....ss.....sssssssss

Ran 25 tests in 4.354s

OK (skipped=12)

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Working on introducing a new speaker recognition plugin type. This
plugin is to both recognize an individual and taylor responses to
them and improve speech recognition by training STT models to
individual speakers.
Added a default speaker recognizer, default_sr, which does not
attempt to identify the speaker but just responds with the first
name as stored in the profile.

I am also passing the result around from the sr_plugin to the
intent parser, to which the identity of the speaker can be
attached to the intent object being passed to the speechhandler.

I also simplified the list of parameters being passed to the mic
object when it is created, storing them instead in the profile.
There were numerous problems with the speech recognizer class that
made it not work when using the VOSK_sr plugin.

I am no longer trying to recognize the speaker when listening
passively for the wake word, only when doing the active listening.
This is because the passive listening needs to be very fast.

I am now putting the name of the identified user in parenthesis
after the active listening transcript. Plugins can access the
identity of the speaker as `intent.get('user', '')`. The only
plugin currently set up to use this is the shutdown plugin. I
also have an update to the Greetings plugin which greets you by
name when you greet Naomi.

The setup still assumes en-US when downloading the VOSK models,
which needs to be fixed to respect the "language" setting in the
profile.

The VOSK speaker recognition is not terribly accurate. It also
seems like you need to retrain your speaker recognition database
from new recordings when you switch to different recording
hardware.
@lgtm-com
Copy link

lgtm-com bot commented Nov 6, 2022

This pull request introduces 5 alerts when merging dab0b66 into d0418fd - view on LGTM.com

new alerts:

  • 2 for Signature mismatch in overriding method
  • 2 for Wrong number of arguments in a class instantiation
  • 1 for Variable defined multiple times

Fixed some method signature mismatches from when I removed the
parameters from the mic methods. Fixed an issue preventing the
input device verification from working during initial setup or
repopulate. Changed the name of the "confidence" result from
speaker recognition to "distance" (since smaller numbers are
better). Clarified how to enter multiple email addresses in
notification client configuration, although I think that still
needs looked at. I think the safe email list is not being
stored as a list, and I think Naomi should only respond to email
addresses in that list, not to any email address if the list is
empty.
Added filenames for the default French and German models.
Constructed URLs and paths from these models to automatically
download and extract the model that matches the language choice
in profile.

The only thing it does not currently do is check whether or not
VOSK is also used as the STT engine. If not, then the audio
file pointer should be passed to the actual STT engine.
If the mic gets cut off when Naomi is listening, then the
mic.listen() method will return False rather than an sr_output
dictionary.
@aaronchantrill
Copy link
Contributor Author

I just realized that this change has changed the behavior of the mic.active_listen() method, which now returns a dictionary including the name of the speaker, a numerical confidence indicator in the identity of the speaker (distance), and the transcription of the utterance. This doesn't matter most of the time, since the utterance is already passed to the speechhandler plugin in an intent object, but it does matter for plugins that call active_listen directly (like the frotz plugin). This does not affect plugins that use the expect or confirm methods. Frotz may be the only plugin affected right now.

@aaronchantrill aaronchantrill mentioned this pull request Apr 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify the mic initialization Ability to recognize users by voice VOSK STT Engine
1 participant