Getting text back with no audio #157

serosenstein · 2024-08-15T22:57:15Z

Using ESP32 S Box 3 with willow installed.

"Hi ESP, lock Front door"

Texted displayed on ESP32: "Front door has been locked" Audio: none

Expected audio: Front door has been locked

[2024-08-15 22:53:06 +0000] [93] [DEBUG] FASTAPI: Got WILLOW request for model medium beam size 1 language detection False
[2024-08-15 22:53:06 +0000] [93] [DEBUG] WILLOW: Audio information: sample rate: 16000, bits: 16, channel(s): 1, codec: pcm
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WILLOW: Source audio is raw PCM, creating WAV container
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: Loading audio took 1.5610000000000002 ms
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: Feature extraction took 34.336 ms
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: Using system default language en
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: Using model medium with beam size 1
[2024-08-15 22:53:07 +0000] [93] [DEBUG] Processing GPU batch 1 of expected 1
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: Model took 322.387 ms
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: Decode took 0.339 ms
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: ASR transcript: Lock front door.
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: Inference took 359.313 ms
[2024-08-15 22:53:07 +0000] [93] [DEBUG] WHISPER: Inference speedup: 3x
[2024-08-15 22:53:09 +0000] [93] [DEBUG] FASTAPI: Got TTS request for speaker CLB with format FLAC and text: Front door has been locked.
[2024-08-15 22:53:09 +0000] [93] [DEBUG] TTS: Got request for speaker CLB with text: Front door has been locked.
[2024-08-15 22:53:09 +0000] [93] [DEBUG] TTS: Loaded included speaker CLB
[2024-08-15 22:53:09 +0000] [93] [DEBUG] TTS: Loading speaker embedding took 1.484 ms
[2024-08-15 22:53:09 +0000] [93] [DEBUG] TTS: Getting inputs took 1.0970000000000002 ms
[2024-08-15 22:53:10 +0000] [93] [DEBUG] TTS: Generating audio took 493.322 ms
[2024-08-15 22:53:10 +0000] [93] [DEBUG] TTS: Generating file took 3.4099999999999997 ms
[2024-08-15 22:53:10 +0000] [93] [DEBUG] TTS: Total time took 499.855 ms

Using WIS in docker, on ubuntu 24.04 (running as proxmox VM with GPU passthrough for Tesla P40).

Side note: webrtc also doesn't work for recording but I can generate TTS speech through API documents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting text back with no audio #157

Getting text back with no audio #157

serosenstein commented Aug 15, 2024

Getting text back with no audio #157

Getting text back with no audio #157

Comments

serosenstein commented Aug 15, 2024