Back
Edit on GitHub
batch / artificial-intelligence / part-2
  • 2/4
Machine Learning
45 Min

Giving Your Smart AI Companion a Voice!

sahitid

Part 2/4 | Prerequisite: a little web dev knowledge (HTML, JS, CSS) | 45-60 minutes

Yay! I'm so glad you're continuing this adventure to build your own AI Companion!

Although this Jams goes over the fundamentals of using AI and ML, reign all your creative freedom to create a fully functioning smart assistant. I'm making an Orpheus that'll embody Hack Club's mascot but you certainly don't have to.

In fact I encourage that you go out of your way to think of the most bizarre uses of your program. Take a look at what an AI came up with as ideas.

chatgpt response with ideas for projects

In this part, we'll start coding our web app with JavaScript and HTML to set up our speech to text feature, start listening function, and stop listening function!

Outline:

  1. Enter The Code Editor (start coding in your browser)
  2. Set Up the HTML Some More (visualize the conversation with AI)
  3. Let's Get Orpheus to Speak! (text to speech recognition!)

Enter the Code Editor

Firstly, let's get setup in the code editor! Head over to the Replit Website and open the project you were working in the last part of this Jams.

You should see your 3 files: index.html, script.js, and style.css respectively.

Set Up the HTML Some More

Now let's get Orpheus to speak instead of a boring console.log.

There are two parts to this: 1) we want to hear it, but 2) we also want to see it, so we need to edit the HTML now.

That's easy! Go to index.html and add this to the <body>

<body>
    <div id="startup">&middot; &middot; &middot;</div>
    <div id="messages"></div>
    <div id="loading">&middot; &middot; &middot;</div>
    <div id="result">&middot; &middot; &middot;</div>
</body>

Don't be scared of HTML entities! They are just descriptions for different symbols, here we are using the middle dot symbol (middot or . ) to represent loading bubble.

The messages div will store all the messages between us and Orpheus, startup is the initial loading of the Teachable Machine model, loading is for Orpheus' responses, and result is our responses.

Some of these will be invisible at first, so let's add some CSS:

#result,
#loading {
  display: none;
}

Let's Get Orpheus to Speak!

orpheus plushie gif

Now we can create the two functions to first output the response through text onto the screen and then to output the speech

We will create 2 functions, one to output Orpheus' response to the screen and the other one to actually speak:

Okay, let's get Orpheus to speak now! We will create 2 functions in the JavaScript file, one to output Orpheus' response to the screen:

function addMessage({ role, content }) {
  let message = document.createElement('div')
  message.innerText = content
  message.scrollIntoView(false) // Scroll to the message
}

Here we are destructuring the argument because OpenAI returns responses as {role, content} 

And the other one to actually speak:

It will be an async function, which means it runs asynchronously: talking takes time and we want to make sure it finishes talking before we move onto the next thing. To do this, we use promises, which tell us when a function is done running and whether or not it was successful.

async function speak(message) {
  return new Promise((resolve, reject) => {
    let synth = window.speechSynthesis
    if (synth) {
      let utterance = new SpeechSynthesisUtterance(message)
      
      synth.cancel()
      synth.speak(utterance)
      utterance.onend = resolve
    } else {
      reject('speechSynthesis not supported')
    }
  })
}

The speak function takes a message as input and returns a promise. Inside the function, it checks if the browser supports speech synthesis. If it does, it proceeds with converting the message into speech.

It then creates a speech utterance (to represent the text to be spoken) and sets it up with options like voice and speaking rate (which are currently commented out in the code), and cancels any ongoing speech synthesis process.

Also, if the browser doesn't support speech synthesis, it rejects the promise with a message saying that speech synthesis is not supported.

Change Up Your AI Companion's Voice

Additionally, if you play around and add some lines of code under let utterance = new SpeechSynthesisUtterance(message) you can change up the voice of your AI Companion. And there's lots of voices to choose from!

There are a lot of voices to choose from:

const voices = synth.getVoices()
utterance.voice = voices[voices.findIndex(voice => voice.name === 'Good News')]
utterance.rate = 1

Play around with the voices! You can find a list of some voices here. (Try out Good News, it's hilarious!)

And finally, let's edit up our already existing listen function:

recognizer.listen(
  result => {
    const orpheusNoise = result.scores[1]
    if (orpheusNoise > THRESHOLD) {
      speak('Hey').then(() => {
        console.log("Done speaking")
      })
    }
  },
  {
    includeSpectrogram: true,
    probabilityThreshold: 0.75,
    invokeCallbackOnNoiseAndUnknown: true,
    overlapFactor: 0.5
  }
)

In case it wasn't obvious enough, here is what we changed:

before and after for code

Notice that we replaced console.log with speak('Hey!').then(() => ...). Because we used promises, we can say, once speak is done done running, we can then move on to the next thing, which is console.logging "Done speaking

We need to also implement the listening variable so that Orpheus isn't listening and speaking at the same time that it is listening! It will tell Orpheus whether or not she should be speaking or listening (that way, listening doens't overlap with Orpheus speaking, which could lead to problems).

Initialize the following variable at the beginning of your program:

let listening = false

Now incorporate it in the listen function:

recognizer.listen(
  result => {
    const orpheusNoise = result.scores[1]
    if (orpheusNoise > THRESHOLD && !listening) {
      listening = true
      speak('Hey').then(() => {
        console.log("Done speaking")
        listening = false
      })
    }
  },
  {
    includeSpectrogram: true,
    probabilityThreshold: 0.75,
    invokeCallbackOnNoiseAndUnknown: true,
    overlapFactor: 0.5
  }
)

And finally we need to call the init function at the end!

init()

Full Code and Demo

const URL = 'https://teachablemachine.withgoogle.com/models/hiSl8IOc-/'
const THRESHOLD = 0.9

let listening = false

async function speak(message) {
  return new Promise((resolve, reject) => {
    let synth = window.speechSynthesis
    if (synth) {
      let utterance = new SpeechSynthesisUtterance(message)
      const voices = synth.getVoices()
      utterance.voice = voices[voices.findIndex(voice => voice.name === 'Good News')]
      utterance.rate = 1
      synth.cancel()
      synth.speak(utterance)
      utterance.onend = resolve
    } else {
      reject('speechSynthesis not supported')
    }
  })
}

function addMessage({ role, content }) {
  let message = document.createElement('div')
  message.innerText = content
  message.scrollIntoView(false)
}

window.onload = () => {
  let recognizer;

  async function createModel() {
    const checkpointURL = URL + 'model.json'
    const metadataURL = URL + 'metadata.json'

    const recognizer = speechCommands.create(
      'BROWSER_FFT',
      undefined,
      checkpointURL,
      metadataURL
    )

    await recognizer.ensureModelLoaded()

    return recognizer
  }

  async function init() {
    recognizer = await createModel()

    recognizer.listen(
      result => {
        const orpheusNoise = result.scores[1]
        if (orpheusNoise > THRESHOLD && !listening) {
          listening = true
          speak('Hey').then(() => {
            console.log("Done speaking")
            listening = false
          })
        }
      },
      {
        includeSpectrogram: true,
        probabilityThreshold: 0.75,
        invokeCallbackOnNoiseAndUnknown: true,
        overlapFactor: 0.5
      }
    )
  }

  init()
}

Additional Challenge Hacking! (recommended)

It's that time you've all been waiting for! Let's Hack our Jam! Try doing one of the following: jam hacks

  • Try adding a setting that lets people choose their voice option
  • Have you ever looked into Eleven Labs AI voices? How might one go about incorporating their API into this project for more realistic voices?
  • Could you add a setting that lets people choose their assistant's voice when they open your web app?

Stay Tuned To Build Your Very Own Smart Voice Assistant

We just gave our model a voice and the necessary tools to know when to listen. But we still need to give it the functionality to understand language and speech.

In the next part, we'll integrate the speech recognition API to convert our speech to text for the AI Companion to understand.

You finished the Jam.
Congratulations! 🎉 🎉 🎉
Share your final project with the community
Project Name
Project URL

Author

sahitid
Message on Slack

Outline