Speakr - How It Works

#dohackathon #react #node #googlecloud

Speakr is a web app which allows you to write in air, using your mobile as a pen. It utilises the onboard IMU to record movements, before rendering these paths as letters in an image. Handwriting recognition is then used to determine the written text which is then played out loud, via text-to-speech.

I created Speakr as an experiment for an alternative mode of interaction - this project nicely ties in with the DigitalOcean App Platform Hackathon.

So what is Speakr for? Speakr demonstrates a novel form of communication, which can be applied to many different scenarios. Perhaps, it can be used as part of third-party mobile games for greater interactivity. With this setup, gestures such as arrows and shapes can be recognised with custom ML models, for more intuitive controls with smart home devices. The possibilities presented by gesture to text or speech translation are truly endless.

Here's the code for the Speakr web app, shown in the video. This web app was built with React, NodeJS and Google Cloud APIs, with the server being hosted on the DO App platform. Below, I will walk through key parts of the gesture to image functionality.

How It Works

Below is an initial test of the drawing functionality, rendered from recorded movements on my mobile phone's IMU. Pretty good!

I chose to use the Generic Sensor API for reading sensor data, shown below as a simple callback.

this.sensor = new AbsoluteOrientationSensor({
      frequency: 60,
});

this.sensor.addEventListener("reading", (e) => this.readSensor(e));

Initially, I started off with the accelerometer, however this was too noisy and unreliable. I quickly switched to the orientation fusion sensor.

By combining data from multiple real sensors, new virtual sensors can be implemented that combine and filter the data so that it’s easier to use — these are known as fusion sensors. In this case, data from the onboard magnetometer, accelerometer, and gyroscope are used for the AbsoluteOrientationSensor’s implementation.

After calculating distance moved, the path moved is pushed to the array storing all previous paths for each letter.

readSensor(e) {
    let q = e.target.quaternion;
    let angles = this.toEuler(q);

    if (!this.draw) {
      this.initAngle = angles;
      this.draw = true;
    }

    let pos = angles.map((angle, i) => this.calcDist(angle, i));
    this.text[this.numChar][0].push(pos[0]);
    this.text[this.numChar][1].push(pos[1]);
  }

The Sensor API returned quaternions, so I had to convert to Euler angles for easier manipulation. Here's a basic implementation in Javascript - pitch has been omitted since I am only working with two dimensions.

// Wikipedia Implementation
  toEuler(q) {
    let sinr_cosp = 2 * (q[3] * q[0] + q[1] * q[2]);
    let cosr_cosp = 1 - 2 * (q[0] * q[0] + q[1] * q[1]);
    let roll = Math.atan2(sinr_cosp, cosr_cosp);

    let siny_cosp = 2 * (q[3] * q[2] + q[0] * q[1]);
    let cosy_cosp = 1 - 2 * (q[1] * q[1] + q[2] * q[2]);
    let yaw = Math.atan2(siny_cosp, cosy_cosp);
    return [yaw, roll];
  }

Distance between the projected initial and current angles are calculated using simple trigonometry and angle differences, with the initial angle positions having been set to the starting orientation of the mobile phone, after the 'Draw' button is pressed down.

calcDist(angle, i) {
    angle = (angle - this.initAngle[i]) * (180 / Math.PI);
    angle = angle < 0 ? angle + 360 : angle;
    angle = angle > 180 ? angle - 360 : angle;
    let dist = -1 * Math.tan(angle * (Math.PI / 180));
    return dist;
  }

Finally, the stored movements have to rendered on to a canvas, in order to generate an image. With a combination of scaling and offsets, each 'letter' is resized into a bounded box and combined with all other letters to form an image with the word drawn in the air.

renderText(ctx) {
    ctx.beginPath();

    this.text.forEach((char, i) => {
      let xpos = char[0];
      let ypos = char[1];

      let xmin = Math.min(...xpos);
      let ymin = Math.min(...ypos);

      if (xmin > 0) xmin = 0;
      if (ymin > 0) ymin = 0;

      let xrange = Math.max(...xpos) - xmin;
      let yrange = Math.max(...ypos) - ymin;

      let xmulti = (this.letterSize * this.letterWidth) / xrange;
      let ymulti = (this.letterSize * this.letterHeight) / yrange;
      let multi = Math.min(xmulti, ymulti);

      let xoffset = this.border + (this.letterWidth - xrange * multi) / 2;
      let yoffset = this.border + (this.letterHeight - yrange * multi) / 2;

      let letterOffset = i * this.letterWidth;

      for (let j = 0; j < xpos.length; j++) {
        let x = xoffset + (Math.abs(xmin) + xpos[j]) * multi + letterOffset;
        let y = yoffset + (Math.abs(ymin) + ypos[j]) * multi;

        if (j === 0) {
          ctx.moveTo(x, y);
        } else {
          ctx.lineTo(x, y);
        }
      }
    });

    ctx.stroke();
  }

An image is then generated from the canvas element. Here are some examples of iterations of tweaking various attributes of the render, such as letter size, margins, line width etc. :

This was the bulk of the problem, since it needed to be legible enough for the Google Vision API to work, and essentially do handwriting recognition to determine the written text. This was mainly trial and error, and took quite some time to get to an acceptable level of accuracy.

Finally, this text is converted to speech with the Google TTS API and spoken by the mobile phone.

And that's it - feel free to play around with Speakr!

DEV Community

Speakr - How It Works

How It Works

Top comments (0)

Read next

Using useEffect for Fetching Data in React

Implementing Infinite Scrolling in React for Seamless User Experience

Understanding React Fiber: Enhancing Performance and User Experience in React

Understanding React Fiber: Enhancing Performance and User Experience in React