Azure Cognitive Services- Speech To Text

Speech service is a kind of cognitive service, which provides voice to text, text to voice, voice translation and so on. Today, what we are fighting is Speech To Text.

STT supports two access modes: 1. SDK and 2. REST API.

Among them:

SDK mode supports recognition of voice stream and voice file of microphone;

The REST API only supports voice files;

Preparation: create Speech service of cognitive service:

After creation, two important parameters can be viewed on the page:

1, The REST API converts voice files to text:

For the Speech API endpoint of Azure global, please refer to:

Speech API endpoint of Azure China:

As of February 2020, Speech service has been opened only in eastern China 2 regions, and the service endpoint is:

For Speech To Text, there are two authentication methods:

The Authorization Token is valid for 10 minutes.

For simplicity, this paper uses the OCP APIM subscription key method.

Note: if you want to convert text to speech, you must use Authorization Token for authentication according to the above table.

Other considerations for building requests:

  1. File format:

  2. Request header:

    It should be noted that Key or Authorization is a two choice relationship.

  3. Request parameters:

The example in Postman is as follows:

If you want to use Authorization Token in REST API, you need to obtain Token first:

Global gets the endpoint of Token:

End point of getting Token in China:

As of February 2020, only East China 2 has Speech service, and its Token endpoint is:

Postman obtains Token reference as follows:

2, SDK to convert voice files to text (Python example):

Similar code can be seen on the official website, but it should be noted that the code only works in Azure Global's Speech service, and specific modifications need to be made for China (see below).

import azure.cognitiveservices.speech as speechsdk # Creates an instance of a speech config with specified subscription key and service region. # Replace with your own subscription key and service region (e.g., "chinaeast2"). speech_key, service_region = "YourSubscriptionKey", "YourServiceRegion" speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region) # Creates an audio configuration that points to an audio file. # Replace with your own audio filename. audio_filename = "whatstheweatherlike.wav" audio_input = speechsdk.AudioConfig(filename=audio_filename) # Creates a recognizer with the given settings speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input) print("Recognizing first result...") # Starts speech recognition, and returns after a single utterance is recognized. The end of a # single utterance is determined by listening for silence at the end or until a maximum of 15 # seconds of audio is processed.  The task returns the recognition text as result. # Note: Since recognize_once() returns only a single utterance, it is suitable only for single # shot recognition like command or query. # For long-running multi-utterance recognition, use start_continuous_recognition() instead. result = speech_recognizer.recognize_once() # Checks result. if result.reason == speechsdk.ResultReason.RecognizedSpeech:    print("Recognized: {}".format(result.text)) elif result.reason == speechsdk.ResultReason.NoMatch:    print("No speech could be recognized: {}".format(result.no_match_details)) elif result.reason == speechsdk.ResultReason.Canceled:    cancellation_details = result.cancellation_details    print("Speech Recognition canceled: {}".format(cancellation_details.reason))    if cancellation_details.reason == speechsdk.CancellationReason.Error:        print("Error details: {}".format(cancellation_details.error_details))

Code providing page:

For China, you need to use a custom endpoint to use the SDK normally:

speech_key, service_region = "Your Key", "chinaeast2"
template = "wss://{}" \
speech_config = speechsdk.SpeechConfig(subscription=speech_key,
endpoint=template.format(service_region, int(initial_silence_timeout_ms)))

The complete code of China is:

#!/usr/bin/env python
# coding: utf-8

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license. See file in the project root for full license information.
Speech recognition samples for the Microsoft Cognitive Services Speech SDK

import time
import wave

    import azure.cognitiveservices.speech as speechsdk
except ImportError:
    Importing the Speech SDK for Python failed.
    Refer to for
    installation instructions.
    import sys

# Set up the subscription info for the Speech Service:
# Replace with your own subscription key and service region (e.g., "westus").
speech_key, service_region = "your key", "chinaeast2"

# Specify the path to an audio file containing speech (mono WAV / PCM with a sampling rate of 16
# kHz).
filename = "D:\FFOutput\speechtotext.wav"

def speech_recognize_once_from_file_with_custom_endpoint_parameters():
    """performs one-shot speech recognition with input from an audio file, specifying an
    endpoint with custom parameters"""
    initial_silence_timeout_ms = 15 * 1e3
    template = "wss://{}{:d}&language=zh-CN"
    speech_config = speechsdk.SpeechConfig(subscription=speech_key,
            endpoint=template.format(service_region, int(initial_silence_timeout_ms)))
    print("Using endpoint", speech_config.get_property(speechsdk.PropertyId.SpeechServiceConnection_Endpoint))
    audio_config =
    # Creates a speech recognizer using a file as audio input.
    # The default language is "en-us".
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    result = speech_recognizer.recognize_once()

    # Check the result
    if result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Recognized: {}".format(result.text))
    elif result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(result.no_match_details))
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))


It should be noted that if we use the SDK to recognize the voice in the microphone, we will

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

Modify it as follows (remove the audio config parameter):

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

Public number link:


Tags: SDK REST Python Linux

Posted on Mon, 03 Feb 2020 03:18:55 -0500 by tequilacat