NAV Navbar
python curl
  • Introduction
  • Authentication
  • Workflow
  • Audio Format
  • Introduction

    print("Welcome to Notula Speech API :D")
    

    Welcome to Notula Speech API or so called nplatform which enables integration between Notula Speech Recognition Engine and developer applications. The API recognizes Bahasa Indonesia to support your needs. You can transcribe your voice seamlessly through either microphones or recorded voices with the latest state-of-the-art technology. Unleash broad possibility of ideas and enable command-and-control through voice.

    Authentication

    OAuth2

    Once you have registered your account, you need to request access_token to our OAuth2 before accessing Speech API server. To get access_token directed toward authentication server, request POST as follows :

    auth request sample

    request_header = {
        'content-type' : 'application/x-www-form-urlencoded',
        'Authorization' : 'Basic ' + user_credentials
    }
    
      curl -X POST 'https://oauth.bahasakita.co.id/api/token' \
        -H "Content-type: application/x-www-form-urlencoded" \
        -H "Authorization: Basic user_credentials" \
        -d 'grant_type=client_credentials' \
        -d 'scope=SpeechTest' \
        --cacert "oauth.bahasakita.co.id.pem"
    
    Key Value
    content_type application/x-www-form-urlencoded
    Authorization Basic user_credentials

    PAYLOAD

    request_body = {
        'grant_type' : 'client_credentials',
        'scope': 'SpeechTest'
    }
    
    Key Value
    grant_type 'client_credentials'
    scope 'SpeechTest'

    RESPONSE SAMPLE

    auth response sample

    {
        "scope": "SpeechTest",
        "token_type": "bearer",
        "expires_in": 3600,
        "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJjbGllbnRfaWQiOiJia19zcGVlY2giLCJleHBpcmVzX2luIjozNjAwLCJpYXQiOjE1MjQ3MDU4ODcsInNhbHQiOiJaUDhRLVYoWzdHITkkOU01Iiwic2NvcGUiOiJTcGVlY2hUZXN0IiwidHlwZSI6ImNsaWVudF90b2tlbiJ9.8GBPREDEzaaaIkkm3nP-PTZHB3s2bjzR9pMdI1sjw4g"
    }
    
    
    Key Value
    scope SpeechTest
    token_types bearer
    expires_in 3600
    access_token access_token

    Workflow

    Normally Notula Speech API uses synchronous communication schemes. Before audio content can be processed, you have to reserve connection to server by initiating stream (init). Once the transcription server resource is managed, you need to notify that you are ready to stream the chain of audio (bos). Then payloads will continuously be uploaded thoroughly (audio). Each payload will be replied by chunk of text namely partial sentence and final sentence. To end up a session of stream, inform the server to conclude the streaming (eos). All of these request are sent to speech API server. You can see figure depicting the sequence of workflow below.

    Workflow

    Workflow Workflow Workflow Workflow Workflow Workflow

    Initiate Stream

    Initiating stream is intended to register a session of Notula Speech API service. To initiate stream you need to request POST comprised by header and payload as follows

    HEADER

    init request sample

    init_headers = {
        'content-type': 'application/json',
        'Accept-Charset': 'UTF-8',
        'Authorization' : 'Bearer ' + access_token
    }
    
    curl -X POST 'https://api.bahasakita.co.id/speech' \
    -H "Content-type: application/json" \
    -H "Authorization: Bearer [access_token] \
    -d '{"bk":{"cmd":"init","entity":"my_entity","stan":"stan_uuid","time":current_time,"protocol":"stream","version":"1.0","type":0,"data":{"user_id":"info@bahasakita.co.id","session_id":"session_id"}}}'\
    --cacert "api.bahasakita.co.id.pem"
    
    Key Value
    content_type application/json
    Accept-Charset UTF-8
    Authorization Bearer access_token

    PAYLOAD

    import time, uuid
    
    my_entity    = "Mitra Bahasa Kita"
    stan_uuid    = str(uuid.uuid4())
    session_uuid = str(uuid.uuid4())
    
    payload_dict = {}
    payload_dict['bk'] = {}
    payload_dict['bk']['cmd'] = '%s' %('init')
    payload_dict['bk']['entity'] = '%s' %(my_entity)
    payload_dict['bk']['stan'] = '%s' %(stan_uuid)
    payload_dict['bk']['type'] = 0
    payload_dict['bk']['time'] = int(time.time())
    payload_dict['bk']['protocol'] = '%s' %('stream')
    payload_dict['bk']['version'] = '%s' %('1.0')
    payload_dict['bk']['data'] = {}
    payload_dict['bk']['data']['session_id'] = '%s' %(session_uuid)
    
    Key Child Key Grand Child Key Value
    bk cmd init
    bk entity my_entity
    bk stan stan_uuid
    bk time current_time
    bk protocol stream
    bk version 1.0
    bk type type_message
    bk data session_id session_uuid

    RESPONSE SAMPLE

    init response sample

    {
        "bk": {
            "protocol": "stream",
            "type": 1,
            "entity": "Mitra Bahasa Kita",
            "version": "1.0",
            "stan": "e43cdcf4-a194-45b2-9d14-cf8ab43a0992",
            "time": 1522126286126,
            "cmd": "init",
            "code": 0,
            "data": {
                "session_id": "20b8f625-1e78-4ebb-9d97-0a15e1290d23"
            }
        }
    }
    
    Key Child Key Grand Child Key Value
    bk protocol stream
    bk type 1
    bk entity my_entity
    bk version 1.0
    bk stan stan_uuid is obtained from init reqeuest
    bk time response_time
    bk cmd init
    bk code 0
    bk data session_id session_uuid is obtained from init reqeuest

    Begin Stream

    After init request is responded with code : 0, you need to request begin of string (bos) as a remark to keep engine steady to process audio stream.

    HEADER

    bos request sample

    init_headers = {
        'content-type': 'application/json',
        'Accept-Charset': 'UTF-8',
        'Authorization' : Bearer access_token
    }
    
    curl -X POST 'https://api.bahasakita.co.id/speech' \
      -H "Content-type: application/json" \
      -H "Authorization: Bearer [access_token]" \
      -d '{"bk":{"cmd":"bos","entity":"my_entity","version":"1.0","time":current_time,"protocol":"stream","stan":"stan_uuid","type":0,"data":{"session_id":"session_id"}}}' \
      --cacert "api.bahasakita.co.id.pem"
    
    Key Value
    content_type application/json
    Accept-Charset UTF-8
    Authorization Bearer access_token

    PAYLOAD

    stan_uuid    = str(uuid.uuid4())
    
    payload_dict = {}
    payload_dict['bk'] = {}
    payload_dict['bk']['cmd'] = '%s' %('init')
    payload_dict['bk']['entity'] = '%s' %(my_entity)
    payload_dict['bk']['stan'] = '%s' %(stan_uuid)
    payload_dict['bk']['type'] = 0
    payload_dict['bk']['time'] = int(time.time())
    payload_dict['bk']['protocol'] = '%s' %('stream')
    payload_dict['bk']['version'] = '%s' %('1.0')
    payload_dict['bk']['data'] = {}
    payload_dict['bk']['data']['session_id'] = '%s' %(session_id)
    
    Key Child Key Grand Child Key Value
    bk cmd bos
    bk entity my_entity
    bk stan stan_uuid
    bk time current_time
    bk protocol stream
    bk version 1.0
    bk data session_id session_id

    RESPONSE SAMPLE

    bos response sample

    {
        "bk": {
            "data": {
                "session_id": "9c5a4a8f-1d30-4208-954d-063648fc733a",
                "utterance_id": 1522739379417
            },
            "type": 1,
            "entity": "Mitra Bahasa Kita",
            "cmd": "bos",
            "protocol": "stream",
            "stan": "6a037cd2-6e24-4d30-90fc-e3ff47e5c048",
            "version": "1.0",
            "time": 1522739379406,
            "code": 0
        }
    }
    
    Key Child Key Grand Child Key Value
    bk protocol stream
    bk type 1
    bk entity my_entity
    bk version 1.0
    bk stan stan_uuid
    bk time response_time
    bk cmd bos
    bk code 0
    bk data session_id session_id

    Streams

    Everything is set and ready to stream the audio chunks. Audio chunk that is sent has to follow format as specified in Audio Format section. Note that transferring audio chunk is a synchronous process which means before next chunk can be sent, make sure previous chunk has finished and been responded with code : 0. If it were happened to retrieve code except 0, you needed to resend bos and continue audio chunk stream where it was malfunctioned.

    Text data are delivered words by words, not sentences in a whole. These texts come along with the type, either partial or final. Type partial means the words are resulted directly from acoustic feature to most similar sounded word and not yet viewing the sentence level context. In other hand, type final means the words are resulted from language model decoding regarding the context. The partial then needs to be replaced by the final words. These type come alternately and do not have particular period of time, but whenever the machine feel it is enough to calculate the language model, it will do.

    HEADER

    audio request sample

    init_headers = {
        'content-type': 'application/json',
        'Accept-Charset': 'UTF-8',
        'Authorization' : Bearer access_token
    }
    
    curl -X POST 'https://api.bahasakita.co.id/speech' \
      -H "Content-type: application/json" \
      -H "Authorization: Bearer [access_token]" \
      -d '{"bk": {"cmd": "audio", "entity": my_entity", "stan": "stan_uuid", "type": 0, "time": current_time, "protocol": "stream", "version": "1.0", "data": {"session_id": "session_id", "utterance_id": utterance_id, "offset": offset, "len": length, "audio": "audio"}}}' \
      --cacert "api.bahasakita.co.id.pem"
    
    Key Value
    content_type application/json
    Accept-Charset UTF-8
    Authorization Bearer access_token

    PAYLOAD

    Very first payload

    stan_uuid    = str(uuid.uuid4())
    
    payload_dict = {}
    payload_dict['bk'] = {}
    payload_dict['bk']['cmd'] = '%s' %('audio')
    payload_dict['bk']['entity'] = '%s' %(my_entity)
    payload_dict['bk']['stan'] = '%s' %(stan_uuid)
    payload_dict['bk']['type'] = 0
    payload_dict['bk']['time'] = int(time.time())
    payload_dict['bk']['protocol'] = '%s' %('stream')
    payload_dict['bk']['version'] = '%s' %('1.0')
    payload_dict['bk']['data'] = {}
    payload_dict['bk']['data']['session_id'] = '%s' %(session_id)
    payload_dict['bk']['data']['utterance_id'] = '%s' %(utterance_id)
    payload_dict['bk']['data']['offset'] = offset
    payload_dict['bk']['data']['len'] = length
    payload_dict['bk']['data']['audio'] = audio
    

    Second payload and after

    stan_uuid    = str(uuid.uuid4())
    
    payload_dict = {}
    payload_dict['bk'] = {}
    payload_dict'bk' = '%s' %('audio')
    payload_dict'bk' = '%s' %(my_entity)
    payload_dict'bk' = '%s' %(stan_uuid)
    payload_dict'bk' = 0
    payload_dict'bk' = int(time.time())
    payload_dict'bk' = '%s' %('stream')
    payload_dict'bk' = '%s' %('1.0')
    payload_dict'bk' = {}
    payload_dict'bk'['session_id'] = '%s' %(session_id)
        if previous_response['bk']['data']['eos'] == True:
            payload_dict'bk'['utterance_id'] = '%s' %(previous_response['bk']['data']['next_utterance_id'])
        else:
            payload_dict'bk'['utterance_id'] = '%s' %(previous_response['bk']['data']['utterance_id'])
    payload_dict'bk'['offset'] = offset
    payload_dict'bk'['len'] = length
    payload_dict'bk'['audio'] = audio
    
    
    curl -X POST 'https://api.bahasakita.co.id/speech' \
      -H "Content-type: application/json" \
      -H "Authorization: Bearer [access_token]" \
      -d '{"bk": {"cmd": "audio", "entity": my_entity", "stan": "stan_uuid", "type": 0, "time": current_time, "protocol": "stream", "version": "1.0", "data": {"session_id": "session_id", "utterance_id": next_utterance_id, "offset": offset, "len": length, "audio": "audio"}}}' \
      --cacert "api.bahasakita.co.id.pem"
    
    Key Child Key Grand Child Key Value
    bk cmd audio
    bk entity my_entity
    bk stan stan_uuid
    bk time current_time
    bk protocol stream
    bk version 1.0
    bk data session_id session_id
    bk data offset offset
    bk data len length
    bk data audio audio
    bk data utterance_id utterance_id

    RESPONSE SAMPLE

    audio response sample

    {
        "bk": {
            "type": 1,
            "code": 0,
            "time": 1522742147997,
            "version": "1.0",
            "stan": "6a7b8a16-3714-11e8-b41e-40b0344888da",
            "entity": "Mitra Bahasa Kita",
            "protocol": "stream",
            "cmd": "audio",
            "data": {
                "session_id": "53b31a9d-9a28-40e9-9115-5b9dd96c1dec",
                "offset": 236800,
                "eos": False,
                "text": [{
                    "value": "di",
                    "type": "partial"
                }],
                "len": 3200,
                "caller": "StreamAudio_dprocess",
                "utterance_id": 1522742142116
            }
        }
    }
    

    audio response sample with next_utterance_id

    {
        "bk": {
            "type": 1,
            "code": 0,
            "time": 1522742147997,
            "version": "1.0",
            "stan": "6a7b8a16-3714-11e8-b41e-40b0344888da",
            "entity": "Mitra Bahasa Kita",
            "protocol": "stream",
            "cmd": "audio",
            "data": {
                "session_id": "53b31a9d-9a28-40e9-9115-5b9dd96c1dec",
                "offset": 240000,
                "eos": true,
                "text": [{
                  "type":"partial",
                  "value":"jakarta"
                },{
                  "type":"final",
                  "value":"saya pertama kali kerja di Jakarta"
                }],
                "len": 3200,
                "caller": "StreamAudio_dprocess",
                "utterance_id": 1522742142116,
          "next_utterance_id": 1522742142759
            }
        }
    }
    
    Key Child Key Grand Child Key Sub Grand Child Key Value
    bk protocol stream
    bk type 1
    bk entity my_entity
    bk version 1.0
    bk stan stan_uuid
    bk time response_time
    bk cmd eos
    bk code 0
    bk data session_id session_uuid
    bk data offset offset
    bk data eos Boolean
    bk data len length
    bk data caller caller
    bk data utterance_id utterance_id
    bk data text value value
    bk data text type type
    bk data next_utterance_id utterance_id

    End Stream

    When audio is completely transferred, you have to end the stream so that your state in engine will not be maintained any longer. Whether you signal eos or not, whenever your token is expired, the state will be wiped and you need to redo everything from requesting token.

    HEADER

    eos request

    init_headers = {
        'content-type': 'application/json',
        'Accept-Charset': 'UTF-8',
        'Authorization' : Bearer access_token
    }
    
    curl -X POST 'https://api.bahasakita.co.id/speech' \
      -H "Content-type: application/json" \
      -H "Authorization: Bearer [access_token]" \
      -d '{"bk": {"type": 0, "data": {"session_id": "session_id", "utterance_id": utterance_id, "stan": "stan_uuid", "time":current_time , "cmd": "eos", "protocol": "stream", "entity": "my_entity", "version": "1.0"}}' \
      --cacert "api.bahasakita.co.id.pem"
    
    Key Value
    content_type application/json
    Accept-Charset UTF-8
    Authorization Bearer access_token

    PAYLOAD

    stan_uuid    = str(uuid.uuid4())
    
    payload_dict = {}
    payload_dict['bk'] = {}
    payload_dict['bk']['cmd'] = '%s' %('eos')
    payload_dict['bk']['entity'] = '%s' %(my_entity)
    payload_dict['bk']['stan'] = '%s' %(stan_uuid)
    payload_dict['bk']['type'] = 0
    payload_dict['bk']['time'] = int(time.time())
    payload_dict['bk']['protocol'] = '%s' %('stream')
    payload_dict['bk']['version'] = '%s' %('1.0')
    payload_dict['bk']['data'] = {}
    payload_dict['bk']['data']['session_id'] = '%s' %(session_id)
    
    Key Child Key Grand Child Key Sub Grand Child Key Value
    bk protocol stream
    bk type 1
    bk entity my_entity
    bk version 1.0
    bk stan stan_uuid
    bk time current_time
    bk cmd eos
    bk code 0
    bk data session_id session_uuid
    bk data offset offset
    bk data eos Boolean
    bk data len length
    bk data caller caller
    bk data utterance_id utterance_id
    bk data text value value
    bk data text type type

    RESPONSE SAMPLE

    eos response sample

    {
        "bk": {
            "data": {
                "user_id":"bk_speech",
                "session_id":"12e334dc-1995-4f92-893f-8a33c5323e51",
                "socket_id":"8922e118-4840-11e8-9524-0cc47ab02a34",
                "utterance_id":1524630253370,
                "text":[
                    {
                        "type":"final",
                        "value":"Effendy"
                    },
                    {
                        "type":"final",
                        "value":"."
                    }
                ]
            },
            "type": 1,
            "entity": "Mitra Bahasa Kita",
            "cmd": "eos",
            "protocol": "stream",
            "stan": "6a037cd2-6e24-4d30-90fc-e3ff47e5c048",
            "version": "1.0",
            "time": 1522739379406,
            "code": 0
        }
    }
    
    Key Child Key Grand Child Key Sub Grand Child Key Value
    bk protocol stream
    bk type 1
    bk entity my_entity
    bk version 1.0
    bk stan stan_uuid
    bk time response_time
    bk cmd init
    bk code 0
    bk data session_id session_uuid
    bk data socket_id socket_id
    bk data user_id user_id
    bk data utterance_id utterance_id
    bk data text value value
    bk data text type type

    Audio Format

    Accepted Format

    As a common format, we restrict the audio that is streamed throughout the API has to be WAV formatted with 16 KHz sampling rate, 1 channel, and 16 bit precision (Signed Integer PCM) . We do not serve any resampler, therefore the streamed audio will be assumed qualified and processed as it is. Unless the audio fulfill these format, the yielded transcription will be meaningless. In order to do that, use tools like sox,ffmpeg, or other python library to have it formatted in right way. Yet we really do not recommend upsampling from lower sampling rate audio, interpolating from lower bit precision, and conversion from lossy (eg. mp3, ogg, etc.) format to WAV since from the very beginning there have been data loss on these cases.

    Chunking and Conversion

    WAV is embodied by two main parts : headers and data. Headers span 44 bytes length and the rest is audio data itself. We would not need the header at all since we have made assumption above. So we skip this index 0 to 43 which are headers and move on to processing the data.

    Chunking

    While streaming the audio data, we do not entirely send the data all at once, but break apart them into smaller chunks. These chunks are sized maximally particular W bytes or less than that. W is either 3200 or 6400 which has to be set the same for whole processes. Based on best practice, we recommend to set W equals to 6400. Furthermore, only the last chunk which eventually less than W bytes. You can either leave it as it is or append zero padding to suffice the size. Each chunk then converted to base64 string before inserted into request body.