current position:Home>Industrial practice promotes scientific and technological innovation, and three papers of JD technology group are selected into icassp 2021

Industrial practice promotes scientific and technological innovation, and three papers of JD technology group are selected into icassp 2021

2021-08-31 16:05:02 Heart of machine

ICASSP 2021 Will be in 2021 year 6 month 6 Japan -11 The opening ceremony was held in Toronto, Canada , With solid accumulation and frontier innovation in the field of voice technology , Jingdong Technology Group's 3 piece The paper has been ICASSP 2021 receive .

ICASSP Full name International Conference on Acoustics, Speech and Signal Processing( International acoustics 、 Speech and signal processing Conference ), By IEEE The world's largest , It is also the most comprehensive top academic conference on signal processing and its applications . The selected papers of Jingdong technology group this time , On the international stage, it shows all aspects of Self in speech enhancement 、 speech synthesis 、 Strength in multiple rounds of dialogue .

01.Neural Kalman Filtering for Speech Enhancement

Research on speech enhancement algorithm based on neural Kalman filter

* Thesis link :

Due to the existence of complex environmental noise , Speech enhancement plays an important role in human-computer speech interaction system . Speech enhancement algorithms based on statistical machine learning usually use the existing common modules in the field of machine learning ( Such as fully connected network 、 Recursive neural network 、 Convolutional neural networks, etc ) Build enhanced systems . However , How to apply the optimal filter design theory based on expert knowledge in traditional speech signal processing , It is still an unsolved problem to effectively apply it to speech enhancement system based on machine learning .

Papers selected by Jingdong technology group 《Neural Kalman Filtering for Speech Enhancement Research on speech enhancement algorithm based on neural Kalman filter 》 Put forward Speech enhancement framework based on neural Kalman filter , Combine neural network with optimal filter theory , The optimal weight of Kalman filter is obtained by supervised learning .

The researchers first constructed a speech temporal change model based on recurrent neural network . Compared with the traditional Kalman filter , This model eliminates the unreasonable assumption that speech changes obey the linear prediction model , It can be used to model the nonlinear change of real speech . One side , Based on the time series model and Kalman hidden state vector information , The algorithm first obtains the speech long-term envelope prediction . On the other hand , By fusing the observation information of the current moment , The system further solves the speech spectrum prediction based on Wiener filter of traditional signal processing . The final output of the system is a linear combination of speech long-term envelope prediction and Wiener filter prediction . Based on the traditional Kalman filter theory , This system directly obtains the optimal solution of linear combination weight , By designing end-to-end systems , Voice time-varying network can be updated synchronously 、 The noise associated with Wiener filter estimates the weight of the network . This study is based on Librispeech Speech set PNL-100Nonspeech-Sounds and MUSAN Noise set The experimental results show that , The performance of the proposed algorithm in SNR gain 、 Voice perception quality (PESQ) And speech intelligibility (STOI) The indicators are better than the traditional ones UNET and CRNN The speech enhancement algorithm of framework has better performance .

02.Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis

Prosodic modeling of end-to-end speech synthesis based on cross sentence information

* Thesis link :

Although the end-to-end speech synthesis technology has been realized more naturally , Speech synthesis effect with relatively rich prosody , However, it does not use the text structure information, but only uses the linguistic features of the current sentence for speech synthesis . Usually , Prosodic information is strongly related to the textual structure of context , The same sentence in different contexts will have completely different prosodic performance , Therefore, the end-to-end system that only uses the current sentence text features for speech synthesis can synthesize a piece of text , It's hard to translate a piece of text into a natural one based on context information 、 Prosody expresses rich voice .

Papers selected by Jingdong technology group 《Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis Prosodic modeling of end-to-end speech synthesis based on cross sentence information 》 Adopted At present, the mainstream BERT Model to extract the cross sentence feature vector of the text to be synthesized , Then the context vector is used to improve the prosodic effect of the end-to-end speech synthesis model .

▲ chart 2: Schematic diagram of model structure ▲

The researchers did not use any displayed prosodic control information , But through BERT The language model extracts the cross sentence feature representation of the context sentence to be composed , And the feature representation is used as the additional input of the current mainstream end-to-end speech synthesis algorithm . This paper discusses the use of two different cross sentence features , The first is to put together the cross sentence features of all the context sentences as an integral input of the end-to-end speech synthesis system , The second way is to take all the cross sentence features of context sentences as a sequence , Then, each speech unit of the text to be synthesized and the sequence are used for attention calculation , Then, the cross sentence features of context sentences can be weighted by the calculated attention to get the corresponding cross sentence features of each speech unit . The second is the use of cross sentence features , Each pronunciation unit can get a fine-grained 、 Cross sentence features that are helpful to the pronunciation of the current unit .

Experimental results show that , This study combines cross sentence features in an end-to-end speech synthesis system , It can effectively improve the naturalness and expressiveness of synthetic paragraph text . In this study, the experimental results are verified on the data sets of Chinese and English audio books respectively . also , In the comparison test results , Compared to our end-to-end based baseline model , Most of the testers prefer the audio synthesized by the speech synthesis algorithm combined with cross sentence vector representation in this study .

03.Conversational Query Rewriting with Self-supervised Learning 

Dialogue based on self supervised learning Query rewrite

* Thesis link :

In a multi round dialogue system , Users tend to be short 、 Colloquial expression , There are a lot of phenomena of lack of information and reference in expression . These phenomena make it difficult for dialogue robots to understand the real intention of users , It greatly increases the difficulty of system response . To improve the level of the dialogue system ,Query Rewriting is based on the user's historical session , Complete the user's words , To recover all omitted and referred information . However , There is now Query Rewriting techniques all adopt supervised learning methods , The effectiveness of the model is seriously limited by the scale of the annotated data , It's a big obstacle to the implementation of technology in real business scenarios . in addition , User problems after rewriting , Whether the intention has changed has not been concerned by the existing work , How to ensure the consistency of user's intention after rewriting is still an urgent problem . Papers selected by Jingdong technology group 《Conversational Query Rewriting with Self-supervised Learning Dialogue based on self supervised learning Query rewrite 》 Put forward the self-monitoring Query Rewrite method . When there are co-occurrence words between user question and historical conversation , Will delete co-occurrence words with a certain probability or replace them with pronouns , Last Query The rewriting model restores the user's original problem according to the historical session . Compared to supervised learning , Self supervised learning can obtain a large number of training data at low cost , Give full play to the representation learning ability of the model .

Jingdong researchers also further proposed an improved model Teresa, Improve the quality and accuracy of rewriting model from two aspects . One is in Transformer The coding layer introduces the keyword detection module , Extract keywords to guide sentence generation . First of all, talk about history (context) The encoded output is constructed from the attention graph , Get the relevance between words in historical conversation ; And then use it Text Rank The algorithm calculates the importance score of words ; Finally, the word importance score is integrated into the decoder as a priori information , Guide the model to generate questions with more key information . The second is to propose the module of intention consistency , stay Transformer Add a special label to the input text of the encoder [CLS], Get the intention distribution of the text content , Intention consistency is maintained by constraining intention distribution . Original conversation (Context, Query) And the generated sentences (Target) share Transformer Encoder , The intention distribution before and after rewriting is obtained respectively , We keep the distribution of the two in line , In order to ensure the consistency of the intention of generating sentences .

Jingdong Technology Group is the core sector of Jingdong in providing technical services to the outside world , Has been committed to cutting-edge research and exploration , And continue to lead by technology , Help cities and industries to achieve digital upgrading . Up to now , Jingdong Technology Group is in AAAI、IJCAI、CVPR、KDD、NeurIPS、ICML、ACL、ICASSP Etc AI A total of nearly 350+ piece , And won in a number of international academic competitions 19 term The world First of all . Believe in the future , Jingdong Technology Group will continue to focus on phonetics and semantics 、 Computer vision 、 Machine learning and other fields , Using science and technology to help the real economy , Change everyone's life .

Recommended reading

copyright notice
author[Heart of machine],Please bring the original link to reprint, thank you.

Random recommended