ICASSP 2021 Will be in 2021 year 6 month 6 Japan -11 The opening ceremony was held in Toronto, Canada , With solid accumulation and frontier innovation in the field of voice technology , Jingdong Technology Group's 3 piece The paper has been ICASSP 2021 receive .
ICASSP Full name International Conference on Acoustics, Speech and Signal Processing( International acoustics 、 Speech and signal processing Conference ), By IEEE The world's largest , It is also the most comprehensive top academic conference on signal processing and its applications . The selected papers of Jingdong technology group this time , On the international stage, it shows all aspects of Self in speech enhancement 、 speech synthesis 、 Strength in multiple rounds of dialogue .
01.Neural Kalman Filtering for Speech Enhancement
Research on speech enhancement algorithm based on neural Kalman filter

* Thesis link :https://arxiv.org/abs/2007.13962
Due to the existence of complex environmental noise , Speech enhancement plays an important role in human-computer speech interaction system . Speech enhancement algorithms based on statistical machine learning usually use the existing common modules in the field of machine learning ( Such as fully connected network 、 Recursive neural network 、 Convolutional neural networks, etc ) Build enhanced systems . However , How to apply the optimal filter design theory based on expert knowledge in traditional speech signal processing , It is still an unsolved problem to effectively apply it to speech enhancement system based on machine learning .
Papers selected by Jingdong technology group 《Neural Kalman Filtering for Speech Enhancement Research on speech enhancement algorithm based on neural Kalman filter 》 Put forward Speech enhancement framework based on neural Kalman filter , Combine neural network with optimal filter theory , The optimal weight of Kalman filter is obtained by supervised learning .

The researchers first constructed a speech temporal change model based on recurrent neural network . Compared with the traditional Kalman filter , This model eliminates the unreasonable assumption that speech changes obey the linear prediction model , It can be used to model the nonlinear change of real speech . One side , Based on the time series model and Kalman hidden state vector information , The algorithm first obtains the speech long-term envelope prediction . On the other hand , By fusing the observation information of the current moment , The system further solves the speech spectrum prediction based on Wiener filter of traditional signal processing . The final output of the system is a linear combination of speech long-term envelope prediction and Wiener filter prediction . Based on the traditional Kalman filter theory , This system directly obtains the optimal solution of linear combination weight , By designing end-to-end systems , Voice time-varying network can be updated synchronously 、 The noise associated with Wiener filter estimates the weight of the network . This study is based on Librispeech Speech set 、PNL-100Nonspeech-Sounds and MUSAN Noise set The experimental results show that , The performance of the proposed algorithm in SNR gain 、 Voice perception quality (PESQ) And speech intelligibility (STOI) The indicators are better than the traditional ones UNET and CRNN The speech enhancement algorithm of framework has better performance .
02.Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis
Prosodic modeling of end-to-end speech synthesis based on cross sentence information

* Thesis link :
https://www.zhuanzhi.ai/paper/92135c7f518e7cda63f7fcb4b940a4c1
Although the end-to-end speech synthesis technology has been realized more naturally , Speech synthesis effect with relatively rich prosody , However, it does not use the text structure information, but only uses the linguistic features of the current sentence for speech synthesis . Usually , Prosodic information is strongly related to the textual structure of context , The same sentence in different contexts will have completely different prosodic performance , Therefore, the end-to-end system that only uses the current sentence text features for speech synthesis can synthesize a piece of text , It's hard to translate a piece of text into a natural one based on context information 、 Prosody expresses rich voice .
Papers selected by Jingdong technology group 《Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis Prosodic modeling of end-to-end speech synthesis based on cross sentence information 》 Adopted At present, the mainstream BERT Model to extract the cross sentence feature vector of the text to be synthesized , Then the context vector is used to improve the prosodic effect of the end-to-end speech synthesis model .

The researchers did not use any displayed prosodic control information , But through BERT The language model extracts the cross sentence feature representation of the context sentence to be composed , And the feature representation is used as the additional input of the current mainstream end-to-end speech synthesis algorithm . This paper discusses the use of two different cross sentence features , The first is to put together the cross sentence features of all the context sentences as an integral input of the end-to-end speech synthesis system , The second way is to take all the cross sentence features of context sentences as a sequence , Then, each speech unit of the text to be synthesized and the sequence are used for attention calculation , Then, the cross sentence features of context sentences can be weighted by the calculated attention to get the corresponding cross sentence features of each speech unit . The second is the use of cross sentence features , Each pronunciation unit can get a fine-grained 、 Cross sentence features that are helpful to the pronunciation of the current unit .
Experimental results show that , This study combines cross sentence features in an end-to-end speech synthesis system , It can effectively improve the naturalness and expressiveness of synthetic paragraph text . In this study, the experimental results are verified on the data sets of Chinese and English audio books respectively . also , In the comparison test results , Compared to our end-to-end based baseline model , Most of the testers prefer the audio synthesized by the speech synthesis algorithm combined with cross sentence vector representation in this study .
03.Conversational Query Rewriting with Self-supervised Learning
Dialogue based on self supervised learning Query rewrite

* Thesis link :
https://github.com/note-lh/paper/blob/main/Conversational_Query_Rewriting_with_Self-supervised_Learning.pdf
In a multi round dialogue system , Users tend to be short 、 Colloquial expression , There are a lot of phenomena of lack of information and reference in expression . These phenomena make it difficult for dialogue robots to understand the real intention of users , It greatly increases the difficulty of system response . To improve the level of the dialogue system ,Query Rewriting is based on the user's historical session , Complete the user's words , To recover all omitted and referred information . However , There is now Query Rewriting techniques all adopt supervised learning methods , The effectiveness of the model is seriously limited by the scale of the annotated data , It's a big obstacle to the implementation of technology in real business scenarios . in addition , User problems after rewriting , Whether the intention has changed has not been concerned by the existing work , How to ensure the consistency of user's intention after rewriting is still an urgent problem . Papers selected by Jingdong technology group 《Conversational Query Rewriting with Self-supervised Learning Dialogue based on self supervised learning Query rewrite 》 Put forward the self-monitoring Query Rewrite method . When there are co-occurrence words between user question and historical conversation , Will delete co-occurrence words with a certain probability or replace them with pronouns , Last Query The rewriting model restores the user's original problem according to the historical session . Compared to supervised learning , Self supervised learning can obtain a large number of training data at low cost , Give full play to the representation learning ability of the model .

Jingdong researchers also further proposed an improved model Teresa, Improve the quality and accuracy of rewriting model from two aspects . One is in Transformer The coding layer introduces the keyword detection module , Extract keywords to guide sentence generation . First of all, talk about history (context) The encoded output is constructed from the attention graph , Get the relevance between words in historical conversation ; And then use it Text Rank The algorithm calculates the importance score of words ; Finally, the word importance score is integrated into the decoder as a priori information , Guide the model to generate questions with more key information . The second is to propose the module of intention consistency , stay Transformer Add a special label to the input text of the encoder [CLS], Get the intention distribution of the text content , Intention consistency is maintained by constraining intention distribution . Original conversation (Context, Query) And the generated sentences (Target) share Transformer Encoder , The intention distribution before and after rewriting is obtained respectively , We keep the distribution of the two in line , In order to ensure the consistency of the intention of generating sentences .
Jingdong Technology Group is the core sector of Jingdong in providing technical services to the outside world , Has been committed to cutting-edge research and exploration , And continue to lead by technology , Help cities and industries to achieve digital upgrading . Up to now , Jingdong Technology Group is in AAAI、IJCAI、CVPR、KDD、NeurIPS、ICML、ACL、ICASSP Etc AI A total of nearly 350+ piece , And won in a number of international academic competitions 19 term The world First of all . Believe in the future , Jingdong Technology Group will continue to focus on phonetics and semantics 、 Computer vision 、 Machine learning and other fields , Using science and technology to help the real economy , Change everyone's life .
Recommended reading