Xiaomingbot

Introduction

Xiaomingbot is an automated system to generate and read news articles. It employs text generation algorithms to generate news from data and machine translation algorithms to translate the text into multiple languages. The system produces a visual avatar to read the news, with facial expression and lip motion synchronized with automatically generated voice.

Demo Video

News Generation

Xiaomingbot takes a table as input，for example, a table that describes the score of each player in a soccer game. To generate news, Xiaomingbot use templated-based table-to-text methods. That is, we have many different templates to generate sentences for the news. Every time Xiaomingbot generate a sentence, she randomly pick up a template and replace the placeholders with the according content in the input table. To generate the news abstract, Xiaomingbot use a BERT-based model to score sentences in the news and pick up those achieving high score to make up the final abstract.

News Translation

Xiaomingbot uses a Transformer-based machine translation model to translate the generated news in chinese into different languages, so that people around the world can read the news conveniently. Although it can already generate fluent translated news, some certain proper nouns are still hard to translate because it may be confusing for the translation model. Therefore, Xiaomingbot also use named entity replacement mechanism, that is, directly replace the named entitie with its according translation. To accelerate the decoding process, we implement a faster CUDA-based NMT system, whose inference speed is ten times than tensorflow, and it can be found Here.

News Reading

With a small amount of recorded voice of a speaker in one language provided as training data, we can train a TTS model for Xiaomingbot. This TTS model has a cross-lingual voice cloning mechanism. In detail, after training, it can read news in different language with exactly the same voice as we provided before.

Avatar Animation

Xiaomingbot can generate lip motion synced with the audio synthesized by the TTS model, and render hairs, clothing, etc. For lip motion, we use a Seq2Seq model. The input sequence is the phoneme and the according duration drawn from the TTS model, and the ouput is a sequence of mouth blendshape weights. With these different mouth blendshape weights, Xiaomingbot can make a lot of different facial expressions. For other rendering, we use Unity and different algorithms like normal mapping.

Our Paper

You can find our paper here. If you want to cite our paper, you can cite:

@inproceedings{xu-etal-2020-xiaomingbot,
    title = "{X}iaomingbot: {A} {M}ultilingual {R}obot {N}ews {R}eporter",
    author = "Xu, Runxin  and
      Cao, Jun  and
      Wang, Mingxuan  and
      Chen, Jiaze  and
      Zhou, Hao  and
      Zeng, Ying  and
      Wang, Yuping  and
      Chen, Li  and
      Yin, Xiang  and
      Zhang, Xijin  and
      Jiang, Songcheng  and
      Wang, Yuxuan  and
      Li, Lei",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association 
        for Computational Linguistics: System Demonstrations",
    month = july,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    pages = "1--8"
}

Links

Xiaomingbot News Reporter: Link