# An In-Depth Look at Baidu's Al Aspirations

**JULIA LI** 



**NEWSHA ARDALANI** 













百度一下

910

### Bai 也大脑 Baidu Brain

# AI&HPC

### Make communication easier

- Speech Recognition
- Text-to-Speech Synthesis
- Simultaneous Translation
- Language Model



### Make AI faster

High Performance Computing







# Speech Recognition Model — SMLTA

### Features

- Streaming
- Multi-layer Neural Network
- Large Data Code Switching





Tech blog: research.baidu.com/Blog/index-view?id=109

# Text-to-Speech Synthesis (TTS)





a a a o ososadetentos

ſ



# TTS Model — Clarinet

### A Fully End-To-End Neural Network Model

Text Phonemes











# Tradeoff between Latency and Quality



One of Al's Holy Grails Needs Fundamentally New Ideas!







# Simultaneous Translation Model — STACL

### A prefix-to-prefix framework

Controlable latency







source:

target:

source:

target:

# Natural Language Processing

### Challenge

- NLP is a diversified field with many distinct tasks
- Shortage of training data

### New Trend

- Pre-training + Fine-tuning framework
  - Pre-training(using the enormous amount of

unannotated text data)

- Fine-tuning(using small-data NLP tasks in
  - resulting in substantial accuracy improvements)





### Machine

Question Answering



Information Retrieval



# Language Model — ERNIE 2.0

- Inspired by BERT
- Incorporate more information
  - Named entities
  - Semantic closeness
  - Sentence order or discourse relations
- Design a continual pretraining framework for language understanding





# Bigger Model is Better?

| Model      | Hidden size | Layer   | Parameters |
|------------|-------------|---------|------------|
| BERT-base  | 768         | 12      | 110M       |
| BERT-large | 1024        | 24      | 340M       |
| GPT2-large | 1024        | 24      | 1.5B       |
| Megatron   | 1024        | 72      | 8.3B       |
| T5         | E1024 D1024 | E24 D24 | 11B        |

"BERT was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete". Paper: https://arxiv.org/pdf/1810.04805.pdf

interconnect with supporting CPU host machines." Paper :<u>https://arxiv.org/pdf/1910.10683.pdf</u>.





## Why Bigger is Better?

### What Are the Implications for System Community?





# Moore's Law

|        | advanceme<br>ed to Moore |
|--------|--------------------------|
|        | 0,000,000,00             |
| 10     | 0,000,000,00             |
| 5      | 5,000,000,00             |
| 1      | ,000,000,00              |
|        | 500,000,00               |
|        | 100,000,00               |
| count  | 50,000,00                |
| tor    | 10,000,00                |
| ransis | 5,000,00                 |
| -      | 1,000,00                 |
|        | 500,00                   |
|        | 100,00                   |
|        | 50,00                    |
|        | 10,00                    |
|        | 5,00                     |
|        | 1,00                     |
|        |                          |



### aw – The number of transistors on integrated circuit chips (1971-2018)

ribes the empirical regularity that the number of transistors on integrated circuits doubles approximately every two years. It is important as other aspects of technological progress – such as processing speed or the price of electronic products – are law.



Data source: Wikipedia (https://en.wikipedia.org/wiki/Transistor\_count)

# What does it mean to be "Better"?

# Better Accuracy Faster Training





# Better as Better Accuracy





### # training samples, log(d)

Bigger

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). <u>Deep learning scaling is predictable, empirically.</u> arXiv preprint arXiv:1712.00409.

# Better as Better Accuracy





Today's Data Required Data # training samples, log(d)

> Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). <u>Deep learning scaling is predictable, empirically.</u> arXiv preprint arXiv:1712.00409.

# Better as Better Accuracy





# training samples, log(d)

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.

# Better as Faster Training







#Parameters, log(m) Bigger

Ardalani, N., Hestness, J., and Gregory Diamos. "Have a larger cake and eat it faster too: A guideline to train larger models faster." (SysML 2018).





### Memory capacity/chip can grow only so much...





### Memory capacity/chip can grow only so much...

### Break the model and data into smaller chunks





### Break the model and data into smaller chunks

### We need to exploit all forms of parallelism



Memory capacity/chip can grow only so much...







### Current Practice: Hire Expert Programmers



# How to find a good parallelism strategy?

 $M_0$ 

# $M_1$ $M_2$ $M_3$

### **Current Practice: Hire Expert** Programmers



# How to find a good parallelism strategy?





### Current Practice: Hire Expert Programmers



# How to find a good parallelism strategy?





# How to find a good parallelism strategy?

- **Current Practice: Hire Expert** Programmers
- Cutting edge: Reinforcement Learning, Dynamic Programming









# GOOD NEWS Best mapping/Best timing



## BAD NEWS System Under-utization



# Solution?

# Co-design Parallelism Strategy & Hardware Accelerator





# Conclusion

### GOOD NEWS Bigger is Better



# Memory is Bottleneck Systems Underutilization

### WHAT CAN WE DO?



# **Cloud AI Computing Platform KongMing Architecture**

GPU



Chip

CPU



Speech/Inages/NLP/Recommendation

PaddlePaddle

Smart Scheduling

IO optimiaztion

Elastic provision

Communication optimzation

High performance storage pool

High speed interconnect

ASIC

Baidu Kunlun







