Understanding Real-World Concurrency Bugs in Go

最近看了一篇关于golang 并发bug研究的paper，有挺多参考价值的，顺便做个笔记~

Introduction

Go的一个主要设计目标是改善传统的多线程编程语言，使得并发编程更加容易且更少出错，其主要通过两方面：

让线程（goroutine）更加轻量化和容易创建
在线程间使用特殊的消息通讯（channel）

论文中主要研究了六个开源软件，分别是：Docker，Kubernetes，etcd，gRPC，CockroachDB，BoltDB，总共包含171个并发Bug。通过分析bug的根源，重现，以及修复来研究这些bug。

研究问题主要关注：消息传递和共享内存哪种进程间通讯机制更不易于出错。相比共享内存，Go则更推崇消息传递的方式。

对于Bug，主要按照两个维度进行分类：

按bug的原因分为：共享内存bug和消息传递bug；
按bug的影响分为：阻塞bug和非阻塞bug；

Bug示例：

bug1

fix

// message passing, blocking bug
func finishReq(timeout time.Duration) ob {
   ch := make(chan ob)
   go func() {
      result := fn()
      ch <- result // block
   }()
   select {
   case result = <-ch:
      return result
   case <-time.After(timeout):
      return nil
   }
}

// message passing, blocking bug
func finishReq(timeout time.Duration) ob {
   ch := make(chan ob, 1)
   go func() {
      result := fn()
      ch <- result // block
   }()
   select {
   case result = <-ch:
      return result
   case <-time.After(timeout):
      return nil
   }
}

说明：如果timeout了，则goroutine无法正常结束

Background and Applications

Goroutine

以goroutine作为并发单元，按 M-N 的方式绑定到内核线程。Go支持使用匿名函数来创建线程，在匿名函数前定义的所有本地变量都可以被子goroutine访问到，然而这很容易造成竞态。

Shared Memory

传统的方式 Mutex(lock/unlock), RWMutex(read/write lock), Cond(条件变量), atomic(read/write)；新的方式 Once, WaitGroup。不过误用 waitgroup 很容易引发bug。

Message Passing

新的方式 channel，包含 buffered 和 unbuffered。使用 channel 具有一些潜规则（必须初始化、从nil channel读写导致阻塞、往closed channel发送数据或关闭一个closed channel将导致panic）。

select 带有随机性，而这种随机性也有可能引入并发bug。

Go Concurrency Usage Patterns

Goroutine Usages

在实际开发环境中：

写的代码是否包含很多goroutine（静态）；
Go程序在运行时是否创建很多goroutine（动态）；

论文中统计了6个研究对象每千行代码创建的goroutine数量在0.18~0.83之间，同时除了Kubernetes和BoltDB它们也大量使用匿名函数。

对比 gRPC-Go 和 gRPC-C，在处理相同的请求数量的情况下（因为Go与C性能不同，不能以时间为参考），可以看到：

gRPC-Go goroutine的数量是 gRPC-C thread的数倍之多；
gRPC-Go goroutine的运行时间是整个程序运行时间的60%~90%, gRPC-C thread则基本是100%；

观察结果1：相比C，Goroutines生命周期更短但创建得更加频繁。

Concurrency Primitive Usages

统计结果显示，共享内存用得比消息传递更多。

观察结果2：虽然传统的共享内存线程间通讯还保留着重度使用，Go程序员也使用了大量的消息传递方式。

Bug Study Methodology

收集Bug的方式：从github上搜索带有 race、deadlock、 concurrency、mutex、atomic等关键字的提交，然后进行随机抽样和过滤，再人工处理，最后总共包含171个bug。

Bug分类：

按bug的行为：
- 如果一个或多个goroutine被卡住，无法继续往前执行，则称之为阻塞bug；
- 如果goroutine可以正常结束，但是他们的目的没有达到，则称之为非阻塞bug；
按bug的原因：shared memory or passing messages；

Blocking Bugs

Root Causes of Blocking Bugs

观察结果3：与普遍认知（消息传递不易出错）相反，更多的阻塞性bug是因为消息传递引发的。

(mis)Protoction of Shared Memory

Mutex：重复加锁，获取锁的顺序冲突，忘记解锁；

RWMutex：进程A重复获取读锁，进程B获取写锁，导致死锁；

WaitGroup：

bug2

fix

var group sync.WaitGroup
group.Add(len(pm.plugins))
for _, p := range pm.plugins {
   go func(p *plugin) {
      defer group.Done()
   }(p)
   group.Wait()
}

var group sync.WaitGroup
group.Add(len(pm.plugins))
for _, p := range pm.plugins {
   go func(p *plugin) {
      defer group.Done()
   }(p)
}
group.Wait()

Misuse of Message Passing

常规的一些bug 如前面提到的（bug 1）；另外，当使用 Go 一些特定的库时，也需要额外注意，比如：

bug3

fix

ctx := context.Background()
hctx, hcancel := context.WithCancel(ctx)
if timeout > 0 {
   hctx, hcancel = context.WithTimeout(ctx, timeout)
}

ctx := context.Background()
var hctx context.Context
var hcancel context.CancelFunc
if timeout > 0 {
   hctx, hcancel = context.WithTimeout(ctx, timeout)
} else {
   hctx, hcancel = context.WithCancel(ctx)
}

说明：WithCancel 创建了一个goroutine，而后如果timeout > 0，hcancel 指向新对象，导致goroutine无法正常结束。

观察结果5：所有因消息传递导致的阻塞bug都跟Go新的消息传递语义如channel有关，它们都难以检测特别是跟其它同步机制混用的时候。

Detection of Blocking Bugs

Go提供了内置的死锁监测，然而对于前面的bug，只有少部分能被检测出来。原因是：

只要还有goroutine在执行，则不会认为是死锁；
只检测了Go并发基元阻塞的goroutine，因等待系统资源而阻塞的goroutine并不考虑；

Non-Blocking Bugs

Root Causes of Non-blocking Bugs

虽然与传统语言相似，有些bug却是因为对Go的新特性理解不透彻导致：

bug4	fix
for i := 0; i <= 10; i++ { go func() { fmt.Println(i) }() }	for i := 0; i <= 10; i++ { go func(i int) { fmt.Println(i) }(i) }
说明：匿名函数导致的竞态

另外也有 WaitGroup 的误用导致的bug：

bug5

fix

func (p *peer) send(){
   p.mu.Lock()   
   defer p.mu.Unlock()
   switch p.status {
   case idle:
      go func(){
         p.wg.Add(1)   
         // ..
         p.wg.Done()
      }()
   case stopped: 
   }
}

func (p *peer) stop(){
   p.mu.Lock()
   p.status = stopped
   p.mu.Unlock()
   p.wg.Wait()
}

func (p *peer) send(){
   p.mu.Lock()   
   defer p.mu.Unlock()
   switch p.status {
   case idle:
      p.wg.Add(1)
      go func(){
         // ..
         p.wg.Done()
      }()
   case stopped: 
   }
}

func (p *peer) stop(){
   p.mu.Lock()
   p.status = stopped
   p.mu.Unlock()
   p.wg.Wait()
}

说明：无法保证goroutine 比 p.wg.Wait() 更早执行

channel重复关闭导致的Bug：

bug6	fix
select { case <- c.closed: default: close(c.closed) }	Once.Do(func(){ close(c.closed) })
说明：当多个goroutine同时执行该代码时，有可能使channel被关闭多次，导致panic

select 随机性导致的Bug：

bug7

fix

ticker := time.NewTicker()
for {
   f()
   select {
   case <- stopChn:
      return
   case <- ticker:
   }
}

ticker := time.NewTicker()
for {
   select {
   case <- stopChn:
      return
   default:  
   }
   f()
   select {
   case <-stopChn:
      return
   case <-ticker:
   }
}

说明：由于select的随机性，可能导致 stopChn 后 f() 被执行多一次

Timer误用导致的Bug：

bug8

fix

timer := time.NewTimer(0)
if dur > 0 {
   timer := time.NewTimer(dur)
}
select {
case <-timer.C:
case <-ctx.Done():
   return nil
}

var timeout <-chan time.Time
if dur > 0 {
   timeout = time.NewTimer(dur).C
}
select {
case <-timeout:
case <-ctx.Done():
   return nil
}

说明：当dur=0时，则select总是直接返回，ctx.Done() 不起作用

Detection of Non-Blocking Bugs

Go 提供了数据竞态检测器，可以在 build 的时候加入 -race 标志。

参考：https://songlh.github.io/paper/go-study.pdf

码字很辛苦，转载请注明来自ChenJiehua的《Understanding Real-World Concurrency Bugs in Go》

2019-03-23 2019-03-23 concurrency, golang

一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Understanding Real-World Concurrency Bugs in Go

Introduction

Background and Applications

Goroutine

Shared Memory

Message Passing

Go Concurrency Usage Patterns

Goroutine Usages

Concurrency Primitive Usages

Bug Study Methodology

Blocking Bugs

Root Causes of Blocking Bugs

(mis)Protoction of Shared Memory

Misuse of Message Passing

Detection of Blocking Bugs

Non-Blocking Bugs

Root Causes of Non-blocking Bugs

Detection of Non-Blocking Bugs

评论

近期文章

归档

Understanding Real-World Concurrency Bugs in Go

Introduction

Background and Applications

Goroutine

Shared Memory

Message Passing

Go Concurrency Usage Patterns

Goroutine Usages

Concurrency Primitive Usages

Bug Study Methodology

Blocking Bugs

Root Causes of Blocking Bugs

(mis)Protoction of Shared Memory

Misuse of Message Passing

Detection of Blocking Bugs

Non-Blocking Bugs

Root Causes of Non-blocking Bugs

Detection of Non-Blocking Bugs

评论

近期文章

归档

标签