A Deep Look Into Golang Profile-Guided Optimization (PGO)

Intro

Profile-Guided Optimization (PGO) is an optimization method which improves final compiled binary by using profile data hints and compiling the code based on those profiles.

There are several optimization mechanisms that compilers usually get into account when compiling your code to binary. Like dead code elimination, Register allocation, Constant folding or function inlining; you may split your code into smaller and smaller functions and different level of abstractions to ease up future changes and modifications, but from compiler point of view multiple calls to different functions may not be very optimize and sometimes compiler decides to inline your functions!

Although there’s a limit to these optimizations. Compiler can’t inline everything! this will result in bloating binary files, l-cache overhead and performance degradation. But with providing information about how the program is being run and which functions are being constantly called, the compiler can inline those hot functions and reduce function call and return overhead. It’s the same with register allocation and other methods.

How does it work?

From go version v1.21, you only need to provide:

  1. A pprof CPU profile, which can be collect with runtime/pprof or net/http/pprof
  2. Renaming it to default.pgo and move it to same directory as main.go or pass it to pgo flag in go build command

Let’s take a look at the following code and see how it works:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
package main

import (
        "fmt"
        "os"
        "runtime/pprof"
)

func main() {
        // These 4 lines simply collect profile data into default.pgo file
        f, _ := os.Create("default.pgo")
        defer f.Close()
        pprof.StartCPUProfile(f)
        defer pprof.StopCPUProfile()

        for i := 0; i <= 1_000_000; i++ {
                Func1()
        }
}

func Func1() {
        fmt.Println("print in func #1")
        Func2()
}

func Func2() {
        fmt.Println("print in func #2")
}

By running go build -o before_pgo ./main.go we compile the code and create the before_pgo file in the same directory and by running that binary with ./before_pgo we can create the default.pgo that we need to optimize our binary.

Before going any further, let’s see how the compiler uses our fancy default.pgo file to improve things. By running go build -pgo=off -gcflags -m main.go we get the following results: -gcflags -m flag is to see compiler optimizations and since we already have default.pgo we need to use -pgo=off to not use it for now.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# command-line-arguments
./main.go:26:6: can inline Func2
./main.go:27:13: inlining call to fmt.Println
./main.go:22:13: inlining call to fmt.Println
./main.go:23:7: inlining call to Func2
./main.go:23:7: inlining call to fmt.Println
./main.go:11:19: inlining call to os.Create
./main.go:22:13: ... argument does not escape
./main.go:22:14: "print in func #1" escapes to heap
./main.go:23:7: ... argument does not escape
./main.go:23:7: "print in func #2" escapes to heap
./main.go:27:13: ... argument does not escape
./main.go:27:14: "print in func #2" escapes to heap

As you can see, the compiler is inlining a lot of things, such as our Func2, Very good without any PGO, right?

Now Let’s try it out with fancy -pgo option, Running go build -pgo=auto -gcflags -m main.go will give us:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# command-line-arguments
./main.go:26:6: can inline Func2
./main.go:27:13: inlining call to fmt.Println
./main.go:21:6: can inline Func1
./main.go:22:13: inlining call to fmt.Println
./main.go:23:7: inlining call to Func2
./main.go:11:19: inlining call to os.Create
./main.go:17:8: inlining call to Func1
./main.go:27:13: ... argument does not escape
./main.go:27:14: "print in func #2" escapes to heap
./main.go:17:8: ... argument does not escape
./main.go:17:8: "print in func #1" escapes to heap
./main.go:22:13: ... argument does not escape
./main.go:22:14: "print in func #1" escapes to heap
./main.go:23:7: ... argument does not escape
./main.go:23:7: "print in func #2" escapes to heap

Now we can see that our Func1 is also inlined, And why is that? It’s because by default, go compiler doesn’t inline any non-leaf function. But by providing some profile data and telling it how our program is actually working! It understands that we’re calling that function 1,000,000 times and it’s better to be inlined.

Dig deeper inside the binaries!

Now let’s compile our code, this time with profile data that we have and compare the binaries. Now that we have default.pgo in the same directory as main.go, let’s run go build -o after_pgo ./main.go and first of all, let’s compare the size of our binaries by running ls -atrl we get:

1
2
3
4
5
6
7
8
 total 4424
drwxr-xr-x 4 yahya yahya    4096 Aug 18 18:25 ..
-rw-r--r-- 1 yahya yahya      65 Aug 18 18:25 go.mod
-rw-r--r-- 1 yahya yahya     407 Aug 18 18:26 main.go
-rwxr-xr-x 1 yahya yahya 2250107 Aug 18 18:41 before_pgo
-rw-r--r-- 1 yahya yahya    2639 Aug 18 18:44 default.pgo
-rwxr-xr-x 1 yahya yahya 2252917 Aug 18 19:02 after_pgo
drwxr-xr-x 2 yahya yahya    4096 Aug 18 19:06 .

after_pgo file is slightly bigger, and we know that’s because of some inlining. Let’s checkout inside these binaries and see what we can find. I’m going to use objdump that exists in the go tool chain, so we can run go tool objdump before_pgo and go tool objdump after_pgo and compare the result. The following is only the part of inlining the Func1 version.

The following is before and after pgo objdump binaries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// before_pgo
TEXT main.main(SB) /home/yahya/go/src/github.com/theyahya/go-pgo-playground/example1/main.go
  main.go:9             0x4a32e0                493b6610                CMPQ SP, 0x10(R14)
  main.go:9             0x4a32e4                0f86c1000000            JBE 0x4a33ab
  main.go:9             0x4a32ea                55                      PUSHQ BP
  main.go:9             0x4a32eb                4889e5                  MOVQ SP, BP
  main.go:9             0x4a32ee                4883ec50                SUBQ $0x50, SP
  main.go:9             0x4a32f2                440f117c2440            MOVUPS X15, 0x40(SP)
  main.go:9             0x4a32f8                c644242700              MOVB $0x0, 0x27(SP)
  main.go:11            0x4a32fd                90                      NOPL
  file.go:323           0x4a32fe                488d05590b0200          LEAQ 0x20b59(IP), AX
  file.go:323           0x4a3305                bb0b000000              MOVL $0xb, BX
  file.go:323           0x4a330a                b942020000              MOVL $0x242, CX
  file.go:323           0x4a330f                bfb6010000              MOVL $0x1b6, DI
  file.go:323           0x4a3314                e88709feff              CALL os.OpenFile(SB)
  main.go:12            0x4a3319                440f117c2430            MOVUPS X15, 0x30(SP)
  main.go:12            0x4a331f                488d159a000000          LEAQ main.main.func1(SB), DX
  main.go:12            0x4a3326                4889542430              MOVQ DX, 0x30(SP)
  main.go:12            0x4a332b                4889442438              MOVQ AX, 0x38(SP)
  main.go:12            0x4a3330                488d542430              LEAQ 0x30(SP), DX
  main.go:12            0x4a3335                4889542448              MOVQ DX, 0x48(SP)
  main.go:12            0x4a333a                c644242701              MOVB $0x1, 0x27(SP)
  main.go:13            0x4a333f                4889c3                  MOVQ AX, BX
  main.go:13            0x4a3342                488d05bf2d0400          LEAQ go:itab.*os.File,io.Writer(SB), AX
  main.go:13            0x4a3349                e89259ffff              CALL runtime/pprof.StartCPUProfile(SB)
  main.go:14            0x4a334e                488d1503880200          LEAQ 0x28803(IP), DX
  main.go:14            0x4a3355                4889542440              MOVQ DX, 0x40(SP)
  main.go:14            0x4a335a                c644242703              MOVB $0x3, 0x27(SP)
  main.go:14            0x4a335f                31c0                    XORL AX, AX
  main.go:16            0x4a3361                eb12                    JMP 0x4a3375
  main.go:16            0x4a3363                4889442428              MOVQ AX, 0x28(SP)
  main.go:17            0x4a3368                e8b3000000              CALL main.Func1(SB)
  main.go:16            0x4a336d                488b442428              MOVQ 0x28(SP), AX
  main.go:16            0x4a3372                48ffc0                  INCQ AX
  main.go:16            0x4a3375                483d40420f00            CMPQ AX, $0xf4240
  main.go:16            0x4a337b                7ee6                    JLE 0x4a3363
  main.go:19            0x4a337d                c644242701              MOVB $0x1, 0x27(SP)
  main.go:19            0x4a3382                e8395dffff              CALL runtime/pprof.StopCPUProfile(SB)
  main.go:19            0x4a3387                c644242700              MOVB $0x0, 0x27(SP)
  main.go:19            0x4a338c                488b542448              MOVQ 0x48(SP), DX
  main.go:19            0x4a3391                488b02                  MOVQ 0(DX), AX
  main.go:19            0x4a3394                ffd0                    CALL AX
  main.go:19            0x4a3396                4883c450                ADDQ $0x50, SP
  main.go:19            0x4a339a                5d                      POPQ BP
  main.go:19            0x4a339b                c3                      RET
  main.go:19            0x4a339c                0f1f4000                NOPL 0(AX)
  main.go:19            0x4a33a0                e8fbd5f8ff              CALL runtime.deferreturn(SB)
  main.go:19            0x4a33a5                4883c450                ADDQ $0x50, SP
  main.go:19            0x4a33a9                5d                      POPQ BP
  main.go:19            0x4a33aa                c3                      RET
  main.go:9             0x4a33ab                e830a1fbff              CALL runtime.morestack_noctxt.abi0(SB)
  main.go:9             0x4a33b0                e92bffffff              JMP main.main(SB)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// after_pgo
TEXT main.main(SB) /home/yahya/go/src/github.com/theyahya/go-pgo-playground/example1/main.go
  main.go:9             0x4a3b80                493b6610                CMPQ SP, 0x10(R14)
  main.go:9             0x4a3b84                0f86f2000000            JBE 0x4a3c7c
  main.go:9             0x4a3b8a                55                      PUSHQ BP
  main.go:9             0x4a3b8b                4889e5                  MOVQ SP, BP
  main.go:9             0x4a3b8e                4883ec60                SUBQ $0x60, SP
  main.go:9             0x4a3b92                440f117c2450            MOVUPS X15, 0x50(SP)
  main.go:9             0x4a3b98                c644242700              MOVB $0x0, 0x27(SP)
  main.go:11            0x4a3b9d                90                      NOPL
  file.go:323           0x4a3b9e                488d05b9020200          LEAQ 0x202b9(IP), AX
  file.go:323           0x4a3ba5                bb0b000000              MOVL $0xb, BX
  file.go:323           0x4a3baa                b942020000              MOVL $0x242, CX
  file.go:323           0x4a3baf                bfb6010000              MOVL $0x1b6, DI
  file.go:323           0x4a3bb4                e84704feff              CALL os.OpenFile(SB)
  main.go:12            0x4a3bb9                440f117c2430            MOVUPS X15, 0x30(SP)
  main.go:12            0x4a3bbf                488d15da000000          LEAQ main.main.func1(SB), DX
  main.go:12            0x4a3bc6                4889542430              MOVQ DX, 0x30(SP)
  main.go:12            0x4a3bcb                4889442438              MOVQ AX, 0x38(SP)
  main.go:12            0x4a3bd0                488d542430              LEAQ 0x30(SP), DX
  main.go:12            0x4a3bd5                4889542458              MOVQ DX, 0x58(SP)
  main.go:12            0x4a3bda                c644242701              MOVB $0x1, 0x27(SP)
  main.go:13            0x4a3bdf                4889c3                  MOVQ AX, BX
  main.go:13            0x4a3be2                488d052f260400          LEAQ go:itab.*os.File,io.Writer(SB), AX
  main.go:13            0x4a3be9                e89259ffff              CALL runtime/pprof.StartCPUProfile(SB)
  main.go:14            0x4a3bee                488d15637f0200          LEAQ 0x27f63(IP), DX
  main.go:14            0x4a3bf5                4889542450              MOVQ DX, 0x50(SP)
  main.go:14            0x4a3bfa                c644242703              MOVB $0x3, 0x27(SP)
  main.go:14            0x4a3bff                31c0                    XORL AX, AX
  main.go:16            0x4a3c01                eb47                    JMP 0x4a3c4a
  main.go:16            0x4a3c03                488d5001                LEAQ 0x1(AX), DX
  main.go:16            0x4a3c07                4889542428              MOVQ DX, 0x28(SP)
  main.go:17            0x4a3c0c                90                      NOPL
  main.go:22            0x4a3c0d                440f117c2440            MOVUPS X15, 0x40(SP)
  main.go:22            0x4a3c13                488d35868a0000          LEAQ 0x8a86(IP), SI
  main.go:22            0x4a3c1a                4889742440              MOVQ SI, 0x40(SP)
  main.go:22            0x4a3c1f                488d3dea200400          LEAQ 0x420ea(IP), DI
  main.go:22            0x4a3c26                48897c2448              MOVQ DI, 0x48(SP)
  main.go:22            0x4a3c2b                488d442440              LEAQ 0x40(SP), AX
  main.go:22            0x4a3c30                bb01000000              MOVL $0x1, BX
  main.go:22            0x4a3c35                4889d9                  MOVQ BX, CX
  main.go:22            0x4a3c38                e8234dfeff              CALL fmt.Println(SB)
  main.go:22            0x4a3c3d                0f1f00                  NOPL 0(AX)
  main.go:23            0x4a3c40                e8bb000000              CALL main.Func2(SB)
  main.go:16            0x4a3c45                488b442428              MOVQ 0x28(SP), AX
  main.go:16            0x4a3c4a                483d40420f00            CMPQ AX, $0xf4240
  main.go:16            0x4a3c50                7eb1                    JLE 0x4a3c03
  main.go:19            0x4a3c52                c644242701              MOVB $0x1, 0x27(SP)
  main.go:19            0x4a3c57                e8045dffff              CALL runtime/pprof.StopCPUProfile(SB)
  main.go:19            0x4a3c5c                c644242700              MOVB $0x0, 0x27(SP)
  main.go:19            0x4a3c61                488b542458              MOVQ 0x58(SP), DX
  main.go:19            0x4a3c66                488b02                  MOVQ 0(DX), AX
  main.go:19            0x4a3c69                ffd0                    CALL AX
  main.go:19            0x4a3c6b                4883c460                ADDQ $0x60, SP
  main.go:19            0x4a3c6f                5d                      POPQ BP
  main.go:19            0x4a3c70                c3                      RET
  main.go:19            0x4a3c71                e80acdf8ff              CALL runtime.deferreturn(SB)
  main.go:19            0x4a3c76                4883c460                ADDQ $0x60, SP
  main.go:19            0x4a3c7a                5d                      POPQ BP
  main.go:19            0x4a3c7b                c3                      RET
  main.go:9             0x4a3c7c                0f1f4000                NOPL 0(AX)
  main.go:9             0x4a3c80                e83b9afbff              CALL runtime.morestack_noctxt.abi0(SB)
  main.go:9             0x4a3c85                e9f6feffff              JMP main.main(SB)

In line #32 in before_pgo it’s calling Func1 but in after_pgo it’s being inlined.

A more realistic example (API)

Now that we’ve got a grasp of what’s going on, let’s do a more realistic example; an http endpoint which returns a simple json. I’m going to use github.com/json-iterator/go which I found out about a couple days ago in one of my endless github explorations!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
package main

import (
	"log"
	"net/http"
	_ "net/http/pprof"
	"os"

	jsoniter "github.com/json-iterator/go"
)

type Data struct {
	Quiz struct {
		Sport struct {
			Q1 struct {
				Question string   `json:"question"`
				Options  []string `json:"options"`
				Answer   string   `json:"answer"`
			} `json:"q1"`
		} `json:"sport"`
		Maths struct {
			Q1 struct {
				Question string   `json:"question"`
				Options  []string `json:"options"`
				Answer   string   `json:"answer"`
			} `json:"q1"`
			Q2 struct {
				Question string   `json:"question"`
				Options  []string `json:"options"`
				Answer   string   `json:"answer"`
			} `json:"q2"`
		} `json:"maths"`
	} `json:"quiz"`
}

func main() {
	http.HandleFunc("/", handler)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

func handler(w http.ResponseWriter, r *http.Request) {
	var json = jsoniter.ConfigCompatibleWithStandardLibrary
	file, _ := os.ReadFile("./data.json")
	data := Data{}
	json.Unmarshal(file, &data)
	d, _ := json.Marshal(data)
	w.Write(d)
}

It’s a simple endpoint that reads a .json file and unmarshal and marshal (we do this to create some fake overhead to this endpoint) again and then write it to response. I also Added net/http/pprof to be able to collect profile data.

Now we can go build -o before_pgo ./main.go and then run it by ./before_pgo. While it’s up we create traffic with go-wrk -d 20 http://localhost:8080 (Check out github.com/tsliwowicz/go-wrk). And while the traffic is going through, we can collect the profile data for 10 seconds with curl -o default.pgo "http://localhost:8080/debug/pprof/profile?seconds=10"

Result of go-wrk command should be something like this:

1
2
3
4
5
6
7
8
9
Running 20s test @ http://localhost:8080
  10 goroutine(s) running concurrently
747453 requests in 19.582312899s, 322.20MB read
Requests/sec:           38169.80
Transfer/sec:           16.45MB
Avg Req Time:           261.987µs
Fastest Request:        33.156µs
Slowest Request:        15.862181ms
Number of Errors:       0

Now let’s recompile our code with the profile data that we have and then run the benchmark again and compare the results. With go build -o after_pgo ./main.go and checking the files that we have, we can see that the size of the binary is already 200KB bigger and this means that probably lots of inlining should’ve happened there!

1
2
3
4
5
-rwxr-xr-x 1 yahya yahya 8.3M Aug 22 17:35 before_pgo
-rw-r--r-- 1 yahya yahya  60K Aug 22 17:38 default.pgo
-rw-r--r-- 1 yahya yahya 1017 Aug 22 17:47 main.go
drwxr-xr-x 2 yahya yahya 4.0K Aug 22 18:21 .
-rwxr-xr-x 1 yahya yahya 8.5M Aug 22 18:21 after_pgo

And this is the go-wrk result.

1
2
3
4
5
6
7
8
9
Running 20s test @ http://localhost:8080
  10 goroutine(s) running concurrently
821349 requests in 19.574756327s, 354.05MB read
Requests/sec:           41959.60
Transfer/sec:           18.09MB
Avg Req Time:           238.324µs
Fastest Request:        28.559µs
Slowest Request:        12.996215ms
Number of Errors:       0

Not bad! Increased number of requests that we can handle by almost 4,000.

If you’re interested about how compiler does these magics, I recommend you to check out these first commits in go project:

https://github.com/golang/go/commit/99862cd57dd9406180f238e8fa2fcc1d435fb163.

It’s the first commit of PGO implementation in the Go compiler and only by scrolling you can get a grasp of how it’s working and how it’s checking for hot nodes in AST.

Automation & Workflow

For profile data to be completely accurate and useful, it needs to come from production with real traffic. But it’s hard to keep an updated profile and constantly replace it. It also could easily become stale or even if you have a big refactor with big changes, your previous profile data from production isn’t useful to compile your code with.

With less than 5% overhead, It’s safe to use profiler in your production. Even major cloud providers like GCP offer solutions to collect your profile data from their native cloud products with a nice dashboard to see the results. It’s also very helpful in cases that you’ve goroutine/memory leak or trying to debug any performance related issue that is only happening in production and hard to reproduce locally!

With that out of the way (having profiler as a default in your code), Automating your builds to use profile data and, as the design proposal offers: having it in your git history as a file (default.pgo) could be an overkill. But if you really care about performance or/and cost, and you’ve exhausted all the other improvements, this could be your chance to make it perfect!

Code

You can find all the codes that we wrote in this post and the related files and benchmarks in this github repo: github.com/theyahya/go-pgo-playground

If you like these kind of topics, join my 📮 Newsletter to receive similar articles every once in a while, there won’t be any spams! promise!

Still interested and curious about the PGO and how it works and you want to follow the rabbit hole? I recommend the following links:

  1. PGO related commits in go compiler: https://github.com/golang/go/commits/master/src/cmd/compile/internal/pgo
  2. AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications
  3. Proposal: profile-guided optimization
  4. Profile-guided optimizations