Strings, bytes, runes and characters in Go

1.`string`概念

Go标准库builtin给出了所有内置类型的定义，源代码位于src/builtin/builtin.go，其中关于string的描述如下:

1
2
3
4


// string is the set of all strings of 8-bit bytes, conventionally but not
// necessarily representing UTF-8-encoded text. A string may be empty, but
// not nil. Values of string type are immutable.
type string 

从中我们可以了解到以下关键信息：

string是一组8比特字节的集合，通常(但并不一定)表示UTF-8编码的文本。（这一点不太理解）
string类型变量对应的空值是""，而不是nil。
string类型的值是不可变的。

在Go中，string是只读的byte类型的切片（a string is in effect a read-only slice of bytes）。

对于string，我们可以通过索引进行遍历，也可以通过for range进行遍历。这两者是有区别的，首先来看一个简单的例子，如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


func main() {
	s := "hello,世界"
	//fmt.Printf("len:%d\n",len(s))
	for i := 0; i < len(s); i++ { // 索引遍历
		fmt.Printf("index=[%d], x=[%x], c=[%c]\n",i, s[i], s[i])
	}
	for i, v := range s { // for range遍历
		fmt.Printf("index=[%d], Unicode=[%U], c=[%c]\n",i,v,v)
	}
}

输出如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


index=[0], x=[68], c=[h]   |  index=[0], Unicode=[U+0068], c=[h]
index=[1], x=[65], c=[e]   |  index=[1], Unicode=[U+0065], c=[e]
index=[2], x=[6c], c=[l]   |  index=[2], Unicode=[U+006C], c=[l]
index=[3], x=[6c], c=[l]   |  index=[3], Unicode=[U+006C], c=[l]
index=[4], x=[6f], c=[o]   |  index=[4], Unicode=[U+006F], c=[o]
index=[5], x=[2c], c=[,]   |  index=[5], Unicode=[U+002C], c=[,]
index=[6], x=[e4], c=[ä]   |  index=[6], Unicode=[U+4E16], c=[世]
index=[7], x=[b8], c=[¸]   |  
index=[8], x=[96], c=[]   | 
index=[9], x=[e7], c=[ç]   |  index=[9], Unicode=[U+754C], c=[界]
index=[10], x=[95], c=[]  |
index=[11], x=[8c], c=[]  | 

可以看到，对于纯英文字符而言，index loop和range loop是一样的。这是因为，对于英语字母，一个字母仅占一个字节，其Unicode code point和 ASCII 码是相同的；而对于汉字而言，一个汉字占3个字节。

我们需要明确的是，采用index loop，我们获取到的是一个个字节(Byte)，而不是字符(Character)。对于英文字母，一个字符恰好只占一个字节，因此是有意义的；但是对于中文这类非英文字符，由于一个字节无法完整的表达这些字符，因此打印一个字节就会出现乱码的情况。

而range loop打印的是恰是我们理解的字符，不管是英文字符还是中文字符。在Go中，我们称字符为rune。关于为什么称其为rune，而不是character，在官方博客Strings, bytes, runes and characters in Go中有解释：

We’ve been very careful so far in how we use the words “byte” and “character”. That’s partly because strings hold bytes, and partly because the idea of “character” is a little hard to define. The Unicode standard uses the term “code point” to refer to the item represented by a single value.

“Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as “code point”, with one interesting addition.

The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go.

总结来说，rune本质上就是Unicode Code Point。关于Unicode编码、UTF-8等内容，请移步对字符编码的简单理解。

2.`string`数据结构

string数据结构的定义位于src/runtime/string.go，如下所示：

1
2
3
4


type stringStruct struct {
	str unsafe.Pointer  // 字符串的首地址
	len int             // 字符串的长度
}

它与切片(slice)对应的结构体定义看起来很像，slice定义如下：

1
2
3
4
5


type slice struct {
	array unsafe.Pointer // 指向底层数组的指针
	len   int            // 元素个数
	cap   int            // 切片容量  
}

3.`string`操作及内部实现原理

3.1.声明

如下代码所示，可以声明一个string变量变赋予初值：

1
2
3


var s string
s = "你好, world"
fmt.Printf("s=%v, type(s)=%T\n", s, s) // s=你好, world, type(s)=string

字符串构建过程是先根据字符串构建stringStruct，再转换成string。转换的源码如下：

1
2
3
4
5


func gostringnocopy(str *byte) string { // 根据字符串地址构建string
	ss := stringStruct{str: unsafe.Pointer(str), len: findnull(str)} // 先构造stringStruct
	s := *(*string)(unsafe.Pointer(&ss))  // 再将stringStruct转换成string
	return s
}

string在runtime包中就是stringStruct，对外呈现叫做string。

3.2.转换

我们可以很容易的将一个[]byte或[]rune转换成string，也可以将string转换成[]byte或[]rune。这样的转换操作很常见，但是，我们需要知道的是，无论哪种转换，转换操作都必须重新分配内存，因此，当程序中出现大量这样的转换时，需要注意到这个点，在某些时候可能会拖累程序性能。

转换示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


func helpPrintf(format string, ptr interface{}) {
	p := reflect.ValueOf(ptr).Pointer()
	h := (*uintptr)(unsafe.Pointer(p))
	fmt.Printf(format, *h)
}

func main() {
	s := "hello, world"
	helpPrintf("s: %x\n", &s)
	//fmt.Printf("s: %x\n", &s)

	bytes := []byte(s)
	s2 := string(bytes)
	helpPrintf("string to []byte, bytes: %x\n", &bytes)
	//fmt.Printf("string to []byte, bytes: %x\n", &bytes)
	helpPrintf("[]byte to string, s2: %x\n", &s2)
	//fmt.Printf("[]byte to string, s2: %x\n", &s2)

	runes := []rune(s)
	s3 := string(runes)
	helpPrintf("string to []rune, bytes: %x\n", &runes)
	helpPrintf("[]rune to string, s3: %x\n", &s3)
}

// output
s: 10cf820
string to []byte, bytes: c0000b4030
[]byte to string, s2: c0000b4040
string to []rune, bytes: c0000bc000
[]rune to string, s3: c0000b4060

可见，转换前后的地址都是不一样的，换句话说，转换操作导致重新分配了内存。具体的内部实现如下（这里仅以[]byte与string互相转换为例，[]rune与string的转换同理）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


// Buf is a fixed-size buffer for the result,
// it is not nil if the result does not escape.
func slicebytetostring(buf *tmpBuf, b []byte) (str string) {
	...

	var p unsafe.Pointer
	if buf != nil && len(b) <= len(buf) {
		p = unsafe.Pointer(buf)
	} else {
		p = mallocgc(uintptr(len(b)), nil, false)
	}
	stringStructOf(&str).str = p
	stringStructOf(&str).len = len(b)
	memmove(p, (*(*slice)(unsafe.Pointer(&b))).array, uintptr(len(b)))
	return
}

func stringtoslicebyte(buf *tmpBuf, s string) []byte {
	var b []byte
	if buf != nil && len(s) <= len(buf) {
		*buf = tmpBuf{}
		b = buf[:len(s)]
	} else {
		b = rawbyteslice(len(s))
	}
	copy(b, s)
	return b
}
// rawbyteslice allocates a new byte slice. The byte slice is not zeroed.
func rawbyteslice(size int) (b []byte) {
	cap := roundupsize(uintptr(size))
	p := mallocgc(cap, nil, false)
	if cap != uintptr(size) {
		memclrNoHeapPointers(add(p, uintptr(size)), cap-uintptr(size))
	}

	*(*slice)(unsafe.Pointer(&b)) = slice{p, size, int(cap)}
	return
}

简单的说，转换操作的过程分为两步：

申请内存空间
拷贝数据

PS.以下部分摘自《Go语言学习笔记》p90，自己目前还不是很理解，先记录之。

某些时候，转换操作会影响性能，我们也可以用自定义的方式进行改善，只不过这样的方式是”非安全“的，需要谨慎。该方法利用了[]byte和string头结构”部分相同“，以非安全的指针类型转换来实现”类型变更“，从而避免了底层数组复制。在很多Web Framework中可以看到类似的做法。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


func toString(bytes []byte) string {
	return *(*string)(unsafe.Pointer(&bytes))
}

func main() {
	bytes := []byte("hello, world")
	s := toString(bytes)
	helpPrintf("bytes: %x\n", &bytes)
	helpPrintf("s2: %x\n", &s)
}

// output
bytes: c0000b4004
s2: c0000b4004

3.3.拼接

在Go语言中，我们可以很方便的使用”+“进行字符串的拼接，如下：

1
2


var s string
s = "hello" + "world" + "hi" + "zju"

其内部实现是这样的（注：为了简化说明，略去部分代码，只保留最核心的逻辑）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46


// concatstrings implements a Go string concatenation x+y+z+...
// The operands are passed in the slice a.
// If buf != nil, the compiler has determined that the result does not
// escape the calling function, so the string data can be stored in buf
// if small enough.
func concatstrings(buf *tmpBuf, a []string) string {
	l := 0  // l 计算拼接后总的字符串长度
  ...
  for i, x := range a {
		n := len(x)
		l += n
	}
	
  ...
	s, b := rawstringtmp(buf, l) // 生成指定大小的字符串，返回一个string和切片，二者共享内存空间
	for _, x := range a {
		copy(b, x) // string无法修改，只能通过切片修改
		b = b[len(x):]
	}
	return s
}

func rawstringtmp(buf *tmpBuf, l int) (s string, b []byte) {
	if buf != nil && l <= len(buf) {
		b = buf[:l]
		s = slicebytetostringtmp(b)
	} else {
		s, b = rawstring(l)
	}
	return
}

// rawstring allocates storage for a new string. The returned
// string and byte slice both refer to the same storage.
// The storage is not zeroed. Callers should use
// b to set the string contents and then drop b.
func rawstring(size int) (s string, b []byte) {
	p := mallocgc(uintptr(size), nil, false)

	stringStructOf(&s).str = p
	stringStructOf(&s).len = size

	*(*slice)(unsafe.Pointer(&b)) = slice{p, size, size}

	return
}

因为string是无法直接修改的，所以这里使用rawstring()方法初始化一个指定大小的string，同时返回一个切片，二者共享同一块内存空间，向切片中拷贝数据，也就间接修改了string。

从上面的源码实现中可以看出，一个拼接语句的字符串在编译时都会被存放到一个切片中，拼接过程需要遍历两次切片，第一次遍历获取总的字符串长度，据此申请内存，第二次遍历会把字符串逐个拷贝过去。新字符串的内存空间是一次分配完成的，性能消耗主要在拷贝数据上。当有非常多的字符串需要拼接时，我们要尽量避免一次只拼接一个字符串，而是让这些字符串一起拼接。

1
2
3
4
5
6
7
8
9


// good style
var s string
s = "str1" + "str2" + "str3" + ...

// bad style
var s string
for _, str := range strs {
  s += str // 一次只拼接一个字符串，意味着要反复的申请内存、拷贝数据
}

当然了，有时候不可能采用"str1" + "str2" + "str3" + ...这样直白的方式进行拼接，我们可以采用strings.Join或bytes.Buffer/strings.Builder的方式，效果是类似的。

下面通过示例来验证不同拼接方式对性能的影响。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


func test() string {
	var s string
	for i := 0; i < 1000; i++ {
		s += "aaa"
	}
	return s
}

func BenchmarkTest(b *testing.B) {
	for i := 0; i < b.N; i++ {
		test()
	}
}

执行如下命令，观察结果

1
2
3
4
5


$ go test -bench . -benchmem -gcflags "-N -l" # 禁用内联和优化，以便观察结果

BenchmarkTest-12            5474            221588 ns/op         1602964 B/op        999 allocs/op
PASS
ok      test/tmp        3.494s

报告显示，每次执行test()，需要花费221588纳秒的时间，分配了1602964字节的内存空间，申请内存分配999次。（关于如何进行基准测试，参考这里）

当我们修改test()，采用strings.Join的方式，然后重新进行测试：

1
2
3
4
5
6
7


func test() string {
	s := make([]string, 1000)
	for i := 0; i < 1000; i++ {
		s[i] = "aaa"
	}
	return strings.Join(s,"")
}

测试结果如下：

1

BenchmarkTest-12           95946             11591 ns/op            3072 B/op          1 allocs/op

可见，无论是执行时间还是内存分配情况，都有了巨大的提升。

采用bytes.Buffer/strings.Builder的方式，效果类似。如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


func test() string {
	sb := strings.Builder{}
	sb.Grow(1000)
	for i := 0; i < 1000; i++ {
		sb.WriteString("aaa")
	}
	return sb.String()
}

func test2() string {
	var buf bytes.Buffer
	buf.Grow(1000)
	for i := 0; i < 1000; i++ {
		buf.WriteString("aaa")
	}
	return buf.String()
}

参考：

文章目录

1.string概念

2.string数据结构

3.string操作及内部实现原理