很多语言中的字符串非常简单，但 Rust 与常规的语言有所区别。

简介

例如，如下代码。

pub fn greet(name: String) {
    println!("Hello {}!", name);
}

fn main() {
    greet("World");
}

编译时会有如下的报错。

error[E0308]: mismatched types
 --> src/main.rs:6:11
  |
6 |     greet("World");
  |     ----- ^^^^^^^- help: try using a conversion method: `.to_string()`
  |     |     |
  |     |     expected `String`, found `&str`
  |     arguments to this function are incorrect
  |
note: function defined here
 --> src/main.rs:1:8
  |
1 | pub fn greet(name: String) {
  |        ^^^^^ ------------

简单来说，就是传入的参数类型不对，如下这个图很形象的介绍了在 Rust 中不同的字符串表示方式。

rust strings

字符串详解

Rust 中的字符采用 Unicode 编码，每个字符固定占用 4 个字节空间，而字符串采用 UTF-8 编码，占用 1~4 个字节。

关于编码更多可以参考字符编码详解中的介绍。

字符

在 Rust 中 char 可能占 1~4 字节，不过实际底层采用 u32 保存，这样就可以直接在 char 和 u32 之间进行转换。不过，需要注意 Unicode 编码的合法性，同时单字节只支持 7Bits 编码。

除了常规的输入，还可以通过 \ 进行转义，例如 ASCII 编码 \x52 对应 R，也包括通过 \u{211D} 方式的 Unicode 编码。

fn main() {
    let c = '🌐'; // \u{1f310}
    println!("{} {} {}", 'a'.len_utf8(), '你'.len_utf8(), '🌐'.len_utf8());
    println!("{} {} {:x}", c, char::from_u32(0x1f310).unwrap(), c as u32);
}
// Output:
// 1 3 4
// 🌐 🌐 1f310

编码格式

String 存储的字节需要保证是一个有效的 UTF-8 编码，保存在堆中，非 Null 结尾。如果有很多字符需要转义，那么可以使用原始字符串，相对会更方便些。

fn main() {
    let r0 = r"Eescapes don't work: \x52\x75\x73\x74, \u{211D}";
    println!("{}", r0);
    let r1 = r#"Eescapes don't work: \x52\x75\x73\x74, "\u{211D}""#;
    println!("{}", r1);
}
// Output:
// Eescapes don't work: \x52\x75\x73\x74, \u{211D}
// Eescapes don't work: \x52\x75\x73\x74, "\u{211D}"

另外，如果要使用非 UTF-8 编码，那么可以使用字节串，支持转义，但不支持 Unicode 写法，此时转换到 UTF-8 可能会失败。

fn main() {
    let b0: &[u8; 18] = b"I'm writing \"\x52\x75\x73\x74\"";
    println!("{:?}", b0); // Display trait not implied.
    let b1 = br"Eescapes don't work: \u{211D}";
    if let Ok(s) = std::str::from_utf8(b1) {
        println!("{}", s);
    }
}
// Output:
// [73, 39, 109, 32, 119, 114, 105, 116, 105, 110, 103, 32, 34, 82, 117, 115, 116, 34]
// Eescapes don't work: \u{211D}

如果是中文或者 Emoji 可以通过如下方式查看，对应的中文时 3Bytes 编码，而 Emoji 一般是 4Bytes 编码，注意，使用 Unicode 码位时需要一些转换才行。

fn main() {
    let s1 = "你好🌐";
    println!("{:?}", s1.as_bytes());
    let s2 = "\u{4f60}\u{597d}\u{1f310}";
    println!("{:?}", s2.as_bytes());
    println!("{}", s2);
}
// Output:
// [228, 189, 160, 229, 165, 189, 240, 159, 140, 144]
// [228, 189, 160, 229, 165, 189, 240, 159, 140, 144]
// 你好🌐

如上两种方式相同。

字符流 VS. 字节流

字符串可以通过 as_bytes() 函数转换为字节，字符则可以进行遍历，如下是常用的方法。

fn main() {
    let strs = "你好🌐";

    let bytes: &[u8] = strs.as_bytes();
    println!("{:?}", bytes);
    println!("{}", String::from_utf8(bytes.to_vec()).unwrap());
    println!("{}", String::from_utf8(Vec::from(bytes)).unwrap());

    for (idx, c) in strs.char_indices() {
        println!("idx={} char={}", idx, c);
    }

    let mut iter = strs.chars();
    while let Some(c) = iter.next() {
        println!("char={}", c);
    }

    for c in strs.chars() {
        println!("char={}", c);
    }
}

String VS. str

实际上，在 Rust 语言层面只有 str 类型，只是通常以 &str 方式使用，可以理解为字符串切片，该类型会在编译阶段硬编码到二进制文件中，无法修改，也被称为字符串字面量。

为了修改，在标准库中提供了 String 类型，这也是最常用的，可以通过如下方式从 &str 生成 String 类型。

String::from("Hello World!");
"Hello World!".to_string();

而从 String 转换为 &str 也很简单，只需要取引用即可，例如 &s &s[..] s.as_str() 都可以。

Vec

这是 Rust 中的动态数组，可以通过 Vec::new() Vec::with_capacity(8) vec![1, 2] vec![0; 5] 这种方式创建，其中第一种方式默认不会申请内存，只有当写入的时候才会申请；空间不足时，会自动进行扩容。

与字符串结合最常用的就是 Vec<u8> 类型了。

其它

clone VS. to_owned

在遇到需要复制的情况时，需要调用 clone() 或 to_owend() 方法，但是这两个方法有联系和差别。

源类型	clone()	to_owned()
`T`	`T` => `T`	`T` => `T`
`&T`	`&T` => `&T`	`&T` => `T`

如下是针对字符串的使用，大部分情况中 to_owned 在内部会调用 clone 实现。

对于引用类型，会对所有的不可变引用实现 Copy、Clone 特性，不可变引用的复制和克隆的效果都是一样，只获得了引用，没有所有权，而 to_owend() 内部通过 &T 调用 T 的 clone() 获得原始数据的拷贝。

let s: String = String::from("Hello");
let s_clone: String = s.clone();
let s_owend: String = s.to_owned();

let s: &str = "Hello";
let s_clone: &str = s.clone();
let s_owend: String = s.to_owned();

总的来说，对于引用类型当调用 clone() 时只是复制引用，而 to_owned() 则会拷贝原始数据获得所有权。

总结

&str    -> String   String::from(s)/s.to_string()/s.to_owned()
&str    -> &[u8]    s.as_bytes()
&str    -> Vec<u8>  s.as_bytes().to_vec()/s.as_bytes().to_owned()
String  -> &str     &s if possible* else s.as_str()
String  -> &[u8]    s.as_bytes()
String  -> Vec<u8>  s.into_bytes()
&[u8]   -> &str     std::str::from_utf8(s).unwrap()
&[u8]   -> String   String::from_utf8(s).unwrap()
&[u8]   -> Vec<u8>  s.to_vec()
Vec<u8> -> &str     std::str::from_utf8(&s).unwrap()
Vec<u8> -> String   String::from_utf8(s).unwrap()
Vec<u8> -> &[u8]    &s if possible* else s.as_slice()

Rust 字符串详解

简介 #

字符串详解 #

字符 #

编码格式 #

字符流 VS. 字节流 #

String VS. str #

Vec #

其它 #

clone VS. to_owned #

总结 #

简介

字符串详解

字符

编码格式

字符流 VS. 字节流

String VS. str

Vec

其它

clone VS. to_owned

总结