Python chardet 检测字符编码

发表于 2020-12-08 分类于 Python

chardet 库用于检测字符编码，我有时会遇到打开文件出现乱码的问题，在不知道编码的情况下，只能按最常见的编码一个一个的去改变编码

参考资料

chardet官方文档

安装

通过 pipenv 安装

1	pipenv install chardet

通过 pip3 安装

1	python3 -m pip install chardet -i https://pypi.tuna.tsinghua.edu.cn/simple

实例

detect

detect 函数需要一个参数，即非单码字符串。它返回一个字典，其中包含自动检测到的字符编码和自取的可信度

假如要检测 abc.txt 文件的编码：

from chardet import detect

# 文件路径,与py文件在同一个目录下
path = './abc.txt'
with open(path, 'rb') as f:
    # 如果文件过大应写成f.read(100),其中数字表示检测字符数量
    detect = detect(f.read())
    print(detect)

1	{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

encoding : 字符编码
confidence : 可信度(0.99即99%)
language : 语言

UniversalDetector

如果您处理了大量文本，您可以逐步调用通用编码探测器库，一旦它有足够的信心报告其结果，它将立即停止

如果探测器达到最低限度的信心阈值，则 detector.done 设置为 True

返回结果与 detect 函数相同

假如要检测 abc.txt 文件的编码：

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()

# 文件路径,与py文件在同一个目录下
path = './abc.txt'
with open(path, 'rb') as f:
    for line in f.readlines():
        detector.feed(line)
        if detector.done:
            break

print(detector.result)

1	{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

如果要检测多个文本的编码，可以重复使用单个对象，只需在每个文件的开头调用

import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()

for filename in glob.glob('./*.txt'):   # 检测当前目录下所有后缀为 .txt 的文件
    print(filename.ljust(60), end='')
    detector.reset()
    with open(filename, 'rb') as f:
        for line in f.readlines():
            detector.feed(line)
            if detector.done:
                break

    print(detector.result)