GBK错误

首页 > 代码库 > GBK错误

2024-08-17 23:25:18 221人阅读

用CMD测试代码的时候，因为CMD默认用gbk支持print()，当UTF字符集出现超出GBK编码的字符是就会出现：

UnicodeEncodeError: ‘gbk’ codec can’t encode character u’\u200e’ in position 43: illegal multibyte sequence

可以在decode时，增加参数ignore对错误进行忽略。

bytes.decode(encoding="utf-8", errors="strict")bytearray.decode(encoding="utf-8", errors="strict")

Return a string decoded from the given bytes. Default encoding is ‘utf-8‘. errors may be given to set a different error handling scheme. The default for errors is ‘strict‘, meaning that encoding errors raise a UnicodeError. Other possible values are ‘ignore‘, ‘replace‘ and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

import urllib.request
from bs4 import BeautifulSoup

def trade_spider(max_pages):

    headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11‘,
    ‘Accept‘:‘text/html;q=0.9,*/*;q=0.8‘
    }

    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    page=1
    while page <= max_pages:
        url=r‘http://news.zjicm.edu.cn/web_2/pages/type.php?Page_Id=8C2D81A53403FED2AEE8F706017F8C3E&PageNo=‘ + str(page)
        source_code=data =http://www.mamicode.com/ opener.open(url).read()
        soup=BeautifulSoup(source_code,"html.parser")
        for link in soup.find(class_=‘list-nopic‘).find_all(‘a‘):
            str1=link.string.encode(‘gbk‘,errors=‘ignore‘)
            print(str1.decode(‘gbk‘,errors=‘ignore‘))
        page += 1

trade_spider(10)

GBK错误

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > GBK错误

GBK错误

看完仍有疑问？有类似问题直接问程序猿