Python correct encoding of Website (Beautiful Soup) -
i trying load html-page , output text, though getting webpage correctly, beautifulsoup destroys somehow encoding.
source:
# -*- coding: utf-8 -*- import requests beautifulsoup import beautifulsoup url = "http://www.columbia.edu/~fdc/utf8/" r = requests.get(url) encodedtext = r.text.encode("utf-8") soup = beautifulsoup(encodedtext) text = str(soup.findall(text=true)) print text.decode("utf-8") excerpt output:
...odenw\xc3\xa4lderisch... this should odenwälderisch
you making 2 mistakes; mis-handling encoding, , treating result list can safely converted string without loss of information.
first of all, don't use response.text! not beautifulsoup @ fault here, re-encoding mojibake. requests library default latin-1 encoding text/* content types when server doesn't explicitly specify encoding, because http standard states that default.
see encoding section of advanced documentation:
the time requests not if no explicit charset present in http headers and
content-typeheader containstext. in situation, rfc 2616 specifies default charset mustiso-8859-1. requests follows specification in case. if require different encoding, can manually setresponse.encodingproperty, or use rawresponse.content.
bold emphasis mine.
pass in response.content raw data instead:
soup = beautifulsoup(r.content) i see using beautifulsoup 3. want upgrade beautifulsoup 4 instead; version 3 has been discontinued in 2012, , contains several bugs. install beautifulsoup4 project, , use from bs4 import beautifulsoup.
beautifulsoup 4 great job of figuring out right encoding use when parsing, either html <meta> tag or statistical analysis of bytes provided. if server provide characterset, can still pass beautifulsoup response, test first if requests used default:
encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else none soup = beautifulsoup(r.content, from_encoding=encoding) last not least, beautifulsoup 4, can extract text page using soup.get_text():
text = soup.get_text() print text you instead converting result list (the return value of soup.findall()) string. never can work because containers in python use repr() on each element in list produce debugging string, , strings means escape sequences not printable ascii character.
Comments
Post a Comment