Python correct encoding of Website (Beautiful Soup) -
i trying load html-page , output text, though getting webpage correctly, beautifulsoup destroys somehow encoding.
source:
# -*- coding: utf-8 -*- import requests beautifulsoup import beautifulsoup url = "http://www.columbia.edu/~fdc/utf8/" r = requests.get(url) encodedtext = r.text.encode("utf-8") soup = beautifulsoup(encodedtext) text = str(soup.findall(text=true)) print text.decode("utf-8")
excerpt output:
...odenw\xc3\xa4lderisch...
this should odenwälderisch
you making 2 mistakes; mis-handling encoding, , treating result list can safely converted string without loss of information.
first of all, don't use response.text
! not beautifulsoup @ fault here, re-encoding mojibake. requests
library default latin-1 encoding text/*
content types when server doesn't explicitly specify encoding, because http standard states that default.
see encoding section of advanced documentation:
the time requests not if no explicit charset present in http headers and
content-type
header containstext
. in situation, rfc 2616 specifies default charset mustiso-8859-1
. requests follows specification in case. if require different encoding, can manually setresponse.encoding
property, or use rawresponse.content
.
bold emphasis mine.
pass in response.content
raw data instead:
soup = beautifulsoup(r.content)
i see using beautifulsoup 3. want upgrade beautifulsoup 4 instead; version 3 has been discontinued in 2012, , contains several bugs. install beautifulsoup4
project, , use from bs4 import beautifulsoup
.
beautifulsoup 4 great job of figuring out right encoding use when parsing, either html <meta>
tag or statistical analysis of bytes provided. if server provide characterset, can still pass beautifulsoup response, test first if requests
used default:
encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else none soup = beautifulsoup(r.content, from_encoding=encoding)
last not least, beautifulsoup 4, can extract text page using soup.get_text()
:
text = soup.get_text() print text
you instead converting result list (the return value of soup.findall()
) string. never can work because containers in python use repr()
on each element in list produce debugging string, , strings means escape sequences not printable ascii character.
Comments
Post a Comment