Python correct encoding of Website (Beautiful Soup) -


i trying load html-page , output text, though getting webpage correctly, beautifulsoup destroys somehow encoding.

source:

# -*- coding: utf-8 -*- import requests beautifulsoup import beautifulsoup  url = "http://www.columbia.edu/~fdc/utf8/" r = requests.get(url)  encodedtext = r.text.encode("utf-8") soup = beautifulsoup(encodedtext) text =  str(soup.findall(text=true)) print text.decode("utf-8") 

excerpt output:

...odenw\xc3\xa4lderisch... 

this should odenwälderisch

you making 2 mistakes; mis-handling encoding, , treating result list can safely converted string without loss of information.

first of all, don't use response.text! not beautifulsoup @ fault here, re-encoding mojibake. requests library default latin-1 encoding text/* content types when server doesn't explicitly specify encoding, because http standard states that default.

see encoding section of advanced documentation:

the time requests not if no explicit charset present in http headers and content-type header contains text. in situation, rfc 2616 specifies default charset must iso-8859-1. requests follows specification in case. if require different encoding, can manually set response.encoding property, or use raw response.content.

bold emphasis mine.

pass in response.content raw data instead:

soup = beautifulsoup(r.content) 

i see using beautifulsoup 3. want upgrade beautifulsoup 4 instead; version 3 has been discontinued in 2012, , contains several bugs. install beautifulsoup4 project, , use from bs4 import beautifulsoup.

beautifulsoup 4 great job of figuring out right encoding use when parsing, either html <meta> tag or statistical analysis of bytes provided. if server provide characterset, can still pass beautifulsoup response, test first if requests used default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else none soup = beautifulsoup(r.content, from_encoding=encoding) 

last not least, beautifulsoup 4, can extract text page using soup.get_text():

text = soup.get_text() print text 

you instead converting result list (the return value of soup.findall()) string. never can work because containers in python use repr() on each element in list produce debugging string, , strings means escape sequences not printable ascii character.


Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -