python 2.7 - Beautifulsoup tag.getText() gives blank value -
i extracting st class span using following code:
address = "http://www.google.com/search?q=%s&num=50&hl=en&start=0" % (urllib.quote_plus(query)) request = urllib2.request(address, none, {'user-agent':'mosilla/5.0 (macintosh; intel mac os x 10_7_4) applewebkit/536.11 (khtml, gecko) chrome/20.0.1132.57 safari/536.11'}) urlfile = urllib2.urlopen(request) page = urlfile.read() soup = beautifulsoup(page) divg=soup.findall('div',attrs={'class':'g'}) li in divg: try: print "\n\n" print "link :" print li.find('h3').find('a')['href'] print "title " title=(li.find('h3',attrs={'class':'r'})) print title.gettext() print "body" body=(li.find('span',attrs={'class':'st'})) print body.gettext() except: continue print len(divg)
the respcetive div follows:
<div class="g"> <!--m--> <div class="rc" data-hveid="53"> <h3 class="r"> <a href="/url?sa=t&rct=j&q=&esrc=s&source=web&cd=45&cad=rja&uact=8&ved=0ahukewim0sfq6knmahvtcy4khteud7w4kbawcdywba&url=http%3a%2f%2fwww.ncbi.nlm.nih.gov%2fpmc%2farticles%2fpmc2658273%2f&usg=afqjcngpmcr8qk2zu2w0yx4tgzv2vcltsq&sig2=bf_cyrqy1qa5g3c-zy8cyg&bvm=bv.119745492,d.c2e" onmousedown="return rwt(this,'','','','45','afqjcngpmcr8qk2zu2w0yx4tgzv2vcltsq','bf_cyrqy1qa5g3c-zy8cyg','0ahukewim0sfq6knmahvtcy4khteud7w4kbawcdywba','','',event)" data-href="http://www.ncbi.nlm.nih.gov/pmc/articles/pmc2658273/">the “4‐hour target”: emergency nurses' views - ncbi</a> </h3> <div class="s"> <div> <div class="f kv _swb" style="white-space: nowrap"> <cite class="_rm bc">www.ncbi.nlm.nih.gov › ncbi › literature › pubmed central (pmc)</cite> <div class="action-menu ab_ctl"> <a class="_fmb ab_button" href="#" id="am-b44" aria-label="result details" aria-expanded="false" aria-haspopup="true" role="button" jsaction="m.tdd;keydown:m.hbke;keypress:m.mskpe" data-ved="0ahukewim0sfq6knmahvtcy4khteud7w4kbdshqg4maq"><span class="mn-dwn-arw"></span></a> <div class="action-menu-panel ab_dropdown" role="menu" tabindex="-1" jsaction="keydown:m.hdke;mouseover:m.hdhne;mouseout:m.hdhue" data-ved="0ahukewim0sfq6knmahvtcy4khteud7w4kbcphwg5maq"> <ol> <li class="action-menu-item ab_dropdownitem" role="menuitem"><a class="fl" href="/search?biw=1024&bih=738&q=related:www.ncbi.nlm.nih.gov/pmc/articles/pmc2658273/+target+breach+2005&tbo=1&sa=x&ved=0ahukewim0sfq6knmahvtcy4khteud7w4kbafcdowba">similar</a></li> </ol> </div> </div> </div> <div class="f slp"> mortimore - ‎2007 - ‎<a class="fl" href="https://scholar.google.co.in/scholar?biw=1024&bih=738&bav=on.2,or.r_cp.&bvm=bv.119745492,d.c2e&um=1&ie=utf-8&lr&cites=3213296797661648681" onmousedown="return rwt(this,'','','','45','afqjcnhe8yfvgtyrdbvn4tu3jtu4kus-nq','6jilcul7509jtylcnakmca','0ahukewim0sfq6knmahvtcy4khteud7w4kbdoagg8maq','','',event)">cited 49</a> - ‎<a class="fl" href="https://scholar.google.co.in/scholar?biw=1024&bih=738&bav=on.2,or.r_cp.&bvm=bv.119745492,d.c2e&um=1&ie=utf-8&lr&q=related:kdefpmnslyxtsm:scholar.google.com/" onmousedown="return rwt(this,'','','','45','afqjcnhzjgbmvuicvi92tjni69s5xgogwq','zd8zj2obi7nf6vwhtng2jg','0ahukewim0sfq6knmahvtcy4khteud7w4kbdpagg9maq','','',event)">related articles</a> </div> <span class="st">prior <em>target</em>, emergency department (ed) included in study had ..... <em>breached</em> (letter) bmj <em>2005</em>, http://www.bmj.com/cgi/eletters/330/7501/1188# ... </span> <div class="_tib">you visited page on 25/4/16.</div> </div> </div> </div> <!--n-->
but blank result. of cases code runs fine cases gives blank output.
for of cases code runs fine cases gives blank output.
this because aside regular search results, elements g
class can represent image thumbnails. limit search regular search results only, need them inside div
elements class="srg"
:
divg = soup.select('div.srg div.g') li in divg: # ...
note i'm assuming using beautifulsoup
version 4 , import is:
from bs4 import beautifulsoup
you may need encode printed texts utf-8
:
print body.text.encode("utf-8")
Comments
Post a Comment