python 2.7 - Beautifulsoup tag.getText() gives blank value -


i extracting st class span using following code:

  address = "http://www.google.com/search?q=%s&num=50&hl=en&start=0" % (urllib.quote_plus(query)) request = urllib2.request(address, none, {'user-agent':'mosilla/5.0 (macintosh; intel mac os x 10_7_4) applewebkit/536.11 (khtml, gecko) chrome/20.0.1132.57 safari/536.11'}) urlfile = urllib2.urlopen(request) page = urlfile.read()   soup = beautifulsoup(page)    divg=soup.findall('div',attrs={'class':'g'})  li in divg:     try:         print "\n\n"          print "link :"         print li.find('h3').find('a')['href']          print "title "         title=(li.find('h3',attrs={'class':'r'}))          print title.gettext()          print "body"         body=(li.find('span',attrs={'class':'st'}))          print body.gettext()        except:         continue  print len(divg) 

the respcetive div follows:

<div class="g"> <!--m--> <div class="rc" data-hveid="53">     <h3 class="r">         <a             href="/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=45&amp;cad=rja&amp;uact=8&amp;ved=0ahukewim0sfq6knmahvtcy4khteud7w4kbawcdywba&amp;url=http%3a%2f%2fwww.ncbi.nlm.nih.gov%2fpmc%2farticles%2fpmc2658273%2f&amp;usg=afqjcngpmcr8qk2zu2w0yx4tgzv2vcltsq&amp;sig2=bf_cyrqy1qa5g3c-zy8cyg&amp;bvm=bv.119745492,d.c2e"             onmousedown="return rwt(this,'','','','45','afqjcngpmcr8qk2zu2w0yx4tgzv2vcltsq','bf_cyrqy1qa5g3c-zy8cyg','0ahukewim0sfq6knmahvtcy4khteud7w4kbawcdywba','','',event)"             data-href="http://www.ncbi.nlm.nih.gov/pmc/articles/pmc2658273/">the             “4‐hour target”: emergency nurses' views - ncbi</a>     </h3>     <div class="s">         <div>             <div class="f kv _swb" style="white-space: nowrap">                 <cite class="_rm bc">www.ncbi.nlm.nih.gov › ncbi ›                     literature › pubmed central (pmc)</cite>                 <div class="action-menu ab_ctl">                     <a class="_fmb ab_button" href="#" id="am-b44"                         aria-label="result details" aria-expanded="false"                         aria-haspopup="true" role="button"                         jsaction="m.tdd;keydown:m.hbke;keypress:m.mskpe"                         data-ved="0ahukewim0sfq6knmahvtcy4khteud7w4kbdshqg4maq"><span                         class="mn-dwn-arw"></span></a>                     <div class="action-menu-panel ab_dropdown" role="menu"                         tabindex="-1"                         jsaction="keydown:m.hdke;mouseover:m.hdhne;mouseout:m.hdhue"                         data-ved="0ahukewim0sfq6knmahvtcy4khteud7w4kbcphwg5maq">                         <ol>                             <li class="action-menu-item ab_dropdownitem" role="menuitem"><a                                 class="fl"                                 href="/search?biw=1024&amp;bih=738&amp;q=related:www.ncbi.nlm.nih.gov/pmc/articles/pmc2658273/+target+breach+2005&amp;tbo=1&amp;sa=x&amp;ved=0ahukewim0sfq6knmahvtcy4khteud7w4kbafcdowba">similar</a></li>                         </ol>                     </div>                 </div>             </div>             <div class="f slp">                 mortimore - &lrm;2007 - &lrm;<a class="fl"                     href="https://scholar.google.co.in/scholar?biw=1024&amp;bih=738&amp;bav=on.2,or.r_cp.&amp;bvm=bv.119745492,d.c2e&amp;um=1&amp;ie=utf-8&amp;lr&amp;cites=3213296797661648681"                     onmousedown="return rwt(this,'','','','45','afqjcnhe8yfvgtyrdbvn4tu3jtu4kus-nq','6jilcul7509jtylcnakmca','0ahukewim0sfq6knmahvtcy4khteud7w4kbdoagg8maq','','',event)">cited                     49</a> - &lrm;<a class="fl"                     href="https://scholar.google.co.in/scholar?biw=1024&amp;bih=738&amp;bav=on.2,or.r_cp.&amp;bvm=bv.119745492,d.c2e&amp;um=1&amp;ie=utf-8&amp;lr&amp;q=related:kdefpmnslyxtsm:scholar.google.com/"                     onmousedown="return rwt(this,'','','','45','afqjcnhzjgbmvuicvi92tjni69s5xgogwq','zd8zj2obi7nf6vwhtng2jg','0ahukewim0sfq6knmahvtcy4khteud7w4kbdpagg9maq','','',event)">related                     articles</a>             </div>             <span class="st">prior <em>target</em>, emergency                 department (ed) included in study had ..... <em>breached</em>                 (letter) bmj <em>2005</em>,                 http://www.bmj.com/cgi/eletters/330/7501/1188#&nbsp;...             </span>             <div class="_tib">you visited page on 25/4/16.</div>         </div>     </div> </div> <!--n--> 

but blank result. of cases code runs fine cases gives blank output.

for of cases code runs fine cases gives blank output.

this because aside regular search results, elements g class can represent image thumbnails. limit search regular search results only, need them inside div elements class="srg":

divg = soup.select('div.srg div.g') li in divg:     # ... 

note i'm assuming using beautifulsoup version 4 , import is:

from bs4 import beautifulsoup 

you may need encode printed texts utf-8:

print body.text.encode("utf-8") 

Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -