Value Errors When Retrieving Images With Scrapy -


i'm having trouble using scrapy's image pipeline retrieve images. error reports, think feeding scrapy right image_urls. however, instead of downloading images them, scrapy returns error: valueerror: missing scheme in request url: h.

this first time using image pipeline feature, suspect i'm making simple mistake. same, i'd appreciate solving it.

below you'll find spider, settings, items, , error output. they're not quite mwes, think they're pretty simple , easy understand same.

spider: import scrapy scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor ngamedallions.items import ngamedallionsitem scrapy.loader.processors import takefirst scrapy.loader import itemloader scrapy.loader.processors import join scrapy.http import request import re

class ngaspider(crawlspider):     name = 'ngamedallions'     allowed_domains = ['nga.gov']     start_urls = [         'http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html'     ]      rules = (             rule(linkextractor(allow=('art-object-page.*','objects/*')),callback='parse_catalogrecord',     follow=true     ),)      def parse_catalogrecord(self, response):         catalogrecord = itemloader(item=ngamedallionsitem(), response=response)         catalogrecord.default_output_processor = takefirst()         keywords = "medal|medallion"         r = re.compile('.*(%s).*' % keywords, re.ignorecase|re.multiline|re.unicode)         if r.search(response.body_as_unicode()):             catalogrecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')             catalogrecord.add_xpath('accession', './/dd[@class="accession"]/text()')             catalogrecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()')             catalogrecord.add_xpath('image_urls', './/img[@class="mainimg"]/@src')              return catalogrecord.load_item() 

settings:

bot_name = 'ngamedallions'  spider_modules = ['ngamedallions.spiders'] newspider_module = 'ngamedallions.spiders'  download_delay=3  item_pipelines = {    'scrapy.pipelines.images.imagespipeline': 1, }  images_store = '/home/tricia/documents/programing/scrapy/ngamedallions/medallionimages' 

items:

import scrapy  class ngamedallionsitem(scrapy.item):     title = scrapy.field()     accession = scrapy.field()     inscription = scrapy.field()     image_urls = scrapy.field()     images = scrapy.field()     pass 

error log:

2016-04-24 19:00:40 [scrapy] info: scrapy 1.0.5.post2+ga046ce8 started (bot: ngamedallions) 2016-04-24 19:00:40 [scrapy] info: optional features available: ssl, http11 2016-04-24 19:00:40 [scrapy] info: overridden settings: {'newspider_module': 'ngamedallions.spiders', 'feed_uri': 'items.json', 'spider_modules': ['ngamedallions.spiders'], 'bot_name': 'ngamedallions', 'feed_format': 'json', 'download_delay': 3} 2016-04-24 19:00:40 [scrapy] info: enabled extensions: closespider, feedexporter, telnetconsole, logstats, corestats, spiderstate 2016-04-24 19:00:40 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2016-04-24 19:00:40 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2016-04-24 19:00:40 [scrapy] info: enabled item pipelines: imagespipeline 2016-04-24 19:00:40 [scrapy] info: spider opened 2016-04-24 19:00:40 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-04-24 19:00:40 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-04-24 19:00:40 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html> (referer: none) 2016-04-24 19:00:44 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1.html> (referer: none) 2016-04-24 19:00:48 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html> (referer: http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html) 2016-04-24 19:00:48 [scrapy] error: error processing {'accession': u'1942.9.163.a',  'image_urls': u'http://media.nga.gov/public/objects/1/3/1/2/1312-primary-0-440x400.jpg',  'inscription': u'around circumference: iohannes franciscvs gon ma; around bottom circumference: mantva',  'title': u'gianfrancesco gonzaga di rodigo, 1445-1496, lord of bozzolo, sabbioneta, , viadana 1478 [obverse]'} traceback (most recent call last):   file "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runcallbacks     current.result = callback(current.result, *args, **kw)   file "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item requests = arg_to_iter(self.get_media_requests(item, info))   file "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests return [request(x) x in item.get(self.images_urls_field, [])]   file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__ self._set_url(url)   file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url self._set_url(url.encode(self.encoding))   file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url raise valueerror('missing scheme in request url: %s' % self._url) valueerror: missing scheme in request url: h 2016-04-24 19:00:48 [scrapy] debug: filtered duplicate request: <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html> - no more duplicates shown (see dupefilter_debug show duplicates) 2016-04-24 19:00:51 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html) 2016-04-24 19:00:52 [scrapy] error: error processing {'accession': u'1942.9.163.b',  'image_urls': u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg',  'inscription': u'around top circumference: trinacria iani; upper center: pelorvs ; across center: pa li; across bottom: belavra',  'title': u'house between 2 hills [reverse]'} traceback (most recent call last):   file "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runcallbacks current.result = callback(current.result, *args, **kw)  file "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item requests = arg_to_iter(self.get_media_requests(item, info))   file "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests return [request(x) x in item.get(self.images_urls_field, [])]   file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__ self._set_url(url)   file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url self._set_url(url.encode(self.encoding))   file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url     raise valueerror('missing scheme in request url: %s' % self._url) valueerror: missing scheme in request url: h 2016-04-24 19:00:55 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1.html> (referer: http://www.nga.gov/content/ngaweb/collection/art-object-page.1.html) 2016-04-24 19:01:02 [scrapy] info: closing spider (finished) 2016-04-24 19:01:02 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 1609,  'downloader/request_count': 5,  'downloader/request_method_count/get': 5,  'downloader/response_bytes': 125593,  'downloader/response_count': 5,  'downloader/response_status_count/200': 5,  'dupefilter/filtered': 5,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2016, 4, 24, 23, 1, 2, 938181),  'log_count/debug': 7,  'log_count/error': 2,  'log_count/info': 7,  'request_depth_max': 2,  'response_received_count': 5,  'scheduler/dequeued': 5,  'scheduler/dequeued/memory': 5,  'scheduler/enqueued': 5,  'scheduler/enqueued/memory': 5,  'start_time': datetime.datetime(2016, 4, 24, 23, 0, 40, 851598)} 2016-04-24 19:01:02 [scrapy] info: spider closed (finished) 

the takefirst processor making image_urls string when should list.

add:

catalogrecord.image_urls_out = lambda v: v 

edit:

this be:

catalogrecord.image_urls_out = scrapy.loader.processors.identity() 

Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -