Value Errors When Retrieving Images With Scrapy -
i'm having trouble using scrapy's image pipeline retrieve images. error reports, think feeding scrapy right image_urls. however, instead of downloading images them, scrapy returns error: valueerror: missing scheme in request url: h.
this first time using image pipeline feature, suspect i'm making simple mistake. same, i'd appreciate solving it.
below you'll find spider, settings, items, , error output. they're not quite mwes, think they're pretty simple , easy understand same.
spider: import scrapy scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor ngamedallions.items import ngamedallionsitem scrapy.loader.processors import takefirst scrapy.loader import itemloader scrapy.loader.processors import join scrapy.http import request import re
class ngaspider(crawlspider): name = 'ngamedallions' allowed_domains = ['nga.gov'] start_urls = [ 'http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html' ] rules = ( rule(linkextractor(allow=('art-object-page.*','objects/*')),callback='parse_catalogrecord', follow=true ),) def parse_catalogrecord(self, response): catalogrecord = itemloader(item=ngamedallionsitem(), response=response) catalogrecord.default_output_processor = takefirst() keywords = "medal|medallion" r = re.compile('.*(%s).*' % keywords, re.ignorecase|re.multiline|re.unicode) if r.search(response.body_as_unicode()): catalogrecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()') catalogrecord.add_xpath('accession', './/dd[@class="accession"]/text()') catalogrecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()') catalogrecord.add_xpath('image_urls', './/img[@class="mainimg"]/@src') return catalogrecord.load_item()
settings:
bot_name = 'ngamedallions' spider_modules = ['ngamedallions.spiders'] newspider_module = 'ngamedallions.spiders' download_delay=3 item_pipelines = { 'scrapy.pipelines.images.imagespipeline': 1, } images_store = '/home/tricia/documents/programing/scrapy/ngamedallions/medallionimages'
items:
import scrapy class ngamedallionsitem(scrapy.item): title = scrapy.field() accession = scrapy.field() inscription = scrapy.field() image_urls = scrapy.field() images = scrapy.field() pass
error log:
2016-04-24 19:00:40 [scrapy] info: scrapy 1.0.5.post2+ga046ce8 started (bot: ngamedallions) 2016-04-24 19:00:40 [scrapy] info: optional features available: ssl, http11 2016-04-24 19:00:40 [scrapy] info: overridden settings: {'newspider_module': 'ngamedallions.spiders', 'feed_uri': 'items.json', 'spider_modules': ['ngamedallions.spiders'], 'bot_name': 'ngamedallions', 'feed_format': 'json', 'download_delay': 3} 2016-04-24 19:00:40 [scrapy] info: enabled extensions: closespider, feedexporter, telnetconsole, logstats, corestats, spiderstate 2016-04-24 19:00:40 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2016-04-24 19:00:40 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2016-04-24 19:00:40 [scrapy] info: enabled item pipelines: imagespipeline 2016-04-24 19:00:40 [scrapy] info: spider opened 2016-04-24 19:00:40 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-04-24 19:00:40 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-04-24 19:00:40 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html> (referer: none) 2016-04-24 19:00:44 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1.html> (referer: none) 2016-04-24 19:00:48 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html> (referer: http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html) 2016-04-24 19:00:48 [scrapy] error: error processing {'accession': u'1942.9.163.a', 'image_urls': u'http://media.nga.gov/public/objects/1/3/1/2/1312-primary-0-440x400.jpg', 'inscription': u'around circumference: iohannes franciscvs gon ma; around bottom circumference: mantva', 'title': u'gianfrancesco gonzaga di rodigo, 1445-1496, lord of bozzolo, sabbioneta, , viadana 1478 [obverse]'} traceback (most recent call last): file "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runcallbacks current.result = callback(current.result, *args, **kw) file "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item requests = arg_to_iter(self.get_media_requests(item, info)) file "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests return [request(x) x in item.get(self.images_urls_field, [])] file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__ self._set_url(url) file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url self._set_url(url.encode(self.encoding)) file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url raise valueerror('missing scheme in request url: %s' % self._url) valueerror: missing scheme in request url: h 2016-04-24 19:00:48 [scrapy] debug: filtered duplicate request: <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html> - no more duplicates shown (see dupefilter_debug show duplicates) 2016-04-24 19:00:51 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/collection/art-object-page.1312.html) 2016-04-24 19:00:52 [scrapy] error: error processing {'accession': u'1942.9.163.b', 'image_urls': u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg', 'inscription': u'around top circumference: trinacria iani; upper center: pelorvs ; across center: pa li; across bottom: belavra', 'title': u'house between 2 hills [reverse]'} traceback (most recent call last): file "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runcallbacks current.result = callback(current.result, *args, **kw) file "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item requests = arg_to_iter(self.get_media_requests(item, info)) file "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests return [request(x) x in item.get(self.images_urls_field, [])] file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__ self._set_url(url) file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url self._set_url(url.encode(self.encoding)) file "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url raise valueerror('missing scheme in request url: %s' % self._url) valueerror: missing scheme in request url: h 2016-04-24 19:00:55 [scrapy] debug: crawled (200) <get http://www.nga.gov/content/ngaweb/collection/art-object-page.1.html> (referer: http://www.nga.gov/content/ngaweb/collection/art-object-page.1.html) 2016-04-24 19:01:02 [scrapy] info: closing spider (finished) 2016-04-24 19:01:02 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 1609, 'downloader/request_count': 5, 'downloader/request_method_count/get': 5, 'downloader/response_bytes': 125593, 'downloader/response_count': 5, 'downloader/response_status_count/200': 5, 'dupefilter/filtered': 5, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2016, 4, 24, 23, 1, 2, 938181), 'log_count/debug': 7, 'log_count/error': 2, 'log_count/info': 7, 'request_depth_max': 2, 'response_received_count': 5, 'scheduler/dequeued': 5, 'scheduler/dequeued/memory': 5, 'scheduler/enqueued': 5, 'scheduler/enqueued/memory': 5, 'start_time': datetime.datetime(2016, 4, 24, 23, 0, 40, 851598)} 2016-04-24 19:01:02 [scrapy] info: spider closed (finished)
the takefirst processor making image_urls
string when should list.
add:
catalogrecord.image_urls_out = lambda v: v
edit:
this be:
catalogrecord.image_urls_out = scrapy.loader.processors.identity()
Comments
Post a Comment