Scrapy Exporting Weird Symbols Into Csv File

August 21, 2024 Post a Comment

Ok, so here's the issue. I'm a beginner who has just started to delve into scrapy/python. I use the code below to scrape a website and save the results into a csv. When I look in

Solution 1:

Officiële will be represented as u'Offici\xeble' in Python 2, as seen in the python shell session example below (no need to worry about the \xXX characters, it's just how Python represents non-ASCII Unicode characters)

$ python
Python 2.7.9 (default, Apr  22015, 15:33:21) 
[GCC 4.9.2] on linux2
Type"help", "copyright", "credits"or"license"for more information.
>>> u'Officiële'u'Offici\xeble'>>> u'Offici\u00EBle'u'Offici\xeble'>>>

I think this is because it's saving in unicode instead of UTF-8

UTF-8 is an encoding, Unicode is not.

ë, a.k.a U+00EB, a.k.a LATIN SMALL LETTER E WITH DIAERESIS, will be UTF-8 encoded as 2 bytes, \xc3 and \xab

>>>u'Officiële'.encode('UTF-8')
'Offici\xc3\xable'
>>>

In the csv file, it changes it to officiÃ«le.

If you see this, it's probably that you need to set the input encoding to UTF-8 when opening the CSV file inside your program.

Scrapy CSV exporter will write Python Unicode strings as UTF-8 encoded strings in the output file.

Scrapy selectors will output Unicode strings:

$ scrapy shell "https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"
2016-03-15 10:44:51 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
2016-03-15 10:44:52 [scrapy] DEBUG: Crawled (200) <GEThttps://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
(...)
In [1]: response.css('div.menu-bmslink > ul > li > a::text').extract()
Out[1]: 
[u'Offici\xeble bekendmakingen vandaag',
 u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011',
 u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009']

In [2]: for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
    print t
   ...:     
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009

Let's see what a spider extracting these strings in items will get you as CSV:

$ cat testspider.py
import scrapy


classTestSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4']

    defparse(self, response):
        for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
            yield {"link": t}

Run the spider and ask for CSV output:

$scrapyrunspidertestspider.py-otest.csv2016-03-15 11:00:13 [scrapy] INFO:Scrapy1.0.5started(bot:scrapybot)2016-03-15 11:00:13 [scrapy] INFO: Optional features available:ssl,http112016-03-15 11:00:13 [scrapy] INFO: Overridden settings: {'FEED_FORMAT':'csv', 'FEED_URI':'test.csv'}
2016-03-15 11:00:14 [scrapy] INFO: Enabled extensions:CloseSpider,FeedExporter,TelnetConsole,LogStats,CoreStats,SpiderState2016-03-15 11:00:14 [scrapy] INFO: Enabled downloader middlewares:HttpAuthMiddleware,DownloadTimeoutMiddleware,UserAgentMiddleware,RetryMiddleware,DefaultHeadersMiddleware,MetaRefreshMiddleware,HttpCompressionMiddleware,RedirectMiddleware,CookiesMiddleware,ChunkedTransferMiddleware,DownloaderStats2016-03-15 11:00:14 [scrapy] INFO: Enabled spider middlewares:HttpErrorMiddleware,OffsiteMiddleware,RefererMiddleware,UrlLengthMiddleware,DepthMiddleware2016-03-15 11:00:14 [scrapy] INFO: Enabled item pipelines:2016-03-15 11:00:14 [scrapy] INFO:Spideropened2016-03-15 11:00:14 [scrapy] INFO:Crawled0pages(at0pages/min),scraped0items(at0items/min)2016-03-15 11:00:14 [scrapy] DEBUG:Telnetconsolelisteningon127.0.0.1:60232016-03-15 11:00:14 [scrapy] DEBUG:Crawled(200)<GEThttps://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>(referer:None)2016-03-15 11:00:14 [scrapy] DEBUG:Scrapedfrom<200https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link':u'Offici\xeblebekendmakingenvandaag'}
2016-03-15 11:00:14 [scrapy] DEBUG:Scrapedfrom<200https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link':u'UitlegnieuwenummeringHandelingenvanaf1januari2011'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link':u'Uitlegnieuwe\r\nnummeringStaatscourantvanaf1juli2009'}
2016-03-15 11:00:14 [scrapy] INFO: Closing spider (finished)
2016-03-15 11:00:14 [scrapy] INFO: Stored csv feed (3 items) in: test.csv
2016-03-15 11:00:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes':488,
 'downloader/request_count':1,
 'downloader/request_method_count/GET':1,
 'downloader/response_bytes':12018,
 'downloader/response_count':1,
 'downloader/response_status_count/200':1,
 'finish_reason':'finished',
 'finish_time':datetime.datetime(2016, 3, 15, 10, 0, 14, 991735),
 'item_scraped_count':3,
 'log_count/DEBUG':5,
 'log_count/INFO':8,
 'response_received_count':1,
 'scheduler/dequeued':1,
 'scheduler/dequeued/memory':1,
 'scheduler/enqueued':1,
 'scheduler/enqueued/memory':1,
 'start_time':datetime.datetime(2016, 3, 15, 10, 0, 14, 59471)}
2016-03-15 11:00:14 [scrapy] INFO:Spiderclosed(finished)

Check content of CSV file:

$ cat test.csv
link
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011"Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009"$ hexdump -C test.csv 
000000006c696e 6b 0d 0a 4f 6666696369 c3 ab 6c65|link..Offici..le|000000102062656b 656e 646d  616b 696e 67656e 20| bekendmakingen |0000002076616e 646161670d  0a 5569746c656720|vandaag..Uitleg |000000306e 6965757765206e  756d 6d 6572696e 67|nieuwe nummering|000000402048616e 64656c696e 67656e 2076616e  | Handelingen van|0000005061662031206a 616e  7561726920323031|af 1 januari 201|00000060310d 0a 225569746c6567206e 69657577|1.."Uitleg nieuw|
00000070  65 0d 0a 20 20 20 20 20  20 20 20 20 20 20 20 6e  |e..            n|
00000080  75 6d 6d 65 72 69 6e 67  20 53 74 61 61 74 73 63  |ummering Staatsc|
00000090  6f 75 72 61 6e 74 20 76  61 6e 61 66 20 31 20 6a  |ourant vanaf 1 j|
000000a0  75 6c 69 20 32 30 30 39  22 0d 0a                 |uli 2009"..|000000ab

You can verify that ë is correctly encoded as c3 ab

I can see the file data correctly when using LibreOffice for example (notice "Character set: Unicode UTF-8"):

You are probably using Latin-1. Here's what you get when using Latin-1 instead of UTF-8 as input encoding setting (in LibreOffice again):

Solution 2:

To encode a string you can directly use encode("utf-8") . Something like this:

item['publicatiedatum'] =''.join(sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()).encode("utf-8")

Python Developer

Scrapy Exporting Weird Symbols Into Csv File

Solution 1:

Solution 2:

Post a Comment for "Scrapy Exporting Weird Symbols Into Csv File"