Python爬虫框架scrapy的安装


摘要:记录一下scrapy的安装过程,尤其是在windows上的安装会遇到一些问题

Linux上安装Scrapy

安装

直接使用pip install scrapy访问国外的源会经常超时,因此建议使用国内的站点快速安装:

[root@mysql .pip]# pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
Collecting scrapy
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a8/96/3affe11cf53a5d2105536919113d5b453479038bb486f7387f4ce4a3b83f/Scrapy-1.4.0-py2.py3-none-any.whl (248kB)
100% |████████████████████████████████| 256kB 2.0MB/s
Collecting queuelib (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/16/4f/b307fc978a21bfbb138e8e01a9f4953191d439e30578f5e4fd5befa77de1/queuelib-1.4.2-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/20/3e/ba9865b88c39edd09100a8c8df11722c8881bbf76aef0c0ae5b970eb42b7/w3lib-1.17.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1d/e5/f1d410192e34b1034dba7804de5dbcdece20a883c445ad661e5ea8226b42/cssselect-1.0.1-py2.py3-none-any.whl
Collecting lxml (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/62/b7/aafdcf0c0ad0cf36a0835adde50f4a7e18241440b9897a88c80f520d0c76/lxml-3.8.0-cp27-cp27m-manylinux1_x86_64.whl (6.8MB)
100% |████████████████████████████████| 6.8MB 193kB/s
Collecting parsel>=1.1 (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d0/bd/c5c3cf9c490d328a1d1e5e942c3a2b84d6934d5666e9d4bcfc2f83e7dbdd/parsel-1.2.0-py2.py3-none-any.whl
Collecting service-identity (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/29/fa/995e364220979e577e7ca232440961db0bf996b6edaf586a7d1bd14d81f1/service_identity-17.0.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c8/0a/b6723e1bc4c516cb687841499455a8505b44607ab535be01091c0f24f079/six-1.10.0-py2.py3-none-any.whl
Collecting Twisted>=13.1.0 (from scrapy)
Could not find a version that satisfies the requirement Twisted>=13.1.0 (from scrapy) (from versions: )
No matching distribution found for Twisted>=13.1.0 (from scrapy)

报错说找不到Twisted,尝试pip安装:

[root@mysql .pip]# pip install Twisted
Collecting Twisted
Could not find a version that satisfies the requirement Twisted (from versions: )
No matching distribution found for Twisted

也找不到,只能单独下载Twisted并安装:

[root@mysql .pip]# wget https://twistedmatrix.com/Releases/Twisted/17.1/Twisted-17.1.0.tar.bz2
[root@mysql .pip]# tar -jxvf Twisted-17.1.0.tar.bz2
[root@mysql .pip]# cd Twisted-17.1.0
[root@mysql Twisted-17.1.0]# python setup.py install

然后再次安装scrapy即可:

[root@mysql Twisted-17.1.0]# pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
Collecting scrapy
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a8/96/3affe11cf53a5d2105536919113d5b453479038bb486f7387f4ce4a3b83f/Scrapy-1.4.0-py2.py3-none-any.whl
Collecting queuelib (from scrapy)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/16/4f/b307fc978a21bfbb138e8e01a9f4953191d439e30578f5e4fd5befa77de1/queuelib-1.4.2-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from scrapy)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/20/3e/ba9865b88c39edd09100a8c8df11722c8881bbf76aef0c0ae5b970eb42b7/w3lib-1.17.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from scrapy)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/1d/e5/f1d410192e34b1034dba7804de5dbcdece20a883c445ad661e5ea8226b42/cssselect-1.0.1-py2.py3-none-any.whl
Collecting lxml (from scrapy)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/62/b7/aafdcf0c0ad0cf36a0835adde50f4a7e18241440b9897a88c80f520d0c76/lxml-3.8.0-cp27-cp27m-manylinux1_x86_64.whl
Collecting parsel>=1.1 (from scrapy)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/d0/bd/c5c3cf9c490d328a1d1e5e942c3a2b84d6934d5666e9d4bcfc2f83e7dbdd/parsel-1.2.0-py2.py3-none-any.whl
Collecting service-identity (from scrapy)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/29/fa/995e364220979e577e7ca232440961db0bf996b6edaf586a7d1bd14d81f1/service_identity-17.0.0-py2.py3-none-any.whl
Requirement already satisfied: six>=1.5.2 in /usr/local/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from scrapy)
Requirement already satisfied: Twisted>=13.1.0 in /usr/local/lib/python2.7/site-packages/Twisted-17.1.0-py2.7-linux-x86_64.egg (from scrapy)
Collecting PyDispatcher>=2.0.5 (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cd/37/39aca520918ce1935bea9c356bcbb7ed7e52ad4e31bff9b943dfc8e7115b/PyDispatcher-2.0.5.tar.gz
Collecting pyOpenSSL (from scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d0/39/7730559b75b565fd6983d857776fcb4982afb0e8faddb06170e59b62b41c/pyOpenSSL-17.1.0-py2.py3-none-any.whl (53kB)
100% |████████████████████████████████| 61kB 1.5MB/s
Collecting pyasn1 (from service-identity->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a5/ae/6b4c4cb9420edddd7401782f55504130d1269f2e5ae3ba3c986da167dd6c/pyasn1-0.2.3-py2.py3-none-any.whl (53kB)
100% |████████████████████████████████| 61kB 15.1MB/s
Requirement already satisfied: attrs in /usr/local/lib/python2.7/site-packages/attrs-17.2.0-py2.7.egg (from service-identity->scrapy)
Collecting pyasn1-modules (from service-identity->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/5b/a4/d4934b1b9d28541e37fa354a7dd3c3d45d19d92196df127e1342420a0ae6/pyasn1_modules-0.0.9-py2.py3-none-any.whl (60kB)
100% |████████████████████████████████| 61kB 7.0MB/s
Requirement already satisfied: zope.interface>=3.6.0 in /usr/local/lib/python2.7/site-packages/zope.interface-4.4.2-py2.7-linux-x86_64.egg (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: constantly>=15.1 in /usr/local/lib/python2.7/site-packages/constantly-15.1.0-py2.7.egg (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: incremental>=16.10.1 in ./.eggs/incremental-17.5.0-py2.7.egg (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: Automat>=0.3.0 in /usr/local/lib/python2.7/site-packages/Automat-0.6.0-py2.7.egg (from Twisted>=13.1.0->scrapy)
Collecting cryptography>=1.9 (from pyOpenSSL->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2a/0c/31bd69469e90035381f0197b48bf71032991d9f07a7e444c311b4a23a3df/cryptography-1.9.tar.gz (409kB)
100% |████████████████████████████████| 419kB 3.9MB/s
Requirement already satisfied: setuptools in /usr/local/lib/python2.7/site-packages/setuptools-19.4-py2.7.egg (from zope.interface>=3.6.0->Twisted>=13.1.0->scrapy)
Collecting idna>=2.1 (from cryptography>=1.9->pyOpenSSL->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/11/7d/9bbbd7bb35f34b0169542487d2a8859e44306bb2e6a4455d491800a5621f/idna-2.5-py2.py3-none-any.whl (55kB)
100% |████████████████████████████████| 61kB 10.2MB/s
Collecting asn1crypto>=0.21.0 (from cryptography>=1.9->pyOpenSSL->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/97/ba/7e8117d8efcee589f4d96dd2b2eb1d997f96d27d214cf2b7134ad8acf6ab/asn1crypto-0.22.0-py2.py3-none-any.whl (97kB)
100% |████████████████████████████████| 102kB 13.8MB/s
Collecting enum34 (from cryptography>=1.9->pyOpenSSL->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c5/db/e56e6b4bbac7c4a06de1c50de6fe1ef3810018ae11732a50f15f62c7d050/enum34-1.1.6-py2-none-any.whl
Collecting ipaddress (from cryptography>=1.9->pyOpenSSL->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/17/93/28f4dd560780dd70fe75ce7e2662869770dfac181f6bbb472179ea8da516/ipaddress-1.0.18-py2-none-any.whl
Collecting cffi>=1.7 (from cryptography>=1.9->pyOpenSSL->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/90/aa/bae1c4627e3e3f631fb8e946da040f36931af86917f54e279ad6f4b29641/cffi-1.10.0-cp27-cp27m-manylinux1_x86_64.whl (394kB)
100% |████████████████████████████████| 399kB 3.0MB/s
Collecting pycparser (from cffi>=1.7->cryptography>=1.9->pyOpenSSL->scrapy)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8c/2d/aad7f16146f4197a11f8e91fb81df177adcc2073d36a17b1491fd09df6ed/pycparser-2.18.tar.gz (245kB)
100% |████████████████████████████████| 256kB 6.4MB/s
Installing collected packages: queuelib, w3lib, cssselect, lxml, parsel, idna, asn1crypto, enum34, ipaddress, pycparser, cffi, cryptography, pyOpenSSL, pyasn1, pyasn1-modules, service-identity, PyDispatcher, scrapy
Running setup.py install for pycparser ... done
Running setup.py install for cryptography ... done
Running setup.py install for PyDispatcher ... done
Successfully installed PyDispatcher-2.0.5 asn1crypto-0.22.0 cffi-1.10.0 cryptography-1.9 cssselect-1.0.1 enum34-1.1.6 idna-2.5 ipaddress-1.0.18 lxml-3.8.0 parsel-1.2.0 pyOpenSSL-17.1.0 pyasn1-0.2.3 pyasn1-modules-0.0.9 pycparser-2.18 queuelib-1.4.2 scrapy-1.4.0 service-identity-17.0.0 w3lib-1.17.0

验证

[root@mysql Twisted-17.1.0]# scrapy version
Scrapy 1.4.0
[root@mysql Twisted-17.1.0]# cd
[root@mysql ~]# python
Python 2.7.10 (default, Jan 18 2016, 17:00:09)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>> exit()
[root@mysql ~]#

Windows10上安装Scrapy

安装步骤

windows上需要先下载openssl并将include目录拷贝到VC下。VC的具体路径因环境而不同,不过可以在安装报错日志中看到。经过各种报错后整理正确的安装步骤如下:
1)下载openssl:https://ci.cryptography.io/job/cryptography-support-jobs/job/openssl-release-1.1/
如果python2.7是32位的就下载openssl-1.1.0f-2010-x86.zip,如果python是64位的就下载openssl-1.1.0f-2010-x86_64.zip,务必保证python和openssl操作系统位数一致。
2)将openssl-win32-2010\include下面的文件夹拷贝到C:\Users\Administrator\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\include下。
3)将openssl-win32-2010\lib下面的文件拷贝到C:\Python27\libs下。
4)pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple

遇到报错及解决

报错1:

C:\Users\Administrator\AppData\Local\Programs\Common\Microsoft\Visual C++ for Py
thon\9.0\VC\Bin\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Python27\includ
e -IC:\Python27\PC /Tcbuild\temp.win32-2.7\Release\_openssl.c /Fobuild\temp.win3
2-2.7\Release\build\temp.win32-2.7\Release\_openssl.obj
_openssl.c
build\temp.win32-2.7\Release\_openssl.c(434) : fatal error C1083: Cannot open in
clude file: 'openssl/opensslv.h': No such file or directory
error: command 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Common\\Micr
osoft\\Visual C++ for Python\\9.0\\VC\\Bin\\cl.exe' failed with exit status 2
----------------------------------------
Cleaning up...
Command C:\Python27\python.exe -c "import setuptools, tokenize;__file__='c:\\use
rs\\admini~1\\appdata\\local\\temp\\pip_build_Administrator\\cryptography\\setup
.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n
', '\n'), __file__, 'exec'))" install --record c:\users\admini~1\appdata\local\t
emp\pip-7as3vx-record\install-record.txt --single-version-externally-managed --c
ompile failed with error code 1 in c:\users\admini~1\appdata\local\temp\pip_buil
d_Administrator\cryptography
Storing debug log for failure in C:\Users\Administrator\pip\pip.log

解决办法:下载openssl(如果python2.7是32位的就下载openssl-1.1.0f-2010-x86.zip,如果python是64位的就下载openssl-1.1.0f-2010-x86_64.zip)
下载网址:https://ci.cryptography.io/job/cryptography-support-jobs/job/openssl-release-1.1/
下载后解压然后将openssl-win32-2010\include下面的文件夹拷贝到C:\Users\Administrator\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\include
再次安装,遇到报错2。
报错2:

C:\Users\Administrator\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:C:\Python27\libs /LIBPATH:C:\Python27\PCbuild libssl.lib libcrypto.lib advapi32.lib crypt32.lib gdi32.lib user32.lib ws2_32.lib /EXPORT:init_openssl build\temp.win32-2.7\Release\build\temp.win32-2.7\Release\_openssl.obj /OUT:build\lib.win32-2.7\cryptography\hazmat\bindings\_openssl.pyd /IMPLIB:build\temp.win32-2.7\Release\build\temp.win32-2.7\Release\_openssl.lib /MANIFESTFILE:build\temp.win32-2.7\Release\build\temp.win32-2.7\Release\_openssl.pyd.manifest /NXCOMPAT /DYNAMICBASE
LINK : fatal error LNK1181: cannot open input file 'libssl.lib'
error: command 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\link.exe' failed with exit status 1181

解决办法:将openssl-win32-2010\lib下面的文件拷贝到C:\Python27\libs,再次安装即可成功。
如果误将openssl-win64-2010\lib拷到C:\Python27\libs的话还会遇到下面的报错:

build\lib.win32-2.7\cryptography\hazmat\bindings\_openssl.pyd : fatal error LNK1120: 1037 unresolved externals
error: command 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\link.exe' failed with exit status 1120

所以务必保证python和openssl操作系统位数的一致。
报错3:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 1: ordinal not in range(128)

解决办法:打开C:\Python27\Lib下的 mimetypes.py 文件,找到大概256行(你可以用Notepad++的搜索功能)的‘default_encoding = sys.getdefaultencoding()’。
在这行前面添加三行:

if sys.getdefaultencoding() != 'gbk':
reload(sys)
sys.setdefaultencoding('gbk')
default_encoding = sys.getdefaultencoding()

验证

C:\>python
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>> exit()
C:\>scrapy version
Scrapy 1.4.0
C:\>scrapy bench
2017-07-11 09:51:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybo
t)
2017-07-11 09:51:30 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_
TIMEOUT': 10, 'LOG_LEVEL': 'INFO', 'LOGSTATS_INTERVAL': 1}
2017-07-11 09:51:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-07-11 09:51:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-11 09:51:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-07-11 09:51:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-07-11 09:51:33 [scrapy.core.engine] INFO: Spider opened
2017-07-11 09:51:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:34 [scrapy.extensions.logstats] INFO: Crawled 61 pages (at 3660
pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:35 [scrapy.extensions.logstats] INFO: Crawled 133 pages (at 432
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:36 [scrapy.extensions.logstats] INFO: Crawled 197 pages (at 384
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:37 [scrapy.extensions.logstats] INFO: Crawled 262 pages (at 390
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:38 [scrapy.extensions.logstats] INFO: Crawled 334 pages (at 432
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:39 [scrapy.extensions.logstats] INFO: Crawled 405 pages (at 426
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:40 [scrapy.extensions.logstats] INFO: Crawled 470 pages (at 390
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:41 [scrapy.extensions.logstats] INFO: Crawled 533 pages (at 378
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:42 [scrapy.extensions.logstats] INFO: Crawled 605 pages (at 432
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:43 [scrapy.core.engine] INFO: Closing spider (closespider_timeo
ut)
2017-07-11 09:51:43 [scrapy.extensions.logstats] INFO: Crawled 670 pages (at 390
0 pages/min), scraped 0 items (at 0 items/min)
2017-07-11 09:51:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 212136,
'downloader/request_count': 686,
'downloader/request_method_count/GET': 686,
'downloader/response_bytes': 1017138,
'downloader/response_count': 686,
'downloader/response_status_count/200': 686,
'dupefilter/filtered': 879,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2017, 7, 11, 1, 51, 44, 289000),
'log_count/INFO': 17,
'request_depth_max': 24,
'response_received_count': 686,
'scheduler/dequeued': 686,
'scheduler/dequeued/memory': 686,
'scheduler/enqueued': 12841,
'scheduler/enqueued/memory': 12841,
'start_time': datetime.datetime(2017, 7, 11, 1, 51, 33, 886000)}
2017-07-11 09:51:44 [scrapy.core.engine] INFO: Spider closed (closespider_timeou
t)

参考文档

https://pypi.python.org/pypi/Scrapy
https://cryptography.io/en/latest/installation/#on-windows
http://blog.csdn.net/zzk1995/article/details/51924510
http://bbs.chinaunix.net/thread-4251968-1-1.html

文章目录
  1. 1. Linux上安装Scrapy
    1. 1.1. 安装
    2. 1.2. 验证
  2. 2. Windows10上安装Scrapy
    1. 2.1. 安装步骤
    2. 2.2. 遇到报错及解决
    3. 2.3. 验证
    4. 2.4. 参考文档
|