Python学习笔记

前言

由于写毕设要搭建后端爬虫的需求,所以开始学习Python语言。由于没有系统的学习,所以笔记可能会记得比较散乱,有时间会一一整理,把一些重要的学习知识点记下来。当然有时间也会整理基础知识。

安装 Python

升级pip

pip3 install --upgrade pip

安装 Python 虚拟环境

(sudo) pip3 install virtualenv virtualenvwrapper

修改~/.bash_profile,添加以下语句

export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/workspace
source /usr/local/bin/virtualenvwrapper.sh

修改后使之立即生效(也可以重启终端使之生效):

source ~/.bash_profile

基本用法

1、创建一个虚拟开发环境

mkvirtualenv zqxt:创建运行环境zqxt

workon zqxt: 工作在 zqxt 环境 或 从其它环境切换到 zqxt 环境

deactivate: 退出终端环境

其它的:
rmvirtualenv ENV:删除运行环境ENV

mkproject mic:创建mic项目和运行环境mic

mktmpenv:创建临时运行环境

lsvirtualenv: 列出可用的运行环境

lssitepackages: 列出当前环境安装了的包

创建的环境是独立的,互不干扰,无需sudo权限即可使用 pip 来进行包的管理。

完成后在当前目录会创建一个test_env的文件夹,进入文件夹会发现生成了以下的目录,神奇吧

├── bin
├── include
│   └── python2.7
├── lib
│   └── python2.7       //所有的新包会被存在这
│       ├── distutils
│       ├── encodings
│       ├── lib-dynload
│       └── site-packages
├── local
│   ├── bin
│   ├── include
│   └── lib

安装Django

pip3 install Django或者pip install Django==1.10.6

在终端上输入Python,点击Enter,进入Python环境

>>> import django
>>> django.VERSION
(1, 8, 16, 'final', 0)
>>> 
>>> django.get_version()
'1.8.16'

这样就可以看见安装的django的版本号

安装django-Celery,设置调度计划任务

pip3 install django-celery

安装PIL(Python Imaging Library)

图片处理的扩展包:
brew install jpeg #安装

安装Django-Dynamic-Scraper(DDS)

pip3 install django-dynamic-scraper

pip3 install scrapy-splash
pip3 install scrapy-djangoitem

安装scrapy

pip3 install Scrapy 安装Scrapy或者pip3 install scrapy==1.3.3

安装chardet-检测网页编码

pip3 install chardet #安装chardet

def GetHtml( url):  
    page = urllib.request.urlopen(url)  
    contex = page.read()  
    return contex  

print(sys.getfilesystemencoding())    #本地系统编码
print('Html is encoding by : %',chardet.detect(GetHtml(url))) #网页编码
2017/3/27 posted in  Python

Python-Crawler

总文件

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://www.xujc.com.cn/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self,response):
        sel = Selector(response)
        sties = sel.xpath('//table')
        # for site in sties:
        #   title = site.xpath('tr/td').extract()
        #   print(site)
        #   print(title)

        #filename = 'school-%s.html' % 1
        #with open(filename, 'wb') as f:
             #f.write(contents)
        #self.log('Saved file %s' % filename)

重点关注链接and标题

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://www.xujc.com.cn/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self,response):
        sel = Selector(response)
        sties = sel.xpath('//table')

        title = sties[16].xpath('tr/td/a/text() | tr/td/a/@href | tr/td/text()').extract()
        print(sties[10])
        print(title)

日期时间

title = sties[25].xpath('tr/td/table/tr/td[@id="zb"]/table/tr/td/span/text()').extract()
        print(sties[10])
        print(title)

通知公告

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://www.xujc.com.cn/index.php?c=Article&a=idxnews&lx=notice',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self,response):
        sel = Selector(response)
        sties = sel.xpath('/html/body/table/tr')
        for site in sties:
            title = site.xpath('td/a/@href | td/a/text()').extract()
            print(site)
            print(title)

新闻中心

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://www.xujc.com.cn/index.php?c=Article&a=idxnews&lx=news',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self,response):
        sel = Selector(response)
        sties = sel.xpath('/html/body/ul/li')
        for site in sties:
            title = site.xpath('a/@href | a/text()').extract()
            print(site)
            print(title)
2017/3/15 posted in  Python