Python網絡爬蟲(十二)——MySQL

簡介

MySQL 是一種關係型數據庫管理系統,關係數據庫將數據保存在不同的表中,而不是將所有數據放在一個大倉庫內,這樣就增加了速度並提高了靈活性。

GUI

有時候使用 cmd 進行數據庫操作不太方便,因此會藉助於 GUI 來進行數據庫的編輯和修改,常見用於 MySQL 的 GUI 工具有 workbench 和 navicat,前者是官方出的 GUI。

python 庫

如果我們想要將爬取到的數據以數據庫的形式存儲時,就需要 python 和 MySQL 的連接。常見的驅動 MySQL 的 python 庫有 mysqldb,mysqlclient,pymysql 等。按照自己 python 的安裝版本自行選擇合適的即可。

數據庫連接

import pymysql

db = pymysql.connect(
    host='localhost',
    user='root',
    password='password',
    database='pymysql',
    port=3306
)

cursor = db.cursor()
for i in range(cursor.execute('show databases')):
    print(cursor.fetchone())

db.close()

結果爲:

('information_schema',)
('mysql',)
('performance_schema',)
('pymysql',)
('sys',)

數據操作

數據插入

import pymysql

db = pymysql.connect(
    host='localhost',
    user='root',
    password='password',
    database='pymysql',
    port=3306
)
command = """
insert into sheet1(id,name) value(105,'qianduo')
"""

cursor = db.cursor()

for i in range(cursor.execute('select * from sheet1')):
    print(cursor.fetchone())

cursor.execute(command)
db.commit()

for i in range(cursor.execute('select * from sheet1')):
    print(cursor.fetchone())

db.close()

結果爲:

(101, 'zhangsan')
(102, 'lisi')
(103, 'wangwu')
(104, 'zhaoliu')
(101, 'zhangsan')
(102, 'lisi')
(103, 'wangwu')
(104, 'zhaoliu')
(105, 'qianduo')

除了上面提到的執行命令的方式,也可以將參數放到命令外部執行:

import pymysql

db = pymysql.connect(
    host='localhost',
    user='root',
    password='password',
    database='pymysql',
    port=3306
)

# 此時 value 中的參數類型應該都是 %s 而不管是否真的是 str,內部會進行處理
command = """
insert into sheet1(id,name) value(%s,%s)
"""

cursor = db.cursor()

cursor.execute(command,(106,'sunjie'))
db.commit()

for i in range(cursor.execute('select * from sheet1')):
    print(cursor.fetchone())

db.close()

結果爲:

(101, 'zhangsan')
(102, 'lisi')
(103, 'wangwu')
(104, 'zhaoliu')
(105, 'qianduo')
(106, 'sunjie')

數據獲取

  • fetchone:每次獲取一條數據
  • fetchall:接受全部的返回結果
  • fetchmany:獲取指定條數的數據
import pymysql

db = pymysql.connect(
    host='localhost',
    user='root',
    password='password',
    database='pymysql',
    port=3306
)
command = """
select * from sheet1
"""

cursor = db.cursor()

for i in range(cursor.execute(command)):
    print(cursor.fetchone())
print("***************")
cursor.execute(command)
print(cursor.fetchall())
print("***************")
cursor.execute(command)
print(cursor.fetchmany(3))

db.close()

結果爲:

(101, 'zhangsan')
(102, 'lisi')
(103, 'wangwu')
(104, 'zhaoliu')
(105, 'qianduo')
(106, 'sunjie')
***************
((101, 'zhangsan'), (102, 'lisi'), (103, 'wangwu'), (104, 'zhaoliu'), (105, 'qianduo'), (106, 'sunjie'))
***************
((101, 'zhangsan'), (102, 'lisi'), (103, 'wangwu'))

數據刪除

import pymysql

db = pymysql.connect(
    host='localhost',
    user='root',
    password='2602388671',
    database='pymysql',
    port=3306
)
delete_command = """
delete from sheet1 where id=101
"""
select_command = """
select * from sheet1
"""

cursor = db.cursor()
for i in range(cursor.execute(select_command)):
    print(cursor.fetchone())

print('*********************')
cursor.execute(delete_command)

for i in range(cursor.execute(select_command)):
    print(cursor.fetchone())

db.commit()
db.close()

結果爲:

(101, 'zhangsan')
(102, 'lisi')
(103, 'wangwu')
(104, 'zhaoliu')
(105, 'qianduo')
(106, 'sunjie')
*********************
(102, 'lisi')
(103, 'wangwu')
(104, 'zhaoliu')
(105, 'qianduo')
(106, 'sunjie')

數據更新

import pymysql

db = pymysql.connect(
    host='localhost',
    user='root',
    password='2602388671',
    database='pymysql',
    port=3306
)
command = """
update sheet1 set name='gaojie' where id=102
"""
select_command = """
select * from sheet1
"""

cursor = db.cursor()
cursor.execute(command)

for i in range(cursor.execute(select_command)):
    print(cursor.fetchone())
db.commit()
db.close()

操作步驟

  • 首先利用 connect 創建一個 Connection 類對象,進行數據庫連接:
class Connection(object):
    """
    Representation of a socket with a mysql server.

    The proper way to get an instance of this class is to call
    connect().

    Establish a connection to the MySQL database. Accepts several
    arguments:

    :param host: Host where the database server is located
    :param user: Username to log in as
    :param password: Password to use.
    :param database: Database to use, None to not use a particular one.
    :param port: MySQL port to use, default is usually OK. (default: 3306)
    :param bind_address: When the client has multiple network interfaces, specify
        the interface from which to connect to the host. Argument can be
        a hostname or an IP address.
    :param unix_socket: Optionally, you can use a unix socket rather than TCP/IP.
    :param read_timeout: The timeout for reading from the connection in seconds (default: None - no timeout)
    :param write_timeout: The timeout for writing to the connection in seconds (default: None - no timeout)
    :param charset: Charset you want to use.
    :param sql_mode: Default SQL_MODE to use.
    :param read_default_file:
        Specifies  my.cnf file to read these parameters from under the [client] section.
    :param conv:
        Conversion dictionary to use instead of the default one.
        This is used to provide custom marshalling and unmarshaling of types.
        See converters.
    :param use_unicode:
        Whether or not to default to unicode strings.
        This option defaults to true for Py3k.
    :param client_flag: Custom flags to send to MySQL. Find potential values in constants.CLIENT.
    :param cursorclass: Custom cursor class to use.
    :param init_command: Initial SQL statement to run when connection is established.
    :param connect_timeout: Timeout before throwing an exception when connecting.
        (default: 10, min: 1, max: 31536000)
    :param ssl:
        A dict of arguments similar to mysql_ssl_set()'s parameters.
    :param read_default_group: Group to read from in the configuration file.
    :param compress: Not supported
    :param named_pipe: Not supported
    :param autocommit: Autocommit mode. None means use server default. (default: False)
    :param local_infile: Boolean to enable the use of LOAD DATA LOCAL command. (default: False)
    :param max_allowed_packet: Max size of packet sent to server in bytes. (default: 16MB)
        Only used to limit size of "LOAD LOCAL INFILE" data packet smaller than default (16KB).
    :param defer_connect: Don't explicitly connect on contruction - wait for connect call.
        (default: False)
    :param auth_plugin_map: A dict of plugin names to a class that processes that plugin.
        The class will take the Connection object as the argument to the constructor.
        The class needs an authenticate method taking an authentication packet as
        an argument.  For the dialog plugin, a prompt(echo, prompt) method can be used
        (if no authenticate method) for returning a string from the user. (experimental)
    :param server_public_key: SHA256 authenticaiton plugin public key value. (default: None)
    :param db: Alias for database. (for compatibility to MySQLdb)
    :param passwd: Alias for password. (for compatibility to MySQLdb)
    :param binary_prefix: Add _binary prefix on bytes and bytearray. (default: False)

    See `Connection <https://www.python.org/dev/peps/pep-0249/#connection-objects>`_ in the
    specification.
    """

    _sock = None
    _auth_plugin_name = ''
    _closed = False
    _secure = False

    def __init__(self, host=None, user=None, password="",
                 database=None, port=0, unix_socket=None,
                 charset='', sql_mode=None,
                 read_default_file=None, conv=None, use_unicode=None,
                 client_flag=0, cursorclass=Cursor, init_command=None,
                 connect_timeout=10, ssl=None, read_default_group=None,
                 compress=None, named_pipe=None,
                 autocommit=False, db=None, passwd=None, local_infile=False,
                 max_allowed_packet=16*1024*1024, defer_connect=False,
                 auth_plugin_map=None, read_timeout=None, write_timeout=None,
                 bind_address=None, binary_prefix=False, program_name=None,
                 server_public_key=None):
        if use_unicode is None and sys.version_info[0] > 2:
            use_unicode = True

        if db is not None and database is None:
            database = db
        if passwd is not None and not password:
            password = passwd

        if compress or named_pipe:
            raise NotImplementedError("compress and named_pipe arguments are not supported")

        self._local_infile = bool(local_infile)
        if self._local_infile:
            client_flag |= CLIENT.LOCAL_FILES

        if read_default_group and not read_default_file:
            if sys.platform.startswith("win"):
                read_default_file = "c:\\my.ini"
            else:
                read_default_file = "/etc/my.cnf"

        if read_default_file:
            if not read_default_group:
                read_default_group = "client"

            cfg = Parser()
            cfg.read(os.path.expanduser(read_default_file))

            def _config(key, arg):
                if arg:
                    return arg
                try:
                    return cfg.get(read_default_group, key)
                except Exception:
                    return arg

            user = _config("user", user)
            password = _config("password", password)
            host = _config("host", host)
            database = _config("database", database)
            unix_socket = _config("socket", unix_socket)
            port = int(_config("port", port))
            bind_address = _config("bind-address", bind_address)
            charset = _config("default-character-set", charset)
            if not ssl:
                ssl = {}
            if isinstance(ssl, dict):
                for key in ["ca", "capath", "cert", "key", "cipher"]:
                    value = _config("ssl-" + key, ssl.get(key))
                    if value:
                        ssl[key] = value

        self.ssl = False
        if ssl:
            if not SSL_ENABLED:
                raise NotImplementedError("ssl module not found")
            self.ssl = True
            client_flag |= CLIENT.SSL
            self.ctx = self._create_ssl_ctx(ssl)

        self.host = host or "localhost"
        self.port = port or 3306
        self.user = user or DEFAULT_USER
        self.password = password or b""
        if isinstance(self.password, text_type):
            self.password = self.password.encode('latin1')
        self.db = database
        self.unix_socket = unix_socket
        self.bind_address = bind_address
        if not (0 < connect_timeout <= 31536000):
            raise ValueError("connect_timeout should be >0 and <=31536000")
        self.connect_timeout = connect_timeout or None
        if read_timeout is not None and read_timeout <= 0:
            raise ValueError("read_timeout should be >= 0")
        self._read_timeout = read_timeout
        if write_timeout is not None and write_timeout <= 0:
            raise ValueError("write_timeout should be >= 0")
        self._write_timeout = write_timeout
        if charset:
            self.charset = charset
            self.use_unicode = True
        else:
            self.charset = DEFAULT_CHARSET
            self.use_unicode = False

        if use_unicode is not None:
            self.use_unicode = use_unicode

        self.encoding = charset_by_name(self.charset).encoding

        client_flag |= CLIENT.CAPABILITIES
        if self.db:
            client_flag |= CLIENT.CONNECT_WITH_DB

        self.client_flag = client_flag

        self.cursorclass = cursorclass

        self._result = None
        self._affected_rows = 0
        self.host_info = "Not connected"

        # specified autocommit mode. None means use server default.
        self.autocommit_mode = autocommit

        if conv is None:
            conv = converters.conversions

        # Need for MySQLdb compatibility.
        self.encoders = {k: v for (k, v) in conv.items() if type(k) is not int}
        self.decoders = {k: v for (k, v) in conv.items() if type(k) is int}
        self.sql_mode = sql_mode
        self.init_command = init_command
        self.max_allowed_packet = max_allowed_packet
        self._auth_plugin_map = auth_plugin_map or {}
        self._binary_prefix = binary_prefix
        self.server_public_key = server_public_key

        self._connect_attrs = {
            '_client_name': 'pymysql',
            '_pid': str(os.getpid()),
            '_client_version': VERSION_STRING,
        }

        if program_name:
            self._connect_attrs["program_name"] = program_name

        if defer_connect:
            self._sock = None
        else:
            self.connect()
  • 構建一個 Cursor 類對象,以與數據庫對象進行交互:
class Cursor(object):
    """
    This is the object you use to interact with the database.

    Do not create an instance of a Cursor yourself. Call
    connections.Connection.cursor().

    See `Cursor <https://www.python.org/dev/peps/pep-0249/#cursor-objects>`_ in
    the specification.
    """

    #: Max statement size which :meth:`executemany` generates.
    #:
    #: Max size of allowed statement is max_allowed_packet - packet_header_size.
    #: Default value of max_allowed_packet is 1048576.
    max_stmt_length = 1024000

    _defer_warnings = False

    def __init__(self, connection):
        self.connection = connection
        self.description = None
        self.rownumber = 0
        self.rowcount = -1
        self.arraysize = 1
        self._executed = None
        self._result = None
        self._rows = None
        self._warnings_handled = False
  • 利用 execute 函數執行 MySQL 命令
  • 利用 fetch 獲取數據
  • 使用 commit 提交更改
  • 關閉數據庫

需要注意的是:

  • 與數據庫交互是通過 cursor 實現的
  • 數據庫打開、關閉和更新則是通過 db 實現的
  • 因爲執行一些操作後,cursor 所指向的當前位置會發生變化,因此要注意 cursor 的當前位置
  • 所有修改數據庫的行爲,如果要使修改生效,需要使用 commit 提交更改
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章