spark解析json數據

1. json數據格式–定義
JSON(JavaScript Object Notation) 是一種輕量級的數據交換格式,易於人閱讀和編寫。

2.json數據格式解編碼(2.1,2.2兩種方法)
2.1 json函數實現解編碼:json.dumps及json.loads

函數 描述
json.dumps 將 Python 對象編碼成 JSON 字符串
json.loads 將已編碼的 JSON 字符串解碼爲 Python 對象

2.1.1 python中json函數使用案例

  • (1)json.dumps----將 Python 對象編碼成 JSON 字符串
json.dumps工具函數介紹:
def dumps(obj, skipkeys=False, ensure_ascii=True, check_circular=True,
        allow_nan=True, cls=None, indent=None, separators=None,
        default=None, sort_keys=False, **kw):
    """Serialize ``obj`` to a JSON formatted ``str``.

    If ``skipkeys`` is true then ``dict`` keys that are not basic types
    (``str``, ``int``, ``float``, ``bool``, ``None``) will be skipped
    instead of raising a ``TypeError``.

    If ``ensure_ascii`` is false, then the return value can contain non-ASCII
    characters if they appear in strings contained in ``obj``. Otherwise, all
    such characters are escaped in JSON strings.

    If ``check_circular`` is false, then the circular reference check
    for container types will be skipped and a circular reference will
    result in an ``OverflowError`` (or worse).

    If ``allow_nan`` is false, then it will be a ``ValueError`` to
    serialize out of range ``float`` values (``nan``, ``inf``, ``-inf``) in
    strict compliance of the JSON specification, instead of using the
    JavaScript equivalents (``NaN``, ``Infinity``, ``-Infinity``).

    If ``indent`` is a non-negative integer, then JSON array elements and
    object members will be pretty-printed with that indent level. An indent
    level of 0 will only insert newlines. ``None`` is the most compact
    representation.

    If specified, ``separators`` should be an ``(item_separator, key_separator)``
    tuple.  The default is ``(', ', ': ')`` if *indent* is ``None`` and
    ``(',', ': ')`` otherwise.  To get the most compact JSON representation,
    you should specify ``(',', ':')`` to eliminate whitespace.

    ``default(obj)`` is a function that should return a serializable version
    of obj or raise TypeError. The default simply raises TypeError.

    If *sort_keys* is ``True`` (default: ``False``), then the output of
    dictionaries will be sorted by key.

    To use a custom ``JSONEncoder`` subclass (e.g. one that overrides the
    ``.default()`` method to serialize additional types), specify it with
    the ``cls`` kwarg; otherwise ``JSONEncoder`` is used.
    
將 Python 對象編碼成 JSON 字符串--demo1:
#!/usr/bin/python
import json
data = [ { 'a' : 1, 'b' : 2, 'c' : 3, 'd' : 4, 'e' : 5 } ]
json = json.dumps(data)
print json
json = json.dumps(data),dumps之後的json值爲字符串:
'[{"d": 4, "e": 5, "c": 3, "a": 1, "b": 2}]'
print json的打印結果爲:
[{"d": 4, "a": 1, "c": 3, "e": 5, "b": 2}]

將 Python 對象編碼成 JSON 字符串,設置參數縮進爲4等格式化輸出字符串--demo2:
print (json.dumps({'a': 'Runoob', 'b': 7}, sort_keys=True, indent=4, separators=(',', ': ')))
print打印結果爲:
{
    "a": "Runoob",
    "b": 7
}
可以看出,上述參數格式化json字符串裏面參數sort_keys實現了字典的key的排序方式,,設置爲True時,默認按字母的升序排序,設置爲False時,默認key沒有順序;indent參數設置右縮進4個空格,separators參數讓json更緊湊;

(2)json.loads----將已編碼的 JSON 字符串解碼爲 Python 對象

json字符串類型 python數據類型
object dict
array list
string unicode
number(int) int,long
number(real) float
true True
false False
null None
import json
jsonData = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
text = json.loads(jsonData)
print (text)
text的值爲:{'d': 4, 'c': 3, 'a': 1, 'e': 5, 'b': 2}
print的結果:{'d': 4, 'c': 3, 'a': 1, 'e': 5, 'b': 2}
注意,print非字符串的Python數據類型,會按原樣顯示;
   print字符串,顯示的時候,會隱藏最外層的引號;
print("{'d': 4, 'c': 3, 'a': 1, 'e': 5, 'b': 2}")
結果:{'d': 4, 'c': 3, 'a': 1, 'e': 5, 'b': 2}

2.2.json函數實現解編碼:使用第三方庫:Demjson
Demjson 是 python 的第三方模塊庫,可用於編碼和解碼 JSON 數據,包含了 JSONLint 的格式化及校驗功能。
Github 地址:https://github.com/dmeranda/demjson
官方地址:http://deron.meranda.us/python/demjson/
安裝方式:

$ tar -xvzf demjson-2.2.3.tar.gz
$ cd demjson-2.2.3
$ python setup.py install
函數 描述
encode 將 Python 對象編碼成 JSON 字符串
decode 將已編碼的 JSON 字符串解碼爲 Python 對象
encode函數工具:
    r"""Encodes a Python object into a JSON-encoded string.

    * 'strict'    (Boolean, default False)

        If 'strict' is set to True, then only strictly-conforming JSON
        output will be produced.  Note that this means that some types
        of values may not be convertable and will result in a
        JSONEncodeError exception.

    * 'compactly'    (Boolean, default True)

        If 'compactly' is set to True, then the resulting string will
        have all extraneous white space removed; if False then the
        string will be "pretty printed" with whitespace and
        indentation added to make it more readable.

    * 'encode_namedtuple_as_object'  (Boolean or callable, default True)

        If True, then objects of type namedtuple, or subclasses of
        'tuple' that have an _asdict() method, will be encoded as an
        object rather than an array.
        If can also be a predicate function that takes a namedtuple
        object as an argument and returns True or False.

    * 'indent_amount'   (Integer, default 2)

        The number of spaces to output for each indentation level.
        If 'compactly' is True then indentation is ignored.

    * 'indent_limit'    (Integer or None, default None)

        If not None, then this is the maximum limit of indentation
        levels, after which further indentation spaces are not
        inserted.  If None, then there is no limit.

    CONCERNING CHARACTER ENCODING:

    The 'encoding' argument should be one of:

        * None - The return will be a Unicode string.
        * encoding_name - A string which is the name of a known
              encoding, such as 'UTF-8' or 'ascii'.
        * codec - A CodecInfo object, such as as found by codecs.lookup().
              This allows you to use a custom codec as well as those
              built into Python.

    If an encoding is given (either by name or by codec), then the
    returned value will be a byte array (Python 3), or a 'str' string
    (Python 2); which represents the raw set of bytes.  Otherwise,
    if encoding is None, then the returned value will be a Unicode
    string.

    The 'escape_unicode' argument is used to determine which characters
    in string literals must be \u escaped.  Should be one of:

        * True  -- All non-ASCII characters are always \u escaped.
        * False -- Try to insert actual Unicode characters if possible.
        * function -- A user-supplied function that accepts a single
             unicode character and returns True or False; where True
             means to \u escape that character.

    Regardless of escape_unicode, certain characters will always be
    \u escaped. Additionaly any characters not in the output encoding
    repertoire for the encoding codec will be \u escaped as well.

    """
encode函數的使用demo:
import demjson
data = [ { 'a' : 1, 'b' : 2, 'c' : 3, 'd' : 4, 'e' : 5 } ]
json = demjson.encode(data)
print (json)
json的值爲字符串:'[{"a":1,"b":2,"c":3,"d":4,"e":5}]'
print的結果爲:[{"a":1,"b":2,"c":3,"d":4,"e":5}]

decode函數工具:
"""Decodes a JSON-encoded string into a Python object.

    == Optional arguments ==

    * 'encoding'  (string, default None)

       This argument provides a hint regarding the character encoding
       that the input text is assumed to be in (if it is not already a
       unicode string type).

       If set to None then autodetection of the encoding is attempted
       (see discussion above). Otherwise this argument should be the
       name of a registered codec (see the standard 'codecs' module).

    * 'strict'    (Boolean, default False)

        If 'strict' is set to True, then those strings that are not
        entirely strictly conforming to JSON will result in a
        JSONDecodeError exception.

    * 'return_errors'    (Boolean, default False)

        Controls the return value from this function. If False, then
        only the Python equivalent object is returned on success, or
        an error will be raised as an exception.

        If True then a 2-tuple is returned: (object, error_list). The
        error_list will be an empty list [] if the decoding was
        successful, otherwise it will be a list of all the errors
        encountered.  Note that it is possible for an object to be
        returned even if errors were encountered.

    * 'return_stats'    (Boolean, default False)

        Controls whether statistics about the decoded JSON document
        are returns (and instance of decode_statistics).

        If True, then the stats object will be added to the end of the
        tuple returned.  If return_errors is also set then a 3-tuple
        is returned, otherwise a 2-tuple is returned.

    * 'write_errors'    (Boolean OR File-like object, default False)

        Controls what to do with errors.

        - If False, then the first decoding error is raised as an exception.
        - If True, then errors will be printed out to sys.stderr.
        - If a File-like object, then errors will be printed to that file.

        The write_errors and return_errors arguments can be set
        independently.

    * 'filename_for_errors'   (string or None)

        Provides a filename to be used when writting error messages.

    * 'allow_xxx', 'warn_xxx', and 'forbid_xxx'    (Booleans)

        These arguments allow for fine-adjustments to be made to the
        'strict' argument, by allowing or forbidding specific
        syntaxes.

        There are many of these arguments, named by replacing the
        "xxx" with any number of possible behavior names (See the JSON
        class for more details).

        Each of these will allow (or forbid) the specific behavior,
        after the evaluation of the 'strict' argument.  For example,
        if strict=True then by also passing 'allow_comments=True' then
        comments will be allowed.  If strict=False then
        forbid_comments=True will allow everything except comments.

    Unicode decoding:
    -----------------
    The input string can be either a python string or a python unicode
    string (or a byte array in Python 3).  If it is already a unicode
    string, then it is assumed that no character set decoding is
    required.

    However, if you pass in a non-Unicode text string (a Python 2
    'str' type or a Python 3 'bytes' or 'bytearray') then an attempt
    will be made to auto-detect and decode the character encoding.
    This will be successful if the input was encoded in any of UTF-8,
    UTF-16 (BE or LE), or UTF-32 (BE or LE), and of course plain ASCII
    works too.
    
    Note though that if you know the character encoding, then you
    should convert to a unicode string yourself, or pass it the name
    of the 'encoding' to avoid the guessing made by the auto
    detection, as with

        python_object = demjson.decode( input_bytes, encoding='utf8' )
    
    Callback hooks:
    ---------------
    You may supply callback hooks by using the hook name as the
    named argument, such as:
        decode_float=decimal.Decimal

    See the hooks documentation on the JSON.set_hook() method.

    """
decode函數的使用demo:
import demjson
json = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
text = demjson.decode(json)
print(text)
text的值爲字典對象:{"a":1,"b":2,"c":3,"d":4,"e":5}
print的結果爲:{"a":1,"b":2,"c":3,"d":4,"e":5}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章