數據分析筆記001

- 記錄來源於Udacity的數據分析課第一章

MangoDB 是關於數據預處理的課程

一個馬拉馬車的案例

  • 我們不應該信任數據,要思考他們從哪兒來的-人 or 機器

數據錯誤的案例

  • google街景對房子判斷錯誤
  • 網絡文檔編輯錯誤
  • 表格數據類型錯誤

我們需要整理的大概有(we need to assess our data to:)

  • Test Assumptions About
    • Values
    • data types
    • shape
  • Identify Errors or outliers
  • Find missing values

Tabular Data

表格數據:office的excel\openoffice

大家關心的:

  • 字段\label
  • 內容
  • Row\Columns \Value

CSV is Lightweight

python-csv鏈接

  • each line of text is single row
  • Fields are separated by a delimeter
  • Just the data itself
  • don’t need special softwart

Parsing CSV File In Python(in this case – csv to dict)

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Qj0WKZz8-1573752573640)(/Users/donghuibiao/Library/Application Support/typora-user-images/image-20191114163348757.png)]

解析CSV文件

def parse_file(datafile):
  data=[]
  with open (datafile,'rb') as f:
    header = f.readline().split(',')
    counter = 0
    for line in f:
      if counter ==10:
        break
      fields = line.split(',')
      entry={}
      
      for i, value in numerate(fields):
        entry[header[i].strip()]=value.strip()
      	data.append(entry)
      	counter += 1
  return data

如果用上述方法分割,含有逗號的內容容易出問題

import csv 

這個模塊可以解決很多csv問題

def parse_csv(datafile):
  data=[]
  n=0
  with open (datafile,'rb')as sd:
    r=csv.DictReader(sd)
    for line in r:
      data.append(line)
  return data


if __name__ == '__main__':
  datafile = os.path.join(DATADIR,DATAFILE)
  parse_csv(datafile)
  d= parse_csv(datafile)
  pprint.pprint(d)

XLRD 簡介

import xlrd

datafile = "2013_ERCOT_Hourly_Load_Data.xls"


def parse_file(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)

    data = [[sheet.cell_value(r, col) 
                for col in range(sheet.ncols)] 
                    for r in range(sheet.nrows)]

    print "\nList Comprehension"
    print "data[3][2]:",
    print data[3][2]

    print "\nCells in a nested loop:"    
    for row in range(sheet.nrows):
        for col in range(sheet.ncols):
            if row == 50:
                print sheet.cell_value(row, col),


    ### other useful methods:
    print "\nROWS, COLUMNS, and CELLS:"
    print "Number of rows in the sheet:", 
    print sheet.nrows
    print "Type of data in cell (row 3, col 2):", 
    print sheet.cell_type(3, 2)
    print "Value in cell (row 3, col 2):", 
    print sheet.cell_value(3, 2)
    print "Get a slice of values in column 3, from rows 1-3:"
    print sheet.col_values(3, start_rowx=1, end_rowx=4)

    print "\nDATES:"
    print "Type of data in cell (row 1, col 0):", 
    print sheet.cell_type(1, 0)
    exceltime = sheet.cell_value(1, 0)
    print "Time in Excel format:",
    print exceltime
    print "Convert time to a Python datetime tuple, from the Excel float:",
    print xlrd.xldate_as_tuple(exceltime, 0)

    return data

data = parse_file(datafile)

上邊是一個案例的代碼,另一種處理方式.處理xls格式.

JSON 簡介

(JavaScript Object Notation.)

JSON is a syntax for storing and exchanging data.

JSON is text, written with JavaScript object notation.

一些csv或者xls格式沒辦法儲存的格式,如一個格子裏多個數據行,就可以用json格式,(在Python中就是字典格式)

data modeling in json

  • items may have different fields
  • may have nested objects(嵌套對象)
  • may have nested arrays(嵌套數組)

Json 練習

request 模塊

Quiz : Exploring JSON Data(json一個練習作業,從提供的表格中找出這些答案)

  • How many Bands named’First Aid Kit’?

  • Begin -Area Name For Queen?

  • Spanish alias for Beatles?

  • Nirvana disambiguation?

  • When was one direction formed?

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章