機器學習實用案例解析(1) 使用R語言

簡介

統計學一直在研究如何從數據中得到可解釋的東西,而機器學習則關注如何將數據變成一些實用的東西。對兩者做出如下對比更有助於理解“機器學習”這個術語:機器學習研究的內容是教給計算機一些知識,再讓計算機利用這些知識完成其他的任務。相比之下,統計學則更傾向於開發一些工具來幫助人類認識世界,以便人類可以更加清晰地思考,從而做出更佳的決策。

在機器學習中,學習指的是採用一些算法來分析數據的基本結構,並且辨別其中的信號和噪聲,從而提取出儘可能多的(或者儘可能合理的)信息的過程。在算法發現信號或者說模式之後,其餘的所有東西都將被簡單判斷爲噪聲。因此,機器學習技術也稱爲模式識別算法

觀測數據、從中學習、自動化識別過程,這三個概念是機器學習的核心。

R最大的優勢是:它是由統計學家們開發的。R最大的劣勢是……它是由統計學家們開發的。——Bo Cowgill, Google公司

太過真實,哈哈哈,比如R在矩陣運算方面確實不如MATLAB方便。

R的基本數據類型是向量。在本質上,R語言裏的所有的數據都是向量,儘管它們有不同的聚合和組織方式。

警告:正因其有所長,R也有短板——R並不能很好地處理大數據。儘管已有很多人在努力解決,但這仍然是一個嚴重的問題。然而,對於我們將要探討的案例研究來說,這不是個問題。我們使用的數據集相對較小,要搭建的系統也都只是原型系統或概念驗證模型。這個區別很重要,因爲如果你要搭建Google或Facebook那樣規模的企業級機器學習系統,選擇R並不合適。事實上,像Google或Facebook這些公司通常把R作爲“數據沙箱”,用於處理數據以及實驗新的機器學習方法。如果某個實驗有了成果,那麼工程師就會把R中的相關功能用更適合的語言復現出來,比如C語言。

實現加載的兩個函數是:library和require。兩者之間存在細微差別,在本書中,主要差別是:後者會返回一個布爾值(TRUE或FALSE)來表示是否加載成功。

library(package) and require(package) both load the namespace of the package with name package and attach it on the search list. require is designed for use inside other functions; it returns FALSE and gives a warning (rather than an error as library() does by default) if the package does not exist. Both functions check and update the list of currently attached packages and do not reload a namespace which is already loaded. 

> a<-require(tm)
載入需要的程輯包:tm
載入需要的程輯包:NLP
Warning messages:
1: 程輯包‘tm’是用R版本3.5.3 來建造的 
2: 程輯包‘NLP’是用R版本3.5.2 來建造的 
> a
[1] TRUE

UFO案例

數據讀入

# Load libraries and data
library(ggplot2)    # We'll use ggplot2 for all of our visualizations
library(plyr)       # For data manipulation
library(scales)     # We'll need to fix date formats in plots

ufo <- read.delim("ufo_awesome.tsv",
                  sep = "\t",
                  stringsAsFactors = FALSE,
                  header = FALSE, 
                  na.strings = "")
  • read.delim: Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.

    Similarly, read.delim is for reading delimited files, defaulting to the TAB character for the delimiter.

在本例中,每一行的數據類型都是strings(字符串),但是所有read.*函數都默認把字符串轉換爲factor類型,因此,我們需要設置stringsAsFactors=FALSE來防止其轉換。此外,這份數據第一行並沒有表頭,因此還需要把表頭的參數設置爲FALSE。最後,數據中有許多空元素,我們想把這些空元素設置爲R中的特殊值N A,爲此,我們顯式地定義空字符串爲na.string。

視察數據:

# Inspect the data frame
head(ufo)

 我們可以賦予每一列更有意義的標籤。給數據框每一列賦予有意義的名稱很重要。這樣一來,不管對自己還是其他人,代碼和輸出都有更強的可讀性。

  • names: Functions to get or set the names of an object.
names(ufo) <- c("DateOccurred", "DateReported",
                "Location", "ShortDescription",
                "Duration", "LongDescription")

無論何時,只要你操作數據框,尤其當數據是從外部數據源讀入時,我們都推薦你手工查看一下數據。關於手工查看數據,兩個比較好用的函數是headtail。這兩個函數會分別打印出數據框中的前六條和後六條數據記錄。或者直接使用view查看全貌。

數據清理

日期清理

good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 |
                    nchar(ufo$DateReported) != 8,
                    FALSE,
                    TRUE)
length(which(!good.rows))
## [1] 688
ufo <- ufo[good.rows, ]

# Now we can convert the strings to Date objects and work with them properly
ufo$DateOccurred <- as.Date(ufo$DateOccurred, format = "%Y%m%d")
ufo$DateReported <- as.Date(ufo$DateReported, format = "%Y%m%d")
  • ifelse(test, yes, no)

    ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

  • nchar: nchar takes a character vector as an argument and returns a vector whose elements contain the sizes of the corresponding elements of x.
  • nzchar is a fast way to find out if elements of a character vector are non-empty strings.
  • as.Date: Functions to convert between character representations and objects of class "Date" representing calendar dates.

地址清理

get.location <- function(l)
{
  split.location <- tryCatch(strsplit(l, ",")[[1]],
                             error = function(e) return(c(NA, NA)))
  clean.location <- gsub("^ ","",split.location)
  if (length(clean.location) > 2)
  {
    return(c(NA,NA))
  }
  else
  {
    return(clean.location)
  }
}

# We use 'lapply' to return a list with [City, State] vector as each element
city.state <- lapply(ufo$Location, get.location)

# We use 'do.call' to collapse the list to an N-by-2 matrix
location.matrix <- do.call(rbind, city.state)
  • lapply: lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

分離州名和城市,example:

l <- "Iowa City, IA"
strsplit(l, ",")
## [[1]]
## [1] "Iowa City" " IA"
  • strsplit: Split the elements of a character vector x into substrings according to the matches to substring split within them.
split.location <- tryCatch(strsplit(l, ",")[[1]], error = function(e) return(c(NA, NA)))
## [1] "Iowa City" " IA"
  • tryCatch: These functions provide a mechanism for handling unusual conditions, including errors and warnings.
#正則表達式匹配替換,去掉開頭的空格
clean.location <- gsub("^ ","",split.location)
clean.location
## [1] "Iowa City" "IA"
  • do.call: constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

我們會經常把lapply和do.call函數結合起來用於處理數據。

> head(location.matrix)
     [,1]               [,2]
[1,] "Iowa City"        "IA"
[2,] "Milwaukee"        "WI"
[3,] "Shelton"          "WA"
[4,] "Columbia"         "MO"
[5,] "Seattle"          "WA"
[6,] "Brunswick County" "ND"

清除非美國數據:

ufo <- transform(ufo,
                 USCity = location.matrix[, 1],
                 USState = location.matrix[, 2],
                 stringsAsFactors = FALSE)

ufo$USState <- state.abb[match(ufo$USState, state.abb)]

ufo.us <- subset(ufo, !is.na(USState))
  • transform: transform is a generic function, which—at least currently—only does anything useful with data frames. transform.default converts its first argument to a data frame if possible and calls transform.data.frame.
  • state.abb state.area state.center state.division state.name state.region state.x77: Data sets related to the 50 states of the United States of America.
  • subset: Return subsets of vectors, matrices or data frames which meet conditions.
ufo.us <- subset(ufo.us, DateOccurred >= as.Date("1990-01-01"))
new.hist <- ggplot(ufo.us, aes(x = DateOccurred)) +
  geom_histogram(aes(fill='white', color='red')) +
  scale_fill_manual(values=c('white'='white'), guide="none") +
  scale_color_manual(values=c('red'='red'), guide="none") +
  scale_x_date(breaks = "50 years")

ggsave(plot = new.hist,
       filename = "new_hist.bmp",
       height = 6,
       width = 8)

繪圖

按照州和月份統計數據

ufo.us$YearMonth <- strftime(ufo.us$DateOccurred, format = "%Y-%m")

sightings.counts <- ddply(ufo.us, .(USState,YearMonth), nrow)
  • strftime: Functions to convert between character representations and objects of classes "POSIXlt" and "POSIXct" representing calendar dates and times.
  • ddply: For each subset of a data frame, apply function then combine results into a data frame. To apply a function for each row, use adply with .margins set to 1.

數據整理

#補全月份
date.range <- seq.Date(from = min(ufo.us$DateOccurred), to = max(ufo.us$DateOccurred), by = "month") date.strings <- strftime(date.range, "%Y-%m") #將州添加進去 states.dates <- lapply(state.abb, function(s) cbind(s, date.strings)) states.dates <- data.frame(do.call(rbind, states.dates), stringsAsFactors = FALSE) #按照前兩列標識進行合併,沒有記錄則記爲NA all.sightings <- merge(states.dates, sightings.counts, by.x = c("s", "date.strings"), by.y = c("USState", "YearMonth"), all = TRUE) #添加列名,NA轉0,轉化日期格式,州名轉換爲因子型 names(all.sightings) <- c("State", "YearMonth", "Sightings") all.sightings$Sightings[is.na(all.sightings$Sightings)] <- 0 all.sightings$YearMonth <- as.Date(rep(date.range, length(state.abb))) all.sightings$State <- as.factor(all.sightings$State)
  • merge: Merge two data frames by common columns or row names, or do other versions of database join operations.

繪圖

state.plot <- ggplot(all.sightings, aes(x = YearMonth,y = Sightings)) +
  geom_line(aes(color = "darkblue")) +
  facet_wrap(~State, nrow = 10, ncol = 5) + 
  theme_bw() + 
  scale_color_manual(values = c("darkblue" = "darkblue"), guide = "none") +
  scale_x_date(breaks = "5 years", labels = date_format('%Y')) +
  xlab("Years") +
  ylab("Number of Sightings") +
  ggtitle("Number of UFO sightings by Month-Year and U.S. State (1990-2010)")

# Save the plot as a PDF
ggsave(plot = state.plot,
       filename = "ufo_sightings.bmp",
       width = 14,
       height = 8.5)

We can alse create a new graph where the number of signtings is normailzed by the state population.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章