Week 1

For more detail, you can download the html file here free

Week 1

This course is following that

Raw data → Processing script → Tidy data → Data analysis → Data communication

Raw Data VS Processed Data:

Raw Data

The original source of the data
Often hard to use for data analysis. Because they are complicated or they are hard to parse or analysis.
Data analysis includes processing or the cleaning of the data.
Raw data may only need to be proceed once

Processed Data:

Data that is ready for analysis.
Processing can include merging, subsetting, transforming, etc.
There may be standards for processing.
All steps should be recorded.(very important)

Data processing actually is part of the data analysis. In fact a huge component of data scientist’s job is performing those sorts of processing operations. The raw data may only need to be processed once, but regardless of how often you processed it, you need to keep a record of all the different things you did. Because it can have a major impact on the data stream analysis.

Raw Data VS Tidy Data

Raw Data:

Ran no software on the data.
Did not manipulate any of the numbers in the data.
You did not remove any data from data set.
You did not summarize the data in any way.

Tidy Data:

Each variable you measure should be in one column.
Each different observation of that variable should be in a different row.
There should be one table for each “kind” of variable.
If you have multiple tables, they should include a column in the table that allow them to be linked.

Other important tips about tidy data:

Include a row at the top of each file with variable names.
Make variable names human readable.
In general data should be saved in one file per table.

Four things you should have from raw data to tidy data:

The raw data.
A tidy data set.
A code book describing each variable and its values in the tidy data.
An explicit and exact recipe you used to go from 1 -> 2,3

Code book:

Information about the variables (including units) in the data set.
Information about the sumary choices you make.
Information about the experimental study design you used.

other important tips about code book:

A common format for this document is a Word/text file.
There should be a section called “Study design” that has a throughout description of how you collected the data.
There must be a section called “Code book” that decribles each variable and its units.

Download files

Get/set your working directory:

A basic component of working with data is knowning your working directory.
The two main commands are getwd() and setwd().
Be aware of relative versus absolute paths:
- Relative: setwd("./data") or setwd("../")
- Absolute: setwd("/Users/jtleek/data/")
Important difference in Windows setwd("C:/Users/Andraw/Downloads") or setwd("C:\\Users\\Andra\\Downloads")

Checking for and creating directories:

file.exists("directoryName") will check to see if the directory exists.
dir.create("directoryName") will create a directory if if doesnot exist.

Example:

if(!file.exists("data")){
dir.create("data")
}

Getting data from the internet–downloand.file()

Downloads a file from the internet.
Even if you could do this by hand, helps with reproducibility.
Important parameters are url, destfile and method.
Useful for downloading tab-delimited, csv. and other files.

Example:

# data from https://data.baltimorecity.gov/Transportation/Baltimore-Fixed-Speed-Cameras/dz54-2aru
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/camera.csv")
list.files("./data")

Linux should be a little difference with the second line:
download.file(fileUrl, destfile = "./data/camera.csv" , method = "curl" )

An important component of downloading files from the internet is that those files might change. So for example they change the cameras, there might be a new set of cameras and the data we are analysis might be different.

dateDownloaded <- date()
dateDownloaded

Some notes about download.file():

If the url starts with http you can use download.file().
If the url starts with https on Windows you may be OK.
If the url starts with https on Mac you may be need to set method = "curl".
If the file is big, this might take a while.
Be sure to record when you downloaded.

Reading flat local file

Loading flat files–read.table():

This is the main function for reading data into R
Flexible and robust by requires more parameters
Reads the data inrto RAM - big data can cause problem.
Important parameters file, header, sep, row.names and nrow.
Related: read.csv() and read.csv2()

Example:
If we use
cameraData <- read.table("./data/camera.csv"),
we will get an error with
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 13 elements.
The reason why is because that there’s commas separating camera.csv. But the default for read.table() is to look for a tab delimited file. There are two ways you can use to read this data:

cameraData <- read.table("./data/camera.csv", sep = ",", header = TRUE)
head(cameraData,3)

or (read.csv() automatically set sep = "," and header = TRUE)

cameraData <- read.csv("./data/camera.csv")
head(cameraData, 3)

Some more important parameters:

quote: you can tell R whether there are any quoted values quote = "" means no quotes.
na.strings: set the character that represents a missing value.
nrows: how many rows to read of the file(e.g. nrows = 10 reads 10 lines).
skip: number of lines to skip befor starting to read.

In my experience, the biggest trouble with reading flat files are quotation marks ’ or ” placed in data values, setting quote = "" often resolves these.

Reading Excel files

Excel files are still probablu the most widely used format for sharing data.

Example:

#  Download the file to load
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/camera.xlsx", mode="wb")
dateDownloaded <- date()

# install xlsx package first
## install.packages("xlsx")

# load package and rad excel data
library(xlsx)
cameraData <- read.xlsx("./data/camera.xlsx", sheetIndex = 1, header = TRUE)
head(cameraData)

Linux should add method = "curl" in the second line and delete mode = "wb" in the third line.

mode = "wb" is very important in Windows because without it there will be an error: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.util.zip.ZipException: invalid distance too far back.

Reading specific rows and columns:

colIndex <- 2:3
rowIndex <- 1:4
cameraDataSubset <- read.xlsx("./data/camera.xlsx", sheetIndex = 1, colIndex = colIndex, rowIndex = rowIndex, mode = "wb")
cameraDataSubset

Further notes:

The write.xlsx() will write out an excel file with similar arguments.
read.xlsx2() is muc faster than read.xlsx() but for reading subsets of rows may be unstable.
The XLConnet package has more options for writting and manipulation excel files.
The XLConnect vignette is a good place to start for that package.
In general it is advised to store your data in either a database or in comma separated data(.csv) or tab separated data(.tab or .txt) as they are easier to distribute.

Reading XML file

XML:

Extensible markup language.
Frequently used to store structured data.
Particularly widely used in internet applications.
Extracting XML is the basis for most web scraping.
Components
- Markup: labels that gives the text structure.
- Content: the actual text of the document.

Tags, elements and attributes:

Tages corresponding to general labels:
- Start tags <section>
- End tags </section>
- Empty tags <line-break/>
ELements are specific examples of tags
- <Greeing> Hello world M</Greeting>
Attributes are components of the labels.
- <img src=".." alt="..."/>
- <step number="3">Conect A to B.</step>

Read XML file into R

# install XML package first
## install.packages("XML")
library(XML)
fileUrl <- "http://www.w3schools.com/xml/simple.xml"
doc <- xmlTreeParse(fileUrl, useInternal = TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
names(rootNode)

xmlTreeParse() parses out the xml file： It loads the document into a R memory in a way. and then parse it and get access to different parts of it. Within R, it’s still a structurd object, so we have to be able to use different functions to access different parts of that object.
xmlRoot() is excuted, you will have access to that particular element to that xml file.
xmlName() gives the name of xml file.
names() gives what all the nested elements with that root node are.
So the root node wraps the whole document. And the whole ducument here is a breakfast menu and then there are five different breakfast items on this menu and each one is wrapped within a food elememt. So you have five food element, if you look at the names of the root node.

The next thing that you could use is to directly access parts of the XML document. You can do it in a little bit in the same way you access a list in R

# first element
rootNode[[1]]
# first element of the first element
rootNode[[1]][[1]]
# extract different parts of the file programmatically
xmlSApply(rootNode, xmlValue)

xmlSApply() what you do is you pass that a parsed XML object and then you tell it what function you’d like to apply. So what that’s is going to do is going to loop through all of the elements of the XML root node and get the XML value.
xmlValue Some types of XML nodes have no children nodes, but are leaf nodes and simply contain text

Xpath:new language

/node Top level node
//node Node at any level
node[@attr-name] Node with an attribute name
node[@attr-name="bob"] Node with attribute name attr-name==”bob”

Get the items on the menu and prices

# extract content by elements
xpathSApply(rootNode, "//name", xmlValue)
xpathSApply(rootNode, "//price", xmlValue)

Extract content by attributes

# Extract content by attributes
fileUrl <- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens"
doc <- htmlTreeParse(fileUrl, useInternal = TRUE)
# find li elements with `class = "team-name"` and return their value
teams <- xpathSApply(doc, "//li[@class='team-name']", xmlValue)
teams

Reading JSON

JSON:

JavaScript Object Notation
Ligntweight data storage
Common format for data from application proframming interfaces(APIs)
Similar structure to XML but different syntax/format
Data stored as
- Numbers(double)
- Strings(double quoted)
- Boolean(true or false)
- Array(ordered, comma separated enclosed in square brackets[])
- Object(unorderd, comma separated collection of key:value pair in curley brackets{})

Reading data from JSOn {jsonlite package}

# install package first
## install.packages("jsonlite")
library(jsonlite)
# what you get from fromJSON function is a structured data frame
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
# all names of this data frame
names(jsonData)
# look at the names of that particular variable
names(jsonData$owner)
jsonData$owner$login

How convert data frame to JSON

# writing data frame to JSON
myjson <- toJSON(iris, pretty = TRUE)
# print it out: too long, you can view it yourself
#cat(myjson, nrow = 2)
# fromJSON return data frame again
iris2 <- fromJSON(myjson)
head(iris2, 3)

Using data.table

data.table:

Inherets from data frame
- All functions that accept data.frame work on data.table
Written in C so it is much faster
Much, much faster at subsetting, group, and updating.

Create data tables just like data frames

# install package first
## install.packages("data.table")
library(data.table)
# data frame
DF <- data.frame(x = rnorm(9), y = rep(letters[1:3], each = 3), z = rnorm(9))
head(DF, 3)
# data table
DT <- data.table(x = rnorm(9), y = rep(letters[1:3], each = 3), z = rnorm(9))
head(DT, 3)

See all the data tables in memry

tables()

Subsetting rows

DT[2,]
DT[DT$y=="a",]
DT[c(2,3)]
# or 
DT[c(2,3),]

Subsettng columns:

What happens when you try to subset columns, if you just try to subset columns the way you used to in data frame, this is where they really diverge data table and data frame. It’s not actually trying to subset the columns using the same subsetting functions that happens with data frame. it does something a little bit different. And so what it’s using is expressions to be able to summarize the data in variour, different ways.

The subsetting function is modified for data.table
The argument you pass after the comma is called an “expression”
In R an expression is a collection of statements enclosed in curley brackets.

# expression in R
k <- {print(10); 5}
print(k)

Calculating values for variables with expressions

DT[, list(mean(x), mean(z))]
DT[,table(y)]

Another thing that it does very fast and memory efficiently is to add a new column. The nice thing is usually when you are adding a new variable to a data frame, R will copy over the entire data frame and add a new variable to it. So you get two copies of the data frame in the memory. So when dealing with big data sets, this is obviously going to cause lots of memory problems which you don’t have with data table because a new copy isn’t being created. So you have to be able to, if you’re trying to create a copy you have to explicitly do that with the copy function.

DT[, w:=z^2]
DT2 <- DT
DT[, y:=2]
head(DT, 2)
head(DT2, 2)

Multiple operations

DT[, m:={tmp <- {x+z}; log2(tmp+5)}]

plyr like operations

DT[,a:=x>0]
DT[,b:=mean(x+w),by=a]

Special variable
.N An integer, length 1, containing the number r

set.seed(123)
DT <- data.table(x = sample(letters[1:3], 1E5, TRUE))
# number of a, b, c apperance
DT[, .N,by=x]

Keys:

A unique aspect of data tables is that that have keys, and so if you set the key, it’s possible to subset and sort a data tbale much more rapidly than you would be able to do with a data frame.

DT <- data.table(x = rep(letters[1:3], each = 100), y = rnorm(100))
# set keys
setkey(DT, x)
# subsetting rows with x == "a"
head(DT["a"])

Joins or merge data table using keys

DT1 <- data.table(x = c("a", "a", "b", "dt1"), y = 1:4)
DT2 <- data.table(x = c("a", "b", "dt2"), z = 5:7)
# set keys
setkey(DT1, x)
setkey(DT2, x)
# joint
merge(DT1, DT2)

Fast reading

big_df <- data.frame(x = rnorm(1E6), y = rnorm(1E6))
file <- tempfile()
write.table(big_df, file = file, row.names = FALSE, col.names = TRUE, sep = "\t", quote = FALSE)
# fread command chould be applied to read data tables
system.time(fread(file))
system.time(read.table(file, head = TRUE, sep = "\t")) # about 10 times slower

[Getting and Cleaning data] Week 1

Week 1

Raw Data VS Processed Data:

Raw Data VS Tidy Data

Download files

Reading flat local file

Reading Excel files

Reading XML file

Reading JSON

Using data.table

容器中nginx無法使用同一個網絡下的容器域名

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習07——API

避免DbContext同時在多個線程調用

Python: SunMoonTimeCalculator

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

[Getting and Cleaning data] Week 1

[Help] Proximal mapping

[Latex] natbib package to cite references

[R] How to install RMySQL package on Window

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結