數據分析師成長和進階免費教程


我整理了國外一些靠譜的大數據免費教程,推出一套網絡自學攻略。
目前是 Alpha Version, 將逐步翻譯,整理,補充

注:
  • 這是非常技術流的教程,涉及大數據處理,編程和統計。不是Excel sheet,PowerPoint或者商業諮詢市場分析類型,如果你是目的是做普通的Business Analyst 或者 BI 諮詢,你不需要這個教程。
  • 針對大數據(1 TB+ )的處理和分析(如果你的數據只是幾個Excel sheet,請略過)
  • 所有教程內容都是英文,你可能需要翻牆(後果自負)。


教程亮點:
  • 全部免費哦!
  • 幫助完全沒有概念的菜鳥快速入門(教授基礎的統計學和編程知識, 無需基礎但要有常識)
  • 從數據採集,分析,到最終可視化展示,教授大數據分析全過程的重要理念,方法和工具。
  • 所需時間:310+ 小時。
    • 菜鳥:要那麼長時間?太慢了?
    • 回答:什麼?啥基礎都沒有,想要多快?你學了9年英語還要3個月新東方考GRE呢。
    • 菜鳥:我有些學過了
    • 回答:你不會跳過啊,菜鳥。

申明:我在英文環境下學習和培養的專業能力,很多術語的中文名稱不瞭解,歡迎拍磚。

這個教程包括以下幾個方面:

基礎課程:
  • exploratory and predictive statistics (統計學:檢測數據和預測分析)
  • basic Python (Python編程基礎)
  • advanced computer program design (電腦程序設計原理,進階)
  • an introduction to algorithms (算法基礎)
  • R for statistical analysis (使用 R 做統計分析)
  • practical machine learning techniques (機器學習 基本技法)
  • Unix
  • data visualization best practices (數據視覺化展示 技巧)

進階套餐:
套餐A - 展示: Visualizing Data 數據視覺化
套餐B - 算法:Analyzing Social Networks (社交網絡分析)
套餐C - 技術: Big Data: Hadoop and MapReduce (大數據,Hadoop 和 MapReduce技能)

作爲一個需要花費時間整理的攻略,不知道以上內容大家是否剛興趣。如果點贊人數超過50人,我就把教程寫出來。


-------------------------------割割哥-------------------------------------------

統計篇

Exploratory and Predictive Statistics - 初級統計學

統計學掃盲

1. Statistics - Udemy ( 12 小時 )
這個教程涵蓋了統計學第一年的基礎內容。簡單粗暴,給你一個統計學的基本概念。這個課程雖然不能讓你吃上豬肉,但是可以讓你見到豬跑。

Optional 完整基礎入門課程 (Strongly recommend if you have the time)
2.1 Introduction to Statistics Descriptive Statistics (50 小時)
The focus of Stat2.1x is on descriptive statistics. The goal of descriptive statistics is to summarize and present numerical information in a manner that is illuminating and useful. The course will cover graphical as well as numerical summaries of data, starting with a single variable and progressing to the relation between two variables. Methods will be illustrated with data from a variety of areas in the sciences and humanities.

2.2 Introduction to Statistics: Probability (50 小時)
The focus of Stat2.2x is on probability theory: exactly what is a random sample, and how does randomness work? If you buy 10 lottery tickets instead of 1, does your chance of winning go up by a factor of 10? What is the law of averages? How can polls make accurate predictions based on data from small fractions of the population? What should you expect to happen "just by chance"? These are some of the questions we will address in the course.

2.3 Introduction to Statistics: Inference (50 小時)
The focus of Stat2.3x is on statistical inference: how to make valid conclusions based on data from random samples. At the heart of the main problem addressed by the course will be a population (which you can imagine for now as a set of people) connected with which there is a numerical quantity of interest (which you can imagine for now as the average number of MOOCs the people have taken).
we will discuss good ways to select the subset (yes, at random); how to estimate the numerical quantity of interest, based on what you see in your sample; and ways to test hypotheses about numerical or probabilistic aspects of the problem


編程篇


Basic Python

1. Intro to Python (3 - 5 小時)掃盲

This is a great place to start if you have no programming background at all or want to brush up. If you have programming experience but have never seen Python, you may still want to skim through these lessons. You’ll learn basic programming techniques, such as loops, lists and dictionaries, functions, classes, and file input/ output.


1.1 彩蛋 Complete the Python Statistics Problem Set ( 0.5 小時 )

2. Videos and Problem Sets of Design of Computer Programs (20 - 30 小時)
This class will teach you to write elegant and efficient code. This will be essential in order to manipulate data effectively and write code that is reusable and easy for others to understand. You will also learn about some of the more sophisticated Python techniques, such as generator functions and list comprehensions.

Optional: Computer programming and Python 完整基礎入門課程

2. Introduction to Computer Science and Programming Using Python (135 小時)
This course focuses on breadth rather than depth. The goal is to provide students with a brief introduction to many topics so they will have an idea of what is possible when they need to think about how to use computation to accomplish some goal later in their career.
  • A Notion of computation
  • The Python programming language
  • Some simple algorithms
  • Testing and debugging
  • An informal introduction to algorithmic complexity
  • Data structures


SQL and JSON

1. Introduction to Database ( 10 小時 - 只需要看前面的基礎部分)
Watch the videos on Relational Databases, JSON Data, Relational Algebra, and SQL, and complete the exercises for those sections.


Algorithm 入門

1. Introduction to Algorithms (SMA 5503) (15小時 - 只需要看前面的基礎部分)
This course teaches techniques for the design and analysis of efficient algorithms, emphasizing methods useful in practice. Topics covered include: sorting; search trees, heaps, and hashing; divide-and-conquer; dynamic programming; amortized analysis; graph algorithms; shortest paths; network flow; computational geometry; number-theoretic algorithms; polynomial and matrix calculations; caching; and parallel computing.

工具篇


1. Unix Basics [4:20] ( 1 小時 )
大部分的大數據開發和分析環境在Unix系統中進行,如果你用Mac或者Unix,You need to learn how to talk to your computer using the command line.
Watch
  • [Lecture 3: Linux and Server-Side Javascript]
  • [Lecture 4a: The Linux Command Line ]

2. Try Git (1小時)
Git is a version control system. It enables programmers to work together on large projects without overwriting each other’s work. Furthermore, it saves old versions of code in case you make a mistake and need to revert back. It can also be a useful portfolio of your programming and analysis projects to show potential employers.


分析篇

Data Visualization Best Practices (數據視覺化展示 技巧)


1. Introduction to Infographics and Data Visualization ( 5 小時)
These videos are enjoyable and they make a nice break from the more technically challenging courses in this path. However, while the material in the course may be easy to understand, data visualization is a deeper topic than it seems. These examples should help illuminate what makes a good visualization and give ideas for some more creative ways to display information. You will also learn general principles of graphic design and visual perception.

Optional: Information Dashboard Design: The Effective Visual Communication of Data by Stephen Few - Dashboard 設計的經典書籍


Python 數據分析

Python 有很多針對統計和數據分析的library,常用的有:Pandas, Scipy, Numpy, and Scikit
1. Introduction to Pandas ( 1 小時)
2. explore SciPy and Numpy libraries (5 小時)


機器學習 Practical Machine Learning

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability and optimization. Learning algorithms enable a wide range of applications, from everyday tasks such as product recommendations and spam filtering to bleeding edge applications like self-driving cars and personalized medicine. In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, machine learning techniques are fast becoming a core component of large-scale data processing pipelines.

1. Introduction to Big Data with Apache Spark (30 小時 with Python)
teach students how to use PySpark (part of Apache Spark) to deliver against their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems.
  • Learn how to use Apache Spark to perform data analysis
  • How to use parallel programming to explore data sets
  • Apply Log Mining, Textual Entity Recognition and Collaborative Filtering to real world data questions

2. Scalable Machine Learning (35 小時 - With Python and Spark )
This course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. We present an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Apache Spark, a cluster computing system well-suited for large-scale machine learning tasks. You will implement scalable algorithms for fundamental statistical models (linear regression, logistic regression, matrix factorization, principal component analysis) while tackling key problems from various domains: online advertising, personalized recommendation, and cognitive neuroscience.
  • The underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines
  • Exploratory data analysis, feature extraction, supervised learning, and model evaluation
  • Application of these principles using Apache Spark
  • How to implement scalable algorithms for fundamental statistical models

Optional: Statistical Learning ( 30 小時 - with R )
This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical).

大數據分析實戰 with R

注: R 並不適合真正的大數據應用,這些課程是一個補充,可以略過
1. Try R ( 5 小時)
R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you.
This course will teach you the basics of R: data types, summary statistics, functions, and control structures.

2. The Analytics Edge (100 小時)
  • An applied understanding of many different analytics methods, including linear regression, logistic regression, CART, clustering, and data visualization
  • How to implement all of these methods in R
  • An applied understanding of mathematical optimization and how to solve optimization models in spreadsheet software
作者:合歡樹
鏈接:http://www.zhihu.com/question/29265587/answer/46676970

希望能給想成爲數據分析師朋友一點幫助,歡迎加羣一起交流、探討QQ 羣::570180534!!!


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章