大數據工程人員知識圖譜

在企業裏面從事大數據相關的工作到底需要掌握哪些知識呢?我認爲需要從兩個角度來看:一個是技術;一個是業務。技術上主要涉及到概率和數理統計,計算機系統、算法和編程等;而業務的角度呢則是因公司業務的不同而異。對於從事大數據的工程人員來說,需要學會使用數據挖掘方法在計算機系統和編程工具的幫助下解決實際的問題,這樣才能夠在海量數據中挖掘出業務增長的助推劑,才能在激烈的市場競爭中爲企業創造更多的價值。

因爲業務會因公司的不同而不同,但是技術點是想通的。我在這裏簡單總結了一下大數據相關工程人員需要掌握的技術相關知識點。主要涉及到數據庫、數據倉庫、編程、分佈式系統、Hadoop生態系統相關、數據挖掘和機器學習相關的基礎知識點。當然我這裏列出來的應該是一個team的人員彙集在一起所具備的,每個人會因在團隊中的角色不同而有所側重。在此剖磚引玉,歡迎大家發表意見。

TopicContentKey pointsReference
DB/OLTP & DW/OLAPDatabase/OLTP basicThe relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACIDRamakrishnan, Raghu, and Johannes Gehrke. Database Management Systems.
Database internal & implementationArchitecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join
Distributed and parallel databaseSharding, database proxy
Data warehouse/OLAPMaterialized views, ETL, column-oriented storage, reporting, BI tools
Basic programmingProgramming languageJava, Python (NumPy/scikit-learn), SQL
OSLinux
DB & DW systemMySQL/ Hive/Impala
Text format and processJSON/XML, regex
ToolGit/SVN, Maven
Distributed system & Hadoop ecosystem & NoSQLDistributed system principal theoryCAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog)
Distributed storage & computing framework & resource managementHadoop/HDFS/MapReduce/YARNTom White. Hadoop : The Definitive Guide.


Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems.

SQL on HadoopData (log) acquisition/integration/fusion, normalization, feature extractionSqoop, Flume/Scribe/Chukwa,


SerDe

Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive.
Query & In-database analyticsHive, Impala, UDF/UDAF
Large scale data mining & machine learning frameworkSpark/MLbase, Mahout
Streaming processStorm
NoSQLHBase/Cassandra (column oriented database)Lars George. HBase: The Definitive Guide.
Mongodb (Document database)
Neo4j (graph database)
Redis (cache)
Data mining & Machine learningDM & ML basicNumerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging
StatisticData exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing
Supervised learningClassifier, boosting, prediction, regression analysis

Han, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques.


Unsupervised learningCluster
Collaborative filtering

Item based CF, user based CF


AlgorithmClassifierDecision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), nave Bayes classifiers, neural networks,
RegressionLinear regression, logistic regression, ranking, perception
ClusterHierarchical cluster, K-means cluster, Spectral Cluster
Dimensionality reductionPCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling)
Text miningCorpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, taggingJimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章