Spark Mllib 下的決策樹二元分類 —— 網站分類(1)

前面一篇文章說了一下基於spark下的協同過濾算法的實現,這篇文章就來講一下決策樹二元分類吧,
這個算法呢主要運用於產品的分類,就好比你要給某人推薦一本書,首先你自己要知道這些書的類型吧,其次你還需要知道你要推薦的這個人他喜歡什麼類型,只有書籍的類型和人的喜好匹配上了,這樣才能達到推薦的目的;正是在這種場景下就需要我們進行對產品的分類,當然我們人可以很容易的就判斷某本書籍是屬於什麼類型的,但是對於機器來說呢,那就非常困難了,數據如果很少,你可能還有希望用人去給這些產品貼上標籤,但是數據量龐大的當下,如果人爲的去貼標籤必然是不可能的,根據人腦的思考模式,就衍生出了一些算法,今天要講的決策樹就是衆多分類算法的一種;
下面就來看一看 spark 下如何實現決策樹算法吧:

1.導入必要的包

import pyspark
import numpy as np
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree

2.初始化一個spark上下文對象

sc = pyspark.SparkContext(master="local[*]",appName="StumbleuponAnalysis")

3.加載數據

3.1打開數據文件

這裏使用的數據集是kaggle上的 StumbleUpon Evergreen Classification Challenge 數據集
不知道如何下載的小夥伴可以在這裏進行下載
下載地址

raw_data_and_header = sc.textFile("file:/home/zh123/.jupyter/workspace/stumbleupon/train.tsv")

3.2查看數據格式

raw_data_and_header.take(5)
['"url"\t"urlid"\t"boilerplate"\t"alchemy_category"\t"alchemy_category_score"\t"avglinksize"\t"commonlinkratio_1"\t"commonlinkratio_2"\t"commonlinkratio_3"\t"commonlinkratio_4"\t"compression_ratio"\t"embed_ratio"\t"framebased"\t"frameTagRatio"\t"hasDomainLink"\t"html_ratio"\t"image_ratio"\t"is_news"\t"lengthyLinkDomain"\t"linkwordscore"\t"news_front_page"\t"non_markup_alphanum_characters"\t"numberOfLinks"\t"numwords_in_url"\t"parametrizedLinkRatio"\t"spelling_errors_ratio"\t"label"',
 '"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"\t"4042"\t"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ball The predictions are part of an annual tradition for the Armonk New York based company which surveys its 3 000 researchers to find five ideas expected to take root in the next five years IBM the world s largest provider of computer services looks to Silicon Valley for input gleaning many ideas from its Almaden research center in San Jose California Holographic conversations projected from mobile phones lead this year s list The predictions also include air breathing batteries computer programs that can tell when and where traffic jams will take place environmental information generated by sensors in cars and phones and cities powered by the heat thrown off by computer servers These are all stretch goals and that s good said Paul Saffo managing director of foresight at the investment advisory firm Discern in San Francisco In an era when pessimism is the new black a little dose of technological optimism is not a bad thing For IBM it s not just idle speculation The company is one of the few big corporations investing in long range research projects and it counts on innovation to fuel growth Saffo said Not all of its predictions pan out though IBM was overly optimistic about the spread of speech technology for instance When the ideas do lead to products they can have broad implications for society as well as IBM s bottom line he said Research Spending They have continued to do research when all the other grand research organizations are gone said Saffo who is also a consulting associate professor at Stanford University IBM invested 5 8 billion in research and development last year 6 1 percent of revenue While that s down from about 10 percent in the early 1990s the company spends a bigger share on research than its computing rivals Hewlett Packard Co the top maker of personal computers spent 2 4 percent last year At Almaden scientists work on projects that don t always fit in with IBM s computer business The lab s research includes efforts to develop an electric car battery that runs 500 miles on one charge a filtration system for desalination and a program that shows changes in geographic data IBM rose 9 cents to 146 04 at 11 02 a m in New York Stock Exchange composite trading The stock had gained 11 percent this year before today Citizen Science The list is meant to give a window into the company s innovation engine said Josephine Cheng a vice president at IBM s Almaden lab All this demonstrates a real culture of innovation at IBM and willingness to devote itself to solving some of the world s biggest problems she said Many of the predictions are based on projects that IBM has in the works One of this year s ideas that sensors in cars wallets and personal devices will give scientists better data about the environment is an expansion of the company s citizen science initiative Earlier this year IBM teamed up with the California State Water Resources Control Board and the City of San Jose Environmental Services to help gather information about waterways Researchers from Almaden created an application that lets smartphone users snap photos of streams and creeks and report back on conditions The hope is that these casual observations will help local and state officials who don t have the resources to do the work themselves Traffic Predictors IBM also sees data helping shorten commutes in the next five years Computer programs will use algorithms and real time traffic information to predict which roads will have backups and how to avoid getting stuck Batteries may last 10 times longer in 2015 than today IBM says Rather than using the current lithium ion technology new models could rely on energy dense metals that only need to interact with the air to recharge Some electronic devices might ditch batteries altogether and use something similar to kinetic wristwatches which only need to be shaken to generate a charge The final prediction involves recycling the heat generated by computers and data centers Almost half of the power used by data centers is currently spent keeping the computers cool IBM scientists say it would be better to harness that heat to warm houses and offices In IBM s first list of predictions compiled at the end of 2006 researchers said instantaneous speech translation would become the norm That hasn t happened yet While some programs can quickly translate electronic documents and instant messages and other apps can perform limited speech translation there s nothing widely available that acts like the universal translator in Star Trek Second Life The company also predicted that online immersive environments such as Second Life would become more widespread While immersive video games are as popular as ever Second Life s growth has slowed Internet users are flocking instead to the more 2 D environments of Facebook Inc and Twitter Inc Meanwhile a 2007 prediction that mobile phones will act as a wallet ticket broker concierge bank and shopping assistant is coming true thanks to the explosion of smartphone applications Consumers can pay bills through their banking apps buy movie tickets and get instant feedback on potential purchases all with a few taps on their phones The nice thing about the list is that it provokes thought Saffo said If everything came true they wouldn t be doing their job To contact the reporter on this story Ryan Flinn in San Francisco at rflinn bloomberg net To contact the editor responsible for this story Tom Giles at tgiles5 bloomberg net by 2015, your mobile phone will project a 3-d image of anyone who calls and your laptop will be powered by kinetic energy. at least that\\u2019s what international business machines corp. sees in its crystal ball."",""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}"\t"business"\t"0.789131"\t"2.055555556"\t"0.676470588"\t"0.205882353"\t"0.047058824"\t"0.023529412"\t"0.443783175"\t"0"\t"0"\t"0.09077381"\t"0"\t"0.245831182"\t"0.003883495"\t"1"\t"1"\t"24"\t"0"\t"5424"\t"170"\t"8"\t"0.152941176"\t"0.079129575"\t"0"',
 '"http://www.popsci.com/technology/article/2012-07/electronic-futuristic-starting-gun-eliminates-advantages-races"\t"8471"\t"{""title"":""The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races"",""body"":""And that can be carried on a plane without the hassle too The Omega E Gun Starting Pistol Omega It s easy to take for granted just how insanely close some Olympic races are and how much the minutiae of it all can matter The perfect example is the traditional starting gun Seems easy You pull a trigger and the race starts Boom What people don t consider When a conventional gun goes off the sound travels to the ears of the closest runner a fraction of a second sooner than the others That s just enough to matter and why the latest starting pistol has traded in the mechanical boom for orchestrated electronic noise Omega has been the watch company tasked as the official timekeeper of the Olympic Games since 1932 At the 2010 Vancouver games they debuted their new starting gun which is a far cry from the iconic revolvers associated with early games it s clearly electronic but still more than a button that s pressed to get the show rolling About as far away as you can get probably while still clearly being a starting gun Pull the trigger once and off the Olympians go If it s pressed twice consecutively it signals a false start Working through a speaker system is what eliminates any kind of advantage for athletes It s not a big advantage being close to a gun but the sound of the bullet traveling one meter every three milliseconds could contribute to a win Powder pistols have been connected to a speaker system before but even then runners could react to the sound of the real pistol firing rather than wait for the speaker sounds to reach them This year s setup will have speakers placed equidistant from runners forcing the sound to reach each competitor at exactly the same time It wouldn t be an enormous difference Omega Timing board member Peter H\\u00fcrzeler said in an email but when you think about reaction times being measured in tiny fractions of a second placing a speaker behind each lane has eliminated any sort of advantage for any athlete They all hear the start commands and signal at exactly the same moment There s also an ulterior reason for its look In a post September 11th world a gun on its way to a major event is going to raise more than a few TSA eyebrows even if it s a realistic looking fake Rather than deal with that the e gun can be transported while still maintaining the general look of a starting gun But there s still nothing like hearing a starting gun go off at the start of a race more than signaling the runners there s probably some Pavlovian response after more than a century of Olympic games that make people want to hear the real thing not a whiny electronic noise Everyone in the stands at home thankfully will still be getting that The sound is programmable and can be synthesized to sound like almost anything H\\u00fcrzeler says but we program it to sound like a pistol it s a way to use the best possible starting technology but to keep a rich tradition alive and that can be carried on a plane without the hassle, too technology,gadgets,london 2012,london olympics,olympics,omega,starting guns,summer olympics,timing,popular science,popsci"",""url"":""popsci technology article 2012 07 electronic futuristic starting gun eliminates advantages races""}"\t"recreation"\t"0.574147"\t"3.677966102"\t"0.50802139"\t"0.288770053"\t"0.213903743"\t"0.144385027"\t"0.468648998"\t"0"\t"0"\t"0.098707403"\t"0"\t"0.203489628"\t"0.088652482"\t"1"\t"1"\t"40"\t"0"\t"4973"\t"187"\t"9"\t"0.181818182"\t"0.125448029"\t"1"',
 '"http://www.menshealth.com/health/flu-fighting-fruits?cm_mmc=Facebook-_-MensHealth-_-Content-Health-_-FightFluWithFruit"\t"1164"\t"{""title"":""Fruits that Fight the Flu fruits that fight the flu | cold & flu | men\'s health"",""body"":""Apples The most popular source of antioxidants in our diet one apple has an antioxidant effect equivalent to 1 500 mg of vitamin C Apples are loaded with protective flavonoids which may prevent heart disease and cancer Next Papayas With 250 percent of the RDA of vitamin C a papaya can help kick a cold right out of your system The beta carotene and vitamins C and E in papayas reduce inflammation throughout the body lessening the effects of asthma Next Cranberries Cranberries have more antioxidants than other common fruits and veggies One serving has five times the amount in broccoli Cranberries are a natural probiotic enhancing good bacteria levels in the gut and protecting it from foodborne illnesses Next Grapefruit Loaded with vitamin C grapefruit also contains natural compounds called limonoids which can lower cholesterol The red varieties are a potent source of the cancer fighting substance lycopene Next Bananas One of the top food sources of vitamin B6 bananas help reduce fatigue depression stress and insomnia Bananas are high in magnesium which keeps bones strong and potassium which helps prevent heart disease and high blood pressure Next everything you need to know about cold and flu so you don\\u2019t get sick this season, at men\\u2019s health.com cold, flu, infection, sore throat, sneeze, immunity, germs, allergies, stay healthy, sick, contagious, medicines, cold medicine"",""url"":""menshealth health flu fighting fruits cm mmc Facebook Mens Health Content Health Fight Flu With Fruit""}"\t"health"\t"0.996526"\t"2.382882883"\t"0.562015504"\t"0.321705426"\t"0.120155039"\t"0.042635659"\t"0.525448029"\t"0"\t"0"\t"0.072447859"\t"0"\t"0.22640177"\t"0.120535714"\t"1"\t"1"\t"55"\t"0"\t"2240"\t"258"\t"11"\t"0.166666667"\t"0.057613169"\t"1"',
 '"http://www.dumblittleman.com/2007/12/10-foolproof-tips-for-better-sleep.html"\t"6684"\t"{""title"":""10 Foolproof Tips for Better Sleep "",""body"":""There was a period in my life when I had a lot of problems with sleep It took me very long to fall asleep I was easily awaken and I simply wasn t getting enough of rest at night I didn t want to take medication and this led me to learn several tips and tricks that really helped me to overcome my insomnia Some of these tips I try to follow regularly Don t worry about not getting enough sleep Try not to worry about how much you sleep Such worrying can start a cycle of negative thoughts that contribute to a condition known as learned insomnia Learned insomnia occurs when you worry so much about whether or not you will be able to get adequate sleep that the bedtime rituals and behavior actually trigger insomnia Don t force yourself to sleep The very attempt of trying to do so actually awakes you making it more difficult to sleep Go to bed only when you are feeling really tired and sleepy Don t look at the alarm clock at night Looking at the clock promotes increased anxiety and obsession about time Body heating procedures Some studies suggest that soaking in hot water before going to bed can ease the transition into a deeper sleep Avoid oversleep Don t oversleep to make up for a poor night s sleep Doing so for even a couple of days can reset your body clock and make it harder for you to sleep at night Sex Sex is a well known nighttime stress reliever Healthy sex life enhances your relationship relaxes your body releases happy chemicals and even promotes wellness And it welcomes sleep Avoid alcohol as a sleeping aid Avoid the use of alcohol in the late evening The most common myth found among people is that they believe alcohol helps in the sleep But the fact is alcohol may initially act as sedative but it produces a number of sleep impairing effects in the long run Associate your bed and bedroom with sleep and sex only Don t watch TV eat or read in bed Although these things help some people sleep they can also give your brain the idea that bed isn t just for sleeping and this can keep you awake Naps If you suffer from insomnia try not taking a nap If the goal is to sleep more during the night napping may steal hours desired later on If you re a regular napper and experiencing difficulty falling or staying asleep at night give up the nap and see what happens Written by C Simmons of HealthAssist net dumb little man shares ideas to make the everyday person more productive in life. expect to read tips on finance, saving money, business, and some diy for the house. tips,diy,money,finance,advice,productivity,efficient,technology,saving,software,business,tools"",""url"":""dumblittleman 2007 12 10 foolproof tips for better sleep html""}"\t"health"\t"0.801248"\t"1.543103448"\t"0.4"\t"0.1"\t"0.016666667"\t"0"\t"0.480724749"\t"0"\t"0"\t"0.095860566"\t"0"\t"0.265655744"\t"0.035343035"\t"1"\t"0"\t"24"\t"0"\t"2737"\t"120"\t"5"\t"0.041666667"\t"0.100858369"\t"1"']

查看數據過後我們發現數據集當中,頭部是所有字段的名稱,顯然是不要的,還有就是每行的數據當中,
類別字段兩邊有雙引號,而且還有一些數據缺失,是以 ? 來代替的,所以要進行清洗;

3.3 數據 ETL

# 取出頭部數據
header_data = raw_data_and_header.first()
# 去頭數據
raw_non_header_data = raw_data_and_header.filter(lambda l:l != header_data)
# 取出每行的所有引號
raw_non_quot_data = raw_non_header_data.map(lambda s:s.replace("\"",""))
# 將每行數據以"\t"進行分割
data = raw_non_quot_data.map(lambda l:l.split("\t"))

3.4查看數據個數

data.count()
7395

數據當中前面三個字段(網站url,網站id,模版文字),對於我們構建模型是沒有用的,然後剩下的字段解釋即爲下圖

在這裏插入圖片描述

4.定義提取特徵信息的函數

def extract_features(fields,categories_dict):
    """
        fields:每行的左右字段
        categories_dictd:類別名->類別ID的映射字典
    """
    # 找到類別所對應的ID值
    category_id = categories_dict[fields[3]]
    # 創建一個類別的編碼列表 (全爲0)
    category_features = np.zeros(len(categories_dict))
    # 將類別ID對應編碼列表的位置變爲 1
    category_features[category_id] = 1
    # 創建數值特列表
    numberical_feature = [0.0 if field == "?" else float(field) for field in fields[4:-1]]
    # 將兩個列表拼接後返回
    return np.concatenate((category_features,numberical_feature))

這裏將類型特徵採用OneHotEncoder的編碼模式,其實很簡單;
比如我有 水果,蔬菜,肉類 三個類別的食物,那我就可以把它們編碼成
水果: [1,0,0]
蔬菜: [0.1.0]
肉類: [0,0,0]

5.構建 類別名稱 - > 類別ID 的字典

# 流程: 取每行的第4個字段 -> 去重 -> 壓縮成(value,index)類型 -> 導出爲字典
categories_dict = data.map(lambda fields:fields[3]).distinct().zipWithIndex().collectAsMap()
# 查看字典長度
len(categories_dict)
14

6.創建LabeledPoint類型的RDD

# LabeledPoint 即爲一個 (標籤值,特徵值) 類型的一個bean
label_point_rdd = data.map(lambda fields:LabeledPoint(
                        float(fields[-1]),
                        extract_features(fields,categories_dict)
))

7.將數據隨機分成三分

train_data,validation_data,test_data = label_point_rdd.randomSplit([8,1,1])
# 各個數據集的個數
print(train_data.count())
print(validation_data.count())
print(test_data.count())
5926
759
710

8.將數據持久化到內存當中加快運算速度

train_data.persist()
validation_data.persist()
test_data.persist()
PythonRDD[787] at RDD at PythonRDD.scala:52

9.訓練模型

model = DecisionTree.trainClassifier(train_data,numClasses=2,categoricalFeaturesInfo={},impurity="entropy",maxDepth=15,maxBins=10)
參數 解釋
input 輸入的訓練集
numClasses 分類的數目
categoricalFeaturesInfo 分類字段信息
impurity 決策樹的評估方式 這個參數有以下兩種評判標準:
● “gini” (基尼指數): 由意大利統計學家 Corrado Gini 發明, 用於計算數值散佈程度(Statistical dispersion, 統計離差)的指標.決策樹算法對每種特徵字段分隔點計算估值,選擇分裂後最小的基尼指數方式
● “entropy” 熵 這個應該很熟悉,就是用來計算混亂程度的,決策樹算法對每種特徵字段分隔點計算估值好後,選擇分裂後最小熵方式
maxDepth 決策樹的的最大深度
maxBins 決策樹中每個節點的最大分支數

10.模型測試

# 這裏先初步進行測試
# 正確個數
true_cnt = 0
# 遍歷驗證集
for i in validation_data.collect():
    # 根據其特徵值進行預測並與 正確的標籤進行比較
    if model.predict(i.features) == i.label:
        true_cnt += 1
# 計算正確率
print(true_cnt / validation_data.count())
0.6245059288537549

這樣的評標準是不標準的,只能夠大概的得出模型的正確率,對於這個正確率來說不能幫助我們,調整模型,
所以需要專業的評判標準也就是 AUC 模型評估算法.會在下一篇文章進行講到

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章