點擊日誌數據轉換成FFM數據格式——CSV2FFM

在用FFM數據時，肯定會問幾個問題：1-ffm數據是啥意思，特徵中的數據是啥意思啊，例如1:2:0,3:5:1，這種數據本來的真實數據是啥？2-如何將真實數據轉成這種格式？轉換中肯定會遇到的問題（1）單值與多值特徵怎麼區別對待？（2）在使用模型訓練ffm數據後是否需要特徵原來對應的真實數據？（這個問題是跑一個模型就知道了，或許就不是問題）3-模型訓練完後如何召回？能不能用faiss？

下面先說2-生產數據格式轉換

2-1，單值數據轉換成FFM數據格式，在kaggle上看到了一個非並行版本。

這裏面需要了解一個函數make_classification

from sklearn.datasets import make_classification

原來版本是100個samples，注意int是數值型數據，這種數據feature idx肯定都是一樣的，而str纔是類別型數據，所以value 都是1

這裏我的疑問已經被另一個大佬提出來了，爲啥不是從0到1進行編碼，這個有點難以理解啊。

剛一誇kaggle比逼乎逼格高，特麼的就屏蔽我的賬號了？？臥槽，

我複製個東西，然後粘貼下，提個問題，就這個就屏蔽了？？woc，傻逼玩意。【我又註冊個賬號，重複這種操作，確認了是這個原因，當我想解封時給我提了個問題，如下，

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 or 5 below 2248.，

。。。。kaggle真是任性。】

算了已經進入誤區了。折回重來，再看一個版本

下面是kaggle版本和微軟版本的對比，如下，10萬條數據，kaggle版本轉換需要1分鐘，而微軟版本只需1秒，而且發現兩個版本有隨數據量線性增長的趨勢。所以在時間上已經確定kaggle版本不行了。50萬則分別需要305,5.13秒，5千萬則可能需要3萬秒和500秒，明顯是後者了。【實際在服務器上1千萬數據，Time 1:3585.653191, 2:65.854875】

微軟的版本依舊與合成的數據是一樣的，參考XDeepFM踩坑之路。【即編碼完一個field的特徵，再編碼另一個field的特徵，這點與libffm中的介紹是不同的，後者是編碼完一條數據的特徵纔開始編碼下一條數據中新的特徵】

而當爲離散型（數值型float）數據時，結果kaggle版本出錯，這就不行了。

Origin Data:
    float1  float2  float3 string1 string2  clicked
0  -0.462  -0.587  -1.221     1.0    -0.0        0
1   1.547   1.900  -0.301     0.0     1.0        1
2   0.884   1.068  -1.328    -0.0    -4.0        1
3   1.440   1.777  -0.116     1.0     1.0        1
4   1.425   1.727   0.197    -1.0    -2.0        1
5  -0.925  -1.140   0.209     1.0    -2.0        0
6  -1.597  -1.963   0.738     1.0    -5.0        0
7  -0.763  -0.938  -1.960     1.0    -1.0        1
8  -0.581  -0.721   0.823     1.0    -3.0        0
9  -2.389  -2.895   0.171     1.0     1.0        0
kaggle version:
 0    0 3:33:1 4:38:1
1    1 3:34:1 4:39:1
2    1 3:35:1 4:40:1
3    1 3:33:1 4:39:1
4    1 3:36:1 4:41:1
5    0 3:33:1 4:41:1
6    0 3:33:1 4:42:1
7    1 3:33:1 4:43:1
8    0 3:33:1 4:44:1
9    0 3:33:1 4:39:1
dtype: object 
microsoft version:
    clicked      float1      float2      float3 string1 string2
0        0  1:1:-0.462  2:2:-0.587  3:3:-1.221   4:4:1   5:8:1
1        1   1:1:1.547     2:2:1.9  3:3:-0.301   4:5:1   5:9:1
2        1   1:1:0.884   2:2:1.068  3:3:-1.328   4:6:1  5:10:1
3        1    1:1:1.44   2:2:1.777  3:3:-0.116   4:4:1   5:9:1
4        1   1:1:1.425   2:2:1.727   3:3:0.197   4:7:1  5:11:1
5        0  1:1:-0.925   2:2:-1.14   3:3:0.209   4:4:1  5:11:1
6        0  1:1:-1.597  2:2:-1.963   3:3:0.738   4:4:1  5:12:1
7        1  1:1:-0.763  2:2:-0.938   3:3:-1.96   4:4:1  5:13:1
8        0  1:1:-0.581  2:2:-0.721   3:3:0.823   4:4:1  5:14:1
9        0  1:1:-2.389  2:2:-2.895   3:3:0.171   4:4:1   5:9:1
Time 1:0.010993, 2:0.006997

如何證明這種編碼是有效、正確的呢？採用movielens-1M進行FFM模型測試即可

【將評分視爲label進行預測,後來發現這種做法在libffm下是不適用的，官方的模型是預測的點擊與否，只能將分類結果預測出來，我看看能不能預測評分】

命令行如下：首先分割成train.ffm和test.ffm數據,我將評分除以5進行歸一化，但仍舊不行啊，下面是過程及結果

$ ./ffm-train -p test.ffm train.ffm mymodel
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (1.2 seconds)
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.3 seconds)
iter   tr_logloss   va_logloss      tr_time
   1      0.43943      0.42538          0.7
   2      0.41742      0.42085          1.4
   3      0.40935      0.41919          2.1
   4      0.40203      0.41849          2.8
   5      0.39509      0.41843          3.5
   6      0.38823      0.41871          4.1
   7      0.38129      0.41931          4.8
   8      0.37428      0.42023          5.5
   9      0.36742      0.42130          6.2
  10      0.36089      0.42281          6.8
  11      0.35491      0.42466          7.5
  12      0.34955      0.42663          8.2
  13      0.34474      0.42882          8.9
  14      0.34051      0.43103          9.5
  15      0.33675      0.43335         10.2

./ffm-predict test.ffm mymodel 'testPred.txt'
logloss = 0.43335

the results are bad enough !!! So will try another method,that is the reference,classify the rating into like and not like

過程及結果如下：

iter   tr_logloss   va_logloss      tr_time
   1      0.56010      0.53802          0.8
   2      0.53190      0.53201          1.4
   3      0.52133      0.52879          2.1
   4      0.51214      0.52717          2.8
   5      0.50332      0.52592          3.5
   6      0.49438      0.52502          4.2
   7      0.48536      0.52439          4.9
   8      0.47661      0.52426          5.6
   9      0.46855      0.52485          6.3
  10      0.46134      0.52603          6.9
  11      0.45504      0.52750          7.6
  12      0.44953      0.52938          8.3
  13      0.44467      0.53141          9.0
  14      0.44038      0.53362          9.7
  15      0.43658      0.53591         10.3

上面的loss發現驗證集第8次已經是最小了，後面的都是過擬合了，涉及早停參數，修改如下：發現並沒有很好的改善

又調了一次，仍舊不理想，如下

$ ./ffm-train -p test.ffm -l 0.003 -t 50 --auto-stop train.ffm 
First check if the text file has already been converted to binary format (0.1 seconds)
Binary file found. Skip converting text to binary
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file found. Skip converting text to binary
iter   tr_logloss   va_logloss      tr_time
   1      0.59387      0.55961          0.7
   2      0.55231      0.54819          1.4
   3      0.54301      0.54335          2.1
   4      0.53763      0.54001          2.9
   5      0.53345      0.53745          3.6
   6      0.52979      0.53541          4.3
   7      0.52643      0.53308          5.6
   8      0.52320      0.53075          7.0
   9      0.51999      0.52904          8.4
  10      0.51685      0.52689          9.7
  11      0.51368      0.52503         11.1
  12      0.51064      0.52342         12.5
  13      0.50781      0.52167         13.9
  14      0.50517      0.52049         15.2
  15      0.50270      0.51954         16.6
  16      0.50053      0.51835         18.0
  17      0.49843      0.51757         19.3
  18      0.49654      0.51691         20.7
  19      0.49480      0.51619         22.1
  20      0.49313      0.51561         23.4
  21      0.49159      0.51491         24.8
  22      0.49010      0.51447         26.2
  23      0.48861      0.51429         27.5
  24      0.48730      0.51392         28.9
  25      0.48601      0.51347         30.3
  26      0.48475      0.51321         31.6
  27      0.48354      0.51294         33.0
  28      0.48239      0.51269         34.4
  29      0.48120      0.51249         35.9
  30      0.48011      0.51230         37.6
  31      0.47904      0.51204         39.2
  32      0.47797      0.51186         40.9
  33      0.47698      0.51170         42.5
  34      0.47593      0.51160         44.2
  35      0.47498      0.51153         45.8
  36      0.47400      0.51152         47.4
  37      0.47310      0.51137         48.9
  38      0.47225      0.51115         50.4
  39      0.47131      0.51121
Auto-stop. Use model at 38th iteration.

$ ./ffm-predict test.ffm train.ffm.model 'testPred.txt'logloss = 0.51115

不行啊，無論是這種分類或者看做迴歸，這種代碼與這種數據的結果都是不對的。所以我不知道怎麼將movielens-1M的數據代入FFM代碼中，這是個問題啊。也沒個大佬指點下。人生艱難啊！

待續吧。

For Video Recommendation in Deep learning QQ Group 277356808

For Speech, Image, Video in deep learning QQ Group 868373192

I'm here waiting for you

點擊日誌數據轉換成FFM數據格式——CSV2FFM

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

多進程之Pool與多線程pool 及tqdm和for 並對比pandas處理結果

Youtube2016推薦召回算法細節及最終實現（離線服務）——完整版

python讀取redis數據及hive入門9——3個表關聯

關於global定義的作用時效問題以及java json/list數據及redis數據解析問題

faiss快速查詢召回數據都一樣咋辦？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結