爲什麼相愛的人不能在一起?

1. 魂斷藍橋的起因

一個兩天的bug,終於解決了,竟然是很基礎的bug,花費了我兩天時間,不免感嘆:爲什麼相愛的人不能在一起。

事情是這個樣子的:

最近測試一個R包breedR的動物模型的功能,用了它的測試數據:

> library(breedR)
> data("globulus")
> ped <- globulus[,1:3]
> str(ped)
'data.frame':	1021 obs. of  3 variables:
 $ self: int  69 70 71 72 73 74 75 76 77 78 ...
 $ dad : int  0 0 0 0 0 0 0 0 0 4 ...
 $ mum : int  64 41 56 55 22 50 67 59 49 8 ...
> res <- remlf90(  fixed = phe_X ~ gg,genetic = list(model = 'add_animal',pedig = ped,id = 'self'),data = globulus)
Using default initial variances given by default_initial_variance()
See ?breedR.getOption.

> summary(res)
Formula: phe_X ~ 0 + gg + pedigree 
   Data: globulus 
  AIC  BIC logLik
 5799 5809  -2898

Parameters of special components:


Variance components:
         Estimated variances  S.E.
genetic                3.397 1.595
Residual              14.453 1.529

             Estimate    S.E.
Heritability   0.1887 0.08705

這裏,Va爲3.39,Ve爲14.45,然後我使用asreml-r作爲對比:

> library(asreml)
> head(dd)
  self dad mum gen gg bl  phe_X  x y
1   69   0  64   1 14 13 15.756  0 0
2   70   0  41   1  4 13 11.141  3 0
3   71   0  56   1 14 13 19.258  6 0
4   72   0  55   1 14 13  4.775  9 0
5   73   0  22   1  8 13 19.099 12 0
6   74   0  50   1 14 13 19.258 15 0
> dd$self = as.factor(dd$self)
> ainv = asreml.Ainverse(ped)$ginv
> mod1.as = asreml(phe_X ~ gg , random = ~ ped(self),ginverse = list(self = ainv), data=dd) 
LogLikelihood Converged 
> summary(mod1.as)$varcomp
                  gamma component std.error  z.ratio constraint
ped(self)!ped 0.2349996  3.396488  1.595445 2.128865   Positive
R!variance    1.0000000 14.453164  1.529262 9.451070   Positive

結果是一樣的。

2. 當你以爲一帆風順時,生活來了

於是我用另外一個數據集,進行測試,數據是使用的我編寫的R包:learnasreml中的數據:

> library(learnasreml)
> dat = animalmodel.dat
> ped = animalmodel.ped
> # asreml
> ainv = asreml.Ainverse(ped)$ginv
> mod2.as = asreml(BWT ~ SEX, random = ~ ped(ANIMAL), ginverse = list(ANIMAL = ainv), data=dat)
LogLikelihood Converged 
> summary(mod2.as)$varcomp
                    gamma component std.error   z.ratio constraint
ped(ANIMAL)!ped 0.2160062  2.494254 0.9180669  2.716855   Positive
R!variance      1.0000000 11.547140 0.9386043 12.302458   Positive

方差組分Va爲2.49,Ve爲11.54。

使用breedR進行測試:

> dd2 = dat
> mod2.br = remlf90(BWT ~ SEX, genetic =  list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree 
   Data: dd2 
  AIC  BIC logLik
 5941 5951  -2968
Parameters of special components:
Variance components:
         Estimated variances   S.E.
genetic               0.6651 0.6728
Residual             13.3380 0.8602

納尼?方差組分Va爲0.66,Ve爲13.33,這是什麼鬼?和asreml不一樣,據我對asreml的熟練程度,只有一種可能:那肯定是breedR有錯誤。

3. 你大爺永遠是你大爺

於是我找到breedR的github中的issue:
上面問題描述:

Hi there,
I recently tried to fit an animal model using the remlf90() function. My model was simple and contained 4 fixed effects, 1 random non-genetic effect and the genetic additive effect (pedigree). I compared the results (h2 + se) to those of BLUPF90 (airemlf90) and they were the same as they should be. Then, I changed the class of the ‘id’ variable in the genetic part of the model from integer to factor and I re-ran the model. The h2 was considerabley different from that I got when the ‘id’ was of class integer.

dat399animal<as.interger(dat399animal <- as.interger(dat399animal) # h2 = 0.44, se = 0.012 (correct)
dat399animal<factor(dat399animal <- factor(dat399animal) # h2 = 0.08, se = 0.006 (wrong)
dat399animal<as.interger(as.character(dat399animal <- as.interger(as.character(dat399animal)) # h2 = 0.44, se = 0.012 (correct again)

So, is this normal, should the ‘id’ part of the genetic effect be always coded as integer or there is a bug that needs to be corrected?

作者回答,breedR需要個體的ID是數字型,如果是因子的話,會報錯提醒啊。。。

Hi Nabeel.
Yes. The variables encoding individuals (i.e., id and progenitors) should be integers.
However, the pedigree-building function should have raisen an error whenever the user tries with a factor.
How did you specify the pedigree in the model?
Thanks for your report and help.

然後作者問了一句開發者經常問道的問題:你這個bug是如何得到的。。。

然而,有時候個體是factor時,真的沒有報錯,我也想提交一個issue,算了,還是自己解決吧!

我就把因子轉化爲了數字,運行breedR:

> dd2$ANIMAL = as.numeric(dd2$ANIMAL)
> mod2.br = remlf90(BWT ~ SEX, genetic =  list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree 
   Data: dd2 
  AIC  BIC logLik
 5941 5951  -2968

Parameters of special components:

Variance components:
         Estimated variances   S.E.
genetic               0.6651 0.6728
Residual             13.3380 0.8602

這個。。。也太真實了,結果不變,依舊是錯誤的。

4. 黯然銷魂掌

於是,我陷入了深深的職業自我懷疑中:我是愛它的,爲什麼相愛的人不能在一起?

我左看右看,上看下看,還是沒有找到問題的所在,我翻遍了breedR的issue,發現了這麼一句話:

Regarding the issue of the variable type of the animal id, note that in the genetic component, the pedigree is taken from ped399, where the variable animal is presumably integer or numeric. However, if you change the corresponding variable in dat399 as you have been doing, this breaks the correspondence between the animal codes in the pedigree and the dataset.

大意就是說,breedR中,系譜和個體ID需要是數字,因爲系譜的數據會在breedR中重新編碼,如果你改變了數據中ID的編碼,那麼系譜構建的矩陣就和數據中的ID對應不了,結果就可能是錯誤的。

這一段正確的廢話,並沒有激起我什麼想法,我還是繼續沉浸於深深的自我懷疑中,一定是我不夠好,所以它纔想要逃。。。

5. 夢裏傳來你的呼喚

靈感總是在夢中醒來,半夜忽然一個想法,是不是我轉化數字的時候,變了?
但是數據中本來就是數字的因子類型啊,我把它轉化爲數字的數字類型時會變麼?
我早知道R中有這種坑,在factor轉化爲number時,一定要通過character,否則會有各種不可預知的坑
難道
難道說
這個坑被我遇到了麼???

第二天上班,我迫切的測試了一下:

> tt = dat
> head(tt$ANIMAL)
[1] 1029 1299 643  1183 1238 891 
1084 Levels: 1 2 3 5 6 7 8 9 10 11 12 14 15 16 17 20 21 22 24 25 26 27 28 29 30 32 33 34 35 36 37 38 40 41 42 43 44 47 48 49 50 51 52 ... 1309
> head(as.numeric(tt$ANIMAL))
[1]  864 1076  549  989 1030  751

可以看到,變得面目全非,本來是1029,現在是864,本來是1299,現在是1076。

6. 恍然大迷瞪

看完之後,我激動的心無法平靜,竟然想起了“爲何相愛的人不能在一起的旋律”,我也太難了,竟然是這個原因。。。

腦子裏想起祥林嫂的語句:
我早知道,R中factor轉化爲number時有可能出錯。。。

然後我用character作爲中間元素,再測試了一下:

> tt = dat
> head(tt$ANIMAL)
[1] 1029 1299 643  1183 1238 891 
1084 Levels: 1 2 3 5 6 7 8 9 10 11 12 14 15 16 17 20 21 22 24 25 26 27 28 29 30 32 33 34 35 36 37 38 40 41 42 43 44 47 48 49 50 51 52 ... 1309
> head(as.numeric(as.character(tt$ANIMAL)))
[1] 1029 1299  643 1183 1238  891

這就是對的了!

最後我用正確的形式,測試breedR中的動物模型:

> dd2 = dat
> dd2$ANIMAL = as.numeric(as.character(dd2$ANIMAL))
> mod2.br = remlf90(BWT ~ SEX, genetic =  list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree 
   Data: dd2 
  AIC  BIC logLik
 5931 5941  -2964

Parameters of special components:


Variance components:
         Estimated variances   S.E.
genetic                2.494 0.9181
Residual              11.547 0.9386

終於看到了正確的結果,Va爲2.49,Ve爲11.54.

7. 多少人愛你青春歡暢的時辰

多麼痛的領悟啊!

R中factor和number相互轉化時,一定要經過character,這不是二手車市場,一定要有中間商賺差價!!!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章