在版本控制下使用IPython筆記本

本文翻譯自:Using IPython notebooks under version control

What is a good strategy for keeping IPython notebooks under version control? 使IPython筆記本保持版本控制的好策略是什麼?

The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works quite well. 筆記本格式非常適合版本控制:如果要對筆記本及其輸出進行版本控制,則效果很好。 The annoyance comes when one wants only to version control the input, excluding the cell outputs (aka. "build products") which can be large binary blobs, especially for movies and plots. 當人們只想對輸入進行版本控制時,就會感到煩惱,不包括可能是大型二進制Blob(尤其是電影和情節)的像元輸出(又稱“生成產品”)。 In particular, I am trying to find a good workflow that: 特別是,我試圖找到一個好的工作流程:

  • allows me to choose between including or excluding output, 讓我可以選擇是包含還是排除輸出,
  • prevents me from accidentally committing output if I do not want it, 防止我不想要我的輸出,
  • allows me to keep output in my local version, 允許我將輸出保持在本地版本中,
  • allows me to see when I have changes in the inputs using my version control system (ie if I only version control the inputs but my local file has outputs, then I would like to be able to see if the inputs have changed (requiring a commit). Using the version control status command will always register a difference since the local file has outputs.) 允許我使用版本控制系統查看何時更改了輸入(即,如果僅對版本進行控制,但是本地文件具有輸出,那麼我希望能夠查看輸入是否已更改(需要提交) )。由於本地文件具有輸出,因此使用version control status命令將始終記錄差異。)
  • allows me to update my working notebook (which contains the output) from an updated clean notebook. 允許我從更新的乾淨筆記本中更新我的工作筆記本(包含輸出)。 (update) (更新)

As mentioned, if I chose to include the outputs (which is desirable when using nbviewer for example), then everything is fine. 如前所述,如果我選擇包括輸出(例如,在使用nbviewer時是理想的),那麼一切都很好。 The problem is when I do not want to version control the output. 問題是,當我不想版本控制輸出。 There are some tools and scripts for stripping the output of the notebook, but frequently I encounter the following issues: 有一些工具和腳本可用於剝離筆記本的輸出,但是我經常遇到以下問題:

  1. I accidentally commit a version with the the output, thereby polluting my repository. 我不小心用輸出提交了一個版本,從而污染了我的存儲庫。
  2. I clear output to use version control, but would really rather keep the output in my local copy (sometimes it takes a while to reproduce for example). 我清除了輸出以使用版本控制,但實際上希望將輸出保留在本地副本中(例如,有時需要一段時間才能重現)。
  3. Some of the scripts that strip output change the format slightly compared to the Cell/All Output/Clear menu option, thereby creating unwanted noise in the diffs. 與“ Cell/All Output/Clear菜單選項相比,某些剝離輸出的腳本會稍微改變格式,從而在差異中產生不必要的噪音。 This is resolved by some of the answers. 這可以通過一些答案解決。
  4. When pulling changes to a clean version of the file, I need to find some way of incorporating those changes in my working notebook without having to rerun everything. 將更改拉到文件的乾淨版本時,我需要找到某種方式將這些更改合併到我的工作筆記本中,而不必重新運行所有內容。 (update) (更新)

I have considered several options that I shall discuss below, but have yet to find a good comprehensive solution. 我考慮了以下將要討論的幾個選項,但尚未找到一個好的綜合解決方案。 A full solution might require some changes to IPython, or may rely on some simple external scripts. 完整的解決方案可能需要對IPython進行一些更改,或者可能依賴於一些簡單的外部腳本。 I currently use mercurial , but would like a solution that also works with git : an ideal solution would be version-control agnostic. 我目前使用mercurial ,但想要一個也可以與git一起使用的解決方案:理想的解決方案是版本控制無關的。

This issue has been discussed many times, but there is no definitive or clear solution from the user's perspective. 已經多次討論了此問題,但是從用戶的角度來看,沒有確定的或明確的解決方案。 The answer to this question should provide the definitive strategy. 這個問題的答案應該提供確定的策略。 It is fine if it requires a recent (even development) version of IPython or an easily installed extension. 如果它需要IPython的最新版本(甚至是開發版本)或易於安裝的擴展程序,那就很好。

Update: I have been playing with my modified notebook version which optionally saves a .clean version with every save using Gregory Crosswhite's suggestions . 更新:我一直在玩我的筆記本電腦修改的版本,可選擇節省了.clean版本,每次保存使用格雷戈裏Crosswhite的建議 This satisfies most of my constraints but leaves the following unresolved: 這滿足了我的大部分約束,但以下問題尚未解決:

  1. This is not yet a standard solution (requires a modification of the ipython source. Is there a way of achieving this behaviour with a simple extension? Needs some sort of on-save hook. 這還不是標準解決方案(需要對ipython源進行修改。是否可以通過簡單的擴展來實現此行爲?需要某種保存上的鉤子。
  2. A problem I have with the current workflow is pulling changes. 我當前的工作流程存在一個問題,就是要進行更改。 These will come in to the .clean file, and then need to be integrated somehow into my working version. 這些將進入.clean文件,然後需要以某種方式集成到我的工作版本中。 (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. (當然,我總是可以重新執行筆記本,但是這可能會很痛苦,尤其是如果某些結果取決於長時間的計算,並行計算等時。)關於如何解決這個問題我還沒有個好主意。 Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated. 也許涉及像ipycache這樣的擴展程序的工作流程可能會起作用,但這似乎有點太複雜了。

Notes 筆記

Removing (stripping) Output 移除(剝離)輸出

  • When the notebook is running, one can use the Cell/All Output/Clear menu option for removing the output. 筆記本計算機運行時,可以使用“ Cell/All Output/Clear菜單選項刪除輸出。
  • There are some scripts for removing output, such as the script nbstripout.py which remove the output, but does not produce the same output as using the notebook interface. 有一些用於刪除輸出的腳本,例如腳本nbstripout.py可以刪除輸出,但不會產生與使用筆記本界面相同的輸出。 This was eventually included in the ipython/nbconvert repo, but this has been closed stating that the changes are now included in ipython/ipython ,but the corresponding functionality seems not to have been included yet. 最終將其包含在ipython / nbconvert存儲庫中,但已關閉,指出更改已包含在ipython / ipython中 ,但似乎尚未包含相應的功能。 (update) That being said, Gregory Crosswhite's solution shows that this is pretty easy to do, even without invoking ipython/nbconvert , so this approach is probably workable if it can be properly hooked in. (Attaching it to each version control system, however, does not seem like a good idea — this should somehow hook in to the notebook mechanism.) (更新)話雖如此, Gregory Crosswhite的解決方案表明,即使不調用ipython / nbconvert ,這也很容易做到,因此,如果可以正確地將其插入 ,這種方法可能是可行的。 ,似乎不是一個好主意-應該以某種方式掛接到筆記本機制上。)

Newsgroups 新聞組

Issues 問題

Pull Requests 拉取請求


#1樓

參考:https://stackoom.com/question/1GblD/在版本控制下使用IPython筆記本


#2樓

Unfortunately, I do not know much about Mercurial, but I can give you a possible solution that works with Git, in the hopes that you might be able to translate my Git commands into their Mercurial equivalents. 不幸的是,我對Mercurial的瞭解不多,但是我可以爲您提供一種與Git一起使用的可行解決方案,以期您希望能夠將我的Git命令轉換爲與Mercurial等效的命令。

For background, in Git the add command stores the changes that have been made to a file into a staging area. 對於後臺,在Git中, add命令將對文件所做的更改存儲到暫存區中。 Once you have done this, any subsequent changes to the file are ignored by Git unless you tell it to stage them as well. 完成此操作後,Git會忽略對該文件的任何後續更改,除非您還告訴它也要暫存它們。 Hence, the following script, which, for each of the given files, strips out all of the outputs and prompt_number sections , stages the stripped file, and then restores the original: 因此,以下腳本(對於每個給定的文件)會剝離所有outputsprompt_number sectionsprompt_number sections剝離的文件,然後還原原始文件:

NOTE: If running this gets you an error message like ImportError: No module named IPython.nbformat , then use ipython to run the script instead of python . 注意:如果運行此命令會收到類似ImportError: No module named IPython.nbformat的錯誤消息,請使用ipython而不是python運行腳本。

from IPython.nbformat import current
import io
from os import remove, rename
from shutil import copyfile
from subprocess import Popen
from sys import argv

for filename in argv[1:]:
    # Backup the current file
    backup_filename = filename + ".backup"
    copyfile(filename,backup_filename)

    try:
        # Read in the notebook
        with io.open(filename,'r',encoding='utf-8') as f:
            notebook = current.reads(f.read(),format="ipynb")

        # Strip out all of the output and prompt_number sections
        for worksheet in notebook["worksheets"]:
            for cell in worksheet["cells"]:
               cell.outputs = []
               if "prompt_number" in cell:
                    del cell["prompt_number"]

        # Write the stripped file
        with io.open(filename, 'w', encoding='utf-8') as f:
            current.write(notebook,f,format='ipynb')

        # Run git add to stage the non-output changes
        print("git add",filename)
        Popen(["git","add",filename]).wait()

    finally:
        # Restore the original file;  remove is needed in case
        # we are running in windows.
        remove(filename)
        rename(backup_filename,filename)

Once the script has been run on the files whose changes you wanted to commit, just run git commit . 在要提交更改的文件上運行腳本後,只需運行git commit


#3樓

Here is my solution with git. 這是我的git解決方案。 It allows you to just add and commit (and diff) as usual: those operations will not alter your working tree, and at the same time (re)running a notebook will not alter your git history. 它允許您像往常一樣添加和提交(和diff):這些操作不會改變您的工作樹,並且同時(重新)運行筆記本不會改變您的git歷史記錄。

Although this can probably be adapted to other VCSs, I know it doesn't satisfy your requirements (at least the VSC agnosticity). 儘管這可能適用於其他VCS,但我知道它不能滿足您的要求(至少VSC不可知)。 Still, it is perfect for me, and although it's nothing particularly brilliant, and many people probably already use it, I didn't find clear instructions about how to implement it by googling around. 儘管如此,它對我來說仍然是完美的,儘管沒有什麼特別出色的,而且很多人可能已經在使用它,但是我沒有找到關於如何通過谷歌搜索來實現它的明確說明。 So it may be useful to other people. 因此對其他人可能有用。

  1. Save a file with this content somewhere (for the following, let us assume ~/bin/ipynb_output_filter.py ) 將具有此內容的文件保存在某處(下面,假定~/bin/ipynb_output_filter.py
  2. Make it executable ( chmod +x ~/bin/ipynb_output_filter.py ) 使它可執行( chmod +x ~/bin/ipynb_output_filter.py
  3. Create the file ~/.gitattributes , with the following content 創建文件~/.gitattributes ,其內容如下

     *.ipynb filter=dropoutput_ipynb 
  4. Run the following commands: 運行以下命令:

     git config --global core.attributesfile ~/.gitattributes git config --global filter.dropoutput_ipynb.clean ~/bin/ipynb_output_filter.py git config --global filter.dropoutput_ipynb.smudge cat 

Done! 做完了!

Limitations: 侷限性:

  • it works only with git 它僅適用於git
  • in git, if you are in branch somebranch and you do git checkout otherbranch; git checkout somebranch 在git中,如果您在somebranch分支中,並且執行git checkout otherbranch; git checkout somebranch git checkout otherbranch; git checkout somebranch , you usually expect the working tree to be unchanged. git checkout otherbranch; git checkout somebranch ,您通常希望工作樹保持不變。 Here instead you will have lost the output and cells numbering of notebooks whose source differs between the two branches. 取而代之的是,您將丟失其來源在兩個分支之間不同的筆記本的輸出和單元編號。
  • more in general, the output is not versioned at all, as with Gregory's solution. 通常,輸出與Gregory的解決方案完全沒有版本控制。 In order to not just throw it away every time you do anything involving a checkout, the approach could be changed by storing it in separate files (but notice that at the time the above code is run, the commit id is not known!), and possibly versioning them (but notice this would require something more than a git commit notebook_file.ipynb , although it would at least keep git diff notebook_file.ipynb free from base64 garbage). 爲了不僅在每次執行涉及結帳的操作時都將其丟棄,可以通過將其存儲在單獨的文件中來更改方法(但請注意,在運行上述代碼時,不知道提交ID!),並可能對其進行版本控制(但請注意,這至少需要執行git commit notebook_file.ipynb ,儘管這至少會使git diff notebook_file.ipynb免於base64垃圾)。
  • that said, incidentally if you do pull code (ie committed by someone else not using this approach) which contains some output, the output is checked out normally. 也就是說,如果您確實拉出包含某些輸出的代碼(即由不使用此方法的其他人提交的代碼),則該輸出將被正常檢出。 Only the locally produced output is lost. 只有本地生產的輸出會丟失。

My solution reflects the fact that I personally don't like to keep generated stuff versioned - notice that doing merges involving the output is almost guaranteed to invalidate the output or your productivity or both. 我的解決方案反映了一個事實,即我個人不希望對生成的內容進行版本控制-請注意,進行包含輸出的合併幾乎可以保證使輸出您的生產率兩者無效。

EDIT: 編輯:

  • if you do adopt the solution as I suggested it - that is, globally - you will have trouble in case for some git repo you want to version output. 如果您確實按照我的建議採用了該解決方案-也就是說,在全球範圍內-如果版本輸出的某些git repo會遇到麻煩。 So if you want to disable the output filtering for a specific git repository, simply create inside it a file .git/info/attributes , with 因此,如果您要禁用特定git存儲庫的輸出過濾,只需在其中創建一個文件.git / info / attributes ,使用

    **.ipynb filter= **。ipynb過濾器=

as content. 作爲內容。 Clearly, in the same way it is possible to do the opposite: enable the filtering only for a specific repository. 顯然,以相同的方式可以執行相反的操作: 僅對特定存儲庫啓用過濾。

  • the code is now maintained in its own git repo 該代碼現在保留在自己的git repo中

  • if the instructions above result in ImportErrors, try adding "ipython" before the path of the script: 如果以上說明導致ImportErrors,請嘗試在腳本路徑之前添加“ ipython”:

     git config --global filter.dropoutput_ipynb.clean ipython ~/bin/ipynb_output_filter.py 

EDIT : May 2016 (updated February 2017): there are several alternatives to my script - for completeness, here is a list of those I know: nbstripout ( other variants ), nbstrip , jq . 編輯 :2016年5月(2017年2月更新):我的腳本有幾種選擇-爲了完整性,這是我所知道的那些列表: nbstripout其他 變體 ), nbstripjq


#4樓

I use a very pragmatic approach; 我使用非常務實的方法。 which work well for several notebooks, at several sides. 適用於多個筆記本的多個側面。 And it even enables me to 'transfer' notebooks around. 而且它甚至使我能夠“轉移”筆記本。 It works both for Windows as Unix/MacOS. 它既適用於Windows,也適用於Unix / MacOS。
Al thought it is simple, is solve the problems above... Al認爲很簡單,就是解決上面的問題...

Concept 概念

Basically, do not track the .ipnyb -files, only the corresponding .py -files. 基本上, 跟蹤.ipnyb -files,只有相應.py -files。
By starting the notebook-server with the --script option, that file is automatically created/saved when the notebook is saved. 通過使用--script選項啓動筆記本服務器 ,保存筆記本時將自動創建/保存該文件。

Those .py -files do contain all input; 這些.py -files確實包含所有輸入; non-code is saved into comments, as are the cell-borders. 非代碼和單元格邊框一起保存到註釋中。 Those file can be read/imported ( and dragged) into the notebook-server to (re)create a notebook. 可以將這些文件讀取/導入(並拖動)到筆記本服務器中,以(重新)創建筆記本。 Only the output is gone; 只有輸出消失了; until it is re-run. 直到重新運行。

Personally I use mercurial to version-track the .py files; 我個人使用mercurial.py文件進行版本跟蹤; and use the normal (command-line) commands to add, check-in (ect) for that. 並使用常規(命令行)命令進行添加,簽入(添加)。 Most other (D)VCS will allow this to. 大多數其他(D)VCS都允許這樣做。

Its simple to track the history now; 現在很容易跟蹤歷史; the .py are small, textual and simple to diff. .py很小,文本且易於區分。 Once and a while, we need a clone (just branch; start a 2nd notebook-sever there), or a older version (check-it out and import into a notebook-server), etc. 有時,我們需要一個克隆(只是分支;在那裏啓動一個第二個筆記本),或者一箇舊版本(簽出並導入到筆記本服務器中),等等。

Tips & tricks 提示與技巧

  • Add *.ipynb to ' .hgignore ', so Mercurial knows it can ignore those files 添加* .ipynb“.hgignore”,所以水銀知道它可以忽略這些文件
  • Create a (bash) script to start the server (with the --script option) and do version-track it 創建(bash)腳本以啓動服務器(使用--script選項)並對其進行版本跟蹤
  • Saving a notebook does save the .py -file, but does not check it in. 保存筆記本不會保存.py -file,但不會將其檢入。
    • This is a drawback : One can forget that 這是一個缺點 :人們可能會忘記
    • It's a feature also: It possible to save a notebook (and continue later) without clustering the repository-history. 它也是一個功能 :可以保存筆記本(並在以後繼續)而無需將存儲庫歷史記錄聚類。

Wishes 祝願

  • It would be nice to have a buttons for check-in/add/etc in the notebook Dashboard 在筆記本儀表板中具有用於簽入/添加/等的按鈕會很好
  • A checkout to (by example) file@date+rev.py ) should be helpful It would be to much work to add that; 檢出(例如) file@date+rev.py )應該會有所幫助。 and maybe I will do so once. 也許我會這樣做一次。 Until now, I just do that by hand. 到目前爲止,我只是手工完成。

#5樓

We have a collaborative project where the product is Jupyter Notebooks, and we've use an approach for the last six months that is working great: we activate saving the .py files automatically and track both .ipynb files and the .py files. 我們有一個合作項目,產品爲Jupyter Notebooks,在過去的六個月中,我們一直使用一種效果很好的方法:我們自動激活保存.py文件並跟蹤.ipynb文件和.py文件。

That way if someone wants to view/download the latest notebook they can do that via github or nbviewer, and if someone wants to see how the the notebook code has changed, they can just look at the changes to the .py files. 這樣,如果有人想要查看/下載最新的筆記本,則可以通過github或nbviewer進行操作,如果有人想要查看筆記本的代碼如何更改,則只需查看.py文件的更改即可。

For Jupyter notebook servers , this can be accomplished by adding the lines 對於Jupyter筆記本服務器 ,這可以通過添加以下行來完成

import os
from subprocess import check_call

def post_save(model, os_path, contents_manager):
    """post-save hook for converting notebooks to .py scripts"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    d, fname = os.path.split(os_path)
    check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)

c.FileContentsManager.post_save_hook = post_save

to the jupyter_notebook_config.py file and restarting the notebook server. jupyter_notebook_config.py文件,然後重新啓動筆記本服務器。

If you aren't sure in which directory to find your jupyter_notebook_config.py file, you can type jupyter --config-dir , and if you don't find the file there, you can create it by typing jupyter notebook --generate-config . 如果不確定在哪個目錄中找到jupyter_notebook_config.py文件,則可以鍵入jupyter --config-dir ,如果找不到該文件,則可以通過鍵入jupyter notebook --generate-config來創建它。 jupyter notebook --generate-config

For Ipython 3 notebook servers , this can be accomplished by adding the lines 對於Ipython 3筆記本服務器 ,這可以通過添加以下行來完成

import os
from subprocess import check_call

def post_save(model, os_path, contents_manager):
    """post-save hook for converting notebooks to .py scripts"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    d, fname = os.path.split(os_path)
    check_call(['ipython', 'nbconvert', '--to', 'script', fname], cwd=d)

c.FileContentsManager.post_save_hook = post_save

to the ipython_notebook_config.py file and restarting the notebook server. ipython_notebook_config.py文件,然後重新啓動筆記本服務器。 These lines are from a github issues answer @minrk provided and @dror includes them in his SO answer as well. 這些行來自github問題答案@minrk提供 ,@ dror也將它們包括在他的SO答案中。

For Ipython 2 notebook servers , this can be accomplished by starting the server using: 對於Ipython 2筆記本服務器 ,可以通過使用以下命令啓動服務器來實現:

ipython notebook --script

or by adding the line 或通過添加行

c.FileNotebookManager.save_script = True

to the ipython_notebook_config.py file and restarting the notebook server. ipython_notebook_config.py文件,然後重新啓動筆記本服務器。

If you aren't sure in which directory to find your ipython_notebook_config.py file, you can type ipython locate profile default , and if you don't find the file there, you can create it by typing ipython profile create . 如果不確定在哪個目錄中找到ipython_notebook_config.py文件,則可以鍵入ipython locate profile default ,如果找不到該文件,則可以通過鍵入ipython profile create來創建它。

Here's our project on github that is using this approach : and here's a github example of exploring recent changes to a notebook . 這是我們在github上使用這種方法的項目 :這是探索筆記本最近更改github示例

We've been very happy with this. 我們對此感到非常高興。


#6樓

I did what Albert & Rich did - Don't version .ipynb files (as these can contain images, which gets messy). 我做了Albert&Rich所做的事情-不要對.ipynb文件進行版本控制(因爲這些文件可能包含圖像,會變得凌亂)。 Instead, either always run ipython notebook --script or put c.FileNotebookManager.save_script = True in your config file, so that a (versionable) .py file is always created when you save your notebook. 相反,請始終運行ipython notebook --script或將c.FileNotebookManager.save_script = True放入配置文件中,以便在保存筆記本時始終創建一個(可版本化的) .py文件。

To regenerate notebooks (after checking out a repo or switching a branch) I put the script py_file_to_notebooks.py in the directory where I store my notebooks. 爲了重新生成筆記本(簽出倉庫或切換分支後),我將腳本py_file_to_notebooks.py放在了我存儲筆記本的目錄中。

Now, after checking out a repo, just run python py_file_to_notebooks.py to generate the ipynb files. 現在,簽出一個python py_file_to_notebooks.py後,只需運行python py_file_to_notebooks.py即可生成ipynb文件。 After switching branch, you may have to run python py_file_to_notebooks.py -ov to overwrite the existing ipynb files. 切換分支後,您可能必須運行python py_file_to_notebooks.py -ov來覆蓋現有的ipynb文件。

Just to be on the safe side, it's good to also add *.ipynb to your .gitignore file. *.ipynb安全考慮,最好在.gitignore文件中添加*.ipynb

Edit: I no longer do this because (A) you have to regenerate your notebooks from py files every time you checkout a branch and (B) there's other stuff like markdown in notebooks that you lose. 編輯:我不再這樣做了,因爲(A)每次簽出分支時都必須從py文件重新生成筆記本,並且(B)還有其他東西,例如丟失的筆記本中的markdown。 I instead strip output from notebooks using a git filter. 我改爲使用git過濾器從筆記本中剝離輸出。 Discussion on how to do this is here . 有關如何執行此操作的討論在這裏

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章