Cloud Foundry中應用實例生命週期過程中的文件目錄分析

在Cloud Foundry中，應用在DEA上運行，而應用在自身的生命週期中，自身的文件目錄也會隨着不同的週期，做出不同的變化。

本文將從創建一個應用（start an app），停止一個應用（stop an app），刪除一個應用（delete an app），重啓一個應用（restart an app），應用crash，關閉dea，啓動dea，dea異常退出後重啓，這幾個方面入手，進行分析應用實例目錄的變化。

本文所講述的Cloud Foundry僅限於v1版本，v2版本會後續跟進。

start an app

start an app主要是指應用用戶發出請求，讓Cloud Foundry創建一個應用，或者啓動一個應用。需要注意的是，在start an app之前，Cloud Foundry的每一個DEA中都不會存有該app的文件。在某一個DEA接受到start an app的請求後，該DEA必須從存放droplet的地方，下載droplet，並在DEA所在節點的某個文件路徑下解壓改droplet，最終啓動解壓後的droplet的應用啓動腳本。這樣的話，該DEA的文件系統中就會有一個該應用相應的文件目錄存在。

以上操作的代碼實現，在/dea/lib/dea/agent.rb的process_dea_start方法中：

[ruby]view
plaincopy

tgz_file = File.join(@staged_dir, "#{sha1}.tgz")  

instance_dir = File.join(@apps_dir, "#{name}-#{instance_index}-#{instance_id}")  

該部分的代碼產生應用在所在DEA上的壓縮包文件目錄以及具體執行的文件目錄，並在後續的success = stage_app_dir(bits_file, bits_uri, sha1, tgz_file, instance_dir, runtime)中實現下載應用源碼至instance_dir。啓動完成之後，以上的instance_dir，就是該應用的文件路徑。

總結：start an app創建應用在某一個DEA上的文件目錄並啓動該應用。

stop an app

stop an app主要是指應用用戶發出請求，讓Cloud Foundry停止一個應用的運行。需要注意的是，在stop an app之前，肯定是必須要在運行的該應用，該應用的文件目錄以及源碼已經存在於某一個DEA的文件系統中。Cloud　Controller收到用戶的stop an app請求後，首先會找到該應用所在運行的DEA節點，並對該DEA發送stop該應用的請求。當DEA接收到該請求後，執行process_dea_stop方法，如下：

[ruby]view
plaincopy

      NATS.subscribe('dea.stop') { |msg| process_dea_stop(msg) }  

在process_dea_stop中，主要執行的便是該應用的停止，包括該應用的所有實例，代碼實現如下：

[ruby]view
plaincopy

return unless instances = @droplets[droplet_id]  

instances.each_value do |instance|  

  version_matched  = version.nil? || instance[:version] == version  

  instance_matched = instance_ids.nil? || instance_ids.include?(instance[:instance_id])  

  index_matched    = indices.nil? || indices.include?(instance[:instance_index])  

  state_matched    = states.nil? || states.include?(instance[:state].to_s)  

  if (version_matched && instance_matched && index_matched && state_matched)  

    instance[:exit_reason] = :STOPPED if [:STARTING, :RUNNING].include?(instance[:state])  

    if instance[:state] == :CRASHED  

      instance[:state] = :DELETED  

      instance[:stop_processed] = false  

    end  

    stop_droplet(instance)  

  end  

end

首先現在@droplets這個hash對象中找到所在停止的應用id，然後再遍歷該應用的所有實例，在對應用實例進行狀態處理之後，隨即執行stop_droplet方法。也就是說真正實現停止應用實例的操作在stop_droplet方法，以下進入該方法的代碼實現：

[ruby]view
plaincopy

    def stop_droplet(instance)  

      return if (instance[:stop_processed])  

      send_exited_message(instance)  

      username = instance[:secure_user]  

      # if system thinks this process is running, make sure to execute stop script  

      if instance[:pid] || [:STARTING, :RUNNING].include?(instance[:state])  

        instance[:state] = :STOPPED unless instance[:state] == :CRASHED  

        instance[:state_timestamp] = Time.now.to_i  

        stop_script = File.join(instance[:dir], 'stop')  

        insecure_stop_cmd = "#{stop_script} #{instance[:pid]} 2> /dev/null"  

        stop_cmd =  

          if @secure  

            "su -c \"#{insecure_stop_cmd}\" #{username}"  

          else  

            insecure_stop_cmd  

          end  

        unless (RUBY_PLATFORM =~ /darwin/ and @secure)  

          Bundler.with_clean_env { system(stop_cmd) }  

        end  

      end  

      ………………  

      cleanup_droplet(instance)  

    end

可以看到在該方法中，主要是通過執行該應用的停止腳本來實現stop an app請求。其中，stop_script = File.join(instance[:dir], 'stop')爲找到停止腳本所在的位置，insecure_stop_cmd = "#{stop_script} #{instance[:pid]} 2> /dev/null"未生成腳本命令，然後通過@secure變量重生成stop_cmd，最後執行Bundler.with_clean_env { system(stop_cmd) }，實現爲啓動一個全新環境來讓操作系統執行腳本stop_cmd。

其實本文最關心的是DEA接下來的操作cleanup_droplet操作，因爲該操作纔是真正於應用在DEA文件系統目錄相關的部分。以下進入cleanup_droplet方法：

[ruby]view
plaincopy

def cleanup_droplet(instance)  

  remove_instance_resources(instance)  

  @usage.delete(instance[:pid]) if instance[:pid]  

  if instance[:state] != :CRASHED || instance[:flapping]  

    if droplet = @droplets[instance[:droplet_id].to_s]  

      droplet.delete(instance[:instance_id])  

      @droplets.delete(instance[:droplet_id].to_s) if droplet.empty?  

      schedule_snapshot  

    end  

    unless @disable_dir_cleanup  

      @logger.debug("#{instance[:name]}: Cleaning up dir #{instance[:dir]}#{instance[:flapping]?' (flapping)':''}")  

      EM.system("rm -rf #{instance[:dir]}")  

    endFileUtils.mv(tmp.path, @app_state_file)  

  else  

    @logger.debug("#{instance[:name]}: Chowning crashed dir #{instance[:dir]}")  

    EM.system("chown -R #{Process.euid}:#{Process.egid} #{instance[:dir]}")  

  end  

end

在該方法中，檢查應用實例狀態後，如果應用的狀態不爲：CRASHED或者instance[:flapping]不爲真時，在@droplets這個hash對象中刪除所要停止的應用實例ID，隨後進行schedule_snapshot操作，該方法的實現於作用稍後會進行分析。然後通過以下代碼實現應用實例文件目錄刪除：

[ruby]view
plaincopy

unless @disable_dir_cleanup  

   @logger.debug("#{instance[:name]}: Cleaning up dir #{instance[:dir]}#{instance[:flapping]?' (flapping)':''}")  

   EM.system("rm -rf #{instance[:dir]}")  

end  

也就在是說@disable_dir_cleanup變量爲真話，不會執行腳本命令 rm -rf #{instance[:dir]} ，如果爲假，則執行腳本命令 rm -rf #{instance[:dir]} ，換句話說會將應用實例的文件目錄全部刪除。在默認情況下，Cloud Foundry關於@disable_dir_cleanup變量的初始化，在agent類的intialize()方法中，初始化讀取配置config['disable_dir_cleanup']，而該配置默認爲空，即爲假。

現在分析剛纔涉及的方法schedule_snapshot方法，在stop_droplet方法中，刪除了@droplets中關於要刪除應用實例的信息後，隨即調用該schedule_snapshot方法。該方法的實現如下：

[ruby]view
plaincopy

def schedule_snapshot  

  return if @snapshot_scheduled  

  @snapshot_scheduled = true  

  EM.next_tick { snapshot_app_state }  

end

可以看到主要是實現了snapshot_app_state方法，現在進入該方法：

[html]view
plaincopy

def snapshot_app_state  

  start = Time.now  

  tmp = File.new("#{@db_dir}/snap_#{Time.now.to_i}", 'w')  

  tmp.puts(JSON.pretty_generate(@droplets))  

  tmp.close  

  FileUtils.mv(tmp.path, @app_state_file)  

  @logger.debug("Took #{Time.now - start} to snapshot application state.")  

  @snapshot_scheduled = false  

end

首先，該方法獲取了當前時間，並以tmp = File.new("#{@db_dir}/snap_#{Time.now.to_i}", 'w')創建了一個文件，通過將@droplets變量json化，隨後將json信息寫入tmp文件；關閉該文件後，通過命令FileUtils.mv(tmp.path, @app_state_file)實現將該tmp文件重命名爲@app_state_file，該變量爲@app_state_file = File.join(@db_dir, APP_STATE_FILE)，其中APP_STATE_FILE = 'applications.json'：。

總結，當stop an app時，DEA的操作流程如下：

刪除該app的所有實例在@droplets中的信息；
對該app的所有實例執行stop腳本；
將刪除指定記錄後的@droplets對象中的所有記錄寫入@app_state_file；
對該app的所有實例的文件目錄，進行刪除處理。

delete an app

delete an app主要是指應用用戶發起一個刪除應用的請求，該請求由Cloud Controller捕獲，Cloud Controller首先將該應用的所有實例停止，然後再將該應用的droplet刪除掉。因此，在操作該請求的時候，有相關該應用的所有信息都會被刪除，自然包括該應用實例在DEA上的文件目錄。

restart an app

restart an app主要是指應用用戶發起一個重啓應用的請求，該請求在vmc處的實現就是分解爲兩個請求，一個stop請求，一個start請求。因此，stop請求在一個DEA上停止該應用的運行，並且刪除該應用的文件目錄；而start請求在一個DEA上現下載該應用的源碼，也就是創建一個文件目錄，最後將該應用啓動起來。需要特別注意的是，執行stop請求的DEA和執行start請求的DEA不一定是同一個DEA。執行stop請求的DEA爲當前需要停止的應用所在的DEA，而執行start請求的DEA，需要由Cloud Controller決策而出。

app crashes

app crashes主要是指應用在運行過程中出現了崩潰的請求。換句話說，應用崩潰，DEA是事先不知曉的，這和stop an app有很大的區別，在具體集羣中可以通過強制殺死應用進程來模擬應用的崩潰。

首先由於應用的崩潰不經過DEA，所以DEA不會執行stop_droplet方法以及cleanup_droplet方法，理論上該應用的文件目錄依然會存在於DEA的文件系統中，據許佔據DEA文件系統的磁盤空間。可以想象，如果應用長此以往的話，對系統磁盤空間的浪費是很明顯的。而關於這個話題，Cloud Foundry中DEA會採取定期執行清除crashed應用的操作，將已經崩潰的應用文件目錄刪除。

具體來講，由於應用崩潰，那麼關於之前該應用的pid也就不會存在了（理論上是這樣），在DEA定期執行monitor_app方法的時候，將所有進程的信息保存起來，隨後執行monitor_apps_helper方法，對於@droplets中的每一個應用的每一個實例，將其的pid信息於實際在DEA節點處的進程pid進行對比，如果失敗，則說明@droplets中的該應用實例已經不在運行，可以認爲是不正常的退出執行。實現代碼如下：

[ruby]view
plaincopy

def monitor_apps_helper(startup_check, ma_start, du_start, du_all_out, pid_info, user_info)  

      …………  

      @droplets.each_value do |instances|  

        instances.each_value do |instance|  

          if instance[:pid] && pid_info[instance[:pid]]  

            …………  

          else  

            # App *should* no longer be running if we are here  

            instance.delete(:pid)  

            # Check to see if this is an orphan that is no longer running, clean up here if needed  

            # since there will not be a cleanup proc or stop call associated with the instance..  

            stop_droplet(instance) if (instance[:orphaned] && !instance[:stop_processed])  

          end  

        end  

      end  

      …………  

    end

當發現該應用實例實際情況下已經不再運行的話，DEA就會執行代碼 instance.delete(:pid) 以及 stop_droplet(instance) if (instance[:orphaned] && !instance[:stop_processed]) ，可以如果(instance[:orphaned] && !instance[:stop_processed]) 爲真的話，那就執行stop_droplet方法，在執行stop_droplet方法的時候，由於先執行send_exited_message方法，如下：

[ruby]view
plaincopy

def stop_droplet(instance)  

      # On stop from cloud controller, this can get called twice. Just make sure we are re-entrant..  

      return if (instance[:stop_processed])  

      # Unplug us from the system immediately, both the routers and health managers.  

      send_exited_message(instance)  

      ……  

      cleanup_droplet(instance)  

    end

而send_exited_message方法中的代碼實現如下：

[ruby]view
plaincopy

def send_exited_message(instance)  

  return if instance[:notified]  

  unregister_instance_from_router(instance)  

  unless instance[:exit_reason]  

    instance[:exit_reason] = :CRASHED  

    instance[:state] = :CRASHED  

    instance[:state_timestamp] = Time.now.to_i  

    instance.delete(:pid) unless instance_running? instance  

  end  

  send_exited_notification(instance)  

  instance[:notified] = true  

end

首先先在router中註銷該應用實例的url，由於對於一個異常終止的應用實例來說，肯定不會有instance[:exit_reason]值，所以正如正常邏輯，應該將該應用實例的:exit_reason以及:state設置爲:CRASHED。

stop_droplet方法中執行完send_exit_message方法之後，最後會執行cleanup_droplet方法。進入cleanup_droplet方法中，由於該應用實例的:state已經被設定爲:CRASHED,所以該應用實例不會進入刪除文件沒有的命令中，而是執行chown命令，代碼如下：

[ruby]view
plaincopy

def cleanup_droplet(instance)  

  ……  

  if instance[:state] != :CRASHED || instance[:flapping]  

   ……  

  else  

    @logger.debug("#{instance[:name]}: Chowning crashed dir #{instance[:dir]}")  

    EM.system("chown -R #{Process.euid}:#{Process.egid} #{instance[:dir]}")  

  end  

end

到目前爲止，crashed應用的狀態只是被標記爲:CRASHED,而其文件目錄還是存在於DEA的文件系統中，並沒有刪除。

但是可以想象的是，對於一個崩潰的應用實例，沒有將其刪除的情況是不合理的，當時Cloud Foundry的設計者肯定會考慮這一點。實際情況中，DEA的執行時，會添加一個週期性任務crashes_reaper，實現代碼如下：

[ruby]view
plaincopy

EM.add_periodic_timer(CRASHES_REAPER_INTERVAL) { crashes_reaper }  

而CRASHES_REAPER_INTERNAL的數值設定爲3600，也就是每隔一小時都是執行一次crashes_reaper操作，現在進入crashes_reaper方法的代碼實現：

[ruby]view
plaincopy

def crashes_reaper  

  @droplets.each_value do |instances|  

    # delete all crashed instances that are older than an hour  

    instances.delete_if do |_, instance|  

      delete_instance = instance[:state] == :CRASHED && Time.now.to_i - instance[:state_timestamp] > CRASHES_REAPER_TIMEOUT  

      if delete_instance  

        @logger.debug("Crashes reaper deleted: #{instance[:instance_id]}")  

        EM.system("rm -rf #{instance[:dir]}") unless @disable_dir_cleanup  

      end  

      delete_instance  

    end  

  end  

  @droplets.delete_if do |_, droplet|  

    droplet.empty?  

  end  

end

該代碼的實現很簡單，也就是如果一個應用實例的狀態爲：CRASHED，那就刪除該應用實例的文件目錄。

總結，當一個應用實例crash的時候，應用實例將不能被訪問，而且其文件目錄依然會存在與DEA所在節點的文件系統中，DEA會將應用實例的狀態標記爲：CRASHED，隨後通過週期爲1小時的任務crashes_reaper將其文件目錄刪除。

stop DEA

stop DEA主要是指，Cloud Foundry的開發者用戶通過Cloud Foundry中指定的腳本命令，停止DEA組件的運行。當開發者用戶發起該請求時，DEA組件會捕獲這個請求：

[ruby]view
plaincopy

['TERM', 'INT', 'QUIT'].each { |s| trap(s) { shutdown() } }  

捕獲到這個請求時，DEA會執行shutdown方法，現在進入該方法的代碼實現：

[ruby]view
plaincopy

def shutdown()  

  @shutting_down = true  

  @logger.info('Shutting down..')  

  @droplets.each_pair do |id, instances|  

    @logger.debug("Stopping app #{id}")  

    instances.each_value do |instance|  

      # skip any crashed instances  

      instance[:exit_reason] = :DEA_SHUTDOWN unless instance[:state] == :CRASHED  

      stop_droplet(instance)  

    end  

  end  

  # Allows messages to get out.  

  EM.add_timer(0.25) do  

    snapshot_app_state  

    @file_viewer_server.stop!  

    NATS.stop { EM.stop }  

    @logger.info('Bye..')  

    @pid_file.unlink()  

  end  

end

看以上代碼可知，執行shutdown方法的時候，對於@droplets中的每一個應用的每一個非CRASHED狀態的實例，將:exit_reason設置爲:DEA_SHUTDOWN之後，隨後執行stop_droplet方法以及cleanup_droplet方法，也就是說會將這些應用實例的文件目錄全部刪除。刪除完之後，DEA會選擇結束進程。當然關於這些進程信息的application.json文件中，也會刪除那些正常運行的應用實例信息。

總結：stop一個DEA的時候，會先停止所有正常應用實例的運行，隨後這些正應用實例的文件目錄會被刪除。

start DEA

start DEA主要是指，Cloud Foundry的開發者用戶通過Cloud Foundry指定的腳本命令，啓動DEA組件的運行。當開發者發起該請求時，DEA組件啓動，重要的部分爲agent對象的創建與運行，現在進入agent實例對象的運行代碼，主要關注與應用實例文件目錄的部分：

[ruby]view
plaincopy

# Recover existing application state.  

recover_existing_droplets  

delete_untracked_instance_dirs  

可以看到的是首先進行recover_existing_droplets方法，代碼實現如下：

[ruby]view
plaincopy

def recover_existing_droplets  

  …………  

  File.open(@app_state_file, 'r') { |f| recovered = Yajl::Parser.parse(f) }  

  # Whip through and reconstruct droplet_ids and instance symbols correctly for droplets, state, etc..  

  recovered.each_pair do |app_id, instances|  

    @droplets[app_id.to_s] = instances  

    instances.each_pair do |instance_id, instance|  

      …………  

    end  

  end  

  @recovered_droplets = true  

  # Go ahead and do a monitoring pass here to detect app state  

  monitor_apps(true)  

  send_heartbeat  

  schedule_snapshot  

end

該方法主要根據@app_state_file文件中的信息，還原@droplets信息，隨後執行monitor_apps，send_heartbeat以及schedule_snapshot方法。

隨後會執行delete_untracked_instance_dirs方法，主要是刪除與@droplets不相符的應用實例文件目錄。

總結，如果之前DEA爲正常退出的話，且正常退出前已經清除所有crashed應用實例的話，aplication_json文件中不會有任何信息，而存放應用文件目錄的路徑下不會有任何應用實例，因此該方法不會文件目錄刪除；如果DEA正常退出之前，還有crashed應用實例還沒有刪除的話，啓動的時候該應用實例還是會存在，等待crashes_reaper操作將其刪除；如果DEA崩潰退出時，存在應用實例文件目錄的路徑下與DEA崩潰前出現不一致，而application.json也與實際的應用實例不一致時，會將不匹配的應用實例的文件目錄進行刪除。

實現如下：

[ruby]view
plaincopy

# Removes any instance dirs without a corresponding instance entry in @droplets  

# NB: This is run once at startup, so not using EM.system to perform the rm is fine.  

def delete_untracked_instance_dirs  

  tracked_instance_dirs = Set.new  

  for droplet_id, instances in @droplets  

    for instance_id, instance in instances  

      tracked_instance_dirs << instance[:dir]  

    end  

  end  

  all_instance_dirs = Set.new(Dir.glob(File.join(@apps_dir, '*')))  

  to_remove = all_instance_dirs - tracked_instance_dirs  

  for dir in to_remove  

    @logger.warn("Removing instance dir '#{dir}', doesn't correspond to any instance entry.")  

    FileUtils.rm_rf(dir)  

  end  

end

DEA crashes

DEA crashes主要是指，DEA在運行過程崩潰，非正常終止，可以是用強制結束DEA進程來模擬DEA crashes。

由於DEA進程退出後，並不會直接影響到應用實例的運行，所以應用的文件目錄還是會存在的，應用還是可以訪問。當重新正常啓動DEA進程的時候，由於和start DEA操作完全一致。需要注意的是，假如重啓的時候，之前運行的應用都正常運行的話，那麼通過recover_existing_droplets方法可以做到監控所有應用實例，通過monitor_apps方法。隨後又可以通過send_heartbeat以及schedule_snapshot方法，實現與外部組件的通信。假如DEA重啓的時候，之前運行的應用實例有部分已經crashes掉了，那在monitor_apps方法的後續執行中會將其文件目錄刪除。

以上便是我對Cloud Foundry中應用實例生命週期中文件目錄的變化分析。

關於作者：

孫宏亮，DAOCLOUD軟件工程師。兩年來在雲計算方面主要研究PaaS領域的相關知識與技術。堅信輕量級虛擬化容器的技術，會給PaaS領域帶來深度影響，甚至決定未來PaaS技術的走向。

轉載請註明出處。

這篇文檔更多出於我本人的理解，肯定在一些地方存在不足和錯誤。希望本文能夠對接觸Cloud Foundry中應用實例生命週期中文件目錄變化的人有些幫助，如果你對這方面感興趣，並有更好的想法和建議，也請聯繫我。

我的郵箱：[email protected]
新浪微博：@蓮子弗如清

孫宏亮

發佈了47 篇原創文章 · 獲贊 10 · 訪問量 22萬+

私信關注

Cloud Foundry中應用實例生命週期過程中的文件目錄分析

start an app

stop an app

delete an app

restart an app

app crashes

stop DEA

start DEA

DEA crashes

Cloud Foundry中Stager組件的源碼分析

Cloud Foundry中syslog_aggregator的實現分析

Cloud Foundry中collector組件的源碼分析

Winnowing:一種文檔指紋的通用算法

Cloud Foundry中基於Master/Slave機制的Service Gateway——解決Service Gateway單點故障問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結