ceph關於multipart讀取數據的總結


通過計算RGWObjManifest的obj_iterator中的各種偏移量來獲取下一個multipart的location相關信息,目的是通過location的object字段來從ceph中按照長度讀取需要的obj

下面是一個location的結構

    location = {

      orig_obj = "pydev.tar.gz.2~8fanG4JO3SshIxkLVlGfVZZG0IdxGLV.1",

      loc = "",

      object = "_multipart_pydev.tar.gz.2~8fanG4JO3SshIxkLVlGfVZZG0IdxGLV.1",                 -------------    通過這個字段來組裝multipart名字

      instance = "",

      bucket = {

        tenant = "",

        name = "zhou",

        data_pool = "default.rgw.buckets.data",

        data_extra_pool = "default.rgw.buckets.non-ec",

        index_pool = "default.rgw.buckets.index",

        marker = "64fc737b-5e37-4154-8c29-da273b587feb.244109.1",

        bucket_id = "64fc737b-5e37-4154-8c29-da273b587feb.244109.1",

        oid = ""

      },

      ns = "multipart",

      in_extra_data = false,

      index_hash_source = ""

    }

通過以上的兩個紅色字段生成了mutipart的obj ID

64fc737b-5e37-4154-8c29-da273b587feb.244109.1_multipart_pydev.tar.gz.2~8fanG4JO3SshIxkLVlGfVZZG0IdxGLV.1

該值和通過命令./rados -p default.rgw.buckets.data ls查詢的名稱格式是一樣的。

 

multipart的大小是根據上傳消息中獲取,如果切片的大小>4M,radogw會按照ceph自身的4M一個單位來分片,

比如上傳消息按照5M切片,那麼radosgw會先分一個4M的mutipart,剩餘的1M作爲shadow。

每個切片分出的第一個4M作爲mutipart,索引爲0,以後該切片的拆分都作爲shadow,索引從0開始累加。

順序讀取每個切片(multipart+若干shadow),如果切片中有shadow在讀取shadow

 

那麼如何判斷multipart是否存在shadow呢?

這裏看下RGWObjManifest類的數據結構中包含的obj_iterator成員構成:

  class obj_iterator {

    RGWObjManifest *manifest;                                              ----------所屬的manifest

    uint64_t part_ofs; /* where current part starts */                ----------- 當前part的開始的偏移量,每個切片的累加值,用於計算不足4M的part。

    uint64_t stripe_ofs; /* where current stripe starts */          ------------當前stripe的開始的偏移量,該參數用於計算整體讀取數量,讀多少累加多少

    uint64_t ofs;       /* current position within the object */     ------------當前位置在object中的偏移量,stripe_ofs的副本,當ofs == object.size(),停止讀取。

    uint64_t stripe_size;      /* current part size */                    ------------當前part大小,用於計算讀取的長度,公式爲stripe_size = MIN(rule->part_size - (stripe_ofs - part_ofs), rule->stripe_max_size)

    int cur_part_id;                                                                     ------------當前part ID,只涉及到mutipart的索引,從1開始

    int cur_stripe;                                                                        ------------當前切片索引從0開始。切換一個切片重置爲0。

    string cur_override_prefix;                                                  -----------前綴用於組成location的object

    rgw_obj location;                                                                  -----------當前object,該結構中包括object名字,可以直接訪問該object讀取數據,可以通過get_location方法來獲取

    map<uint64_t, RGWObjManifestRule>::iterator rule_iter;  --------指向manifest的rule中begin

    map<uint64_t, RGWObjManifestRule>::iterator next_rule_iter; --------指向manifest的rule中end

    map<uint64_t, RGWObjManifestPart>::iterator explicit_iter;  ----------詳細遊標,目前流程未涉及

......

}

 

RGWObjManifest::obj_iterator iter = astate->manifest.obj_find(ofs)時,上述的參數就已經賦值如下:

(gdb) p stripe_size

$12 = 4194304

(gdb) p stripe_ofs

$13 = 0

(gdb) p part_ofs

$14 = 0

(gdb) p ofs

$15 = 0

(gdb) p cur_part_id

$16 = 1

(gdb) p stripe_size

$17 = 4194304


stripe_ofs默認每次累加4M,當前切片偏移量+4M  >=  part_ofs + rule->part_size,表示該切片下存在小於等於4M的part(也就是存在__shadow__ 部分),這時候重置cur_stripe=0,part_ofs += rule->part_size(切片單位), stripe_ofs = part_ofs

開始爲讀取__shadow__ 部分做準備,當get_location方法被調用後,相應的location被返回,待讀取的__shadow__部分名字存在於返回的location中。

stripe_size = MIN(rule->part_size - (stripe_ofs - part_ofs), rule->stripe_max_size) ------計算出了待讀取的長度。


當到達最後一個mutipart時stripe_ofs = next_rule_iter->second.start_ofs滿足,那麼

rule_iter指向了最後一個next_rule_iter,cur_part_id賦值了 rule_iter->second.start_part_num

      bool last_rule = (next_rule_iter == manifest->rules.end());

      /* move to the next rule? */

      if (!last_rule && stripe_ofs >= next_rule_iter->second.start_ofs) {

        rule_iter = next_rule_iter;

        last_rule = (next_rule_iter == manifest->rules.end());

        if (!last_rule) {

          ++next_rule_iter;

        }

        cur_part_id = rule_iter->second.start_part_num;

      } else {

        cur_part_id++;

      }

 

 

以上cur_part_id從1到9, cur_stripe則表示每個cur_part_id下的shadow序號從1開始,切換cur_part_id則重置爲0,

RGWObjManifest類中update_location()調用瞭如下代碼:

 

if (cur_stripe == 0) {----------------------------------當cur_stripe=0時,mannifest的location中object的組合以.cur_part_id結尾

 

      snprintf(buf, sizeof(buf), ".%d", (int)cur_part_id);

 

      oid += buf;

 

      ns= RGW_OBJ_NS_MULTIPART;

 

    } else {-----------------------------------------------當cur_stripe=0時,mannifest的location中object的組合以.cur_part_id_cur_stripe結尾

 

      snprintf(buf, sizeof(buf), ".%d_%d", (int)cur_part_id, (int)cur_stripe);

 

      oid += buf;

 

      ns = shadow_ns;

}

正好反映了上述rados -p default.rgw.buckets.data ls的查詢結果。

 

讀取代碼功能在如下函數中,下面進行詳細分析:

int RGWRados::iterate_obj(RGWObjectCtx& obj_ctx, rgw_obj& obj,

                          off_t ofs, off_t end,

                          uint64_t max_chunk_size,

                          int (*iterate_obj_cb)(rgw_obj&, off_t, off_t, off_t, bool, RGWObjState *, void *),

                          void *arg)

{

--------

  if (astate->has_manifest) {

    /* now get the relevant object stripe */

    RGWObjManifest::obj_iterator iter = astate->manifest.obj_find(ofs);

 

    RGWObjManifest::obj_iterator obj_end = astate->manifest.obj_end();

 

    for (; iter != obj_end && ofs <= end; ++iter) {

      off_t stripe_ofs = iter.get_stripe_ofs();   --------------------------- 當前的切片偏移量,初值爲0

      off_t next_stripe_ofs = stripe_ofs + iter.get_stripe_size(); -----下一個切片的偏移量,爲退出while循環使用

 

      while (ofs < next_stripe_ofs && ofs <= end) {

        read_obj = iter.get_location();           --------------------- 返回location,其中包括的待讀取的object名字。

        uint64_t read_len = min(len, iter.get_stripe_size() - (ofs - stripe_ofs));  ------------- 計算出了要讀取的長度。

        read_ofs = iter.location_ofs() + (ofs - stripe_ofs);  ------------- 計算出了要讀取的起始偏移量。

 

        if (read_len > max_chunk_size) {

          read_len = max_chunk_size;

        }     

 

        reading_from_head = (read_obj == obj);

        r = iterate_obj_cb(read_obj, ofs, read_ofs, read_len, reading_from_head, astate, arg);   ---- 開始讀取數據。

        if (r < 0) {

          return r;

        }     

 

        len -= read_len;

        ofs += read_len;  ---- 累加讀取的長度

      }     

    }

  }

 

---------

}

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章