OpenStack Nova調度策略研究筆記 原 薦

概述

在創建一個新虛擬機實例時,Nova Scheduler通過配置好的Filter Scheduler對所有計算節點進行過濾(filtering)和稱重(weighting),最後根據稱重高低和用戶請求節點個數返回可用主機列表。如果失敗,則表明沒有可用的主機。

###標準過濾器

filtering-workflow-1.png

  • AllHostsFilter - 不進行過濾,所有可見的主機都會通過。

  • ImagePropertiesFilter - 根據鏡像元數據進行過濾。

  • AvailabilityZoneFilter - 根據可用區域進行過濾(Availability Zone元數據)。

  • ComputeCapabilitiesFilter - 根據計算能力進行過濾,通過請求創建虛擬機時指定的參數與主機的屬性和狀態進行匹配來確定是否通過,可用的操作符如下:

    * = (equal to or greater than as a number; same as vcpus case)
    * == (equal to as a number)
    * != (not equal to as a number)
    * >= (greater than or equal to as a number)
    * <= (less than or equal to as a number)
    * s== (equal to as a string)
    * s!= (not equal to as a string)
    * s>= (greater than or equal to as a string)
    * s> (greater than as a string)
    * s<= (less than or equal to as a string)
    * s< (less than as a string)
    * <in> (substring)
    * <all-in> (all elements contained in collection)
    * <or> (find one of these)
    
    Examples are: ">= 5", "s== 2.1.0", "<in> gcc", "<all-in> aes mmx", and "<or> fpu <or> gpu"
    

    部分可用的屬性:

    * free_ram_mb (compared with a number, values like ">= 4096")
    * free_disk_mb (compared with a number, values like ">= 10240")
    * host (compared with a string, values like: "<in> compute","s== compute_01")
    * hypervisor_type (compared with a string, values like: "s== QEMU", "s== powervm")
    * hypervisor_version (compared with a number, values like : ">= 1005003", "== 2000000")
    * num_instances (compared with a number, values like: "<= 10")
    * num_io_ops (compared with a number, values like: "<= 5")
    * vcpus_total (compared with a number, values like: "= 48", ">=24")
    * vcpus_used (compared with a number, values like: "= 0", "<= 10")
    
  • AggregateInstanceExtraSpecsFilter - 根據額外的主機屬性進行過濾(Host Aggregate元數據),與ComputeCapabilitiesFilter類似。

  • ComputeFilter - 根據主機的狀態和服務的可用性過濾。

  • CoreFilter AggregateCoreFilter - 根據剩餘可用的CPU個數進行過濾。

  • IsolatedHostsFilter - 根據nova.conf中的image_isolatedhost_isolated,和restrict_isolated_hosts_to_isolated_images 標誌進行過濾,用於節點隔離。

  • JsonFilter - 根據JSON語句來過濾。

  • RamFilter AggregateRamFilter - 根據內存來過濾。

  • DiskFilter AggregateDiskFilter - 根據磁盤空間來過濾。

  • NumInstancesFilter AggregateNumInstancesFilter - 根據節點實例個數來過濾。

  • IoOpsFilter AggregateIoOpsFilter - 根據IO狀況過濾。

  • PciPassthroughFilter - 根據請求的PCI設備進行過濾。

  • SimpleCIDRAffinityFilter - 在同一個IP子網上創建虛擬機。

  • SameHostFilter - 在與一個實例相同的主機上啓動實例。

  • RetryFilter - 過濾掉已經嘗試過的主機。

  • AggregateTypeAffinityFilter - 限定一個Aggregate中創建的實例類型(Flavor類型)。

  • ServerGroupAntiAffinityFilter - 儘量把實例部署在不同主機。

  • ServerGroupAffinityFilter - 儘量把實例部署在相同主機。

  • AggregateMultiTenancyIsolation - 把租戶隔離在指定的Aggregate。

  • AggregateImagePropertiesIsolation - 根據鏡像屬性和Aggregate屬性隔離主機。

  • MetricsFilter - 根據weight_setting 過濾主機,只有具備可用測量值的主機被通過。

  • NUMATopologyFilter - 根據實例的NUMA要求過濾主機。

###權重計算

filtering-workflow-2

當過濾後如果有多個主機,則需要進行權重計算,最後選出權重最高的主機,公式如下:

weight = w1_multiplier * norm(w1) + w2_multiplier * norm(w2) + ...

每一項都由“權重係數”(wN_multiplier)乘以“稱重值”(norm(wN)),“權重係數”通過配置文件獲取,“稱重值”由“稱重對象”(Weight Object)動態生成,目前可用的“稱重對象”主要有:RAMWeigher,DiskWeigher,MetricsWeigher,IoOpsWeigher,PCIWeigher,ServerGroupSoftAffinityWeigher和ServerGroupSoftAntiAffinityWeigher。

###常見策略

根據不同的需求,可以制定出不同的調度策略,使用調度插件進行組合,以滿足需求。下面是一些常見的調度策略:

  • Packing: 虛擬機儘量放置在含有虛擬機數量最多的主機上。

  • Stripping: 虛擬機儘量放置在含有虛擬機數量最少的主機上。

  • CPU load balance:虛擬機儘量放在可用core最多的主機上。

  • Memory load balance:虛擬機儘量放在可用memory 最多的主機上。

  • Affinity : 多個虛擬機需要放置在相同的主機上。

  • AntiAffinity: 多個虛擬機需要放在在不同的主機上。

  • CPU Utilization load balance:虛擬機儘量放在CPU利用率最低的主機上。

##元數據

###調度測試

由於各種元數據過濾方法都大同小異,而Flavor元數據沒有太多預定義的值,處理比較自由,因此這裏以Flavor元數據過濾器進行測試。

####過濾器配置

新增AggregateInstanceExtraSpecsFilter過濾器:

$ vi /etc/kolla/nova-scheduler/nova.conf
[DEFAULT]
...
scheduler_default_filters = AggregateInstanceExtraSpecsFilter, RetryFilter, RamFilter, DiskFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter
...

$ docker restart nova_scheduler

####集合配置

  • 創建io-fast集合:
$ nova aggregate-create io-fast
+----+---------+-------------------+-------+----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts | Metadata | UUID                                 |
+----+---------+-------------------+-------+----------+--------------------------------------+
| 8  | io-fast | -                 |       |          | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd |
+----+---------+-------------------+-------+----------+--------------------------------------+
$ nova aggregate-set-metadata io-fast io=fast
Metadata has been successfully updated for aggregate 8.
+----+---------+-------------------+-------+-----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts | Metadata  | UUID                                 |
+----+---------+-------------------+-------+-----------+--------------------------------------+
| 8  | io-fast | -                 |       | 'io=fast' | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd |
+----+---------+-------------------+-------+-----------+--------------------------------------+

$ nova aggregate-add-host io-fast osdev-01
Host osdev-01 has been successfully added for aggregate 8 
+----+---------+-------------------+------------+-----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts      | Metadata  | UUID                                 |
+----+---------+-------------------+------------+-----------+--------------------------------------+
| 8  | io-fast | -                 | 'osdev-01' | 'io=fast' | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd |
+----+---------+-------------------+------------+-----------+--------------------------------------+

$ nova aggregate-add-host io-fast osdev-02
Host osdev-02 has been successfully added for aggregate 8 
+----+---------+-------------------+------------------------+-----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts                  | Metadata  | UUID                                 |
+----+---------+-------------------+------------------------+-----------+--------------------------------------+
| 8  | io-fast | -                 | 'osdev-01', 'osdev-02' | 'io=fast' | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd |
+----+---------+-------------------+------------------------+-----------+--------------------------------------+

$ nova aggregate-add-host io-fast osdev-03
Host osdev-03 has been successfully added for aggregate 8 
+----+---------+-------------------+------------------------------------+-----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts                              | Metadata  | UUID                                 |
+----+---------+-------------------+------------------------------------+-----------+--------------------------------------+
| 8  | io-fast | -                 | 'osdev-01', 'osdev-02', 'osdev-03' | 'io=fast' | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd |
+----+---------+-------------------+------------------------------------+-----------+--------------------------------------+
  • 創建io-slow集合:
$ nova aggregate-create io-slow
+----+---------+-------------------+-------+----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts | Metadata | UUID                                 |
+----+---------+-------------------+-------+----------+--------------------------------------+
| 9  | io-slow | -                 |       |          | d10d2eaf-43d7-464e-bc12-10f18897b476 |
+----+---------+-------------------+-------+----------+--------------------------------------+

$ nova aggregate-set-metadata io-slow io=slow
Metadata has been successfully updated for aggregate 9.
+----+---------+-------------------+-------+-----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts | Metadata  | UUID                                 |
+----+---------+-------------------+-------+-----------+--------------------------------------+
| 9  | io-slow | -                 |       | 'io=slow' | d10d2eaf-43d7-464e-bc12-10f18897b476 |
+----+---------+-------------------+-------+-----------+--------------------------------------+

$ nova aggregate-add-host io-slow osdev-gpu
Host osdev-gpu has been successfully added for aggregate 9 
+----+---------+-------------------+-------------+-----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts       | Metadata  | UUID                                 |
+----+---------+-------------------+-------------+-----------+--------------------------------------+
| 9  | io-slow | -                 | 'osdev-gpu' | 'io=slow' | d10d2eaf-43d7-464e-bc12-10f18897b476 |
+----+---------+-------------------+-------------+-----------+--------------------------------------+

$ nova aggregate-add-host io-slow osdev-ceph
Host osdev-ceph has been successfully added for aggregate 9 
+----+---------+-------------------+---------------------------+-----------+--------------------------------------+
| Id | Name    | Availability Zone | Hosts                     | Metadata  | UUID                                 |
+----+---------+-------------------+---------------------------+-----------+--------------------------------------+
| 9  | io-slow | -                 | 'osdev-gpu', 'osdev-ceph' | 'io=slow' | d10d2eaf-43d7-464e-bc12-10f18897b476 |
+----+---------+-------------------+---------------------------+-----------+--------------------------------------+

####模板配置

  • 創建io-fast虛擬機模板:
$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.fast
$ nova flavor-key machine.fast set io=fast

$ openstack flavor show machine.fast
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| access_project_ids         | None                                 |
| disk                       | 1                                    |
| id                         | 4c8a6d15-270d-464b-bd3b-303d167af4cb |
| name                       | machine.fast                         |
| os-flavor-access:is_public | True                                 |
| properties                 | io='fast'                            |
| ram                        | 64                                   |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 1                                    |
+----------------------------+--------------------------------------+
  • 創建io-slow虛擬機模板:
$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.slow
$ nova flavor-key machine.slow set io=slow

$ openstack flavor show machine.slow
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| access_project_ids         | None                                 |
| disk                       | 1                                    |
| id                         | f6a0fdad-3f20-40ed-a4fc-0ba49ff4ff02 |
| name                       | machine.slow                         |
| os-flavor-access:is_public | True                                 |
| properties                 | io='slow'                            |
| ram                        | 64                                   |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 1                                    |
+----------------------------+--------------------------------------+

####創建虛擬機

  • 創建io-fast虛擬機:
$ openstack server create --image cirros --flavor machine.fast --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.fast1

$ openstack server create --image cirros --flavor machine.fast --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.fast2

$ openstack server create --image cirros --flavor machine.fast --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.fast3
  • 創建io-slow虛擬機:
$ openstack server create --image cirros --flavor machine.slow --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.slow1

$ openstack server create --image cirros --flavor machine.slow --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.slow2

$ openstack server create --image cirros --flavor machine.slow --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.slow3

  • 查看虛擬機被調度的節點:
$ openstack server list --long --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host"
+--------------------+--------+--------------------+-------------------+------------+
| Name               | Status | Networks           | Availability Zone | Host       |
+--------------------+--------+--------------------+-------------------+------------+
| server.slow3       | ACTIVE | demo-net=10.0.0.20 | az02              | osdev-gpu  |
| server.slow2       | ACTIVE | demo-net=10.0.0.17 | az02              | osdev-ceph |
| server.slow1       | ACTIVE | demo-net=10.0.0.14 | az02              | osdev-gpu  |
| server.fast3       | ACTIVE | demo-net=10.0.0.13 | az01              | osdev-01   |
| server.fast2       | ACTIVE | demo-net=10.0.0.16 | az01              | osdev-02   |
| server.fast1       | ACTIVE | demo-net=10.0.0.15 | az02              | osdev-03   |
+--------------------+--------+--------------------+-------------------+------------+

###相關源碼

請求參數

命令參數

  • 查看命令幫助:
$ openstack server create
usage: openstack server create [-h] [-f {json,shell,table,value,yaml}]
                               [-c COLUMN] [--max-width <integer>]
                               [--fit-width] [--print-empty] [--noindent]
                               [--prefix PREFIX]
                               (--image <image> | --volume <volume>) --flavor
                               <flavor>
                               [--security-group <security-group-name>]
                               [--key-name <key-name>]
                               [--property <key=value>]
                               [--file <dest-filename=source-filename>]
                               [--user-data <user-data>]
                               [--availability-zone <zone-name>]
                               [--block-device-mapping <dev-name=mapping>]
                               [--nic <net-id=net-uuid,v4-fixed-ip=ip-addr,v6-fixed-ip=ip-addr,port-id=port-uuid>]
                               [--hint <key=value>]
                               [--config-drive <config-drive-volume>|True]
                               [--min <count>] [--max <count>] [--wait]
                               <server-name>
openstack server create: error: too few arguments

影響調度的直接輸入參數有[--availability-zone <zone-name>][--hint <key=value>]

  • 由Nova API生成的請求參數(nova/objects/request_spec.py ):
...
@base.NovaObjectRegistry.register
class RequestSpec(base.NovaObject):
    # Version 1.0: Initial version
    # Version 1.1: ImageMeta version 1.6
    # Version 1.2: SchedulerRetries version 1.1
    # Version 1.3: InstanceGroup version 1.10
    # Version 1.4: ImageMeta version 1.7
    # Version 1.5: Added get_by_instance_uuid(), create(), save()
    # Version 1.6: Added requested_destination
    # Version 1.7: Added destroy()
    # Version 1.8: Added security_groups
    VERSION = '1.8'

    fields = {
        'id': fields.IntegerField(),
        'image': fields.ObjectField('ImageMeta', nullable=True),
        'numa_topology': fields.ObjectField('InstanceNUMATopology',
                                            nullable=True),
        'pci_requests': fields.ObjectField('InstancePCIRequests',
                                           nullable=True),
        'project_id': fields.StringField(nullable=True),
        'availability_zone': fields.StringField(nullable=True),
        'flavor': fields.ObjectField('Flavor', nullable=False),
        'num_instances': fields.IntegerField(default=1),
        'ignore_hosts': fields.ListOfStringsField(nullable=True),
        'force_hosts': fields.ListOfStringsField(nullable=True),
        'force_nodes': fields.ListOfStringsField(nullable=True),
        'requested_destination': fields.ObjectField('Destination',
                                                    nullable=True,
                                                    default=None),
        'retry': fields.ObjectField('SchedulerRetries', nullable=True),
        'limits': fields.ObjectField('SchedulerLimits', nullable=True),
        'instance_group': fields.ObjectField('InstanceGroup', nullable=True),
        # NOTE(sbauza): Since hints are depending on running filters, we prefer
        # to leave the API correctly validating the hints per the filters and
        # just provide to the RequestSpec object a free-form dictionary
        'scheduler_hints': fields.DictOfListOfStringsField(nullable=True),
        'instance_uuid': fields.UUIDField(),
        'security_groups': fields.ObjectField('SecurityGroupList'),
    }

...

####主機狀態

主機狀態中的主要信息(nova/scheduler/host_manager.py):

class HostState(object):
    """Mutable and immutable information tracked for a host.
    This is an attempt to remove the ad-hoc data structures
    previously used and lock down access.
    """

    def __init__(self, host, node):
        self.host = host
        self.nodename = node
        self._lock_name = (host, node)

        # Mutable available resources.
        # These will change as resources are virtually "consumed".
        self.total_usable_ram_mb = 0
        self.total_usable_disk_gb = 0
        self.disk_mb_used = 0
        self.free_ram_mb = 0
        self.free_disk_mb = 0
        self.vcpus_total = 0
        self.vcpus_used = 0
        self.pci_stats = None
        self.numa_topology = None

        # Additional host information from the compute node stats:
        self.num_instances = 0
        self.num_io_ops = 0

        # Other information
        self.host_ip = None
        self.hypervisor_type = None
        self.hypervisor_version = None
        self.hypervisor_hostname = None
        self.cpu_info = None
        self.supported_instances = None

        # Resource oversubscription values for the compute host:
        self.limits = {}

        # Generic metrics from compute nodes
        self.metrics = None

        # List of aggregates the host belongs to
        self.aggregates = []

        # Instances on this host
        self.instances = {}

        # Allocation ratios for this host
        self.ram_allocation_ratio = None
        self.cpu_allocation_ratio = None
        self.disk_allocation_ratio = None

        self.updated = None

####Flavor元數據和過濾器

Flavor元數據主要在extra_spec字段,基本都沒有明確定義,可自由使用。

  • AggregateInstanceExtraSpecsFilter過濾器,使用aggregate_instance_extra_specs域和沒有域的元數據進行判斷(nova/scheduler/filters/aggregate_instance_extra_specs.py):
from oslo_log import log as logging


from nova.scheduler import filters
from nova.scheduler.filters import extra_specs_ops
from nova.scheduler.filters import utils


LOG = logging.getLogger(__name__)

_SCOPE = 'aggregate_instance_extra_specs'


class AggregateInstanceExtraSpecsFilter(filters.BaseHostFilter):
    """AggregateInstanceExtraSpecsFilter works with InstanceType records."""

    # Aggregate data and instance type does not change within a request
    run_filter_once_per_request = True

    RUN_ON_REBUILD = False

    def host_passes(self, host_state, spec_obj):
        """Return a list of hosts that can create instance_type

        Check that the extra specs associated with the instance type match
        the metadata provided by aggregates.  If not present return False.
        """
        instance_type = spec_obj.flavor
        # If 'extra_specs' is not present or extra_specs are empty then we
        # need not proceed further
        if (not instance_type.obj_attr_is_set('extra_specs')
                or not instance_type.extra_specs):
            return True

        metadata = utils.aggregate_metadata_get_by_host(host_state)

        for key, req in instance_type.extra_specs.items():
            # Either not scope format, or aggregate_instance_extra_specs scope
            scope = key.split(':', 1)
            if len(scope) > 1:
                if scope[0] != _SCOPE:
                    continue
                else:
                    del scope[0]
            key = scope[0]
            aggregate_vals = metadata.get(key, None)
            if not aggregate_vals:
                LOG.debug("%(host_state)s fails instance_type extra_specs "
                    "requirements. Extra_spec %(key)s is not in aggregate.",
                    {'host_state': host_state, 'key': key})
                return False
            for aggregate_val in aggregate_vals:
                if extra_specs_ops.match(aggregate_val, req):
                    break
            else:
                LOG.debug("%(host_state)s fails instance_type extra_specs "
                            "requirements. '%(aggregate_vals)s' do not "
                            "match '%(req)s'",
                          {'host_state': host_state, 'req': req,
                           'aggregate_vals': aggregate_vals})
                return False
        return True
  • 元數據匹配的相關運算符(nova/scheduler/filters/extra_specs_ops.py):
import operator

# 1. The following operations are supported:
#   =, s==, s!=, s>=, s>, s<=, s<, <in>, <all-in>, <or>, ==, !=, >=, <=
# 2. Note that <or> is handled in a different way below.
# 3. If the first word in the extra_specs is not one of the operators,
#   it is ignored.
op_methods = {'=': lambda x, y: float(x) >= float(y),
               '<in>': lambda x, y: y in x,
               '<all-in>': lambda x, y: all(val in x for val in y),
               '==': lambda x, y: float(x) == float(y),
               '!=': lambda x, y: float(x) != float(y),
               '>=': lambda x, y: float(x) >= float(y),
               '<=': lambda x, y: float(x) <= float(y),
               's==': operator.eq,
               's!=': operator.ne,
               's<': operator.lt,
               's<=': operator.le,
               's>': operator.gt,
               's>=': operator.ge}


def match(value, req):
    words = req.split()

    op = method = None
    if words:
        op = words.pop(0)
        method = op_methods.get(op)

    if op != '<or>' and not method:
        return value == req

    if value is None:
        return False

    if op == '<or>':  # Ex: <or> v1 <or> v2 <or> v3
        while True:
            if words.pop(0) == value:
                return True
            if not words:
                break
            words.pop(0)  # remove a keyword <or>
            if not words:
                break
        return False

    if words:
        if op == '<all-in>':  # requires a list not a string
            return method(value, words)
        return method(value, words[0])
    return False

####Image元數據和過濾器

  • Image基本屬性值(nova/objects/image_meta.py):
@base.NovaObjectRegistry.register
class ImageMeta(base.NovaObject):

    fields = {
        'id': fields.UUIDField(),
        'name': fields.StringField(),
        'status': fields.StringField(),
        'visibility': fields.StringField(),
        'protected': fields.FlexibleBooleanField(),
        'checksum': fields.StringField(),
        'owner': fields.StringField(),
        'size': fields.IntegerField(),
        'virtual_size': fields.IntegerField(),
        'container_format': fields.StringField(),
        'disk_format': fields.StringField(),
        'created_at': fields.DateTimeField(nullable=True),
        'updated_at': fields.DateTimeField(nullable=True),
        'tags': fields.ListOfStringsField(),
        'direct_url': fields.StringField(),
        'min_ram': fields.IntegerField(),
        'min_disk': fields.IntegerField(),
        'properties': fields.ObjectField('ImageMetaProps'),
    }
  • Image元數據中的可用值(nova/objects/image_meta.py):
@base.NovaObjectRegistry.register
class ImageMetaProps(base.NovaObject):
    # Version 1.0: Initial version
    # Version 1.1: added os_require_quiesce field
    # Version 1.2: added img_hv_type and img_hv_requested_version fields
    # Version 1.3: HVSpec version 1.1
    # Version 1.4: added hw_vif_multiqueue_enabled field
    # Version 1.5: added os_admin_user field
    # Version 1.6: Added 'lxc' and 'uml' enum types to DiskBusField
    # Version 1.7: added img_config_drive field
    # Version 1.8: Added 'lxd' to hypervisor types
    # Version 1.9: added hw_cpu_thread_policy field
    # Version 1.10: added hw_cpu_realtime_mask field
    # Version 1.11: Added hw_firmware_type field
    # Version 1.12: Added properties for image signature verification
    # Version 1.13: added os_secure_boot field
    # Version 1.14: Added 'hw_pointer_model' field
    # Version 1.15: Added hw_rescue_bus and hw_rescue_device.
    # Version 1.16: WatchdogActionField supports 'disabled' enum.
    VERSION = '1.16'

    def obj_make_compatible(self, primitive, target_version):
        super(ImageMetaProps, self).obj_make_compatible(primitive,
                                                        target_version)
        target_version = versionutils.convert_version_to_tuple(target_version)
        if target_version < (1, 16) and 'hw_watchdog_action' in primitive:
            # Check to see if hw_watchdog_action was set to 'disabled' and if
            # so, remove it since not specifying it is the same behavior.
            if primitive['hw_watchdog_action'] == \
                    fields.WatchdogAction.DISABLED:
                primitive.pop('hw_watchdog_action')
        if target_version < (1, 15):
            primitive.pop('hw_rescue_bus', None)
            primitive.pop('hw_rescue_device', None)
        if target_version < (1, 14):
            primitive.pop('hw_pointer_model', None)
        if target_version < (1, 13):
            primitive.pop('os_secure_boot', None)
        if target_version < (1, 11):
            primitive.pop('hw_firmware_type', None)
        if target_version < (1, 10):
            primitive.pop('hw_cpu_realtime_mask', None)
        if target_version < (1, 9):
            primitive.pop('hw_cpu_thread_policy', None)
        if target_version < (1, 7):
            primitive.pop('img_config_drive', None)
        if target_version < (1, 5):
            primitive.pop('os_admin_user', None)
        if target_version < (1, 4):
            primitive.pop('hw_vif_multiqueue_enabled', None)
        if target_version < (1, 2):
            primitive.pop('img_hv_type', None)
            primitive.pop('img_hv_requested_version', None)
        if target_version < (1, 1):
            primitive.pop('os_require_quiesce', None)

        if target_version < (1, 6):
            bus = primitive.get('hw_disk_bus', None)
            if bus in ('lxc', 'uml'):
                raise exception.ObjectActionError(
                    action='obj_make_compatible',
                    reason='hw_disk_bus=%s not supported in version %s' % (
                        bus, target_version))

    # Maximum number of NUMA nodes permitted for the guest topology
    NUMA_NODES_MAX = 128

    # 'hw_' - settings affecting the guest virtual machine hardware
    # 'img_' - settings affecting the use of images by the compute node
    # 'os_' - settings affecting the guest operating system setup

    fields = {
        # name of guest hardware architecture eg i686, x86_64, ppc64
        'hw_architecture': fields.ArchitectureField(),

        # used to decide to expand root disk partition and fs to full size of
        # root disk
        'hw_auto_disk_config': fields.StringField(),

        # whether to display BIOS boot device menu
        'hw_boot_menu': fields.FlexibleBooleanField(),

        # name of the CDROM bus to use eg virtio, scsi, ide
        'hw_cdrom_bus': fields.DiskBusField(),

        # preferred number of CPU cores per socket
        'hw_cpu_cores': fields.IntegerField(),

        # preferred number of CPU sockets
        'hw_cpu_sockets': fields.IntegerField(),

        # maximum number of CPU cores per socket
        'hw_cpu_max_cores': fields.IntegerField(),

        # maximum number of CPU sockets
        'hw_cpu_max_sockets': fields.IntegerField(),

        # maximum number of CPU threads per core
        'hw_cpu_max_threads': fields.IntegerField(),

        # CPU allocation policy
        'hw_cpu_policy': fields.CPUAllocationPolicyField(),

        # CPU thread allocation policy
        'hw_cpu_thread_policy': fields.CPUThreadAllocationPolicyField(),

        # CPU mask indicates which vCPUs will have realtime enable,
        # example ^0-1 means that all vCPUs except 0 and 1 will have a
        # realtime policy.
        'hw_cpu_realtime_mask': fields.StringField(),

        # preferred number of CPU threads per core
        'hw_cpu_threads': fields.IntegerField(),

        # guest ABI version for guest xentools either 1 or 2 (or 3 - depends on
        # Citrix PV tools version installed in image)
        'hw_device_id': fields.IntegerField(),

        # name of the hard disk bus to use eg virtio, scsi, ide
        'hw_disk_bus': fields.DiskBusField(),

        # allocation mode eg 'preallocated'
        'hw_disk_type': fields.StringField(),

        # name of the floppy disk bus to use eg fd, scsi, ide
        'hw_floppy_bus': fields.DiskBusField(),

        # This indicates the guest needs UEFI firmware
        'hw_firmware_type': fields.FirmwareTypeField(),

        # boolean - used to trigger code to inject networking when booting a CD
        # image with a network boot image
        'hw_ipxe_boot': fields.FlexibleBooleanField(),

        # There are sooooooooooo many possible machine types in
        # QEMU - several new ones with each new release - that it
        # is not practical to enumerate them all. So we use a free
        # form string
        'hw_machine_type': fields.StringField(),

        # One of the magic strings 'small', 'any', 'large'
        # or an explicit page size in KB (eg 4, 2048, ...)
        'hw_mem_page_size': fields.StringField(),

        # Number of guest NUMA nodes
        'hw_numa_nodes': fields.IntegerField(),

        # Each list entry corresponds to a guest NUMA node and the
        # set members indicate CPUs for that node
        'hw_numa_cpus': fields.ListOfSetsOfIntegersField(),

        # Each list entry corresponds to a guest NUMA node and the
        # list value indicates the memory size of that node.
        'hw_numa_mem': fields.ListOfIntegersField(),

        # Generic property to specify the pointer model type.
        'hw_pointer_model': fields.PointerModelField(),

        # boolean 'yes' or 'no' to enable QEMU guest agent
        'hw_qemu_guest_agent': fields.FlexibleBooleanField(),

        # name of the rescue bus to use with the associated rescue device.
        'hw_rescue_bus': fields.DiskBusField(),

        # name of rescue device to use.
        'hw_rescue_device': fields.BlockDeviceTypeField(),

        # name of the RNG device type eg virtio
        'hw_rng_model': fields.RNGModelField(),

        # number of serial ports to create
        'hw_serial_port_count': fields.IntegerField(),

        # name of the SCSI bus controller eg 'virtio-scsi', 'lsilogic', etc
        'hw_scsi_model': fields.SCSIModelField(),

        # name of the video adapter model to use, eg cirrus, vga, xen, qxl
        'hw_video_model': fields.VideoModelField(),

        # MB of video RAM to provide eg 64
        'hw_video_ram': fields.IntegerField(),

        # name of a NIC device model eg virtio, e1000, rtl8139
        'hw_vif_model': fields.VIFModelField(),

        # "xen" vs "hvm"
        'hw_vm_mode': fields.VMModeField(),

        # action to take when watchdog device fires eg reset, poweroff, pause,
        # none
        'hw_watchdog_action': fields.WatchdogActionField(),

        # boolean - If true, this will enable the virtio-multiqueue feature
        'hw_vif_multiqueue_enabled': fields.FlexibleBooleanField(),

        # if true download using bittorrent
        'img_bittorrent': fields.FlexibleBooleanField(),

        # Which data format the 'img_block_device_mapping' field is
        # using to represent the block device mapping
        'img_bdm_v2': fields.FlexibleBooleanField(),

        # Block device mapping - the may can be in one or two completely
        # different formats. The 'img_bdm_v2' field determines whether
        # it is in legacy format, or the new current format. Ideally
        # we would have a formal data type for this field instead of a
        # dict, but with 2 different formats to represent this is hard.
        # See nova/block_device.py from_legacy_mapping() for the complex
        # conversion code. So for now leave it as a dict and continue
        # to use existing code that is able to convert dict into the
        # desired internal BDM formats
        'img_block_device_mapping':
            fields.ListOfDictOfNullableStringsField(),

        # boolean - if True, and image cache set to "some" decides if image
        # should be cached on host when server is booted on that host
        'img_cache_in_nova': fields.FlexibleBooleanField(),

        # Compression level for images. (1-9)
        'img_compression_level': fields.IntegerField(),

        # hypervisor supported version, eg. '>=2.6'
        'img_hv_requested_version': fields.VersionPredicateField(),

        # type of the hypervisor, eg kvm, ironic, xen
        'img_hv_type': fields.HVTypeField(),

        # Whether the image needs/expected config drive
        'img_config_drive': fields.ConfigDrivePolicyField(),

        # boolean flag to set space-saving or performance behavior on the
        # Datastore
        'img_linked_clone': fields.FlexibleBooleanField(),

        # Image mappings - related to Block device mapping data - mapping
        # of virtual image names to device names. This could be represented
        # as a formal data type, but is left as dict for same reason as
        # img_block_device_mapping field. It would arguably make sense for
        # the two to be combined into a single field and data type in the
        # future.
        'img_mappings': fields.ListOfDictOfNullableStringsField(),

        # image project id (set on upload)
        'img_owner_id': fields.StringField(),

        # root device name, used in snapshotting eg /dev/<blah>
        'img_root_device_name': fields.StringField(),

        # boolean - if false don't talk to nova agent
        'img_use_agent': fields.FlexibleBooleanField(),

        # integer value 1
        'img_version': fields.IntegerField(),

        # base64 of encoding of image signature
        'img_signature': fields.StringField(),

        # string indicating hash method used to compute image signature
        'img_signature_hash_method': fields.ImageSignatureHashTypeField(),

        # string indicating Castellan uuid of certificate
        # used to compute the image's signature
        'img_signature_certificate_uuid': fields.UUIDField(),

        # string indicating type of key used to compute image signature
        'img_signature_key_type': fields.ImageSignatureKeyTypeField(),

        # string of username with admin privileges
        'os_admin_user': fields.StringField(),

        # string of boot time command line arguments for the guest kernel
        'os_command_line': fields.StringField(),

        # the name of the specific guest operating system distro. This
        # is not done as an Enum since the list of operating systems is
        # growing incredibly fast, and valid values can be arbitrarily
        # user defined. Nova has no real need for strict validation so
        # leave it freeform
        'os_distro': fields.StringField(),

        # boolean - if true, then guest must support disk quiesce
        # or snapshot operation will be denied
        'os_require_quiesce': fields.FlexibleBooleanField(),

        # Secure Boot feature will be enabled by setting the "os_secure_boot"
        # image property to "required". Other options can be: "disabled" or
        # "optional".
        # "os:secure_boot" flavor extra spec value overrides the image property
        # value.
        'os_secure_boot': fields.SecureBootField(),

        # boolean - if using agent don't inject files, assume someone else is
        # doing that (cloud-init)
        'os_skip_agent_inject_files_at_boot': fields.FlexibleBooleanField(),

        # boolean - if using agent don't try inject ssh key, assume someone
        # else is doing that (cloud-init)
        'os_skip_agent_inject_ssh': fields.FlexibleBooleanField(),

        # The guest operating system family such as 'linux', 'windows' - this
        # is a fairly generic type. For a detailed type consider os_distro
        # instead
        'os_type': fields.OSTypeField(),
    }

    # The keys are the legacy property names and
    # the values are the current preferred names
    _legacy_property_map = {
        'architecture': 'hw_architecture',
        'owner_id': 'img_owner_id',
        'vmware_disktype': 'hw_disk_type',
        'vmware_image_version': 'img_version',
        'vmware_ostype': 'os_distro',
        'auto_disk_config': 'hw_auto_disk_config',
        'ipxe_boot': 'hw_ipxe_boot',
        'xenapi_device_id': 'hw_device_id',
        'xenapi_image_compression_level': 'img_compression_level',
        'vmware_linked_clone': 'img_linked_clone',
        'xenapi_use_agent': 'img_use_agent',
        'xenapi_skip_agent_inject_ssh': 'os_skip_agent_inject_ssh',
        'xenapi_skip_agent_inject_files_at_boot':
            'os_skip_agent_inject_files_at_boot',
        'cache_in_nova': 'img_cache_in_nova',
        'vm_mode': 'hw_vm_mode',
        'bittorrent': 'img_bittorrent',
        'mappings': 'img_mappings',
        'block_device_mapping': 'img_block_device_mapping',
        'bdm_v2': 'img_bdm_v2',
        'root_device_name': 'img_root_device_name',
        'hypervisor_version_requires': 'img_hv_requested_version',
        'hypervisor_type': 'img_hv_type',
    }
  • 主要是進行鏡像與主機架構等屬性的對比(nova/scheduler/filters/image_props_filter.py):
class ImagePropertiesFilter(filters.BaseHostFilter):
    """Filter compute nodes that satisfy instance image properties.

    The ImagePropertiesFilter filters compute nodes that satisfy
    any architecture, hypervisor type, or virtual machine mode properties
    specified on the instance's image properties.  Image properties are
    contained in the image dictionary in the request_spec.
    """

    RUN_ON_REBUILD = True

    # Image Properties and Compute Capabilities do not change within
    # a request
    run_filter_once_per_request = True

    def _instance_supported(self, host_state, image_props,
                            hypervisor_version):
        img_arch = image_props.get('hw_architecture')
        img_h_type = image_props.get('img_hv_type')
        img_vm_mode = image_props.get('hw_vm_mode')
        checked_img_props = (
            fields.Architecture.canonicalize(img_arch),
            fields.HVType.canonicalize(img_h_type),
            fields.VMMode.canonicalize(img_vm_mode)
        )

        # Supported if no compute-related instance properties are specified
        if not any(checked_img_props):
            return True

        supp_instances = host_state.supported_instances
        # Not supported if an instance property is requested but nothing
        # advertised by the host.
        if not supp_instances:
            LOG.debug("Instance contains properties %(image_props)s, "
                        "but no corresponding supported_instances are "
                        "advertised by the compute node",
                      {'image_props': image_props})
            return False

        def _compare_props(props, other_props):
            for i in props:
                if i and i not in other_props:
                    return False
            return True

        def _compare_product_version(hyper_version, image_props):
            version_required = image_props.get('img_hv_requested_version')
            if not(hypervisor_version and version_required):
                return True
            img_prop_predicate = versionpredicate.VersionPredicate(
                'image_prop (%s)' % version_required)
            hyper_ver_str = versionutils.convert_version_to_str(hyper_version)
            return img_prop_predicate.satisfied_by(hyper_ver_str)

        for supp_inst in supp_instances:
            if _compare_props(checked_img_props, supp_inst):
                if _compare_product_version(hypervisor_version, image_props):
                    return True

        LOG.debug("Instance contains properties %(image_props)s "
                    "that are not provided by the compute node "
                    "supported_instances %(supp_instances)s or "
                    "hypervisor version %(hypervisor_version)s do not match",
                  {'image_props': image_props,
                   'supp_instances': supp_instances,
                   'hypervisor_version': hypervisor_version})
        return False

    def host_passes(self, host_state, spec_obj):
        """Check if host passes specified image properties.

        Returns True for compute nodes that satisfy image properties
        contained in the request_spec.
        """
        image_props = spec_obj.image.properties if spec_obj.image else {}

        if not self._instance_supported(host_state, image_props,
                                        host_state.hypervisor_version):
            LOG.debug("%(host_state)s does not support requested "
                        "instance_properties", {'host_state': host_state})
            return False
        return True

####CoreFilter過濾器

  • CoreFilterAggregateCoreFilter過濾器源碼(nova/scheduler/filters/core_filter.py):
class BaseCoreFilter(filters.BaseHostFilter):

    RUN_ON_REBUILD = False

    def _get_cpu_allocation_ratio(self, host_state, spec_obj):
        raise NotImplementedError

    def host_passes(self, host_state, spec_obj):
        """Return True if host has sufficient CPU cores.

        :param host_state: nova.scheduler.host_manager.HostState
        :param spec_obj: filter options
        :return: boolean
        """
        if not host_state.vcpus_total:
            # Fail safe
            LOG.warning(_LW("VCPUs not set; assuming CPU collection broken"))
            return True

        instance_vcpus = spec_obj.vcpus
        cpu_allocation_ratio = self._get_cpu_allocation_ratio(host_state,
                                                              spec_obj)
        vcpus_total = host_state.vcpus_total * cpu_allocation_ratio

        # Only provide a VCPU limit to compute if the virt driver is reporting
        # an accurate count of installed VCPUs. (XenServer driver does not)
        if vcpus_total > 0:
            host_state.limits['vcpu'] = vcpus_total

            # Do not allow an instance to overcommit against itself, only
            # against other instances.
            if instance_vcpus > host_state.vcpus_total:
                LOG.debug("%(host_state)s does not have %(instance_vcpus)d "
                          "total cpus before overcommit, it only has %(cpus)d",
                          {'host_state': host_state,
                           'instance_vcpus': instance_vcpus,
                           'cpus': host_state.vcpus_total})
                return False

        free_vcpus = vcpus_total - host_state.vcpus_used
        if free_vcpus < instance_vcpus:
            LOG.debug("%(host_state)s does not have %(instance_vcpus)d "
                      "usable vcpus, it only has %(free_vcpus)d usable "
                      "vcpus",
                      {'host_state': host_state,
                       'instance_vcpus': instance_vcpus,
                       'free_vcpus': free_vcpus})
            return False

        return True


class CoreFilter(BaseCoreFilter):
    """CoreFilter filters based on CPU core utilization."""

    def _get_cpu_allocation_ratio(self, host_state, spec_obj):
        return host_state.cpu_allocation_ratio


class AggregateCoreFilter(BaseCoreFilter):
    """AggregateCoreFilter with per-aggregate CPU subscription flag.

    Fall back to global cpu_allocation_ratio if no per-aggregate setting found.
    """

    def _get_cpu_allocation_ratio(self, host_state, spec_obj):
        aggregate_vals = utils.aggregate_values_from_key(
            host_state,
            'cpu_allocation_ratio')
        try:
            ratio = utils.validate_num_values(
                aggregate_vals, host_state.cpu_allocation_ratio, cast_to=float)
        except ValueError as e:
            LOG.warning(_LW("Could not decode cpu_allocation_ratio: '%s'"), e)
            ratio = host_state.cpu_allocation_ratio

        return ratio
  • 驗證參數(nova/scheduler/filters/utils.py):
def validate_num_values(vals, default=None, cast_to=int, based_on=min):
    """Returns a correctly casted value based on a set of values.

    This method is useful to work with per-aggregate filters, It takes
    a set of values then return the 'based_on'{min/max} converted to
    'cast_to' of the set or the default value.

    Note: The cast implies a possible ValueError
    """
    num_values = len(vals)
    if num_values == 0:
        return default

    if num_values > 1:
        if based_on == min:
            LOG.info(_LI("%(num_values)d values found, "
                         "of which the minimum value will be used."),
                     {'num_values': num_values})
        else:
            LOG.info(_LI("%(num_values)d values found, "
                         "of which the maximum value will be used."),
                     {'num_values': num_values})
    return based_on([cast_to(val) for val in vals])

當節點所屬的多個Host Aggregate 設置cpu_allocation_ratio`參數時,取較小值。

##容災備份

節點劃分

  • Region: 主要用於對集羣的物理位置劃分,每個 Region 有自己獨立的EndPoint,Regions 之間完全隔離,但是多個 Regions 之間共享同一個 KeyStone 和DashBoard。

  • Availability Zone: 可以簡單理解爲一組節點的集合,這組節點具有獨立的電力供應設備,比如一個獨立供電的機房,一個獨立供電的機架。

  • Host Aggregate: 主要用於管理員根據節點的屬性來對硬件進行劃分,只對管理員可見。

  • Cell: 主要用來解決OpenStack擴展性和規模瓶頸,對DataBase和AMQP等組件進行分割,實現分級調度。

總結:**

  1. 可以使用 RegionAvailability Zone 來指定實例部署位置,並把對應的功能暴露給用戶。也可以自行管理,向用戶提供各種災備選項。
  2. 可以使用 Cell 功能來劃分集羣,增強集羣的橫向擴展能力。
  3. 可以使用 Host Aggregate 功能來對主機的屬性進行歸類,也可以配合 AggregateInstanceExtraSpecsFilter 過濾器對主機進行不同調度策略的歸類,再加上 flavor 的元數據就可以在一個集羣中同時支持多種調度策略。
  4. 可以使用 ServerGroupAntiAffinityFilterServerGroupAffinityFilter 插件,使用 --hint 參數對虛擬機進行分組部署。

####多區域

  • 多區域Region,使用--os-region-name參數指定:
$ nova --help
...

--os-region-name <region-name>
                                Defaults to env[OS_REGION_NAME].

...

####集合

目前在命令中,可用區和集合是使用同一類命令進行管理的。

  • 多機房Availability Zone(AvailabilityZoneFilter),多機架Host Aggregate(AggregateInstanceExtraSpecsFilter),配合元數據,使用如下命令進行管理:
$ nova --help
...

    aggregate-add-host          Add the host to the specified aggregate.
    aggregate-create            Create a new aggregate with the specified
                                details.
    aggregate-delete            Delete the aggregate.
    aggregate-list              Print a list of all aggregates.
    aggregate-remove-host       Remove the specified host from the specified
                                aggregate.
    aggregate-set-metadata      Update the metadata associated with the
                                aggregate.
    aggregate-show              Show details of the specified aggregate.
    aggregate-update            Update the aggregate's name and optionally
                                availability zone.
    availability-zone-list      List all the availability zones.

...

###過濾器配置

新增AvailabilityZoneFilter過濾器:

$ vi /etc/kolla/nova-scheduler/nova.conf
[DEFAULT]
...
scheduler_default_filters = AggregateInstanceExtraSpecsFilter, RetryFilter, AvailabilityZoneFilter, RamFilter, DiskFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter
...

$ docker restart nova_scheduler

調度測試

####可用區配置

  • 這裏假設有一個區域(Region)有2個機房,每個機房分別有2個機架:
主機          機房     機架       機架屬性
osdev-01     az01   az01-ha01    addr=11
osdev-02     az01   az01-ha02    addr=12
osdev-03     az02   az02-ha01    addr=21
osdev-ceph   az02   az02-ha02    addr=22
osdev-gpu    az02   az02-ha02    addr=22
  • 查看當前的Availability Zone
$ nova availability-zone-list
+-----------------------+----------------------------------------+
| Name                  | Status                                 |
+-----------------------+----------------------------------------+
| internal              | available                              |
| |- osdev-01           |                                        |
| | |- nova-conductor   | enabled :-) 2018-03-15T09:51:30.000000 |
| | |- nova-scheduler   | enabled :-) 2018-03-15T09:51:30.000000 |
| | |- nova-consoleauth | enabled :-) 2018-03-15T09:51:31.000000 |
| nova                  | available                              |
| |- osdev-01           |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T09:51:30.000000 |
| |- osdev-02           |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T09:51:25.000000 |
| |- osdev-03           |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T09:51:31.000000 |
| |- osdev-ceph         |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T09:51:28.000000 |
| |- osdev-gpu          |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T09:51:26.000000 |
+-----------------------+----------------------------------------+
  • 分別創建2個Availability Zone和4個Host Aggregate
$ nova aggregate-create az01-ha01 az01
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts | Metadata                 | UUID                                 |
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
| 3  | az01-ha01 | az01              |       | 'availability_zone=az01' | 3baac65d-2907-412a-98f5-60e582612548 |
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+

$ nova aggregate-create az01-ha02 az01
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts | Metadata                 | UUID                                 |
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
| 4  | az01-ha02 | az01              |       | 'availability_zone=az01' | 5ea0f221-024d-43f5-b1f1-6e0cc364ad39 |
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+

$ nova aggregate-create az02-ha01 az02
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts | Metadata                 | UUID                                 |
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
| 5  | az02-ha01 | az02              |       | 'availability_zone=az02' | 00cdcefa-bcd0-490c-bdee-9cc970268a03 |
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+

$ nova aggregate-create az02-ha02 az02
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts | Metadata                 | UUID                                 |
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
| 6  | az02-ha02 | az02              |       | 'availability_zone=az02' | 81de444f-4730-471a-bff5-ff11372c3096 |
+----+-----------+-------------------+-------+--------------------------+--------------------------------------+
  • 分別把5個節點加入4個Host Agreegate中:
$ nova aggregate-add-host az01-ha01 osdev-01
Host osdev-01 has been successfully added for aggregate 3 
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts      | Metadata                 | UUID                                 |
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+
| 3  | az01-ha01 | az01              | 'osdev-01' | 'availability_zone=az01' | 3baac65d-2907-412a-98f5-60e582612548 |
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+

$ nova aggregate-add-host az01-ha02 osdev-02
Host osdev-02 has been successfully added for aggregate 4 
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts      | Metadata                 | UUID                                 |
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+
| 4  | az01-ha02 | az01              | 'osdev-02' | 'availability_zone=az01' | 5ea0f221-024d-43f5-b1f1-6e0cc364ad39 |
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+

$ nova aggregate-add-host az02-ha01 osdev-03
Host osdev-03 has been successfully added for aggregate 5 
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts      | Metadata                 | UUID                                 |
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+
| 5  | az02-ha01 | az02              | 'osdev-03' | 'availability_zone=az02' | 00cdcefa-bcd0-490c-bdee-9cc970268a03 |
+----+-----------+-------------------+------------+--------------------------+--------------------------------------+

$ nova aggregate-add-host az02-ha02 osdev-ceph
Host osdev-ceph has been successfully added for aggregate 6 
+----+-----------+-------------------+--------------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts        | Metadata                 | UUID                                 |
+----+-----------+-------------------+--------------+--------------------------+--------------------------------------+
| 6  | az02-ha02 | az02              | 'osdev-ceph' | 'availability_zone=az02' | 81de444f-4730-471a-bff5-ff11372c3096 |
+----+-----------+-------------------+--------------+--------------------------+--------------------------------------+

$ nova aggregate-add-host az02-ha02 osdev-gpu
Host osdev-gpu has been successfully added for aggregate 6 
+----+-----------+-------------------+---------------------------+--------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts                     | Metadata                 | UUID                                 |
+----+-----------+-------------------+---------------------------+--------------------------+--------------------------------------+
| 6  | az02-ha02 | az02              | 'osdev-ceph', 'osdev-gpu' | 'availability_zone=az02' | 81de444f-4730-471a-bff5-ff11372c3096 |
+----+-----------+-------------------+---------------------------+--------------------------+--------------------------------------+
  • 查看當前的Availability Zone
$ nova availability-zone-list
+-----------------------+----------------------------------------+
| Name                  | Status                                 |
+-----------------------+----------------------------------------+
| internal              | available                              |
| |- osdev-01           |                                        |
| | |- nova-conductor   | enabled :-) 2018-03-15T10:09:10.000000 |
| | |- nova-scheduler   | enabled :-) 2018-03-15T10:09:10.000000 |
| | |- nova-consoleauth | enabled :-) 2018-03-15T10:09:01.000000 |
| az02                  | available                              |
| |- osdev-03           |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T10:09:01.000000 |
| |- osdev-ceph         |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T10:09:08.000000 |
| |- osdev-gpu          |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T10:09:05.000000 |
| az01                  | available                              |
| |- osdev-01           |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T10:09:10.000000 |
| |- osdev-02           |                                        |
| | |- nova-compute     | enabled :-) 2018-03-15T10:09:05.000000 |
+-----------------------+----------------------------------------+

####指定可用區

  • az02上創建一個實(被調度到osdev-gpu上):
$ openstack network list
+--------------------------------------+----------+--------------------------------------+
| ID                                   | Name     | Subnets                              |
+--------------------------------------+----------+--------------------------------------+
| 8ab35b74-d680-4cfc-8c61-810965e3992e | public1  | 2e6f24b8-3482-4e68-9d61-6306ff1da8a2 |
| 8d01509e-4a3a-497a-9118-3827c1e37672 | demo-net | 3b817b11-8fda-485f-bad9-0b7e30534d66 |
+--------------------------------------+----------+--------------------------------------+

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az02 demo1


$ openstack server list --long --column "Name" --column "Status" --column "Flavor Name" --column "Networks" --column "Availability Zone" --column "Host"
+-------+--------+--------------------+-------------------+-----------+
| Name  | Status | Networks           | Availability Zone | Host      |
+-------+--------+--------------------+-------------------+-----------+
| demo1 | ACTIVE | demo-net=10.0.0.11 | az02              | osdev-gpu |
+-------+--------+--------------------+-------------------+-----------+
  • az01上創建一個實(被調度到osdev-02上):
$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01 demo2

$ openstack server list --long --column "Name" --column "Status" --column "Flavor Name" --column "Networks" --column "Availability Zone" --column "Host"
+-------+--------+--------------------+-------------------+-----------+
| Name  | Status | Networks           | Availability Zone | Host      |
+-------+--------+--------------------+-------------------+-----------+
| demo2 | ACTIVE | demo-net=10.0.0.3  | az01              | osdev-02  |
| demo1 | ACTIVE | demo-net=10.0.0.11 | az02              | osdev-gpu |
+-------+--------+--------------------+-------------------+-----------+

####集合配置

  • 配置機架屬性:
$ nova aggregate-set-metadata az01-ha01 addr=11
Metadata has been successfully updated for aggregate 3.
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts      | Metadata                            | UUID                                 |
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+
| 3  | az01-ha01 | az01              | 'osdev-01' | 'addr=11', 'availability_zone=az01' | 3baac65d-2907-412a-98f5-60e582612548 |
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+

$ nova aggregate-set-metadata az01-ha02 addr=12
Metadata has been successfully updated for aggregate 4.
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts      | Metadata                            | UUID                                 |
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+
| 4  | az01-ha02 | az01              | 'osdev-02' | 'addr=12', 'availability_zone=az01' | 5ea0f221-024d-43f5-b1f1-6e0cc364ad39 |
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+

$ nova aggregate-set-metadata az02-ha01 addr=21
Metadata has been successfully updated for aggregate 5.
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts      | Metadata                            | UUID                                 |
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+
| 5  | az02-ha01 | az02              | 'osdev-03' | 'addr=21', 'availability_zone=az02' | 00cdcefa-bcd0-490c-bdee-9cc970268a03 |
+----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+

$ nova aggregate-set-metadata az02-ha02 addr=22
Metadata has been successfully updated for aggregate 6.
+----+-----------+-------------------+---------------------------+-------------------------------------+--------------------------------------+
| Id | Name      | Availability Zone | Hosts                     | Metadata                            | UUID                                 |
+----+-----------+-------------------+---------------------------+-------------------------------------+--------------------------------------+
| 6  | az02-ha02 | az02              | 'osdev-ceph', 'osdev-gpu' | 'addr=22', 'availability_zone=az02' | 81de444f-4730-471a-bff5-ff11372c3096 |
+----+-----------+-------------------+---------------------------+-------------------------------------+--------------------------------------+

####模板配置

  • 創建新的flavor
$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.az01-ha01
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 1                                    |
| id                         | 0c3bb453-146f-4093-b161-39c10978f0eb |
| name                       | machine.az01-ha01                    |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 64                                   |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 1                                    |
+----------------------------+--------------------------------------+

$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.az01-ha02
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 1                                    |
| id                         | ba9a94aa-f841-4529-8d34-e3e9e8484f90 |
| name                       | machine.az01-ha02                    |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 64                                   |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 1                                    |
+----------------------------+--------------------------------------+

$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.az02-ha01
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 1                                    |
| id                         | 326d7246-9d6a-4b73-8e89-83565887ada7 |
| name                       | machine.az02-ha01                    |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 64                                   |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 1                                    |
+----------------------------+--------------------------------------+

$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.az02-ha02
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 1                                    |
| id                         | f3fb1e5f-adcd-402c-9125-56dda997b52a |
| name                       | machine.az02-ha02                    |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 64                                   |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 1                                    |
+----------------------------+--------------------------------------+
  • flavor新增addr元數據:
$ nova flavor-key machine.az01-ha01 set addr=11
$ nova flavor-key machine.az01-ha02 set addr=12
$ nova flavor-key machine.az02-ha01 set addr=21
$ nova flavor-key machine.az02-ha02 set addr=22

$ openstack flavor list --long --column "Name" --column "Properties"
+-------------------+------------+
| Name              | Properties |
+-------------------+------------+
| machine.az01-ha01 | addr='11'  |
| m1.tiny           |            |
| m1.small          |            |
| m1.medium         |            |
| machine.az02-ha01 | addr='21'  |
| m1.large          |            |
| m1.xlarge         |            |
| machine.az01-ha02 | addr='12'  |
| machine.az02-ha02 | addr='22'  |
+-------------------+------------+

####指定集合

使用帶有addr元數據的flavor創建虛擬機:

$ openstack server create --image cirros --flavor machine.az01-ha01 --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.az01-ha01

$ openstack server create --image cirros --flavor machine.az01-ha02 --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.az01-ha02

$ openstack server create --image cirros --flavor machine.az02-ha01 --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.az02-ha01

$ openstack server create --image cirros --flavor machine.az02-ha02 --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.az02-ha02


$ openstack server list --long --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host"
+------------------+--------+--------------------+-------------------+-----------+
| Name             | Status | Networks           | Availability Zone | Host      |
+------------------+--------+--------------------+-------------------+-----------+
| server.az02-ha02 | ACTIVE | demo-net=10.0.0.15 | az02              | osdev-gpu |
| server.az02-ha01 | ACTIVE | demo-net=10.0.0.12 | az02              | osdev-03  |
| server.az01-ha02 | ACTIVE | demo-net=10.0.0.6  | az01              | osdev-02  |
| server.az01-ha01 | ACTIVE | demo-net=10.0.0.4  | az01              | osdev-01  |
+------------------+--------+--------------------+-------------------+-----------+

###相關源碼

  • 可用區過濾器源碼(nova/scheduler/filters/availability_zone_filter.py):
class AvailabilityZoneFilter(filters.BaseHostFilter):
    """Filters Hosts by availability zone.

    Works with aggregate metadata availability zones, using the key
    'availability_zone'
    Note: in theory a compute node can be part of multiple availability_zones
    """

    # Availability zones do not change within a request
    run_filter_once_per_request = True

    RUN_ON_REBUILD = False

    def host_passes(self, host_state, spec_obj):
        availability_zone = spec_obj.availability_zone

        if not availability_zone:
            return True

        metadata = utils.aggregate_metadata_get_by_host(
                host_state, key='availability_zone')

        if 'availability_zone' in metadata:
            hosts_passes = availability_zone in metadata['availability_zone']
            host_az = metadata['availability_zone']
        else:
            hosts_passes = availability_zone == CONF.default_availability_zone
            host_az = CONF.default_availability_zone

        if not hosts_passes:
            LOG.debug("Availability Zone '%(az)s' requested. "
                      "%(host_state)s has AZs: %(host_az)s",
                      {'host_state': host_state,
                       'az': availability_zone,
                       'host_az': host_az})

        return hosts_passes

可以看到Availability Zone的判斷,首先比較節點的元數據,如果不存在則使用節點的默認配置。

  • 獲取節點的元數據(nova/scheduler/filters/utils.py):
def aggregate_metadata_get_by_host(host_state, key=None):
    """Returns a dict of all metadata based on a metadata key for a specific
    host. If the key is not provided, returns a dict of all metadata.
    """
    aggrlist = host_state.aggregates
    metadata = collections.defaultdict(set)
    for aggr in aggrlist:
        if key is None or key in aggr.metadata:
            for k, v in aggr.metadata.items():
                metadata[k].update(x.strip() for x in v.split(','))
    return metadata

##親和性

過濾器配置

  • 新增SameHostFilterDifferentHostFilterServerGroupAntiAffinityFilterServerGroupAffinityFilter過濾器:
$ vi /etc/kolla/nova-scheduler/nova.conf
[DEFAULT]
...
scheduler_default_filters = ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, SameHostFilter, DifferentHostFilter, AggregateInstanceExtraSpecsFilter, RetryFilter, AvailabilityZoneFilter, RamFilter, DiskFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter
...

$ docker restart nova_scheduler

親和性測試

####相同主機

  • 在實例server.az02-ha02所在節點上新建實例:
$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint same_host=6a928cc0-1509-4e00-91c8-6b43ceb05373 server.same


$ openstack server list --long --column "ID" --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host"
+--------------------------------------+------------------+--------+--------------------+-------------------+-----------+
| ID                                   | Name             | Status | Networks           | Availability Zone | Host      |
+--------------------------------------+------------------+--------+--------------------+-------------------+-----------+
| aadc3d24-965f-437d-af40-70adee984cad | server.same      | ACTIVE | demo-net=10.0.0.10 | az02              | osdev-gpu |
| 6a928cc0-1509-4e00-91c8-6b43ceb05373 | server.az02-ha02 | ACTIVE | demo-net=10.0.0.15 | az02              | osdev-gpu |
| 3605afc0-7a8e-4ae9-a0cf-e0df64f6bfd6 | server.az02-ha01 | ACTIVE | demo-net=10.0.0.12 | az02              | osdev-03  |
| cdec034a-bdca-4651-88d5-c34d17ea12f1 | server.az01-ha02 | ACTIVE | demo-net=10.0.0.6  | az01              | osdev-02  |
| 276a4af1-762b-43c9-a064-7f1b27d46356 | server.az01-ha01 | ACTIVE | demo-net=10.0.0.4  | az01              | osdev-01  |
+--------------------------------------+------------------+--------+--------------------+-------------------+-----------+

####不同主機

在實例server.az02-ha02以外的節點上新建實例:

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint different_host=6a928cc0-1509-4e00-91c8-6b43ceb05373 server.different

$ openstack server list --long --column "ID" --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host"
+--------------------------------------+------------------+--------+--------------------+-------------------+-----------+
| ID                                   | Name             | Status | Networks           | Availability Zone | Host      |
+--------------------------------------+------------------+--------+--------------------+-------------------+-----------+
| d652614a-ce1b-4314-8c13-33c77527950d | server.different | ACTIVE | demo-net=10.0.0.8  | az02              | osdev-03  |
| aadc3d24-965f-437d-af40-70adee984cad | server.same      | ACTIVE | demo-net=10.0.0.10 | az02              | osdev-gpu |
| 6a928cc0-1509-4e00-91c8-6b43ceb05373 | server.az02-ha02 | ACTIVE | demo-net=10.0.0.15 | az02              | osdev-gpu |
| 3605afc0-7a8e-4ae9-a0cf-e0df64f6bfd6 | server.az02-ha01 | ACTIVE | demo-net=10.0.0.12 | az02              | osdev-03  |
| cdec034a-bdca-4651-88d5-c34d17ea12f1 | server.az01-ha02 | ACTIVE | demo-net=10.0.0.6  | az01              | osdev-02  |
| 276a4af1-762b-43c9-a064-7f1b27d46356 | server.az01-ha01 | ACTIVE | demo-net=10.0.0.4  | az01              | osdev-01  |
+--------------------------------------+------------------+--------+--------------------+-------------------+-----------+

####分散調度

  • 創建實例分組:
$ openstack server group create --policy anti-affinity group-anti-affinity
+----------+--------------------------------------+
| Field    | Value                                |
+----------+--------------------------------------+
| id       | 855ea22c-d369-4e90-a7b1-318064c72b16 |
| members  |                                      |
| name     | group-anti-affinity                  |
| policies | anti-affinity                        |
+----------+--------------------------------------+
  • 創建虛擬機:
$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa1

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa2

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa3

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa4

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa5

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa6

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa7

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa8
  • 查看調度結果:
$ openstack server list --long --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host"
+--------------------+--------+--------------------+-------------------+------------+
| Name               | Status | Networks           | Availability Zone | Host       |
+--------------------+--------+--------------------+-------------------+------------+
| server.aa8         | ERROR  |                    |                   | None       |
| server.aa7         | ERROR  |                    |                   | None       |
| server.aa6         | ERROR  |                    |                   | None       |
| server.aa5         | ACTIVE | demo-net=10.0.0.10 | az02              | osdev-ceph |
| server.aa4         | ACTIVE | demo-net=10.0.0.7  | az01              | osdev-01   |
| server.aa3         | ACTIVE | demo-net=10.0.0.18 | az01              | osdev-02   |
| server.aa2         | ACTIVE | demo-net=10.0.0.4  | az02              | osdev-03   |
| server.aa1         | ACTIVE | demo-net=10.0.0.11 | az02              | osdev-gpu  |
+--------------------+--------+--------------------+-------------------+------------+

####聚合調度

  • 創建實例分組:
$ openstack server group create --policy affinity group-affinity
+----------+--------------------------------------+
| Field    | Value                                |
+----------+--------------------------------------+
| id       | d44f51d3-676e-4bca-ac56-76e998da9467 |
| members  |                                      |
| name     | group-affinity                       |
| policies | affinity                             |
+----------+--------------------------------------+
  • 創建虛擬機:
$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=d44f51d3-676e-4bca-ac56-76e998da9467 server.a1

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=d44f51d3-676e-4bca-ac56-76e998da9467 server.a2

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=d44f51d3-676e-4bca-ac56-76e998da9467 server.a3
  • 查看調度結果:
$ openstack server list --long --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host"
+------------------+--------+--------------------+-------------------+------------+
| Name             | Status | Networks           | Availability Zone | Host       |
+------------------+--------+--------------------+-------------------+------------+
| server.a3        | ACTIVE | demo-net=10.0.0.24 | az02              | osdev-gpu  |
| server.a2        | ACTIVE | demo-net=10.0.0.20 | az02              | osdev-gpu  |
| server.a1        | ACTIVE | demo-net=10.0.0.19 | az02              | osdev-gpu  |
+------------------+--------+--------------------+-------------------+------------+

###相關源碼

  • 判斷虛擬機是否在該節點(nova/scheduler/filters/utils.py):
def instance_uuids_overlap(host_state, uuids):
    """Tests for overlap between a host_state and a list of uuids.

    Returns True if any of the supplied uuids match any of the instance.uuid
    values in the host_state.
    """
    if isinstance(uuids, six.string_types):
        uuids = [uuids]
    set_uuids = set(uuids)
    # host_state.instances is a dict whose keys are the instance uuids
    host_uuids = set(host_state.instances.keys())
    return bool(host_uuids.intersection(set_uuids))
  • 相同主機過濾器(nova/scheduler/filters/affinity_filter.py):
class SameHostFilter(filters.BaseHostFilter):
    """Schedule the instance on the same host as another instance in a set of
    instances.
    """
    # The hosts the instances are running on doesn't change within a request
    run_filter_once_per_request = True

    RUN_ON_REBUILD = False

    def host_passes(self, host_state, spec_obj):
        affinity_uuids = spec_obj.get_scheduler_hint('same_host')
        if affinity_uuids:
            overlap = utils.instance_uuids_overlap(host_state, affinity_uuids)
            return overlap
        # With no same_host key
        return True
  • 不同主機過濾器(nova/scheduler/filters/affinity_filter.py):
class DifferentHostFilter(filters.BaseHostFilter):
    """Schedule the instance on a different host from a set of instances."""
    # The hosts the instances are running on doesn't change within a request
    run_filter_once_per_request = True

    RUN_ON_REBUILD = False

    def host_passes(self, host_state, spec_obj):
        affinity_uuids = spec_obj.get_scheduler_hint('different_host')
        if affinity_uuids:
            overlap = utils.instance_uuids_overlap(host_state, affinity_uuids)
            return not overlap
        # With no different_host key
        return True
  • 分散過濾器源碼(nova/scheduler/filters/affinity_filter.py):
class _GroupAntiAffinityFilter(filters.BaseHostFilter):
    """Schedule the instance on a different host from a set of group
    hosts.
    """

    RUN_ON_REBUILD = False

    def host_passes(self, host_state, spec_obj):
        # Only invoke the filter if 'anti-affinity' is configured
        policies = (spec_obj.instance_group.policies
                    if spec_obj.instance_group else [])
        if self.policy_name not in policies:
            return True
        # NOTE(hanrong): Move operations like resize can check the same source
        # compute node where the instance is. That case, AntiAffinityFilter
        # must not return the source as a non-possible destination.
        if spec_obj.instance_uuid in host_state.instances.keys():
            return True

        group_hosts = (spec_obj.instance_group.hosts
                       if spec_obj.instance_group else [])
        LOG.debug("Group anti affinity: check if %(host)s not "
                  "in %(configured)s", {'host': host_state.host,
                                        'configured': group_hosts})
        if group_hosts:
            return host_state.host not in group_hosts

        # No groups configured
        return True


class ServerGroupAntiAffinityFilter(_GroupAntiAffinityFilter):
    def __init__(self):
        self.policy_name = 'anti-affinity'
        super(ServerGroupAntiAffinityFilter, self).__init__()

不能在一個分組中重複調度虛擬機。

  • 聚合過濾器源碼(nova/scheduler/filters/affinity_filter.py):
class _GroupAffinityFilter(filters.BaseHostFilter):
    """Schedule the instance on to host from a set of group hosts.
    """

    RUN_ON_REBUILD = False

    def host_passes(self, host_state, spec_obj):
        # Only invoke the filter if 'affinity' is configured
        policies = (spec_obj.instance_group.policies
                    if spec_obj.instance_group else [])
        if self.policy_name not in policies:
            return True

        group_hosts = (spec_obj.instance_group.hosts
                       if spec_obj.instance_group else [])
        LOG.debug("Group affinity: check if %(host)s in "
                  "%(configured)s", {'host': host_state.host,
                                     'configured': group_hosts})
        if group_hosts:
            return host_state.host in group_hosts

        # No groups configured
        return True


class ServerGroupAffinityFilter(_GroupAffinityFilter):
    def __init__(self):
        self.policy_name = 'affinity'
        super(ServerGroupAffinityFilter, self).__init__()

##NUMA綁定

###參數配置

  • 爲Flavor添加元數據,即extra-specs,通過設置以下幾種關鍵字:
hw:numa_nodes=N                         - VM中NUMA的個數

hw:numa_mempolicy=preferred|strict      - VM中 NUMA 內存的使用策略

hw:numa_cpus.0=<cpu-list>               - VM 中在NUMA node 0的cpu

hw:numa_cpus.1=<cpu-list>               - VM 中在NUMA node 1的cpu

hw:numa_mem.0=<ram-size>                - VM 中在NUMA node 0的內存大小(M)

hw:numa_mem.1=<ram-size>                - VM 中在NUMA node 1的內存大小(M)
  • 爲Image添加元數據,即Image的metadata,通過設置以下幾種關鍵字:
hw_numa_nodes=N                        - numa of NUMA nodes to expose to the guest.

hw_numa_mempolicy=preferred|strict     - memory allocation policy

hw_numa_cpus.0=<cpu-list>              - mapping of vCPUS N-M to NUMA node 0

hw_numa_cpus.1=<cpu-list>              - mapping of vCPUS N-M to NUMA node 1

hw_numa_mem.0=<ram-size>               - mapping N MB of RAM to NUMA node 0

hw_numa_mem.1=<ram-size>               - mapping N MB of RAM to NUMA node 1
  • 各個字段表示的含義如下:
numa_nodes:該虛擬機包含的 NUMA 節點個數;

numa_cpus.0:虛擬機上 NUMA 節點 0 包含的虛擬機 CPU 的 ID,格式"1-4,6",如果用戶自己指定部署方式,則需要指定虛擬機內每個 NUMA 節點的 CPU 部署信息,所有 NUMA 節點上的 CPU 總和需要與套餐中 vcpus 的總數一致;

numa_mem.0:虛擬機上 NUMA 節點 0 包含的內存大小,單位 M,如果用戶自己指定部署方式,則需要指定虛擬機內每個 NUMA 節點的內存信息,所有 NUMA 節點上的內存總和需要等於套餐中的 memory_mb 大小。 

注意:

N 是虛擬機 NUMA 節點的索引,並不一定對應主機 NUMA 節點。 例如,在兩個 NUMA 節點的平臺,根據
hw:numa_mem.0,調度會選擇虛擬機 NUMA 節點 0,但是卻是在主機 NUMA 節點 1 上,反之亦然。類似的,FLAVOR-CORES 也是虛擬機 vCPU 的編號,並不對應與主機 CPU。因此,這個特性不能用來約束虛擬機所處的主機 CPU 與 NUMA
節點。

警告:

如果 hw:numa_cpus.N 或 hw:numa_mem.N 的值比可用 CPU 或內存大,則會引發錯誤。
  • 自動分配 NUMA 的約束和限制:
1、不能設置 numa_cpus 和 numa_mem;

2、自動從 0 節點開始平均分配。
  • 手動指定 NUMA 的約束和限制 :
1、用戶指定的 CPU 總數需要與套餐中的 CPU 個數一致;

2、用戶指定的內存總數需要與套餐中的內存總數一致;

3、必須設置 numa_cpus 和 numa_mem;

4、需要從 0 開始指定各個 numa 節點的資源佔用 。

###過濾器配置

新增NUMATopologyFilter過濾器:

$ vi /etc/kolla/nova-scheduler/nova.conf
[DEFAULT]
...
scheduler_default_filters = NUMATopologyFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, SameHostFilter, DifferentHostFilter, AggregateInstanceExtraSpecsFilter, RetryFilter, AvailabilityZoneFilter, RamFilter, DiskFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter
...

$ docker restart nova_scheduler

###CPU查看

全局信息

  • libvirt中查看:
$ virsh nodeinfo
CPU 型號:        x86_64
CPU:               72
CPU 頻率:        1963 MHz
CPU socket:        1
每個 socket 的內核數: 18
每個內核的線程數: 2
NUMA 單元:       2
內存大小:      401248288 KiB
  • numactl中查看(需安裝numactl軟件包):
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
node 0 size: 195236 MB
node 0 free: 153414 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 1 size: 196608 MB
node 1 free: 167539 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1
  • 查看NUMA內存分配情況(等同於cat /sys/devices/system/node/node0/numastat命令):
$ numastat
                           node0           node1
numa_hit             25692061493     30097928824
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit            110618          109691
local_node           25689725277     30096685088
other_node               2336216         1243736

numa_hit是打算在該節點上分配內存,最後從這個節點分配的次數;

num_miss是打算在該節點分配內存,最後卻從其他節點分配的次數(此數值偏高時說明要調整分配策略);

num_foregin是打算在其他節點分配內存,最後卻從這個節點分配的次數;

interleave_hit是採用interleave策略最後從該節點分配的次數;

local_node該節點上的進程在該節點上分配的次數;

other_node是其他節點進程在該節點上分配的次數。

  • 查看個CPU負載情況(需安裝sysstat軟件包):
$ mpstat -P ALL
Linux 3.10.0-693.17.1.el7.x86_64 (osdev-01) 	2018年03月19日 	_x86_64_	(72 CPU)

16時28分12秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
16時28分12秒  all    2.38    8.79    0.70    0.00    0.00    0.01    0.00    0.00    0.00   88.11
16時28分12秒    0    8.83    3.29    1.50    0.00    0.00    0.30    0.00    0.00    0.00   86.09
16時28分12秒    1    7.54    2.88    1.29    0.00    0.00    0.05    0.00    0.00    0.00   88.24
16時28分12秒    2    7.60    2.89    1.30    0.00    0.00    0.03    0.00    0.00    0.00   88.18
16時28分12秒    3    6.97    2.93    1.20    0.00    0.00    0.02    0.00    0.00    0.00   88.87
16時28分12秒    4    3.91    5.84    0.87    0.00    0.00    0.02    0.00    0.00    0.00   89.36
16時28分12秒    5    3.95    5.64    0.88    0.00    0.00    0.01    0.00    0.00    0.00   89.52
16時28分12秒    6    3.14    7.24    0.80    0.00    0.00    0.01    0.00    0.00    0.00   88.80
16時28分12秒    7    2.38    8.41    0.74    0.01    0.00    0.01    0.00    0.00    0.00   88.46
16時28分12秒    8    2.34    9.37    0.76    0.01    0.00    0.01    0.00    0.00    0.00   87.53
16時28分12秒    9    2.19    9.73    0.75    0.01    0.00    0.01    0.00    0.00    0.00   87.32
16時28分12秒   10    2.12   10.09    0.75    0.01    0.00    0.01    0.00    0.00    0.00   87.03
16時28分12秒   11    2.10   10.44    0.74    0.01    0.00    0.01    0.00    0.00    0.00   86.71
16時28分12秒   12    2.04   10.72    0.75    0.01    0.00    0.01    0.00    0.00    0.00   86.48
16時28分12秒   13    2.00   11.14    0.76    0.01    0.00    0.01    0.00    0.00    0.00   86.10
16時28分12秒   14    2.00   11.52    0.76    0.01    0.00    0.01    0.00    0.00    0.00   85.70
16時28分12秒   15    1.97   11.80    0.76    0.01    0.00    0.01    0.00    0.00    0.00   85.45
16時28分12秒   16    1.97   12.03    0.76    0.01    0.00    0.01    0.00    0.00    0.00   85.22
16時28分12秒   17    1.96   12.20    0.76    0.01    0.00    0.01    0.00    0.00    0.00   85.06
16時28分12秒   18    2.17   17.85    0.97    0.01    0.00    0.00    0.00    0.00    0.00   79.00
16時28分12秒   19    2.55   17.04    0.99    0.01    0.00    0.00    0.00    0.00    0.00   79.41
16時28分12秒   20    2.72   17.36    1.07    0.01    0.00    0.00    0.00    0.00    0.00   78.84
16時28分12秒   21    2.61   17.27    1.03    0.01    0.00    0.00    0.00    0.00    0.00   79.07
16時28分12秒   22    2.37   17.39    1.00    0.01    0.00    0.00    0.00    0.00    0.00   79.23
16時28分12秒   23    2.16   17.50    0.99    0.01    0.00    0.00    0.00    0.00    0.00   79.35
16時28分12秒   24    2.05   17.52    0.98    0.01    0.00    0.00    0.00    0.00    0.00   79.44
16時28分12秒   25    1.98   17.52    0.97    0.01    0.00    0.00    0.00    0.00    0.00   79.53
16時28分12秒   26    1.93   17.47    0.97    0.01    0.00    0.00    0.00    0.00    0.00   79.62
16時28分12秒   27    1.89   17.50    0.97    0.01    0.00    0.00    0.00    0.00    0.00   79.63
16時28分12秒   28    1.85   17.51    0.96    0.01    0.00    0.00    0.00    0.00    0.00   79.66
16時28分12秒   29    1.83   17.50    0.96    0.01    0.00    0.00    0.00    0.00    0.00   79.70
16時28分12秒   30    1.82   17.46    0.96    0.01    0.00    0.00    0.00    0.00    0.00   79.76
16時28分12秒   31    1.78   17.42    0.95    0.01    0.00    0.00    0.00    0.00    0.00   79.83
16時28分12秒   32    1.79   17.36    0.94    0.01    0.00    0.00    0.00    0.00    0.00   79.90
16時28分12秒   33    1.77   17.34    0.94    0.01    0.00    0.00    0.00    0.00    0.00   79.94
16時28分12秒   34    1.75   17.33    0.94    0.01    0.00    0.00    0.00    0.00    0.00   79.97
16時28分12秒   35    1.73   17.30    0.94    0.01    0.00    0.00    0.00    0.00    0.00   80.02
16時28分12秒   36    4.90    3.96    0.61    0.00    0.00    0.00    0.00    0.00    0.00   90.53
16時28分12秒   37    9.35    6.24    1.46    0.00    0.00    0.00    0.00    0.00    0.00   82.95
16時28分12秒   38    5.90    4.43    0.87    0.00    0.00    0.00    0.00    0.00    0.00   88.79
16時28分12秒   39    6.10    3.83    0.81    0.00    0.00    0.00    0.00    0.00    0.00   89.25
16時28分12秒   40    2.95    3.55    0.53    0.00    0.00    0.00    0.00    0.00    0.00   92.95
16時28分12秒   41    2.28    3.48    0.44    0.00    0.00    0.00    0.00    0.00    0.00   93.79
16時28分12秒   42    1.44    3.34    0.35    0.00    0.00    0.00    0.00    0.00    0.00   94.87
16時28分12秒   43    0.94    3.24    0.30    0.00    0.00    0.00    0.00    0.00    0.00   95.52
16時28分12秒   44    0.89    3.41    0.29    0.00    0.00    0.00    0.00    0.00    0.00   95.41
16時28分12秒   45    0.86    3.43    0.28    0.00    0.00    0.00    0.00    0.00    0.00   95.42
16時28分12秒   46    0.83    3.44    0.28    0.00    0.00    0.00    0.00    0.00    0.00   95.45
16時28分12秒   47    0.84    3.43    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.46
16時28分12秒   48    0.83    3.50    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.40
16時28分12秒   49    0.88    3.31    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.53
16時28分12秒   50    0.91    3.16    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.65
16時28分12秒   51    0.90    3.15    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.68
16時28分12秒   52    0.89    3.19    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.65
16時28分12秒   53    0.91    3.40    0.28    0.00    0.00    0.00    0.00    0.00    0.00   95.40
16時28分12秒   54    1.00    5.91    0.37    0.00    0.00    0.00    0.00    0.00    0.00   92.71
16時28分12秒   55    5.70    7.12    1.16    0.00    0.00    0.00    0.00    0.00    0.00   86.01
16時28分12秒   56    2.49    6.33    0.69    0.00    0.00    0.00    0.00    0.00    0.00   90.49
16時28分12秒   57    2.93    6.12    0.66    0.00    0.00    0.00    0.00    0.00    0.00   90.29
16時28分12秒   58    2.15    6.06    0.60    0.00    0.00    0.00    0.00    0.00    0.00   91.19
16時28分12秒   59    1.52    6.15    0.56    0.00    0.00    0.00    0.00    0.00    0.00   91.77
16時28分12秒   60    1.23    6.25    0.48    0.00    0.00    0.00    0.00    0.00    0.00   92.03
16時28分12秒   61    1.11    5.68    0.42    0.00    0.00    0.00    0.00    0.00    0.00   92.79
16時28分12秒   62    1.03    5.67    0.41    0.00    0.00    0.00    0.00    0.00    0.00   92.89
16時28分12秒   63    1.01    5.72    0.39    0.00    0.00    0.00    0.00    0.00    0.00   92.88
16時28分12秒   64    0.96    5.62    0.38    0.00    0.00    0.00    0.00    0.00    0.00   93.02
16時28分12秒   65    0.94    5.50    0.37    0.00    0.00    0.00    0.00    0.00    0.00   93.18
16時28分12秒   66    0.93    5.50    0.37    0.00    0.00    0.00    0.00    0.00    0.00   93.20
16時28分12秒   67    0.92    5.53    0.37    0.00    0.00    0.00    0.00    0.00    0.00   93.17
16時28分12秒   68    0.91    5.52    0.36    0.00    0.00    0.00    0.00    0.00    0.00   93.21
16時28分12秒   69    0.91    5.54    0.36    0.00    0.00    0.00    0.00    0.00    0.00   93.18
16時28分12秒   70    0.89    5.61    0.36    0.00    0.00    0.00    0.00    0.00    0.00   93.13
16時28分12秒   71    0.89    5.94    0.37    0.00    0.00    0.00    0.00    0.00    0.00   92.79

####單項查看

  • 查看Socket個數:
$ grep 'physical id' /proc/cpuinfo | awk -F: '{print $2 | "sort -un"}' | wc -l
2

一共有2個Socket

  • 查看每個Socket包含的Processor個數:
$ grep 'physical id' /proc/cpuinfo | awk -F: '{print $2}' | sort | uniq -c
     36  0
     36  1

每個Socket有36個Processor

  • 查看每個Socket包含的Core個數:
cat /proc/cpuinfo | grep 'core'  | sort -u
core id		: 0
core id		: 1
core id		: 10
core id		: 11
core id		: 16
core id		: 17
core id		: 18
core id		: 19
core id		: 2
core id		: 20
core id		: 24
core id		: 25
core id		: 26
core id		: 27
core id		: 3
core id		: 4
core id		: 8
core id		: 9
cpu cores	: 18

每個Socket包含18個Core(每個Core包含2個Processor)。

####結構分析

  • CPU結構分析腳本:
#!/bin/bash

# Simple print cpu topology
# Author: kodango

function get_nr_processor()
{
    grep '^processor' /proc/cpuinfo | wc -l
}

function get_nr_socket()
{
    grep 'physical id' /proc/cpuinfo | awk -F: '{
            print $2 | "sort -un"}' | wc -l
}

function get_nr_siblings()
{
    grep 'siblings' /proc/cpuinfo | awk -F: '{
            print $2 | "sort -un"}'
}

function get_nr_cores_of_socket()
{
    grep 'cpu cores' /proc/cpuinfo | awk -F: '{
            print $2 | "sort -un"}'
}

echo '===== CPU Topology Table ====='
echo

echo '+--------------+---------+-----------+'
echo '| Processor ID | Core ID | Socket ID |'
echo '+--------------+---------+-----------+'

while read line; do
    if [ -z "$line" ]; then
        printf '| %-12s | %-7s | %-9s |\n' $p_id $c_id $s_id
        echo '+--------------+---------+-----------+'
        continue
    fi

    if echo "$line" | grep -q "^processor"; then
        p_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` 
    fi

    if echo "$line" | grep -q "^core id"; then
        c_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` 
    fi

    if echo "$line" | grep -q "^physical id"; then
        s_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` 
    fi
done < /proc/cpuinfo

echo

awk -F: '{ 
    if ($1 ~ /processor/) {
        gsub(/ /,"",$2);
        p_id=$2;
    } else if ($1 ~ /physical id/){
        gsub(/ /,"",$2);
        s_id=$2;
        arr[s_id]=arr[s_id] " " p_id
    }
} 

END{
    for (i in arr) 
        printf "Socket %s:%s\n", i, arr[i];
}' /proc/cpuinfo

echo
echo '===== CPU Info Summary ====='
echo

nr_processor=`get_nr_processor`
echo "Logical processors: $nr_processor"

nr_socket=`get_nr_socket`
echo "Physical socket: $nr_socket"

nr_siblings=`get_nr_siblings`
echo "Siblings in one socket: $nr_siblings"

nr_cores=`get_nr_cores_of_socket`
echo "Cores in one socket: $nr_cores"

let nr_cores*=nr_socket
echo "Cores in total: $nr_cores"

if [ "$nr_cores" = "$nr_processor" ]; then
    echo "Hyper-Threading: off"
else
    echo "Hyper-Threading: on"
fi

echo
echo '===== END ====='
  • 運行CPU結構分析腳本:
./cpu_view.sh 
===== CPU Topology Table =====

+--------------+---------+-----------+
| Processor ID | Core ID | Socket ID |
+--------------+---------+-----------+
| 0            | 0       | 0         |
+--------------+---------+-----------+
| 1            | 1       | 0         |
+--------------+---------+-----------+
| 2            | 2       | 0         |
+--------------+---------+-----------+
| 3            | 3       | 0         |
+--------------+---------+-----------+
| 4            | 4       | 0         |
+--------------+---------+-----------+
| 5            | 8       | 0         |
+--------------+---------+-----------+
| 6            | 9       | 0         |
+--------------+---------+-----------+
| 7            | 10      | 0         |
+--------------+---------+-----------+
| 8            | 11      | 0         |
+--------------+---------+-----------+
| 9            | 16      | 0         |
+--------------+---------+-----------+
| 10           | 17      | 0         |
+--------------+---------+-----------+
| 11           | 18      | 0         |
+--------------+---------+-----------+
| 12           | 19      | 0         |
+--------------+---------+-----------+
| 13           | 20      | 0         |
+--------------+---------+-----------+
| 14           | 24      | 0         |
+--------------+---------+-----------+
| 15           | 25      | 0         |
+--------------+---------+-----------+
| 16           | 26      | 0         |
+--------------+---------+-----------+
| 17           | 27      | 0         |
+--------------+---------+-----------+
| 18           | 0       | 1         |
+--------------+---------+-----------+
| 19           | 1       | 1         |
+--------------+---------+-----------+
| 20           | 2       | 1         |
+--------------+---------+-----------+
| 21           | 3       | 1         |
+--------------+---------+-----------+
| 22           | 4       | 1         |
+--------------+---------+-----------+
| 23           | 8       | 1         |
+--------------+---------+-----------+
| 24           | 9       | 1         |
+--------------+---------+-----------+
| 25           | 10      | 1         |
+--------------+---------+-----------+
| 26           | 11      | 1         |
+--------------+---------+-----------+
| 27           | 16      | 1         |
+--------------+---------+-----------+
| 28           | 17      | 1         |
+--------------+---------+-----------+
| 29           | 18      | 1         |
+--------------+---------+-----------+
| 30           | 19      | 1         |
+--------------+---------+-----------+
| 31           | 20      | 1         |
+--------------+---------+-----------+
| 32           | 24      | 1         |
+--------------+---------+-----------+
| 33           | 25      | 1         |
+--------------+---------+-----------+
| 34           | 26      | 1         |
+--------------+---------+-----------+
| 35           | 27      | 1         |
+--------------+---------+-----------+
| 36           | 0       | 0         |
+--------------+---------+-----------+
| 37           | 1       | 0         |
+--------------+---------+-----------+
| 38           | 2       | 0         |
+--------------+---------+-----------+
| 39           | 3       | 0         |
+--------------+---------+-----------+
| 40           | 4       | 0         |
+--------------+---------+-----------+
| 41           | 8       | 0         |
+--------------+---------+-----------+
| 42           | 9       | 0         |
+--------------+---------+-----------+
| 43           | 10      | 0         |
+--------------+---------+-----------+
| 44           | 11      | 0         |
+--------------+---------+-----------+
| 45           | 16      | 0         |
+--------------+---------+-----------+
| 46           | 17      | 0         |
+--------------+---------+-----------+
| 47           | 18      | 0         |
+--------------+---------+-----------+
| 48           | 19      | 0         |
+--------------+---------+-----------+
| 49           | 20      | 0         |
+--------------+---------+-----------+
| 50           | 24      | 0         |
+--------------+---------+-----------+
| 51           | 25      | 0         |
+--------------+---------+-----------+
| 52           | 26      | 0         |
+--------------+---------+-----------+
| 53           | 27      | 0         |
+--------------+---------+-----------+
| 54           | 0       | 1         |
+--------------+---------+-----------+
| 55           | 1       | 1         |
+--------------+---------+-----------+
| 56           | 2       | 1         |
+--------------+---------+-----------+
| 57           | 3       | 1         |
+--------------+---------+-----------+
| 58           | 4       | 1         |
+--------------+---------+-----------+
| 59           | 8       | 1         |
+--------------+---------+-----------+
| 60           | 9       | 1         |
+--------------+---------+-----------+
| 61           | 10      | 1         |
+--------------+---------+-----------+
| 62           | 11      | 1         |
+--------------+---------+-----------+
| 63           | 16      | 1         |
+--------------+---------+-----------+
| 64           | 17      | 1         |
+--------------+---------+-----------+
| 65           | 18      | 1         |
+--------------+---------+-----------+
| 66           | 19      | 1         |
+--------------+---------+-----------+
| 67           | 20      | 1         |
+--------------+---------+-----------+
| 68           | 24      | 1         |
+--------------+---------+-----------+
| 69           | 25      | 1         |
+--------------+---------+-----------+
| 70           | 26      | 1         |
+--------------+---------+-----------+
| 71           | 27      | 1         |
+--------------+---------+-----------+

Socket 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
Socket 1: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

===== CPU Info Summary =====

Logical processors: 72
Physical socket: 2
Siblings in one socket:  36
Cores in one socket:  18
Cores in total: 36
Hyper-Threading: on

===== END =====

綁定測試

####創建普通虛擬機

  • 創建一個用於NUMA測試的模板:
$ openstack flavor create --vcpus 2 --ram 64 --disk 1 machine.numa
  • 創建普通虛擬機:
$ openstack server create --image cirros --flavor machine.numa --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.numa1
  • 查看libvirt配置:
$ openstack server show server.numa1 | grep instance_name | awk '{print $4}'
$ virsh edit instance-0000001f
...
  <memory unit='KiB'>65536</memory>
  <currentMemory unit='KiB'>65536</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <shares>2048</shares>
  </cputune>
...
  <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
    <topology sockets='2' cores='1' threads='1'/>
  </cpu>
...
  • 查看CPU親和性和分配情況:
$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs taskset -c -p
pid 180564's current affinity list: 0-71

$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs ps -m -o pid,psr,comm -p
   PID PSR COMMAND
180564   - qemu-kvm
     -  61 -
     -  22 -
     -  13 -
     -   1 -
     -  58 -
  • 設置模板的NUMA屬性:
$ nova flavor-key machine.numa set hw:numa_nodes=1 hw:numa_cpus.0=0,1 hw:numa_mem.0=64
# nova flavor-key machine.numa unset hw:numa_nodes hw:numa_cpus.0 hw:numa_mem.0

$ openstack flavor show machine.numa
+----------------------------+-------------------------------------------------------------+
| Field                      | Value                                                       |
+----------------------------+-------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                                       |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                           |
| access_project_ids         | None                                                        |
| disk                       | 1                                                           |
| id                         | fc37ea6f-3e69-422f-a05e-0ee56837a84d                        |
| name                       | machine.numa                                                |
| os-flavor-access:is_public | True                                                        |
| properties                 | hw:numa_cpus.0='0,1', hw:numa_mem.0='64', hw:numa_nodes='1' |
| ram                        | 64                                                          |
| rxtx_factor                | 1.0                                                         |
| swap                       |                                                             |
| vcpus                      | 2                                                           |
+----------------------------+-------------------------------------------------------------+
  • 重新啓動之前創建的虛擬機:
$ openstack server stop server.numa1
$ openstack server start server.numa1
  • NUMA屬性未改變:
$ openstack server show server.numa1 | grep instance_name | awk '{print $4}'
$ virsh edit instance-00000022
...
  <memory unit='KiB'>65536</memory>
  <currentMemory unit='KiB'>65536</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <shares>2048</shares>
  </cputune>
...
  <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
    <topology sockets='2' cores='1' threads='1'/>
  </cpu>
...

  • 查看CPU親和性和分配情況:
$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs taskset -c -p
pid 219152's current affinity list: 0-71

$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list
/proc/219152/task/219152/status:Cpus_allowed_list:	0-71
/proc/219152/task/219220/status:Cpus_allowed_list:	0-71
/proc/219152/task/219225/status:Cpus_allowed_list:	0-71
/proc/219152/task/219227/status:Cpus_allowed_list:	0-71
/proc/219152/task/219250/status:Cpus_allowed_list:	0-71

$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs ps -m -o pid,psr,comm -p
   PID PSR COMMAND
219152   - qemu-kvm
     -  31 -
     -  64 -
     -   1 -
     -  12 -
     -  55 -
  • 查看內存分配情況:
$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs -I {} cat /proc/{}/numa_maps
...

55cdcffcd000 default file=/usr/libexec/qemu-kvm mapped=1328 mapmax=2 N0=1287 N1=41 kernelpagesize_kB=4
55cdd0944000 default file=/usr/libexec/qemu-kvm anon=248 dirty=248 N1=248 kernelpagesize_kB=4
55cdd0afc000 default file=/usr/libexec/qemu-kvm anon=95 dirty=95 N0=5 N1=90 kernelpagesize_kB=4
55cdd0b5c000 default anon=19 dirty=19 N0=4 N1=15 kernelpagesize_kB=4
55cdd1d83000 default heap anon=10419 dirty=10419 N0=1158 N1=9261 kernelpagesize_kB=4
7f35cbbba000 default
7f35cbbbb000 default anon=1 dirty=1 N0=1 kernelpagesize_kB=4
7f35cbcbb000 default
7f35cbcbc000 default anon=1 dirty=1 N0=1 kernelpagesize_kB=4
7f35cedc2000 default
7f35cedc3000 default anon=4 dirty=4 N0=4 kernelpagesize_kB=4
7f35d05c5000 default
7f35d05c6000 default anon=1 dirty=1 N1=1 kernelpagesize_kB=4
7f35d06c6000 default
...

####創建綁定虛擬機

  • 創建一個設置NUMA參數的虛擬機,可以看到虛擬機的CPU都創建在Node 0上:
$ openstack server create --image cirros --flavor machine.numa --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.numa2
  • 查看libvirt配置:
$ openstack server show server.numa2 | grep instance_name | awk '{print $4}'
instance-00000024

$ virsh edit instance-00000024
...
  <memory unit='KiB'>65536</memory>
  <currentMemory unit='KiB'>65536</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <shares>2048</shares>
    <vcpupin vcpu='0' cpuset='0-17,36-53'/>
    <vcpupin vcpu='1' cpuset='0-17,36-53'/>
    <emulatorpin cpuset='0-17,36-53'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
  </numatune>
...

  <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
    <topology sockets='2' cores='1' threads='1'/>
    <numa>
      <cell id='0' cpus='0-1' memory='65536' unit='KiB'/>
    </numa>
...
  • 查看CPU親和性和綁定情況:
$ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p
pid 1139's current affinity list: 0-17,36-53

$ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list
/proc/1139/task/1139/status:Cpus_allowed_list:	0-17,36-53
/proc/1139/task/1143/status:Cpus_allowed_list:	0-17,36-53
/proc/1139/task/1148/status:Cpus_allowed_list:	0-17,36-53
/proc/1139/task/1149/status:Cpus_allowed_list:	0-17,36-53
/proc/1139/task/1151/status:Cpus_allowed_list:	0-17,36-53

$ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p
   PID PSR COMMAND
  1139   - qemu-kvm
     -   3 -
     -   8 -
     -   2 -
     -   6 -
     -  51 -

  • 查看內存分配情況:
$ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps
...

56133b0b3000 default file=/usr/libexec/qemu-kvm mapped=1326 mapmax=2 N0=1292 N1=34 kernelpagesize_kB=4
56133ba2a000 default file=/usr/libexec/qemu-kvm anon=248 dirty=248 N0=248 kernelpagesize_kB=4
56133bbe2000 default file=/usr/libexec/qemu-kvm anon=95 dirty=95 N0=95 kernelpagesize_kB=4
56133bc42000 default anon=19 dirty=19 N0=19 kernelpagesize_kB=4
56133db66000 default heap anon=3415 dirty=3415 N0=3415 kernelpagesize_kB=4
...

####內存分配對比

  • 內存分配對比測試腳本:
#!/usr/bin/perl

# Copyright (c) 2010, Jeremy Cole <[email protected]>

# This program is free software; you can redistribute it and/or modify it
# under the terms of either: the GNU General Public License as published
# by the Free Software Foundation; or the Artistic License.
# 
# See http://dev.perl.org/licenses/ for more information.

#
# This script expects a numa_maps file as input.  It is normally run in
# the following way:
#
#     # perl numa-maps-summary.pl < /proc/pid/numa_maps
#
# Additionally, it can be used (of course) with saved numa_maps, and it
# will also accept numa_maps output with ">" prefixes from an email quote.
# It doesn't care what's in the output, it merely summarizes whatever it
# finds.
#
# The output should look something like the following:
#
#     N0        :      7983584 ( 30.45 GB)
#     N1        :      5440464 ( 20.75 GB)
#     active    :     13406601 ( 51.14 GB)
#     anon      :     13422697 ( 51.20 GB)
#     dirty     :     13407242 ( 51.14 GB)
#     mapmax    :          977 (  0.00 GB)
#     mapped    :         1377 (  0.01 GB)
#     swapcache :      3619780 ( 13.81 GB)
#

use Data::Dumper;

sub parse_numa_maps_line($$)
{
  my ($line, $map) = @_;

  if($line =~ /^[> ]*([0-9a-fA-F]+) (\S+)(.*)/)
  {
    my ($address, $policy, $flags) = ($1, $2, $3);

    $map->{$address}->{'policy'} = $policy;

    $flags =~ s/^\s+//g;
    $flags =~ s/\s+$//g;
    foreach my $flag (split / /, $flags)
    {
      my ($key, $value) = split /=/, $flag;
      $map->{$address}->{'flags'}->{$key} = $value;
    }
  }

}

sub parse_numa_maps()
{
  my ($fd) = @_;
  my $map = {};

  while(my $line = <$fd>)
  {
    &parse_numa_maps_line($line, $map);

  }
  return $map;
}

my $map = &parse_numa_maps(\*STDIN);

my $sums = {};

foreach my $address (keys %{$map})
{
  if(exists($map->{$address}->{'flags'}))
  {
    my $flags = $map->{$address}->{'flags'};
    foreach my $flag (keys %{$flags})
    {
      next if $flag eq 'file';
      $sums->{$flag} += $flags->{$flag} if defined $flags->{$flag};
    }
  }
}

foreach my $key (sort keys %{$sums})
{
  printf "%-10s: %12i (%6.2f GB)\n", $key, $sums->{$key}, $sums->{$key}/262144;
}

  • 普通虛擬機內存在兩個節點都有較多分配:
$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl 
N0        :         8721 (  0.03 GB)
N1        :        21186 (  0.08 GB)
active    :            0 (  0.00 GB)
anon      :        26787 (  0.10 GB)
dirty     :        26797 (  0.10 GB)
kernelpagesize_kB:         1660 (  0.01 GB)
mapmax    :         4116 (  0.02 GB)
mapped    :         3110 (  0.01 GB)
  • NUMA綁定的虛擬機幾乎全部在Node1節點:
$ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl
N0        :        21731 (  0.08 GB)
N1        :           34 (  0.00 GB)
active    :            0 (  0.00 GB)
anon      :        18647 (  0.07 GB)
dirty     :        18657 (  0.07 GB)
kernelpagesize_kB:         1708 (  0.01 GB)
mapmax    :         4116 (  0.02 GB)
mapped    :         3108 (  0.01 GB)

###相關源碼

虛擬機創建

  • 創建虛擬機時,進行參數構建和驗證(nova/compute/api.py):
    @hooks.add_hook("create_instance")
    def create(self, context, instance_type,
               image_href, kernel_id=None, ramdisk_id=None,
               min_count=None, max_count=None,
               display_name=None, display_description=None,
               key_name=None, key_data=None, security_groups=None,
               availability_zone=None, forced_host=None, forced_node=None,
               user_data=None, metadata=None, injected_files=None,
               admin_password=None, block_device_mapping=None,
               access_ip_v4=None, access_ip_v6=None, requested_networks=None,
               config_drive=None, auto_disk_config=None, scheduler_hints=None,
               legacy_bdm=True, shutdown_terminate=False,
               check_server_group_quota=False):
        """Provision instances, sending instance information to the
        scheduler.  The scheduler will determine where the instance(s)
        go and will handle creating the DB entries.

        Returns a tuple of (instances, reservation_id)
        """
        if requested_networks and max_count is not None and max_count > 1:
            self._check_multiple_instances_with_specified_ip(
                requested_networks)
            if utils.is_neutron():
                self._check_multiple_instances_with_neutron_ports(
                    requested_networks)

        if availability_zone:
            available_zones = availability_zones.\
                get_availability_zones(context.elevated(), True)
            if forced_host is None and availability_zone not in \
                    available_zones:
                msg = _('The requested availability zone is not available')
                raise exception.InvalidRequest(msg)

        filter_properties = scheduler_utils.build_filter_properties(
                scheduler_hints, forced_host, forced_node, instance_type)

        return self._create_instance(
                       context, instance_type,
                       image_href, kernel_id, ramdisk_id,
                       min_count, max_count,
                       display_name, display_description,
                       key_name, key_data, security_groups,
                       availability_zone, user_data, metadata,
                       injected_files, admin_password,
                       access_ip_v4, access_ip_v6,
                       requested_networks, config_drive,
                       block_device_mapping, auto_disk_config,
                       filter_properties=filter_properties,
                       legacy_bdm=legacy_bdm,
                       shutdown_terminate=shutdown_terminate,
                       check_server_group_quota=check_server_group_quota)

    def _create_instance(self, context, instance_type,
               image_href, kernel_id, ramdisk_id,
               min_count, max_count,
               display_name, display_description,
               key_name, key_data, security_groups,
               availability_zone, user_data, metadata, injected_files,
               admin_password, access_ip_v4, access_ip_v6,
               requested_networks, config_drive,
               block_device_mapping, auto_disk_config, filter_properties,
               reservation_id=None, legacy_bdm=True, shutdown_terminate=False,
               check_server_group_quota=False):
        """Verify all the input parameters regardless of the provisioning
        strategy being performed and schedule the instance(s) for
        creation.
        """

# ...

        base_options, max_net_count, key_pair, security_groups = \
                self._validate_and_build_base_options(
                    context, instance_type, boot_meta, image_href, image_id,
                    kernel_id, ramdisk_id, display_name, display_description,
                    key_name, key_data, security_groups, availability_zone,
                    user_data, metadata, access_ip_v4, access_ip_v6,
                    requested_networks, config_drive, auto_disk_config,
                    reservation_id, max_count)

#...

    def _validate_and_build_base_options(self, context, instance_type,
                                         boot_meta, image_href, image_id,
                                         kernel_id, ramdisk_id, display_name,
                                         display_description, key_name,
                                         key_data, security_groups,
                                         availability_zone, user_data,
                                         metadata, access_ip_v4, access_ip_v6,
                                         requested_networks, config_drive,
                                         auto_disk_config, reservation_id,
                                         max_count):
        """Verify all the input parameters regardless of the provisioning
        strategy being performed.
        """
#...

        numa_topology = hardware.numa_get_constraints(
                instance_type, image_meta)

#...

  • 構建NUMA參數(nova/virt/hardware.py):
# TODO(sahid): Move numa related to hardware/numa.py
def numa_get_constraints(flavor, image_meta):
    """Return topology related to input request.

    :param flavor: a flavor object to read extra specs from
    :param image_meta: nova.objects.ImageMeta object instance

    :raises: exception.InvalidNUMANodesNumber if the number of NUMA
             nodes is less than 1 or not an integer
    :raises: exception.ImageNUMATopologyForbidden if an attempt is made
             to override flavor settings with image properties
    :raises: exception.MemoryPageSizeInvalid if flavor extra spec or
             image metadata provides an invalid hugepage value
    :raises: exception.MemoryPageSizeForbidden if flavor extra spec
             request conflicts with image metadata request
    :raises: exception.ImageNUMATopologyIncomplete if the image
             properties are not correctly specified
    :raises: exception.ImageNUMATopologyAsymmetric if the number of
             NUMA nodes is not a factor of the requested total CPUs or
             memory
    :raises: exception.ImageNUMATopologyCPUOutOfRange if an instance
             CPU given in a NUMA mapping is not valid
    :raises: exception.ImageNUMATopologyCPUDuplicates if an instance
             CPU is specified in CPU mappings for two NUMA nodes
    :raises: exception.ImageNUMATopologyCPUsUnassigned if an instance
             CPU given in a NUMA mapping is not assigned to any NUMA node
    :raises: exception.ImageNUMATopologyMemoryOutOfRange if sum of memory from
             each NUMA node is not equal with total requested memory
    :raises: exception.ImageCPUPinningForbidden if a CPU policy
             specified in a flavor conflicts with one defined in image
             metadata
    :raises: exception.RealtimeConfigurationInvalid if realtime is
             requested but dedicated CPU policy is not also requested
    :raises: exception.RealtimeMaskNotFoundOrInvalid if realtime is
             requested but no mask provided
    :raises: exception.CPUThreadPolicyConfigurationInvalid if a CPU thread
             policy conflicts with CPU allocation policy
    :raises: exception.ImageCPUThreadPolicyForbidden if a CPU thread policy
             specified in a flavor conflicts with one defined in image metadata
    :returns: objects.InstanceNUMATopology, or None
    """
    flavor_nodes, image_nodes = _get_flavor_image_meta(
        'numa_nodes', flavor, image_meta)
    if flavor_nodes and image_nodes:
        raise exception.ImageNUMATopologyForbidden(
            name='hw_numa_nodes')

    nodes = None
    if flavor_nodes:
        _validate_numa_nodes(flavor_nodes)
        nodes = int(flavor_nodes)
    else:
        _validate_numa_nodes(image_nodes)
        nodes = image_nodes

    pagesize = _numa_get_pagesize_constraints(
        flavor, image_meta)

    numa_topology = None
    if nodes or pagesize:
        nodes = nodes or 1

        cpu_list = _numa_get_cpu_map_list(flavor, image_meta)
        mem_list = _numa_get_mem_map_list(flavor, image_meta)

        # If one property list is specified both must be
        if ((cpu_list is None and mem_list is not None) or
            (cpu_list is not None and mem_list is None)):
            raise exception.ImageNUMATopologyIncomplete()

        # If any node has data set, all nodes must have data set
        if ((cpu_list is not None and len(cpu_list) != nodes) or
            (mem_list is not None and len(mem_list) != nodes)):
            raise exception.ImageNUMATopologyIncomplete()

        if cpu_list is None:
            numa_topology = _numa_get_constraints_auto(
                nodes, flavor)
        else:
            numa_topology = _numa_get_constraints_manual(
                nodes, flavor, cpu_list, mem_list)

        # We currently support same pagesize for all cells.
        [setattr(c, 'pagesize', pagesize) for c in numa_topology.cells]

    cpu_policy = _get_cpu_policy_constraints(flavor, image_meta)
    cpu_thread_policy = _get_cpu_thread_policy_constraints(flavor, image_meta)
    rt_mask = _get_realtime_mask(flavor, image_meta)

    # sanity checks

    rt = is_realtime_enabled(flavor)

    if rt and cpu_policy != fields.CPUAllocationPolicy.DEDICATED:
        raise exception.RealtimeConfigurationInvalid()

    if rt and not rt_mask:
        raise exception.RealtimeMaskNotFoundOrInvalid()

    if cpu_policy == fields.CPUAllocationPolicy.SHARED:
        if cpu_thread_policy:
            raise exception.CPUThreadPolicyConfigurationInvalid()
        return numa_topology

    if numa_topology:
        for cell in numa_topology.cells:
            cell.cpu_policy = cpu_policy
            cell.cpu_thread_policy = cpu_thread_policy
    else:
        single_cell = objects.InstanceNUMACell(
                id=0,
                cpuset=set(range(flavor.vcpus)),
                memory=flavor.memory_mb,
                cpu_policy=cpu_policy,
                cpu_thread_policy=cpu_thread_policy)
        numa_topology = objects.InstanceNUMATopology(cells=[single_cell])

    return numa_topology


def _get_cpu_policy_constraints(flavor, image_meta):
    """Validate and return the requested CPU policy."""
    flavor_policy, image_policy = _get_flavor_image_meta(
        'cpu_policy', flavor, image_meta)

    if flavor_policy == fields.CPUAllocationPolicy.DEDICATED:
        cpu_policy = flavor_policy
    elif flavor_policy == fields.CPUAllocationPolicy.SHARED:
        if image_policy == fields.CPUAllocationPolicy.DEDICATED:
            raise exception.ImageCPUPinningForbidden()
        cpu_policy = flavor_policy
    elif image_policy == fields.CPUAllocationPolicy.DEDICATED:
        cpu_policy = image_policy
    else:
        cpu_policy = fields.CPUAllocationPolicy.SHARED

    return cpu_policy


def _get_cpu_thread_policy_constraints(flavor, image_meta):
    """Validate and return the requested CPU thread policy."""
    flavor_policy, image_policy = _get_flavor_image_meta(
        'cpu_thread_policy', flavor, image_meta)

    if flavor_policy in [None, fields.CPUThreadAllocationPolicy.PREFER]:
        policy = flavor_policy or image_policy
    elif image_policy and image_policy != flavor_policy:
        raise exception.ImageCPUThreadPolicyForbidden()
    else:
        policy = flavor_policy

    return policy

從中可見Flavor參數優先於Image參數。

####NUMA過濾器

  • NUMACPU綁定過濾器源碼(nova/scheduler/filters/numa_topology_filter.py):
from oslo_log import log as logging

from nova import objects
from nova.objects import fields
from nova.scheduler import filters
from nova.virt import hardware

LOG = logging.getLogger(__name__)


class NUMATopologyFilter(filters.BaseHostFilter):
    """Filter on requested NUMA topology."""

    RUN_ON_REBUILD = True

    def _satisfies_cpu_policy(self, host_state, extra_specs, image_props):
        """Check that the host_state provided satisfies any available
        CPU policy requirements.
        """
        host_topology, _ = hardware.host_topology_and_format_from_host(
            host_state)
        # NOTE(stephenfin): There can be conflicts between the policy
        # specified by the image and that specified by the instance, but this
        # is not the place to resolve these. We do this during scheduling.
        cpu_policy = [extra_specs.get('hw:cpu_policy'),
                      image_props.get('hw_cpu_policy')]
        cpu_thread_policy = [extra_specs.get('hw:cpu_thread_policy'),
                             image_props.get('hw_cpu_thread_policy')]

        if not host_topology:
            return True

        if fields.CPUAllocationPolicy.DEDICATED not in cpu_policy:
            return True

        if fields.CPUThreadAllocationPolicy.REQUIRE not in cpu_thread_policy:
            return True

        # the presence of siblings in at least one cell indicates
        # hyperthreading (HT)
        has_hyperthreading = any(cell.siblings for cell in host_topology.cells)

        if not has_hyperthreading:
            LOG.debug("%(host_state)s fails CPU policy requirements. "
                      "Host does not have hyperthreading or "
                      "hyperthreading is disabled, but 'require' threads "
                      "policy was requested.", {'host_state': host_state})
            return False

        return True

    def host_passes(self, host_state, spec_obj):
        # TODO(stephenfin): The 'numa_fit_instance_to_host' function has the
        # unfortunate side effect of modifying 'spec_obj.numa_topology' - an
        # InstanceNUMATopology object - by populating the 'cpu_pinning' field.
        # This is rather rude and said function should be reworked to avoid
        # doing this. That's a large, non-backportable cleanup however, so for
        # now we just duplicate spec_obj to prevent changes propagating to
        # future filter calls.
        spec_obj = spec_obj.obj_clone()

        ram_ratio = host_state.ram_allocation_ratio
        cpu_ratio = host_state.cpu_allocation_ratio
        extra_specs = spec_obj.flavor.extra_specs
        image_props = spec_obj.image.properties
        requested_topology = spec_obj.numa_topology
        host_topology, _fmt = hardware.host_topology_and_format_from_host(
                host_state)
        pci_requests = spec_obj.pci_requests

        if pci_requests:
            pci_requests = pci_requests.requests

        if not self._satisfies_cpu_policy(host_state, extra_specs,
                                          image_props):
            return False

        if requested_topology and host_topology:
            limits = objects.NUMATopologyLimits(
                cpu_allocation_ratio=cpu_ratio,
                ram_allocation_ratio=ram_ratio)
            instance_topology = (hardware.numa_fit_instance_to_host(
                        host_topology, requested_topology,
                        limits=limits,
                        pci_requests=pci_requests,
                        pci_stats=host_state.pci_stats))
            if not instance_topology:
                LOG.debug("%(host)s, %(node)s fails NUMA topology "
                          "requirements. The instance does not fit on this "
                          "host.", {'host': host_state.host,
                                    'node': host_state.nodename},
                          instance_uuid=spec_obj.instance_uuid)
                return False
            host_state.limits['numa_topology'] = limits
            return True
        elif requested_topology:
            LOG.debug("%(host)s, %(node)s fails NUMA topology requirements. "
                      "No host NUMA topology while the instance specified "
                      "one.",
                      {'host': host_state.host, 'node': host_state.nodename},
                      instance_uuid=spec_obj.instance_uuid)
            return False
        else:
            return True

這裏對CPU、內存和NUMA技術都一併進行過濾,進行簡單的資源數比較和特性匹配。

  • 獲取主機狀態中CPU屬性,並與虛擬機請求中的要求進行比較(nova/virt/hardware.py):
# TODO(ndipanov): Remove when all code paths are using objects
def host_topology_and_format_from_host(host):
    """Extract numa topology from myriad host representations.

    Until the RPC version is bumped to 5.x, a host may be represented
    as a dict, a db object, an actual ComputeNode object, or an
    instance of HostState class. Identify the type received and return
    either an instance of objects.NUMATopology if host's NUMA topology
    is available, else None.

    :returns: A two-tuple. The first element is either an instance of
              objects.NUMATopology or None. The second element is a
              boolean set to True if topology was in JSON format.
    """
    was_json = False
    try:
        host_numa_topology = host.get('numa_topology')
    except AttributeError:
        host_numa_topology = host.numa_topology

    if host_numa_topology is not None and isinstance(
            host_numa_topology, six.string_types):
        was_json = True

        host_numa_topology = (objects.NUMATopology.obj_from_db_obj(
            host_numa_topology))

    return host_numa_topology, was_json

def numa_fit_instance_to_host(
        host_topology, instance_topology, limits=None,
        pci_requests=None, pci_stats=None):
    """Fit the instance topology onto the host topology.

    Given a host, instance topology, and (optional) limits, attempt to
    fit instance cells onto all permutations of host cells by calling
    the _fit_instance_cell method, and return a new InstanceNUMATopology
    with its cell ids set to host cell ids of the first successful
    permutation, or None.

    :param host_topology: objects.NUMATopology object to fit an
                          instance on
    :param instance_topology: objects.InstanceNUMATopology to be fitted
    :param limits: objects.NUMATopologyLimits that defines limits
    :param pci_requests: instance pci_requests
    :param pci_stats: pci_stats for the host

    :returns: objects.InstanceNUMATopology with its cell IDs set to host
              cell ids of the first successful permutation, or None
    """
    if not (host_topology and instance_topology):
        LOG.debug("Require both a host and instance NUMA topology to "
                  "fit instance on host.")
        return
    elif len(host_topology) < len(instance_topology):
        LOG.debug("There are not enough NUMA nodes on the system to schedule "
                  "the instance correctly. Required: %(required)s, actual: "
                  "%(actual)s",
                  {'required': len(instance_topology),
                   'actual': len(host_topology)})
        return

    # TODO(ndipanov): We may want to sort permutations differently
    # depending on whether we want packing/spreading over NUMA nodes
    for host_cell_perm in itertools.permutations(
            host_topology.cells, len(instance_topology)):
        cells = []
        for host_cell, instance_cell in zip(
                host_cell_perm, instance_topology.cells):
            try:
                got_cell = _numa_fit_instance_cell(
                    host_cell, instance_cell, limits)
            except exception.MemoryPageSizeNotSupported:
                # This exception will been raised if instance cell's
                # custom pagesize is not supported with host cell in
                # _numa_cell_supports_pagesize_request function.
                break
            if got_cell is None:
                break
            cells.append(got_cell)

        if len(cells) != len(host_cell_perm):
            continue

        if not pci_requests or ((pci_stats is not None) and
                pci_stats.support_requests(pci_requests, cells)):
            return objects.InstanceNUMATopology(cells=cells)

####虛擬化驅動

  • 虛擬化驅動基類,創建虛擬實例函數spawn(nova/virt/driver.py):
class ComputeDriver(object):
    """Base class for compute drivers.

    The interface to this class talks in terms of 'instances' (Amazon EC2 and
    internal Nova terminology), by which we mean 'running virtual machine'
    (XenAPI terminology) or domain (Xen or libvirt terminology).

    An instance has an ID, which is the identifier chosen by Nova to represent
    the instance further up the stack.  This is unfortunately also called a
    'name' elsewhere.  As far as this layer is concerned, 'instance ID' and
    'instance name' are synonyms.

    Note that the instance ID or name is not human-readable or
    customer-controlled -- it's an internal ID chosen by Nova.  At the
    nova.virt layer, instances do not have human-readable names at all -- such
    things are only known higher up the stack.

    Most virtualization platforms will also have their own identity schemes,
    to uniquely identify a VM or domain.  These IDs must stay internal to the
    platform-specific layer, and never escape the connection interface.  The
    platform-specific layer is responsible for keeping track of which instance
    ID maps to which platform-specific ID, and vice versa.

    Some methods here take an instance of nova.compute.service.Instance.  This
    is the data structure used by nova.compute to store details regarding an
    instance, and pass them into this layer.  This layer is responsible for
    translating that generic data structure into terms that are specific to the
    virtualization platform.

    """

    def spawn(self, context, instance, image_meta, injected_files,
              admin_password, network_info=None, block_device_info=None):
        """Create a new instance/VM/domain on the virtualization platform.

        Once this successfully completes, the instance should be
        running (power_state.RUNNING).

        If this fails, any partial instance should be completely
        cleaned up, and the virtualization platform should be in the state
        that it was before this call began.

        :param context: security context
        :param instance: nova.objects.instance.Instance
                         This function should use the data there to guide
                         the creation of the new instance.
        :param nova.objects.ImageMeta image_meta:
            The metadata of the image of the instance.
        :param injected_files: User files to inject into instance.
        :param admin_password: Administrator password to set in instance.
        :param network_info: instance network information
        :param block_device_info: Information about block devices to be
                                  attached to the instance.
        """
        raise NotImplementedError()
  • 虛擬機實例的主要屬性(nova/objects/instance.py):
# TODO(berrange): Remove NovaObjectDictCompat
@base.NovaObjectRegistry.register
class Instance(base.NovaPersistentObject, base.NovaObject,
               base.NovaObjectDictCompat):
    # Version 2.0: Initial version
    # Version 2.1: Added services
    # Version 2.2: Added keypairs
    # Version 2.3: Added device_metadata
    VERSION = '2.3'

    fields = {
        'id': fields.IntegerField(),

        'user_id': fields.StringField(nullable=True),
        'project_id': fields.StringField(nullable=True),

        'image_ref': fields.StringField(nullable=True),
        'kernel_id': fields.StringField(nullable=True),
        'ramdisk_id': fields.StringField(nullable=True),
        'hostname': fields.StringField(nullable=True),

        'launch_index': fields.IntegerField(nullable=True),
        'key_name': fields.StringField(nullable=True),
        'key_data': fields.StringField(nullable=True),

        'power_state': fields.IntegerField(nullable=True),
        'vm_state': fields.StringField(nullable=True),
        'task_state': fields.StringField(nullable=True),

        'services': fields.ObjectField('ServiceList'),

        'memory_mb': fields.IntegerField(nullable=True),
        'vcpus': fields.IntegerField(nullable=True),
        'root_gb': fields.IntegerField(nullable=True),
        'ephemeral_gb': fields.IntegerField(nullable=True),
        'ephemeral_key_uuid': fields.UUIDField(nullable=True),

        'host': fields.StringField(nullable=True),
        'node': fields.StringField(nullable=True),

        'instance_type_id': fields.IntegerField(nullable=True),

        'user_data': fields.StringField(nullable=True),

        'reservation_id': fields.StringField(nullable=True),

        'launched_at': fields.DateTimeField(nullable=True),
        'terminated_at': fields.DateTimeField(nullable=True),

        'availability_zone': fields.StringField(nullable=True),

        'display_name': fields.StringField(nullable=True),
        'display_description': fields.StringField(nullable=True),

        'launched_on': fields.StringField(nullable=True),

        # NOTE(jdillaman): locked deprecated in favor of locked_by,
        # to be removed in Icehouse
        'locked': fields.BooleanField(default=False),
        'locked_by': fields.StringField(nullable=True),

        'os_type': fields.StringField(nullable=True),
        'architecture': fields.StringField(nullable=True),
        'vm_mode': fields.StringField(nullable=True),
        'uuid': fields.UUIDField(),

        'root_device_name': fields.StringField(nullable=True),
        'default_ephemeral_device': fields.StringField(nullable=True),
        'default_swap_device': fields.StringField(nullable=True),
        'config_drive': fields.StringField(nullable=True),

        'access_ip_v4': fields.IPV4AddressField(nullable=True),
        'access_ip_v6': fields.IPV6AddressField(nullable=True),

        'auto_disk_config': fields.BooleanField(default=False),
        'progress': fields.IntegerField(nullable=True),

        'shutdown_terminate': fields.BooleanField(default=False),
        'disable_terminate': fields.BooleanField(default=False),

        'cell_name': fields.StringField(nullable=True),

        'metadata': fields.DictOfStringsField(),
        'system_metadata': fields.DictOfNullableStringsField(),

        'info_cache': fields.ObjectField('InstanceInfoCache',
                                         nullable=True),

        'security_groups': fields.ObjectField('SecurityGroupList'),

        'fault': fields.ObjectField('InstanceFault', nullable=True),

        'cleaned': fields.BooleanField(default=False),

        'pci_devices': fields.ObjectField('PciDeviceList', nullable=True),
        'numa_topology': fields.ObjectField('InstanceNUMATopology',
                                            nullable=True),
        'pci_requests': fields.ObjectField('InstancePCIRequests',
                                           nullable=True),
        'device_metadata': fields.ObjectField('InstanceDeviceMetadata',
                                              nullable=True),
        'tags': fields.ObjectField('TagList'),
        'flavor': fields.ObjectField('Flavor'),
        'old_flavor': fields.ObjectField('Flavor', nullable=True),
        'new_flavor': fields.ObjectField('Flavor', nullable=True),
        'vcpu_model': fields.ObjectField('VirtCPUModel', nullable=True),
        'ec2_ids': fields.ObjectField('EC2Ids'),
        'migration_context': fields.ObjectField('MigrationContext',
                                                nullable=True),
        'keypairs': fields.ObjectField('KeyPairList'),
        }

    obj_extra_fields = ['name']
  • libvirt虛擬化驅動的spawn實現(nova/virt/libvirt/driver.py ):
class LibvirtDriver(driver.ComputeDriver):

    # NOTE(ilyaalekseyev): Implementation like in multinics
    # for xenapi(tr3buchet)
    def spawn(self, context, instance, image_meta, injected_files,
              admin_password, network_info=None, block_device_info=None):
        disk_info = blockinfo.get_disk_info(CONF.libvirt.virt_type,
                                            instance,
                                            image_meta,
                                            block_device_info)
        injection_info = InjectionInfo(network_info=network_info,
                                       files=injected_files,
                                       admin_pass=admin_password)
        gen_confdrive = functools.partial(self._create_configdrive,
                                          context, instance,
                                          injection_info)
        self._create_image(context, instance, disk_info['mapping'],
                           injection_info=injection_info,
                           block_device_info=block_device_info)

        # Required by Quobyte CI
        self._ensure_console_log_for_instance(instance)

        xml = self._get_guest_xml(context, instance, network_info,
                                  disk_info, image_meta,
                                  block_device_info=block_device_info)
        self._create_domain_and_network(
            context, xml, instance, network_info, disk_info,
            block_device_info=block_device_info,
            post_xml_callback=gen_confdrive,
            destroy_disks_on_failure=True)
        LOG.debug("Instance is running", instance=instance)

        def _wait_for_boot():
            """Called at an interval until the VM is running."""
            state = self.get_info(instance).state

            if state == power_state.RUNNING:
                LOG.info(_LI("Instance spawned successfully."),
                         instance=instance)
                raise loopingcall.LoopingCallDone()

        timer = loopingcall.FixedIntervalLoopingCall(_wait_for_boot)
        timer.start(interval=0.5).wait()
  • 生成虛擬機xml配置(nova/virt/libvirt/driver.py ):
    def _get_guest_xml(self, context, instance, network_info, disk_info,
                       image_meta, rescue=None,
                       block_device_info=None):
        # NOTE(danms): Stringifying a NetworkInfo will take a lock. Do
        # this ahead of time so that we don't acquire it while also
        # holding the logging lock.
        network_info_str = str(network_info)
        msg = ('Start _get_guest_xml '
               'network_info=%(network_info)s '
               'disk_info=%(disk_info)s '
               'image_meta=%(image_meta)s rescue=%(rescue)s '
               'block_device_info=%(block_device_info)s' %
               {'network_info': network_info_str, 'disk_info': disk_info,
                'image_meta': image_meta, 'rescue': rescue,
                'block_device_info': block_device_info})
        # NOTE(mriedem): block_device_info can contain auth_password so we
        # need to sanitize the password in the message.
        LOG.debug(strutils.mask_password(msg), instance=instance)
        conf = self._get_guest_config(instance, network_info, image_meta,
                                      disk_info, rescue, block_device_info,
                                      context)
        xml = conf.to_xml()

        LOG.debug('End _get_guest_xml xml=%(xml)s',
                  {'xml': xml}, instance=instance)
        return xml
  • 生成虛擬機基本配置(nova/virt/libvirt/driver.py ):
    def _get_guest_config(self, instance, network_info, image_meta,
                          disk_info, rescue=None, block_device_info=None,
                          context=None):
        """Get config data for parameters.

        :param rescue: optional dictionary that should contain the key
            'ramdisk_id' if a ramdisk is needed for the rescue image and
            'kernel_id' if a kernel is needed for the rescue image.
        """
        flavor = instance.flavor
        inst_path = libvirt_utils.get_instance_path(instance)
        disk_mapping = disk_info['mapping']

        virt_type = CONF.libvirt.virt_type
        guest = vconfig.LibvirtConfigGuest()
        guest.virt_type = virt_type
        guest.name = instance.name
        guest.uuid = instance.uuid
        # We are using default unit for memory: KiB
        guest.memory = flavor.memory_mb * units.Ki
        guest.vcpus = flavor.vcpus
        allowed_cpus = hardware.get_vcpu_pin_set()

        guest_numa_config = self._get_guest_numa_config(
            instance.numa_topology, flavor, allowed_cpus, image_meta)

        guest.cpuset = guest_numa_config.cpuset
        guest.cputune = guest_numa_config.cputune
        guest.numatune = guest_numa_config.numatune

        guest.membacking = self._get_guest_memory_backing_config(
            instance.numa_topology,
            guest_numa_config.numatune,
            flavor)

        guest.metadata.append(self._get_guest_config_meta(instance))
        guest.idmaps = self._get_guest_idmaps()

        for event in self._supported_perf_events:
            guest.add_perf_event(event)

        self._update_guest_cputune(guest, flavor, virt_type)

        guest.cpu = self._get_guest_cpu_config(
            flavor, image_meta, guest_numa_config.numaconfig,
            instance.numa_topology)

        # Notes(yjiang5): we always sync the instance's vcpu model with
        # the corresponding config file.
        instance.vcpu_model = self._cpu_config_to_vcpu_model(
            guest.cpu, instance.vcpu_model)

        if 'root' in disk_mapping:
            root_device_name = block_device.prepend_dev(
                disk_mapping['root']['dev'])
        else:
            root_device_name = None

        if root_device_name:
            # NOTE(yamahata):
            # for nova.api.ec2.cloud.CloudController.get_metadata()
            instance.root_device_name = root_device_name

        guest.os_type = (fields.VMMode.get_from_instance(instance) or
                self._get_guest_os_type(virt_type))
        caps = self._host.get_capabilities()

        self._configure_guest_by_virt_type(guest, virt_type, caps, instance,
                                           image_meta, flavor,
                                           root_device_name)
        if virt_type not in ('lxc', 'uml'):
            self._conf_non_lxc_uml(virt_type, guest, root_device_name, rescue,
                    instance, inst_path, image_meta, disk_info)

        self._set_features(guest, instance.os_type, caps, virt_type)
        self._set_clock(guest, instance.os_type, image_meta, virt_type)

        storage_configs = self._get_guest_storage_config(
                instance, image_meta, disk_info, rescue, block_device_info,
                flavor, guest.os_type)
        for config in storage_configs:
            guest.add_device(config)

        for vif in network_info:
            config = self.vif_driver.get_config(
                instance, vif, image_meta,
                flavor, virt_type, self._host)
            guest.add_device(config)

        self._create_consoles(virt_type, guest, instance, flavor, image_meta)

        pointer = self._get_guest_pointer_model(guest.os_type, image_meta)
        if pointer:
            guest.add_device(pointer)

        if (CONF.spice.enabled and CONF.spice.agent_enabled and
                virt_type not in ('lxc', 'uml', 'xen')):
            channel = vconfig.LibvirtConfigGuestChannel()
            channel.type = 'spicevmc'
            channel.target_name = "com.redhat.spice.0"
            guest.add_device(channel)

        # NB some versions of libvirt support both SPICE and VNC
        # at the same time. We're not trying to second guess which
        # those versions are. We'll just let libvirt report the
        # errors appropriately if the user enables both.
        add_video_driver = False
        if ((CONF.vnc.enabled and
             virt_type not in ('lxc', 'uml'))):
            graphics = vconfig.LibvirtConfigGuestGraphics()
            graphics.type = "vnc"
            graphics.keymap = CONF.vnc.keymap
            graphics.listen = CONF.vnc.vncserver_listen
            guest.add_device(graphics)
            add_video_driver = True

        if (CONF.spice.enabled and
                virt_type not in ('lxc', 'uml', 'xen')):
            graphics = vconfig.LibvirtConfigGuestGraphics()
            graphics.type = "spice"
            graphics.keymap = CONF.spice.keymap
            graphics.listen = CONF.spice.server_listen
            guest.add_device(graphics)
            add_video_driver = True

        if add_video_driver:
            self._add_video_driver(guest, image_meta, flavor)

        # Qemu guest agent only support 'qemu' and 'kvm' hypervisor
        if virt_type in ('qemu', 'kvm'):
            self._set_qemu_guest_agent(guest, flavor, instance, image_meta)

        if virt_type in ('xen', 'qemu', 'kvm'):
            # Get all generic PCI devices (non-SR-IOV).
            for pci_dev in pci_manager.get_instance_pci_devs(instance):
                guest.add_device(self._get_guest_pci_device(pci_dev))
        else:
            # PCI devices is only supported for hypervisor 'xen', 'qemu' and
            # 'kvm'.
            pci_devs = pci_manager.get_instance_pci_devs(instance, 'all')
            if len(pci_devs) > 0:
                raise exception.PciDeviceUnsupportedHypervisor(
                    type=virt_type)

        # image meta takes precedence over flavor extra specs; disable the
        # watchdog action by default
        watchdog_action = (flavor.extra_specs.get('hw:watchdog_action')
                           or 'disabled')
        watchdog_action = image_meta.properties.get('hw_watchdog_action',
                                                    watchdog_action)

        # NB(sross): currently only actually supported by KVM/QEmu
        if watchdog_action != 'disabled':
            if watchdog_action in fields.WatchdogAction.ALL:
                bark = vconfig.LibvirtConfigGuestWatchdog()
                bark.action = watchdog_action
                guest.add_device(bark)
            else:
                raise exception.InvalidWatchdogAction(action=watchdog_action)

        # Memory balloon device only support 'qemu/kvm' and 'xen' hypervisor
        if (virt_type in ('xen', 'qemu', 'kvm') and
                CONF.libvirt.mem_stats_period_seconds > 0):
            balloon = vconfig.LibvirtConfigMemoryBalloon()
            if virt_type in ('qemu', 'kvm'):
                balloon.model = 'virtio'
            else:
                balloon.model = 'xen'
            balloon.period = CONF.libvirt.mem_stats_period_seconds
            guest.add_device(balloon)

        return guest
  • 獲取客戶機NUMA配置(nova/virt/libvirt/driver.py ):
    def _get_guest_numa_config(self, instance_numa_topology, flavor,
                               allowed_cpus=None, image_meta=None):
        """Returns the config objects for the guest NUMA specs.

        Determines the CPUs that the guest can be pinned to if the guest
        specifies a cell topology and the host supports it. Constructs the
        libvirt XML config object representing the NUMA topology selected
        for the guest. Returns a tuple of:

            (cpu_set, guest_cpu_tune, guest_cpu_numa, guest_numa_tune)

        With the following caveats:

            a) If there is no specified guest NUMA topology, then
               all tuple elements except cpu_set shall be None. cpu_set
               will be populated with the chosen CPUs that the guest
               allowed CPUs fit within, which could be the supplied
               allowed_cpus value if the host doesn't support NUMA
               topologies.

            b) If there is a specified guest NUMA topology, then
               cpu_set will be None and guest_cpu_numa will be the
               LibvirtConfigGuestCPUNUMA object representing the guest's
               NUMA topology. If the host supports NUMA, then guest_cpu_tune
               will contain a LibvirtConfigGuestCPUTune object representing
               the optimized chosen cells that match the host capabilities
               with the instance's requested topology. If the host does
               not support NUMA, then guest_cpu_tune and guest_numa_tune
               will be None.
        """

        if (not self._has_numa_support() and
                instance_numa_topology is not None):
            # We should not get here, since we should have avoided
            # reporting NUMA topology from _get_host_numa_topology
            # in the first place. Just in case of a scheduler
            # mess up though, raise an exception
            raise exception.NUMATopologyUnsupported()

        topology = self._get_host_numa_topology()

        # We have instance NUMA so translate it to the config class
        guest_cpu_numa_config = self._get_cpu_numa_config_from_instance(
                instance_numa_topology,
                self._wants_hugepages(topology, instance_numa_topology))

        if not guest_cpu_numa_config:
            # No NUMA topology defined for instance - let the host kernel deal
            # with the NUMA effects.
            # TODO(ndipanov): Attempt to spread the instance
            # across NUMA nodes and expose the topology to the
            # instance as an optimisation
            return GuestNumaConfig(allowed_cpus, None, None, None)
        else:
            if topology:
                # Now get the CpuTune configuration from the numa_topology
                guest_cpu_tune = vconfig.LibvirtConfigGuestCPUTune()
                guest_numa_tune = vconfig.LibvirtConfigGuestNUMATune()
                emupcpus = []

                numa_mem = vconfig.LibvirtConfigGuestNUMATuneMemory()
                numa_memnodes = [vconfig.LibvirtConfigGuestNUMATuneMemNode()
                                 for _ in guest_cpu_numa_config.cells]

                vcpus_rt = set([])
                wants_realtime = hardware.is_realtime_enabled(flavor)
                if wants_realtime:
                    if not self._host.has_min_version(
                            MIN_LIBVIRT_REALTIME_VERSION):
                        raise exception.RealtimePolicyNotSupported()
                    # Prepare realtime config for libvirt
                    vcpus_rt = hardware.vcpus_realtime_topology(
                        flavor, image_meta)
                    vcpusched = vconfig.LibvirtConfigGuestCPUTuneVCPUSched()
                    vcpusched.vcpus = vcpus_rt
                    vcpusched.scheduler = "fifo"
                    vcpusched.priority = (
                        CONF.libvirt.realtime_scheduler_priority)
                    guest_cpu_tune.vcpusched.append(vcpusched)

                for host_cell in topology.cells:
                    for guest_node_id, guest_config_cell in enumerate(
                            guest_cpu_numa_config.cells):
                        if guest_config_cell.id == host_cell.id:
                            node = numa_memnodes[guest_node_id]
                            node.cellid = guest_node_id
                            node.nodeset = [host_cell.id]
                            node.mode = "strict"

                            numa_mem.nodeset.append(host_cell.id)

                            object_numa_cell = (
                                    instance_numa_topology.cells[guest_node_id]
                                )
                            for cpu in guest_config_cell.cpus:
                                pin_cpuset = (
                                    vconfig.LibvirtConfigGuestCPUTuneVCPUPin())
                                pin_cpuset.id = cpu
                                # If there is pinning information in the cell
                                # we pin to individual CPUs, otherwise we float
                                # over the whole host NUMA node

                                if (object_numa_cell.cpu_pinning and
                                        self._has_cpu_policy_support()):
                                    pcpu = object_numa_cell.cpu_pinning[cpu]
                                    pin_cpuset.cpuset = set([pcpu])
                                else:
                                    pin_cpuset.cpuset = host_cell.cpuset
                                if not wants_realtime or cpu not in vcpus_rt:
                                    # - If realtime IS NOT enabled, the
                                    #   emulator threads are allowed to float
                                    #   across all the pCPUs associated with
                                    #   the guest vCPUs ("not wants_realtime"
                                    #   is true, so we add all pcpus)
                                    # - If realtime IS enabled, then at least
                                    #   1 vCPU is required to be set aside for
                                    #   non-realtime usage. The emulator
                                    #   threads are allowed to float acros the
                                    #   pCPUs that are associated with the
                                    #   non-realtime VCPUs (the "cpu not in
                                    #   vcpu_rt" check deals with this
                                    #   filtering)
                                    emupcpus.extend(pin_cpuset.cpuset)
                                guest_cpu_tune.vcpupin.append(pin_cpuset)

                # TODO(berrange) When the guest has >1 NUMA node, it will
                # span multiple host NUMA nodes. By pinning emulator threads
                # to the union of all nodes, we guarantee there will be
                # cross-node memory access by the emulator threads when
                # responding to guest I/O operations. The only way to avoid
                # this would be to pin emulator threads to a single node and
                # tell the guest OS to only do I/O from one of its virtual
                # NUMA nodes. This is not even remotely practical.
                #
                # The long term solution is to make use of a new QEMU feature
                # called "I/O Threads" which will let us configure an explicit
                # I/O thread for each guest vCPU or guest NUMA node. It is
                # still TBD how to make use of this feature though, especially
                # how to associate IO threads with guest devices to eliminiate
                # cross NUMA node traffic. This is an area of investigation
                # for QEMU community devs.
                emulatorpin = vconfig.LibvirtConfigGuestCPUTuneEmulatorPin()
                emulatorpin.cpuset = set(emupcpus)
                guest_cpu_tune.emulatorpin = emulatorpin
                # Sort the vcpupin list per vCPU id for human-friendlier XML
                guest_cpu_tune.vcpupin.sort(key=operator.attrgetter("id"))

                guest_numa_tune.memory = numa_mem
                guest_numa_tune.memnodes = numa_memnodes

                # normalize cell.id
                for i, (cell, memnode) in enumerate(
                                            zip(guest_cpu_numa_config.cells,
                                                guest_numa_tune.memnodes)):
                    cell.id = i
                    memnode.cellid = i

                return GuestNumaConfig(None, guest_cpu_tune,
                                       guest_cpu_numa_config,
                                       guest_numa_tune)
            else:
                return GuestNumaConfig(allowed_cpus, None,
                                       guest_cpu_numa_config, None)
  • 獲取客戶機內存後端配置(nova/virt/libvirt/driver.py ):
    def _get_guest_memory_backing_config(
            self, inst_topology, numatune, flavor):
        wantsmempages = False
        if inst_topology:
            for cell in inst_topology.cells:
                if cell.pagesize:
                    wantsmempages = True
                    break

        wantsrealtime = hardware.is_realtime_enabled(flavor)

        membacking = None
        if wantsmempages:
            pages = self._get_memory_backing_hugepages_support(
                inst_topology, numatune)
            if pages:
                membacking = vconfig.LibvirtConfigGuestMemoryBacking()
                membacking.hugepages = pages
        if wantsrealtime:
            if not membacking:
                membacking = vconfig.LibvirtConfigGuestMemoryBacking()
            membacking.locked = True
            membacking.sharedpages = False

        return membacking
  • 獲取客戶機配置元數據(nova/virt/libvirt/driver.py ):
    def _get_guest_config_meta(self, instance):
        """Get metadata config for guest."""

        meta = vconfig.LibvirtConfigGuestMetaNovaInstance()
        meta.package = version.version_string_with_package()
        meta.name = instance.display_name
        meta.creationTime = time.time()

        if instance.image_ref not in ("", None):
            meta.roottype = "image"
            meta.rootid = instance.image_ref

        system_meta = instance.system_metadata
        ometa = vconfig.LibvirtConfigGuestMetaNovaOwner()
        ometa.userid = instance.user_id
        ometa.username = system_meta.get('owner_user_name', 'N/A')
        ometa.projectid = instance.project_id
        ometa.projectname = system_meta.get('owner_project_name', 'N/A')
        meta.owner = ometa

        fmeta = vconfig.LibvirtConfigGuestMetaNovaFlavor()
        flavor = instance.flavor
        fmeta.name = flavor.name
        fmeta.memory = flavor.memory_mb
        fmeta.vcpus = flavor.vcpus
        fmeta.ephemeral = flavor.ephemeral_gb
        fmeta.disk = flavor.root_gb
        fmeta.swap = flavor.swap

        meta.flavor = fmeta

        return meta
  • 更新客戶機CPU設置參數(nova/virt/libvirt/driver.py ):
    def _update_guest_cputune(self, guest, flavor, virt_type):
        is_able = self._host.is_cpu_control_policy_capable()

        cputuning = ['shares', 'period', 'quota']
        wants_cputune = any([k for k in cputuning
            if "quota:cpu_" + k in flavor.extra_specs.keys()])

        if wants_cputune and not is_able:
            raise exception.UnsupportedHostCPUControlPolicy()

        if not is_able or virt_type not in ('lxc', 'kvm', 'qemu'):
            return

        if guest.cputune is None:
            guest.cputune = vconfig.LibvirtConfigGuestCPUTune()
            # Setting the default cpu.shares value to be a value
            # dependent on the number of vcpus
        guest.cputune.shares = 1024 * guest.vcpus

        for name in cputuning:
            key = "quota:cpu_" + name
            if key in flavor.extra_specs:
                setattr(guest.cputune, name,
                        int(flavor.extra_specs[key]))


  • 獲取客戶機CPU配置(nova/virt/libvirt/driver.py ):
    def _get_guest_cpu_config(self, flavor, image_meta,
                              guest_cpu_numa_config, instance_numa_topology):
        cpu = self._get_guest_cpu_model_config()

        if cpu is None:
            return None

        topology = hardware.get_best_cpu_topology(
                flavor, image_meta, numa_topology=instance_numa_topology)

        cpu.sockets = topology.sockets
        cpu.cores = topology.cores
        cpu.threads = topology.threads
        cpu.numa = guest_cpu_numa_config

        return cpu
  • 根據配置生成內部CPU模型(nova/virt/libvirt/driver.py ):
    def _cpu_config_to_vcpu_model(self, cpu_config, vcpu_model):
        """Update VirtCPUModel object according to libvirt CPU config.

        :param:cpu_config: vconfig.LibvirtConfigGuestCPU presenting the
                           instance's virtual cpu configuration.
        :param:vcpu_model: VirtCPUModel object. A new object will be created
                           if None.

        :return: Updated VirtCPUModel object, or None if cpu_config is None

        """

        if not cpu_config:
            return
        if not vcpu_model:
            vcpu_model = objects.VirtCPUModel()

        vcpu_model.arch = cpu_config.arch
        vcpu_model.vendor = cpu_config.vendor
        vcpu_model.model = cpu_config.model
        vcpu_model.mode = cpu_config.mode
        vcpu_model.match = cpu_config.match

        if cpu_config.sockets:
            vcpu_model.topology = objects.VirtCPUTopology(
                sockets=cpu_config.sockets,
                cores=cpu_config.cores,
                threads=cpu_config.threads)
        else:
            vcpu_model.topology = None

        features = [objects.VirtCPUFeature(
            name=f.name,
            policy=f.policy) for f in cpu_config.features]
        vcpu_model.features = features

        return vcpu_model

##CPU綁定

爲了減少CPU競爭,提高CPU Cache命中率,可以把GuestvCPU綁定到HostpCPU上。

參數配置

####過時參數

注意:如下參數已過時!!!

CPU綁定的Flavor Extra Specs配置:

hw:cpu_policy=shared|dedicated
hw:cpu_thread_policy=avoid|separate|isolate|prefer

CPU綁定的Image Metadata配置:

hw_cpu_policy=shared|dedicated
hw_cpu_thread_policy=avoid|separate|isolate|prefer

如果hw:cpu_policy=shared,則和現有的默認CPU配置一樣,沒有的CPU 綁定。

如果hw:cpu_policy=dedicated則進行CPU綁定。

綁定策略有:

  • avoid - Guest不會調度到有超線程的Host上。

img

  • separate - 每個vCPU到放置到不同的Core

img

  • isolate - 每個vCPU到放置到不同的Core上,並且獨佔這個Core,其他vCPU不能再放置到該Core上。

img

  • prefer - GuestvCPU放置到同一Core上,讓vCPU成爲Siblings Thread

img

####最新參數

使用上述參數在Ocata版本上測試時,發現有問題。

  • Ocata版本的源碼中的說明(nova/objects/fields.py):
class CPUThreadAllocationPolicy(BaseNovaEnum):

    # prefer (default): The host may or may not have hyperthreads. This
    #  retains the legacy behavior, whereby siblings are preferred when
    #  available. This is the default if no policy is specified.
    PREFER = "prefer"
    # isolate: The host may or many not have hyperthreads. If hyperthreads are
    #  present, each vCPU will be placed on a different core and no vCPUs from
    #  other guests will be able to be placed on the same core, i.e. one
    #  thread sibling is guaranteed to always be unused. If hyperthreads are
    #  not present, each vCPU will still be placed on a different core and
    #  there are no thread siblings to be concerned with.
    ISOLATE = "isolate"
    # require: The host must have hyperthreads. Each vCPU will be allocated on
    #   thread siblings.
    REQUIRE = "require"

    ALL = (PREFER, ISOLATE, REQUIRE)
Proposed change
The flavor extra specs will be enhanced to support one new parameter:

hw:cpu_thread_policy=prefer|isolate|require
This policy is an extension to the already implemented CPU policy parameter:

hw:cpu_policy=shared|dedicated

The threads policy will control how the scheduler / virt driver places guests with respect to CPU threads. It will only apply if the CPU policy is ‘dedicated’, i.e. guest vCPUs are being pinned to host pCPUs.

prefer: The host may or may not have an SMT architecture. This retains the legacy behavior, whereby siblings are preferred when available. This is the default if no policy is specified.

isolate: The host must not have an SMT architecture, or must emulate a non-SMT architecture. If the host does not have an SMT architecture, each vCPU will simply be placed on a different core as expected. If the host does have an SMT architecture (i.e. one or more cores have “thread siblings”) then each vCPU will be placed on a different physical core and no vCPUs from other guests will be placed on the same core. As such, one thread sibling is always guaranteed to always be unused.

require: The host must have an SMT architecture. Each vCPU will be allocated on thread siblings. If the host does not have an SMT architecture then it will not be used. If the host has an SMT architecture, but not enough cores with free thread siblings are available, then scheduling will fail.


The image metadata properties will also allow specification of the threads policy:

hw_cpu_thread_policy=prefer|isolate|require

This will only be honored if the flavor specifies the ‘prefer’ policy, either explicitly or implicitly as the defalt option. This ensures that the cloud administrator can have absolute control over threads policy if desired.
  • 新參數說明:
prefer - 默認策略,如果有SMT技術,則優先把vCPU放在一個Core上;
isolate - 把每個vCPU放在不同的Core,且沒有其他虛擬機使用這些Core;
require - 必須用在有SMT技術的主機上,每個vCPU會被儘量分配在一個Core上。

####其他參數

  • 內核參數zone_reclaim_mode

當某個節點可用內存不足時,如果爲0的話,那麼系統會傾向於從遠程節點分配內存;如果爲1的話,那麼系統會傾向於從本地節點回收Cache內存。多數時候,Cache對性能很重要,所以0是一個更好的選擇。

$ echo "vm.zone_reclaim_mode = 0" >> /etc/sysctl.conf
$ sysctl -p
  • 內核參數overcommit_memory
可選值:0、1、2。

0:表示內核將檢查是否有足夠的可用內存供應用進程使用;如果有足夠的可用內存,內存申請允許;否則,內存申請失敗,並把錯誤返回給應用進程。

1:表示內核允許分配所有的物理內存,而不管當前的內存狀態如何。

2:表示內核允許分配超過所有物理內存和交換空間總和的內存

  • Nova參數vcpu_pin_set
$ vi /etc/nova/nova.conf

[DEFAULT]
vcpu_pin_set = 4-12,^8,15     

#Presumably, this would ensure that all instances only run on CPUs 4,5,6,7,9,10,11,12,15

限定Nova使用哪些CPU核。

過濾器配置

使用NUMATopologyFilter過濾器。

###綁定測試

  • 創建一個有16個vCPU的虛擬機模板:
$ openstack flavor create --vcpus 16 --ram 64 --disk 1 machine.cpu
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 1                                    |
| id                         | 82b10589-5a06-4e48-a770-e8c0e275ba4d |
| name                       | machine.cpu                          |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 64                                   |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 16                                   |
+----------------------------+--------------------------------------+
  • 配置所有vCPU在一個Node上:
$ nova flavor-key machine.cpu set hw:numa_nodes=1 hw:numa_cpus.0=0-15 hw:numa_mem.0=64
$ openstack flavor show machine.cpu
+----------------------------+--------------------------------------------------------------+
| Field                      | Value                                                        |
+----------------------------+--------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                                        |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                            |
| access_project_ids         | None                                                         |
| disk                       | 1                                                            |
| id                         | 7dce75ff-ee3a-41c4-a4c8-69416c92f5c1                         |
| name                       | machine.cpu                                                  |
| os-flavor-access:is_public | True                                                         |
| properties                 | hw:numa_cpus.0='0-15', hw:numa_mem.0='64', hw:numa_nodes='1' |
| ram                        | 64                                                           |
| rxtx_factor                | 1.0                                                          |
| swap                       |                                                              |
| vcpus                      | 16                                                           |
+----------------------------+--------------------------------------------------------------+
  • 創建查詢CPU ID腳本:
#!/bin/bash

# get processor $1 's core and socket id.


while read line; do
    if [[ -z "$line" && $1 = $p_id ]]; then
        printf '%-3s %-3s %-3s\n' $p_id $c_id $s_id
        break
    fi

    if echo "$line" | grep -q "^processor"; then
        p_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` 
    fi

    if echo "$line" | grep -q "^core id"; then
        c_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` 
    fi

    if echo "$line" | grep -q "^physical id"; then
        s_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` 
    fi
done < /proc/cpuinfo

####普通虛擬機

  • 使用默認策略創建虛擬機:
$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.default
  • 查看虛擬機CPU親和性:
$ ps -aux | grep `openstack server show server.cpu.default | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p
pid 40296's current affinity list: 0-17,36-53

$ ps -aux | grep `openstack server show server.cpu.default | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list
/proc/40296/task/40296/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40310/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40313/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40314/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40315/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40316/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40317/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40318/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40319/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40320/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40321/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40322/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40323/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40324/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40325/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40326/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40327/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40328/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40329/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40337/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40580/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40587/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40590/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40591/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40902/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/40903/status:Cpus_allowed_list:	0-17,36-53
/proc/40296/task/41007/status:Cpus_allowed_list:	0-17,36-53
  • 查看虛擬機CPU運行狀態:
$ ps -aux | grep `openstack server show server.cpu.default | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p
   PID PSR COMMAND
 40296   - qemu-kvm
     -  15 -
     -  42 -
     -   7 -
     -   5 -
     -   2 -
     -   0 -
     -  14 -
     -  41 -
     -   0 -
     -   7 -
     -   0 -
     -  11 -
     -   5 -
     -  43 -
     -   9 -
     -   0 -
     -   5 -
     -   2 -
     -  53 -

$ ps -aux | grep `openstack server show server.cpu.default | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p
   PID PSR COMMAND
 40296   - qemu-kvm
     -  48 -
     -  42 -
     -   1 -
     -  16 -
     -   3 -
     -   8 -
     -   5 -
     -   8 -
     -  11 -
     -   0 -
     -   4 -
     -   5 -
     -   7 -
     -   5 -
     -   8 -
     -   0 -
     -  13 -
     -  10 -
     -  53 -

可以看到CPU一直在變,且既有在一個Core上的,也有不在一個Core上的。

####avoid虛擬機

  • 設置avoid綁定策略:
$ nova flavor-key machine.cpu set hw:cpu_policy=dedicated hw:cpu_thread_policy=avoid
  • 創建虛擬機:
$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.avoid
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<type 'exceptions.ValueError'> (HTTP 500) (Request-ID: req-fbb4aef8-a2bb-47af-bea0-e776c83ae5e9)

無法創建虛擬機。

  • 查看錯誤日誌:
$ tailf /var/lib/docker/volumes/kolla_logs/_data/nova/nova-api.log
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions [req-fbb4aef8-a2bb-47af-bea0-e776c83ae5e9 03e0cf5adea04b73a13bc45a0306171b 1b50364d35624d0e8affe0721866fda1 - default default] Unexpected exception in API method
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/openstack/extensions.py", line 338, in wrapped
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return f(*args, **kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/openstack/compute/servers.py", line 642, in create
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     **create_kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     rv = f(*args, **kwargs)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/api.py", line 1620, in create
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     check_server_group_quota=check_server_group_quota)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/api.py", line 1186, in _create_instance
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     reservation_id, max_count)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/api.py", line 889, in _validate_and_build_base_options
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     instance_type, image_meta)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/hardware.py", line 1293, in numa_get_constraints
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     cell.cpu_thread_policy = cpu_thread_policy
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 72, in setter
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     field_value = field.coerce(self, name, value)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_versionedobjects/fields.py", line 195, in coerce
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return self._type.coerce(obj, attr, value)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_versionedobjects/fields.py", line 317, in coerce
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     raise ValueError(msg)
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions ValueError: Field value avoid is invalid
2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions 
2018-03-21 19:34:03.102 27 INFO nova.api.openstack.wsgi [req-fbb4aef8-a2bb-47af-bea0-e776c83ae5e9 03e0cf5adea04b73a13bc45a0306171b 1b50364d35624d0e8affe0721866fda1 - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<type 'exceptions.ValueError'>

aviod參數無效。

####prefer虛擬機

  • 設置prefer綁定策略:
$ nova flavor-key machine.cpu set hw:cpu_policy=dedicated hw:cpu_thread_policy=prefer
  • 創建虛擬機:
$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.prefer
  • 查看虛擬機CPU親和性:
$ ps -aux | grep `openstack server show server.cpu.prefer | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p
pid 187669's current affinity list: 0-2,7,8,14-16,36-38,43,44,50-52

$ ps -aux | grep `openstack server show server.cpu.prefer | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list
/proc/187669/task/187669/status:Cpus_allowed_list:	0-2,7-8,14-16,36-38,43-44,50-52
/proc/187669/task/187671/status:Cpus_allowed_list:	0-2,7-8,14-16,36-38,43-44,50-52
/proc/187669/task/187675/status:Cpus_allowed_list:	43
/proc/187669/task/187676/status:Cpus_allowed_list:	7
/proc/187669/task/187677/status:Cpus_allowed_list:	16
/proc/187669/task/187678/status:Cpus_allowed_list:	52
/proc/187669/task/187679/status:Cpus_allowed_list:	2
/proc/187669/task/187680/status:Cpus_allowed_list:	38
/proc/187669/task/187681/status:Cpus_allowed_list:	8
/proc/187669/task/187682/status:Cpus_allowed_list:	44
/proc/187669/task/187683/status:Cpus_allowed_list:	50
/proc/187669/task/187684/status:Cpus_allowed_list:	14
/proc/187669/task/187685/status:Cpus_allowed_list:	0
/proc/187669/task/187686/status:Cpus_allowed_list:	36
/proc/187669/task/187687/status:Cpus_allowed_list:	51
/proc/187669/task/187688/status:Cpus_allowed_list:	15
/proc/187669/task/187689/status:Cpus_allowed_list:	1
/proc/187669/task/187690/status:Cpus_allowed_list:	37
/proc/187669/task/187692/status:Cpus_allowed_list:	0-2,7-8,14-16,36-38,43-44,50-52
  • 查看虛擬機CPU運行狀態:
$ ps -aux | grep `openstack server show server.cpu.prefer | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p | awk 'NR>2{print $2}' | sort -n | uniq | xargs -n1 ./cpu_id.sh
0   0   0  
1   1   0  
2   2   0  
7   10  0  
8   11  0  
14  24  0  
15  25  0  
16  26  0  
36  0   0  
37  1   0  
38  2   0  
43  10  0  
44  11  0  
50  24  0  
51  25  0  
52  26  0

虛擬機CPU被綁定,且被分配的Core上都有兩個vCPU(這個主機使用了SMT技術,優先這麼分配)。

  • 查看內存分配:
$ ps -aux | grep `openstack server show server.cpu.prefer | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl
N0        :        22942 (  0.09 GB)
N1        :           34 (  0.00 GB)
active    :          168 (  0.00 GB)
anon      :        19827 (  0.08 GB)
dirty     :        19852 (  0.08 GB)
kernelpagesize_kB:         1848 (  0.01 GB)
mapmax    :         4217 (  0.02 GB)
mapped    :         3127 (  0.01 GB)

####isolate虛擬機

  • 設置isolate綁定策略:
$ nova flavor-key machine.cpu set hw:cpu_policy=dedicated hw:cpu_thread_policy=isolate
  • 創建虛擬機:
$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.isolate
  • 查看虛擬機CPU親和性:
$ ps -aux | grep `openstack server show server.cpu.isolate | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p
pid 51203's current affinity list: 18,19,24-26,32-35,57-59,64-67

$ ps -aux | grep `openstack server show server.cpu.isolate | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list
/proc/51203/task/51203/status:Cpus_allowed_list:	18-19,24-26,32-35,57-59,64-67
/proc/51203/task/51206/status:Cpus_allowed_list:	18-19,24-26,32-35,57-59,64-67
/proc/51203/task/51210/status:Cpus_allowed_list:	59
/proc/51203/task/51211/status:Cpus_allowed_list:	65
/proc/51203/task/51212/status:Cpus_allowed_list:	18
/proc/51203/task/51213/status:Cpus_allowed_list:	34
/proc/51203/task/51214/status:Cpus_allowed_list:	24
/proc/51203/task/51215/status:Cpus_allowed_list:	33
/proc/51203/task/51216/status:Cpus_allowed_list:	58
/proc/51203/task/51217/status:Cpus_allowed_list:	67
/proc/51203/task/51218/status:Cpus_allowed_list:	66
/proc/51203/task/51219/status:Cpus_allowed_list:	26
/proc/51203/task/51220/status:Cpus_allowed_list:	35
/proc/51203/task/51221/status:Cpus_allowed_list:	57
/proc/51203/task/51222/status:Cpus_allowed_list:	25
/proc/51203/task/51223/status:Cpus_allowed_list:	19
/proc/51203/task/51224/status:Cpus_allowed_list:	64
/proc/51203/task/51225/status:Cpus_allowed_list:	32
/proc/51203/task/51227/status:Cpus_allowed_list:	18-19,24-26,32-35,57-59,64-67
  • 查看虛擬機CPU運行狀態:
$ ps -aux | grep `openstack server show server.cpu.isolate | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p | awk 'NR>2{print $2}' | sort -n | uniq | xargs -n1 ./cpu_id.sh
18  0   1  
19  1   1  
24  9   1  
25  10  1  
26  11  1  
32  24  1  
33  25  1  
34  26  1  
35  27  1  
57  3   1  
58  4   1  
59  8   1  
64  17  1  
65  18  1  
66  19  1  
67  20  1

虛擬機CPU被綁定,且每個vCPU被分配到了不同的Core

  • 查看內存分配:
$ ps -aux | grep `openstack server show server.cpu.isolate | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl
N0        :         3077 (  0.01 GB)
N1        :        22653 (  0.09 GB)
active    :          168 (  0.00 GB)
anon      :        22581 (  0.09 GB)
dirty     :        22600 (  0.09 GB)
kernelpagesize_kB:         1844 (  0.01 GB)
mapmax    :         4217 (  0.02 GB)
mapped    :         3131 (  0.01 GB)

####require虛擬機

  • 設置require綁定策略:
$ nova flavor-key machine.cpu set hw:cpu_policy=dedicated hw:cpu_thread_policy=require
  • 創建虛擬機:
$ openstack quota set --cores 100 admin

$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.require
  • 查看虛擬機CPU親和性:
$ ps -aux | grep `openstack server show server.cpu.require | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p
pid 194063's current affinity list: 3,5,6,9-13,39,41,42,45-49

$ ps -aux | grep `openstack server show server.cpu.require | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list
/proc/194063/task/194063/status:Cpus_allowed_list:	3,5-6,9-13,39,41-42,45-49
/proc/194063/task/194065/status:Cpus_allowed_list:	3,5-6,9-13,39,41-42,45-49
/proc/194063/task/194069/status:Cpus_allowed_list:	10
/proc/194063/task/194070/status:Cpus_allowed_list:	46
/proc/194063/task/194071/status:Cpus_allowed_list:	11
/proc/194063/task/194072/status:Cpus_allowed_list:	47
/proc/194063/task/194073/status:Cpus_allowed_list:	42
/proc/194063/task/194074/status:Cpus_allowed_list:	6
/proc/194063/task/194075/status:Cpus_allowed_list:	41
/proc/194063/task/194076/status:Cpus_allowed_list:	5
/proc/194063/task/194077/status:Cpus_allowed_list:	9
/proc/194063/task/194078/status:Cpus_allowed_list:	45
/proc/194063/task/194079/status:Cpus_allowed_list:	3
/proc/194063/task/194080/status:Cpus_allowed_list:	39
/proc/194063/task/194081/status:Cpus_allowed_list:	48
/proc/194063/task/194082/status:Cpus_allowed_list:	12
/proc/194063/task/194083/status:Cpus_allowed_list:	49
/proc/194063/task/194084/status:Cpus_allowed_list:	13
/proc/194063/task/194088/status:Cpus_allowed_list:	3,5-6,9-13,39,41-42,45-49
  • 查看虛擬機CPU運行狀態:
ps -aux | grep `openstack server show server.cpu.require | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p | awk 'NR>2{print $2}' | sort -n | uniq | xargs -n1 ./cpu_id.sh
3   3   0  
5   8   0  
6   9   0  
9   16  0  
10  17  0  
11  18  0  
12  19  0  
13  20  0  
39  3   0  
41  8   0  
42  9   0  
45  16  0  
46  17  0  
47  18  0  
48  19  0  
49  20  0

虛擬機CPU被綁定,且被分配的Core上都有兩個vCPU,和prefer類似,且沒有被分配在和isolate策略相同的Core上面。

  • 查看內存分配:
$ ps -aux | grep `openstack server show server.cpu.require | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl
N0        :        22939 (  0.09 GB)
N1        :           34 (  0.00 GB)
active    :          168 (  0.00 GB)
anon      :        19824 (  0.08 GB)
dirty     :        19850 (  0.08 GB)
kernelpagesize_kB:         1848 (  0.01 GB)
mapmax    :         4217 (  0.02 GB)
mapped    :         3127 (  0.01 GB)

相關源碼

詳見NUMA綁定源碼。

參考文檔

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章