自制Java 虛擬機（一）解析class文件

一、認識class文件結構

一個.java後綴的java源文件，經過javac編譯之後的字節碼文件，結構如下：（摘自jvm虛擬機規範 version8）

ClassFile {
    u4             magic; // 魔數，值爲 0xCAFEBABE，表示這是一個java class文件
    u2             minor_version; // 次版本號
    u2             major_version; // 主版本號
    u2             constant_pool_count;  // 等於constant_pool表中條目的數量+1
    cp_info        constant_pool[constant_pool_count-1]; // 常量池表，下標從1開始
    u2             access_flags; // 該類或接口的訪問限制標誌
    u2             this_class; // 表示該class文件定義的類或者接口，其值是常量池表中的索引，對應一個CONSTANT_Class_info結構
    u2             super_class; // 表示該class文件定義的類/接口的父類/父接口，其值是常量池表中的索引，是一個CONSTANT_Class_info結構。特殊情況下，其值是0，表示該類沒有父類(java.lang.Object)
    u2             interfaces_count; // 該類/接口的父類/接口的數量
    u2             interfaces[interfaces_count]; // 該數組中的每一個值都是常量池中的索引, 對應一個CONSTANT_Class_info結構
    u2             fields_count; // 該數值表示fields表中field_info的總數（由該類/接口聲明的字段），包括類變量和實例變量
    field_info     fields[fields_count]; // 每個fields表中的一項是一個field_info結構，含有該字段完整的描述信息，不包括從父類或者父接口中繼承而來的字段
    u2             methods_count; // 該數值表示methods表中method_info的總數
    method_info    methods[methods_count]; // 每個methods中的一項是一個method_info結構，含有該阿方法的完整描述信息，如果該方法不是ACC_NATIVE或者ACC_ABSTRACT，還包含JVM指令（就是方法的代碼）
    u2             attributes_count; // 該class文件表示的類/接口的attributes表中包含多少個attribute_info
    attribute_info attributes[attributes_count];
}

其中u1、u2、u4分別表示1個字節、2個字節、4個字節。看了以上結構，很自然地想到可以用一個結構體來表示一個class文件，u1可以用unsigned char 類型，u2用unsigned short類型，u4用unsigned int類型。

不過由於cp_info、field_info、method_info、attribute_info是複合類型，光以上信息還不能夠確定如何用C語言中的結構表示一個class文件，所以我們還得繼續往下看：

1. 常量池中的主要結構

常量池是個關鍵，很多java指令都以索引的形式引用常量池中的符號信息。

一個常量池中的項目有如下通用結構：

cp_info {
  u1 tag; // 一個字節，表示常量的類型。
  u1 info[]; // 該內容因tag的不同而不同
}

表1：不同tag對應的常量池類型摘自jvm虛擬機規範 version8

常量池類型	tag
CONSTANT_Class	7
CONSTANT_Fieldref	9
CONSTANT_Methodref	10
CONSTANT_InterfaceMethodref	11
CONSTANT_String	8
CONSTANT_Integer	3
CONSTANT_Float	4
CONSTANT_Long	5
CONSTANT_Double	6
CONSTANT_NameAndType	12
CONSTANT_Utf8	1
CONSTANT_MethodHandle	15
CONSTANT_MethodType	16
CONSTANT_InvokeDynamic	18

很自然，我們可以用define來定義這些常量

#define CONSTANT_Class 7
#define CONSTANT_Fieldref 9
...
#define CONSTANT_InvokeDynamic 18

1.1 CONSTANT_Class_info類型

CONSTANT_Class_info表示一個類或接口(interface):

CONSTANT_Class_info {
  u1 tag; // 固定爲7，即 CONSTANT_Class
  u2 name_index; // 其值是常量池中的一個索引，對應一個CONSTANT_Utf8_info結構
}

自然，我們可以用C語言定義一個結構體來表示：

typedef struct _CONSTANT_Class_info {
  uchar tag; // 爲了方便，已經 typedef unsigned char uchar;
  ushort name_index; // 爲了方便，已經 typedef unsigned short ushort;
}

1.2 CONSTANT_Fieldref_info，CONSTANT_Methodref_info，CONSTANT_InterfaceMethodref_info 類型

這三種類型有相似的結構：

.... {
  u1 tag; // 
  u2 class_index; // 其值是常量池中的一個索引，對應一個 CONSTANT_Class_info結構。
  u2 name_and_type_index; // 其值是常量池中的一個索引，對應一個CONSTANT_NameAndType_info結構
}

於是我們可以用C語言定義如下結構：

typedef struct _CONSTANT_Fieldref_info {
    uchar tag;
    ushort class_index;
    ushort name_and_type_index;
    ushort findex; // 該field在對象中的索引，以後備用
    uchar ftype; // 該field的類型，以後備用
} CONSTANT_Fieldref_info;

typedef struct _CONSTANT_Methodref_info {
    uchar tag;
    ushort class_index;
    ushort name_and_type_index;
    void* ref_addr; // method的地址，以後備用
    ushort args_len; // 該方法的參數數碼，以後備用
} CONSTANT_Methodref_info;

typedef CONSTANT_Methodref_info CONSTANT_InterfaceMethodref_info; // CONSTANT_InterfaceMethodref_info 暫時不涉及，故與CONSTANT_Methodref_info一樣

1.3 CONSTANT_String_info

該類型表示java.lang.String類型的常量對象，結構如下：

CONSTANT_String_info {
  u1 tag; // 固定爲8，表示一個CONSTANT_String
  u2 string_index; // 其值是常量池中的一個索引，對應一個CONSTANT_Utf8_info結構。實例化該String對象時的Unicode代碼點序列。
}

對應C的結構體如下：

typedef struct _CONSTANT_String_info {
    uchar tag;
    ushort string_index;
} CONSTANT_String_info;

1.4 CONSTANT_Integer_info和CONSTANT_Float_info

這兩個類型表示4個字節的數字常量，CONSTANT_Integer_info表示的是int型，CONSTANT_Float_info表示的是float型：結構如下：

... {
  u1 tag; // 類型標誌，3 => CONSTANT_Integer,4 => CONSTANT_Float
  u4 bytes; // 以大端字節序存儲的int或float的4個字節
}

相應，我們可以定義如下C結構體：

typedef struct _CONSTANT_Integer_info {
    uchar tag; 
    int value; // 該int型常量的值
} CONSTANT_Integer_info;
typedef struct _CONSTANT_Float_info {
    uchar tag; 
    float value; // 該float型常量的值
} CONSTANT_Float_info;

1.5 CONSTANT_Long_info和CONSTANT_Double_info

這兩個類型表示8個字節的數字常量，CONSTANT_Long_info表示的是long型，CONSTANT_Double_info表示的是double型：結構如下：

... {
  u1 tag; // 類型標誌： 5 => CONSTANT_Long, 6 => CONSTANT_Double
  u4 high_bytes; // 高四字節
  u4 low_bytes;  // 低四字節
}

這裏需要注意：如果常量池索引n是一個CONSTANT_Long_info或CONSTANT_Double_info類型的結構，那麼常量池中下一個可用的索引是n+2，索引n+1必須有效但是不可用。（這個有點奇怪，jvm規範中也說道讓8字節常數佔據兩個常量池的位置是個糟糕的選擇）

與CONSTANT_Integer_info和CONSTANT_Float_ifno類似，我們定義如下C結構體來存儲這連個類型：

typedef struct _CONSTANT_Long_info {
    uchar tag;
    long value; // 該 long類型的值
} CONSTANT_Long_info;
typedef struct _CONSTANT_Double_info {
    uchar tag;
    double value; // 該dobule類型的值
} CONSTANT_Double_info;

1.6 CONSTANT_NameAndType_info

該結構描述一個字段/方法的名稱和類型信息：

CONSTANT_NameAndType_info {
  u1 tag; // 類型標誌，固定爲12，表示 CONSTANT_NameAndType
  u2 name_index; // 其值是常量池中的一個索引，對應一個 CONSTANT_Utf8_info結構,表示該字段或方法的名字。
  u2 descriptor_index; // 其值是常量池中的一個索引，對應一個 CONSTANT_Utf8_info結構,表示該字段或方法的類型。
}

對應定義如下C結構體：

typedef struct _CONSTANT_NameAndType_info {
    uchar tag;
    ushort name_index;
    ushort descriptor_index;
} CONSTANT_NameAndType_info;

1.7 CONSTANT_Utf8_info

該結構估計是被提到次數最多的一個結構了，很多常量池中的結構都有個*_index的字段來指向這個結構。該結構表示一個常量字符串值，如下：

CONSTANT_Utf8_info {
  u1 tag; // 類型標誌，固定爲1，表示CONSTANT_Utf8
  u2 length; // 字符串字節數
  u1 bytes[length]; // 實際的字符串字節（不打算深入研究，詳細請見jvm規範）
}

我們可用定義如下C結構來存儲：

typedef struct _CONSTANT_Utf8_info {
    uchar tag;
    ushort length;
    char *bytes; // C風格的字符串，最後一個以0結尾，注意要分配 length+1 個字節
} CONSTANT_Utf8_info;

1.8 其它結構

常量池中剩下的其它幾個幾個：CONSTANT_MethodHandle_info、CONSTANT_MethodType_info、CONSTANT_InvokeDynamic_info。暫不打算研究，僅把它們的內容存起來就行，後面深入研究時再討論。

以下是各自對應的C結構體：

typedef struct _CONSTANT_MethodHandle_info {
    uchar tag;
    uchar reference_kind;
    ushort reference_index;
} CONSTANT_MethodHandle_info;
typedef struct _CONSTANT_MethodType_info {
    uchar tag;
    ushort descriptor_index;
} CONSTANT_MethodType_info;
typedef struct _CONSTANT_InvokeDynamic_info {
    uchar tag;
    ushort bootstrap_method_attr_index;
    ushort name_and_type_index;
} CONSTANT_InvokeDynamic_info;

到這裏，常量池的各個結構都已經有了對應的C語言結構體來表示和存儲，由於各個結構體大同小異，ClassFile中的constant_pool對應的cp_info類型用哪個結構來表示都不合適，我們就用泛型指針 void** 來表示。

typedef void** cp_info;

2. field_info 結構

每個字段由一個field_info結構來描述：

field_info {
  u2 access_flags; // 訪問標誌,ACC_PUBLIC(0x0001)、ACC_PRIVATE(0x0002).....
  u2 name_index; // 其值是常量池中的一個索引，對應一個 CONSTANT_Utf8_info結構,表示該字段的名字。
  u2 descriptor_index; // 其值是常量池中的一個索引，對應一個 CONSTANT_Utf8_info結構,表示該字段的描述。
  u2 attributes_count; // 該字段的額外屬性個數
  attribute_info attributes[attributes_count]; // 屬性表，是一個attribute_info結構
}

又冒出個attribute_info結構！jvm規範中定義了20多個attribute_info結構類型，它們都有類似的結構：

attribute_info {
  u2 attribute_name_index;
  u4 attribute_length;
  u1 info[attribute_length];
}

當然目前我們不會每個attribute都取分析，不過既然知道了attribute的長度，我們可以一次性讀取完，然後只分析我們關心的attribute。

我們定義如下結構來表示和存儲attribute_info：

typedef struct _attribute_info{
    ushort attribute_name_index;
    uint attribute_length;
    uchar *info;
} attribute_info;

所以，field_info可定義如下：

typedef struct _field_info{
    ushort access_flags;
    ushort name_index;
    ushort descriptor_index;
    ushort attributes_count;
    attribute_info **attributes;
    ushort findex; // 字段所以（留作以後用）
    uchar ftype; // 字段類型（留作以後用）
} field_info;

3. method_info 結構

用來描述一個類/接口的方法：

method_info {
  u2 access_flags;
  u2 name_index;
  u2 descriptor_index;
  u2 attribute_count;
  attribute_info attributes[attributes_count];
}

與field_info的結構相似，不多說。定義如下C結構：

typedef struct _method_info{
    ushort access_flags;
    ushort name_index;
    ushort descriptor_index;
    ushort attributes_count;
    attribute_info **attributes;
    void* code_attribute_addr; // address of code attribute，喜歡用泛型指針
    ushort args_len; // 該方法的形式參數個數
} method_info;

目前，對於method_info，我們關注以下屬性：

Code_attribute，包含該方法的代碼以及一些輔助信息（如局部變量個數，最大操作數棧深度）
LineNumberTable_attribute，虛擬機指令索引與源代碼行號的對應表，該屬性是可選的，用於調試，包含在Code_attribute中

相關結構可定義如下：

typedef struct _exception_table {
    ushort start_pc;
    ushort end_pc;
    ushort handler_pc;
    ushort catch_type;
} exception_table;
typedef struct _Code_attribute {
    ushort attribute_type;
    ushort max_stack;
    ushort max_locals;
    uint code_length;
    uchar *code;
    ushort exception_table_length;
    exception_table *exceptions;
    ushort attributes_count;
    attribute_info **attributes;
} Code_attribute;

typedef struct _line_number_table {
    ushort start_pc;
    ushort line_number;
} line_number_table;

typedef struct _LineNumberTable_attribute {
    ushort attribute_type;
    uint attribute_length;
    ushort table_length;
    line_number_table *tables;
} LineNumberTable_attribute;

class文件中的主要結構已經描述完畢，我們可以定義class文件的表示了：

typedef struct _ClassFile{
    uint magic;
    ushort minor_version;
    ushort major_version;
    ushort constant_pool_count;
    cp_info constant_pool;
    ushort access_flags;
    ushort this_class;
    ushort super_class;
    ushort interface_count;
    ushort *interfaces;
    ushort fields_count;
    field_info **fields;
    ushort methods_count;
    method_info **methods;
    ushort attributes_count;
    attribute_info **attributes;
    struct _ClassFile *parent_class; // 父類的ClassFile結構
    int parent_fields_size; 
    int fields_size;
    ushort static_field_size;
    char *static_fields;
} ClassFile;

typedef ClassFile Class;

二、輔助代碼

從以上class文件的各個結構描述中，我們經常會需要讀取一個字節、兩個字節、4個字節、8個字節，以及指定長度的字節。對於ushort 、int、float、long、double類型的數字，我們還需要把大端字節序轉換成小端字節序（因爲java的class文件是用大端字節序表示，而筆者的cpu是小端字節序的）。這裏的大小端轉換很簡單，仍然是順序讀取，只不過放置順序是反過來放的。

大端順序轉成小端順序讀取的宏定義（有些遞歸的味道在裏面）：

#define READ_U1(fp, dest) fread(dest, 1, 1, fp)
#define READ_U2(fp, dest) \
READ_U1(fp, dest+1);\
READ_U1(fp, dest)

#define READ_U4(fp, dest) \
READ_U2(fp, dest+2);\
READ_U2(fp, dest)

#define READ_U8(fp, dest) \
READ_U4(fp, dest+4); \
READ_U4(fp, dest)

然後，讀取ushort、int、float等的函數可以定義如下：

ushort readUShort(FILE *fp)
{
    uchar uc2[2];
    READ_U2(fp, uc2);
    return *(ushort*)&uc2[0];
}

float readFloat(FILE *fp)
{
    char uc4[4];
    READ_U4(fp, uc4);
    return *(float*)&uc4;
}

uint readUInt(FILE *fp)
{
    uchar uc4[4];
    READ_U4(fp, uc4);
    return *(uint*)&uc4;
}

int readInt(FILE *fp)
{
    char uc4[4];
    READ_U4(fp, uc4);
    return *(int*)&uc4;
}
... // long和double省略

三、解析class文件

準備工作已經就緒，可以開始解析class文件了。

代碼大致如下：

Class* loadClass(const char *filename)
{
    FILE *fp = fopen(filename, "rb");
    if (!fp) {
        printf("Cannot open: %s\n", filename);
        return NULL;
    }

    Class *pclass = (Class*)malloc(sizeof(Class));
    pclass->parent_class = NULL;

    // step 1: read magic number
    pclass->magic = readUInt(fp);
    printf("Magic: 0x%X\n", pclass->magic);
    if (pclass->magic != 0xCAFEBABE) {
        printf("Invalid class file!\n");
        exit(1);
    }

    // step2: read version
    pclass->minor_version = readUShort(fp);
    pclass->major_version = readUShort(fp);
    printf("minor_version: %d\n", pclass->minor_version);
    printf("major_version: %d\n", pclass->major_version);

    printf("--------------------------------------------\n");
    // step3: read constant pool
    parseConstantPool(fp, pclass);
    printf("constant_pool_count: %d\n", pclass->constant_pool_count);
    showConstantPool(pclass);

    printf("--------------------------------------------\n");
    // step4: read access_flags
    pclass->access_flags = readUShort(fp);
    printf("access_flag: %04X\t%s\n", pclass->access_flags, formatAccessFlag(pclass->access_flags));

    // step5: this class
    pclass->this_class = readUShort(fp);
    printf("this_class: #%d\t%s\n", pclass->this_class, get_class_name(pclass->constant_pool, pclass->this_class));

    // step6: super class
    pclass->super_class = readUShort(fp);
    if (pclass->super_class > 0) {
        printf("super_class: #%d\t%s\n", pclass->super_class, get_class_name(pclass->constant_pool, pclass->super_class));
    }

    printf("--------------------------------------------\n");
    // step7: read inerfaces
    parseInterface(fp, pclass);
    printf("interface_count: %d\n", pclass->interface_count);
    showInterface(pclass);

    printf("--------------------------------------------\n");
    // step8: read fields
    parseFields(fp, pclass);
    printf("fields_count: %d\n", pclass->fields_count);
    showFields(pclass);

    printf("--------------------------------------------\n");
    // step9: read methods
    parseMethods(fp, pclass);
    printf("methods_count: %d\n", pclass->methods_count);
    showMethods(pclass);

    printf("--------------------------------------------\n");
    // step10: read attributes
    parseAttributes(fp, pclass);
    printf("attributes_count: %d\n", pclass->attributes_count);
    showAttributes(pclass, pclass->attributes, pclass->attributes_count);

    //setThisClassFieldIndex(pclass);

    ((CONSTANT_Class_info*)(pclass->constant_pool[pclass->this_class]))->pclass = pclass;

    return pclass;
}

由於代碼太多，涉及的函數就不一一列出了。反正就是“逢山開路，遇水搭橋”，按照jvm規範中的描述來，該怎麼辦怎麼辦。

四、測試

Hello.java

package test;

public class Hello implements IMath{
    static double C_DOUBLE = 12.45;
    public int xi = 789;
    protected long xl = 35;
    public float xf = -235.125f;
    public double db = 32.5;
    private int priv_i = 2;

    public int sum(int x, int y)
    {
        int s = 0;
        for(int i=x;i<=y;i++) {
            s+=i;
        }

        s += sub(x,y);
        return s+xi + this.priv_i;
    }

    private int sub(int x, int y) {
        return x-y;
    }
}

interface IMath {
    public int sum(int x, int y);
}

用javac工具編譯成字節碼：

javac Hello.java

然後解析Hello.class文件，輸出如下（與javap的輸出對比）：

常量池的輸出對比
代碼屬性的輸出對比

可見解析OK。

自制Java 虛擬機（一）解析class文件

自制Java 虛擬機（一）解析class文件

一、認識class文件結構

1. 常量池中的主要結構

1.1 CONSTANT_Class_info類型

1.2 CONSTANT_Fieldref_info，CONSTANT_Methodref_info，CONSTANT_InterfaceMethodref_info 類型

1.3 CONSTANT_String_info

1.4 CONSTANT_Integer_info和CONSTANT_Float_info

1.5 CONSTANT_Long_info和CONSTANT_Double_info

1.6 CONSTANT_NameAndType_info

1.7 CONSTANT_Utf8_info

1.8 其它結構

2. field_info 結構

3. method_info 結構

二、輔助代碼

三、解析class文件

四、測試

自制Java虛擬機（三）運行第一個main函數

用PHP寫PHP擴展-Another way writing php extensions

自制Java虛擬機（五）實現繼承、多態、invokevirtual

Yaf學習之Bootstrap

Yaf學習之整合yii

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結