Crawler4j学习笔记-util

util有两个类，IO.java和Util.java。

IO.java用于文件的操作。

deleteFolder用于删除文件夹（directory），实际通过deleteFolderContents删除文件夹下的文件，递归调用deleteFolder删除子文件夹，再删除父文件夹。
这里的用处是用来删除持久化的url数据的。
当使用crawler4j时，需要配置保存url数据的文件夹（用于恢复上次操作），会在crawlStorageFolder新建文件夹frontier用于保存，而一般重新run Crawler4j时，需要删除上次遗留下的数据，也就是frontier，这时就需要deleteFolder。

配置保存url数据的文件夹

CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);

frontier结构

同理writeBytesToFile就是用于将url保存起来的，不过其中用到了nio中的FileChannel和ByteBuffer。

FileChannel fc = new FileOutputStream(destination).getChannel();
fc.write(ByteBuffer.wrap(bytes));
fc.close();

FileChannel不能通过构造方法获取实例，需要通过使用一个InputStream、OutputStream或RandomAccessFile获取，通过ByteBuffer读写文件。

Util.java主要就是long、int和byte之间的转换，以及用于判断url指向的资源类型。

long、int和byte之间的转换通过位运算和逻辑操作实现的，在这里int是4 byte大小的，转换时需要new byte[4]保存，同理long需要new byte[8]。
实现如下

    public static byte[] int2ByteArray(int value) {
        byte[] b = new byte[4];
        for (int i = 0; i < 4; i++) {
            int offset = (3 - i) * 8;
            b[i] = (byte) ((value >>> offset) & 0xFF);
        }
        return b;
    }

需要注意的是，假设value为0x0A0B0C0D，那么对应的关系（右为最低字节），

no	[3]	[2]	[1]	[0]
value	0A	0B	0C	0D
byte	b[3]	b[2]	b[1]	b[0]
	0D	0C	0B	0A

这样的话，高位字节就保存到下标小的b中了，long2ByteArray也是一样的。

需要注意，在putIntInByteArray方法中，没有校验offset>buf.length的。

    public static void putIntInByteArray(int value, byte[] buf, int offset) {
        for (int i = 0; i < 4; i++) {
            int valueOffset = (3 - i) * 8;
            buf[offset + i] = (byte) ((value >>> valueOffset) & 0xFF);
        }
    }

而判断url指向的资源类型主要就是以response（String类型）里面的content-type为参数传入判断的，hasBinaryContent会判断image、audio、video、application为二进制的数据，而hasPlainTextContent则以text/plain为据。
如

   public static boolean hasBinaryContent(String contentType) {
        if (contentType != null) {
            String typeStr = contentType.toLowerCase();
            if (typeStr.contains("image") || typeStr.contains("audio") || typeStr.contains("video") || typeStr.contains("application")) {
                return true;
            }
        }
        return false;
    }

Cceking

发布了49 篇原创文章 · 获赞 16 · 访问量 7万+

私信关注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Crawler4j学习笔记-util

Crawler4j学习笔记-util

SQL优化-20231016

基於輸入域的方法的測試用例設計

Java 8 函數式編程

搭建免費圖庫， Typora 自動上傳圖片

查看主板型號

ekho--TTS語音引擎

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結