相關歷史文章(閱讀本文之前,您可能需要先看下之前的系列👇)
沒有預熱,不叫高併發「限流算法第三把法器:令牌桶算法」- 第302篇
水滿自溢「限流算法第四把法器:漏桶算法」- 第303篇
布隆過濾器Bloom Filter竟然讓我解決了一個大廠的問題 - 第305篇
師傅:徒兒,睡醒了沒有,趕緊起牀學習了
悟纖:師傅,這不天還沒亮嘛?
師傅:學習要趁早,沒聽過早起的鳥有蟲嘛!
悟纖:晚起的鳥兒也有蟲吃呀,晚起的鳥兒吃晚起的蟲。
師傅:是,是,你都說的都對,你再不起來,午飯都快沒了。
悟纖:歐侯,師傅,現在不會是快到下午了吧。
師傅:是呀,你現在才發現,太陽都曬到你屁股了。
悟纖:(#^.^#) ….
師傅:趕緊吃飯,學習來…
文章目錄
一、大文件讀取之文件分割法
二、大文件讀取之多線程讀取
三、悟纖小結
一、大文件讀取之文件分割法
我們來看下這種方法的核心思路就是:不是文件太大了嘛?那麼是否可以把文件拆分成幾個小的文件,然後使用多線程進行讀取吶?具體的步驟:
(1)先分割成多個文件。
(2)多個線程操作多個文件,避免兩個線程操作同一個文件
(3)按行讀文件
1.1 文件分割
在Mac和Linux都有文件分割的命令,可以使用:
split -b 1024m test2.txt /data/tmp/my/test.txt.
說明:
(1)split:分割命令;
(2)-b 1024m:指定每多少字就要切成一個小文件。支持單位:m,k;這裏是將6.5G的文件按照1G進行拆分成7個文件左右。
(3)test2.txt:要分割的文件;
(4)test.txt. : 切割後文件的前置文件名,split會自動在前置文件名後再加上編號;
其它參數:
(1)-l<行數> : 指定每多少行就要切成一個小文件。
(2) -C<字節>:與-b參數類似,但切割時儘量維持每行的完整性。
分割成功之後文件是這樣子的:
1.2 多線程讀取分割文件
我們使用多線程讀取分割的文件,然後開啓線程對每個文件進行處理:
public void readFileBySplitFile(String pathname) {
//pathname這裏是路徑,非具體的文件名,比如:/data/tmp/my
File file = new File(pathname);
File[] files = file.listFiles();
List<MyThread> threads = new ArrayList<>();
for(File f:files) {
MyThread thread = new MyThread(f.getPath());
threads.add(thread);
thread.start();
}
for(MyThread t:threads) {
try {
t.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
private class MyThread extends Thread{
private String pathname;
public MyThread(String pathname) {
this.pathname = pathname;
}
@Override
public void run() {
readFileFileChannel(pathname);
}
}
說明:
(1)獲取到指定目錄下的所有分割的文件信息;
(2)遍歷文件路徑,將路徑使用線程進行處理,這裏線程的run使用readFileChannel進行讀取每個文件的信息。
(3)join方法:就是讓所有線程等待,然後回到主線程,不懂的可以參之前的一篇文章:《悟纖和師傅去女兒國「線程並行變爲串行,Thread你好牛」》
測試:6.5G 耗時:4秒
這個多線程的方式,那麼理論上是文件越大,優勢會越明顯。對於線程開啓的個數,這裏使用的是文件的個數,在實際中,能這麼使用嘛?答案肯定是不行的。相信大家應該知道怎麼進行改良下,這裏不展開講解。
二、大文件讀取之多線程讀取同一個文件
2.1 多線程1.0版本
我們在看一下這種方式就是使用多線程讀取同一個文件,這種方式的思路,就是講文件進行劃分,從不同的位置進行讀取,那麼滿足這種要求的就是RandomAccessFile,因爲此類中有一個方法seek,可以指定開始的位置。
public void readFileByMutiThread(String pathname, int threadCount) {
BufferedRandomAccessFile randomAccessFile = null;
try {
randomAccessFile = new BufferedRandomAccessFile(pathname, "r");
// 獲取文件的長度,進行分割
long fileTotalLength = randomAccessFile.length();
// 分割的每個大小.
long gap = fileTotalLength / threadCount;
// 記錄每個的開始位置和結束位置.
long[] beginIndexs = new long[threadCount];
long[] endIndexs = new long[threadCount];
// 記錄下一次的位置.
long nextStartIndex = 0;
// 找到每一段的開始和結束的位置.
for (int n = 0; n < threadCount; n++) {
beginIndexs[n] = nextStartIndex;
// 如果是最後一個的話,剩下的部分,就全部給最後一個線程進行處理了.
if (n + 1 == threadCount) {
endIndexs[n] = fileTotalLength;
break;
}
/*
* 不是最後一個的話,需要獲取endIndexs的位置.
*/
// (1)上一個nextStartIndex的位置+gap就是下一個位置.
nextStartIndex += gap;
// (2)nextStartIndex可能不是剛好這一行的結尾部分,需要處理下.
// 先將文件移動到這個nextStartIndex的位置,然後往後進行尋找位置.
randomAccessFile.seek(nextStartIndex);
// 主要是計算回車換行的位置.
long gapToEof = 0;
boolean eol = false;
while (!eol) {
switch (randomAccessFile.read()) {
case -1:
eol = true;
break;
case '\n':
eol = true;
break;
case '\r':
eol = true;
break;
default:
gapToEof++;
break;
}
}
// while循環,那個位置剛好是對應的那一行的最後一個字符的結束,++就是換行符號的位置.
gapToEof++;
nextStartIndex += gapToEof;
endIndexs[n] = nextStartIndex;
}
// 開啓線程
List<MyThread2> threads = new ArrayList<>();
for (int i = 0; i < threadCount; i++) {
MyThread2 thread = new MyThread2(pathname, beginIndexs[i], endIndexs[i]);
threads.add(thread);
thread.start();
}
// 等待彙總數據
for (MyThread2 t : threads) {
try {
t.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
說明:此方法的作用就是對我們的文件根據線程的個數進行位置的分割,每個位置負責一部分的數據處理。
我們看下具體線程的處理:
private class MyThread2 extends Thread{
private long begin;
private long end;
private String pathname;
public MyThread2(String pathname,long begin,long end) {
this.pathname = pathname;
this.begin = begin;
this.end = end;
}
@Override
public void run() {
//System.out.println("TestReadFile.MyThread2.run()-"+begin+"--"+end);
RandomAccessFile randomAccessFile = null;
try {
randomAccessFile = new RandomAccessFile(pathname, "r");
//指定其實讀取的位置.
randomAccessFile.seek(begin);
StringBuffer buffer = new StringBuffer();
String str;
while ((str = randomAccessFile.readLine()) != null) {
//System.out.println(str+"--"+Thread.currentThread().getName());
//處理字符串,並不會將字符串保存真正保存到內存中
// 這裏簡單模擬下處理操作.
buffer.append(str.substring(0,1));
//+1 就是要加上回車換行符號
begin += (str.length()+1);
if(begin>=end) {
break;
}
}
System.out.println("buffer.length:"+buffer.length()+"--"+Thread.currentThread().getName());
} catch (IOException e) {
e.printStackTrace();
}finally {
//TODO close處理.
}
}
}
說明:此線程的主要工作就是根據文件的位置點beginPosition和endPosition讀取此區域的數據。
運行看下效果,6.5G的,居然要運行很久,不知道什麼時候要結束,實在等待不了,就結束運行了。
爲啥會這麼慢吶?不是感覺這種處理方式很棒的嘛?爲什麼要傷害我弱小的心靈。
我們分析下:之前的方法readFileByRandomAccessFile,我們在測試的時候,結果也是很慢,所以可以得到並不是因爲我們使用的線程的原因導致了很慢了,那麼這個是什麼原因導致的吶?
我們找到RandomAccessFile 的readLin()方法:
public final String readLine() throws IOException {
StringBuffer input = new StringBuffer();
int c = -1;
boolean eol = false;
while (!eol) {
switch (c = read()) {
case -1:
case '\n':
eol = true;
break;
case '\r':
eol = true;
long cur = getFilePointer();
if ((read()) != '\n') {
seek(cur);
}
break;
default:
input.append((char)c);
break;
}
}
if ((c == -1) && (input.length() == 0)) {
return null;
}
return input.toString();
}
此方法的原理就是:使用while循環,不停的讀取字符,如果遇到\n或者\r的話,那麼readLine就結束,並且返回此行的數據,那麼核心的方法就是read():
public int read() throws IOException {
return read0();
}
private native int read0() throws IOException;
直接調用的是本地方法了。那麼這個方法是做了什麼呢?我們可以通過註釋分析下:
* Reads a byte of data from this file. The byte is returned as an
* integer in the range 0 to 255 ({@code 0x00-0x0ff}). This
* method blocks if no input is yet available.
通過這裏我們可以知道:read()方法會從該文件讀取一個字節的數據。 字節返回爲介於0到255之間的整數({@code 0x00-0x0ff})。 這個如果尚無輸入可用,該方法將阻塞。
到這裏,不知道你是否知道這個爲啥會這麼慢了。一個字節一個字節每次讀取,那麼肯定是比較慢的嘛。
2.2 多線程2.0版本
那麼怎麼辦呢?有一個類BufferedRandomAccessFile,當然這個類並不屬於jdk中的類,需要自己去找下源代碼:
package com.kfit.bloomfilter;
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Arrays;
/**
* A <code>BufferedRandomAccessFile</code> is like a
* <code>RandomAccessFile</code>, but it uses a private buffer so that most
* operations do not require a disk access.
* <P>
*
* Note: The operations on this class are unmonitored. Also, the correct
* functioning of the <code>RandomAccessFile</code> methods that are not
* overridden here relies on the implementation of those methods in the
* superclass.
*/
public final class BufferedRandomAccessFile extends RandomAccessFile
{
static final int LogBuffSz_ = 16; // 64K buffer
public static final int BuffSz_ = (1 << LogBuffSz_);
static final long BuffMask_ = ~(((long) BuffSz_) - 1L);
private String path_;
/*
* This implementation is based on the buffer implementation in Modula-3's
* "Rd", "Wr", "RdClass", and "WrClass" interfaces.
*/
private boolean dirty_; // true iff unflushed bytes exist
private boolean syncNeeded_; // dirty_ can be cleared by e.g. seek, so track sync separately
private long curr_; // current position in file
private long lo_, hi_; // bounds on characters in "buff"
private byte[] buff_; // local buffer
private long maxHi_; // this.lo + this.buff.length
private boolean hitEOF_; // buffer contains last file block?
private long diskPos_; // disk position
/*
* To describe the above fields, we introduce the following abstractions for
* the file "f":
*
* len(f) the length of the file curr(f) the current position in the file
* c(f) the abstract contents of the file disk(f) the contents of f's
* backing disk file closed(f) true iff the file is closed
*
* "curr(f)" is an index in the closed interval [0, len(f)]. "c(f)" is a
* character sequence of length "len(f)". "c(f)" and "disk(f)" may differ if
* "c(f)" contains unflushed writes not reflected in "disk(f)". The flush
* operation has the effect of making "disk(f)" identical to "c(f)".
*
* A file is said to be *valid* if the following conditions hold:
*
* V1. The "closed" and "curr" fields are correct:
*
* f.closed == closed(f) f.curr == curr(f)
*
* V2. The current position is either contained in the buffer, or just past
* the buffer:
*
* f.lo <= f.curr <= f.hi
*
* V3. Any (possibly) unflushed characters are stored in "f.buff":
*
* (forall i in [f.lo, f.curr): c(f)[i] == f.buff[i - f.lo])
*
* V4. For all characters not covered by V3, c(f) and disk(f) agree:
*
* (forall i in [f.lo, len(f)): i not in [f.lo, f.curr) => c(f)[i] ==
* disk(f)[i])
*
* V5. "f.dirty" is true iff the buffer contains bytes that should be
* flushed to the file; by V3 and V4, only part of the buffer can be dirty.
*
* f.dirty == (exists i in [f.lo, f.curr): c(f)[i] != f.buff[i - f.lo])
*
* V6. this.maxHi == this.lo + this.buff.length
*
* Note that "f.buff" can be "null" in a valid file, since the range of
* characters in V3 is empty when "f.lo == f.curr".
*
* A file is said to be *ready* if the buffer contains the current position,
* i.e., when:
*
* R1. !f.closed && f.buff != null && f.lo <= f.curr && f.curr < f.hi
*
* When a file is ready, reading or writing a single byte can be performed
* by reading or writing the in-memory buffer without performing a disk
* operation.
*/
/**
* Open a new <code>BufferedRandomAccessFile</code> on <code>file</code>
* in mode <code>mode</code>, which should be "r" for reading only, or
* "rw" for reading and writing.
*/
public BufferedRandomAccessFile(File file, String mode) throws IOException
{
this(file, mode, 0);
}
public BufferedRandomAccessFile(File file, String mode, int size) throws IOException
{
super(file, mode);
path_ = file.getAbsolutePath();
this.init(size);
}
/**
* Open a new <code>BufferedRandomAccessFile</code> on the file named
* <code>name</code> in mode <code>mode</code>, which should be "r" for
* reading only, or "rw" for reading and writing.
*/
public BufferedRandomAccessFile(String name, String mode) throws IOException
{
this(name, mode, 0);
}
public BufferedRandomAccessFile(String name, String mode, int size) throws FileNotFoundException
{
super(name, mode);
path_ = name;
this.init(size);
}
private void init(int size)
{
this.dirty_ = false;
this.lo_ = this.curr_ = this.hi_ = 0;
this.buff_ = (size > BuffSz_) ? new byte[size] : new byte[BuffSz_];
this.maxHi_ = (long) BuffSz_;
this.hitEOF_ = false;
this.diskPos_ = 0L;
}
public String getPath()
{
return path_;
}
public void sync() throws IOException
{
if (syncNeeded_)
{
flush();
getChannel().force(true);
syncNeeded_ = false;
}
}
// public boolean isEOF() throws IOException
// {
// assert getFilePointer() <= length();
// return getFilePointer() == length();
// }
public void close() throws IOException
{
this.flush();
this.buff_ = null;
super.close();
}
/**
* Flush any bytes in the file's buffer that have not yet been written to
* disk. If the file was created read-only, this method is a no-op.
*/
public void flush() throws IOException
{
this.flushBuffer();
}
/* Flush any dirty bytes in the buffer to disk. */
private void flushBuffer() throws IOException
{
if (this.dirty_)
{
if (this.diskPos_ != this.lo_)
super.seek(this.lo_);
int len = (int) (this.curr_ - this.lo_);
super.write(this.buff_, 0, len);
this.diskPos_ = this.curr_;
this.dirty_ = false;
}
}
/*
* Read at most "this.buff.length" bytes into "this.buff", returning the
* number of bytes read. If the return result is less than
* "this.buff.length", then EOF was read.
*/
private int fillBuffer() throws IOException
{
int cnt = 0;
int rem = this.buff_.length;
while (rem > 0)
{
int n = super.read(this.buff_, cnt, rem);
if (n < 0)
break;
cnt += n;
rem -= n;
}
if ( (cnt < 0) && (this.hitEOF_ = (cnt < this.buff_.length)) )
{
// make sure buffer that wasn't read is initialized with -1
Arrays.fill(this.buff_, cnt, this.buff_.length, (byte) 0xff);
}
this.diskPos_ += cnt;
return cnt;
}
/*
* This method positions <code>this.curr</code> at position <code>pos</code>.
* If <code>pos</code> does not fall in the current buffer, it flushes the
* current buffer and loads the correct one.<p>
*
* On exit from this routine <code>this.curr == this.hi</code> iff <code>pos</code>
* is at or past the end-of-file, which can only happen if the file was
* opened in read-only mode.
*/
public void seek(long pos) throws IOException
{
if (pos >= this.hi_ || pos < this.lo_)
{
// seeking outside of current buffer -- flush and read
this.flushBuffer();
this.lo_ = pos & BuffMask_; // start at BuffSz boundary
this.maxHi_ = this.lo_ + (long) this.buff_.length;
if (this.diskPos_ != this.lo_)
{
super.seek(this.lo_);
this.diskPos_ = this.lo_;
}
int n = this.fillBuffer();
this.hi_ = this.lo_ + (long) n;
}
else
{
// seeking inside current buffer -- no read required
if (pos < this.curr_)
{
// if seeking backwards, we must flush to maintain V4
this.flushBuffer();
}
}
this.curr_ = pos;
}
public long getFilePointer()
{
return this.curr_;
}
public long length() throws IOException
{
// max accounts for the case where we have written past the old file length, but not yet flushed our buffer
return Math.max(this.curr_, super.length());
}
public int read() throws IOException
{
if (this.curr_ >= this.hi_)
{
// test for EOF
// if (this.hi < this.maxHi) return -1;
if (this.hitEOF_)
return -1;
// slow path -- read another buffer
this.seek(this.curr_);
if (this.curr_ == this.hi_)
return -1;
}
byte res = this.buff_[(int) (this.curr_ - this.lo_)];
this.curr_++;
return ((int) res) & 0xFF; // convert byte -> int
}
public int read(byte[] b) throws IOException
{
return this.read(b, 0, b.length);
}
public int read(byte[] b, int off, int len) throws IOException
{
if (this.curr_ >= this.hi_)
{
// test for EOF
// if (this.hi < this.maxHi) return -1;
if (this.hitEOF_)
return -1;
// slow path -- read another buffer
this.seek(this.curr_);
if (this.curr_ == this.hi_)
return -1;
}
len = Math.min(len, (int) (this.hi_ - this.curr_));
int buffOff = (int) (this.curr_ - this.lo_);
System.arraycopy(this.buff_, buffOff, b, off, len);
this.curr_ += len;
return len;
}
public void write(int b) throws IOException
{
if (this.curr_ >= this.hi_)
{
if (this.hitEOF_ && this.hi_ < this.maxHi_)
{
// at EOF -- bump "hi"
this.hi_++;
}
else
{
// slow path -- write current buffer; read next one
this.seek(this.curr_);
if (this.curr_ == this.hi_)
{
// appending to EOF -- bump "hi"
this.hi_++;
}
}
}
this.buff_[(int) (this.curr_ - this.lo_)] = (byte) b;
this.curr_++;
this.dirty_ = true;
syncNeeded_ = true;
}
public void write(byte[] b) throws IOException
{
this.write(b, 0, b.length);
}
public void write(byte[] b, int off, int len) throws IOException
{
while (len > 0)
{
int n = this.writeAtMost(b, off, len);
off += n;
len -= n;
this.dirty_ = true;
syncNeeded_ = true;
}
}
/*
* Write at most "len" bytes to "b" starting at position "off", and return
* the number of bytes written.
*/
private int writeAtMost(byte[] b, int off, int len) throws IOException
{
if (this.curr_ >= this.hi_)
{
if (this.hitEOF_ && this.hi_ < this.maxHi_)
{
// at EOF -- bump "hi"
this.hi_ = this.maxHi_;
}
else
{
// slow path -- write current buffer; read next one
this.seek(this.curr_);
if (this.curr_ == this.hi_)
{
// appending to EOF -- bump "hi"
this.hi_ = this.maxHi_;
}
}
}
len = Math.min(len, (int) (this.hi_ - this.curr_));
int buffOff = (int) (this.curr_ - this.lo_);
System.arraycopy(b, off, this.buff_, buffOff, len);
this.curr_ += len;
return len;
}
}
然後將我們在上面使用到的類RandomAccessFile 替換成BufferedRandomAccessFile 即可。
來測試下吧:
如果是前面的方法:
TestReadFile.readFileByBufferedRandomAccessFile(pathname2);
6.5G 耗時:32秒
相比之前一直不能讀取的情況下,已經是好很多了,但是相對於nio的話,還是慢了。
測試下多線程版本的吧:
6.5G 耗時:2個線程20秒,3個線程16秒,4個線程14秒,5個線程11秒,6個線程8秒,7個線程8秒,8個線程9秒
我這個Mac電腦是6核處理器,所以在6核的時候,達到了性能的最高點,在開啓的更多的時候,線程的上下文切換會浪費這個時間,所以時間就越越來越高。但和上面的版本好像還是不能媲美。
2.3 多線程3.0版本
RandomAccessFile的絕大多數功能,在JDK 1.4以後被nio的”內存映射文件(memory-mapped files)”給取代了MappedByteBuffer,大家可以自行去嘗試下,本文就不展開講解了。
三、悟纖小結
師傅:本文有點難,也有點辣眼睛和騷腦,今天就爲師給你總結下。
徒兒:師傅,我太難了,我都要聽睡着了。
師傅:文件操作本身就會比較複雜,在一個項目中,也不是所有人都會去寫IO流的代碼。
來個小結,主要講了兩個知識點。
(1)第一:使用文件分隔的方式讀取大文件,配套NIO的技術,速度會有提升。核心的思路就是:使用Mac/Linx下的split命令,將大文件分割成幾個小的文件,然後使用多線程分別讀取每個小文件。13.56G :分割爲6個文件,耗時8秒;26G,耗時16秒。按照這樣的情況,那麼讀取100G的時間,也就是1分鐘左右的事情了,當然實際耗時,還是和你具體的獲取數據的處理方法有很大的關係,比如你使用系統的System.out的話,那麼這個時間就很長了。
(2)第二:使用多線程讀取大文件。核心的思路就是:根據文件的長度將文件分割成n段,然後開啓多線程利用類RandomAccessFile的位置定位seek方法,直接從此位置開啓讀取。13.56G :6個線程耗時23秒。
另外實際上NIO的FileChannel單線程下的讀取速度也是挺快的:13.56G :耗時15秒,之前就提到過了Java天然支持大文件的處理,這就是Java ,不僅Write once ,而且Write happy。
最後要注意下,ByteBuffer讀取到的是很多行的數據,不是一行一行的數據。
我就是我,是顏色不一樣的煙火。
我就是我,是與衆不同的小蘋果。
學院中有Spring Boot相關的課程:
à悟空學院:https://t.cn/Rg3fKJD
SpringBoot視頻:http://t.cn/A6ZagYTi
Spring Cloud視頻:http://t.cn/A6ZagxSR
SpringBoot Shiro視頻:http://t.cn/A6Zag7IV
SpringBoot交流平臺:https://t.cn/R3QDhU0
SpringData和JPA視頻:http://t.cn/A6Zad1OH
SpringSecurity5.0視頻:http://t.cn/A6ZadMBe
Sharding-JDBC分庫分表實戰:http://t.cn/A6ZarrqS
分佈式事務解決方案「手寫代碼」:http://t.cn/A6ZaBnIr
JVM內存模型和性能調優:http://t.cn/A6wWMVqG