MIT Place Pulse數據集及google街景圖片爬取

1、項目背景

1.1 使用谷歌街景圖片的必要性

MIT Place Pulse 數據集可直接下載，但沒有提供街景圖片本身，只提供了街景的座標，需通過谷歌街景開放API 獲取對應的街景圖片。
MIT Place Pulse數據集中的街景圖片大多在國外，因此你懂得。

1.2 使用谷歌街景圖片的目標

“建立街景圖片與人主觀感受的聯繫”場景的相關論文都沒有提供開源代碼，需實現模型並訓練，所以需要 MIT Place Pulse數據集作爲基礎。

1.3 “建立街景圖片與人主觀感受的聯繫”場景實現的基本流程：

通過 MIT Place Pulse數據集以及相關街景圖片訓練模型。
獲取百度地圖街景圖片作爲模型輸入，通過上一步訓練好的模型，獲取結果（例如，對街景的治安狀況進行評分等）。

1.4 參考鏈接

這篇文章詳述用Python爬取該訓練集，提供了訓練集地址，此外還提供了多個可用的google street view static api key。鏈接如下：https://zhuanlan.zhihu.com/p/34967038
下載好文中所述的訓練集文件之後，仔細查看votes.csv及readme.txt文件。寫的很清楚，需要對應votes.csv中的每一條數據，拼接街景圖片下載的url。vote.csv文件內容如下，一行記錄中有兩個座標，通過進一步觀察發現，裏面也有重複的座標（街景ID）。因此我們在真正下載圖片或拼接url之前還需做一次去重。

當然在這之前還需要申請 google 雲控制檯的street view static api key，我們也可以直接採用上述文中api.txt文件中的key，但其中大多已不能使用。畢竟是公開的資源，大家都在用，很容易被限制，最好自己和團隊成員多申請幾個，申請時需要用到VISA信用卡。申請鏈接如下：https://developers.google.com/maps/documentation/streetview/get-api-key
在有可用key的情況下，我們就可以通過發送GET請求的方式獲取街景圖片，對應的url如下：

https://maps.googleapis.com/maps/api/streetview?size=400x300&location=39.737314,-104.87407400000001&key=YOUR_API_KEY

2、任務分解

業務邏輯流程梳理大致如下：

解析vote.csv文件，並遍歷每一條記錄；
根據解析出的每一個座標，判斷該記錄對應的圖片是否已下載；
若已下載，則略過；
若未下載，則拼接url；
發送GET請求下載圖片，因爲是IO密集型任務，開啓線程池進行併發下載；
存儲（項目需求是存儲至本地文件夾下即可）

2.1 csv文件的解析

可用於csv文件解析的工具有很多，如：javacsv、Inputstream等，強烈建議使用現成的優秀工具，不建議自己編寫解析邏輯。更不建議一次性讀入文件再進行解析。這裏採用了一個號稱是目前爲止最高效的解析工具：univocity-parser，可採用迭代（行掃描）的方式讀取每一條記錄，詳見：https://github.com/uniVocity/univocity-parsers。
univocity-parser使用方法參考：https://blog.csdn.net/qq_21101587/article/details/79803582，這裏不再贅述。

2.2 街景ID（座標）去重

這裏需要注意的是，vote.csv文件有將近123萬行數據記錄，也就是近246萬個座標（含重複），如果一次性讀入文件，並存入HashSet的話可能會引起OOM，如果該文件有上億條數據記錄，此方法更不可取。筆者採用的是redis去重，結合redis近乎O（1）的複雜度，能夠處理數據量較大情況下的去重。但本訓練集數量還遠沒有達到海量級別，用File類中的exists方法也可以去重。

2.2.1 使用redis去重：

關於街景座標去重邏輯主要運用了以下幾個命令：

//當redis中存在該key時，跳過；不含該key時，則存入該鍵值數據
jedis.setnx(key,value);

//檢查該key是否存在
jedis.exists(key);

//獲取以spider-hgg-googlemap:爲正則前綴的所有key集合，返回set
jedis.keys("spider-hgg-googlemap:*");

//刪除該key
jedis.del(key);

2.2.2 使用File類的exists()去重

筆者原本以爲new File(“文件路徑”).exists()方法會隨着本地文件中的圖片越來越多而查詢變慢，但在實際使用過程中發現該方法在本地圖片達到6萬多張的時候，執行時間也是毫、微秒級，因此也能高效完成去重。底層原理可能得益於文件索引也是用的B樹或哈希索引的方式（本人自己猜測的，沒有深入研究）
去重代碼就很簡單了，傳入參數ID，拼接圖片路徑即可：

private boolean isPicExists(String panoId){
    String path = "E:\\temp\\hgg-googlemap\\safety\\"+panoId+".jpg";
    File file = new File(path);
    return file.exists();
}

2.3 url的拼接

這裏要注意一點的是，一個key每天的請求上限是2萬次（本人親測是低於2萬次/天，不穩定），超過之後就會被限制訪問，所以儘量獲取更多的key，在拼接url的時候也儘量在有效的key集合中隨機選擇使用(爲確保快速並可靠的下載，及時剔除無效的key)，儘可能減少同一key頻繁訪問的次數。另外一點需要注意的是，需加一個判斷圖片是否下載成功的邏輯，若下載成功就存儲，若不成功還要重新拼接url進行再次下載，直至成功爲止。

2.4 線程池的使用

這一塊涉及線程池的使用及線程數合理配置，不熟悉的童鞋可參閱：https://www.cnblogs.com/dolphin0520/p/3932921.html

一般需要根據任務的類型來配置線程池大小：

如果是CPU密集型任務，就需要儘量壓榨CPU，參考值可以設爲 CPU核數量+1
如果是IO密集型任務，參考值可以設置爲2* CPU核數量
當然，這只是一個參考值，具體的設置還需要根據實際情況進行調整，比如可以先將線程池大小設置爲參考值，再觀察任務運行情況和系統負載、資源利用率來進行適當調整。

3 代碼實現

3.1 添加依賴

dependencies {
    compile 'com.squareup.okhttp3:okhttp:3.11.0'
    compile 'com.demo.ddc:ddc-core:0.1.11-alpha6'
	compile 'redis.clients:jedis:2.9.0'
	compile 'org.apache.logging.log4j:log4j-core:2.8.2'
	compile 'org.apache.commons:commons-pool2:2.4.2'
	compile 'com.univocity:univocity-parsers:2.8.2'
}

3.2 核心流程代碼

將vote.csv文件改名爲googlemapvotes.csv，並將其置於資源目錄下。
先定義一個csv行數據的java bean類：

public class CsvPanoBean {

    private String panoId;

    private double lati;

    private double lonti;

    public CsvPanoBean(String panoId,double lati, double lonti){
        this.panoId = panoId;
        this.lati = lati;
        this.lonti = lonti;
    }

    public String getPanoId() {
        return panoId;
    }

    public void setPanoId(String panoId) {
        this.panoId = panoId;
    }

    public double getLati() {
        return lati;
    }

    public void setLati(double lati) {
        this.lati = lati;
    }

    public double getLonti() {
        return lonti;
    }

    public void setLonti(double lonti) {
        this.lonti = lonti;
    }
}

編寫核心代碼，含義詳見註釋：

protected boolean process() {
    String filePath = "/googlemapvotes.csv";
    // 創建csv解析器settings配置對象
    CsvParserSettings settings = new CsvParserSettings();
    // 文件中使用 '\n' 作爲行分隔符
    // 確保像MacOS和Windows這樣的系統
    // 也可以正確處理（MacOS使用'\r'；Windows使用'\r\n'）
    settings.getFormat().setLineSeparator("\n");
    // 考慮文件中的第一行內容解析爲列標題，跳過第一行
    settings.setHeaderExtractionEnabled(true);
    // 創建CSV解析器（將分隔符傳入對象）
    CsvParser parser = new CsvParser(settings);
    // 調用beginParsing逐個讀取記錄，使用迭代器iterator
    parser.beginParsing(getReader(filePath));
    String[] row;
    //圖片下載工具類
    PicLoadUtils picLoadUtils = new PicLoadUtils();
    //創建線程池，由於本地機器爲8核CPU，故定義10個核心線程，最大線程數爲16，且自定義線程工廠類和飽和策略
    ThreadPoolExecutor executor = new ThreadPoolExecutor(10, 16, 100, TimeUnit.MILLISECONDS,
            new LinkedBlockingQueue<>(1024), new MyTreadFactory(),  new MyIgnorePolicy());
    //預啓動所有核心線程
    executor.prestartAllCoreThreads();
    //解析csv文件並迭代每行記錄
    while ((row = parser.parseNext()) != null) {
        String category = row[7];
        //這裏根據需求，優先下載safety類型的訓練集街景圖片
        if ("safety".equals(category)){
            String leftPanoId = row[0];
            String rightPanoId = row[1];
            double leftLati = Double.parseDouble(row[3]);
            double leftLonti = Double.parseDouble(row[4]);
            double rightLati = Double.parseDouble(row[5]);
            double rightLonti = Double.parseDouble(row[6]);
            CsvPanoBean leftPanoBean = new CsvPanoBean(leftPanoId,leftLati,leftLonti);
            CsvPanoBean rightPanoBean = new CsvPanoBean(rightPanoId,rightLati,rightLonti);
            CsvPanoBean[] csvPanoBeans = {leftPanoBean,rightPanoBean};
            for (CsvPanoBean element:csvPanoBeans){
                //判斷redis中或本地是否有該街景ID
                String panoId = element.getPanoId();
                //boolean isExists = isPicExists(panoId);
                boolean isExists = redisUtils.isPanoIDExists(panoId);
                if (!isExists){
                    redisUtils.panoIdPush(panoId);
                    DownloadPicTask task = new DownloadPicTask(picLoadUtils,element);
                    executor.execute(task);
                }else{
                    logger.info(panoId + " is exist");
                }
            }
            try {
                // 這裏主線程需要睡一會，否則容易引起多線程下載時的讀超時
                Thread.sleep(400L);
                logger.info("The queue size of Thread Pool is "+ executor.getQueue().size());
            }catch (InterruptedException e){
                e.printStackTrace();
            }
        }
    }
    logger.info("--------------------------crawl finished!--------------------------");
    // 在讀取結束時自動關閉所有資源，或者當錯誤發生時，可以在任何使用調用stopParsing()
    // 只有在不是讀取所有內容的情況下調用下面方法,但如果不調用也沒有非常嚴重的問題
    parser.stopParsing();
    isComplete = true;
    return true;
}

//讀文件時定義編碼格式
private Reader getReader(String relativePath) {
    try {
        return new InputStreamReader(this.getClass().getResourceAsStream(relativePath), "UTF-8");
    } catch (UnsupportedEncodingException e) {
        throw new IllegalStateException("Unable to read input", e);
    }
}

//判斷本地是否已存在
private boolean isPicExists(String panoId){
    String path = "E:\\temp\\hgg-googlemap\\safety\\"+panoId+".jpg";
    File file = new File(path);
    return file.exists();
}

3.3 圖片下載工具類

該工具作用：主要是下載路徑的設置及下載圖片時的檢測

/**
 * @author Huigen Zhang
 * @since 2018-10-19 18:53
 **/
public class PicLoadUtils {
    private final static String WINDOWS_DISK_SYMBOL = ":";
    private final static String WINDOWS_PATH_SYMBOL = "\\";
    private final static int STATUS_CODE = 200;
    private String localLocation;

    {
        //要下載到本地的路徑
        localLocation = this.getFileLocation("googlepano");
    }

    private String getFileLocation(String storeDirName){
        String separator = "/";
        ConfigParser parser = ConfigParser.getInstance();
        String spiderId = "spider-googlemap";
        SpiderConfig spiderConfig = new SpiderConfig(spiderId);
        Map<String,Object> storageConfig = (Map<String, Object>) parser.assertKey(spiderConfig.getSpiderConfig(),"storage", spiderConfig.getConfigPath());
        String fileLocation = (String) parser.getValue(storageConfig,"piclocation",null,spiderConfig.getConfigPath()+".storage");
        String pathSeparator = getSeparator();
        String location;
        if(fileLocation!=null){
            //先區分系統環境，再判斷是否爲絕對路徑
            if (separator.equals(pathSeparator)){
                //linux
                if(fileLocation.startsWith(separator)){
                    location = fileLocation + pathSeparator + "data";
                }else {
                    location = System.getProperty("user.dir") + pathSeparator + fileLocation;
                }
                location = location.replace("//", pathSeparator);
                return location;
            }else {
                //windows
                if (fileLocation.contains(WINDOWS_DISK_SYMBOL)){
                    //絕對路徑
                    location = fileLocation + pathSeparator + "data";
                }else {
                    //相對路徑
                    location = System.getProperty("user.dir") + pathSeparator + fileLocation;
                }
                location = location.replace("\\\\",pathSeparator);
            }
        }else{
            //默認地址
            location = System.getProperty("user.dir") + pathSeparator + storeDirName;
        }
        return location;
    }

    private String getSeparator(){
        String pathSeparator = File.separator;
        if(!WINDOWS_PATH_SYMBOL.equals(File.separator)){
            pathSeparator = "/";
        }
        return pathSeparator;
    }

    private void mkDir(File file){
        String directory = file.getParent();
        File myDirectory = new File(directory);
        if (!myDirectory.exists()) {
            myDirectory.mkdirs();
        }
    }

    public boolean downloadPic(String url, String panoId){
        okhttp3.Request request = new okhttp3.Request.Builder()
                .url(url)
                .build();
        Response response = null;
        InputStream inputStream = null;
        FileOutputStream out = null;
        String relativePath;
        try {
            response = OkHttpUtils.getInstance().newCall(request).execute();
            if (response.code()!=STATUS_CODE){
                return false;
            }
            //將響應數據轉化爲輸入流數據
            inputStream = response.body().byteStream();
            byte[] buffer = new byte[2048];
            relativePath = panoId + ".jpg";
            File myPath = new File(localLocation + File.separator + relativePath);
            this.mkDir(myPath);
            out = new FileOutputStream(myPath);
            int len;
            while ((len = inputStream.read(buffer)) != -1){
                out.write(buffer,0,len);
            }
            //刷新文件流
            out.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            if (inputStream!=null){
                try {
                    inputStream.close();
                }catch (IOException e){
                    e.printStackTrace();
                }
            }
            if (null!=out){
                try {
                    out.close();
                }catch (IOException e){
                    e.printStackTrace();
                }
            }
            if (null!=response){
                response.body().close();
            }
        }
        return true;
    }
}

3.4 redis工具類

主要還是運用了上述redis命令，在這基礎上做一層封裝：

/**
 * @author zhanghuigen
 * @since 0.1.0
 **/
public class RedisUtils {
    private JedisPool pool;
    private String spiderUUID;
    private static Logger logger = Logger.getLogger(RedisUtils.class);

    public RedisUtils(String host, int port, String password, String spiderUUID) {
        this(new JedisPool(new JedisPoolConfig(), host, port, 2000, password));
        this.spiderUUID = spiderUUID;
    }

    public RedisUtils(JedisPool pool) {
        this.pool = pool;
    }

    public synchronized Boolean isPanoIDExists(String panoId) {
        Jedis jedis = null;
        Boolean exists;
        try {
            jedis = this.pool.getResource();
            exists = jedis.exists(this.spiderUUID + ":" + panoId);
            return exists;
        }finally {
            if (jedis!=null){
                jedis.close();
            }
        }
    }

    public synchronized boolean removeKeys(){
        Jedis jedis = this.pool.getResource();
        try {
            Set<String> keys = jedis.keys(this.spiderUUID + ":*" );
            if(keys != null && !keys.isEmpty()) {
                logger.info("redis has stored " + keys.size() + " keys, now ready to remove them all!");
                String[] array = new String[keys.size()];
                jedis.del(keys.toArray(array));
            }
            return true;
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            if (jedis!=null){
                jedis.close();
            }
        }
        return true;
    }

    public synchronized boolean panoIdPush(String panoId) {
        Jedis jedis = this.pool.getResource();
        try {
            long num = jedis.setnx(this.spiderUUID + ":" + panoId, String.valueOf(1));
            return num==1;
        } finally {
            if (jedis!=null){
                jedis.close();
            }
        }
    }
}

3.5 線程池的任務類及拒絕策略

這裏其實也可以運用Callable+Future的模式定義下載任務，詳見: https://www.cnblogs.com/hapjin/p/7599189.html 或 https://www.cnblogs.com/myxcf/p/9959870.html

class DownloadPicTask implements Runnable {
    private CsvPanoBean taskBean;
    private PicLoadUtils picLoadUtils;
    private String panoId;

    private DownloadPicTask(PicLoadUtils picLoadUtils,CsvPanoBean bean) {
        this.picLoadUtils = picLoadUtils;
        this.taskBean = bean;
        this.panoId = taskBean.getPanoId();
    }

    @Override
    public void run() {
        logger.info("正在執行task "+panoId);
        String url;
        String key;
        boolean successDownload;
        do {
            //拼接街景圖片url
            String[] urlWithKey = getUrlWithKey(taskBean);
            url = urlWithKey[0];
            key = urlWithKey[1];
            //發送請求，下載圖片，直到本圖片下載成功爲止
            successDownload = picLoadUtils.downloadPic(url,panoId);
        }while (!successDownload);
        logger.info(panoId + " downloaded succeed with " + key);
    }

    @Override
    public String toString(){
        return panoId;
    }

    private String[] getUrlWithKey(){
        String requestPrefix = "https://maps.googleapis.com/maps/api/streetview?size=400x300&location=";
        String url = requestPrefix + taskBean.getLati() + "," + taskBean.getLonti() + "&key=";
        Random random = new Random();
        //這裏需確保可用的key已經配置在配置文件中，並已讀取至一個List----googleKeys中
        int index = random.nextInt(5);
        String key = googleKeys.get(index);
        return new String[]{url+key,key};
    }
}


class MyTreadFactory implements ThreadFactory {
    private final AtomicInteger mThreadNum = new AtomicInteger(1);
    @Override
    public Thread newThread(Runnable r) {
        Thread t = new Thread(r, "my-thread-" + mThreadNum.getAndIncrement());
        logger.info(t.getName() + " has been created");
        return t;
    }
}

class MyIgnorePolicy implements RejectedExecutionHandler {

    @Override
    public void rejectedExecution(Runnable r, ThreadPoolExecutor e) {
        doLog(r, e);
    }

    private void doLog(Runnable r, ThreadPoolExecutor e) {
        // 將拒絕執行的街景ID寫入日誌
        logger.warn( r.toString() + " rejected");
    }
}

4 寫在最後

單線程與多線程下載的效率比較

若用單線程下載，差不多1秒一張圖片，相對低效：

採用線程池後，剛開始線程數量設的較高，也沒有在主線程中加入睡眠時間，易出現讀超時現象，原因是使用公司代理訪問google時，多線程下載使得帶寬受限。引起線程遲遲讀不到數據後報異常，如下圖所示：

通過在主線程添加睡眠時間後，讀超時現象消失，可以順利下載：

在滿足帶寬條件下，下載速度約5張/秒，

正常運行時的本地效果圖
圖片質量檢測
實際上，該訓練集中有部分圖片因google資源缺失無法下載。

解決方法：可以提前在下載過程中進行檢測，一般此類圖片size較小，可以通過在圖片下載工具類中對下載返回的響應加個判斷來決定是否對其下載，並記錄好異常位置即可。