1.爬取字符串文本經常通過下面三種方法
(1)通過HttpURLConnection爬取文本
①通過url得到HttpUrlConnection的對象httpUrlConnection。
②得到響應碼判斷是否獲取成功。
③將httpUrlConnection.getInputSream()的字節流對象轉化爲字符流InputStreamReader對象is。
④通過is的read()方法獲取文本。
/** * HttpUrlConnection */ new Thread(new Runnable() { @Override public void run() { URL url = null; try { url = new URL("http://lol.qq.com/web201310/info-heros.shtml"); } catch (MalformedURLException e) { e.printStackTrace(); } try { HttpURLConnection httpURLConnection = (HttpURLConnection) url.openConnection(); if(httpURLConnection.getResponseCode() == HttpURLConnection.HTTP_OK) { InputStreamReader is = new InputStreamReader(httpURLConnection.getInputStream()); int i = 0; StringBuffer sb = new StringBuffer(); while ((i = is.read()) != -1 ) { sb.append((char) i); } // Log.d("TAG",sb.toString()); Message msg = new Message(); Bundle bundle = new Bundle(); byte[] bytes = sb.toString().getBytes("utf-8"); String str = new String(bytes); bundle.putString("stringUrl", str); msg.setData(bundle); msg.what = 0x123; myHandler.sendMessage(msg); } else { Log.d("TAG httpUrlConnection : ",httpURLConnection.getResponseCode() +""); } } catch (IOException e) { e.printStackTrace(); } } }).start();
效果圖:
(2)通過HttpClient爬取數據
①創建HttpClient對象client。
②通過url獲取HttpGet請求對象 get;
③獲取字符串類型的ResponseHandler(響應處理器)對象.
④調用client.execute(get,responseHandler)方法獲取字符串文本。
/** * HttpClient */ new Thread(new Runnable() { @Override public void run() { try { HttpClient client = new DefaultHttpClient(); HttpGet get = new HttpGet("http://lol.qq.com/web201310/info-heros.shtml"); ResponseHandler<String> responseHandler = new BasicResponseHandler(); String content = client.execute(get, responseHandler); if(content.equals("")) { Toast.makeText(DataActivity.this, "null", Toast.LENGTH_SHORT).show(); } Message msg = new Message(); Bundle bundle = new Bundle(); bundle.putString("stringUrl",content); msg.setData(bundle); msg.what = 0x123; myHandler.sendMessage(msg); } catch (Exception e) { e.printStackTrace(); } } }).start();
效果圖:
(3)通過jsoup爬取數據,這裏使用異步加載數據,除了爬取文本外,經常通過jsoup去獲取具體的數據,如下我們要爬取字符串有:所有英雄、戰士、法師、刺客、坦克、射手、輔助。
public class LoadHtml extends AsyncTask<String,String,String> { Document doc; //建立一個Document對象 String url ; CallBack callBack; //接口回調 private List<String> mListTitle = new ArrayList<>(); public LoadHtml(CallBack callBack,String url) { this.url = url; this.callBack = callBack; } @Override protected String doInBackground(String... params) { try { doc = Jsoup.connect(url).timeout(5000).post(); //doc.string()爲該url的文本字符串 Document document = Jsoup.parse(doc.toString()); Elements element = document.select("#seleteChecklist"); //取得標題所在<ul>的id值,通過 seleteChecklist進行過濾。 Document document1 = Jsoup.parse(element.toString()); Elements elements = document1.getElementsByTag("li"); if(elements == null) { Log.d("TAG","elements爲空"); } for(Element links : elements) { String title = links.getElementsByTag("label").text(); mListTitle.add(title); //得到字符串列表(所有英雄、戰士...) } } catch (IOException e) { e.printStackTrace(); } return null; } @Override protected void onPostExecute(String s) { super.onPostExecute(s); Log.d("TAG", "onPostExecute"); Log.d("TAG","listSize : "+mListTitle.size()); for(int i=0;i < mListTitle.size();i++) { String title = mListTitle.get(i);
Log.d("TAG","title : "+title);} if(mListTitle !=null) { callBack.solve(mListTitle); //當獲取到具體數據列表時調用回調函數 } } }
效果圖:
solve()爲自定義接口CallBack的方法,需要數據的類(A類),只需實現該接口,重寫該方法即可。LoadHtml類(B類)的構造方法中的callBack爲A(該類繼承了CallBack接口)爲當獲取到信息數據後,調用回調接口函數的slove()方法即可將數據返回到需要該數據的類。