Java爬虫学习:利用HttpClient和Jsoup库实现简单的Java爬虫程序

HttpClient简介

HttpClient是Apache Jakarta Common下的子项目,可以用来提供高效的、最新的、功能丰富的支持HTTP协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本。它的主要功能有:

  • (1) 实现了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)
  • (2) 支持自动转向
  • (3) 支持 HTTPS 协议
  • (4) 支持代理服务器等

Jsoup简介

jsoup是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。它的主要功能有:

  • (1) 从一个URL,文件或字符串中解析HTML;
  • (2) 使用DOM或CSS选择器来查找、取出数据;
  • (3) 可操作HTML元素、属性、文本;

使用步骤

maven项目添加依赖

pom.xml文件依赖如下:

1
2
3
4
5
6
7
8
9
10
11
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>

编写Junit测试代码

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

import org.apache.http.HttpEntity;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;

import java.util.List;

/**
* HttpClient & Jsoup libruary test class
*
* Created by xuyh at 2017/11/6 15:28.
*/
public class HttpClientJsoupTest {
@Test
public void test() {
//通过httpClient获取网页响应,将返回的响应解析为纯文本
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;

String responseStr = "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext context = HttpClientContext.create();
response = httpClient.execute(httpGet, context);
int state = response.getStatusLine().getStatusCode();
if (state != 200)
responseStr = "";
HttpEntity entity = response.getEntity();
if (entity != null)
responseStr = EntityUtils.toString(entity, "utf-8");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null)
response.close();
if (httpClient != null)
httpClient.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}

if (responseStr == null)
return;

//将解析到的纯文本用Jsoup工具转换成Document文档并进行操作
Document document = Jsoup.parse(responseStr);
List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
.getElementsByAttributeValue("class", "phdnews_hdline");
elements.forEach(element -> {
for (Element e : element.getElementsByTag("a")) {
System.out.println(e.attr("href"));
System.out.println(e.text());
}
});
}
}

详解

新建HttpGet对象,对象将从 http://sports.sina.com.cn/ 这个URL地址获取GET响应。并设置socket超时时间和连接超时时间分别为30000ms。
1
2
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
通过HttpClientBuilder新建一个CloseableHttpClient对象,并执行上面的HttpGet规定的请求,将响应放在新建的HttpClientContext对象中。最后从HttpClientContext对象中获取响应的文本格式。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;

String responseStr = "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext context = HttpClientContext.create();

response = httpClient.execute(httpGet, context);

int state = response.getStatusLine().getStatusCode();
if (state != 200)
responseStr = "";


HttpEntity entity = response.getEntity();
if (entity != null)
responseStr = EntityUtils.toString(entity, "utf-8");


} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null)
response.close();
if (httpClient != null)
httpClient.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
将响应的文本用Jsoup库解析,得到其中的各个元素
1
2
3
4
5
6
7
8
9
10
11
Document document = Jsoup.parse(responseStr);

List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
.getElementsByAttributeValue("class", "phdnews_hdline");

elements.forEach(element -> {
for (Element e : element.getElementsByTag("a")) {
System.out.println(e.attr("href"));
System.out.println(e.text());
}
});
Jsoup的Document对象继承自org.jsoup.nodes.Element类和Element均有的部分方法:
1
2
3
4
5
6
public Element getElementById(String id);//通过id获取元素
public Elements getElementsByClass(String className);//通过className获取元素
public Elements getElementsByAttributeValue(String key, String value);//通过属性值获取元素
public Elements getElementsByTag(String tagName);//通过标签名获取元素
public String attr(String attributeKey);//获取本元素的属性值
public String text();//获取本元素的内容
其中HTML规定的元素格式为:
1
2
3
4
5
6
7
<div class="code">  <!--div 是元素的标签--> <!--class="code" 是元素的属性和属性值-->
<div>
<br>
这是第一个段落。 <!--元素的内容-->
<br>
</div>
</div>

运行结果

运行结果如下所示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
http://sports.sina.com.cn/sportsevents/3v3/2017-11-05/doc-ifynmzrs7218551.shtml
3X3黄金联赛冠军赛山西队夺冠!独享48万
http://video.sina.com.cn/sports/k/cba/1105final3x3/
视频
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/181467390769.html
黄金mvp集锦
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/170167390621.html
直捣黄龙1v2
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/183267390917.html
5佳球:库里式虚晃
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/150067390331.html
大嫂徐冬冬亮相
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/145367390313.html
现场众多美女云集
http://video.sina.com.cn/p/sports/c/zj/v/doc/2017-11-05/150867390337.html
啦啦队热舞表演
http://sports.sina.com.cn/nba/
哈登56分周琦暴扣火箭胜
http://sports.sina.com.cn/basketball/nba/2017-11-06/doc-ifynmzrs7300047.shtml
詹皇26分骑士负

爬取的网页内容区域为下图所示:

image

编写工具类

将HttpClient和Jsoup进行封装,形成一个工具类,内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416

import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.CookieStore;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.cookie.Cookie;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import javax.net.ssl.*;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
* <pre>
* Http工具,包含:
* 普通http请求工具(使用httpClient进行http,https请求的发送)
* </pre>
* Created by xuyh at 2017/7/17 19:08.
*/
public class HttpUtils {
/**
* 请求超时时间,默认20000ms
*/
private int timeout = 20000;
/**
* cookie表
*/
private Map<String, String> cookieMap = new HashMap<>();

/**
* 请求编码(处理返回结果),默认UTF-8
*/
private String charset = "UTF-8";

private static HttpUtils httpUtils;

private HttpUtils() {
}

/**
* 获取实例
*
* @return
*/
public static HttpUtils getInstance() {
if (httpUtils == null)
httpUtils = new HttpUtils();
return httpUtils;
}

/**
* 清空cookieMap
*/
public void invalidCookieMap() {
cookieMap.clear();
}

public int getTimeout() {
return timeout;
}

/**
* 设置请求超时时间
*
* @param timeout
*/
public void setTimeout(int timeout) {
this.timeout = timeout;
}

public String getCharset() {
return charset;
}

/**
* 设置请求字符编码集
*
* @param charset
*/
public void setCharset(String charset) {
this.charset = charset;
}

/**
* 将网页返回为解析后的文档格式
*
* @param html
* @return
* @throws Exception
*/
public static Document parseHtmlToDoc(String html) throws Exception {
return removeHtmlSpace(html);
}

private static Document removeHtmlSpace(String str) {
Document doc = Jsoup.parse(str);
String result = doc.html().replace("&nbsp;", "");
return Jsoup.parse(result);
}

/**
* 执行get请求,返回doc
*
* @param url
* @return
* @throws Exception
*/
public Document executeGetAsDocument(String url) throws Exception {
return parseHtmlToDoc(executeGet(url));
}

/**
* 执行get请求
*
* @param url
* @return
* @throws Exception
*/
public String executeGet(String url) throws Exception {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpClient httpClient = null;
String str = "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext context = HttpClientContext.create();
CloseableHttpResponse response = httpClient.execute(httpGet, context);
getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
int state = response.getStatusLine().getStatusCode();
if (state == 404) {
str = "";
}
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
str = EntityUtils.toString(entity, charset);
}
} finally {
response.close();
}
} catch (IOException e) {
throw e;
} finally {
try {
if (httpClient != null)
httpClient.close();
} catch (IOException e) {
throw e;
}
}
return str;
}

/**
* 用https执行get请求,返回doc
*
* @param url
* @return
* @throws Exception
*/
public Document executeGetWithSSLAsDocument(String url) throws Exception {
return parseHtmlToDoc(executeGetWithSSL(url));
}

/**
* 用https执行get请求
*
* @param url
* @return
* @throws Exception
*/
public String executeGetWithSSL(String url) throws Exception {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpClient httpClient = null;
String str = "";
try {
httpClient = createSSLInsecureClient();
HttpClientContext context = HttpClientContext.create();
CloseableHttpResponse response = httpClient.execute(httpGet, context);
getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
int state = response.getStatusLine().getStatusCode();
if (state == 404) {
str = "";
}
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
str = EntityUtils.toString(entity, charset);
}
} finally {
response.close();
}
} catch (IOException e) {
throw e;
} catch (GeneralSecurityException ex) {
throw ex;
} finally {
try {
if (httpClient != null)
httpClient.close();
} catch (IOException e) {
throw e;
}
}
return str;
}

/**
* 执行post请求,返回doc
*
* @param url
* @param params
* @return
* @throws Exception
*/
public Document executePostAsDocument(String url, Map<String, String> params) throws Exception {
return parseHtmlToDoc(executePost(url, params));
}

/**
* 执行post请求
*
* @param url
* @param params
* @return
* @throws Exception
*/
public String executePost(String url, Map<String, String> params) throws Exception {
String reStr = "";
HttpPost httpPost = new HttpPost(url);
httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));
List<NameValuePair> paramsRe = new ArrayList<>();
for (String key : params.keySet()) {
paramsRe.add(new BasicNameValuePair(key, params.get(key)));
}
CloseableHttpClient httpclient = HttpClientBuilder.create().build();
CloseableHttpResponse response;
try {
httpPost.setEntity(new UrlEncodedFormEntity(paramsRe));
HttpClientContext context = HttpClientContext.create();
response = httpclient.execute(httpPost, context);
getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
HttpEntity entity = response.getEntity();
reStr = EntityUtils.toString(entity, charset);
} catch (IOException e) {
throw e;
} finally {
httpPost.releaseConnection();
}
return reStr;
}

/**
* 用https执行post请求,返回doc
*
* @param url
* @param params
* @return
* @throws Exception
*/
public Document executePostWithSSLAsDocument(String url, Map<String, String> params) throws Exception {
return parseHtmlToDoc(executePostWithSSL(url, params));
}

/**
* 用https执行post请求
*
* @param url
* @param params
* @return
* @throws Exception
*/
public String executePostWithSSL(String url, Map<String, String> params) throws Exception {
String re = "";
HttpPost post = new HttpPost(url);
List<NameValuePair> paramsRe = new ArrayList<>();
for (String key : params.keySet()) {
paramsRe.add(new BasicNameValuePair(key, params.get(key)));
}
post.setHeader("Cookie", convertCookieMapToString(cookieMap));
post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpResponse response;
try {
CloseableHttpClient httpClientRe = createSSLInsecureClient();
HttpClientContext contextRe = HttpClientContext.create();
post.setEntity(new UrlEncodedFormEntity(paramsRe));
response = httpClientRe.execute(post, contextRe);
HttpEntity entity = response.getEntity();
if (entity != null) {
re = EntityUtils.toString(entity, charset);
}
getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);
} catch (Exception e) {
throw e;
}
return re;
}

/**
* 发送JSON格式body的POST请求
*
* @param url 地址
* @param jsonBody json body
* @return
* @throws Exception
*/
public String executePostWithJson(String url, String jsonBody) throws Exception {
String reStr = "";
HttpPost httpPost = new HttpPost(url);
httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));
CloseableHttpClient httpclient = HttpClientBuilder.create().build();
CloseableHttpResponse response;
try {
httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
HttpClientContext context = HttpClientContext.create();
response = httpclient.execute(httpPost, context);
getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
HttpEntity entity = response.getEntity();
reStr = EntityUtils.toString(entity, charset);
} catch (IOException e) {
throw e;
} finally {
httpPost.releaseConnection();
}
return reStr;
}

/**
* 发送JSON格式body的SSL POST请求
*
* @param url 地址
* @param jsonBody json body
* @return
* @throws Exception
*/
public String executePostWithJsonAndSSL(String url, String jsonBody) throws Exception {
String re = "";
HttpPost post = new HttpPost(url);
post.setHeader("Cookie", convertCookieMapToString(cookieMap));
post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpResponse response;
try {
CloseableHttpClient httpClientRe = createSSLInsecureClient();
HttpClientContext contextRe = HttpClientContext.create();
post.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
response = httpClientRe.execute(post, contextRe);
HttpEntity entity = response.getEntity();
if (entity != null) {
re = EntityUtils.toString(entity, charset);
}
getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);
} catch (Exception e) {
throw e;
}
return re;
}

private void getCookiesFromCookieStore(CookieStore cookieStore, Map<String, String> cookieMap) {
List<Cookie> cookies = cookieStore.getCookies();
for (Cookie cookie : cookies) {
cookieMap.put(cookie.getName(), cookie.getValue());
}
}

private String convertCookieMapToString(Map<String, String> map) {
String cookie = "";
for (String key : map.keySet()) {
cookie += (key + "=" + map.get(key) + "; ");
}
if (map.size() > 0) {
cookie = cookie.substring(0, cookie.length() - 2);
}
return cookie;
}

/**
* 创建 SSL连接
*
* @return
* @throws GeneralSecurityException
*/
private static CloseableHttpClient createSSLInsecureClient() throws GeneralSecurityException {
try {
SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, (chain, authType) -> true).build();
SSLConnectionSocketFactory sslConnectionSocketFactory = new SSLConnectionSocketFactory(sslContext,
(s, sslContextL) -> true);
return HttpClients.custom().setSSLSocketFactory(sslConnectionSocketFactory).build();
} catch (GeneralSecurityException e) {
throw e;
}
}
}

上面的工具类不仅可以进行网页内容的获取,还能够进行http请求的发送。

源码地址

https://github.com/johnsonmoon/HttpUtils.git