java通过Jsoup爬取中国知网(cnki)思路与实践(成功获取)

Shaka 9月前 ⋅ 1096 阅读

前段时间测试抓取知网数据,弄了很久都失败了,然后就不想弄了....

现在把测试记录下来,成功与否都能做个参考.

测试一

cookies

知网第一次访问网站的时候返回的一个cookies,里面有4个参数;
ASP.NET_SessionId
Ecp_ClientId
Ecp_IpLoginFail
SID_kns
这个参数是后续请求必须要的.否则就会找不到用户.
浏览器不同提交的cookies也略有差异,谷歌的cookies:

con.cookie("_pk_ses", "*");
con.cookie("ASP.NET_SessionId","gtiddqzwyj5gpg0qgxipyqo3");
con.cookie("ASPSESSIONIDQCSTBCRB","HKNBKCGDNLDLNHJECPKOJLED");
con.cookie("CNZZDATA3258975","cnzz_eid%3D696775271-1525654797-http%253A%252F%252Fkns.cnki.net%252F%26ntime%3D1538273733");
con.cookie("Ecp_ClientId", "5180531163302738890");
con.cookie("Ecp_IpLoginFail","180930115.171.133.231");
con.cookie("KNS_SortType","custommode@SCDB");//摘要模式(必填)
con.cookie("KNS_SortType","");       //空为列表模式-  摘要模式(custommode@SCDB)
con.cookie("RsPerPage","50");
con.cookie("SID_kcms","124111");   //摘要模式(必填)
con.cookie("SID_klogin","125144"); //可忽略
con.cookie("SID_kns","123106");
con.cookie("SID_krsnew","125132");//可忽略
con.cookie("UM_distinctid","163386ea9a727b-04e123947bf39b-3961430f-1fa400-163386ea9a8e52");
con.cookie("_pk_id","82b531d4-7052-45bb-8132-b45f0932ceec.1525660161.11.1528872420.1528868668.");
con.cookie("_pk_ref","%5B%22%22%2C%22%22%2C1538116979%2C%22http%3A%2F%2Fwww.cnki.net%2F%22%5D");
con.cookie("_pk_ses","*");
con.cookie("amid","696cc3d1-fa9e-4b87-9bdf-009344a96698");
con.cookie("cnkiUserKey","9aae4225-eafe-ddc1-25cc-3c392546db3a");

 

火狐的cookies比较少:

F12获取请求数据地址

http://kns.cnki.net/kns/brief/brief.aspx?pagename=ASP.brief_result_aspx&isinEn=1&dbPrefix=SCDB&dbCatalog=%e4%b8%ad%e5%9b%bd%e5%ad%a6%e6%9c%af%e6%96%87%e7%8c%ae%e7%bd%91%e7%bb%9c%e5%87%ba%e7%89%88%e6%80%bb%e5%ba%93&ConfigFile=SCDB.xml&research=off&t=1538278623116&keyValue=%E5%A5%A5%E6%B2%99%E5%88%A9%E9%93%82&S=1&sorttype=&DisplayMode=custommode

//首先用浏览器搜索,然后把cookies复制到程序中访问,成功返回数据页面.

//接下来简化cookies,寻找必备参数.(待续)

------------------------------------------------------------------------

测试二

简化cookies访问,我发现只有两个cookie是必不可少的:

ASP.NET_SessionId和SID_kns;

于是测试继续........

失败案例一

1.加入请求头(后来发现这个请求头可以省略掉)

2.获取cookies

String url = "http://kns.cnki.net/kns/brief/result.aspx?dbprefix=SCDB";
Connection con = Jsoup.connect(url);

//执行连接,获取返回response
Connection.Response response = con.execute();
//获取返回cookies
Map<String,String> map = response.cookies();

String sessionId = map.get("ASP.NET_SessionId");
String SID_kns = map.get("SID_kns");

3.设置cookies
url="http://kns.cnki.net/kns/brief/brief.aspx?pagename=ASP.brief_result_aspx&isinEn=1&dbPrefix=SCDB&dbCatalog=%e4%b8%ad%e5%9b%bd%e5%ad%a6%e6%9c%af%e6%96%87%e7%8c%ae%e7%bd%91%e7%bb%9c%e5%87%ba%e7%89%88%e6%80%bb%e5%ba%93&ConfigFile=SCDB.xml&research=off&t=1538278623116&keyValue=%E5%A5%A5%E6%B2%99%E5%88%A9%E9%93%82&S=1&sorttype=&DisplayMode=custommode";
con.url(url);
con.cookie("ASP.NET_SessionId",sessionId);
// con.cookie("Ecp_ClientId", "5180531163302738890");
// con.cookie("Ecp_IpLoginFail","180930115.171.133.231");
con.cookie("SID_kns",SID_kns);

返回请求:"对不起,服务器上不存在此用户!可能已经被剔除或参数错误"

成功案例

首先用浏览器访问知网搜索,F12获取请求Cookie中sessionId和Kns

然后把sessionId和Kns放到程序中,访问访问是成功的.

这暂且算是半自动吧.

接下来实现半自动的抓取所有页面摘要.(不知道sessionId多长时间过期,好奇....测试下...)(待续...........)

 

-------------------------------------------------------------------------------------------------------

通过上面说的获取浏览器搜索的SESSIONID和KNS后,所有获取列表数据

public void  test() throws IOException {
//打开列表页
String url = "http://kns.cnki.net/kns/brief/brief.aspx?curpage=1&RecordsPerPage=50&QueryID=5&ID=&turnpage=1&tpagemode=L&dbPrefix=SCDB&Fields=&DisplayMode=custommode&PageName=ASP.brief_result_aspx&isinEn=1";
Connection con = Jsoup.connect(url);
con.cookie("ASP.NET_SessionId","eurbqm5il14thnm2jjg3c541");
con.cookie("SID_kns","123122");
con.cookie("RsPerPage","50");
Document doc = con.get();

//获取列表页数
int pageCount = Integer.parseInt(doc.getElementsByClass("countPageMark").get(0).text().split("/")[1]);
//循环页数
for(int i=1;i<=pageCount;i++){
String pageUrl = "http://kns.cnki.net/kns/brief/brief.aspx?curpage="+i+"&RecordsPerPage=50&QueryID=0&ID=&turnpage=1&tpagemode=L&dbPrefix=SCDB&Fields=&DisplayMode=custommode&PageName=ASP.brief_result_aspx&isinEn=1#J_ORDER&";
Connection pageConn = Jsoup.connect(pageUrl);
pageConn.cookie("ASP.NET_SessionId","dlrvqzrlnpyb3fdjzlsyvgds");
pageConn.cookie("SID_kns","123109");
pageConn.cookie("RsPerPage","50");
Document docList = pageConn.get();
int docByPage = docList.getElementsByClass("title_c").size();
//循环每页数据
for(int j=0;j<docByPage;j++){
String title = docList.getElementsByClass("title_c").get(j).text();
String author = docList.getElementsByClass("author").get(j).text();
String journal = docList.getElementsByClass("journal").get(j).text();
String abstract_c = docList.getElementsByClass("abstract_c").get(j).text();
System.out.println(title+"----"+author+"------"+journal+"------"+abstract_c);
}
}
}

注意:本文归作者所有,未经作者允许,不得转载

全部评论: 0

    我有话说: