最新想用爬虫实现抓取五大门户网站(搜狐、新浪、网易、腾讯、凤凰网)和电商数据(天猫,京东,聚美等), 今天第一天先搭建下环境和测试。
采用maven+xpath+ HttpClient+正则表达式。
maven pom.xml配置文件信息
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<!-- httpclient4.4 -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.4</version>
</dependency>
<!-- htmlcleaner -->
<dependency>
<groupId>net.sourceforge.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>2.10</version>
</dependency>
<!-- json -->
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20140107</version>
</dependency>
<!-- hbase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>0.96.1.1-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>0.96.1.1-hadoop2</version>
</dependency>
<!-- redis 2.7.0-->
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.7.0</version>
</dependency>
<!-- slf4j -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.10</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.10</version>
</dependency>
<!-- quartz1.8.4 -->
<dependency>
<groupId>org.quartz-scheduler</groupId>
<artifactId>quartz</artifactId>
<version>1.8.4</version>
</dependency>
<!-- curator -->
<dependency>
<groupId>org.apache.curator</groupId>
<artifactId>curator-framework</artifactId>
<version>2.7.1</version>
</dependency>
新建一个测试类:SpiderTest
/** * url 入口,下载页面 * @param url */ public static String downLoadCrawlurl(String url){ String context = null; Logger logger = LoggerFactory.getLogger(SpiderTest.class); HttpClientBuilder create = HttpClientBuilder.create(); HttpGet httpGet = new HttpGet(url); CloseableHttpClient build = create.build(); try { CloseableHttpResponse response = build.execute( httpGet); HttpEntity entity = response.getEntity(); context = EntityUtils.toString( entity ); System.out.println("context:" + context); } catch ( ClientProtocolException e ) { e.printStackTrace(); } catch ( IOException e ) { logger.info("download...." ); } return context; }
public static void main( String[] args ) {
String url = "http://money.163.com/";
downLoadCrawlurl(url);
}
原文:http://www.cnblogs.com/zhanggl/p/5216346.html