在Eclipse中运行Nutch

1.准备
openSUSE11
eclipse3.4(Ganymede)  linux版
jdk1.6 选择linux版
nutch1.0 

2.导入eclipse
new java Project-->输入工程名称,在contents中点选create project from existing source  选中解压后的nutch目录-->finish

3.修改错误
导入完成后 parse-mp3, parse-rtf 报错,是因为缺少jar包,从这里下载
 http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
 http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
下载后分别放到 src/plugin/parse-mp3/lib 和 src/plugin/parse-rtf/lib/ 当中
然后在 Configure Build Path 当中 Add Jars,
重新编译 代码还是有错误,需要修改一下代码,可能是跟我的版本有关系
RTFParseFactory.java:
import org.apache.nutch.parse.ParseResult;
public Parse getParse(Content content) {
改成
public ParseResult getParse(Content content) {
return new ParseStatus(ParseStatus.FAILED,ParseStatus.FAILED_EXCEPTION,e.toString()).getEmptyParse(conf);
改成
return new ParseStatus(ParseStatus.FAILED,ParseStatus.FAILED_EXCEPTION,e.toString()).getEmptyParseResult(content.getUrl(), getConf());
return new ParseImpl(text,new ParseData(ParseStatus.STATUS_SUCCESS,title,OutlinkExtractor.getOutlinks(text,this.conf),content.getMetadata(),metadata));
改成
return ParseResult.createParseResult(content.getUrl(),new ParseImpl(text,new ParseData(ParseStatus.STATUS_SUCCESS,title,OutlinkExtractor.getOutlinks(text, this.conf),content.getMetadata(),metadata)));

TestRTFParser.java
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);
改成
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);

4.配置运行环境
将conf文件夹添加到source目录当中,运行的时候会用到conf当中的配置文件
右键工程属性-->java build path-->source-->add folder-->选中conf文件夹-->ok
修改配置文件
在src目录下新建一个文件夹urls用于存放所有将来要被抓取的url
mkdir urls
echo "http://lucene.apache.org/">>urls/myurl

修改 conf/crawl-urlfilter.txt
将+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
改为你的url 例
+^http://([a-z0-9]*\.)*apache.org/
修改nutch-site.xml  

<configuration>
 <property>
  <name>http.agent.name</name>
  <value>flyox</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
   please set this to a single word uniquely related to your organization.
   NOTE: You should also check other related properties:
   http.robots.agents
   http.agent.description
   http.agent.url
   http.agent.email
   http.agent.version
   and set their values appropriately.
   </description>
 </property>
 <property>
  <name>http.agent.description</name>
  <value>flyox</value>
  <description>Further description of our bot- this text is used in
   the User-Agent header. It appears in parenthesis after the agent name.
   </description>
 </property>
 <property>
  <name>http.agent.url</name>
  <value>http://www.flyox.com/crawl</value>
  <description>A URL to advertise in the User-Agent header. This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
   </description>
 </property>
 <property>
  <name>http.agent.email</name>
  <value>sunwei250@hotmail.co</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
   </description>
  </property>
 <property>
  <name>http.agent.version</name>
  <value>1.0</value>
 </property>
</configuration>

5.运行
Run > Run Configurations...--> New Java Application-
Main class-->org.apache.nutch.crawl.Crawl
Program arguments-->urls -dir myPages -depth 2 -topN 50
VM arguments->-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

6.可能错误
可能会遇到内存溢出错误
VM arguments中添加 -Xms5m -Xmx150m

最后更新:
分类:  |  (共有0条评论)  
评论    共 0 条
分页:  1 
发表评论  (黄色为必填项)
称呼:
邮箱:
内容: