解决java.lang.UnsatisfiedLinkError libXp.so.6找不到的问题
|
fckeditor更新到最新版了
在Eclipse中运行Nutch
1.准备
openSUSE11
eclipse3.4(Ganymede) linux版
jdk1.6 选择linux版
nutch1.0
2.导入eclipse
new java Project-->输入工程名称,在contents中点选create project from existing source 选中解压后的nutch目录-->finish
3.修改错误
导入完成后 parse-mp3, parse-rtf 报错,是因为缺少jar包,从这里下载
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
下载后分别放到 src/plugin/parse-mp3/lib 和 src/plugin/parse-rtf/lib/ 当中
然后在 Configure Build Path 当中 Add Jars,
重新编译 代码还是有错误,需要修改一下代码,可能是跟我的版本有关系
RTFParseFactory.java:
import org.apache.nutch.parse.ParseResult;
public Parse getParse(Content content) {
改成
public ParseResult getParse(Content content) {
return new ParseStatus(ParseStatus.FAILED,ParseStatus.FAILED_EXCEPTION,e.toString()).getEmptyParse(conf);
改成
return new ParseStatus(ParseStatus.FAILED,ParseStatus.FAILED_EXCEPTION,e.toString()).getEmptyParseResult(content.getUrl(), getConf());
return new ParseImpl(text,new ParseData(ParseStatus.STATUS_SUCCESS,title,OutlinkExtractor.getOutlinks(text,this.conf),content.getMetadata(),metadata));
改成
return ParseResult.createParseResult(content.getUrl(),new ParseImpl(text,new ParseData(ParseStatus.STATUS_SUCCESS,title,OutlinkExtractor.getOutlinks(text, this.conf),content.getMetadata(),metadata)));
TestRTFParser.java
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);
改成
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);
4.配置运行环境
将conf文件夹添加到source目录当中,运行的时候会用到conf当中的配置文件
右键工程属性-->java build path-->source-->add folder-->选中conf文件夹-->ok
修改配置文件
在src目录下新建一个文件夹urls用于存放所有将来要被抓取的url
mkdir urls
echo "http://lucene.apache.org/">>urls/myurl
修改 conf/crawl-urlfilter.txt
将+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
改为你的url 例
+^http://([a-z0-9]*\.)*apache.org/
修改nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>flyox</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>flyox</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://www.flyox.com/crawl</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>sunwei250@hotmail.co</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>
5.运行
Run > Run Configurations...--> New Java Application-
Main class-->org.apache.nutch.crawl.Crawl
Program arguments-->urls -dir myPages -depth 2 -topN 50
VM arguments->-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
6.可能错误
可能会遇到内存溢出错误
VM arguments中添加 -Xms5m -Xmx150m
分类: | (共有0条评论)



