<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=windows-1250">
<META content="MSHTML 5.50.4522.1800" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial>Bok,</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>I am sending this as suggested by "forgot who" on
#twisted. </FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>I extended t.w.spider , this isn't finished. It seems
to work however. </FONT></DIV>
<DIV><FONT face=Arial>You can download, run it and look it crawl ower certain
domain ... It prints out</FONT></DIV>
<DIV><FONT face=Arial>quite few information doing work so you know what is going
on ( mainly for testing purposes ).<BR></FONT></DIV>
<DIV><FONT face=Arial>I am hoping just for a quick scan thru code and run by
someone more experienced in this. So I will know if I did it python/twisted/sane
way, before I go further and conect it to various stuff. ( also read NOTE at the
end of email )</FONT></DIV><FONT face=Verdana size=2>
<DIV><BR><FONT face=Arial size=3>You can get this at </FONT><A
href="http://www.mind-nest.com/downloads/walker.tgz"><FONT face=Arial
size=3>http://www.mind-nest.com/downloads/walker.tgz</FONT></A><FONT face=Arial
size=3> or </FONT></DIV>
<DIV><A href="http://www.mind-nest.com/downloads/walker.zip"><FONT face=Arial
size=3>http://www.mind-nest.com/downloads/walker.zip</FONT></A><BR><BR><FONT
face="Times New Roman"><FONT size=3><FONT face=Arial>When finished I plan give
it back to twisted... of course if they will<BR>accept
it.</FONT><BR></DIV></FONT></FONT>
<DIV><FONT face=Arial>note for the maillist moderator: I sent this mail when I
was not yet a member some week ago, I got response that you must first approve
it. As it didn't happen in 1 week and I am a member now I sent it again as a
member.</FONT></DIV>
<DIV><FONT face="Times New Roman"><FONT size=3><FONT face=Verdana
size=2></FONT> </DIV></FONT></FONT>
<DIV><FONT face=Arial size=3>lp</FONT></DIV>
<DIV><FONT face=Arial size=3>:janko</FONT></DIV>
<DIV><FONT face=Arial size=3><A
href="mailto:janko@mind-nest.com">janko@mind-nest.com</A></FONT></DIV>
<DIV><FONT face=Arial size=3><A
href="http://www.mind-nest.com">www.mind-nest.com</A></FONT></DIV>
<DIV><FONT face="Times New Roman"><FONT size=3><FONT face=Verdana
size=2></FONT> </DIV>
<DIV><FONT face=Verdana size=2></FONT><FONT face=Verdana size=2></FONT><FONT
face=Verdana size=2></FONT><FONT face=Verdana size=2></FONT><FONT face=Verdana
size=2></FONT><FONT face=Verdana size=2></FONT><FONT face=Verdana
size=2></FONT><BR>***<BR><BR>This module extends twisted.web.spider.SpiderSender
class into WalkerSender<BR>(and also extends htmllib.HTMLParser and
t.w.c.HTTPClientFactory so<BR>WalkerSender can use
them):<BR><BR> LinkParser<BR> -collects
links on a page<BR> -also collects frame scr-es, to crawl ower
frames<BR> -can also collect images, links(css,js..) for
link/img validating<BR>purposes<BR><BR>
HTTPCollector<BR> -Doesnt store content of page to file but to
variable<BR> -Can be easily set to diferent link-parsers or
page-downloaders(*)..<BR> -Returns *self* to the callback, so
that links, content, or anything<BR>else it collects can be
retrieved<BR> by the callback method.
(**)<BR><BR> WalkerSender<BR> -Can be easily
set to diferent http-collectors/downloaders(***)..<BR> -Uses
dictionary instead of list for queue now.. explaind below<BR>
-Has 4 more plugins/events, some existent are cahnged to be
more<BR>powerfull<BR> -Plugin to filter links to whatever you
wanth (extensions, domains...)<BR> -Plugin to fill with some
algorithm to prevent from looping<BR> -Plugin to notify that
download failed<BR> -Plugin to tell that all links found while
crawling were crawled and<BR>there is nowwhere else to go<BR>
(Likely/hopefully to occur if doing One Site/domain crawler as I was,<BR>and
when timeouting or some other shit happens)<BR> -Plugin
notifyDownloadEnd has aditional argument downloader which holds<BR>anything you
prepare in dowloader class(****)<BR><BR> some smaller things
were made to get it working<BR> -preventing from starting
downloader on page/url that is already<BR>downloading<BR>
-queue (is now dictionary) so it can't have multiple same pages in it<BR>(the
depth of the first ocurrence is stored)<BR> -removes fragments
from urls (</FONT></FONT><A href="http://www.a.org/index.html#fragment"><FONT
face="Times New Roman" size=3>www.a.org/index.html#fragment</FONT></A><FONT
face="Times New Roman" size=3>) so we get<BR>multiple -same- pages that are then
filtered<BR> -it doesnt remove ?queries as they often mean new
content..<BR><BR> I made OneSiteWalkerSender as an example.. I
intend to make one site<BR>search engine (with pyndex probably) and
a<BR> anchor/link/img... validating script.
OneSiteWalkerSender has just now<BR>crawled ower 525 pages of </FONT><A
href="http://www.google.com"><FONT face="Times New Roman"
size=3>www.google.com</FONT></A><FONT face="Times New Roman"
size=3>.<BR> I also tested it with other
sites.<BR><BR> NOTE: Don't shoot me or something if I made
something very stupid, I am<BR>very new to Python and
Twisted<BR> and don't understand many important issues on any
of them. Where I<BR>marked with * in upper description I am a
little<BR> suspicious with my way of doing
it.<BR><BR>***</FONT><BR></DIV></FONT></BODY></HTML>