<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=windows-1250">

<META content="MSHTML 5.50.4522.1800" name=GENERATOR>

<STYLE></STYLE>

</HEAD>

<BODY bgColor=#ffffff>

<DIV><FONT face=Arial>Bok,</FONT></DIV>

<DIV><FONT face=Arial></FONT>&nbsp;</DIV>

<DIV><FONT face=Arial>I am sending this as suggested by "forgot who" on 

#twisted. </FONT></DIV>

<DIV><FONT face=Arial></FONT>&nbsp;</DIV>

<DIV><FONT face=Arial>I extended&nbsp;t.w.spider , this isn't finished. It seems 

to work however. </FONT></DIV>

<DIV><FONT face=Arial>You can download, run it and look it crawl ower certain 

domain ... It prints out</FONT></DIV>

<DIV><FONT face=Arial>quite few information doing work so you know what is going 

on ( mainly for testing purposes ).<BR></FONT></DIV>

<DIV><FONT face=Arial>I am hoping just for a quick scan thru code and run by 

someone more experienced in this. So I will know if I did it python/twisted/sane 

way, before I go further and conect it to various stuff. ( also read NOTE at the 

end of email )</FONT></DIV><FONT face=Verdana size=2>

<DIV><BR><FONT face=Arial size=3>You can get this at </FONT><A 

href="http://www.mind-nest.com/downloads/walker.tgz"><FONT face=Arial 

size=3>http://www.mind-nest.com/downloads/walker.tgz</FONT></A><FONT face=Arial 

size=3> or </FONT></DIV>

<DIV><A href="http://www.mind-nest.com/downloads/walker.zip"><FONT face=Arial 

size=3>http://www.mind-nest.com/downloads/walker.zip</FONT></A><BR><BR><FONT 

face="Times New Roman"><FONT size=3><FONT face=Arial>When finished I plan give 

it back to twisted... of course if they will<BR>accept 

it.</FONT><BR></DIV></FONT></FONT>

<DIV><FONT face=Arial>note for the maillist moderator: I sent this mail when I 

was not yet a member some week ago, I got response that you must first approve 

it. As it didn't happen in 1 week and I am a member now I sent it again as a 

member.</FONT></DIV>

<DIV><FONT face="Times New Roman"><FONT size=3><FONT face=Verdana 

size=2></FONT>&nbsp;</DIV></FONT></FONT>

<DIV><FONT face=Arial size=3>lp</FONT></DIV>

<DIV><FONT face=Arial size=3>:janko</FONT></DIV>

<DIV><FONT face=Arial size=3><A 

href="mailto:janko@mind-nest.com">janko@mind-nest.com</A></FONT></DIV>

<DIV><FONT face=Arial size=3><A 

href="http://www.mind-nest.com">www.mind-nest.com</A></FONT></DIV>

<DIV><FONT face="Times New Roman"><FONT size=3><FONT face=Verdana 

size=2></FONT>&nbsp;</DIV>

<DIV><FONT face=Verdana size=2></FONT><FONT face=Verdana size=2></FONT><FONT 

face=Verdana size=2></FONT><FONT face=Verdana size=2></FONT><FONT face=Verdana 

size=2></FONT><FONT face=Verdana size=2></FONT><FONT face=Verdana 

size=2></FONT><BR>***<BR><BR>This module extends twisted.web.spider.SpiderSender 

class into WalkerSender<BR>(and also extends htmllib.HTMLParser and 

t.w.c.HTTPClientFactory so<BR>WalkerSender can use 

them):<BR><BR>&nbsp;&nbsp;&nbsp; LinkParser<BR>&nbsp;&nbsp;&nbsp; -collects 

links on a page<BR>&nbsp;&nbsp;&nbsp; -also collects frame scr-es, to crawl ower 

frames<BR>&nbsp;&nbsp;&nbsp; -can also collect images, links(css,js..) for 

link/img validating<BR>purposes<BR><BR>&nbsp;&nbsp;&nbsp; 

HTTPCollector<BR>&nbsp;&nbsp;&nbsp; -Doesnt store content of page to file but to 

variable<BR>&nbsp;&nbsp;&nbsp; -Can be easily set to diferent link-parsers or 

page-downloaders(*)..<BR>&nbsp;&nbsp;&nbsp; -Returns *self* to the callback, so 

that links, content, or anything<BR>else it collects can be 

retrieved<BR>&nbsp;&nbsp;&nbsp; by the callback method. 

(**)<BR><BR>&nbsp;&nbsp;&nbsp; WalkerSender<BR>&nbsp;&nbsp;&nbsp; -Can be easily 

set to diferent http-collectors/downloaders(***)..<BR>&nbsp;&nbsp;&nbsp; -Uses 

dictionary instead of list for queue now.. explaind below<BR>&nbsp;&nbsp;&nbsp; 

-Has 4 more plugins/events, some existent are cahnged to be 

more<BR>powerfull<BR>&nbsp;&nbsp;&nbsp; -Plugin to filter links to whatever you 

wanth (extensions, domains...)<BR>&nbsp;&nbsp;&nbsp; -Plugin to fill with some 

algorithm to prevent from looping<BR>&nbsp;&nbsp;&nbsp; -Plugin to notify that 

download failed<BR>&nbsp;&nbsp;&nbsp; -Plugin to tell that all links found while 

crawling were crawled and<BR>there is nowwhere else to go<BR>&nbsp;&nbsp;&nbsp; 

(Likely/hopefully to occur if doing One Site/domain crawler as I was,<BR>and 

when timeouting or some other shit happens)<BR>&nbsp;&nbsp;&nbsp; -Plugin 

notifyDownloadEnd has aditional argument downloader which holds<BR>anything you 

prepare in dowloader class(****)<BR><BR>&nbsp;&nbsp;&nbsp; some smaller things 

were made to get it working<BR>&nbsp;&nbsp;&nbsp; -preventing from starting 

downloader on page/url that is already<BR>downloading<BR>&nbsp;&nbsp;&nbsp; 

-queue (is now dictionary) so it can't have multiple same pages in it<BR>(the 

depth of the first ocurrence is stored)<BR>&nbsp;&nbsp;&nbsp; -removes fragments 

from urls (</FONT></FONT><A href="http://www.a.org/index.html#fragment"><FONT 

face="Times New Roman" size=3>www.a.org/index.html#fragment</FONT></A><FONT 

face="Times New Roman" size=3>) so we get<BR>multiple -same- pages that are then 

filtered<BR>&nbsp;&nbsp;&nbsp; -it doesnt remove ?queries as they often mean new 

content..<BR><BR>&nbsp;&nbsp;&nbsp; I made OneSiteWalkerSender as an example.. I 

intend to make one site<BR>search engine (with pyndex probably) and 

a<BR>&nbsp;&nbsp;&nbsp; anchor/link/img... validating script. 

OneSiteWalkerSender has just now<BR>crawled ower 525 pages of </FONT><A 

href="http://www.google.com"><FONT face="Times New Roman" 

size=3>www.google.com</FONT></A><FONT face="Times New Roman" 

size=3>.<BR>&nbsp;&nbsp;&nbsp; I also tested it with other 

sites.<BR><BR>&nbsp;&nbsp;&nbsp; NOTE: Don't shoot me or something if I made 

something very stupid, I am<BR>very new to Python and 

Twisted<BR>&nbsp;&nbsp;&nbsp; and don't understand many important issues on any 

of them. Where I<BR>marked with * in upper description I am a 

little<BR>&nbsp;&nbsp;&nbsp; suspicious with my way of doing 

it.<BR><BR>***</FONT><BR></DIV></FONT></BODY></HTML>