[Twisted-Python] Application Design help - Concurrent but not Protocols based.

Wed Jun 3 14:55:47 EDT 2009

On Wed, 3 Jun 2009, Senthil Kumaran wrote:

> Hello Twisted Developers/Users,
> 
> This is my first concurrent application design and my first trial with
> twisted. I have read the documentation and understand where twisted
> plays its part. Unfortunately, I could not directly relate it to my
> requirements and hence, could not go forward with designing and
> building on top using the examples as a reference.
> 
> I need your guidance in helping me design an application.
> 
> My Application Details:
> 
> 1) I need to constantly monitor a particular directory for new files.
> 2) Whenever a new file is dropped; I read that file and get
> information on where to collect data from that is a) another machine b)
> machine2-different method c) database.
> 3) I collect data from those machines and store it.
> 
> The data is huge and I need the three processes a, b, c to be
> non-blocking, and I can just do a function call like do_a(), do_b(),
> do_c() to perform them.
> 
> For 1) to constantly monitor a particular directory for new files, I
> am doing something like this:
> 
> while True:
>         check_for_new_files()
> 

This is not an issue specifically related to Python or Twisted, but
there is a very serious synchronization issue that needs to be
addressed with this application design.  (Trust me, I've seen this
issue come up dozens of times in over 30 years of experience...)

Creating a file and loading it with data is not an atomic operation.
It takes a significant amount of time, and if the process attempting
to read the file is faster than the process writing it, it won't see
all (or any of) the data.

It can work great while testing and then fall over the first time it
is used in production, or it can work fine for years before mysteriously
breaking.

There are several ways to cope with this situation:

1) If the system allows you to create temporary invisible files and
then only makes the file visible when it is cleanly closed, you can
use this method.  However, this is often not portable.  Not all operating
systems, languages, FTP or SFTP servers, etc. support such a facility.

2) Create the file using a method that disallows reading of the file
while it is still open by the creator.  Make the reader process wait
until it can get read access to the file before processing it.  (Sometimes
this can be done by making the reader process request exclusive write
access to the file, even though it doesn't intend to write to it.)
This is also not particularly portable, and may require the reading
process to spin or wait-loop, either wasting resources or delaying
processing by half the wait time on average.

3) Create the file in another directory and then move it to the target
directory when it is complete.  The reading process will only see it
after the move is complete.  However, such an operation isn't always
atomic, or even possible.  I think "mv" on most Unix systems is atomic
if both directories are on the same physical disk, but if the directories
are on different disks, it copies the file and then deletes the
original file.  This could work fine for years and then break when
someone decides to move directories around for some reason.

4) Create the file with a temporary file name, for example
"foo-YYYYMMDD-SEQ.tmp" and then after it is created and fully
populated, rename it to "foo-YYYYMMDD-SEQ.dat".  Make the reading
process only look for files named "*.dat", ignoring the "*.tmp"
files.  I don't know of any operating system where renaming a
file is not an atomic operation, but I suppose such might exist.
There could concievably be a small window when the file system
could have created a directory entry for the .dat filename, but
hasn't yet linked the filename to the file.  Though if this is
possible, one could argue that this is an O/S bug and demand the
O/S vendor fix it.  (Or fix it yourself if it's a self-maintained
O/S or file system...)

5) After creating the file, create a flag file (empty or with
minimal, unimportant contents.)  For example, if the data file is
named "foo-YYYYMMDD-SEQ#.dat", after creating it, create a flag
file name "foo-YYYYMMDD-SEQ#.flag".  Have the reading process
look only for flag files (they could even be in a separate directory
to avoid clutter.)  When a flag file appears, process the
corresponding data file.  This method is very portable and is
*almost* bullet proof.  The exceptions I have seen have almost
all been when someone didn't understand the importance of the
flag file and created it first.  Aside from just doing it in the
wrong order, I've seen cases where they triggered two parallel
processes to create the files, and the flag file being much
smaller, got created first, and where they created all the
files in a local directory (on another system), and then FTP'ed
them to the target system/directory, using a wild-card file name,
which unfortunately caused the flag file to get sent first.
(It may have had an alphabetically earlier name than the data
file, or the FTP client may have transfered files in a random
order or one based on the inode or file ID or other non-obvious
file attribute.)  In these cases, the cure was to explicitly
transfer the data file and then the flag file in the correct
order.

(We once encountered an issue where an FTP client may have been
"optimizing" transfers either by doing them in parallel, or by
sending small files first, and broke this scheme, but that was
only a theory we had while trying to diagnose the problem, and
may not have been what was actually happening.)

I don't know of any scheme that is absolutely foolproof unless
you control both the file creation and file reading sides of
things, but scheme 5 (flag files) seems to work best in practice.

Sorry I can't help with the Python/Twisted specifics, but I'm
too much of a newbie to be very useful with that.

> http://paste.pocoo.org/show/120824/
> 
> My Question: Can this be designed in way that looking for new files is
> also asynchronous activity? 
> 
> What will be the deferred in this case?
> 
> # my ideas:
> 
> - I might define a deferred as, whenever the contents of the directory
>   is not matching the previous contents, return the new file which was
>   added.
> - I can then add a callback to read the newfile.
> 
> 
> Now, after reading the contents, I will have to do a non-blocking call
> to fetch data, either using fun_a, fun_b or fun_b. How should I
> associate this requirement to deferred/callback pattern?
> 
> Any guidance would be helpful.
> 
> Thanks,
> Senthil
> 
> 
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
> 
> 

-- 
John Santos
Evans Griffiths & Hart, Inc.
781-861-0670 ext 539