Opened 3 years ago

Last modified 16 months ago

#5562 defect new

iocp reactor fails to support PB & AMP

Reported by: dynamicgl Owned by:
Priority: normal Milestone:
Component: core Keywords: iocp pb amp
Cc: ivank-twisted-bugs@… Branch: branches/iocp-5562-2
(diff, github, buildbot, log)
Author: itamarst, tomprince Launchpad Bug:

Description

When running the attached files, I noticed weird exceptions like Connection Lost raised.

If I used a bigger value for reactor.callLater on server side, It ran fine without any problem. I ran a similar test with AMP and got the same problem with different errors.

Because these errors didnt' exist when I used select reactor so I guess the current implementation of iocp reactor might have some problems about sending chunks data continuously.

OS: Win 7 64 bit
Python version: 2.6.6 (32 bit)
Twisted version: 12.0 & 11.1

Attachments (10)

pb_bugs_server.py (480 bytes) - added by dynamicgl 3 years ago.
pb_bugs_client.py (860 bytes) - added by dynamicgl 3 years ago.
ticket5562_abstract.py.patch (1.1 KB) - added by dynamicgl 2 years ago.
5562_test2.py (3.4 KB) - added by dynamicgl 2 years ago.
bugs_5562_server.py (531 bytes) - added by dynamicgl 2 years ago.
#5562_abstract_2.py.patch (1.3 KB) - added by dynamicgl 2 years ago.
5562_test3.py (3.4 KB) - added by dynamicgl 2 years ago.
test_tcp.py (65.8 KB) - added by BrianMatthews 22 months ago.
abstract.diff (4.2 KB) - added by BrianMatthews 22 months ago.
abstract.diff2 (4.2 KB) - added by BrianMatthews 22 months ago.

Download all attachments as: .zip

Change History (53)

Changed 3 years ago by dynamicgl

Changed 3 years ago by dynamicgl

comment:1 Changed 3 years ago by thijs

  • Milestone regular-releases deleted

comment:2 Changed 3 years ago by ivank

  • Cc ivank-twisted-bugs@… added
  • Priority changed from high to highest

I have observed the IOCP reactor send data completely out of order. I was using twisted.web, and saw things like:

  1. HTTP responses never closing
  1. Truncated HTTP responses
  1. The most horrifying: all HTTP chunk lengths arriving first, then the chunk contents

I was using Twisted trunk (2012-03-03), Python 2.7.1 32-bit, Server 2008 R2 SP1 64-bit.

I tried the IOCP reactor before iocp-64bit-5373-2 and before iocp-win64-4669, but it exhibited the same problems. I have no problems with the select reactor.

Sorry for the lack of test case. I am fairly sure this problem could be reproduced by doing request.write(small_piece) 100,000 times.

comment:3 Changed 3 years ago by itamar

  • Owner set to PenguinOfDoom

Looking at the IOCP code, it looks like writes can be initiated because a previous write finished, or by startWriting being called after a short amount of time since the last startWriting. So that may be the issue - two or more sequences of writes happening, instead of just one.

Pavel, could you take a look?

comment:4 Changed 3 years ago by dynamicgl

I have read some parts of codes for iocp support and ran some trials. It seems once the buffer is full (SEND_LIMIT is 128 K bytes), an ERROR_IO_PENDING will be returned and then

exceptions raise on client side.

two possible ways to work around:

1 change SEND_LIMIT to other values like 256K.
2 don't send too much data in a very short period.

comment:5 Changed 2 years ago by itamarst

  • Author set to itamarst
  • Branch set to branches/iocp-5562

(In [34396]) Branching to 'iocp-5562'

comment:6 Changed 2 years ago by itamar

Poking around I discovered that the problem is indeed reproduceable with the example script; something todo with abstract.py or tcp.py, but I couldn't get any further. Handing back to Pavel.

Changed 2 years ago by dynamicgl

comment:7 Changed 2 years ago by dynamicgl

  • Keywords review added

Hi All

I attached a patch for this issue. I think the first two files can be used for testing purpose. Please feel free to have a look.

Thanks.

comment:8 follow-up: Changed 2 years ago by itamar

  • Keywords review removed
  • Owner changed from PenguinOfDoom to dynamicgl

First, thanks for working on this; I tried to, and failed, and this is a serious bug, so it'd be really great to get the fix in.

However, the demonstration files are far too complex to serve as unit tests. Consider what happens if someone changes the implementation of PB for some reason, and coincidentally it no longer hits the edge case that causes the problem. Suddenly the test is not testing what you think it's testing. Ideally you'd just have a simple Protocol that did a few writes to demonstrate the problem; try writing a prose explanation of the problem and then translating it into the least code possible. twisted.internet.test.test_tcp has a bunch of tests that might serve as a template, and demonstrates the ReactorBuilder infrastructure you'd want to use.

comment:9 Changed 2 years ago by dynamicgl

  • Owner changed from dynamicgl to itamar

comment:10 in reply to: ↑ 8 Changed 2 years ago by dynamicgl

Replying to itamar:

First, thanks for working on this; I tried to, and failed, and this is a serious bug, so it'd be really great to get the fix in.

However, the demonstration files are far too complex to serve as unit tests. Consider what happens if someone changes the implementation of PB for some reason, and coincidentally it no longer hits the edge case that causes the problem. Suddenly the test is not testing what you think it's testing. Ideally you'd just have a simple Protocol that did a few writes to demonstrate the problem; try writing a prose explanation of the problem and then translating it into the least code possible. twisted.internet.test.test_tcp has a bunch of tests that might serve as a template, and demonstrates the ReactorBuilder infrastructure you'd want to use.

Sorry about that, I didn't see your action before I typed. Ok, I will try to write a formal test case for that.

I think this issue can be reproduced by sending some small chunks data continuously just like ivank said, however I didn't make such that case before. Anyway I try to make a simple test case.

Changed 2 years ago by dynamicgl

Changed 2 years ago by dynamicgl

Changed 2 years ago by dynamicgl

comment:11 Changed 2 years ago by dynamicgl

  • Keywords review added

Hi All

I attach a test case for this bug. A server script is also attached. In fact this bug can't be reproduced (at least on my machine) when both of server & client running in the same process.

please feel free to check it.

Thanks.

comment:12 Changed 2 years ago by dynamicgl

I also attach a patch. The previous patch contains a bug which may lead to infinitely schedule the doWrite method.

please ignore the first patch.

Thanks.

comment:13 Changed 2 years ago by itamar

  • Keywords review removed
  • Owner changed from itamar to dynamicgl

Thanks for putting all this work into fixing this, and for writing the unit test.

However, if we can't run the test within the same process, we can't automate it very easily, and the idea is that unit tests run automatically (http://buildbot.twistedmatrix.com/boxes-supported). Also, your test seems a bit overcomplicated. In your email posting you (I assume it was you?) described the bug as follows:

The problem is doWrite method might be triggered twice instead of once; in particular, when trying to send a large chunks data whose size is bigger then SEND_LIMIT, it will always happen. When doWrite being invoked twice, it means the same buffer data will be sent twice.

Doing a write bigger than SEND_LIMIT seems like it would be simpler to do in a test than what you're doing now. Could you see if you can reproduce the bug that way in a unit test?

Changed 2 years ago by dynamicgl

comment:14 Changed 2 years ago by dynamicgl

Hi Itammar

I attached a modified test case for that. Actually this bug can be reproduced by using a normal test case as long as you run that several times. Failure is not deterministic so it is why I said it couldn't be reproduced at first.

In this test case, the client initiates sending a large chunks data and then collect all data from server. The server simply send back all data got from client. Eventually the client should get exactly what it sent, the same data and the same size.


Many parts of this test case are copied from test_tcp.py, so it looks like messy, sorry about that.


comment:15 Changed 2 years ago by exarkun

  • Keywords changed from iocp, pb, amp to iocp pb amp

I applied a modified version of the test to the branch. Unfortunately I wasn't able to get it to fail (partly due to a problem with one of our development tools). If you can see what's wrong with my version of the test, that'd be quite helpful - http://twistedmatrix.com/trac/browser/branches/iocp-5562/twisted/internet/test/test_tcp.py#L364

comment:16 Changed 2 years ago by dynamicgl

Hi exarkun

You didn't do anything wrong. This bug is not always reproduced when both server and client running at the same process. You can reproduced it by running the test case several times? How many times? I don't know. Sometimes It failed on the second trial but sometimes after 50 trials.

I modified the original scripts and let server & client run in the same process. Guess what? the bug disappeared !!! I ran the script ten times and got a flawless result. If I ran in two separate process, the bug raised at the first round.

I am not an expert on network programming especially on how windows playing socket under the hood.


Do you have any idea what is different between socket communications in process and socket communications interprocess on windows? If I do know that detail, maybe I can try to simulate it.

Thanks

comment:17 Changed 22 months ago by BrianMatthews

I tried the fix attached by dynamicgl. It fixed the problem, but it has the side effect of creating 10's, and possible 100's of unnecessary callLater's.
The problem is the databuffer and offset are being modified by doWrite when a write is in progress. The resultant modification by the callback method _doWrite() is using the wrong offset.

comment:18 Changed 22 months ago by BrianMatthews

Sorry, I meant the callback _cbWrite(), not _doWrite()

comment:19 Changed 22 months ago by exarkun

Thanks for looking into this Brian. fwiw, if someone does the hard part of creating a reliably failing unit test, I'd be happy to look into how to implement a solution that doesn't involve the extra delayed calls (or investigate sufficiently to demonstrate that the extra calls aren't worth worrying about).

comment:20 Changed 22 months ago by BrianMatthews

Ok I will try to do that some time today. I have another ticket open which is connected to this, #6217. The fix I did for that seems to make this problem worse.

comment:21 follow-up: Changed 22 months ago by BrianMatthews

Well that was easy! If you apply my fix from #6217, then your unit test 5562_test3.py fails immediately. Your fix does fix it, but with lots of extra calls. As I suspected, the #6217 bug was helping to mask this bug.

comment:22 Changed 22 months ago by BrianMatthews

BTW the number of extra callLaters in 5562_test3.py after applying your fix was 4005.

comment:23 Changed 22 months ago by exarkun

Well that was easy! If you apply my fix from #6217, then your unit test 5562_test3.py fails immediately. Your fix does fix it, but with lots of extra calls. As I suspected, the #6217 bug was helping to mask this bug.

Hmm. My problem with 5562_test3.py is that I could not get it to fail deterministically when run against current trunk. I checked my attempt at this into the branch identified in the ticket description (iocp-5562). Did you see the test fail deterministically against trunk?

comment:24 Changed 22 months ago by BrianMatthews

OK, well the way to do it is overwhelm it with data.
I have attached a version of your test_tcp.py that for me fails every time on the trunk code.

Changed 22 months ago by BrianMatthews

comment:25 Changed 22 months ago by exarkun

Thanks! I'll give that a shot.

For future reference, a patch file is generally preferred. These are easily generated with the "svn diff" command once you have edited a file in an svn checkout (if you're using one of the other vcs mirrors, they all have similar commands too).

comment:26 in reply to: ↑ 21 Changed 22 months ago by dynamicgl

Replying to BrianMatthews:

Well that was easy! If you apply my fix from #6217, then your unit test 5562_test3.py fails immediately. Your fix does fix it, but with lots of extra calls. As I suspected, the #6217 bug was helping to mask this bug.

Hi Brian

I am delighted you have some thoughts about that. I am not an expert on windows network programming and IOCP as well. The patch I provided can be considered as a work around and far away from an optimistic solution.

Yes, it has a side effect about unnecessary scheduling. As long as there is a way that doWrite & cbWrite are invoked in order, we can remove these scheduling safely.

Any idea?

comment:27 Changed 22 months ago by BrianMatthews

exarkun is looking at it now, but one thing I did try was replacing

if self._doWriteCalling:

self.reactor.callLater(0, self._resumeWriting)

with

if self._doWriteCalling:

return

which seemed to work OK. _cbWrite is going to call doWrite anyway...

comment:28 Changed 22 months ago by BrianMatthews

I also notice that the WSASend is returning the number of bytes written if it returns successfull immediately. It is my understanding that the I/O event will still occur even it it completes immediately. A 0 error return should be processed exactly the same as a WSA_IO_PENDING return, and the number of bytes returned when the error is 0 should be ignored.

comment:29 Changed 22 months ago by BrianMatthews

Ignore previous comment. Looking closer, it does indeed ignore the immediate return.

comment:30 Changed 22 months ago by BrianMatthews

My suggested fix, unified diff attached as abstract.diff

Changed 22 months ago by BrianMatthews

comment:31 Changed 22 months ago by BrianMatthews

oops, typo in diff. Use abstract.diff2

Changed 22 months ago by BrianMatthews

comment:32 follow-up: Changed 22 months ago by BrianMatthews

readBufferSize is set to 4096. This makes the test very slow (40+ seconds)
Setting it to 65K (2222) as in other reactors makes it much faster, maybe 5-10 times.
Is there any reason why this cant be used in iocpreactor?

comment:33 in reply to: ↑ 32 Changed 21 months ago by dynamicgl

Replying to BrianMatthews:

readBufferSize is set to 4096. This makes the test very slow (40+ seconds)
Setting it to 65K (2222) as in other reactors makes it much faster, maybe 5-10 times.
Is there any reason why this cant be used in iocpreactor?

Hi Brian

I also noticed this problem a few months ago. It looks like there is no harm to set a bigger value. Maybe you can fire a ticket for this issue.

comment:34 Changed 18 months ago by itamar

  • Keywords review added

comment:35 Changed 17 months ago by therve

  • Owner dynamicgl deleted
  • Priority changed from highest to normal

comment:36 Changed 16 months ago by tomprince

  • Author changed from itamarst to itamarst, tomprince
  • Branch changed from branches/iocp-5562 to branches/iocp-5562-2

(In [38924]) Branching to iocp-5562-2.

comment:37 Changed 16 months ago by tom.prince

  • Keywords review removed
  • Owner set to BrianMatthews
  1. Even with the changes in the branch, the test isn't failing deterministicly for me (see here. So I can't tell if this patch address the issue.
  2. In general this patch looks somewhat sane, but I am not familiar with iocp reactor.
  3. The one introduced docstring in abstract.diff2 doesn't follow the coding-style guide. In fact, it would be nice to see docstrings on all the touched methods; the code here is somewhat involved, and it would be easier for a reviewer, if there was some explanation of whatt all the involved methods were intended to do.

I haven't applied abstract.diff2, since I'd like to see the test failing consitently before applying the fix. (One minor comment about the format, it would be nice if patches were suitable to pass to patch -p0, i.e. the paths relative to the root of the twisted repo).

comment:38 follow-up: Changed 16 months ago by BrianMatthews

  • Owner changed from BrianMatthews to tom.prince

No, it doesn't fail deterministically. Neither does an application that uses it. I produced a test that does fail on my machine deterministically, but I cannot guarantee it for other machines. I am not a regular Python/Twisted contributor, I just use it in our application. I had serious issues with the select reactor (and its loopback socket) so moved to iocp. Which failed frequently. With this fix, our application does not fail. So I would suggest you apply the fix anyway, it is significantly better than what you have right now, in that it actually works. I cannot justify any more time for this, as we do not currently have a problem.
My apologies if the various syntax and formats do not fit with your culture.

comment:39 Changed 16 months ago by exarkun

Thank you very much for your work on this issue so far BrianMatthews. The patches are appreciated!

comment:40 in reply to: ↑ 38 Changed 16 months ago by dynamicgl

Replying to BrianMatthews:

No, it doesn't fail deterministically. Neither does an application that uses it. I produced a test that does fail on my machine deterministically, but I cannot guarantee it for other machines. I am not a regular Python/Twisted contributor, I just use it in our application. I had serious issues with the select reactor (and its loopback socket) so moved to iocp. Which failed frequently. With this fix, our application does not fail. So I would suggest you apply the fix anyway, it is significantly better than what you have right now, in that it actually works. I cannot justify any more time for this, as we do not currently have a problem.
My apologies if the various syntax and formats do not fit with your culture.

These test cases fail deterministically only with inter-process. so if build bot is unable to simulate such environment, it won't be easy to be reproduced.

comment:41 follow-up: Changed 16 months ago by exarkun

These test cases fail deterministically only with inter-process. so if build bot is unable to simulate such environment, it won't be easy to be reproduced.

Nothing stops a unit test from launching a child process except for a desire to keep tests as simple as possible. If it is possible to reproduce the problem with a single process, that is preferable. So far it sounds like no one is confident enough that they understand the issue to declare either "Multiple processes are required to reproduce this problem!" or "Here is a single-process test which reproduces the problem."

comment:42 in reply to: ↑ 41 Changed 16 months ago by dynamicgl

Replying to exarkun:

These test cases fail deterministically only with inter-process. so if build bot is unable to simulate such environment, it won't be easy to be reproduced.

Nothing stops a unit test from launching a child process except for a desire to keep tests as simple as possible. If it is possible to reproduce the problem with a single process, that is preferable. So far it sounds like no one is confident enough that they understand the issue to declare either "Multiple processes are required to reproduce this problem!" or "Here is a single-process test which reproduces the problem."

Hi.

My opinion is multiple process environment is required. let's reproduce it first.

comment:43 Changed 16 months ago by tom.prince

  • Owner tom.prince deleted
Note: See TracTickets for help on using tickets.