| 1 | | Hello, |
| 2 | | |
| 3 | | sorry for my bad english. I hope someone understand what I mean. |
| 4 | | |
| 5 | | The short description: In twisted.web.client must be a bug. |
| 6 | | If I have a URL list with many bad URLs I get a lot (50 - 100%) of strange |
| 7 | | user timeout exceptions. But only if I have to many parallel requests |
| 8 | | (more than 10). |
| 9 | | |
| 10 | | For example: |
| 11 | | |
| 12 | | I have a list of 20.000 URLs (with many bad URLs in it). |
| 13 | | I have a script, that has always 10 parallel requests -> 2.65 % of the |
| 14 | | requests get the timeout exception. |
| 15 | | I start the same script with the same URLs list and 50 parallel requests |
| 16 | | -> 5,88 % of the requests get the timeout exception. |
| 17 | | I start the same script with the same URLs list and 100 parallel requests |
| 18 | | -> 64,1 % of the requests get the timeout exception. |
| 19 | | 250 parallel requests -> 99,38 % of the requests get the timeout exception. |
| 20 | | |
| 21 | | But if I use the script with a list of 20.000 good URLs I get this result: |
| 22 | | |
| 23 | | 100 parallel requests -> 0,34 % of the requests get the timeout exception. |
| 24 | | 250 parallel requests -> 0,48 % of the requests get the timeout exception. |
| 25 | | |
| 26 | | |
| 27 | | Therefore, I think there must be a bug. |
| 28 | | |
| 29 | | Why are there 100 % timeouts with 250 parallel request, but with 10 parallel |
| 30 | | request only 2.65 % with the bad URL list? |
| 31 | | |
| 32 | | |
| 33 | | The timeout exception is: |
| 34 | | |
| 35 | | Traceback (most recent call last): |
| 36 | | Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure. |
| | 1 | `reactor.connectTCP` imposes a timeout - 30 seconds by default. `twisted.web.client.getPage` offers no way for application code to specify this value. |
| 38 | | --------------------- |
| 39 | | |
| 40 | | |
| 41 | | I will attach my test script, a spreadsheet with many statistic datas and |
| 42 | | other things (so you can better look at the bug) in this bug report: |
| 43 | | |
| 44 | | Content of the zip file: |
| 45 | | |
| 46 | | ------------------ |
| 47 | | The URL lists: |
| 48 | | ------------------ |
| 49 | | |
| 50 | | * the good urls: miscurls.py |
| 51 | | * the bad urls: webproxys.py |
| 52 | | |
| 53 | | |
| 54 | | -------------------------------- |
| 55 | | The script (WebProxyChecker.py): |
| 56 | | -------------------------------- |
| 57 | | |
| 58 | | Input (over comand line): |
| 59 | | |
| 60 | | * first parameter: quantity of parallel requests |
| 61 | | * second parameter: the url list "mode": |
| 62 | | * "miscurls" for the good url list (20.000 urls) |
| 63 | | * "webproxy" for the bad url list (20.000 urls) |
| 64 | | * third parameter: quantity of requests |
| 65 | | * (the script selects the urls per random.sample() from the url list) |
| 66 | | |
| 67 | | example: python WebProxyChecker.py 20000 webproxy 50 |
| 68 | | |
| 69 | | the script prints statistic data and all exceptions to stdout. |
| 70 | | |
| 71 | | --------------- |
| 72 | | bash_script.sh |
| 73 | | --------------- |
| 74 | | calls the test script with many different parameters. |
| 75 | | (runs on my machine ca. 30 hours, with 526.100 requests) |
| 76 | | |
| 77 | | usage example: bash bash_script.sh | tee WebProxyCheckStatRaw.nfo |
| 78 | | |
| 79 | | ---------- |
| 80 | | parser.py |
| 81 | | ---------- |
| 82 | | |
| 83 | | parser for the output from the bash script, witch generates |
| 84 | | a) a shelve db for easyly usage of the datas |
| 85 | | (incl. all received exceptions) and |
| 86 | | b) a spreadsheet for a better overlook. |
| 87 | | |
| 88 | | ---------------- |
| 89 | | The spreadsheet |
| 90 | | ---------------- |
| 91 | | |
| 92 | | 1. column: number of parallel requests (first comand line parameter from WebProxyChecker.py) |
| 93 | | 2. column: url list (second comand line parameter from WebProxyChecker.py) |
| 94 | | 3. column: number of requests (third comand line parameter from WebProxyChecker.py) |
| 95 | | 4. column: number of successful requests |
| 96 | | 5. column: errors (number of received exceptions) |
| 97 | | 6. column: user timeouts (number of the strange user timeouts) |
| 98 | | 7. column: % requests with user timeouts -> column_6 / (column_3 / 100) |
| 99 | | 8. column: duration of all requests (in seconds) |
| 100 | | 9. column: duration per requests (in seconds) |
| 101 | | 10. column: the key in in the shelve db for this record (for example to look at the exceptions) |
| 102 | | |
| 103 | | |
| 104 | | --------------- |
| 105 | | The shelve db |
| 106 | | --------------- |
| 107 | | |
| 108 | | db key = the comand on the comand line (10. column in the spreadsheet) |
| 109 | | for example: python WebProxyChecker.py 20000 webproxy 50 |
| 110 | | |
| 111 | | {{{ |
| 112 | | >>> import shelve |
| 113 | | >>> db = shelve.open('result_db.slv') |
| 114 | | >>> record = [r for r in db.values()][0] |
| 115 | | >>> record['parallel_requests'] # 1. spreadsheet column |
| 116 | | '250' |
| 117 | | >>> record['URL_mode'] # 2. spreadsheet column |
| 118 | | 'miscurls' |
| 119 | | >>> record['checked_URLs'] # 3. spreadsheet column |
| 120 | | '100' |
| 121 | | >>> record['successful_requests'] # 4. spreadsheet column |
| 122 | | '99' |
| 123 | | >>> record['errors'] # 5. spreadsheet column |
| 124 | | '1' |
| 125 | | >>> record['seconds_altogether'] # 8. spreadsheet column |
| 126 | | '10.0048542023' |
| 127 | | >>> record['seconds_per_url'] # 9. spreadsheet column |
| 128 | | '0.100048542023' |
| 129 | | >>> exceptions = record['exceptions'] # all exceptions |
| 130 | | >>> for exception, data in exceptions.iteritems(): |
| 131 | | ... print exception # the original exception string |
| 132 | | ... data['cnt'] # occurrence of this exception |
| 133 | | ... data['urls'] # a set of the urls that caused this exception |
| 134 | | ... data['vars'] # a dict of lists, with removed data (all strings between '', "" and <>) |
| 135 | | ... break |
| 136 | | ... |
| 137 | | Traceback (most recent call last): |
| 138 | | Failure: twisted.web.error.Error: 400 Bad Request |
| 139 | | 1 |
| 140 | | set(['http://www.cruise.ch']) |
| 141 | | {} |
| 142 | | >>> |
| 143 | | }}} |
| 144 | | |
| 145 | | ---------- |
| 146 | | The rest: |
| 147 | | ---------- |
| 148 | | |
| 149 | | * WebProxyCheckStatRaw.nfo: output from the bash script. |
| 150 | | Unessential; all data are in the shelve db or in the |
| 151 | | spreadsheet for simple usage. |
| 152 | | * this text |
| 153 | | |
| 154 | | |
| 155 | | ------------------------------------------------------- |
| 156 | | |
| 157 | | My System: |
| 158 | | |
| 159 | | * Ubuntu 8.04 |
| 160 | | * Python 2.5.2 |
| 161 | | * Twisted 8.1.0 |
| 162 | | |
| 163 | | Do you need additional information? |
| 164 | | |
| 165 | | I am searching this bug since over a week, but I don't know |
| 166 | | twisted internals and can't find it. |
| 167 | | |
| 168 | | Sorry for my terrible, poorly english. |
| 169 | | |
| | 3 | There should be a way to control the value of the connection timeout for HTTP clients. |