Changes between and of Version 2Version 6Ticket #3450

Show
Ignore:
Timestamp:
12/19/2008 09:17:36 AM (5 years ago)
Author:
exarkun
Comment:

Adjusting summary and description to reflect the feature request.

The old description was:

Hello,

sorry for my bad english. I hope someone understand what I mean.

The short description: In twisted.web.client must be a bug. If I have a URL list with many bad URLs I get a lot (50 - 100%) of strange user timeout exceptions. But only if I have to many parallel requests (more than 10).

For example:

I have a list of 20.000 URLs (with many bad URLs in it). I have a script, that has always 10 parallel requests -> 2.65 % of the requests get the timeout exception. I start the same script with the same URLs list and 50 parallel requests -> 5,88 % of the requests get the timeout exception. I start the same script with the same URLs list and 100 parallel requests -> 64,1 % of the requests get the timeout exception. 250 parallel requests -> 99,38 % of the requests get the timeout exception.

But if I use the script with a list of 20.000 good URLs I get this result:

100 parallel requests -> 0,34 % of the requests get the timeout exception. 250 parallel requests -> 0,48 % of the requests get the timeout exception.

Therefore, I think there must be a bug.

Why are there 100 % timeouts with 250 parallel request, but with 10 parallel request only 2.65 % with the bad URL list?

The timeout exception is:

Traceback (most recent call last): Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.

---------------------

I will attach my test script, a spreadsheet with many statistic datas and other things (so you can better look at the bug) in this bug report:

Content of the zip file:

------------------ The URL lists: ------------------

  • the good urls: miscurls.py
  • the bad urls: webproxys.py

-------------------------------- The script (WebProxyChecker.py): --------------------------------

Input (over comand line):

  • first parameter: quantity of parallel requests
  • second parameter: the url list "mode":
    • "miscurls" for the good url list (20.000 urls)
    • "webproxy" for the bad url list (20.000 urls)
  • third parameter: quantity of requests
    • (the script selects the urls per random.sample() from the url list)

example: python WebProxyChecker.py 20000 webproxy 50

the script prints statistic data and all exceptions to stdout.

--------------- bash_script.sh --------------- calls the test script with many different parameters. (runs on my machine ca. 30 hours, with 526.100 requests)

usage example: bash bash_script.sh | tee WebProxyCheckStatRaw.nfo

---------- parser.py ----------

parser for the output from the bash script, witch generates a) a shelve db for easyly usage of the datas

(incl. all received exceptions) and

b) a spreadsheet for a better overlook.

---------------- The spreadsheet ----------------

  1. column: number of parallel requests (first comand line parameter from WebProxyChecker.py)
  2. column: url list (second comand line parameter from WebProxyChecker.py)
  3. column: number of requests (third comand line parameter from WebProxyChecker.py)
  4. column: number of successful requests
  5. column: errors (number of received exceptions)
  6. column: user timeouts (number of the strange user timeouts)
  7. column: % requests with user timeouts -> column_6 / (column_3 / 100)
  8. column: duration of all requests (in seconds)
  9. column: duration per requests (in seconds)
  10. column: the key in in the shelve db for this record (for example to look at the exceptions)

--------------- The shelve db ---------------

db key = the comand on the comand line (10. column in the spreadsheet)

for example: python WebProxyChecker.py 20000 webproxy 50

    >>> import shelve
    >>> db = shelve.open('result_db.slv')
    >>> record = [r for r in db.values()][0]
    >>> record['parallel_requests'] # 1. spreadsheet column
    '250'
    >>> record['URL_mode'] # 2. spreadsheet column
    'miscurls'
    >>> record['checked_URLs'] # 3. spreadsheet column
    '100'
    >>> record['successful_requests'] # 4. spreadsheet column
    '99'
    >>> record['errors'] # 5. spreadsheet column
    '1'
    >>> record['seconds_altogether'] # 8. spreadsheet column
    '10.0048542023'
    >>> record['seconds_per_url'] # 9. spreadsheet column
    '0.100048542023'
    >>> exceptions = record['exceptions'] # all exceptions
    >>> for exception, data in exceptions.iteritems():
    ...     print exception # the original exception string
    ...     data['cnt'] # occurrence of this exception
    ...     data['urls'] # a set of the urls that caused this exception
    ...     data['vars'] # a dict of lists, with removed data (all strings between '', "" and <>)
    ...     break
    ... 
    Traceback (most recent call last):
    Failure: twisted.web.error.Error: 400 Bad Request
    1
    set(['http://www.cruise.ch'])
    {}
    >>> 

---------- The rest: ----------

  • WebProxyCheckStatRaw.nfo: output from the bash script.

Unessential; all data are in the shelve db or in the spreadsheet for simple usage.

  • this text

-------------------------------------------------------

My System:

  • Ubuntu 8.04
  • Python 2.5.2
  • Twisted 8.1.0

Do you need additional information?

I am searching this bug since over a week, but I don't know twisted internals and can't find it.

Sorry for my terrible, poorly english.

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #3450

    • Property status changed from new to reopened
    • Property cc exarkun added
    • Property type changed from defect to enhancement
    • Property summary changed from user timeout bug to HTTP client API doesn't allow the connection setup timeout to be specified
    • Property priority changed from high to normal
  • Ticket #3450 – description

    v2 v6  
    1     Hello, 
    2      
    3     sorry for my bad english. I hope someone understand what I mean. 
    4      
    5     The short description: In twisted.web.client must be a bug. 
    6     If I have a URL list with many bad URLs I get a lot (50 - 100%) of strange 
    7     user timeout exceptions. But only if I have to many parallel requests 
    8     (more than 10). 
    9      
    10     For example: 
    11      
    12     I have a list of 20.000 URLs (with many bad URLs in it). 
    13     I have a script, that has always 10 parallel requests -> 2.65 % of the 
    14     requests get the timeout exception. 
    15     I start the same script with the same URLs list and 50 parallel requests 
    16     -> 5,88 % of the requests get the timeout exception. 
    17     I start the same script with the same URLs list and 100 parallel requests 
    18     -> 64,1 % of the requests get the timeout exception. 
    19     250 parallel requests -> 99,38 % of the requests get the timeout exception. 
    20      
    21     But if I use the script with a list of 20.000 good URLs I get this result: 
    22      
    23     100 parallel requests -> 0,34 % of the requests get the timeout exception. 
    24     250 parallel requests -> 0,48 % of the requests get the timeout exception. 
    25      
    26      
    27     Therefore, I think there must be a bug. 
    28      
    29     Why are there 100 % timeouts with 250 parallel request, but with 10 parallel 
    30     request only 2.65 % with the bad URL list? 
    31      
    32      
    33     The timeout exception is: 
    34      
    35     Traceback (most recent call last): 
    36     Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure. 
     1`reactor.connectTCP` imposes a timeout - 30 seconds by default.  `twisted.web.client.getPage` offers no way for application code to specify this value. 
    372 
    38     --------------------- 
    39      
    40      
    41     I will attach my test script, a spreadsheet with many statistic datas and 
    42     other things (so you can better look at the bug) in this bug report: 
    43      
    44     Content of the zip file: 
    45      
    46     ------------------ 
    47     The URL lists: 
    48     ------------------ 
    49      
    50     * the good urls: miscurls.py 
    51     * the bad urls: webproxys.py 
    52      
    53      
    54     -------------------------------- 
    55     The script (WebProxyChecker.py): 
    56     -------------------------------- 
    57      
    58     Input (over comand line): 
    59      
    60     * first parameter: quantity of parallel requests 
    61     * second parameter: the url list "mode": 
    62         * "miscurls" for the good url list (20.000 urls) 
    63         * "webproxy" for the bad url list (20.000 urls) 
    64     * third parameter: quantity of requests 
    65         * (the script selects the urls per random.sample() from the url list) 
    66      
    67     example: python WebProxyChecker.py 20000 webproxy 50 
    68      
    69     the script prints statistic data and all exceptions to stdout. 
    70      
    71     --------------- 
    72     bash_script.sh 
    73     --------------- 
    74     calls the test script with many different parameters. 
    75     (runs on my machine ca. 30 hours, with 526.100 requests) 
    76      
    77     usage example: bash bash_script.sh | tee WebProxyCheckStatRaw.nfo 
    78      
    79     ---------- 
    80     parser.py 
    81     ---------- 
    82      
    83     parser for the output from the bash script, witch generates 
    84     a) a shelve db for easyly usage of the datas 
    85         (incl. all received exceptions) and 
    86     b) a spreadsheet for a better overlook. 
    87      
    88     ---------------- 
    89     The spreadsheet 
    90     ---------------- 
    91      
    92     1. column: number of parallel requests (first comand line parameter from WebProxyChecker.py) 
    93     2. column: url list (second comand line parameter from WebProxyChecker.py) 
    94     3. column: number of requests (third comand line parameter from WebProxyChecker.py) 
    95     4. column: number of successful requests 
    96     5. column: errors (number of received exceptions) 
    97     6. column: user timeouts (number of the strange user timeouts) 
    98     7. column: % requests with user timeouts -> column_6 / (column_3 / 100) 
    99     8. column: duration of all requests (in seconds) 
    100     9. column: duration per requests (in seconds) 
    101     10. column: the key in in the shelve db for this record (for example to look at the exceptions) 
    102      
    103      
    104     --------------- 
    105     The shelve db 
    106     --------------- 
    107      
    108     db key = the comand on the comand line (10. column in the spreadsheet) 
    109         for example: python WebProxyChecker.py 20000 webproxy 50 
    110      
    111 {{{ 
    112     >>> import shelve 
    113     >>> db = shelve.open('result_db.slv') 
    114     >>> record = [r for r in db.values()][0] 
    115     >>> record['parallel_requests'] # 1. spreadsheet column 
    116     '250' 
    117     >>> record['URL_mode'] # 2. spreadsheet column 
    118     'miscurls' 
    119     >>> record['checked_URLs'] # 3. spreadsheet column 
    120     '100' 
    121     >>> record['successful_requests'] # 4. spreadsheet column 
    122     '99' 
    123     >>> record['errors'] # 5. spreadsheet column 
    124     '1' 
    125     >>> record['seconds_altogether'] # 8. spreadsheet column 
    126     '10.0048542023' 
    127     >>> record['seconds_per_url'] # 9. spreadsheet column 
    128     '0.100048542023' 
    129     >>> exceptions = record['exceptions'] # all exceptions 
    130     >>> for exception, data in exceptions.iteritems(): 
    131     ...     print exception # the original exception string 
    132     ...     data['cnt'] # occurrence of this exception 
    133     ...     data['urls'] # a set of the urls that caused this exception 
    134     ...     data['vars'] # a dict of lists, with removed data (all strings between '', "" and <>) 
    135     ...     break 
    136     ...  
    137     Traceback (most recent call last): 
    138     Failure: twisted.web.error.Error: 400 Bad Request 
    139     1 
    140     set(['http://www.cruise.ch']) 
    141     {} 
    142     >>>  
    143 }}} 
    144  
    145     ---------- 
    146     The rest: 
    147     ---------- 
    148      
    149     * WebProxyCheckStatRaw.nfo: output from the bash script. 
    150         Unessential; all data are in the shelve db or in the  
    151         spreadsheet for simple usage. 
    152     * this text 
    153      
    154      
    155     ------------------------------------------------------- 
    156      
    157     My System: 
    158      
    159     * Ubuntu 8.04 
    160     * Python 2.5.2 
    161     * Twisted 8.1.0 
    162      
    163     Do you need additional information? 
    164      
    165     I am searching this bug since over a week, but I don't know 
    166     twisted internals and can't find it. 
    167      
    168     Sorry for my terrible, poorly english. 
    169      
     3There should be a way to control the value of the connection timeout for HTTP clients.