Changes between and of Version 2Version 6Ticket #3450


Ignore:
Timestamp:
12/19/2008 09:17:36 AM (8 years ago)
Author:
exarkun
Comment:

Adjusting summary and description to reflect the feature request.

The old description was:

Hello,

sorry for my bad english. I hope someone understand what I mean.

The short description: In twisted.web.client must be a bug. If I have a URL list with many bad URLs I get a lot (50 - 100%) of strange user timeout exceptions. But only if I have to many parallel requests (more than 10).

For example:

I have a list of 20.000 URLs (with many bad URLs in it). I have a script, that has always 10 parallel requests -> 2.65 % of the requests get the timeout exception. I start the same script with the same URLs list and 50 parallel requests -> 5,88 % of the requests get the timeout exception. I start the same script with the same URLs list and 100 parallel requests -> 64,1 % of the requests get the timeout exception. 250 parallel requests -> 99,38 % of the requests get the timeout exception.

But if I use the script with a list of 20.000 good URLs I get this result:

100 parallel requests -> 0,34 % of the requests get the timeout exception. 250 parallel requests -> 0,48 % of the requests get the timeout exception.

Therefore, I think there must be a bug.

Why are there 100 % timeouts with 250 parallel request, but with 10 parallel request only 2.65 % with the bad URL list?

The timeout exception is:

Traceback (most recent call last): Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.

---------------------

I will attach my test script, a spreadsheet with many statistic datas and other things (so you can better look at the bug) in this bug report:

Content of the zip file:

------------------ The URL lists: ------------------

  • the good urls: miscurls.py
  • the bad urls: webproxys.py

-------------------------------- The script (WebProxyChecker.py): --------------------------------

Input (over comand line):

  • first parameter: quantity of parallel requests
  • second parameter: the url list "mode":
    • "miscurls" for the good url list (20.000 urls)
    • "webproxy" for the bad url list (20.000 urls)
  • third parameter: quantity of requests
    • (the script selects the urls per random.sample() from the url list)

example: python WebProxyChecker.py 20000 webproxy 50

the script prints statistic data and all exceptions to stdout.

--------------- bash_script.sh --------------- calls the test script with many different parameters. (runs on my machine ca. 30 hours, with 526.100 requests)

usage example: bash bash_script.sh | tee WebProxyCheckStatRaw.nfo

---------- parser.py ----------

parser for the output from the bash script, witch generates a) a shelve db for easyly usage of the datas

(incl. all received exceptions) and

b) a spreadsheet for a better overlook.

---------------- The spreadsheet ----------------

  1. column: number of parallel requests (first comand line parameter from WebProxyChecker.py)
  2. column: url list (second comand line parameter from WebProxyChecker.py)
  3. column: number of requests (third comand line parameter from WebProxyChecker.py)
  4. column: number of successful requests
  5. column: errors (number of received exceptions)
  6. column: user timeouts (number of the strange user timeouts)
  7. column: % requests with user timeouts -> column_6 / (column_3 / 100)
  8. column: duration of all requests (in seconds)
  9. column: duration per requests (in seconds)
  10. column: the key in in the shelve db for this record (for example to look at the exceptions)

--------------- The shelve db ---------------

db key = the comand on the comand line (10. column in the spreadsheet)

for example: python WebProxyChecker.py 20000 webproxy 50

    >>> import shelve
    >>> db = shelve.open('result_db.slv')
    >>> record = [r for r in db.values()][0]
    >>> record['parallel_requests'] # 1. spreadsheet column
    '250'
    >>> record['URL_mode'] # 2. spreadsheet column
    'miscurls'
    >>> record['checked_URLs'] # 3. spreadsheet column
    '100'
    >>> record['successful_requests'] # 4. spreadsheet column
    '99'
    >>> record['errors'] # 5. spreadsheet column
    '1'
    >>> record['seconds_altogether'] # 8. spreadsheet column
    '10.0048542023'
    >>> record['seconds_per_url'] # 9. spreadsheet column
    '0.100048542023'
    >>> exceptions = record['exceptions'] # all exceptions
    >>> for exception, data in exceptions.iteritems():
    ...     print exception # the original exception string
    ...     data['cnt'] # occurrence of this exception
    ...     data['urls'] # a set of the urls that caused this exception
    ...     data['vars'] # a dict of lists, with removed data (all strings between '', "" and <>)
    ...     break
    ... 
    Traceback (most recent call last):
    Failure: twisted.web.error.Error: 400 Bad Request
    1
    set(['http://www.cruise.ch'])
    {}
    >>> 

---------- The rest: ----------

  • WebProxyCheckStatRaw.nfo: output from the bash script.

Unessential; all data are in the shelve db or in the spreadsheet for simple usage.

  • this text

-------------------------------------------------------

My System:

  • Ubuntu 8.04
  • Python 2.5.2
  • Twisted 8.1.0

Do you need additional information?

I am searching this bug since over a week, but I don't know twisted internals and can't find it.

Sorry for my terrible, poorly english.

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #3450

    • Property Status changed from new to reopened
    • Property Cc exarkun added
    • Property Type changed from defect to enhancement
    • Property Summary changed from user timeout bug to HTTP client API doesn't allow the connection setup timeout to be specified
    • Property Priority changed from high to normal
  • Ticket #3450 – Description

    v2 v6  
    1     Hello,
    2    
    3     sorry for my bad english. I hope someone understand what I mean.
    4    
    5     The short description: In twisted.web.client must be a bug.
    6     If I have a URL list with many bad URLs I get a lot (50 - 100%) of strange
    7     user timeout exceptions. But only if I have to many parallel requests
    8     (more than 10).
    9    
    10     For example:
    11    
    12     I have a list of 20.000 URLs (with many bad URLs in it).
    13     I have a script, that has always 10 parallel requests -> 2.65 % of the
    14     requests get the timeout exception.
    15     I start the same script with the same URLs list and 50 parallel requests
    16     -> 5,88 % of the requests get the timeout exception.
    17     I start the same script with the same URLs list and 100 parallel requests
    18     -> 64,1 % of the requests get the timeout exception.
    19     250 parallel requests -> 99,38 % of the requests get the timeout exception.
    20    
    21     But if I use the script with a list of 20.000 good URLs I get this result:
    22    
    23     100 parallel requests -> 0,34 % of the requests get the timeout exception.
    24     250 parallel requests -> 0,48 % of the requests get the timeout exception.
    25    
    26    
    27     Therefore, I think there must be a bug.
    28    
    29     Why are there 100 % timeouts with 250 parallel request, but with 10 parallel
    30     request only 2.65 % with the bad URL list?
    31    
    32    
    33     The timeout exception is:
    34    
    35     Traceback (most recent call last):
    36     Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.
     1`reactor.connectTCP` imposes a timeout - 30 seconds by default.  `twisted.web.client.getPage` offers no way for application code to specify this value.
    372
    38     ---------------------
    39    
    40    
    41     I will attach my test script, a spreadsheet with many statistic datas and
    42     other things (so you can better look at the bug) in this bug report:
    43    
    44     Content of the zip file:
    45    
    46     ------------------
    47     The URL lists:
    48     ------------------
    49    
    50     * the good urls: miscurls.py
    51     * the bad urls: webproxys.py
    52    
    53    
    54     --------------------------------
    55     The script (WebProxyChecker.py):
    56     --------------------------------
    57    
    58     Input (over comand line):
    59    
    60     * first parameter: quantity of parallel requests
    61     * second parameter: the url list "mode":
    62         * "miscurls" for the good url list (20.000 urls)
    63         * "webproxy" for the bad url list (20.000 urls)
    64     * third parameter: quantity of requests
    65         * (the script selects the urls per random.sample() from the url list)
    66    
    67     example: python WebProxyChecker.py 20000 webproxy 50
    68    
    69     the script prints statistic data and all exceptions to stdout.
    70    
    71     ---------------
    72     bash_script.sh
    73     ---------------
    74     calls the test script with many different parameters.
    75     (runs on my machine ca. 30 hours, with 526.100 requests)
    76    
    77     usage example: bash bash_script.sh | tee WebProxyCheckStatRaw.nfo
    78    
    79     ----------
    80     parser.py
    81     ----------
    82    
    83     parser for the output from the bash script, witch generates
    84     a) a shelve db for easyly usage of the datas
    85         (incl. all received exceptions) and
    86     b) a spreadsheet for a better overlook.
    87    
    88     ----------------
    89     The spreadsheet
    90     ----------------
    91    
    92     1. column: number of parallel requests (first comand line parameter from WebProxyChecker.py)
    93     2. column: url list (second comand line parameter from WebProxyChecker.py)
    94     3. column: number of requests (third comand line parameter from WebProxyChecker.py)
    95     4. column: number of successful requests
    96     5. column: errors (number of received exceptions)
    97     6. column: user timeouts (number of the strange user timeouts)
    98     7. column: % requests with user timeouts -> column_6 / (column_3 / 100)
    99     8. column: duration of all requests (in seconds)
    100     9. column: duration per requests (in seconds)
    101     10. column: the key in in the shelve db for this record (for example to look at the exceptions)
    102    
    103    
    104     ---------------
    105     The shelve db
    106     ---------------
    107    
    108     db key = the comand on the comand line (10. column in the spreadsheet)
    109         for example: python WebProxyChecker.py 20000 webproxy 50
    110    
    111 {{{
    112     >>> import shelve
    113     >>> db = shelve.open('result_db.slv')
    114     >>> record = [r for r in db.values()][0]
    115     >>> record['parallel_requests'] # 1. spreadsheet column
    116     '250'
    117     >>> record['URL_mode'] # 2. spreadsheet column
    118     'miscurls'
    119     >>> record['checked_URLs'] # 3. spreadsheet column
    120     '100'
    121     >>> record['successful_requests'] # 4. spreadsheet column
    122     '99'
    123     >>> record['errors'] # 5. spreadsheet column
    124     '1'
    125     >>> record['seconds_altogether'] # 8. spreadsheet column
    126     '10.0048542023'
    127     >>> record['seconds_per_url'] # 9. spreadsheet column
    128     '0.100048542023'
    129     >>> exceptions = record['exceptions'] # all exceptions
    130     >>> for exception, data in exceptions.iteritems():
    131     ...     print exception # the original exception string
    132     ...     data['cnt'] # occurrence of this exception
    133     ...     data['urls'] # a set of the urls that caused this exception
    134     ...     data['vars'] # a dict of lists, with removed data (all strings between '', "" and <>)
    135     ...     break
    136     ...
    137     Traceback (most recent call last):
    138     Failure: twisted.web.error.Error: 400 Bad Request
    139     1
    140     set(['http://www.cruise.ch'])
    141     {}
    142     >>>
    143 }}}
    144 
    145     ----------
    146     The rest:
    147     ----------
    148    
    149     * WebProxyCheckStatRaw.nfo: output from the bash script.
    150         Unessential; all data are in the shelve db or in the
    151         spreadsheet for simple usage.
    152     * this text
    153    
    154    
    155     -------------------------------------------------------
    156    
    157     My System:
    158    
    159     * Ubuntu 8.04
    160     * Python 2.5.2
    161     * Twisted 8.1.0
    162    
    163     Do you need additional information?
    164    
    165     I am searching this bug since over a week, but I don't know
    166     twisted internals and can't find it.
    167    
    168     Sorry for my terrible, poorly english.
    169    
     3There should be a way to control the value of the connection timeout for HTTP clients.