Continuing – not for too long, I hope – in the series of the dangers of making assumptions about Python libraries while programming.
Suppose, hypothetically, you want to fetch a feed from a web-server using some Python code.
You would probably go and use urllib2
which makes it straight-forward, if you want it to be, or sophisticated if you need it to be.
Now, network programming is notoriously unreliable, so you should give some quick thought to that. You should have some exception handlers in place to handle timeouts and DNS problems and the like. Note that urllib2.URLError
includes some “non-error” exceptions, like authentication requests, but you’re not expecting those, so they count as errors here.
On the subject of timeouts, you can now (since Python 2.6) override the default with a parameter.
The optional
timeout
parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS, FTP and FTPS connections. [Ref: urllib2 manual, Python 2.6.4]
So you care how long it is? No, you are happy to use whatever the typical default is for fetching a web-page – what would that be? 2 seconds? 10 seconds? 20 seconds? As long as it isn’t silly, like 5 minutes, you’ll be fine, so, hypothetically, leave that parameter empty.
Whoops. You shouldn’t have done that! Not even hypothetically.
The urllib2
package doesn’t specify the default timeout in this situation. It sits on the httplib
library, and it punts that issue down. It assumes that users of the URL library have the same needs as the users of the HTTP library.
But httplib
doesn’t specify the default timeout either. It sits on the socket
library, and it punts that issue down. It assumes that users of the HTTP library have the same needs as the users of the Socket library.
The socket
library is a fairly generic library used by many users. It doesn’t know whether it is being used to fetch web-pages across the web, or for interprocess communication by parts of the same application. So, it sets the default timeout as… forever.
Somehow, that default doesn’t seem quite so reasonable for the library responsible for fetching web-pages.
It is a nasty little problem, as an unresponsive remote web-server will freeze your application’s thread… but your application thread won’t realise it is having problems, so it will never log an error to ensure you get an email. Nor will it crash, so your watchdog process can restart it. No, it just quietly hang your server application at, say, 29 minutes past midnight on a Saturday morning. Purely, a hypothetical example, of course.
Now, this is a low-level API-usage error that will hopefully be noticed in code-review. Err.. Whoops, no, it wasn’t… err… hypothetically, I mean.
That’s a problem, because it isn’t a problem you will likely find in unit-testing. You may be diligent and mock out the urllib2
to make sure your code handles timeout errors correctly, but are you going to set-up a mock web-server to timeout and see whether timeout errors are actually raised?
Nor is it likely to come up during system-testing. How often do web-servers fail that way?
Actually, that is a question I can answer. If you were to have such a bug, it wouldn’t appear until the 2,879,702th call to the relevant function in urllib2
, so when it did happen, you would be quite surprised that a piece of code that you thought was very stable could bring down your production server for many hours until it was noticed.
Hypothetically, that is.
Comment by Richard Atkins on May 1, 2010
Your watchdog needs some work, such as in-process checking that each thread is alive and doing work appropriately. You should probably change each thread to tell the in-process watchdog that it’s alive, and make sure that this knows how many threads should be working, and either restarts the app itself or it will trigger the out-of-process watchdog to restart it.
Comment by Julian on May 1, 2010
Richard,
You are probably right. The watchdog will need the ability to forcefully kill a jammed-up process. I have avoided giving the watchdog teeth that sharp and complexity that high because of almost inevitable issues that will arise with bugs in the watchdog itself.
I was hoping by now the mean-time-between-failures for my server would be high enough to leave little justification for such measures. Alas.
Comment by Aristotle Pagaltzis on May 2, 2010
Time for a bug report methinks.
Comment by Alastair on May 2, 2010
It is for these types of reasons that I generally prefer to use the Reactor pattern when dealing with any nontrivial I/O, particularly over a network. Twisted seems like a particularly mature implementation of this, although I haven’t used it myself.
Comment by Julian on May 2, 2010
I am considering Aristotle’s suggestion to raise a bug-report.
The downside is that I don’t know what to suggest – now that the API contract is firmly in place, changing the timeout behaviour could be construed as a bad thing. You might need to offer a new interface.
Then I remembered good, old, Python 3.0. That was a release that re-did a lot of the libraries (at a cost of breaking existing code, and the consequent slow adoption). Maybe it was fixed then?
I inspected the Python 3.0.1 documentation, but I didn’t inspect the source code or install it.
If you substitute
urllib.request
whenever I sayurllib2
andhttp.client
whenever I sayhttplib
, the documentation would suggest the behaviour is identical. 🙁Maybe I should submit a bug-report just to put a warning in the documentation.
Comment by Julian on May 2, 2010
This post has now been re-released in Python bug-report form.
Comment by Julian on May 2, 2010
Alastair,
I am pondering this. I would certainly take another look at Twisted on my next major project. It is an interesting variant of Inversion of Control.
In the mean-time, I considered this I/O to be pretty close to trivial. Yes, it is over the network, and can fail, but:
* the deadlines are long and loose (although, it turns out, not infinite!)
* missed deadlines are simply dropped; the cost is merely a tiny missed window to make money, but not mission-critical.
* no elaborate failover is required (no need to try another server, re-authenticate, or refresh caches and the like – just fetch the url again, until it works.
So, if I was basing my entire architecture on Twisted, I might use it here for consistency, but for this one piece alone, it seems overkill.
After all, it works 2,879,701 times out of every 2,879,702! 🙂
Comment by Sunny Kalsi on May 2, 2010
This might be getting too general, but computers do have a halting problem. I’m not smart enough to figure out if this situation applies to the idea that waiting for a reply forever is undecidable, but in general a watchdog process (especially a complex one) is a waste of time, because it’s non-trivial to tell whether / when a sub-program will end. Basically, the watchdog makes your program less robust in the opposite direction to a program which halts — i.e. a program which simply takes a long time will still trip the watchdog.
The exception is in real-time programs.
Comment by Julian on May 3, 2010
Sunny, perhaps this falls under your “real-time programs” exception, but the Halting Problem isn’t applicable here.
The Halting Problem refers to analyzing source code and input to determine ahead of time if the program will ever halt. It am just checking that the program (or at least, a single function of the overall program) halts within a specific amount of time, which is much more tractable – even more so, because my solution would be to simply run the source code, and kill it if it fails to halt in time.
It renders my system no longer Turing-complete, from a Theory of Computation perspective,but in this application (like most real-time systems) the amount of processing to be performed when data is received is both bounded and small.
Comment by Chaim on November 17, 2011
This article beautifully captured a similar experience that I had with ulrlib2. Thanks for the post.