A/B tests are only as good as your A idea or your B idea. I think it's somehow tempting to think we can "randomly" vary the user interface and have something "emerge" from this process. This is especially tempting a thought for non-designers.
However, mathematically (sorry, I am a statistician), this is a pretty flawed idea. Steven pointed out the idea that you'll only climb to the peak of the hill you are on (and there could be Mt. Everest a short ride *down* the hill from you), but it's even worse than that: any time you engage in multiple testing, you are guaranteed false positives. So, at least some of your tests - even if you perform them correctly - are going to reach the opposite conclusion you should. The more tests you run, the more this occurs.
Other issues include bugs in the way you implement the test. I have noticed that, culturally, data is sacrosanct... people don't question results as often as they should, and bugs occur more often than you might expect, as engineers have a hard time getting their heads around testing code that implements randomness. Dilbert loses faith in humanity before he loses faith in the code that brought him this silly answer. The cult of A/B testing is a tough one to fight against, and I sympathize with every designer who kind of loses the ability to calmly communicate in the face of it.
Finally, A/B testing assumes you can draw a random sample from all your users - both current and future. For a large company like AOL I buy into this... they have massive traffic, and it's likely very similar day-to-day, so today's users are tomorrow's users. For a startup, your user base is in flux. Even if random interface "A" is preferable to "B" today, I'm skeptical you know for sure it will be in the future. A/B tests can't be enough to make final decisions on. Call it "gut" if you want... I call it experienced based knowledge.
Ok, now having said all that, I'm not some kind of wacky philistine. I studied stats for years and love data. What I'm trying to say is I absolutely use A/B tests, but I use them sparingly and carefully. Here's how: rather than hoping they'll cause user interfaces to "emerge" from chaos, I use them to keep the designers from getting too full of themselves, as a way to guide research, and as a way to remind designers of our actual business objectives.
Instead of making the null hypothesis that A is no different than B (common approach), and celebrating any "improvement," I insist anyone who is proposing a new interface make a claim about how much better it should be than the current one. So, if I am told that B should improve our click through by 5%, then the null hypothesis is B is at least 5% better than A. (If they tell me it will be 1% better, I tell them to work on something else this week - that's usually well in the noise of my testing precision anyway.)
If I can run data for a while and disprove the null that B is at least 5% of better than A (note that this is still A/B testing - it's just not the way people talk about it colloquially, at least in my experience), then I go back to the designer and ask what went wrong. If he or she is stumped, I get ideas from other people and we start researching these ideas.
If this consistently happens and the designer just yells at me I don't know what I'm doing using A/B tests to question their genius, I get a new designer. However, I've found good designers are eager to engage in this way: they still control the UI and UX; I'm just asking them to be open to the idea that they might be wrong and that there is something to be learned from mistakes. I am not using A/B tests to replace designers or to create a culture where anyone's crackpot idea can be validated by the users somehow.
I do not apply A/B testing (of any kind) to overhauls to the whole funnel; not that it can't be done, but implementing the test is pretty hard. Usually you can use other metrics to tell if things are better than randomly assigning your users to one experience or the other.