IF YOU POST ABOUT SEARCH BEING BROKE, I'll ban you...
LS1Tech and myself personally have split the cost of a new development server. The components arrived yesterday, and I spent yesterday evening assembling the new machine and installing the operating system. This system will be used to develop and stress test any and all new code destined for use on this (and affiliate) site(s).
Search is priority one. As is, adding in more posts than a couple months worth really kills site performance -- more than it already is hurting. Some of you have caught wind of my efforts to port vBulletin to a more high-performance and enterprise-scale database server (PostreSQL). This work is more or less finished, but needs to be final bug tested, load tested, and then transferred to the production servers. LOTS of work, but it IS progressing.
First off, let me state that I've been incrementally and slowly adding posts into the existing search index while monitoring server performance. I don't want to rob Peter to pay Paul, so to speak.
Second, here's where we stand overall. Based on the post-***** test session we had when the server blew up a while back, I'm confident the PostgreSQL port of vBulletin is solid. The only variable beyond that has been search, since I turned it off for the duration of that test (since there were no posts indexed). Here's a breakdown of what has transpired since then:
1. Loaded 3.5M posts into the "vBulletin" style post indexing scheme, under Postgres. This turned out to work reasonably well, but was somewhat hit or miss. It didn't tear down the server (since PG has MUCH more intelligent query and resource scheduling) but the search results wouldn't return in a consistant timeframe. Sometimes took 2 seconds, some took 2+ minutes, for similar type searches.
2. Wiped out the vB style index, and indexed the tables using the tSearch2 "FullText" index plugin for PG. This required some hacking of the vBulletin search subsystem to get working, since PG does fulltext a little differently than MySQL. End result, was it worked a little faster than #1 but still inconsistant in performance.
3. Went looking at the actual data and query structure of the search, and the PHP code that controlled everything. Found out some HUGE problems with the way Jelsoft wrote the search system, mostly based on trying to offer too many options that truly don't make a lot of sense for us. The most glaring is the idea of cacheing the search results, so if someone else searches the same thing 20 minutes later, we can re-use the search results from the first guy. In theory, this sounds like a good idea. In practice, its pretty half-baked. First, what is the percentage of people will search for the exact same thing? Not very high, I looked. Less than 1%. So even if it DID gain a ton of performance (it really doesn't, I'll explain why in a bit) there's very little chance it would actually be used. That brings us to the next stage.
4. The vB PHP code does some very innefficient things, due to the framework being written to support saved searches. The biggest glaring thing is basically running every search TWICE (or more). Sure, some of that is handled by the DB engine query cache, but that also introduces its own issues. First off, if you expand the number of search results, you increase the amount of cache needed. This is why the current limit is 250 results. Any more than that, and it won't fit in the query cache at all, and it has to be paged to disk. That basically halts everything, since you're still fighting normal posting and reading queries. Once things start to back up, the site grinds to a halt. Now do that multiple times. Very bad. So, by removing saved searches it only runs once and alleviates quite a bit of that right off the bat. The next problem, is the search engine actually runs the exact same search query every time you either return to the results page, or click between pages. So again, if the query doesn't fit in the query cache it has to be paged to disk, or run from scratch. Not very efficient.
5. This is the stage I'm in the middle of right now, which is a 100% total rewrite of the search engine. Without laying out all the details, it basically works by running the initial query, and storing the results list in a search results table. Since this table is nice and small (n results * x searches in the past y minutes) requerying it based on page movements or returning to the results list is incredibly fast. With realistic limits (1000 results, 30 searches per minute, 5 minute window) the table would grow to a fuzzy max of 150,000 rows. Thats nothing for a database to query on, and even cache. There's still no cheap way of getting away from the initial impact of running the search, which can take up to 2 minutes in some cases. The only realistic way to speed that up is with a custom built search server/cluster, which spreads the database over as many disks as possible. The more disk spindles there are, and the faster they spin, the faster it can scan all that data. Its not cheap to build and host a box like that obviously, so we'll do the best we can with the resources that make the most sense. If some 6 word, obscure search back to the beginning of time takes a couple minutes, so be it. Most typical 1 to 3 word searches complete in mere seconds, so most searches would never take that hit anyway.
i have a smaller site, but i also noticed the search saving.. i thought it was kinda dumb, but i dont really have a need to re-write it...
The Best V8 Stories One Small Block at Time





