Sphinx Search: MySQL Fulltext Search Engine Replacement
July 21, 2008 – 12:07 am by Lord RavenclawFor PatronusCharm 1.0.0 and 1.0.1, we’ve used the MySQL fulltext search engine. That’s a mouthful, right? MySQL is the database software PatronusCharm utilizes to store all of the data used in PatronusCharm. It stores users, stories, news articles, FAQ items, everything. As you well know, a search engine is needed to pile through this data. As PatronusCharm finds more and more stories to host on its pages, a good search engine is important to continue to allow users to search for the stories they’d like to read.
MySQL has functions to do this. Using what’s known as a fulltext index, it searches through content such as titles, summaries, and tags to return matches to the query. Unfortunately, MySQL fulltext searches also suck. Limited by features, and woefully undeveloped facilities, MySQL fulltext searches are not the best. I’ve attempted to make them better, developing my own search query off examples and tips from other sites, but I find it inadequate. As do you all, I’m sure. In order to search for words below 4 letters long, one must change it for ALL databases on the server, which leads to a ballooning of search index sizes and performance reduction. My query is complicated, and taxes the server. For those interested, it looks like this (including PHP):
// Our query is massive enough to be multiline.
$list_query = $this->database->sql_query("
SELECT SQL_CALC_FOUND_ROWS s.*, u.name,
( (1.3 * (MATCH(summary) AGAINST ('" . $search . "' " . $boolean . ")))
+ (1.1 * (MATCH(tags) AGAINST ('" . $search . "' " . $boolean . ")))
+ (0.6 * (MATCH(title) AGAINST ('" . $search . "' " . $boolean . ")))
) AS relevance
FROM stories s
LEFT JOIN users u USING(userid)
WHERE ( MATCH(summary, title, tags) AGAINST ('" . $search . "' " . $boolean . ") )
" . $cwhere . "
HAVING relevance > 0
ORDER BY relevance DESC, updated DESC
LIMIT " . (($curpage - 1) * $perpage) . "," . ($curpage * $perpage)
);
I found a different solution. After searching, I found the Sphinx Fulltext Search Engine. It works as a replacement to the MySQL Fulltext Search Engine. Complete with a PHP API class, I designed a new search engine based around Sphinx, which gave me the features and flexibility to do whatever necessary to create a good search engine. What it does is crawl through each record in the database to be indexed and creates its own indexer files, eliminating the need for MySQL fulltext indexes (and thus increased performance). Then a daemon is run which the PHP API connects to and searches with, receiving results back from Sphinx.
After weeks of work a few months ago, I’m quite proud of it. It’s faster, lighter, does more, and finds more. One can search for “HP” or “SGA” and return results. “HP/RR” returns my story as the one with the greatest relevance. In all, I’m quite pleased with how it’s turned out.
I’ve enabled it on the devsite, if you’re curious to give it a go. I find it a much more elegant search, and far more reliable. Please, try it and comment back here with feedback. Be sure to tell me if you find any bugs…you all are far better at breaking it than I.
6 Responses to “Sphinx Search: MySQL Fulltext Search Engine Replacement”
Gave it a try. Works fast and precise even with one letter to search by. Still, I’ve spent couple of minutes doing unnatural things to it, and got some results.
Naturally, first word I tried to search for was “ass”. And what do you think happened? I lol’ed.
It gives you something like this:
“ass=”nlink” href=”/s/120/1/” name=”storytitle”>Harry Potter and the Acts of Betrayal by ass=”nlink”
I had already tried three other obscenities, when it occurred to me that “ass” was just a part of “class”. Pity.
Still, it only affects results. Now, searching for “” – that’s where real fun begins. Oh, the things it was showing to me… Try “< as”, nearly killed my computer with this amount of fucked-up java-script.
Anyway, you really need to close this one. Not sure if it is possible to execute scripts or php there, but just to be safe.
Next thing is about non-ASCII symbols. The likes of “ö” and such. Now, I know that probability is low, but still — if someone names their story using one of those, you will have some difficulties finding it.
And finally, strange highlighting – with blank spaces around search string. Doesn’t really make for a clean page.
That’s all I’ve got for now. Enjoy your debugging, I’ll be in touch.
By Alssn on Jul 21, 2008
Huh. It seems that I can’t use certain symbols in those commentaries. Anyway, in fifth paragraph I’ve meant usual angle brackets used in the most perverted combinations possible.
By Alssn on Jul 21, 2008
Good to have some feedback.
Search highlighting is actually done with JS, as at the time I couldn’t really use any other method. I will however look into making it more HTML friendly…I thought I had closed that issue, but apparently not. And no, not really. Even if you could, it’s only JS and not run server-side. It does need fixing.
The highlighting was intentional, to make it readily apparent what was being matched. I suppose highlighting could use an overhaul. I’ve put it on my to-do list.
Sphinx handles mapping of ö -> o, so it should be just fine. I’ll look into making sure that it properly handles these.
By Lord Ravenclaw on Jul 21, 2008
The highlighter as since undergone a revision, and now does tag highlighting server-side. It should be HTML-friendly now and not break HTML tags. So far I haven’t been able to get it to break.
By Lord Ravenclaw on Jul 22, 2008