While I was in the cafeteria with Mike Vernal this afternoon I bumped into some members of the Windows Desktop Search team. They mentioned that they'd heard that I'd decided to go with Lucene.NET for the search feature of RSS Bandit instead of utilizing WDS. Much to my surprise they were quite supportive of my decision and agreed that Lucene.NET is a better solution for my particular problem than relying on WDS. In addition, they brought an experienced perspective to a question that Torsten and I had begun to ask ourselves. The question was how to deal with languages other than English.
When building a search index, the indexer has to know what the stop words it shouldn't index are (e.g. a, an, the) as well as have some knowledge about word boundaries. Where things get tricky is that a user can receive content in multiple languages, you may receive email in Japanese from some friends and English from others. Similarly you could subscribe to some feeds in French and others in Chinese. Our original thinking was that we would have to figure out the language of each feed and build a separate search index for each language. This approach seemed error prone for a number of reasons
- Many feeds don't provide information about what language they are in
- People tend to mix different languages in their speech and writing. Spanglish anyone?
The Windows Desktop Search folks advised that instead of building a complicated solution that wasn't likely to work correctly in the general case, we should consider simply choosing the indexer based on the locale/language of the Operating System. This is already what we do today to determine what language to display in the UI and we have considered allowing users to change the UI language in future which would also affect the search indexer [if we chose this approach]. This assumes that people read feeds primarily in the same language that they chose for their operating system. This seems like a valid assumption but I'd like to hear from RSS Bandit users if this is indeed the case.
If you use the search features of RSS Bandit, I'd appreciate getting your feedback on this issue.