Skip to main content

Patrick Tisseghem's Blog [MVP SharePoint]

Go Search
U2U Blog Center
U2U website
  

The Happy SharePoint Traveller

U2U Blog Center > Patrick Tisseghem's Blog [MVP SharePoint] > Posts > Indexing and Searching Documents in Multiple Languages (Part I)
Indexing and Searching Documents in Multiple Languages (Part I)

The good thing about dedicated workshops like the Search Workshops we have done last week in Brussels and Copenhagen, is that after the course, you end up with a lot of questions that were answered and that should somehow end up in blog postings, articles or whatever. Finding the time to do this is of course always a problem. But I'll do my best, certainly if it adds to the material covered in our latest book:

Inside the Index and Search Engines: Microsoft® Office SharePoint® Server 2007
by Patrick Tisseghem, Lars Fastrup

Read more about this book...

 

One of the interesting questions was regarding the indexing and the searching for documents created in a specific language. I'll cover a bit here in this first posting and continue with that later this week.

How does the crawler detect the language of the content of the document?

First of all, the detection of the language is dependent on the IFilter that is used to index the content of the document. There is a full explanation of the internals of IFilters and also a guide how to build your own one in chapter 9 of the book.

The built-in IFilter that is part of the MOSS indexing architecture is capable of looking at an Office document and collect plenty of information. This information gathering is actually the task of one of the internal plug-ins named the Metadata Extraction plug-in. It relies on an internal language detection algorithm (developed by Microsoft Research) to find out about the language of the content. When it was able to retrieve the language (represented by a number), it stores this information in a hidden managed property called DetectedLanguage.

How do I search for a document in a specific language?

Let's have a look first at the out-of-the-box experience. I have for example here a document library storing different documents each authored in a language. I configured a content source that indexed all of this data.

image

The advanced search page allows us to filter on the language very easily using the language picker. By default there are a couple of options but if you open the tool pane and configure the XML that is set as the value for the Properties property of the AdvancedSearchBox Web Part, you are able to offer more choices.

In the XML you find a list of LangDef elements each one representing one language and the number for it. Note that it is not very clear how Microsoft got to these numbers (they do not match for example the LCID numbers).

   1: <LangDefs>
   2:         <LangDef DisplayName="Arabic" LangID="1"/>
   3:         <LangDef DisplayName="Bengali" LangID="69"/>
   4:         <LangDef DisplayName="Bulgarian" LangID="2"/>
   5:         <LangDef DisplayName="Catalan" LangID="3"/>
   6:         <LangDef DisplayName="Chinese" LangID="4"/>
   7:         <LangDef DisplayName="Croatian/Serbian" LangID="26"/>
   8:         <LangDef DisplayName="Czech" LangID="5"/>
   9:         <LangDef DisplayName="Danish" LangID="6"/>
  10:         <LangDef DisplayName="Dutch" LangID="19"/>
  11:         <LangDef DisplayName="Finnish" LangID="11"/>
  12:         <LangDef DisplayName="French" LangID="12"/>
  13:         <LangDef DisplayName="German" LangID="7"/>
  14:         <LangDef DisplayName="Greek" LangID="8"/>

The language picker will show all the languages that are defined within the Languages element:

   1: <Languages>
   2:         <Language LangRef="12"/>
   3:         <Language LangRef="7"/>
   4:         <Language LangRef="17"/>
   5:         <Language LangRef="10"/>
   6:         <Language LangRef="19"/>
   7:         <Language LangRef="25"/>
   8:         <Language LangRef="22"/>
   9:     </Languages>

A query using the language picker will result in the inclusion of the match on the DetectedLanguage managed property as shown here:

image

image

The Advanced Search Page is not the only place where you can use this managed property. You can also immediately type it in in the search box where you formulate your keyword syntax query. You just have to find out the number of the language (see the above XML).

image

In a next posting I'll show you how you can customize the search experience using the language information.

Comments

Guided search in different languages

I have just posted an article about Search engines and desktop search at http://hermansberghem.blogspot.com/2008/06/nice-thing-about-vivisimo-search-when.html.

Because I know that you are into the Search experience a lot, I decided to verify something I wrote in the post. What I want to know is: does SharePoint Search use the languages for stemming, truncation, did you mean etc? For instance, when I type in the dutch word 'dag', will SharePoint recognize it is Dutch? You don't want something like: did you mean 'dog'? You know what I mean?
at 27/06/2008 14:20

Faceted Search

When you're using the CodePlex project "Faceted Search", you can configure the managed property "DetectedLanguage" as a facet.

I have just posted an article about this at
http://blog.sharepoint.ch/2008/07/language-filter-with-faceted-search.html
at 7/07/2008 14:08

Language Filter for English

Hi Patrick, nice post!!!

But I still can find how to set up moss to filter the english language. I try to insert a new LangDef for code 9, but no success. Do you have any tips for me?

Thank you!
at 6/08/2008 16:13

Thank you

I love your blog, keep going
at 23/01/2009 15:38

Chinese Simplified and Chinese Taiwan

I noticed that we have Chinese as one of the options in the Language attributes but it is not broken down by the sub language. How is it possible to separate the content between Chinese sub cultures, like Chinese Simplified and Chinese Taiwan? Is there a property other than DetectedLanguage that holds the sub culture value that we can filter upon?
at 3/02/2009 21:55
Captcha

Enter the code shown above: *

(Note: If you cannot read the numbers in the above
image, reload the page to generate a new one.)