By default the Acquia Solr Search schema is optimized for English searches. Other Latin-based languages (e.g. Italian) will work due to grammatical and vocabular similarities, but Chinese, Japanese and Korean (CJK) languages use different stemming and spacing rules than Latin-based languages, and are handled differently by Solr / Lucene search engine.
These references that provide some helpful background information about Searching CJK languages, and they are intended to be a reference as you consider adding CJK support to Acquia Solr Search index for your site(s).
- https://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
- https://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/analysis/cjk/CJKTokenizer.html
- http://www.slideshare.net/lucenerevolution/japanese-linguistics-in-lucene-and-solr
You can add support for CJK languages through a custom Acquia Search Solr Configuration. We accept customer modified versions of schema.xml, elevate.xml, synonyms.txt, stopwords.txt, and protwords.txt.
Please note that SmartChineseSentenceTokenizerFactory
and SmartChineseWordTokenFilterFactory
are only available under Acquia Search cores running Solr 3.5. You can instead use CJKTokenizerFactory if your Acquia Search cores are running on Solr 4.5.1
We recommend that you set up a local Solr installation to test in as you develop the new configuration files to support those languages. After you have developed and tested your configuration changes, create a support ticket and we will review and deploy the changed files to your Acquia Search index.