php - Solr searching with /terms -

- July 15, 2014

i have php application uses solr database. problem appeared when doing /terms request (terms doc)

so parts of document interest are

poi: "bistriÅ£a", ... text: [ "ddt", "numeric", "/14/gagaga 2/11/economics/17/datenow", "/20/daniel_same/11/economics/17/datenow", "0/gagaga 2", "1/gagaga 2/economics", "2/gagaga 2/economics/datenow", "0/daniel_same", "1/daniel_same/economics", "2/daniel_same/economics/datenow", "ppla", "seat of first-order administrative division", "/19/daniel_same/1071/plurinational state of bolivia/2269/cuba/2272/bistriÅ£a", "0/daniel_same", "1/daniel_same/plurinational state of bolivia", "2/daniel_same/plurinational state of bolivia/cuba", "3/daniel_same/plurinational state of bolivia/cuba/bistriÅ£a", "0/undefined_activity", "year", "0/1999", "0/1999", "measured", "", "utf8" ],

and request is

http://localhost:8080/solr/terms ?wt=json &indent=true &terms.sort=count &terms.mincount=1 &terms.limit=10 &terms.regex.flag=case_insensitive &terms.regex=.*bi.* &terms.fl=text

the response is

{     responseheader: {         status: 0,         qtime: 4     },     terms: {         text: [             "bistriå",             16         ]     } }

the problem result resulted text truncated. expecting "bistriÅ£a" utf-8 encoding of city bistrița. result seems truncated @ special character.

the weird thing if request field name "poi" instead of "text", correct response

http://localhost:8080/solr/terms ?wt=json &indent=true &terms.sort=count &terms.mincount=1 &terms.limit=10 &terms.regex.flag=case_insensitive &terms.regex=.*bi.* &terms.fl=poi  {     responseheader: {         status: 0,         qtime: 4     },     terms: {         text: [             "bistriÅ£a",             16         ]     } }

so word not truncated.

the big difference between 2 fields type. poi has string type , text has text_general type. text_general type defined in schema this

<fieldtype name="text_general" class="solr.textfield" positionincrementgap="100">   <analyzer type="index">     <tokenizer class="solr.standardtokenizerfactory"/>     <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" enablepositionincrements="true" />     <!-- in example, use synonyms @ query time     <filter class="solr.synonymfilterfactory" synonyms="index_synonyms.txt" ignorecase="true" expand="false"/>     -->     <filter class="solr.lowercasefilterfactory"/>   </analyzer>   <analyzer type="query">     <tokenizer class="solr.standardtokenizerfactory"/>     <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" enablepositionincrements="true" />     <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/>     <filter class="solr.lowercasefilterfactory"/>   </analyzer> </fieldtype>

i can provide more details if asked. not sure can add , not bloat question much.

you want consider using asciifoldingfilterfactory in text_general field appropriately handle special character. additionally, please refer language analysis support provided solr may of use you.

Search This Blog

Permission

php - Solr searching with /terms -

Comments

Post a Comment

Popular posts from this blog

java - Jmockit String final length method mocking Issue -

asp.net - Razor Page Hosted on IIS 6 Fails Every Morning -

c++ - wxwidget compiling on windows command prompt -