php - Solr searching with /terms -
i have php application uses solr database. problem appeared when doing /terms request (terms doc)
so parts of document interest are
poi: "bistriţa", ... text: [ "ddt", "numeric", "/14/gagaga 2/11/economics/17/datenow", "/20/daniel_same/11/economics/17/datenow", "0/gagaga 2", "1/gagaga 2/economics", "2/gagaga 2/economics/datenow", "0/daniel_same", "1/daniel_same/economics", "2/daniel_same/economics/datenow", "ppla", "seat of first-order administrative division", "/19/daniel_same/1071/plurinational state of bolivia/2269/cuba/2272/bistriţa", "0/daniel_same", "1/daniel_same/plurinational state of bolivia", "2/daniel_same/plurinational state of bolivia/cuba", "3/daniel_same/plurinational state of bolivia/cuba/bistriţa", "0/undefined_activity", "year", "0/1999", "0/1999", "measured", "", "utf8" ],
and request is
http://localhost:8080/solr/terms ?wt=json &indent=true &terms.sort=count &terms.mincount=1 &terms.limit=10 &terms.regex.flag=case_insensitive &terms.regex=.*bi.* &terms.fl=text
the response is
{ responseheader: { status: 0, qtime: 4 }, terms: { text: [ "bistriå", 16 ] } }
the problem result resulted text truncated. expecting "bistriÅ£a" utf-8 encoding of city bistrița. result seems truncated @ special character.
the weird thing if request field name "poi" instead of "text", correct response
http://localhost:8080/solr/terms ?wt=json &indent=true &terms.sort=count &terms.mincount=1 &terms.limit=10 &terms.regex.flag=case_insensitive &terms.regex=.*bi.* &terms.fl=poi { responseheader: { status: 0, qtime: 4 }, terms: { text: [ "bistriţa", 16 ] } }
so word not truncated.
the big difference between 2 fields type. poi has string type , text has text_general type. text_general type defined in schema this
<fieldtype name="text_general" class="solr.textfield" positionincrementgap="100"> <analyzer type="index"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" enablepositionincrements="true" /> <!-- in example, use synonyms @ query time <filter class="solr.synonymfilterfactory" synonyms="index_synonyms.txt" ignorecase="true" expand="false"/> --> <filter class="solr.lowercasefilterfactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.stopfilterfactory" ignorecase="true" words="stopwords.txt" enablepositionincrements="true" /> <filter class="solr.synonymfilterfactory" synonyms="synonyms.txt" ignorecase="true" expand="true"/> <filter class="solr.lowercasefilterfactory"/> </analyzer> </fieldtype>
i can provide more details if asked. not sure can add , not bloat question much.
you want consider using asciifoldingfilterfactory in text_general
field appropriately handle special character. additionally, please refer language analysis support provided solr may of use you.
Comments
Post a Comment