Solr DataImportHandler - indexing multiple, related XML documents -
let's have 2 xml document types, , b, this:
a:
<xml> <a> <name>first number</name> <num>1</num> </a> <a> <name>second number</name> <num>2</num> </a> </xml>
b:
<xml> <b> <akey>1</akey> <value>one</value> </b> <b> <akey>2</akey> <value>two</value> </b> </xml>
i'd index this:
<doc> <str name="name">first name</str> <int name="num">1</int> <str name="spoken">one</str> </doc> <doc> <str name="name">second name</str> <int name="num">2</int> <str name="spoken">two</str> </doc>
so, in effect, i'm trying use value key in b. using dataimporthandler, i've used following data config definition:
<dataconfig> <datasource type="filedatasource" encoding="utf-8" /> <document> <entity name="document" transformer="logtransformer" loglevel="trace" processor="filelistentityprocessor" basedir="/tmp/somedir" filename="a.*.xml$" recursive="false" rootentity="false" datasource="null"> <entity name="a" transformer="regextransformer,templatetransformer,logtransformer" loglevel="trace" processor="xpathentityprocessor" url="${document.fileabsolutepath}" stream="true" rootentity="true" foreach="/xml/a"> <field column="name" xpath="/xml/a/name" /> <field column="num" xpath="/xml/a/num" /> <entity name="b" transformer="logtransformer" processor="xpathentityprocessor" url="/tmp/somedir/b.xml" stream="false" foreach="/xml/b" loglevel="trace"> <field column="spoken" xpath="/xml/b/value[../akey=${a.num}]" /> </entity> </entity> </entity> </document> </dataconfig>
however, encounter 2 problems:
- i can't xpath expression predicate match rows; regardless of whether use alternative
/xml/b[akey=${a.num}]/value
, or hardcoded valueakey
. - even when remove predicate, parser goes through b file once every row in a, inefficient.
my question is: how, in light of problems listed above, index data correctly , efficiently dataimporthandler?
i'm using solr 3.6.2 .
note: bit similar this question, deals 2 xml document types instead of rdbms , xml document.
i have bad experiences using dataimporthandler kind of data. simple python script merge data smaller current configuration , more readable. depending on requirements , data size, create temporary xml file or directly pipe results solr. if have use dataimporthandler, use urldatasource , setup minimal server generates xml. obvioulsy i'm python fan, it's quite it's easy job in ruby, perl, ...
Comments
Post a Comment