首页 > 代码库 > tikaentityprocessor 各种示例

tikaentityprocessor 各种示例

1.

<dataConfig> <dataSource type="BinFileDataSource" /><script><![CDATA[function setIdType(row) {    row.put(‘id‘, ‘file::‘     + row.get(‘fileAbsolutePath‘));    row.put(‘type‘, ‘file‘);    return row;}]]></script> <document>     <entity name="tika-test" processor="TikaEntityProcessor"    url="C:\Users\Administrator\Desktop\测试素材\URL URI.pdf"    format="text"   transformer="script:setIdType">       <field name="file_author" column="Author" meta="true" />    <field name="file_title" column="title" meta="true" />    <field name="file_text" column="text" />   </entity> </document></dataConfig>

 

2.

<dataConfig>    <script><![CDATA[        id = 1;        function GenerateId(row) {            row.put(‘id‘, (id ++).toFixed());            return row;        }              ]]></script>   <dataSource type="BinURLDataSource" name="data"/>    <dataSource type="URLDataSource" baseUrl="http://localhost/tmp/bin/" name="main"/>    <document>        <entity name="rec" processor="XPathEntityProcessor" url="data.xml" forEach="/albums/album" dataSource="main" transformer="script:GenerateId">            <field column="title" xpath="//title" />            <field column="description" xpath="//description" />            <entity processor="TikaEntityProcessor" url="http://localhost/tmp/bin/${rec.description}" dataSource="data">                <field column="text" name="content" />                <field column="Author" name="author" meta="true" />                <field column="title" name="title" meta="true" />            </entity>        </entity>    </document></dataConfig>

 3.

Solr配置Clob字段<documentname="bulletin">     <entity name="item" pk="uuid" transformer="ClobTransformer" query="select * from no_bulletin">             <fieldcolumn="UUID"name="id"/>           <fieldcolumn="CONTENT"name="content"clob="true"/>      </entity></document>

注:红色部分是配置clob字段必须的,CONTENT必须大些,否则ClobTransformer是不会被执行解析的。(query中的sql语句改成自己的)

 

Solr配置Blob字段

<dataSourcename="f1"type="FieldStreamDataSource"/><dataSourcename="orcle"driver="oracle.jdbc.driver.OracleDriver"url="jdbc:oracle:thin:@192.168.196.253:1521:orcl"user="sample_bus"password="sample_bus"/><document>       <entitydataSource ="orcle"name="attach"query="select att_id,content from no_bul_attcontent where att_id=‘645cf16b40d4472ca649084c6aa099fe‘">               <fieldcolumn="ATT_ID"name="id"/>               <entitydataSource="f1"processor="TikaEntityProcessor"url="content" dataField="attach.CONTENT">                       <fieldcolumn="text"name="docContent"/>                </entity>        </entity></document>

注意:这里url没有作用,可以去掉(如果dataSource不是数据库,而是本地文件,那这里就是路径,如:url="d:/path ${f.fileAbsolutePath}"等等,f父实体的name),

如果url不对,报无效的sql语句错误。

dataField中attach是父实体的name。attach.CONTENT必须大写,否则报:No field available for name : attach.content Processing Document # 1.

 

特别注意:数据库中Blob字段名不能与schema.xml中对应的字段同名。否则,Bolb字段导入的结果为<str name="abc">oracle.sql.BLOB@1042c25</str>

tikaentityprocessor 各种示例