Integrating Apache Solr with Alfresco WCM for Faceted Search and Navigation of Next-Generation Web Sites Vagif Jalilov Rivet Logic.

Download Report

Transcript Integrating Apache Solr with Alfresco WCM for Faceted Search and Navigation of Next-Generation Web Sites Vagif Jalilov Rivet Logic.

Integrating Apache Solr with
Alfresco WCM for Faceted Search
and Navigation of Next-Generation
Web Sites
Vagif Jalilov
Rivet Logic
About Rivet Logic
• Award-winning professional services focused on:
– Enterprise Content Management
– Web Content Management
– Collaboration and Social Communities
• Using Leading Open Source Software
Business Case for Alfresco & Solr
•
•
•
•
Large scale sites
Need for real-time updates
Full-text search
Faceted search
Technical Challenges for Search
• Accurately index each page
– Solution: Assembly of relevant content to index
• Targeted, real-time indexing
– Solution: Trigger indexing from publishing
mechanism
Possible Index Solutions
• Spidering/Crawling
– Follow navigational & cross-links
– Parse HTML and fetch relevant content
– Spider full (or partial) site each time
• Real-time Indexing
– Triggered by FSR deployment
– Process only change-set (incremental updates)
– Assemble relevant page content
Typical Web Application
Source Control
• Source code & libs
• View templates
• Site navigation
• Web content
CMS (Alfresco)
• Binary Content
“Managed” (Riveted) Web Application
Source Control
• Source code & libs
• (View templates)
CMS (Alfresco)
• Binary Content
• Web Content
• Site Navigation
• (View templates)
Page Composition
Metacontent.xml
Pagemetadata.xml
dynamic
Sectionhtml.xml
dynamic
Relatedlinks.xml
Supportingitems.xml
Content Delivery
(http://crafterrivet.org)
Alfresco WCM Lifecycle
Indexing Architecture
Solr Customizations
• Custom Solr
– Schema.xml
• Fields (Type, Indexed/Stored)
• Unique key
– Solrconfig.xml
• “dismax” type request handler to define queried fields
• ExtractingRequestHandler (indexing RT docs)
Custom Solr Schema
<field name="page_url" type="string" indexed="true" stored="true"
required="true"/>
<field name="page_title" type="text" indexed="true" stored="true"/>
<field name="page_category" type="string" indexed="true"
stored="true"/>
<field name="page_type" type="string" indexed="true"
stored="true"/>
<field name="page_last_modified" type="date" indexed="true"
stored="true"/>
<field name="page_text" type="text" indexed="true" stored="true"/>
<field name="page_file_size" type="int" indexed="false"
stored="true"/>
</fields>
<uniqueKey>page_url</uniqueKey>
ExtractingRequestHandler
<!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
startup="lazy">
<lst name="defaults">
<str name="fmap.content">page_text</str>
<str name="fmap.title">page_title</str>
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
<dynamicField name="ignored_*" type="ignored"/>
ContentStreamUpdateRequest up = new
ContentStreamUpdateRequest("/update/extract");
up.addFile(new File(filePath));
SolrServer solrServer = new CommonsHttpSolrServer(solrServerUrl);
solrServer.request(up);
solrServer.commit();
Custom RequestHandler
<!-- DisMaxRequestHandler allows easy searching across multiple
fields
for simple user-entered phrases. It's implementation is now
just the standard SearchHandler with a default query type
of "dismax".
see http://wiki.apache.org/solr/DisMaxRequestHandler
-->
<requestHandler name=”solrDemoDismax" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="qf">
page_title^5.0 page_text^1.0
</str>
</lst>
</requestHandler>
Compilation
• Compiler Engine processes all instructions
• Dispatches to appropriate Page Type Compiler
Content Deployment & Solr Update
Compiler Instructions
<updates deploy-root=”/path/to/content/root">
...
<update>/solutions/security/article.xml</update>
<delete>/products/widget/top-section.xml</delete>
...
</updates>
Compilation Types
1. Web Pages (HTML)
2. Rich Text (PDF)
Web Page Compilation & Indexing
Indexer
Instructions
HTML Indexer Instruction
<?xml version="1.0" encoding="ISO-8859-1"?>
<add>
<doc>
<field name="page_url">/solutions/content-mgmt/overview.html</field>
<field name="page_title">Increase productivity and streamline workflow
throughout the enterprise</field>
<field name="page_description">Commercial enterprises and government agencies
face significant challenges as they strive to meet a rapidly growing need to
manage thousands ...</field>
<field name="page_category”>Solutions</field>
<field name="page_type">Web Page</field>
<field name="page_last_modified">2009-12-18T15:03:57Z</field>
<field name="page_text">Rivet Logic addresses many of today's workplace
challenges with Enterprise Content Management (ECM) solutions that enable
organizations to transform traditional content repositories and static
intranets into dynamic, collaborative work environments through open source
functionality. Through ...</field>
</doc>
</add>
Rich Text Compilation & Indexing
Rich Text Indexer Instruction
<?xml version="1.0" encoding="ISO-8859-1"?>
<add>
<doc>
<field name=”page_file">/docroot/static/about-us/pressreleases/2010/rl_crafter_studio.pdf</field>
<field name=”page_url”>/about-us/pressreleases/2010/rl_crafter_studio.pdf</field>
<field name="page_title”>Rivet Logic launches Crafter Studio for
user friendly Web content authoring and publishing.</field>
<field name="page_category">News</field>
<field name="page_type">Press Release</field>
<field name="page_last_modified">2007-12-19T08:00:00Z</field>
<field name="page_file_size”>135</field>
</doc>
</add>
Compiler Configuration
Compiler Configuration
<compiler-config>
<page-types>
<page-type
name="Solution Page”
compiler="com.rivetlogic.index.compile.ArticleCompiler">
<uri-pattern pattern=".*/page-content/solutions/.*(article|page-metadata|metacontent).xml$" />
<properties>
<property field=“page_type” value=“Web Page”/>
<property field=“page_category” value=“Solutions”/>
</properties>
</page-type>
<page-type
name="Press Release Page”
compiler="com.paetec.index.model.compile.PressReleaseCompiler">
<uri-pattern pattern=".*/press-releases/.*/(press-release|meta-content).xml$"
/>
<properties>
<property field=“page_type” value=“Press Release”/>
<property field=“page_category” value=“News”/>
</properties>
</page-type>
<page-types>
<compiler-config>
Search UI
•
•
•
•
•
Full text search
Faceted search on category & type
Pagination or search result clustering
Keyword highlighting in search results
Track user queries
Search Results Page
Clustered Results
Summary
• Requirements:
– Real time updates
– Full editorial control
– Faceted search
• Solution
–
–
–
–
Alfresco CMS
Alfresco plugin for Solr indexing
Compile updates & index
Serve in UI (ft search + facets)
Q&A
• Thank you for attending :-)
• Questions, comments…
Appendix
Search Model/API