IBM Content Analytics with Enterprise Search

Download Report

Transcript IBM Content Analytics with Enterprise Search

IBM Confidential
IBM Content Analytics with Enterprise Search
BigInsights Integration
Mar 28th, 2012
IBM Software Group | YSL
IBM Software Group | Yamato Software Development Laboratory
IBM Content Analytics with Enterprise Search
IBM Confidential
Challenge and Approach
 Challenge
– Achieve massive scale-out
– Utilize cloud environment as resource pool
 Approach
– Keep compatibility with current version to respect existing customers
• No end user impact
• Seamless administration
– Utilize current assets
•
•
•
•
UIMA Infrastructure
UIMA Annotators (LW, System-T, Takmi,…)
Various data source crawlers
…
– Utilize BigInsights as scale-out infrastructure
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Seamless Scale-out Scenario
 ICA V3.0 offers 3 types of system configuration according to the volume
* BigInsights is supported only on Linux
of data
POC with small data
can be done on
a single workstation
Production system will
be deployed to 1 to N
servers
Production system
analyzing big data will
utilize BigInsights
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
About InfoSphere BigInsights
 IBM InfoSphere BigInsights brings the power of
Hadoop to the enterprise. … BigInsights enhances this technology
to withstand the demands of your enterprise, adding
administrative, workflow, provisioning,
and security features, along with
best-in-class analytical capabilities from IBM
Research….
 InfoSphere BigInsights is prereq
– Version 1.3 is officially supported
4
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Feature Overview: Collection on BigInsights
 Search & Text Analytics Capability
– UIMA
– System-T
– Gumshoe
 Scale Out
– IBM Hadoop
– ILEL BigIndex
 Flexible Job Flow
– Orchestrator (a.k.a. MetaTracker)
 Easy Data Manipulation
– JAQL
 Robust File System
– GPFS (Shared Nothing Cluster version, not yet released)
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
ICA V3.0 Analytics Flow on BigInsights
Operation
by JAQL
IBM InfoSphere
BigInsights
IBM Content Analytics
Document Processing Flow
Job Flow
controlled
by Orchestrator
(MetaTracker)
System-T
Analysis
Custom
Data
ICA GA
Indexing
Pre-Processing
- Gumshoe LA
- Gumshoe GA
-Link Analysis
Slave
- Dup Doc
Elimination
Index
- Facet Grouping
- Custom GA
UIMA Analysis
UIMA Annotators
- LanguageWare
- TAKMI
- User Custom
BigIndex
Orchestrator
RDS
Crawler
Various
Data
source
HDFS/GPFS Job Request
Document Processing Flow
RDS
Regular
OS
Cache
Text Analytics
/ Search
Runtime
Indexing Service Process
Importer
Local Analysis
(UIMA base)
Global
Analysis
UI
-Gumshoe Relevancy
Exporter
IBM Software Group | YSL
Other
App.
IBM Content Analytics with Enterprise Search
IBM Confidential
How Jaql and Hadoop Map/Reduce works with ICA
Example: Omit duplicated documents in RDS by Jaql/Hadoop
parseRds=fn(rdsFiles:[{path:string,offset:long}],options,output:schematype=null) (
mapReduce({
input:rdsFileDescriptor(rdsFiles[*].path,keepRemoved=false),
output:(if(isnull(output))(HadoopTemp())else(output)),
map:fn(v) (v->transform [$.uri,$]),
reduce: fn(k,v) (
d=v->sort by [$.sequenceNumber desc],
if( d[0].code > 0 and d[0].code != 4050)([d[0]])else([])
),
…
Map
Reduce
Output Format
Input Format
RDS
{uri=“A”,seqno=0,…}
{uri=“B” seqno=1,…},
…
Key=“A”, value={uri=“A”, seqno=0,…}
Key=“B”, value={uri=“B”, seqno=1,…}
…
Map
Input Format
RDS
7
{uri=“A”,seqno=100,…}
{uri=“C” seqno=101,…},
…
Key=“A”, value={uri=“A”, seqno=100,…}
Key=“C”, value={uri=“C”, seqno=101,…}
…
Key=“A”, Values=[
{uri=“A”, seqno=100,,…},
{uri=“A”, seqno=0,…}
]
Reduce
Key=“B”,Values=[
{uri=“B”, seqno=1,…}]
Key=“C”,Values=[
{uri=“C”, seqno=101,…}
]
{uri=“A”,seqno=100,…}
…
Json
Output Format
{uri=“B”,seqno=1,…}
{uri=“C” seqno=101,…},
…
Json
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Differences : In general
Regular collection
BigInsights collection
Time to refresh index
Quick
Lazy
Scalability
Under 10 servers
Over 100 servers
Flexibility
System must have peak capacity
System resource can allocate as required
Best for the use case
Documents are continuously
added/removed/updated
Can have powerful server
Large number of documents are processed at once
Already have BigInsights
Needs flexibility
8
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Easy Configuration
 Specify BigInsights Sever Information
Admin user can confirm the setting on Topology View
 Specify “Use IBM BigInsights” while creating a collection
– Then configuration files and ICA libraries, UIMA PEARs (including custom PEAR) and other
required modules will be distributed to BIgInsights servers automatically
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Storage requirement with BigInsights
 ICA
 ES_NODE_ROOT should be shared on all nodes
to share configuration and other resoureces
 BigInsights
 Jaql and Map/Reduce uses local storage as temporary
storage
 HDFS will also uses local storage as a part of the file
system
 BigIndex also consumes local storage to merge indexes
 It is strongly suggested to use GPFS with fibre as storage in
performance/reliability reasons for small cluster
10
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Storage requirement with BigInsights : HDFS
 HDFS
– Storage on each data node will used as a part of file system
– Can increase capacity by adding storage on each data node or
adding new data node with storage
– Have replication of each blocks ( default : 3 )
– Each searcher process downloads
index from HDFS to local file system
11
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Storage requirement with BigInsights :Shared storage
 Shared storage
– High performance storage (i.e. GPFS with fibre) will be required
– Each searcher must share the storage
– ICA servers should use same storage as ES_NODE_ROOT
– Using NFS has some requirement, please check release note
12
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Custom Global Analysis
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
IBM Confidential
Custom Global Analysis by JAQL
 Global Analysis
– Obtain new information by examining all documents in a collection
• Link Counting
• Duplicated Document Detection
• etc
 Custom Global Analysis by JAQL
– User can integrate his own Global Analysis logic using JAQL
• Input is the result of ICA document processing (field, facet, content)
• Output can be stored as a document field or facet
 User Benefits
– New data manipulation point across documents
• Crawler plug-in, UIMA Annotator can manipulate data only within each
document
– Manipulate data using Map/Reduce from script like SQL
• JAQL releases developers from Java programming of Map/Reduce
IBM Software Group | YSL