Metadata extractors, content transformers & renditions

Download Report

Transcript Metadata extractors, content transformers & renditions

Metadata Extractors, Content Transformers &
Renditions
Neil Mc Erlean
Who am I?
Lead Engineer in the Services Team
4 years at Alfresco (since 3.2)
Previously worked on
•Hybrid Sync
•Alfresco in the Cloud
•Various services/components
•Transformers & Extractors
•REST APIs
•Actions & Behaviours and more…
Ex-astrophysicist (of which more later)
Talk content
What data is in your content?
How does Alfresco get at it?
What does Alfresco do with it?
How can you use these features?
Introductory material
•no prior knowledge assumed
Talk content - Breaking it down
Your content & its metadata
Alternative renditions of your content
Overviews of the 3 services
Java Foundation APIs. JavaScript.
Configuring & extending Alfresco.
All code samples available as runnable tests - download
from the website.
#1 Metadata Extraction
#2 Content Transformation
Alfresco uses them to produce
•images (thumbnails)
•plain text (indexing)
•inter-Office transforms
Also generally useful
#3 Rendition Service
• Very similar to transformations
• More general service
• More than just content to content
How do these components work?
Mostly by leveraging existing OSS Java libs
•Notably Apache Tika
Some external OS processes too
•OpenOffice.org (OOo), LibreOffice
•ImageMagick
•pdf2swf (swftools)
Some bespoke impls e.g. zip - txt
‘embedded’ thumbnails/previews iWorks, Office
General Considerations
CPU, memory
In process vs. out of process vs. Remote CPU
Selection of ‘best’ extractor/transformer
Stay for Andy Hunt’s talk for Support’s
troubleshooting tips
Metadata Extraction
#1 Metadata Extraction
• Triggered on content creation or update.
• or on demand
• ‘Best’ available extractor obtained from
MetadataExtracterRegistry.
• This Extractor pulls out the metadata.
• Format depends on the extractor lib/impl.
• key/value pairs
• These data are mapped onto the Alfresco
content model
• configurable mapping. <ExtractorClass>.properties
Metadata extraction - Java
MetadataExtracterRegistry registry =
appContext.getBean("metadataExtracterRegistry”
,
MetadataExtracterRegistry.class);
ContentReader reader =
contentService.getReader(nodeRef,
ContentModel.PROP_CONTENT);
MetadataExtracter extractor =
registry.getExtracter(reader.getMimetype());
Map<QName, Serializable> props =
new HashMap<QName, Serializable>();
extractor.extract(reader,
OverwritePolicy.EAGER, props);
Overwrite Policy – when re-extracting
• EAGER
• extracted value is not null
• PRUDENT
• db property doesn’t exist or is null or “” (+ above)
• CAUTIOUS
• existing property == undefined
<ExtractorClass>.properties mapping
namespace.prefix.cm=http://www.alfresco.org/mo
del/content/1.0
author=cm:author
title=cm:title
#Note need to escape ‘:’ in key name
geo\:lat=cm:latitude
geo\:long=cm:longitude
Mapping properties
• Can map extracted key-value onto multiple
content properties
• Can ignore extracted key-values i.e. not map.
Metadata extraction - JavaScript
var action =
actions.create('extract-metadata');
action.execute(nodeRef);
Ways to customise & extend
• Customisation of existing extractors
• Define new mappings – to an existing or a new
content model.
• Adding new extractors
•
•
•
•
•
Identify 3rd party lib that can read the binary file
Or write your own code to do this
Extend AbstractMappingMetadataExtracter
Or write a Tika plugin
Define metadata mappings
• org.alfresco.repo.content.metadata
Recap
• Metadata extraction harvests ‘hidden’ data and
maps it into Alfresco content model.
• Support for many MIME types
• Metadata insertion coming
• it’s on HEAD but currently disabled
• also maps metadata tags to cm:taggable
• “Best” extractor selection covered below
Content Transformers
Out of the box transformers
•
•
•
•
•
•
•
•
•
•
•
•
•
text, html, xml
Microsoft Office (doc & docx formats)
OpenDocument Format
iWorks (Keynote, Pages, Numbers)
Images
Shockwave Flash (SWF)
RFC822 email, Outlook .msg email
Adobe PDF, Illustrator, PSD
Electronic publication (epub)
Rich Text (RTF)
MP3
Archives (ZIP, tar)
Many more
Available transformers
• No ‘graph’ of transform paths/mime types
• Spring beans extend “baseContentTransformer”
• They implement isTransformable(from, to)
• They can be
•
•
•
•
•
simple
(A to B)
‘complex’
(A to C, via B)
failover
(A to B, A to B…)
overlapping
(multiple beans for same path)
dynamically un/available (e.g. OOo)
/api/service/mimetypes webscript
http://localhost:8080/alfresco/service/mimetypes
• MIME types
• Metadata Extractors
• Content Transformers
• As services come and go (OOo), entries may
disappear
/api/service/mimetypes webscript
application/vnd.openxmlformats-officedocument.presentationml.presentation - pptx
Extractors: org.alfresco.repo.content.metadata.PoiMetadataExtracter
Transformable To:
application/pdf = Using a Direct Open Office Connection
application/vnd.ms-powerpoint = Using a Direct Open Office Connection
application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection
application/x-shockwave-flash = Complex via: application/pdf
image/jpeg = Complex via: application/pdf
image/png = Complex via: application/pdf
text/html = org.alfresco.repo.content.transform.TikaAutoContentTransformer
text/plain = org.alfresco.repo.content.transform.TikaAutoContentTransformer
text/xml = org.alfresco.repo.content.transform.TikaAutoContentTransformer
Transformable From: application/vnd.ms-powerpoint = Using a Direct Open Office Connection
application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection
“Best” transformer selection
• Alfresco prefers
• available transformers (obviously)
• ‘explicit’ transformers
• previously fast transformers*
• Alfresco doesn’t understand the output quality
• pass/fail
• fast/slow
* past performance is not a guide to future performance.
Content Transformation - Java
ContentTransformerRegistry registry =
appContext.getBean("contentTransformerRegistry”);
ContentReader reader = contentService.getReader
(nodeRef, ContentModel.PROP_CONTENT);
ContentWriter writer = contentService.getWriter
(targetNode, ContentModel.PROP_CONTENT, true);
writer.setEncoding("UTF-8”);
writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);
// Now have a reader & writer ready to go
Content Transformation – Java ctd.
ContentTransformer transformer =
registry.getTransformer
(MimetypeMap.MIMETYPE_ZIP,
reader.getSize(),
MimetypeMap.MIMETYPE_TEXT_PLAIN, null);
transformer.transform(reader, writer);
Content Transformation - JavaScript
var action = actions.create('transform');
action.parameters["destination-folder"] = node.parent;
action.parameters["assoc-type"] =
"{http://www.alfresco.org/model/content/1.0}contains";
action.parameters["assoc-name"] =
node.name + "transformed";
action.parameters["mime-type"]
action.execute(testNode);
= "text/plain";
Config: Transformer Filtering/Debugging
• org.alfresco.service.cmr.repository.
TransformationOptionLimits
• timeouts, size limits, page limits
• content.transformer.OpenOffice.
mimeTypeLimits.txt.pdf.
maxSourceSizeKBytes=5120
•
org.alfresco.repo.content.TransformerDebug
•
contextual logging
Extending
• Follow the Alfresco patterns
• org.alfresco.repo.content.transform
• Remember the chains
• Remember the subsystems
• ImageMagick
• OpenOffice
• Remember the Enterprise variants
• JodConverter
Recap
• Many transformations & paths possible
• No graph
• Can be expensive in CPU/memory
• Transformation to text = free indexing
• No link between source & transformed content
• Thumbnails are children of their source nodes
• Bespoke behaviours ensure thumbnails are
updated
Renditions
Renditions
• A more general feature than transformers
• Although with a strong overlap
• Thumbnails are renditions
• Previews are renditions
• Not all renditions are thumbnails/previews
Renditions
• Flexible location
• Always associated to their source node.
• Child nodes of their source node.
• Child nodes of another folder node.
• Updated when their source updates.
• Can be disabled with marker aspect
• rn:preventRenditions
• See ‘preventRenditions’ spring bean to register
other ‘unrenditionable’ content classes
• Can reflect the content and/or metadata of their
source node.
Standard rendition engines
• reformat
redirects to vanilla transforms
• image
image manipulation parameters
• freemarker
run some FTL against source content
• xslt
run XSLT on (XML) source node
• composite
rendition series [reformat, crop]
Persistence of Rendition Definitions
1. Create Rendition Definition
2. Set parameter values on it
3. Execute it against a source node
•
Definitions can be persisted
• Useful for complex or commonly used
• RenditionService.save(), .load()
• Saved into Alfresco’s Data Dictionary
Renditions - Java
NodeRef jpgNodeRef;
QName renditionName =
QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI,
"myRendDefn");
RenditionDefinition renditionDef =
renditionService.createRenditionDefinition
(renditionName, "imageRenderingEngine");
renditionDef.setParameterValue(
ImageRenderingEngine.PARAM_RESIZE_WIDTH, 128);
renditionDef.setParameterValue(
ImageRenderingEngine.PARAM_RESIZE_HEIGHT, 512);
renditionDef.setParameterValue(
ImageRenderingEngine.PARAM_MAINTAIN_ASPECT_RATIO, false);
ChildAssociationRef chAssRef =
renditionService.render(jpgNodeRef, renditionDef);
Renditions - JavaScript
var renditionDef = renditionService
.createRenditionDefinition("cm:cropResize”,
"imageRenderingEngine");
renditionDef.parameters["destination-path-template”]
= "/Company Home/Cropped Images/${name}.jpg";
renditionDef.parameters["isAbsolute"] = true;
renditionDef.parameters["xSize"] = 50;
renditionDef.parameters["ySize"] = 50;
renditionService.render(testNode, renditionDef);
var renditions = renditionService.getRenditions(testNode);
Recap
• Renditions == Transformations++
• More complex, more powerful
End