Transcript PW 101 V8i - Chapter9 - Document Indexing
ProjectWise 101 – Chapter 9 Document Indexing
Gary Cochrane – Technical Director Geospatial Sales – North America
Introduction
• ProjectWise Document Indexing – Really means three things • Full Text Indexing, in support of full text searching • Thumbnail Extraction • Document Property Extraction – We won’t cover this one in PW101 – See Bentley Institute PW Admin course guide for this
Full Text Indexing
• We did not write the engine for this – But elected to use the one Microsoft provides – – • Included with every copy of Windows That engine is called the MS Indexing Service • And it was installed in the VM as an optional Windows component Microsoft indexes the following file formats • MSWord, Excel, PPT, HTML, XML, TXT
Pre-installed in VM
ProjectWise Integration Server ProjectWise Orchestration Framework MicroStation V8i-SS1 Supported Database Engine Microsoft Message Queuing Service Microsoft Indexing Service Microsoft .NET Framework 2.0
Windows Server 2003 with SP2
Extending the MS Index Service
• Microsoft provides an SDK for third parties to extend the Indexing service – So the Indexing service will know how to “filter” files from that vendor • For instance, Adobe provides an “iFilter” that teaches the MS Index Service how to extract text from a PDF file • The Adobe PDF iFilter is installed with Acrobat Reader V9x
Indexing Overview
• Within PW, Indexing consists of: – Scheduling – • A process that wakes up, checks for new, (or modified files), adds them to the Copy-out queue, and goes back to sleep Copy-out • Copy the file from the Storage Area, to the machine running the Indexing Service. Then add file to the extraction queue.
• Remember, files may be stored on multiple servers • Also, in large installations, a machine may be dedicated to indexing
Indexing Overview – Part II
• Overview – continued – Extraction – • This process gets the text from the file and adds it to the MS Index catalog. Then adds the file to the Update queue Update • This process sets the flag on the file (in the PW database) that says it is “done” • New files are added with the flag set to “undone” • Check-out/in causes the flag to be set to “undone”
A note on “done”
• Done does not necessarily mean it was successful – It means the file has been processed – • In other words, what happens if an unknown file (Ex: an Autocad file) is sent to the Indexing Service?
The file is attempted… – • And the indexing service says, “I don’t know how to extract text from this file” There would be no point in trying the file again • So it is marked as “done”, even when unsuccessful
MicroStation and AutoCAD
• ProjectWise provides a mechanism to index the text from these file types – Instead of writing an iFilter, Bentley elected to: – • Copy-out the file • Run MicroStation in the background, extract all the text, and write it to an XML file • Send the XML file to the Indexing Engine Since MicroStation can parse DWG as well… • Then this method saved us from having to write two iFilters
Summary
• So within ProjectWise, we index: – Word, PPT, Excel, XML, HTML, TXT – – Adobe PDF DGN, & DWG • More good news – iFilters can be found for many file formats • Some free, and some for purchase
PW Orchestration Framework
• Remember when we installed this?
– PWOF is responsible for managing batch processes for ProjectWise – • This includes all those processes discussed on the previous slides For Full Text Indexing, that means • Scheduler process, Copy-out process, Extraction process, Updater process, and the MicroStation instance running in the background
Lab 1a
• PW Orchestration Framework – Start the Windows Task Manager – – – • Hint: Right-click on empty part of Taskbar Examine memory usage • On the Performance tab Switch to Processes tab • Sort by Mem Usage column (descending) • Look for ustation.exe
• Look for DmsAfpEngine(s) Lots of memory consumed here…
Lab 1b
• Now open Services dialog – Remember “gears” icon on Quick-Launch – – – • Locate PW Orchestration Framework service Select the PW OF service, and choose> Stop • Watch memory usage in Task Manager For remainder of exercise, we need PWOF running • So start it back up now • Note PWOF is configured for automatic startup – It will run each time machine is booted Close Services and Task Manager
Lab 2a
• Open PW Administrator – Log in as> adminpw – – Drill down to: • Document Processors> Full Text Indexing Right-click, choose> Properties
Lab 2b - Full Text Indexing
Accept defaut, unless Indexing is to be run on another machine Turn on adminpw adminpw Set to 60
Lab 2c - Full Text Indexing
Enable all times in the schedule Set to 2
Lab 2d
• Switch to File Type Associations tab – Press> Add – • In the Extension field, enter> DWG • In the bottom field, enter> DGN – So that DWG files are processed as if they were DGN Press> OK
Lab 2e
Lab 2f
• Still on the File Type Associations tab – Again, press> Add – • In the Extension field, enter> itiff • In the bottom, enable> Do not process these documents – You can’t extract text from a raster so this prevents wasted file transfers Press> OK • Press OK again – To close the Full Text Indexing Properties
Lab 2g
• Open Task Manager again – Switch to Performance tab – • Within 2 minutes, you should see heavy CPU usage • Memory usage will also go up Up to 60 documents will be indexed in the first pass • If there are more than 60 documents to be done, then they will be queued in the next pass – 2 minutes from now
Analysis
• All documents will eventually be processed – When done, the index will be ready for fast full text searches • Once the indexer has caught up, future load will be lighter due to only processing incremental documents
Lab 3a
• When done, close Task Manager, open PW Explorer – Log in as user1 • From the main tool box, select> Find Documents – Binocular icon • Change to Full Text tab – Enter Look For> detail • Press OK to start search – Then Close the Search dialog • Your results should include: DGN’s, DWG’s, and PDF’s
Lab 3b
• Browse to: – User1/Document Indexing/MS-SHT • These files were not successful because they have an unknown extension • But they were attempted, and flagged as done • Return to PW Administrator – Select datasource name (pwdemo) – • Right-click, choose> Properties • Change to Statistics tab • Choose Refresh • Review Full Text Statistics Close dialog
Lab 3c
• While still in PW Administrator – Open Full Text Indexing Properties again – – • Switch to the File Type Associations tab Press Add • In the Extension field, enter> SHT • In the bottom Extension field, enter> DGN – So that SHT files will be processed as if they were DGN files • Press OK to complete the Extension mapping Press OK again to close the Properties dialog
Lab 3d
• Once new file type has been added… – Now a small problem • These files were flagged as done, and the Indexer won’t try them again unless they are checked out/in • And even that won’t work unless you actually makes changes… • PW compares files to version on server, and doesn’t transfer back if there are no changes
Lab 3e
• Rather than check them all out, and back in – From PW Administrator – – – • Right-click Full Text Indexing Choose> • Mark folder Documents for Reprocessing Browse “…” to • USer1/Document Indexing/MS-SHT Press OK • Press OK again
Analysis
• Within 2 minutes, these documents will be re processed – If you run the search again (in a few minutes), you should also get SHT files in your results – Re-visit Datasource statistics to see if it Full Text categories have changed
Summary
• Once the index is created, – You can stop the PW Orchestration Framework service – – • It is used to create the index, but not to search the index This will save memory, and CPU cycles • So in a demo, your machine will run faster • BUT, new, (or modified) files will not be re-indexed Up until now, the PWOF was not being used at all • Full Text Indexing is the first time we’ve needed PWOF, even though it has been running since installation
PW Thumbnails
• PW Thumbnails is not “indexing” in the proper sense, but it is similar in nature to Full Text – PW Thumbnails extracts a thumbnail from the document, and stores a copy in the PW database – • This allows one to browse PW Explorer, and see thumbnails in the Preview Pane Not all file types support thumbnails • Among those that do, some don’t do it per the industry standard
Thumbnails – Part II
• Important to remember – ProjectWise does not create thumbnails – • It only extracts what might be in the file A good test is to check to see if Windows Explorer displays a thumbnail for the file • If it does, then PW should as well
Lab 4a
• Open Windows Explorer – Browse to: – • C:\PW-101 Class Files\Document Indexing\MS-V8 Change to Thumbnail display • MicroStation V8 files have thumbnails
Lab 4b
• Browse through remaining Document Indexing folders – Note which include thumbnails – Additional notes • PDF files take a long time because you are really looking at a small view of the whole file, not a thumbnail • AutoCAD doesn’t adhere to the Industry standard – These files only display correctly because MicroStation is installed, and is responsible for displaying a thumbnail – Autodesk may have fixed this in later versions?
Lab 5a
• Open PW Administrator – Log in as> adminpw – – Drill down to: • Document Processors> Thumbnail Extraction Right-click, choose> Properties • Similar to Full Text Indexing – But actually less involved
Lab 5b
Turn on adminpw adminpw Set to 60
Lab 5c
Enable all times in the schedule Set to 2
Lab 5d
• No changed required on the File Type Associations tab – Press OK to complete the configuration and close the dialog • Within a few minutes, thumbnails should show up in the preview pane
Analysis
• Thumbnails are extracted and stored in the PW database – Because document storage may not be local – • Thus “touching” the document to see thumbnail in real-time is not practical Thumbnail notes • Requires less processing than full text – MicroStation not running in this process – Requires PWOF to extract, but not to display
Review
• Topics covered in this Chapter – Full text Indexing – Configuration – – – – Full Text Searches ProjectWise Orchestration Framework Thumbnail Extraction Microsoft Indexing Service • And iFilters to extend default supported file types • (I have a free Visio, and MSG iFilter from Microsoft)