Integrated Full-Text Search (iFTS) in Microsoft SQL Server ® 2008 Fernando Azpeitia Lopez SQL Server Engine - Program Manager Microsoft ® Corporation.

Transcript Integrated Full-Text Search (iFTS) in Microsoft SQL Server ® 2008 Fernando Azpeitia Lopez SQL Server Engine - Program Manager Microsoft ® Corporation.

Integrated Full-Text Search (iFTS)
in Microsoft SQL Server ® 2008
Fernando Azpeitia Lopez
SQL Server Engine - Program Manager
Microsoft ® Corporation
Session Objectives And Takeaways
“What is exactly full-text search in a DB? Which are its main
scenarios and how can it provide value to my solution?”
• Cover the main concepts and capabilities of our full-text search
solution inside SQL Server.
• Realize how iFTS (SQL Server 2008 integrated FTS) can extract
significant value out of unstructured and semi-structured data stored
inside the DB.
• Demonstrate the usage of iFTS and how it solves your daily
DB Search needs.
• Cover iFTS’ best practices and workarounds.
• Evangelize SQL Server FTS and prove it ready for high level production
environments
• Understanding the future of Search in DBs. Where are we going and
why?
Session Details
• Searching at Microsoft: Main Players
• SQL Server Full-Text Search (FTS) : Overview of Basic Scenarios,
Concepts, Features & Architecture (Demo)
-
•
SQL Server 2008 iFTS (Integrated FTS)
-
•
The need
The Scenario
The Solution
How do I deploy it?
How do I use it?
Why did we need a new Search infrastructure?
SQL FTS Roadmap
The new Architecture
Main Improvements (Demo)
Upgrading to iFTS
Best Practices and Workarounds
What is not in iFTS?
Upcoming Future
Q&A
3
Searching at Microsoft: Main Players
 Bing
•
•
•
Search WWW
Highly scalable
Does not use FTS as provided by SQL Server
 MS Search
•
•
•
Search information on disk
Integrated with OS (e.g: Vista Desktop Search)
Used by former SharePoint, SQL Server 2000 and 2005 and Desktop
Search …
 FAST
•
•
Search information on your organization (Enterprise Search)
SharePoint Server search solution
 iFTS in SQL Server 2008
•
•
Provides Full Text Search over data in RDBMS
Completely redesigned search solution to leverage DB robustness
and scalability.
Full-Text Search (FTS) in SQL Server:
Overview
“The Need”
“How can I extract value out of vast amounts of non-relational
data stored in the DB, by leveraging fast filtering mechanisms to
get to the information I need?”
Full-Text Search (FTS) in SQL Server:
Overview
“The Scenario”
1.
Heterogeneous data (different types and languages) stored and managed in
the DB. Mainly unstructured and semi-structured data (e.g: binary documents,
emails, XML, HTML, etc..)
2.
Besides leverage DB capabilities, the ability to Search efficiently over
this data arises.
1.
The creation of special (non-relational) indexes is needed in order to
enable Search over these corpuses.
2.
Users need to manage the data and its indexes together and consistently.
3.
Main Markets: Compliance/E-Discovery, Government/Law Enforcement,
eCommerce, Customer Support, Vertical Search (Medical, Media,..Etc….)
Full-Text Search (FTS) in SQL Server:
Overview
“The Solution”

SQL Server Full-Text Search (FTS)
• FTS is a feature integrated into SQL Server that allows fast and flexible
querying of significant words and phrases in:
• Character-based database columns
• Documents stored in an binary typed column; varbinary() and binary(), as well
as XML data types
• FTS searches for language aware tokens, keywords or phrases inside the
database providing scoring and relevance of the result set.
Full-Text Search (FTS) in SQL Server:
Overview
“How does it work?”
Tokenization

•
The data is parsed by language aware processes (iFilters, stemmer, wordbreaker,
etc..)
Indexing

•
•
FTS uses special indexing mechanism: custom FTIndexes stored in FTCatalogs.
Each token (word) is indexed keeping metadata related to it.
Querying

•
•
Predicates: CONTAINS, FREETEXT
Table-valued functions returning relevance base score:
CONTAINSTABLE, FREETEXTTABLE
•
•
•
Matching semantics: exact or fuzzy
Various query options available: i.e : weight, NEAR, Thesaurus, Inflectional,
Top_N_by_Rank, prefix, etc..…
Wide language support at indexing and querying time.
Full-Text Search (FTS) in SQL Server:
Overview
Demo: “How do I deploy it?”
1.
Locate table with textual data (character based or binary raw data).
2.
Full-text catalog created in that database.
CREATE FULLTEXT CATALOG ftCatalog AS DEFAULT;
3.
Full-text index (associated with the full-text catalog) on the column/s of the
table.
CREATE FULLTEXT INDEX ON HumanResources.JobCandidate(Resume) KEY
INDEX PK_JobCandidate_JobCandidateID;
4.
Population of the index that results in scanning and processing of the textual
data.
5.
T-SQL Queries that use full-text predicates or table valued constructs.
6.
A full set of T-SQL DDL available to perform FTS manageability operations.
Full-Text Search (FTS) in SQL Server:
Overview
Demo: “How do I use it?”
1.
Populating the FT Index. Keeping up with changes.
2.
Querying my data: Understanding CONTAINS and FREETEXT
3.
Extending the query to leverage useful features:
1.
Thesaurus
2.
Prefix search
3.
Inflectional Forms
4.
Weighted terms
5.
TOP_N
Etc…
Why a new Search Infrastructure?
 Possible challenges you might have encountered when
using SQL Server 2005 Full-Text Search (FTS)
 Indexes stored outside SQL Server might lead to manageability
challenges
 E.g: Backup/Restoring your DB with FTCatalogs might require specific actions
 The Mixed query performance suffers from having to pull over the
complete full-text result set
 E.g: Poor performance when relational side very selective
 Possible Scaling issues on big boxes
 Many sustained engineering challenges due lack of technology
ownership.
 E.g: Inability to efficiently support our customers
Inability to deliver top features highly requested.
 Inability to shape our own roadmap targeting our direct customer needs.
Why a new Search Infrastructure?
 Provide foundation to address future needs
 Extend the FTS feature set
 customizable proximity operator
 property level search
 snippets with hit-highlighting
 field weighted relevance
 customizable tokenizing
 etc…..
 Compelling search platform for others to build on
 Eventually, offer Search within all data inside the database.
Search:
SQL Server Full-Text Search Roadmap
 FTIndexes into the database
Seamless administration, manageability and high availability experience
for search indexes
 Integration with Query Processor
Efficient and scalable query processing over search to query continuum
 Enabling search in semi-structured and unstructured DB storage solutions
(e.g:Filestream, Sparse Columns, XML, etc..)
 Feature set comparable to what they see in web and desktop search space
Snippets, field scoped search, weighted relevance, customizable NEAR, etc…
 Massive Scale-up and performance enhancements. Becomes the fast DB search engine
in the market.
 Facilitate Scale-out and manageability enhancements. e.g: Remote indexing, etc..
 Enabling finding of ad-hoc information inside the DB without prior knowledge of its
schema.
FTS 2008 New Architecture
SQL Server
process
Doc_table
id
doc_type
document
SQL Query Procesor
SQL Query
Execution
SQL Query
compilation
MSFTESQL
FTS Engine
(FT Search Engine)
FTS Query
Execution
FTS Query
compilation
Crawl
Gatherer
Data to be
Indexed
STOPLIST
Full-Text
Index
Indexer
Keyword and
occurrence
information
Shared
Memory
Filter Daemon Host
(FDHost)
Protocol
Hander
DB Data
iFilters
Filtered Text
Wordbreakers
Shared
Memory
FTS 2008 New Architecture
Query
SQL Server
process
SQL Query Procesor
SQL Query
compilation
SQL Query
Execution
MSFTESQL
FTS Engine
(FT Search Engine)
FTS Query
compilation
FTS Query
Execution
Thesaurus
STOPLIST
Full-Text
Index
Filter Daemon Host
(FDHost)
Wordbreakers
Main Improvements

Full-Text Indexes are stored and maintained inside SQL Server.
Full integration with other unstructured and semi-structured solutions.

Manageability features like Backup/Restore, Attach/Detach,
Mirroring and Log shipping work for FT indexes just as any other DB objects or
regular indexes.

Indexing and specific cases of query performance have been improved
dramatically.

(Demo) Access to Full-Text Indexes raw Data. Provides access to search corpus
and statistical information. Useful for troubleshooting scenarios.
sys.dm_fts_index_keywords()
sys.dm_fts_index_keywords_by_document()
Keyword
Display_term
Col_id
Doc_id
Occ_count
Main Improvements
 (Demo) Query Input Parser : sys.dm_fts_parser
 Better supportability: understanding a given WB behavior.
SELECT * FROM sys.dm_fts_parser('"This is test“ AND “This also"',1033,0,0)
Group_id
keyword
occurrence
Special_term
Display_
term
Expansion_ty
pe
Source_term
1
0x007400680069
0073
1
Noise Word
This
0
This is test
1
0x00690073
2
Noise Word
is
0
This is test
1
0x007400650073
0074
3
Exact Match
test
0
This is test
2
0x007400680069
0073
1
Noise Word
this
0
This also
2
0x0061006C0073
006F
2
Noise Word
also
0
This also
Main Improvements
 Full-Text Query is integrated with SQL Query Processor
 Mixed query performance has been improved for significantly
different cardinality scenarios
 Overall Query performance scales better in most scenarios
 Resource utilization is managed (mostly) as part of SQL Server
Main Improvements
 Thesaurus improvements
 Stored in internal tables (in tempdb) in XML form instead of
being parsed from external files
 Instance level thesaurus
sys.sp_fulltext_load_thesaurus_file (lcid)
Loads all the data specified in the Thesaurus XML corresponding
to the language with specified lcid.
Main Improvements

(Demo) New STOPLIST support: Simplified noise words utilization
and manageability. DB object associated with the FT index.
CREATE FULLTEXT STOPLIST stoplist_name
[ FROM {[database_name.] source_stoplist_name}
| SYSTEM STOPLIST]
[AUTHORIZATION owner_name]
ALTER FULLTEXT STOPLIST stoplist_name
{
| ADD <keyword> LANGUAGE language_term
| DROP
{
| <keyword> LANGUAGE language_term
| ALL LANGUAGE language_term
| ALL
}
Main Improvements
 New family of Word-Breakers (WB):
 WBs are components responsible of parse the textual
data in a given language and pass the tokenized result
to the Full-Text Index.

51 languages/WBs out of the box

Improved quality and supportability in most
wordbreakers
Main Improvements
 WBs available in SQL Server 2008:
Arabic
Bengali
Brazilian
Bulgarian
Canadian
Catalan
Chinese (Simplified)
Chinese (Traditional)
Chinese (Hong Kong)
Chinese (Macau)
Chinese (Singapore)
Croatian
Cyrillic
Danish
Dutch
English
English UK
French
German
Gujarati
Hebrew
Hindi
Icelandic
Indonesian
Italian
Japanese
Korean
Latvian
Lithuanian
Malay
Malayalam
Marathi
Neutral
Norwegian
Polish
Portuguese
Punjabi
Romanian
Russian
Serbian
Latin
Slovak
Slovenian
Spanish
Swedish
Tamil
Telugu
Thai
Turkish
Ukrainian
Urdu
Vietnamese
Languages present but
disabled by default
New languages supported
in SQL Server 2008
Existing in SQL Server
2005, and being replaced by
new WBs in SQL Server
2008
Unchanged language/WB
from SQL Server 2005
Main Improvements
 The indexing performance has improved in most scenarios
2005 Crawl
2005 Total
IFTS Crawl
IFTS Total
20M rows 1k
text data
02:06
02:25
01:22
01:28
5M rows 8k
text data
02:10
02:41
02:22
02:32
20M rows 1k
nvarchar
data
01:37
01:55
01:20
01:26
Measured on 4 processor AMD64 2793 MHz, 8G RAM. Numbers are in HH:MM format.
Total time is combining time to crawl and time of merge into index
For some HW configuration and data types, specific best practices are
recommended to improve indexing performance (i.e: capping SQL
Server’s memory, etc…)
Upgrading to iFTS
 Due a new Full-Text Index architecture, former Full-Text Indexes are not
compatible in SQL Server 2008 anymore.
Solution…: The Full-Text Catalog Upgrade Option
 Import: (default) Faster method although performance and semantic
implications are possible.
 Rebuild: Slower method although ideal final state of new FTCatalogs
guaranteed.
 Reset : Faster Upgrade method although your Search app will not
have the FTCatalogs available afterwards. You need to rebuild them
when possible.
 Possible Upgrade methods:

In place Upgrade: User will be prompted for what Upgrade Option to
choose for existing FTCatalogs.

Restore/Attach : Instance level setting will be applied to former
Full-Text Catalogs brought up with the former DB.
Best Practices and Workarounds
 Full-Text key type: Use Integer..
When so, no internal mapping table required  we avoid an extra
internal JOIN at query time.
 To limit results and increase performance, use the top_n_by_rank option
with FREETEXTTABLE and CONTAINSTABLE.
 Use CONTAINSTABLE or FREETEXTTABLE when you only require
Full-Text Key or rank information.
 Keep your FTIndexes de-fragmentated when possible  Reorganize the
Full-Text catalog by using ALTER FULLTEXT CATALOG REORGANIZE.
Best Practices and Workarounds
 Prefix queries Issues?
The issue
Possible Query perf degradation with non-deterministic enough
prefix queries (i.e: ‘a*’).
Recommended best practice/s
-Try to narrow down your search when possible.
- If the query plan has a nested loop with the fulltext STVF, you
should try hinting a merge join.
Best Practices and Workarounds
Issues with Complex queries?
A.
Several ANDs + ORs within a single CONTAINS()
The issue
 QO plan and compilation complexity might cause blocking
 Multiple FT logical operators adds extra cost when relational predicate is
pushed into the FTIndex
 When kept as relational, the QO chosen plan might be not ideal, causing
costly executions.
Recommended best practice
 Implement app. level max # of terms.
 Leverage Thesaurus capabilities when possible.
B.
Several CONTAINS linked by ORs conditions
The issue
 QO plan and compilation complexity might cause blocking
Recommended best practice
 Combine multiple CONTAINS predicates into one CONTAINS predicate
Best Practices and Workarounds
 Blocking issues due high DML + FT query workloads
The issue
 …High DML workload (+100 DMLs/sec)
+
 …Plus AUTO being specified as the change tracking mechanism for
the Full-text index
…. might cause the query load to suffer blocking.
Recommended best practice/s
 A Trace Flag exist to solve this issue (7646)
 There are documented best practices in the SQL Server 2008
White Paper to mitigate this issue. (e.g: manual change tracking,
REORGANIZE, etc..)
 Install latest PCU for latest fixes in this area.
Best Practices and Workarounds
 Best resources for additional information:
•
SQL Server 2008 BOL
•
SQL Server 2008 iFTS Internals and Enhcancements (White
Paper): http://msdn.microsoft.com/en-us/library/cc721269.aspx
•
Fernando Azpeitia Lopez : [email protected]
Program Manager .SQL Server Full-Text Search.
What is not yet in IFTS?
 No document property level search (i.e: search ‘foo’ on doc.’title’)
 Some of the customer wish list items: Snippets, column weights, language
detection, customizable wordbreakers and proximity operators, etc..
 Non support for remote FTIndexing. Currently, iFTS can only index data
stored directly in the DB or in the file system using Filestream integration.
 No partitioned full-text indexes. No support for SWITCH partition
on tables that are FT indexed.
 Non support for remote FTIndexing. Currently, iFTS can only index data
stored directly in the DB or in the file system using Filestream integration.
Upcoming Future
 What are we working on for our next major release/s?
 Improve dramatically our overall query performance by
reengineering key components in our architecture.
 i.e: Early estimations point we will become the fastest DB Engine
in the market.
 Address as many developer features as time permits (e.g:
customizable NEAR, property search, etc..)
 Improve our overall scale story to support large corpuses
under an impressive query performance.
Summary
 iFTS
adds significant value
Implementation -> straightforward
Management -> straightforward
Improved overall Performance, Integration and
Robustness

iFTS 2008 is the beginning of an ambitious upcoming plan
to become the fastest and easier to use DB Search solution
in the world.
Thank you

Integrated Full-Text Search (iFTS) in Microsoft SQL Server ® 2008 Fernando Azpeitia Lopez SQL Server Engine - Program Manager Microsoft ® Corporation.

Transcript Integrated Full-Text Search (iFTS) in Microsoft SQL Server ® 2008 Fernando Azpeitia Lopez SQL Server Engine - Program Manager Microsoft ® Corporation.

Directory