(10) Search 搜索

Download Report

Transcript (10) Search 搜索

The Networked Economy (10):
Information Management, Strategy, and Innovation
网络经济:信息管理,战略,和创新
Search
搜索
© people & data | www.weigend.com
Andreas S. Weigend, Ph.D. 韦思岸教授
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Search: Key Points
搜索:议程

Technolgoy
技术
Index
检索
Crawl (or spider)
网页爬行程序

Speed
Store everything  Need search  To be fast, need to build index
存储所有东西  需要搜索 索引能加快速度
Trade-off: Results very fast, but pre-computing and storage needed
权衡:结果迅速但需要预先计算和存储空间

Relevance (algorithmic)
关联度(算法)
Information production  Information finding  Information filtering / ranking
信息产出  信息搜索 信息过滤/排序
2
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Desktop search
桌面搜索

Money or Attention?
You pay with your money ; Buy software (e.g., X1 in 2005)
用户付费购买软件: 软件开发(例如,X1 in 2005)
Pay with attention: Yahoo, Google, MSN search toolbar
用户付出注意力: 搜索工具栏

Sites understand user behavior and situation better, can target ads better
获取用户行为和上网情境等信息, 改善定向广告效果
3
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Search
搜索
•
•
Desktop

Build index
桌面:建立检索

Intranet is similar
局内网很类似
Web

How to find information, products, etc. on the web?
如何在网络上寻找信息,产品等?
4
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Find without search?
无需搜索就能找到?
•
1. GUESS

•
If you know the location on the web (URL)
如果知道确定的网络地址(主页、网址)
2. BROWSE

Use directories
目录指南
Organize manually using expert “surfers”
由“网络冲浪”专家手工编制

Does not scale.

Manual directories difficult to maintain!
但是如规模太大则无法人工维护
URL (Universal Resource Locator,
e.g., http://www.ceibs.edu)
is the address that identifies the
web page
URL(通用资源定位程序,如
http://www.ceibs.edu)指定位网
页的地址
Organize using community of web users
由网民社区编制
5
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Crawl
爬行

Early search engines
早期搜索引擎

Basic idea
原理
Crawl through web, following hyperlinks*
通过超级链接在网页间中抓取
Hyperlink: Takes you
to another page when
you click on it
Extract words from the page
从网站中提取关键字
超级链接:在用户点击后
将用户带到另一个网页
Then build index of web
然后建立网页索引
Match user input (search terms) to the index
将关键字与用户输入信息(搜索项)比对
•
*Example: <a href="http://www.weigend.com/"> Home of my professor.</a>
6
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Relevance (in organic search results)
关联度(自然搜索结果)

Example: google search for “weigend”
例如: 在google上搜索“weigend”

309,000 results returned for weigend, in 0.2 seconds
显示约309,000条关于weigend的结果(仅用0.2秒)
New problem: Relevance
新问题:关联度

E.g., Number of hyperlinks going into page
指向某网页的外部超级链接数量
How to rank the pages? What to show on top?
搜索结果如何排序?哪些显示在最前面?


What information can be used to help with this
decision?
哪些信息可以用来做这项决策?
A) Within page
同一网页上
Location of search term on page
搜索项在网页上的位置
Number of occurrences of the search term on
page
网页上搜索项出现的频率
Metatags
底标签
B) Static: Link structure
静态的: 链接结构
Leverages other websites
利用其它网站的访问情况

C) Dynamic: Click behavior
动态的:点击行为
Choice within set of links
用户如何在一系列链接中进行选择

Action: Move results up or down
搜索结果上下移动

Understand overall trajectory (eg for typos) 趋
势分析(如错别字)

Q: What information does the user see?
问题:用户看到的是哪些信息?
Leverages users
利用使用者的点击情况
7
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Business models of search (sponsored search etc.)
搜索商业模式(付费搜索等)
Search is a necessary competence…
搜索是必须的功能
•
Has become entry point to everything (or at least key necessity)
已成为重要入口(至少是必备)
Customer has become empowered
消费者能力增强

Customer get smarter. Can’t fool them any more – Transparency empowers them, too
消费者变得聪明,不能随便愚弄 - 透明度使他们更加聪明
Power of community
社区力量
Other examples of search
其他搜索举例
•

Product search
产品

Books
书
8
© people & data | www.weigend.com
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
Search Inside the Book (Amazon.com, 2003)
亚马逊图书内容全文搜索(2003)
9
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Search statistics
搜索统计

1 billion searches per day (2005.1 estimate)
每天10亿次搜索(2005年1月估计)

Unique users per month (google, 2003.06)
每月用户实际人数(google, 2003.06)

0.3 billion searches per day (2003.1)
每天3亿次搜索(2003年1月)

81.9 million
8,190万

Search statistics (January 2003)(searchenginewatch.com)
搜索统计(2003年1月)
•
Search Engine
搜索引擎
Search Hours Per Month
每月搜索小时
(in millions)
(百万)
Search Minutes Per Day
每天搜索分钟
(in millions)
(百万)
Google
AOL Search
Yahoo
MSN Search
Ask Jeeves
InfoSpace
AltaVista
Overture
Netscape
Earthlink
Looksmart
Lycos
18.7
15.5
7.1
5.4
2.3
1.1
0.8
0.8
0.7
0.4
0.2
0.2
37
31
14
11
5
2
2
2
1
1
0
0
(Nielsen/NetRatings)
Searches
Per Day
每天搜索
(in millions)
(百万)
112
93
42
32
14
7
5
5
4
3
1
1
10
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Vertical search
垂直搜索
Many internet businesses are essentially vertical search
许多网络公司本质上是垂直/纵向搜索
•

Limitations of horizontal search?
横向搜索的局限性?
Complexity of products and services
产品与服务的复杂性
Domain knowledge
专业领域知识

Information often in deep web, not in surface web
信息常处于深层网页,而非表层网页
Travel
旅游
•

Aggregation: Intermediation and disintermediation
信息聚合 / 中介与非中介
11
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Vertical search
垂直搜索
Shopping comparison
比价购物
•

Initially: Spider sites, e.g., Amazon.com
最初:网络蜘蛛,如亚马逊
What should be Amazon’s response?
亚马逊如何反应?
Should they make it hard or easy?
制造阻碍还是积极配合?
Now feeds and web services
达到双赢

Business models for shopping comparison engines
比价购物搜索引擎的盈利模式
12
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Vertical search
垂直搜索
Insurance comparison
保险比价

Market structure: Often through agents, health insurance often through
employment
市场结构:经常通过代理,健康险经常通过公司
Essentially an information good
本质上是信息产品
Still a long way to go
还有很长路要走
13
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Vertical search
垂直搜索
Cars
汽车
•
70% of customer do research on web before going to dealer
在进店买之前,70%的人在网上搜索过
Challenge: Dealer’s don’t think of their business as e-business
挑战:经销商不认为他们从事的是电子商务
Huge advertising budget, need to move to mixed channel marketing
巨额的广告预算,需要利用多种渠道做整体市场营销

Basically, car market can also be seen as vertical search
基本上,汽车市场也可以被看作是垂直搜索。
14
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Vertical search
垂直搜索
Real estate
房地产
•
Large part of the economy – most expensive purchase for most people
经济的重大组成部分-对于大多数人来说是最贵的商品
Market structure: 6% commission
市场结构:6%佣金
But: Real estate also is essentially a information search problem
但本质上是信息搜索问题
15
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Vertical search
垂直搜索
Music
音乐
•

Information sources
信息来源
Human ratings
歌迷打分
Meta data (Composer etc.)
Meta数据(作曲家等)
Machine analysis
机器分析

Payment
支付
From buy (Possession) to rent (Subscription)
购买(拥有)还是租借(订阅)
China piracy rate: 92% of consumers are using pirated materials
中国盗版率:92%消费者用盗版
16
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Local search
本地搜索
Total market size: $90 billion (CitySearch)
总体市场规模:$900亿(CitySearch)
•
Technology
技术
•

Know location of use via IP address or registration
通过IP地址知道网民位置或注册

Mobile: LBS (location based services)
收集: LBS(以地区为基础的服务)
17
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
People search
人物搜索
Dating and social networking sites
网上交友和社会网络公司
•
Note: Social networking companies are purely an information play
注:社会网络公司是纯粹的信息服务
Network effects key
网络效应是关键


The product is the customer
产品就是客户

The buyers are the inventory
买家成为存货
Online dating platforms
网络约会平台
18
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Other examples
其他例子
•
Craigslist
No real-time chat
非实时聊天

Local markets (San Francisco, New York etc)
本地市场(旧金山、纽约等)

Monetization
货币化
Only people who post jobs pay
贴招聘广告者付钱
Genealogy
家谱
•

Amazing stories of people finding relatives
花费大量精力寻根溯源
19
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Personalized search
个性化搜索

Explicit
显性的
“Customization”:
用户定制
User tells interests explicitly
用户告诉对何感兴趣

Implicit
隐性的
Based on user’s past behavior
基于你过去的行为
Needs persistent history
需长时间的历史信息
Problem: Multiple personalities
问题:多重个性

a9, google giving access to entire search history on their platforms
a9、google让你访问所有的搜索历史
20
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Relevance is everything
关联度最重要

The Search Paradigm
搜索范例
2.4 words, a few clicks, and done
2.4个字,几次点击,就找到了
Only possible if results are relevant
搜索结果关联度很高时才可能

Relevance is ‘speed’
关联度就是“速度”
Time from task initiation to resolution
从任务开始执行到完成的时间
Tmportant factors:
重要因素:

Location of useful result
有用搜索结果的位置

UI Clutter
接口的速度

Latency
反应时间

Relevance is relative
关联度是相对的
Context dependent
内容依赖

E.g. ‘football’ in the UK vs the US
例如,“football”在英国与在美国的含义
就不同
Task dependent
任务依赖

E.g. ‘mafia’ when shopping vs
researching
例如,“mafia”在购物与在研究中的含义
也不同
21
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Relevance is hard to measure
关联度很难测量

Poorly defined, subjective notion
定义不清晰,主观想法

Depends on task, user context, etc.
取决于任务,用户情境等

Analysts have focused on surrogates
that are easier to measure
分析时关注更易测度的替代指标
Index size, traffic, speed
索引规模,流量,速度
anecdotal relevance tests
有趣的关联度测试

e.g. Vanity queries
例如,空内容检索
Methodology important
需要用调查的方法
Averaged over queries
检索要求平均
Averaged over users
用户平均

Development Cycle
发展周期
Tune Ranking
可调节的排序
Evaluate Metrics
评定标准
22
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
User interface
用户接口

Relevance-ranked result lists
排序搜索结果

Document summaries are critical
文件摘要很重要


Hit highlighting
加亮提示

Dynamic abstracts
动态摘要
Multiple sources
多种来源
Predefined segmentation
预先提炼信息池

Assisted search
辅助搜索
Specialized indices
特定索引
via Tabs
通过标签
E.g. Paid listing
如付费列示
Intermixed with results from other
sources
将结果与其他信息来源混杂
Spell correction
拼写校正

Blended results
结果混杂


E.g. News
例如新闻
Localization
本地化
Country language experience
语言组合与识别
23
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Future Trends
未来趋势

Question answering
问题解答

New contexts
新的领域
Ubiquitous searching
无处不在的搜索

Toolbars, desktop, phone
工具栏、桌面、电话等
Implicit searching
模糊搜索


Computed links
计算链接
New tasks
新的任务
E.g. Local search
如本地搜索
24
| +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800
© people & data | www.weigend.com
Search: Summary
搜索:总结

No longer about filing and organizing, but about searching
不再去归档或组织,而是去搜索
Whether it’s about your email or knowledge in your companies
可能是你的电子邮件,可能是你公司的内部信息和知识
And then about ranking / sorting / relevance
排序/索引分类/关联度

Why does search replace directories / categories?
为什么搜索会替代分类目录
Can be done automatic, in contrast to manual categories
自动执行,不同于手工目录
25