Transcript Document

NUWeb System
[email protected]
WWW Architecture
• Web Server (e.g., Apache, IIS)
• Browser (e.g., IE, Firefox)
• Addressing and Information Channel (DNS,
URL, SearchEngine)
• Abstract Model:
– Provider (server), Consumer (client), Channel
– Client-Server architecture, Centralized Service
Problems of the WWW due to the
fundamental design
•
Naming/Addressing problem:
–
–
–
–
•
Physical naming/addressing
Static Binding through DNS
URL may not be a good design, (hard-to-remember)
DNS could be slow
Information flow organization not designed in the first place,
– Hotspot bottleneck problem, bandwidth waste problem,
– Cache and Proxy tech are added separately afterwards,
•
Linkrot problem
– Dead links, wrong links, faked links,
– Approximately up to 15% of links
•
Need static IP, need to apply for URL, need knowledge in building up and
managing Websites
– Creating and maintaining a website is costly
– Webpage creation is not easy
•
Divide the computer world into two hierarchies
– Server: Website owners, service providers
– Client: ordinary users
Weaving the Web
(quoted from wikipedia)
• In Berners-Lee's book, Weaving the Web, several
recurring themes are apparent:
– It is just as important to be able to edit the Web as browse it.
Wikis are a step in this direction, although Berners-Lee
considers them merely a shadow of the WYSIWYG functionality
of his first browser.
– Computers can be used for background tasks that enable
humans to work better in groups.
– Every aspect of the Internet should function as a Web, rather
than a hierarchy. Notable current exceptions are the Domain
Name System and the domain naming rules managed by ICANN.
– Computer scientists have a moral responsibility as well as a
technical responsibility.
What Is NUWeb?
• Marriage of WWW with P2P
• Technologically:
– NUWeb = WebServer + Browser + WNS + SearchEngine +
Proxy/Cache + WebBuilder + Blog + CommunityEngine + KIM +
P2P – URL – DNS and – Cost 
• Logically:
– A New Web System for any net user to build his/her own web in
an extremely easy-to-use way.
– A platform for web-building, information sharing, information
management, community, and service management
• A platform for Webilization
• A project to pursue Wemocracy
NUWeb Functions
• A platform for Public Sharing and Publishing
– Personal website/blog
– Public community
– Search Engine,
• A platform for Private Sharing and Community
– Personal community builder
– Sharing management
• A platform for personal information / knowledge
management, content engine,
NUWeb Software Architecture
• NUWeb system is composed of three subsystems
– NUWeb.CC CyberCenter
• WNS, (web name service),
• Search engine, Cache
• Commuity services, (Photo, Blog, Video…)
– NUWeb CP (Community Portal)
• Community services, (Blog, Photo, Video…)
• Search Engine service,
• Proxy and Cache
– NUWeb PP (Personal Portal)
• NUWeb browser, kim,
• NUWeb server,
• NUWeb personal portal/blog builder
How it works
• Personal Web server on Windows platform
–
–
–
–
Auto indexing, thumbnail,
Auto page generation and run-time rendering
Auto caching,
Bundled with php/perl platform
• Registration to WNS in the set up,
– Site name, user-account, SiteKey, …
• UPNP to handle firewall/NAT,
• Packet forwarding Proxy to handle the cases
where UPNP does not work correctly.
How it works (2)
• Each time a client gets on line, send the current
IP and name/key info to the WNS center.
• The connection request to a personal site will
first send the name of the site to the WNS to get
the IP of the target site (dynamic binding)
• If the requested site is not online, then the center
will redirect the request to the cache server.
• If the site is connected through proxy, then
connect it through relay proxy.
Naming and Dynamic Addressing
– A page is a textual web document. It contains UltraLinks or tags
and the display of such page might instantiate the display of
some other objects such as included images.
– An object is either a richtext document such as pdf, msdoc,
msppt, etc., a multimedia file, or any singular file that can be
accessed in the web space.
– A resource is either a page or an object
– GRN, global resource naming
• SiteUniqName#objectname[#class#type#location]
– fixed IP is not necessary
– ABN (AddressByName), ABI (AddressById),
ABC(AddressByContent)
– USI (UniversalSiteId),
NUWeb CyberCenter
• GRI: Global Resource Index
– A distributed index structure for objects/pages on the NUWeb
space
– Use hash data structure
• Search engine, Community Service, Portal for NUWeb
• Proxy & Caching
–
–
–
–
–
–
Auto backup and versioning
Info filtering, content switching
Packet forwarding, center relay
Relay casting, media streaming
Hierarchical search
Collaborative cache (super cache)
Site Initialization
• When a new site is installed:
– Register the following info
•
•
•
•
SiteUniqName, to be interacted by the center
Titles of the site (at most T bytes)
Abstract of the site (at most P bytes)
tags, (if inappropriate, such as infringing others right, will
be abolished by the center)
• Country/city/county, real world geography info
• Profile of personal info
• Residents : SUN.resident will identify a user
– Decide which directories to be open to public
– Decide which directories to be open to private
connections
– Decide whether to open caching of the public
directory
Site Initialization
• The server will build an index for the
pages/objects that are covered in the site . The
index for public and private areas are separated
such that the privacy will be secured.
• The index is on the name and signature level,
plus the content of pages, the support for object
content index such as ms-doc files pdf files will
be optional
• After the site is set up, the user will be asked to
provide a list of friends to which the system will
send invitation letters.
NUWeb Services
•
•
•
•
•
•
•
•
NUSite, NUBlog
NUSearch, NUSM
NUCommunity, NUBBS,
NUBot, NUWatch, NUPush
NUCache, NUProxy
NUPedia, knowledge authoring/manager
NUMail, P2P secure mail system
NUJournal
Searching
• The search in the nuweb center includes:
– Search pages/objects by name (WNS)
– Page content search
– * attributed search , for example, search for pages authored by
Hamming
• The indexer in each nusite will send the raw-index to the center, and
the center will build an index . The raw-index is a record containing
indexable texts for each page or object. A text extractor will be used
to extract text from rich text documents such as MS-DOC/PPT
documents. The upload of such raw index will get approval from the
users first.
• Before rendering the search result to the user, the searcher needs to
check whether the result page/object exists at that moment.
• It uses the SSN to check the SiteDB and to see whether that site is
avalable. It also use grn to check where such resource is available
in the cache.
Caching
• Caching
– Every site page will be automatically cached, unless
explicitly disabled
– In the first phase, the caching will be done in the center and
the NUWeb CP cache spaces. Objects will be cached if
accessed
• The client will cache it in its cache spool, and an index
will be sent to the center to notify the center that it has
such object in cache.
– In the second phase, the caching will be done by
collaborative caching in the p2p space too, assuming that
some of the personal sites are willing to participate.
– The cache object will be indexed by GRN and MD5
– Note that if an object is modified, it will trigger a update to
the global cache space to remove the original cache
indexed by GRN
– Each cache object will record a timestamp of the content
(the time such content is created.)
GRI & Collaborative Proxy
• GRI:
– Object indexed by MD5-signature & GRN
– Home page indexed by GRN
– Instance indexed by MD5
• Syntax:
– GRN: SUN#OBN
• Distributed/Collaborative GRI
• Multi-tier Collaborative Proxy
Indices (1)
• In the nuweb center, there are several indices:
– SiteDB: indexed by SSN
• Last live time, access cnt, data size,
• When alive, each site will periodically send alive info to
the center (every K minutes)
– NameDB: indexed using gaisindex
• Each name is associated with a SSN by which we can
check whether such page/object exists.
• Each name will have a record, which will have a SSN
value, and a GRN cache flag
• In the search result of name db, if a record does not have
a online instance (either roiginal site or the cache copy),
it will have a flag indicating “not available”
Indices(2)
– MD5 index, objects/pages indexed by MD5
signature. Each site will produce MD5 signatures
for each object, and the (grn,md5) info will be sent
to the center to be indexed.The return of a MD5
lookup is the source SSN/IP or the cache site/s IP
– Page/document Content index
• Indexed through gais search engine
NUWeb Portal Service
• Search engine for the NUWeb cyberspace
– Websites, pages, pictures, videos, documents,
articles, etc., …
• Browsing and Viewing
– What’s hot, what’s new, what’s cool,
– Automatically generated through page
rendering tool based on a CountDB and list
manager.
NUWeb DB
• NUWeb cache is implemented through NUWeb DB
system.
• NUWeb DB is to store Web Objects and relationship and
provide search function.
– Web DB:
•
•
•
•
•
•
•
•
•
•
ODB, (Object DB)
NDB, (Name DB)
IDB, (Index DB)
TDB, (Term DB)
UDB, (User DB)
SDB, (Site DB)
Page Engine
Access Log DB (PV DB)
Access Control
Query Interface (including SQL) *
Web DB implementation
• ODB and NDB is the kernel storage DB
• The key technique used in ODB and NDB is the
Hash DB which needs to minimize the disk
seeks and maximize the memory usage.
• PV DB (Access log DB) is implemented on top of
ODB and NDB.
• Term DB is implemented on top of ODB too.
Term DB will record the term frequency, term
score … information.
Web DB implementation (2)
• Site DB records the site info such as
access frequency, size, dynamics, etc.
• IDB is a real time index engine for all the
objects stored in Web DB.
• Access Control:
– Authorization: permission list based
– Authentication: through an authentication
center in WNS server.
• SQL is not supported yet, on the todo list.
NUDB
• Net User’s DataBase
• Easy to use,
– No background of database is needed.
– No need to program
– Define the spec and start to use,
• Spec can be adjusted flexibly
– Scalable
• Combine the advantages of Table processing
software such as Excel and Database systems
• Portable, computable, mergeable
NUDB implmentation
• Physical DB Kernel
– Hash DB
– Inverted Index
– Pattern Matching
• Schema Layer, and Query Processing
• User Interface Layer
– Data Presentation Management
– DUA (Database User Agent, 類似 MUA)
NUBlog
• AJAX Based Blog System
• Personal Blog Home Base
– Can have multiple copies in the web
– Creation, Management, Posting
• Import, Export:
– XMLRPC
– Robot, simulating Browser behaviour
NUWatch
•
•
•
•
•
Personal Web Agent
Event Watch, News Watch
Service Watch,
Site Watch,
Commerce Watch,
NUWatch Implementation
• Personal Profile Manager
• Matching Platform
– On the fly matching
– Batch mode matching through searching
• Data Source Agents
– Per user agent
– Centralized agent (can reduce overhead)
• Notification Agent
– Relay casting to speed up
– Gateway to message system
NUCommunity
• Personal and Regional Community Engine
–
–
–
–
Forum, Vote,
Calendar, File Sharing,
Address Book, DB, ..
Interaction mechanism, (auto notification,..)
• A community is conceptually a given a NUWeb
site
• A community is treated like a user in the NUWeb
space’s authentication and authorization
Access Control
• Support both password-based and
membership based protections.
• Each directory is associated with a
protection data structure
• Authentication in WNS server
• Use Permission List technique for
membership based protection
• The protection is a directory base, no
inheritance will be assumed.
NUJournal
• Why the publication is through paper?!
– Traditionally, publication HAD TO BE published through paper in the old
age
– Journal is both a channel and a barrier 
– Most of the papers entered the dead state once published 
• A new model of publication
– Separate the concept of publication and evaluation
– Publication is an autonomous will, and publication can be through own
website!, reviewed, commented by readers, or reviewers.
– Journal is a marketplace to glue/guide the accesses of publications and
to comment and evaluate the publications
– A publication can be a long time living object
– Other authors can join the published work along the time, if they make
substantial contributions to the work.
– A publication is evaluated by its contribution and impact.
Thanks!