Transcript The Internet
Data Communications and Computer Networks: A Business User’s Approach
Chapter 11
The Internet
1
This time
Move up the OSI hierarchy • Internet • Apps • Protocols – XXXP 2
The Internet Model
3
Introduction
Today’s present Internet is a vast collection of thousands of networks and their attached devices.
The Internet began as the Arpanet during the 1960s.
One high-speed backbone connected several university, government, and research sites.
The backbone was capable of supporting 56 Kbps transmission speeds and eventually became financed by the National Science Foundation (NSF).
4
Old NSFnet backbone & connecting midlevel and campus networks
5
Brief History of the Internet (1)
• • • • • •
1964
- Packet switching network paper by Rand Corporation
1969
- The DOD Advanced Research Projects Agency creates an experimental network called ARPANET
1972
- Email programs sent
1980s
- ARPANET splits into two networks: ARPANET and MILNET
1984 -
Arpanet shut down and Internet resulted
1987 -
NSFnet Network service Center (NNSC) 6
Brief History of the Internet (2)
• • • • •
1993 -
InterNIC formed replaced NNSC
1993
- CERN releases the World Wide Web (WWW), developed by Tim Berners-Lee
1993-1994
- The graphical web browsers Mosaic and Netscape Navigator are introduced
1995
- NSF quits all support and backbone, and the Internet became commercially supported
1996
-
present
- Internet access increases rapidly among home, education and business users 7
Brief History of the Internet (3) • Internet Growth in Nodes
– 1969 - only 4 – 1983 - approximately 500 – 1989 - approximately 80,000 – 1997 - over 16 million – Now - over 370 million 8
Internet Growth
• http://
www.netsizer.com/
• Hosts vs nodes Hosts – users connected to the internet 130 M (2001) • Nodes are all connected devices 9
Internet Services
The Internet provides many types of services, including several very common ones: • File transfer protocol (FTP) • Remote login (Telnet) • Internet telephony • Electronic mail • World Wide Web • Streaming Video and Audio 10
File Transfer Protocol (FTP)
Used to transfer files across the Internet.
User can upload or download a file.
The URL for an FTP site begins with ftp://… The three most common ways to access an FTP site is: 1. Through a browser 2. Using a canned FTP program 3. Issuing FTP commands at a text-based command prompt.
11
Remote Login (Telnet)
Allows a user to remotely login to a distant computer site.
User usually needs a login and password to remove computer site.
User saves money on long distance telephone charges.
12
Internet Telephony
The transfer of voice signals using a packet switched network and the IP protocol.
Also known as packet voice, voice over packet, voice over the Internet, and voice over Internet Protocol (VoIP).
VoIP can be internal to a company or can be external using the Internet.
VoIP consumes many resources and may not always work well, but can be cost effective in certain situations.
13
Internet Telephony (VoIP)
Three basic ways to make a telephone call using VoIP: 1. PC to PC using sound cards and headsets (or speakers and microphone) 2. PC to telephone (need a gateway to convert IP addresses to telephone numbers) 3. Telephone to telephone (need gateways) 14
Internet Telephony (VoIP)
Three functions necessary to support voice over IP: 1. Voice must be digitized (PCM, 64 Kbps, fairly standard) 2. 64 Kbps voice must be compressed (many standards here ITU-T G.729A, used by AT&T, Lucent, others; G.723.1, used by Microsoft and Intel) 3. Once the voice is compressed, the data must be transmitted.
Many
different ways to do this.
15
Internet Telephony (VoIP)
How can we transport compressed voice?
Streaming audio, such as Real Time Streaming Protocol (RTSP) and Microsoft’s Active Streaming Format (ASF) Resource Reservation Protocol (RSVP) - carries a specific QoS through the network, reserving bandwidth at every node. Operates at the transport layer.
Internet Stream Protocol version 2 (ST2) - an experimental resource reservation protocol that operates at same layer as IP 16
Electronic Mail
E-mail programs can create, send, receive, and store e-mails, as well as reply to, forward, and attach non-text files.
Multipurpose Internet Mail Extension (MIME) is used to send e-mail attachments.
Simple Mail Transfer Protocol (SMTP) is used to transmit e mail messages. (uses port TCP port 25) Email daemon always waiting to perform its function Post Office Protocol version 3 (POP3) and Internet Message Access Protocol (IMAP) are used to hold and later retrieve e mail messages.
17
Consists of 2 parts: User Agent: Allows users to create, edit, store and forward programs Message Transfer Agent: Prepares and transfers email message 18
Electronic Mail Holders
Post Office Protocol version 3 (POP3) and Internet Message Access Protocol (IMAP) are used to hold and later retrieve e mail messages.
POP allows you to save messages in your email box IMAP allows you to only view message heading and not download everything. Also permits mailboxs, search, etc.
19
Listservs
A popular software program used to create and manage Internet mailing lists.
When an individual sends an e-mail to a listserv, the listserv sends a copy of the message to all listserv members.
Listservs can be useful business tools for individuals trying to follow a particular area of study.
20
Usenet
A voluntary set of rules for passing messages and maintaining newsgroups.
A newsgroup is the Internet equivalent of an electronic bulletin board system.
Thousands of Usenet groups exist on virtually any topic.
21
Streaming Audio and Video
The continuous download of a compressed audio or video file, which can be heard or viewed on the user’s workstation.
Real-time Protocol (RTP) and Real Time Streaming Protocol (RTSP) support streaming audio and video.
Streaming audio and video consume a large amount of network resources.
22
World Wide Web
The World Wide Web (WWW) is a immense collection of web pages and other resources that can be downloaded across the Internet and displayed on a workstation via a web browser.
Browser is the user agent.
The most popular service on the Internet.
Basic web pages are created with the HyperText Markup Language (HTML).
23
World Wide Web
While HTML is the language to display a web page, HyperText Transport Protocol (HTTP) is the protocol to transfer a web page.
Many extensions to HTML have been created. Dynamic HTML is a very popular extension to HTML.
Common examples of dynamic HTML include mouse-over techniques, live positioning of elements (layers), data binding, and cascading style sheets.
24
World Wide Web – XML
Extensible Markup Language (XML) is a description for how to create a document - both the definition of the document and the contents of the document.
The syntax of XML is fairly similar to HTML.
You can define your own tags, such as
25
e-Commerce and e-government
The buying and selling of goods and services via the internet.
Government transitions via the internet.
e-commerce major areas: 1. e-retailing 2. Electronic Data Interchange (EDI) 3. Micro-marketing 4. Electronic security 5. Web services 26
Security of Data Privacy of Data Transaction Processing Integrity Business Policies 27
Security of Data
• How secure is the data
maintained
by the business?
– Personal/business entity data – data stored by a web site that is used by a trading partner to make transaction decision • How secure is the data as it is
transmitted
to and from this business?
28
Business Policies
• What are the business policies and practices of this business?
– billing and payment policies – shipping policy – return policy – tax collection – additional policy information 29
Transaction Processing Integrity
• What procedures are in place to ensure that the transactions are handled as disclosed?
– How does the company ensure that is does not lose orders placed?
– How does the company ensure that it accurately processes bills and account information?
– What controls exist to ensure that the company accurately posts payment in a timely fashion?
– Does the company have controls in place to ensure that it ships the right inventory items and quantities?
30
Privacy of Data
• What is the privacy policy of the business?
• What information does it keep?
• How will the information collected be used by the business?
• Will this business share or sell customer data without the customer’s permission or knowledge?
• What ensures that the company’s privacy policies are observed and practiced on a continuous basis?
31
Security Assurance Systems ensure that...
• The transacting parties are authenticated -
who they claim to be - a security issue
• that electronic data are protected from unauthorized disclosure -
a security issue
32
Electronic Data Interchange...
• is the electronic exchange of business documents between trading partners using a standardized format.
• Traditional EDI – High start-up costs – Used primarily by large firms – Generally, even large firms could only connect with 20% of their trading partners 33
Cookies and State Information
A cookie is data created by a web server that is stored on the hard drive of a user’s workstation.
This state information is used to track a user’s activity and to predict future needs.
Information on previous viewing habits stored in a cookie can also be used by other web sites to provide customized content.
Many consider cookies to be an invasion of privacy.
www.cookiecentral.com
34
Cookie Control
Delete cookies after inserted Accept no or restricted cookies Change permissions
www.cookiecentral.com
35
Intranets and Extranets
An intranet is a TCP/IP network inside a company that allow employees to access the company’s information resources through an Internet-like interface.
When an intranet is extended outside the corporate walls to include suppliers, customers, or other external agents, the intranet becomes an extranet.
36
Internet Protocols
To support the Internet and all its services, many protocols are necessary.
Some of the protocols that we will look at: • Internet Protocol (IP) • Transmission Control Protocol (TCP) • Address Resolution Protocol (ARP) • Domain Name System (DNS) 37
Internet Protocols
Recall that the Internet with all its protocols follows the Internet model.
An application, such as e-mail, resides at the highest layer.
A transport protocol, such as TCP, resides at the transport layer.
The Internet Protocol (IP) resides at the Internet or network layer.
A particular media and its framing resides at the interface layer.
38
The Internet Model
39
Network Layer
Responsible for creating maintaining and ending network connections.
Transfers a data packet from node to node within the network.
Message routing Billing Accounting
40
Transport Layer
Provides an end-to-end, error-free network connection.
Makes sure the data arrives at the destination exactly as it left the source.
Makes sure all information is accounted for:
– Missing information – Duplicated information 41
The Internet Protocol (IP)
IP prepares a packet called a datagram for transmission across the Internet.
The IP header is encapsulated onto a transport data packet.
The IP packet is then passed to the next layer where further network information is encapsulated onto it.
42
Progression of a datagram packet from one network to another
43
The Internet Protocol (IP)
Using IP, a subnet router: Makes routing decision based on the destination address.
May have to fragment the datagram into smaller datagrams (very rare) using Fragment Offset.
May determine that the current datagram has been hopping around the network too long and delete it TTL (Time to Live).
44
Format of the IP Datagram
45
The Transmission Control Protocol (TCP)
The TCP layer creates a connection between sender and receiver using
port
numbers.
The port number identifies a particular application on a particular device (IP address).
ftp: 20 smtp: 25 http: 80 TCP can multiplex multiple connections (using port numbers) over a single IP line.
46
The Transmission Control Protocol (TCP)
The TCP layer can ensure that the receiver is not overrun with data (end-to-end flow control) using the Window field.
TCP can perform end-to-end error correction (Checksum).
TCP allows for the sending of high priority data (Urgent Pointer).
47
Fields of the TCP Header
48
Internet Control Message Protocol (ICMP)
ICMP, which is used by routers and nodes, performs the error reporting for the Internet Protocol.
ICMP reports errors such as invalid IP address, invalid port address, and the packet has hopped too many times.
49
Ping (Packet Internet Groper)
ping command
50
Ping – TCP/IP Troubleshooting
• Ping is the primary tool for troubleshooting IP-level connectivity. Type
ping -?
at a command prompt to see a complete list of available command-line options. Ping allows you to specify the size of packets to use (the default is 32 bytes), how many to send, whether to record the route used, what Time To Live (TTL) value to use, and whether to set the "don't fragment" flag.
• When a
ping
command is issued, the utility sends an ICMP Echo Request to a destination IP address. Try pinging the IP address of the target host to see if it responds. If that succeeds, try pinging the target host using a host name. Ping first attempts to resolve the name to an address through a DNS server, then a WINS server (if one is configured), then attempts a local broadcast. When using DNS for name resolution, if the name entered is not a fully qualified domain name, the DNS name resolver appends the computer's domain name or names to generate a fully qualified domain name.
• If pinging by address succeeds but pinging by name fails, the problem usually lies in name resolution, not network connectivity. Note that name resolution might fail if you do not use a fully qualified domain name for a remote name. These requests fail because the DNS name resolver is appending the local domain suffixes to a name that resides elsewhere in the domain hierarchy.
51
tracert command
tracert – trace route 52
How the TRACERT command works • • • The TRACERT diagnostic utility determines the route taken to a destination by sending Internet Control Message Protocol (ICMP) echo packets with varying IP Time-To-Live (TTL) values to the destination. Each router along the path is required to decrement the TTL on a packet by at least 1 before forwarding it, so the TTL is effectively a hop count. When the TTL on a packet reaches 0, the router should send an ICMP Time Exceeded message back to the source computer. TRACERT determines the route by sending the first echo packet with a TTL of 1 and incrementing the TTL by 1 on each subsequent transmission until the target responds or the maximum TTL is reached. The route is determined by examining the ICMP Time Exceeded messages sent back by intermediate routers. Note that some routers silently drop packets with expired TTLs and are invisible to TRACERT. TRACERT prints out an ordered list of the routers in the path that returned the ICMP Time Exceeded message. If the -d switch is used (telling TRACERT not to perform a DNS lookup on each IP address), the IP address of the near- side interface of the routers is reported. 53
User Datagram Protocol (UDP)
A transport layer protocol used in place of TCP.
Where TCP supports a connection-oriented application, UDP is used with connectionless applications.
UDP also encapsulates a header onto an application packet but the header is much simpler than TCP.
54
Address Resolution Protocol (ARP)
When an IP packet has traversed the Internet and encounters the destination LAN, how does the packet find the destination workstation?
Even though the destination workstation may have an IP address, a LAN does not use IP addresses to deliver frames. A LAN uses the MAC layer address.
ARP translates an IP address into a MAC layer address so a frame can be delivered to the proper workstation.
55
Tunneling Protocols
The Internet is not normally a secure system.
If a person wants to use the Internet to access a corporate computer system, how can a secure connection be created?
One possible technique is by creating a
virtual private network
(VPN).
A VPN creates a secure connection through the Internet by using a tunneling protocol.
56
Every workstation attached to the Internet needs:
Its IP address • Its subnet mask (more on this later) • The IP address of a router • The IP address of a name server 57
BOOTP (you don’t have an IP address?)
Thin client workstations do not have a disk drive, and its ROM does not contain the previous four pieces of information.
How do we tell the machine this information? BOOTP (Bootstrap protocol).
There are two types of BOOTP operations: REQUEST – A workstation asks a server for the information (source IP address = all 0s, destination IP address = all 1s).
REPLY – The server returns the information to the workstation.
58
59
Dynamic Host Configuration Protocol (DHCP)
BOOTP is not dynamic (when a client requests its IP address, it is retrieved from a static table).
DHCP is a dynamic extension of BOOTP.
When a DHCP client issues an IP request, the DHCP server looks in its static table. If no entry exists, the server selects an IP address from an available pool.
60
Dynamic Host Configuration Protocol (DHCP)
The address assigned by the DHCP server is temporary.
Part of the agreement includes a specific period of time.
If no time period specified, the default is one hour.
DHCP clients may negotiate for a renewal before the time period expires.
61
Network Address Translation (NAT)
NAT protocol lets a router represent an entire local area network to the Internet as a single IP address.
Thus it appears all traffic leaving this LAN appears as originating from a global IP address.
All traffic coming into this LAN uses this global IP address.
This security feature allows a LAN to hide all the workstation IP addresses from the Internet.
62
NAT
Since the outside world cannot see into the LAN, you do not need to use registered IP addresses on the inside LAN.
We can use the following blocks of addresses for private use: •10.0.0.0 – 10.255.255.255
•172.16.0.0 – 172.31.255.255
•192.168.0.0 – 192.168.255.255
63
NAT
When a user on inside sends a packet to the outside, the NAT interface changes the user’s inside address to the global IP address. This change is stored in a cache.
When the response comes back, the NAT looks in the cache and switches the addresses back.
No cache entry? The packet is dropped. Unless NAT has a service table of fixed IP address mappings. This service table allows packets to originate from the outside.
64
Locating a Document on the Internet
Every document on the Internet has a
uniform resource locator (URL)
(not necessarily unique) and an IP address (not necessarily unique).
All URLs consist of four parts: 1. Service type 2. Host or domain name 3. Directory or subdirectory information 4. Filename 65
The Parts of a Uniform Resource Locator (URL)
http://psu.edu/stuff http
service type
edu
top level domain – type of organization often followed by a country code, eg. --.uk
psu
mid level domain – name of organization
stuff, www.psu.edu
domains generated by organization top and mid levels Determined by assignment boards
66
The Parts of a Uniform Resource Locator (URL)
New domains: .biz
.zzz
.xxx
.dog
Who controls this?
http://www.icann.org/
67
68
Locating a Document on the Internet
When a user, running a web browser, enters a URL, how is the URL translated into an IP address?
The Domain Name System (DNS) is a large, distributed database of URLs and IP addresses.
tracert
command does this for you.
The first operation performed by DNS is to query a local database for URL/IP address information.
If the local server does not recognize the address, the server at the next level will be queried.
69
Locating a Document on the Internet
Eventually the root server for URL/IP addresses will be queried.
If the root server has the answer, the results are returned.
If the root server recognizes the domain name but not the extension in front of the domain name, the root server will query the server at the domain name’s location.
When the domain’s server returns the results, they are passed back through the chain of servers (and their caches).
70
IP Addresses
All devices connected to the Internet have a 32-bit IP (IPv4) address associated with it. 2 32 = total addresses?
Think of the IP address as a logical address (possibly temporary), while the 48-bit address on every NIC is the physical, or permanent address.
Computers, networks and routers use the 32-bit binary address, but a more readable form is the dotted decimal notation.
71
IP Addresses
For example, the 32-bit binary address 10000000 10011100 00001110 00000111 (4 octets) translates to 128.156.14.7 (called dotted decimal notation) Range of octets is 0-255 = 2 8 There are basically four types of IP addresses: Classes A, B, C and D.
A particular class address has a unique network address size and a unique host address size.
72
Four Basic Forms of an IP 32-bit Address What is psu’s IP address?
Ping: psu.edu 128.118.141.56
Ping ist.psu.edu?
73
IP Addresses
When you examine the first decimal value in the dotted decimal notation: All Class A addresses are in the range 0 - 127 All Class B addresses are in the range 128 - 191 All Class C addresses are in the range 192 - 223 74
IP Subnet Masking
Sometimes you have a large number of IP address to manage.
By using subnet masking, you can break the host ID portion of the address into a subnet ID and host ID.
Each subnet supports a number of other hosts.
For example, the subnet mask 255.255.255.0 applied to a class B address will break the host ID (normally 16 bits) into an 8-bit subnet ID and an 8-bit host ID.
75
Data Communications and Computer Networks Chapter 10
76
The Future of the Internet
Various Internet committees are constantly working on new and improved protocols.
Examples include: • Internet Printing Protocol • Internet fax • Extensions to FTP • Common Name Resolution Protocol • WWW Distributed Authoring and Versioning • Web Services 77
IPv6
http://www.ipv6.org/
The next version of the Internet Protocol.
Main features include: • Simpler header • 128-bit IP addresses 2 128 = (2 10 ) 12 2 8 = (10 3 ) 12 2 8 = 2 x 10 38 • Priority levels and quality of service parameters • No fragmentation (datagram is big!) 78
Fields in the IPv6 Header
79
Internet2
http://www.internet2.edu/
A new form of the Internet is being developed by a number of businesses and universities.
Internet2 will support very high speed data streams (Gigs).
Applications might include: • Digital library services • Tele-immersion • Virtual laboratories 80
The Internet In Action: A Company Creates a VPN
A fictitious company wants to allow 3500 of its workers to work from home.
If all 3500 users used a dial-in service, the telephone costs would be very high.
81
Data Communications and Computer Networks Chapter 11
82
Data Communications and Computer Networks Chapter 11
The Internet In Action: A Company Creates a VPN
Instead, the company will require each user to access the Internet via their local Internet service provider.
This local access will help keep telephone costs low.
Then, once on the Internet, the company will provide software to support virtual private networks.
The virtual private networks will create secure connections from the users’ homes into the corporate computer system.
83
Data Communications and Computer Networks Chapter 11
84
• • •
Your old web pages!!! Internet Archive www.archive.org
Founded in 1996 by Brewster Kahle.
Maintains many, many TB’s of Internet data, including snapshots of
– World Wide Web – Usenet – Gopher – FTP archives
Goals:
– Accumulate and preserve digital information for the long term that would otherwise be lost. – Provide access to researchers, journalists, historians and others.
85
Bow-tie Theory of the Web
200 million (billion links) urls explored - Broder, et.al.
WWW9 ’00
86
How Big is the Publicly Indexable Web?
• Feb’99 : estimate 16 million total web servers reduces to about 2.8 million servers for the publicly indexable web • Average number of pages per site was 289 • Estimated total number of pages on the web about 800 million • Current estimate – 3 to 5 billion pages From a random sample of IP addresses (address space 256 4 or about 4.3 billion) 87
Volume of Information on Web - Feb, ‘99
• Mean page size was 19k (median 4k) • Total amount of data: about 15 terabytes of pages • About 6 terabytes after removing comments, extra whitespace, and HTML tags • About 63 images per server, mean image size 15k (median 6k) • About 180 million images on the publicly indexable web, about 3 terabytes of image data 88
What’s on the web?
89
Distribution of the content of WWW Information
• • • • %’s of manually classified homepage of first 2,500 randomly found web servers 83% of sites commercial –
Off scale for this chart ->
Percentage of sites in areas like science, health,and government relatively small – Would be feasible and very valuable to create specialized search services that are very comprehensive and up to date 65% of sites have a majority of pages in English 90
Web Search Techniques
- 85% of users use search engines to locate information (GVU survey) - Several search engines consistently rank in the top 10 sites accessed on the web • Full-text indexes • Hierarchical directories • Specialized or niche search services • What’s related (Alexa/Netscape) • Collaborative filtering • Notification systems • Softbots 91
Search Engines
• Lots: over 3000? -
the web 20 make up 98% of all searches done
• Business models are often not just search!
• AltaVista (summer, 1998): – Indexes about 0.8 Tb (index about 30% of the size of the grabbed data) – Every word indexed – About 37 million queries on weekdays – Mean response time of 0.6 seconds – About 20 64-bit machines • 10 CPU, 625 MHz, 12Gb RAM, 300 Gb RAID (each) • Google (spring, 2000): 92 – 2500 PCs, buy 30 a day, discard them when they break
93
Search Engine Architecture
• Web crawler that crawls the web and harvests data – html, text, etc.
• Indexer that indexes some of the crawled pages • Query engine that queries the index and presents results • Query interface 94
Query Engine Index Users Interface Indexer Crawler Web
A Typical Web Search Engine
95
Ways to compare search engines
• Relevance ranking • Coverage (comments once seen in the press) – “If you can’t find it using XXX search, it’s probably not out there” – “HotBot is the first search robot capable of indexing and searching the entire web” • Recency (comment once seen in the press) – “[With XXX] you can find new information just about as quickly as it's available on the Web” • Functionality (e.g. query syntax) • Speed • Availability • Usability • Time/ability to satisfy user requests 96
Special factors
Ranking Options
• Conventional methods (e.g., tf.idf) were developed for homogenous collections, e.g., items of similar length • Some items are deliberately constructed to distort indexing
Options
• Vector space ranking with corrections for document length • Extra weighting for specific fields, e.g., title, anchors, etc.
• Link structure, e.g., Google's PageRank, Kleinberg's Hubs and Authorities 97
(Page, Brin)
•
2 nd Generation Search Engine!
• Makes greater use of HTML structure and the graph formed by hyperlinks between pages • PageRank – Iteratively uses information about the number of pages pointing to a page in order to estimate the popularity of a page – Links from more popular pages count more • Uses the text in links to a page – Link descriptions may describe a page better than the page itself • Yahoo’s search engine www.google.com
98
99
PageRank and Google
p
1 • Prestige of a page is proportional to sum of prestige of citing pages • Standard bibliometric measure of influence • Simulate a random walk on the Web to precompute prestige of all pages • Sort keyword-matched responses by decreasing prestige
p p p
3 4 2
p
1 I.e.,
p
+
p
2 =
Ep
Follow random outlink from page
p
4 +
p
3 100
Google Architecture
• Perl with C/C++ • Linux • Module-based architecture • Multi-machine • Multi-thread
URL Server Crawler Store Server Anchors URL Resolver Indexer Repository Lexicon Links Doc Index Barrels Sorter Pagerank Searchers
101
Metasearch Engines or Tools
Information Need Query Search Engine #1 Search Engine #2 Search Engine #3
etc
Fusion Policy Result Set
• Single search engine coverage is low, maximum of 16% – Querying multiple can significantly improve coverage • Query is sent to several search engines simultaneously – Policies?
• Results are fused by a fusion policy – Similar, but slightly different from an ordering policy • Fusion at many levels 102
http://www.beaucoup.com/
103
Search Engine Coverage - 11 engines Feb ‘99 • Combined coverage with respect to each other • With respect to each other compared to total web size • Combined coverage - 42% 104
Search Engines Sizes
searchenginewatch.com
105
searchenginewatch.com
106
Covered
• Protocols – XXXP • URL and DNS • IP addresses • Search engines 107