Understanding the Internet - Technion – Israel Institute

Download Report

Transcript Understanding the Internet - Technion – Israel Institute

Understanding the Internet
The Basics
1
Topics
• DNS, TCP/IP
• URLs
• Dynamic Pages
• Caching and Proxies
• Cookies
The Goal: By the end of this lesson, you should be able
to explain how the internet works, including details about
each of the basic technologies used. You should be able
to draw basic conclusions about how things can or
cannot work on the web.
2
How Does It all Work?
What happens
when a link is
pressed?
3
HTML (HyperText Markup Language)
• Web pages are written in HTML, which defines the
style in which the page should be displayed.
<a href="http://www.undergraduate.technion.ac.il/catalog/facs009.html">
course syllabus</a>
Text written
on(the
thedestination)
link
URL
4
Name
Address
Request
Response
5
Resource
(HTML Page)
DNS
Server
URL
Name
Browser
Web Server
IP
Address
Address
Request
HTTP
Request
HTTP
Response
Response
Resource
6
www.undergraduate.technion.ac.il
DNS
132.68.238.21
HTTP
Request http://www.undergraduate. technion.ac.il/
catalog/facs009.html
HTTP
Response
Web Server
132.68.238.21
catalog/facs009.html
File System
7
The Infrastructure (very brief)
IP Address, Domain Name Server,
TCP/IP
8
IP Address and Port
• The browser will ask for the page from the
appropriate web server using the HTTP protocol
– A web server is a program that can return resources
• However, in order to request anything from the web
server, the browser must know its IP address and
the port on which the web server is listening
– Intuitively, the domain name is like a person's name and
the IP address is like his address
– The port is like the “apartment number” (note: Web
servers generally use the standard port number 80)
9
DNS (Domain Name Server)
• Computers are configured to know the IP address
of a DNS (Domain Name Server)
• A DNS is a program that, when sent a domain
name, returns the IP address of the domain.
– the DNS either looks up the domain name in a list
– or, the DNS asks a different DNS for the IP address
• In order to use the DNS, the web browser creates
a packet of information that contains
– the address of the DNS,
– its own IP address
– the content of the request
Why?
10
How Does the Packet Get to Its
Destination?
• The computer sends it down the phone line (or
Ethernet connection or it transmits it by radio to a
base station which sends it down some wire, etc.).
• The Internet is a net of computers all connected
together by various cables
• A computer, when it gets a packet, sees what
computer number it is being sent to and passes it
on in the general direction toward its destination
• This way of passing on the packet is called the
Internet Protocol (IP)
11
Getting the Actual Resource
• After the IP address of the domain of the resource
is received (from the DNS), the resource is
requested from the web server in that domain
• The request is written using a protocol called HTTP
(Hyper Text Transfer Protocol)
• The web server sees the URL requested and
sends the resource back to the requester
• The resource is split up into packets (of 512 bytes)
and sent back to the requester
– the protocol for sending packets is called Transmission
Control Protocol (TCP)
12
Displaying the Web Page
• The web browser puts the packets back
together, in order
• The page is displayed (by interpreting the
style commands if it is an HTML page)
13
Resources and URLs
14
Resources
• A resource is a chunk of information that can be
identified by a URL (Universal Resource Locator)
• A resource can be
– A file, e.g., html, text, image
– A dynamically created page (more about this later on)
• What we see on the browser can be a combination
of some resources
– When an html page is displayed with images we are
actually seeing several resources at once
How do we get
them all?
How Many?
15
Basic Syntax
Basic Format of a URL
protocol://domain/path
http://iew3.technion.ac.il/~moshet/index.html
http://iew3.technion.ac.il/~sarac
ftp://ctan.unsw.edu.au/tex-archive/misc.zip
16
Anchors and Parameters
• URLs can also have
– an anchor: This is used in order to define a link that
takes the user to the middle of a page (instead of to the
top). In order for this to work, the anchor must also be
defined within the destination page
– parameters: These are extra values that are passed
along to the web server along with the path. (More about
this when we discuss dynamic pages)
Anchor Example
17
Anchors and Parameters:
Syntax
protocol://domain/path#anchor?parameters
http://www.cs.huji.ac.il/~dbi/index.html#info
http://www.google.com/search?hl=en&q=blabla
• A URL can also have both an anchor and
parameters
18
Syntax of Parameters
• Spaces are represented by “+”
• Characters such as &,+,% are encoded in
the form “%xx” where xx is the ascii value in
hexadecimal; For example, “%” = “%25”
• The inputs to the parameters are given as a
list of pairs of a parameter and a value:
var1=value1&var2=value2&var3=value3
19
apples & bananas
20
http://www.google.com/search?hl=en&ie=UTF8&q=apples+%26+bananas
21
Anchors and Parameters: Notes
• Questions to think about:
– When a URL with an anchor is requested is
something different sent than when the URL is
requested without an anchor?
– When a URL with parameters is requested is
something different sent than when the URL is
requested without parameters?
22
Relative Links
• A URL in a web page is can be written with only a
path (no protocol or domain)
protocol://domain/path
• The browser then figures out the complete location
by considering the current location
– Change the last file (if there is one) with the value of the
relative path.
23
Relative Links: Examples
• Page at http://www.abc.com/shop/index.html with
relative link to robes.html
– Where will this take us?
• Page at http://www.abc.com/shop/ with relative link
to robes.html
– Where will this take us?
• Page at http://www.abc.com/shop with relative links
to robes.html
– Where will this take us?
24
Notes
• A URL uniquely identifies a resource
– Given a URL there is exactly one resource that
corresponds to the URL (who determines which
resource it is?)
• A resource may not be uniquely identified by
a single URL
– several URLs can correspond to the same
resource (example?)
25
Questions About Resources and
URLs?
26
Dynamic Pages
27
Type Types of Dynamic Pages
• Web-Server-Side: The Web Server
dynamically creates the page as a response
to the user’s request
• Client-Side: The browser dynamically
changes the resource that is returned
28
Web-Server-Side
• Up until now, we have assumed that the web
server returns a file that exists in its file system
• Clearly, not every page on the internet can be
implemented this way (example?)
• A web server can actually run a program (with the
parameters in the URL) and then return the result
of the program.
– such a page is called a "dynamic page"
29
www.google.com
DNS
216.239.59.104
HTTP
Requesthttp://www.google.com/search?hl=en&ie=UTF8&q=Managing+Web+Data
HTTP
Response
Web Server
216.239.59.104
Managing
Web Data
Execute a
Program
30
Web-Server-Side Technologies
• Common tools for creating dynamic serverside pages
– CGI (Common Gateway Interface)
– Java Servlets, JSP – Java Server Pages
– Microsoft ASP – Active Server Pages
– PHP
31
Client-Side Dynamic Pages
• Certain parts of a Web application can be executed
locally, in the web browser
• For example, some validity checks can be applied
to the user’s input locally
– The user request is sent to the server only if the input is
valid
• JavaScript, VbScript, AJAX are HTML-embedded
scripting language for client-side programming
• It is also possible to combines both server-side and
client-side dynamic technologies
Example
32
Server-Side versus Client-Side
• When must a dynamic page be server-side?
• When is it better to use a client-side dynamic
page?
• What advantages do each kind of dynamic
pages have? disadvantages?
33
Questions About Dynamic Pages?
34
HTTP
35
Common Protocols
• In order for two remote machines to “understand”
each other they should
– ‘‘speak the same language’’ and coordinate their
‘‘conversation’’
• The solution is to use protocols, e.g.,
– FTP: File Transfer Protocol
– SMTP: Simple-Mail Transfer Protocol
– NNTP: Network-News Transfer Protocol
– HTTP: HyperText Transfer Protocol
36
The HTTP "Conversation"
• A Web Browser knows how to send an HTTP
request for a resource
• A Web Server is a program that listens for HTTP
requests and knows how to send appropriate
HTTP responses
• There are 2 standard versions of HTTP: HTTP 1.0
and HTTP 1.1
37
A Basic HTTP Session
• A basic HTTP session has four phases:
1. Client opens the connection (a TCP connection)
2. Client makes a request
3. Server sends a response
4. Server closes the connection
• Who is the client?
• Who is the server?
38
Stateless Protocol
• HTTP is a stateless protocol
– Once a server has delivered the requested data
to a client, the server retains no memory of what
has just taken place (even if the connection is
persistent)
• Server-side programming tools must provide
a mechanism for maintaining states (e.g.,
cookies)
39
The Format of HTTP
Requests and Responses
• An initial line
– In a request, the first line is a method
– In a response, the first line is a status code
• Zero or more header lines
• A blank line, and
• An optional message body (e.g., a file, query
data, or query output)
40
Request (General Form)
method
header
header
cr lf
sp
:
URL
value
sp version
cr lf
:
value
cr
lf
cr
lf
0 or
more
header
lines
Optional Entity Body
41
Example HTTP Request
Method
GET http://iew3.technion.ac.il HTTP/1.0
User-Agent: Mozilla/4.0
If-Modified-Since: Fri, 31 Dec 1999 23:59:59 GMT
42
Example HTTP Request
Resource
GET http://iew3.technion.ac.il HTTP/1.0
User-Agent: Mozilla/4.0
If-Modified-Since: Fri, 31 Dec 1999 23:59:59 GMT
43
Example HTTP Request
HTTP
Version
GET http://iew3.technion.ac.il HTTP/1.0
User-Agent: Mozilla/4.0
If-Modified-Since: Fri, 31 Dec 1999 23:59:59 GMT
44
Example HTTP Request
GET http://iew3.technion.ac.il HTTP/1.0
User-Agent: Mozilla/4.0
If-Modified-Since: Fri, 31 Dec 1999 23:59:59 GMT
Headers
45
Example HTTP Request
GET http://iew3.technion.ac.il HTTP/1.0
User-Agent: Mozilla/4.0
If-Modified-Since: Fri, 31 Dec 1999 23:59:59 GMT
Type of Browser
Making the Request
46
Example HTTP Request
GET http://iew3.technion.ac.il HTTP/1.0
User-Agent: Mozilla/4.0
If-Modified-Since: Fri, 31 Dec 1999 23:59:59 GMT
Return resource only if it was
modified after the given date
(Why is this useful?)
47
Common Request Methods
• GET returns the content of the indicated URL
• POST treats the URL as an application and send
some data to it
– Could be used to process a form
– GET and POST differ in their treatment of parameters
• HEAD returns the header information for the
indicated URL
– Useful for finding out info about a URL without actually
retrieving it
48
More Request Methods
• PUT replaces the content of the URL with
some data or generates a new document
with that URL if none exists
• DELETE deletes the indicated document
• Usually these methods are not allowed
49
HTTP Headers
• Examples of HTTP headers:
– Accept-Encoding
– Cookie
– If-Modified-Since
– User-Agent
– Content Length
– Referer
Headers may be spoofed!
Which ones might you want to spoof?
50
General Format of HTTP Response
• The first line is the status of the result
• After the first line, there are 0 or more headers,
e.g.,
– Last-Modified
– Refresh
– Set-Cookie
• Then there is a blank line
• Then, there is an optional message body
51
Response (General Form)
version
header
header
cr lf
sp status code sp phrase
value
cr lf
:
:
value
cr
lf
cr
lf
0 or
more
header
lines
status
line
Optional Entity Body
52
Example HTTP Response
HTTP/1.0 200 OK
HTTP
Version
Date: Fri, 30 Jul 2004 07:20:37 GMT
Last-Modified: Tue, 27 Jul 2004 18:37:57 GMT
Content-Length: 21175
Content-Type: text/html
<META NAME="description" CONTENT="The
William Davidson Faculty of Industrial
Engineering and Management, Technion">
…
53
Example HTTP Response
HTTP/1.0 200 OK
Status Code and
Explanation
Date: Fri, 30 Jul 2004 07:20:37 GMT
Last-Modified: Tue, 27 Jul 2004 18:37:57 GMT
Content-Length: 21175
Content-Type: text/html
<META NAME="description" CONTENT="The
William Davidson Faculty of Industrial
Engineering and Management, Technion">
…
54
Example HTTP Response
HTTP/1.0 200 OK
Headers
Date: Fri, 30 Jul 2004 07:20:37 GMT
Last-Modified: Tue, 27 Jul 2004 18:37:57 GMT
Content-Length: 21175
Content-Type: text/html
<META NAME="description" CONTENT="The
William Davidson Faculty of Industrial
Engineering and Management, Technion">
…
55
Example HTTP Response
HTTP/1.0 200 OK
Date of Request
Date: Fri, 30 Jul 2004 07:20:37 GMT
Last-Modified: Tue, 27 Jul 2004 18:37:57 GMT
Content-Length: 21175
Content-Type: text/html
<META NAME="description" CONTENT="The
William Davidson Faculty of Industrial
Engineering and Management, Technion">
…
56
Example HTTP Response
HTTP/1.0 200 OK
Date File Was Last
Modified
Date: Fri, 30 Jul 2004 07:20:37 GMT
Last-Modified: Tue, 27 Jul 2004 18:37:57 GMT
Content-Length: 21175
Content-Type: text/html
<META NAME="description" CONTENT="The
William Davidson Faculty of Industrial
Engineering and Management, Technion">
…
57
Example HTTP Response
HTTP/1.0 200 OK
Length of content
of page
(Why is this needed?)
Date: Fri, 30 Jul 2004 07:20:37 GMT
Last-Modified: Tue, 27 Jul 2004 18:37:57 GMT
Content-Length: 21175
Content-Type: text/html
<META NAME="description" CONTENT="The
William Davidson Faculty of Industrial
Engineering and Management, Technion">
…
58
Example HTTP Response
HTTP/1.0 200 OK
Type of resource
returned
Date: Fri, 30 Jul 2004 07:20:37(Why
GMTis this needed?)
Last-Modified: Tue, 27 Jul 2004 18:37:57 GMT
Content-Length: 21175
Content-Type: text/html
<META NAME="description" CONTENT="The
William Davidson Faculty of Industrial
Engineering and Management, Technion">
…
59
Example HTTP Response
HTTP/1.0 200 OK
Date: Fri, 30 Jul 2004 07:20:37 GMT
Actual Resource
Last-Modified: Tue, 27 Jul 2004 18:37:57
GMT
Returned
Content-Length: 21175
Content-Type: text/html
<META NAME="description" CONTENT="The
William Davidson Faculty of Industrial
Engineering and Management, Technion">
…
60
Common Status Codes
• 200 OK
– Return the page requested
• 301 Moved Permanently
– Also returns the new URL
– Usually, a web browser will automatically
request the new page
61
Common Status Codes
• 404 File Not Found: This explains all the 404
pages that you get when surfing the web
62
Common Status Codes
• 502 Bad Gateway: Fails on the DNS lookup
63
Experimenting Manually with HTTP (1)
GET xxxx.html HTTP/1.0
HTTP/1.1 400 Bad Request
Headers
HTML Contents
64
Experimenting Manually with HTTP (1)
GET www.google.com/xxx.html HTTP/1.0
HTTP/1.1 404 Not Found
Headers
HTML Contents
65
Experimenting Manually with HTTP (2)
GET http://iew3.technion.ac.il/~sarac HTTP/1.0
HTTP/1.1 301 Moved Permanently
Headers
HTML Contents
66
Why?
• What happens, when you write the url
http://iew3.technion.ac.il/~sarac in the
browser?
• Technically, why does this happen?
• Why is the redirection really needed?
Try it and Explain it!
67
Questions About HTTP?
68
Caching and Proxy Servers
69
Complicating the Picture
• The way that the process of getting a Web page
has been described up until now is that:
– for each resource, the web browser makes a request
– the request is sent directly to the web server that has the
resource
• Think about it: This would mean that every time a
user searches using Google, an HTTP request is
sent directly to the Google for their icon. Isn't this a
waste?!!
70
Caching
• In order to save on traffic and improve on
speed, resources are cached (saved
temporarily) at two points:
– locally, on the client (web browsers allow for a
configuration of the size of the cache)
– on the way, using a proxy server
71
Proxy Servers
• Proxy = "Go Between" (‫)מתווך‬
• Usually a browser does not directly contact the
web server whose resource it needs
• Instead, the browser contacts a program (called a
proxy server) whose job is to contact the web
server
• Since the proxy server is used by many users,
caching can be very helpful at this level
72
Proxy Caches
GET /fruit/apple.gif
client
proxy
server
client
server
GET /fruit/apple.gif
GET /fruit/apple.gif
server
client
73
Proxy Caches
reduce latency
for a given user
agent if they
can serve the
request from
their cache.
As a result,
they also save
bandwidth and
reduce the load
on the origin
server.
Department
Proxy Server
University
Proxy Server
Israel
Proxy Server
Web Server
Therefore, they
also reduce
latency for the
requests that
must be sent to
the target server
www.w3.org:80
74
DNS
www.google.com
Cache
216.239.59.104
http://www.google.com/search?hl=en&ie=UTF8&q=Managing+Web+Data
Execute a
HTTP
Program
HTTP
Response
Request
Proxy Server
Web Server
HTTP
Response 216.239.59.104
75
DNS
http://www.google.com/images/logo_sm.gif
Cache
HTTP
Response
Proxy Server
Web Server
216.239.59.104
76
Risks in Caching
• The benefits of caching are clear
• What are the risks of caching?
• How can such risks be minimized?
Hint: Remember the Header
If-Modified-Since
77
Other Uses for a Proxy Server
• Restricting access of users
• Tracking access of users
• Virus protection
• Note: A proxy is a program, so it could conceivably
be written to do anything. Normally, it simply
requests and caches resources
78
Basic Format of a Proxy Program
• Listens to a port
• Infinite loop:
– Establish connection
– Read Request
– Process Request (e.g., check in cache…)
– Send Response
79
Example Proxy
• To demonstrate how a proxy could work, try
the following 2 programs:
– SillyProxy
– EchoProxy
80
SillyProxy.java
import java.net.*;
import java.io.*;
public class SillyProxy {
public static void main(String[] args) {
handleRequests(Integer.parseInt(args[0]));
}
81
public static void handleRequests(int port) {
try {
Establish Connection
ServerSocket listen = new ServerSocket(port);
while (true) {
Socket client = listen.accept();
BufferedReader in = new BufferedReader(new
InputStreamReader
(client.getInputStream()));
Read Request
String line = in.readLine();
while (line != null && line.trim().length() > 0) {
line = in.readLine();
}
82
Send Response
String answer =
"HTTP/1.1 200 OK\n" +
"Content-type: text/html\n\n" +
"<html><body>" +
"<font color=green size=+4>" +
"<i>Happy Purim</i></font></body></html>";
PrintWriter out =
new PrintWriter(client.getOutputStream(), true);
out.println(answer);
out.flush();
client.close();
}
83
EchoProxy
• Can be defined similarly to SillyProxy, but:
– sends as a response the HTTP request, wrapped in
<HTML><BODY><PRE>
• Try the out:
– run the programs from the command line with some
number as the port value
– set your browser to run with the program as its proxy
– try to open up a page
– What will we see?
84
Questions About Proxies?
85
Cookies
86
Cookies
• Cookies are a general mechanism that server-side
applications can use to both store and retrieve
long-term information on the client side
• Servers send cookies in the HTTP response and
browsers are expected to save and to send the
same cookies back to the Server, whenever they
make additional requests from the Server
• The content of the cookies is stored as a text
document in the file system
87
Cookie Transportation
Amazon
Web server
request page
Set-cookie: pref=eng; id=123 ...
amazon.com
technion.ac.il
pref=eng; id=123 ...
tz=153 ...
request page
Set-cookie: id=153 ...
IE Technion
Web Server
88
Cookie Transportation
pref=eng; id=123 ...
request page
Amazon
Web server
response
amazon.com
technion.ac.il
pref=eng; id=123 ...
tz=153 ...
An Example
IE Technion
Web Server
89
Cookie Format
• A cookie in a response header:
Set-Cookie: NAME=VALUE; expires=DATE; path=PATH;
domain=DOMAIN_NAME; secure
– Only the NAME field is required
• A cookie in a request header:
Cookie: NAME1=VALUE1; NAME2=VALUE2;
NAME3=VALUE3...
– This header contains all matching stored
cookies
90
Cookie Properties
• NAME=VALUE: the content of the cookie
– should not contain semi-colons, commas or whitespaces
• expires=DATE: expiration date
– default is the session life time (until browser is closed)
• path=PATH: the paths for which the cookie is valid
– matches every path that begins with PATH
• domain=DOMAIN_NAME: the cookie’s domain
– matches every domain that ends with DOMAIN_NAME
• secure: send only through secure channels (i.e.,
https)
91
Notes about Cookies
• A response may contain multiple cookies
• A Cookie overrides previous cookies with the same
path and name (in the same domain)
• If no path and domain are given, then they are
assumed to be those of the requested URL
– What happens if the domain does not match the current
domain?
92
Notes about Cookies
• The Cookie header of a request contains all
mappings that match the requested URL
• A server can delete a cookie by sending a new one
with the same path and name, but with expiry date
in the past
93
Using Cookies for Session
Management
94
HTTP is Stateless
• HTTP is a stateless protocol
– Individual requests are treated independently
– Without external support, one cannot tell
whether an HTTP request is part of a continuing
interaction between the client and the server
• BUT some Web applications have states!
– Online stores that maintain a shopping cart
– Portals that remember your name and
preferences
95
HTTP Sessions
• The solution: Client and Server transfer some
unique data in the course of a session
• A session captures the notion of a continuous
interaction between a server and a client
• End users should be oblivious to session
management
• Session management should be efficient
– Is it reasonable to send the whole shopping cart on
every request to Amazon.com?
96
Session Supporting Servers
• A server that supports sessions holds session-
specific data in an internal data structure or
database
– For example, the content of the shopping cart
• On the first request, the server initializes the
session data and sends to the client a unique
identifier for this data
• During the session, the client attaches this
identifier to every request to the server
97
Session Management Methods
•
How is the session key shared between the
client and the server?
•
We will discuss sharing a key, using
cookies.
•
Another way to send the key (without using
cookies) is by URL rewriting. Not discussed
today.
98
Session Cookies
• In the response to the first request of a session, the server
puts a cookie, which contains a session identifier
• When the client sends subsequent requests, it also sends
the cookie
• The client sends the cookie as long as the requests are
within its session bound (e.g., the same browser process)
• The server treats the cookie as a valid identifier as long as
the requests are within its session bound (e.g., a short time
period passed since the last request)
99
Session Cookies
• Session cookies are simply a special kind of
cookies
• The time boundary of session cookies is based on
the session and not on an explicit date
– This is the default expiration time
• Actual session data is kept on the server (while the
session cookie holds only an identifier of the
session)
100
Example
request page
buybook.html
Amazon
Web server
Set-cookie: cartid=3 ...
id=1, details=b56
id=2, details=b12, b90
id=3
101
sessionId
list
102
Session Duration
A session ends in either one of the following cases:
• The server invalidates the session
– Required explicitly, e.g., a user logs out, or
– The session was inactive for a long time
• The client stops cooperating
– Session cookies have expired, e.g., the browser runs in
a new process
103
Questions
• How come when you use Moodle, you
sometimes have to repeatedly enter your
password, and sometimes it remembers
you?
• How come Amazon always remembers you?
104
Are cookies safe?
• Can they contain a virus?
• Can they fill up your disk?
• Can they contain secret information?
• Problems:
– Cookie theft, cookie poisoning, third-party
cookies
105
Cookie Theft
Browser
Server
Attacker
• Attacker listens on the line with a packet
sniffer.
• Solution?
106
Cookie Poisoning
Server
Attacker
• Attacker sends the wrong value back to the
server (instead of the actual cookie value)
• Solution?
107
Third Party Cookies
• Can site A know that you have visited both
site B and site C? (Would you mind if it did?)
• Often happens if advertising banners are
stored at a third-party site
– Diagram on the blackboard
108
Questions About Cookies?
109
Summary
• What have we seen?
– DNS, TCP/IP
– URLs
– Dynamic Pages
– Caching and Proxies
– Cookies
110