www.cs.newpaltz.edu

Download Report

Transcript www.cs.newpaltz.edu

Chapter 5
Bytes and Octets, ASCII and Unicode




Early on bytes could be anywhere from 5 to 9 bits so octet
came into use to tell us exactly what we were talking about.
Today bytes are also universally 8 bits so we have two names
for the same thing.
Unicode (16-bit codes) is an expansion of ASCII (8-bit codes).
Authors recommend always using Unicode for strings (but
don't follow their own advice.
elvish = u'Namárië!'
Unicode 2 Network

Unicode characters that need to be transmitted across a
network are sent as octets.

We need a Unicode2Network conversion scheme.

Enter 'utf-8'
>>> elvish = u'Namárië!'
>>> elvish.encode('utf-8')
'Nam\xc3\xa1ri\xc3\xab!'


For example, the uft-8 encoding of the character ë is the two
characters C3 AB.
Understand that the above string means that when printed,
printables are themselves and unprintables are \xnn where nn
is a hexadecimal value.
Other Encodings

There are many choices fro encoding schemes.
>>> elvish.encode('utf-16')
'\xff\xfeN\x00a\x00m\x00\xe1\x00r\x00i\x00\xeb\x00!\x00'
>>> elvish.encode('cp1252')
'Nam\xe1ri\xeb!'
>>> elvish.encode('idna')
'xn--namri!-rta6f'
>>> elvish.encode('cp500')
'\xd5\x81\x94E\x99\x89SO'

utf-16: '\xff\xfe' represents byte order and all other characters
are represented in 2 octets, typically <p>\x00 where <p>
means “printable”
Decodings:

Upon receipt, byte streams need to be decoded. To do this the
encoding needs to be understood and then things are easy.
>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')
Namárië!
Decodings:

Note that if you are not “printing” that decode returns some
universal representation of the original string.
>>> 'Nam\xe1ri\xeb!'.decode('cp1252')
u'Nam\xe1ri\xeb!'
>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')
Namárië!
>>> '\xd5\x81\x94E\x99\x89SO'.decode('cp500')
u'Nam\xe1ri\xeb!'
>>> 'xn--namri!-rta6f'.decode('idna')
u'nam\xe1ri\xeb!'
>>> '\xff\xfeN\x00a\x00m\x00\xe1\x00r\x00i\x00\xeb\x00!\x00'.decode('utf-16')
u'Nam\xe1ri\xeb!'
>>> 'Nam\xc3\xa1ri\xc3\xab!'.decod('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decod'
>>> 'Nam\xc3\xa1ri\xc3\xab!'.decode('utf-8')
u'Nam\xe1ri\xeb!'
Do it yourself; or not!


If you use high-level protocols (and their libraries) like HTTP
encoding is done for you.
If not, you'll need to do it yourself.
Not supported:

ASCII is a 7-bit code so can't be used to encode some things.
>>> elvish.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 3:
ordinal not in range(128)
Variable length encodings:


Some codecs have different encodings of characters in
different lengths.
Example, utf-16 uses either 16 or 32 bits to encode a
character.

utf-16 adds prefix bytes - \xff\xfe.

All these things make it hard to pick out individual characters
Network Byte Order


Either big-endian or little-endian.
Typically needed for binary data. Text is handled by encoding
(and knowing where your message ends (framing)).

Problem: Send 4253 across a netwrok connection

Solution 1: Send '4253'



Problem: Need to convert string <--> number. Lots of
arithmetic.
Still, lots of situations do exactly this (HTTP, for example,
since it is a text protocol)
We used to use dense binary protocols but less and less.
How does Python see 4253?

Python stores a number as binary, we can look at its hex
representation as follows:
>>> hex(4253)
'0x109d'


Each hex digit is 4 bits.
Computers store this value in memory using big-endian (most
significant bits first) or little-endian (least significant bits first)
format.
Python's perspective on a religious war.

Python is agnostic.
>>> import struct
>>> struct.pack('<i',4253)
'\x9d\x10\x00\x00'
>>> struct.pack('>i',4253)
'\x00\x00\x10\x9d'
>>> struct.pack('!i',4253)
'\x00\x00\x10\x9d'
>>> struct.unpack('!i','\x00\x00\x10\x9d')
(4253,)

'<': little-endian

'>': big-endian

'i': integer

'!': network perspective (big-endian)
Older Approaches

h2ns(), h2nl(), n2hs() and n2hl().

Authors say, “Don't do it”.
Framing



UDP does framing for you. Data is transmitted in the same
chucks it is received from the application
In TCP you have to frame your own transmitted data.
Framing answers the question, “When is it safe to stop calling
recv()?
Simple Example: Single Stream

Send data with no reply
import socket, sys
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
HOST = sys.argv.pop() if len(sys.argv) == 3 else '127.0.0.1'
PORT = 1060
if sys.argv[1:] == ['server']:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((HOST, PORT))
s.listen(1)
print 'Listening at', s.getsockname()
sc, sockname = s.accept()
print 'Accepted connection from', sockname
sc.shutdown(socket.SHUT_WR)
message = ''
while True:
more = sc.recv(8192) # arbitrary value of 8k
if not more: # socket has closed when recv() returns ''
break
message += more
print 'Done receiving the message; it says:'
print message
sc.close()
s.close()
Simple Example
elif sys.argv[1:] == ['client']:
s.connect((HOST, PORT))
s.shutdown(socket.SHUT_RD)
s.sendall('Beautiful is better than ugly.\n')
s.sendall('Explicit is better than implicit.\n')
s.sendall('Simple is better than complex.\n')
s.close()
else:
print >>sys.stderr, 'usage: streamer.py server|client [host]'
Simple Example: Streaming in both directions; one RQ, one
RP

Important cariat: Always complete streaming in one direction
before beginning in the opposite direction. If not, deadlock can
happen.
Simple Example: Fixed Length Messages

In this case use TCP's sendall() and write your own recvall().
def recvall(sock, length):
data = ''
while len(data) < length:
more = sock.recv(length - len(data))
if not more:
raise EOFError('socket closed %d bytes into a %d-byte message'
% (len(data), length))
data += more
return data

Rarely happens.
Simple Example: Delimit Message with Special Characters.




Use a character outside the range of possible message
characters unless the message is binary.
Authors' recommendation is to use this only if you know the
message “alphabet” is limited.
If you need to use message characters then “escape” them
inside the message.
Using this approach has issues – recognizing an escaped
character, removing the escaping upon arrival and message
length.
Simple Example: Prefix message with its length

Popular with binary data.

Don't forget to “frame” the length itself.

What if this is your choice but you don't know in advance the
length of the message? Divide your message up into known
length segments and send them separately. Now all you need
is a signal for the final segment.
Listing 5-2.
#!/usr/bin/env python
# Foundations of Python Network Programming - Chapter 5 - blocks.py
# Sending data one block at a time.
import socket, struct, sys
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
HOST = sys.argv.pop() if len(sys.argv) == 3 else '127.0.0.1'
PORT = 1060
format = struct.Struct('!I') # for messages up to 2**32 - 1 in length
def recvall(sock, length):
data = ''
while len(data) < length:
more = sock.recv(length - len(data))
if not more:
raise EOFError('socket closed %d bytes into a %d-byte message'
% (len(data), length))
data += more
return data
Listing 5-2.
def get(sock):
lendata = recvall(sock, format.size)
(length,) = format.unpack(lendata)
return recvall(sock, length)
def put(sock, message):
sock.send(format.pack(len(message)) + message)
if sys.argv[1:] == ['server']:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((HOST, PORT))
s.listen(1)
print 'Listening at', s.getsockname()
sc, sockname = s.accept()
print 'Accepted connection from', sockname
sc.shutdown(socket.SHUT_WR)
while True:
message = get(sc)
if not message:
break
print 'Message says:', repr(message)
sc.close()
s.close()
Listing 5-2.
elif sys.argv[1:] == ['client']:
s.connect((HOST, PORT))
s.shutdown(socket.SHUT_RD)
put(s, 'Beautiful is better than ugly.')
put(s, 'Explicit is better than implicit.')
put(s, 'Simple is better than complex.')
put(s, '')
s.close()
else:
print >>sys.stderr, 'usage: streamer.py server|client [host]'
HTTP Example:
•
Uses a delimiter - '\r\n\r\n' – for the header and ContentLength field in the header for possibly purely binary data.
Pickles:
•
Pickles is native serialization built into Python.
•
Serialization is used to send objects that include pointers
across the network where the pointers ill have to be rebuilt.
•
Pickling is a mix of text and data:
>>> import pickle
>>> pickle.dumps([5,6,7])
'(lp0\nI5\naI6\naI7\na.'
>>>
•
At the other end:
>>> pickle.dumps([5,6,7])
'(lp0\nI5\naI6\naI7\na.'
>>> pickle.loads(('(lp0\nI5\naI6\naI7\na.An apple day') )
[5, 6, 7]
Pickles:
•
Problem in network case is that we can't tell how many bytes
of pickle data were consumed before we get to what follows
(“An apple a day”).
•
If we use load() function on a file instead, then the file pointer
is maintained and we can ask its location.
>>> from StringIO import StringIO
>>> f = StringIO('(lp0\nI5\naI6\naI7\na.An apple day')
>>> pickle.load(f)
[5, 6, 7]
>>> f.pos
18
>>> f.read()
'An apple day'
>>>
•
Remember that Python lets you turn a socket into a file
object – makefile().
JSON
•
Popular and easily allows data exchange between software
written in different languages.
•
Does not support framing.
•
JSON supports Unicode but not binary (see BSON)
•
See Chapter 18
>>> import json
>>> json.dumps([51,u'Namárië!'])
'[51, "Nam\\u00e1ri\\u00eb!"]'
>>> json.loads('{"name": "lancelot", "quest" : "Grail"}')
{u'quest': u'Grail', u'name': u'lancelot'}
>>>
XML
•
Popular and easily allows data exchange between software
written in different languages.
•
Does not support framing.
•
Best for text documents.
•
See Chapter 10
Compression
•
Time spent transmitting much longer than time pre- and postprocesssing exchanged data.
•
HTTP lets client and server decide whether to compress or
not.
•
zlib is self-framing. Start feeding it a compressed data stream
and it will know when the stream has come to an end.
>>> data = zlib.compress('sparse')+'.'+zlib.compress('flat')+'.'
>>> data
'x\x9c+.H,*N\x05\x00\t\r\x02\x8f.x\x9cK\xcbI,\x01\x00\x04\x16\x01\xa8.'
>>> len(data)
28
>>>
did not try to compress this
Compression
•
Suppose the previous data arrives in 8-byte chunks.
>>> dobj = zlib.decompressobj()
>>> dobj.decompress(data[0:8]), dobj.unused_data
('spars', '')
>>>
indicates we haven't reached EOF
•
We are still expecting more data.
>>> dobj.decompress(data[8:16]), dobj.unused_data
('e', '.x')
>>>
says we consumed the first compressed bit
and some data was unused.
Compression
•
Skip over the '.' and start to decompress the rest of the
compressed data
>>> dobj2 = zlib.decompressobj()
>>> dobj2.decompress('x'), dobj2.unused_data
('', '')
>>> dobj2.decompress(data[16:24]), dobj2.unused_data
('flat', '')
>>> dobj2.decompress(data[24:]), dobj2.unused_data
('', '.')
>>>
final '.'; the point is, the stuff we have gathered so far
'' + 'flat' + ''
consists of all the data compressed by the 2nd use
of zlib.compress()
NOTE: Using zlib regularly provides its own framing.
Network Exceptions:
•
Many possibilities, some specific (socket.timeout) and some
generic (socket.error).
•
Homework: Write two short python scripts; one that opens a
UDP socket connected to a remote socket. The second
program tries to send data to the previous socket but will fail
since its socket is not the one the other was “connected” to.
Find out the exact error that Python returns, along with the
value of ErrNo.
•
Familiar exceptions – socket.gaierror, socket.error,
socket.timeout.