Transcript A NEW PAGE TABLE FOR 64
COSC 1306 COMPUTER SCIENCE AND PROGRAMMING
Jehan François Pâris
CHAPTER VI FILES
Chapter Overview We will learn how to read, create and modify files Essential if we want to store our program inputs and results.
Pay special attention to
pickled files
They are very easy to use!
The file system Provides
long term storage
of information. Will store data in
stable storage
(disk) Cannot be RAM because:
Dynamic RAM
powered off
Static RAM
loses its contents when is too expensive System crashes can corrupt contents of the main memory
Overall organization Data managed by the file system are grouped in
user-defined
data sets called
files
The file system must provide a mechanism for
naming
these data Each file system has its own set of conventions All modern operating systems use a
hierarchical directory structure
Windows solution Each device and each disk partition is identified by a letter A: and B: were used by the floppy drives C: is the first
disk partition o
f the hard drive If hard drive has no other disk partition, D: denotes the DVD drive Each device and each disk partition has its
own hierarchy of folders
Windows solution
C: Second disk D: Users Windows Flash drive F: Program Files
UNIX/LINUX organization Each device and disk partition has its own directory tree Disk partitions are glued together through the operation to form a single tree Typical user does not know where her files are stored
UNIX/LINUX organization
Root partition
/
Other partition
bin usr
The magic mount
Second partition can be accessed as /usr
Mac OS organization Similar to Windows Disk partitions are not merged Represented by separate icons on the desktop
Accessing a file (I) Your Python programs are stored in a folder AKA directory On my home PC it is
C:\Users\Jehan-Francois Paris\Documents\ Courses\1306\Python
All files in that folder can be directly accessed through their names
"myfile.txt"
Accessing a file (II) Files in folders inside that folder —
subfolders
—can be accessed by specifying first the subfolder
Windows style:
"test\\sample.txt"
Note the double backslash
Linux/Unix/Mac OS X style:
"test/sample.txt"
Generally works for Windows
Why the double backslash?
The backslash is an
escape character
in Python Combines with its successor to represent
non-printable characters
‘\n’
represents a newline
‘\t’
represents a tab Must use ‘
\\
’ to represent a plain backslash
Accessing a file (III) For other files, must use full pathname
Windows Style:
"C:\\Users\\Jehan-Francois Paris\\ Documents\\Courses\\1306\\Python\\ myfile.txt"
Accessing file contents Two step process: First we
open
the file Then we access its contents
read
write
When we are done, we
close
the file
What happens at open() time?
The system verifies That you are an
authorized user
That you have the
right permission
Read permission
Write permission
Execute permission exists but doesn’t apply and returns a
file handle
/
file descriptor
The file handle Gives the user Fast direct access to the file No folder lookups Authority to execute the file operations whose permissions have been requested
Python open()
open(name, mode = ‘r’, buffering = -1)
where
name
is name of file
mode
is
permission requested
Default is
‘r’
for read only
buffering
specifies the
buffer size
Use system default value
(code -1)
The modes Can request
‘r’
for read-only
‘w’
for write-only Always overwrites the file ‘
a
’ for append Writes at the end
‘r+’
or
‘a+’
for updating (read + write/append)
Examples
f1 = open("myfile.txt")
same as
f1 = open("myfile.txt", "r")
f2 = open("test\\sample.txt", "r")
f3 = open("test/sample.txt", "r")
) f4 = open("C:\\Users\\Jehan-Francois Paris\\ Documents\\Courses\\1306\\Python\\myfile.txt"
Reading a file Three ways: Global reads Line by line Pickled files
Global reads
fh.read()
Returns
whole contents
of file specified by file handle
fh
File contents are stored in a
single string
that might be very large
Example
f2 = open("test\\sample.txt", "r") bigstring = f2.read() print(bigstring) f2.close() # not required
Output of example
To be or not to be that is the question Now is the winter of our discontent
Exact contents of file
‘test\sample.txt’ followed by an extra return
Line-by-line reads
for line in fh : # do not forget the column #anything you want fh.close() # not required: Python does it
Example
f3 = open("test/sample.txt", "r") for line in f3 : # do not forget the column print(line) f3.close() # not required
Output To be or not to be that is the question Now is the winter of our discontent With one or more
extra blank lines
Why?
Each line ends with an end-of-line marker
print(…) adds
an extra end-of-line
Trying to remove blank lines
print('----------------------------------------------------') f5 = open("test/sample.txt", "r") for line in f5 : # do not forget the column print(line[:-1]) # remove last char f5.close() # not required print('-----------------------------------------------------')
The output
--------------------------------------------------- To be or not to be that is the question Now is the winter of our disconten -----------------------------------------------------
The last line did not end with an EOL!
A smarter solution (I)
Only remove the last character if it is an EOL
if line[ 1] == ‘\n’ : print(line[:-1] else print line
A smarter solution (II)
print('----------------------------------------------------') fh = open("test/sample.txt", "r") for line in fh : # do not forget the column if line[-1] == '\n' : print(line[:-1]) # remove last char else : print(line) print('-----------------------------------------------------') fh.close() # not required
It works!
--------------------------------------------------- To be or not to be that is the question Now is the winter of our discontent -----------------------------------------------------
Making sense of file contents Most files contain more than one data item per line COSC 713-743-3350 UHPD 713-743-3333 Must split lines
mystring.split(sepchar)
where
sepchar
is a separation character returns a list of items
Splitting strings >>> text = "Four score and seven years ago" >>> text.split() ['Four', 'score', 'and', 'seven', 'years', 'ago'] >>>record ="1,'Baker, Andy', 83, 89, 85" >>> record.split(',') [' 1', "'Baker", " Andy'", ' 83', ' 89', ' 85']
Not what we wanted!
Example
# how2split.py
print('----------------------------------------------------') f5 = open("test/sample.txt", "r") for line in f5 : words = line.split() for xxx in words : print(xxx) f5.close() # not required print('-----------------------------------------------------')
Output
--------------------------------------------------- To be … of our discontent -----------------------------------------------------
Picking the right separator (I)
Commas
CSV Excel format Values are separated by commas Strings are stored without quotes Unless they contain a comma “Doe, Jane”, freshman, 90, 90 Quotes within strings are doubled
Picking the right separator (II)
Tabs( ‘\t’)
Advantages:
Your fields will appear nicely aligned Spaces, commas, … are not an issue
Disadvantage:
You do not see them They look like spaces
Why it is important When you must pick your file format, you should decide how the data inside the file will be used: People will read them Other programs will use them Will be used by people and machines
An exercise Converting tab-separated data to CSV format Replacing tabs by commas Easy Will use string replace function
First attempt
fh_in = open('grades.txt', 'r') # the 'r' is optional buffer = fh_in.read() newbuffer = buffer.replace('\t', ',') fh_out = open('grades0.csv', 'w') fh_out.write(newbuffer) fh_in.close() fh_out.close() print('Done!')
The output
Alice Bob Carol 90 85 75 90 85 75
becomes
Alice,90,90,90,90,90 Bob,85,85,85,85,85 Carol,75,75,75,75,75 90 85 75 90 85 75 90 85 75
Dealing with commas (I) Work line by line For each line split input into fields using TAB as separator store fields into a list Alice 90 90 90 90 becomes [‘Alice’, ’90’, ’90’, ’90’, ’90’, ’90’] 90
Dealing with commas (II) Put within double quotes any entry containing one or more commas Output list entries separated by commas
['"Baker, Alice"', 90, 90, 90, 90, 90]
becomes
"Baker, Alice",90,90,90,90,90
Dealing with commas (III) Our troubles are not over: Must store somewhere all lines until we are done Store them in a list
Dealing with double quotes Before wrapping items with commas with double quotes replace All double quotes by pairs of double quotes
'Aguirre, "Lalo" Eduardo'
becomes
'Aguirre, ""Lalo"" Eduardo'
then
'"Aguirre, ""Lalo"" Eduardo"'
Order matters (I) We must double the inside double quotes before wrapping the string into double quotes; From
'Aguirre, "Lalo" Eduardo'
go to
'Aguirre, ""Lalo"" Eduardo'
then to
'"Aguirre, ""Lalo"" Eduardo"'
Order matters (II) Otherwise; We go from
'Aguirre, "Lalo" Eduardo'
to '
"Aguirre, "Lalo" Eduardo"'
then to
'""Aguirre, ""Lalo"" Eduardo""'
with
all
double quotes doubled
General organization (I) linelist = [ ] # the samer file in CSV format for line in file itemlist = line.split(…) linestring = '' # always start with an empty line for item in itemlist : remove any trailing newline double all double quotes if item contains comma, wrap add to linestring
General organization (II) for line in file … for each item in itemlist double all double quotes if item contains comma, wrap add to linestring append linestring to stringlist
General organization (III) for line in file … remove last comma of linestring add newline at end of linestring append linestring to stringlist for linestring in in stringline write linestring into output file
The program (I)
# betterconvert2csv.py
""" Convert tab-separated file to csv """ fh = open('grades.txt','r') #input file linelist = [ ] # global data structure for line in fh : # we process an input line itemlist = line.split('\t') # print(str(itemlist)) # just for debugging linestring = '' # start afresh
The program (II)
for item in itemlist : #we process an item item = item.replace(' " ',' "" ') # for quotes if item[-1] == '\n' : # remove it item = item[:-1] if ',' in item : # wrap item linestring += ' " ' + item +' " ' + ',' else : # just append linestring += item +',' # end of item loop
The program (III)
# must replace last comma by newline linestring = linestring[:-1] + '\n' linelist.append(linestring) # end of line loop fh.close() fhh = open('great.csv', 'w') for line in linelist : fhh.write(line) fhh.close()
Notes Most print statements used for debugging were removed Space considerations Observe that the inner loop adds a comma after each item Wanted to remove the last one Must also add a newline at end of each line
The input file
Alice 90 Bob Carol 85 75 Doe, Jane 90 85 75 90 90 85 75 90 90 85 75 90 Fulano, Eduardo "Lalo" 90 90 85 75 80 90 70 90 90
The output file
Alice,90,90,90,90,90 Bob,85,85,85,85,85 Carol ,75,75,75,75,75 "Doe, Jane",90,90,90 ,80 ,75 "Fulano, Eduardo ""Lalo""",90,90,90,90
Mistakes being made (I)
Mixing lists and strings:
Earlier draft of program declared
linestring = [ ]
and did
linestring.append(item)
Outcome was
['Alice,', '90,'. … ]
instead of
'Alice,90, …'
Mistakes being made (II)
Forgetting to add a newline
Output was a single line
Doing the append inside the inner loop:
Output was
Alice,90 Alice,90,90 Alice,90,90,90 …
Mistakes being made
Forgetting that strings are immutable:
Trying to do
linestring[-1] = '\n'
instead of
linestring = linestring[:-1] + '\n'
Bigger issue:
Do we have to remove the last comma?
Could we have done better? (I) Make the program
more readable by decomposing it into functions
A function to process each line of input
do_line(line)
Input is a string ending with newline Output is a string in CSV format Should call a function processing individual items
Could we have done better? (II) A function to process individual items
do_item(item)
Input is a string Returns a string With double quotes "doubled" Without a newline Within quotes if it contains a comma
The new program (I)
def do_item(item) : item = item.replace(' " ',' "" ') if item[-1] == '\n' : item = item[:-1] if ',' in item : item =' " ' + item +' " ' return item
The new program (II)
def do_line(line) : itemlist = line.split('\t') linestring = '' # start afresh for item in itemlist : linestring += do_item(item) +',' if linestring != '' and linestring[-1] == ',' : linestring = linestring [:-1] linestring += '\n' return linestring
The new program (III)
fh = open('grades.txt','r') linelist = [ ] for line in fh : linelist.append( do_line(line )) fh.close()
The new program (IV)
fhh = open('great.csv', 'w') for line in linelist : fhh.write(line) fhh.close()
Why it is better Program is decomposed into small modules that are much easier to understand Each fits on a PowerPoint slide
The break statement Makes the program exit the loop it is in In next example, we are looking for
first instance
of a string in a file Can exit as soon it is found
Example (I)
searchstring= input('Enter search string:') found = False fh = open('grades.txt') for line in fh : if searchstring in line : print(line) found = True break
Example (II)
if found == True : print("String %s was found" % searchstring) else : print("String %s NOT found " % searchstring)
Flags A variable like
found
That can either be
True
or
False
That is used in a condition for an
if
or a
while is often referred to as a flag
A dumb mistake Unlike C and its family of languages, Python does not let you write
if found = True
for
if found == True
There are still cases where we can do mistakes!
Example
>>> b = 5 >>> c = 8 >>> a = b = c >>> a 8
>>> a = b == c >>> a True
HANDLING EXCEPTIONS
When a wrong value is entered When user is prompted for
number = int(input("Enter a number: ")
and enters a non-numerical string a
ValueError
exception is raised and the program terminates Python a programs catch errors
The try… except pair (I)
try:
Observe the colons the indentation
The try… except pair (II)
try:
If an exception occurs while the program executes the statements between the
try
and the
except,
control is
immediately transferred
to the
statements after the except
A better example
done = False while not done : filename= input("Enter a file name: ") try : fh = open(filename) done = True except Exception as ex: print ('File %s does not exist' % filename) print(fh.read())
An Example (I)
done = False while not done : try : number = int(input('Enter a number:')) done = True except Exception as ex: print ('You did not enter a number') print ("You entered %.2f." % number) input("Hit enter when done with program.")
A simpler solution
done = False while not done myinput = (input('Enter a number:')) if myinput.isdigit() : number = int(myinput) done = True else : print ('You did not enter a number') print ("You entered %.2f." % number) input("Hit enter when done with program.")
PICKLED FILES
Pickled files
import pickle
Provides a way to save complex data structures in a file Sometimes said to provide a
serialized representation
of Python objects
Basic primitives (I)
dump(object,fh)
appends a sequential representation of
object
into file with file handle
fh
object
is virtually any Python object
fh
is the handle of a file that must have been opened in
'wb'
mode b is a special option allowing to
write or read binary data
Basic primitives (II)
target = load( filehandle)
assigns to
target
next pickled object stored in file
filehandle
target
is virtually any Python object
filehandle
id filehandle of a file that was opened in
rb
mode
Example (I)
>>> mylist = [ 2, 'Apples', 5, 'Oranges']
>>> mylist [2, 'Apples', 5, 'Oranges']
>>> fh = open('testfile', 'wb') # b for BINARY
>>> import pickle
>>> pickle.dump(mylist, fh)
>>> fh.close()
Example (II)
>>> fhh = open('testfile', 'rb') # b for BINARY
>>> theirlist = pickle.load(fhh)
>>> theirlist [2, 'Apples', 5, 'Oranges']
>>> theirlist == mylist True
What was stored in testfile?
Some binary data containing the strings 'Apples' and 'Oranges'
Using ASCII format Can require a pickled representation of objects that only contains printable characters Must specify
protocol = 0
Advantage:
Easier to debug
Disadvantage:
Takes more space
Example
import pickle mydict = {'Alice': 22, 'Bob' : 27} fh = open('asciifile.txt', 'wb') # MUST be 'wb' pickle.dump(mydict, fh, protocol = 0) fh.close() fhh = open('asciifile.txt', 'rb') theirdict = pickle.load(fhh) print(mydict) print(theirdict)
The output
{'Bob': 27, 'Alice': 22} {'Bob': 27, 'Alice': 22}
What is inside asciifile.txt?
(dp0VBobp1L27Ls
V
Alicep2L22Ls.
Dumping multiple objects (I)
import pickle fh = open('asciifile.txt', 'wb') for k in range(3, 6) : mylist = [i for i in range(1,k)] print(mylist) pickle.dump(mylist, fh, protocol = 0) fh.close()
Dumping multiple objects (II)
fhh = open('asciifile.txt', 'rb') lists = [ ] # initializing list of lists while 1 : # means forever try: lists.append(pickle.load(fhh)) except EOFError : break fhh.close() print(lists)
Dumping multiple objects (III) Note the way we test for end-of-file (
EOF
)
while 1 : # means forever try: lists.append(pickle.load(fhh)) except EOFError : break
The output
[1, 2] [1, 2, 3] [1, 2, 3, 4] [[1, 2], [1, 2, 3], [1, 2, 3, 4]]
What is inside asciifile.txt?
(lp0L1LaL2La.(lp0L1LaL2LaL3La.(lp0L1LaL2 LaL3LaL4La.
Practical considerations You rarely pick the format of your input files
May have to do format conversion
You often have to use specific formats for you output files
Often dictated by program that will use them
Otherwise
stick with pickled files
!