PowerPoint PPT

Download Report

Transcript PowerPoint PPT

Do not put content on the brand

PROC IMPORT and more Or When PROC IMPORT just doesn't do the job

David B. Horvath, CCP, MS PhilaSUG Spring 2014 Meeting

PROC IMPORT and more

Copyright

©

2014, David B. Horvath, CCP — All Rights Reserved The Author can be contacted at: 504 Longbotham Drive, Aston PA 19014-2502, USA Phone: 1-610-859-8826 Email: [email protected]

Web: http://www.cobs.com/ All trademarks and servicemarks are the property of their respective owners.

2

Introductions

• My Background • PROC IMPORT Basics • CSV Specification and Issues • Fixing CSV Data Outside of SAS • Tips and Tricks 3

Abstract

• PROC IMPORT comes in handy when quickly trying to load a CSV or similar file. But it does have limitations. Unfortunately, I've run into those limitations and had to work around them. • This session will discuss the original CSV specification (early 1980's), how Microsoft Excel violates that specification, how SAS PROC IMPORT does not follow that specification, and the issues that can result. • Simple UNIX tools will be described that can be used to ensure that data hilarities do not occur due to CSV issues. • Recommendations will be made to get around some of PROC IMPORT limitations (like field naming, data type determination, limitation in number of fields, separator in data). • CSV, TAB, and DLM types will be discussed.

4

My Background

• Base SAS on Mainframe, UNIX, and PC Platforms • SAS is primarily an ETL tool or Programming Language for me • My background is IT – I am not a modeler • Not my first User Group presentation – presented sessions and seminars in Australia, France, the US, and Canada. • Undergraduate: Computer and Information Sciences, Temple Univ.

• Graduate: Organizational Dynamics, UPenn • Most of my career was in consulting • Have written several books (none SAS-related, yet) • Adjunct Instructor covering IT topics.

5

PROC IMPORT Basics

• Will focus on code rather than GUI Import Wizard • Handy tool to load data from other sources • You provide the input (flat) file, SAS creates the dataset • The primary formats: • • • CSV – Comma Separated Variables TAB – Tab (ASCII ‘09x’ EBCDIC ’05’x) DLM – You pick it • Quoted strings get special treatment any time you use delimited files • • PROC IMPORT INFILE with DLM and DSD (“Delimiter-Sensitive Data”) 6

PROC IMPORT Basics

• Syntax: PROC IMPORT DATAFILE="filename” OUT=SAS data-set <(SAS data-set-options)> ; • DBMS identifies data source type • • CSV, TAB, and DLM will be discussed here DBMS options could be an entire talk in themselves and largely depends on operating system and installed components.

• REPLACE allows output dataset to be overwritten.

7

PROC IMPORT Basics

• Data-source-statements: • DELIMITER only specified with DBMS=DLM • Will override DBMS=CSV • Can contain multiple values DELIMTER=‘!|’ • GETNAMES gets variable names from file • GETNAMES=YES is the default • • Will “sasify” the names (underlines replace spaces, length, etc.) Defaults DATAROW to 2 • DATAROW=n • • Is similar in function to FIRSTOBS Defaults to 1 unless GETNAMES=YES • GUESSINGROWS=20 is the default • • Used by wizard to build format and input statements Tradeoff between time and accuracy – higher count gives greater accuracy 8

Example from SAS Documentation

• The input CSV: "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472" "Central America/Caribbean","Boot","Kingston","33","$102,372","$393,376","$4,454" "Eastern Europe","Boot","Budapest","22","$74,102","$317,515","$3,341" "Middle East","Boot","Al-Khobar","10","$15,062","$44,658","$765" "Pacific","Boot","Auckland","12","$20,141","$97,919","$962" "South America","Boot","Bogota","19","$15,312","$35,805","$1,229" "United States","Boot","Chicago","16","$82,483","$305,061","$3,735" "Western Europe","Boot","Copenhagen","2","$1,663","$4,657","$129“ • The Code: proc import datafile=“csv1.csv" out=shoes dbms=csv replace; getnames=no; run; 9

Example from SAS Documentation

• Generated Code: 7 /********************************************************************** 8 * PRODUCT: SAS 9 * VERSION: 9.2

10 * CREATOR: External File Interface 11 * DATE: 10JUN14 12 * DESC: Generated SAS Datastep Code 13 * TEMPLATE SOURCE: (None Specified.) 14 ***********************************************************************/ 15 data WORK.SHOES ; 16 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 17 infile 'csv1.csv' delimiter = ',' MISSOVER DSD lrecl=32767 ; 18 informat VAR1 $34. ; ...

25 format VAR1 $34. ; ...

32 input 33 VAR1 $ ...

40 ; 41 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ 42 run; 10

Example from SAS Documentation

• Output: Obs VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 1 Africa Boot Addis Ababa 12 $29,761 $191,821 $769 2 Asia Boot Bangkok 1 $1,996 $9,576 $80 3 Canada Boot Calgary 8 $17,720 $63,280 $472 4 Central America/Caribbean Boot Kingston 33 $102,372 $393,376 $4,454 5 Eastern Europe Boot Budapest 22 $74,102 $317,515 $3,341 ...

10 Western Europe Boot Copenhagen 2 $1,663 $4,657 $129 11

CSV Specification and Issues

• When CSV first came out, there was no real specification • History traced to list directed/free form input specification in FORTRAN 77 • First mention I could find (in SAS’ TS-673) was version 6 • • Did not handle empty fields (two delimiters in a row) Did not handle quoted strings containing delimiter • Each vendor had their own “standard” which may or may not support quoted strings • For a long time, Excel did not quote strings •

The creator of your file may not quote strings

• RFC 4180, IETF 2005, is the definitive reference now!

12

CSV Specification and Issues

• The format is basically: • Optional Header (usually column names): field_name,field_name,field_name • Followed by Datalines: aaa,bbb,ccc zzz,yyy,xxx • Optionally, strings are contained within single or double quotes "aaa","bbb","ccc" • If a quoted string contains the quote character, it must be escaped: • • "Af"r,ica" Becomes “Af”r and ica” fields "Af”"r,ica" Becomes “Af”r,ica” single field • Each line should have a line terminator • This only really matters if you have issues importing 13

CSV Specification and Issues

• What happens if delimiter-containing fields are not quoted?

• • Depends on consistency of data OK: Applicant, Instructor Resources, Human Management, Fiscal • Not OK: Applicant, Instructor Resources, Human Training • Inconsistent number of fields • Column names exist in some files • Page titles 14

Fixing CSV Data Outside of SAS

• I strongly advise against CSV in favor of TAB delimited • Second choice is delimited by other character (like ‘|’ pipe) • Ask for a new version properly formatted • Manually edit the file • Use some UNIX and SAS tricks 15

Tips and Tricks – Modifying Generated Code

• My favorite trick is to let PROC IMPORT write the initial code for a file and then copy/paste/edit the resulting DATA step into the program.

• Can alter variable names, formats, and data types • Can use more INFILE options like DLMSTR, DLMSOPT, END, others • Can perform data validation/correction • Excel will let you enter ‘14-JUL’ as a date, treat it as July 14 2014, but write it out to the file as ‘14-JUL’ • • Humans are lazy.

Your program could input the field and correct in on flow through the data.

• Programmers are lazy. I don’t want to have to type all the variables in a large delimited file.

• UNIX has a tool to cut columns out of a file: • cut -c 10- csv3.log > data.txt

• Will eliminate row numbers and formatting in log columns 1-9 16

Tips and Tricks – Hanging PROC

• If you see this error when using PROC IMPORT or EXPORT while running from the command line: ERROR: Cannot open X display. Check the name/server access authorization.

• Then the wizard is trying to run interactively. Your command line should include: sas –noterminal • Supposed to be fixed in v9.3

• Another symptom is that the PROC “hangs” 17

Tips and Tricks – Delimited File Not Quote Escaped

• Problem reported recently by Don Stanley [email protected]

• This is the case where quotes are just data: AAAA.BBBB.'I.CCCC'.0.DL

• PROC IMPORT/INFILE DLM DSD wants to treat 'I.CCCC‘ as one field, developer wants it to be two.

• With the possibility of empty fields, partial DSD functionality is needed. But DSD also invokes the quoted field handling.

18

Tips and Tricks – Delimited File Not Quote Escaped

• SAS Solution (I lost the contributor name) infile custdata dlm=‘,' ; input @ ; do while(index(_infile_,‘,,')>0) ; _infile_ = tranwrd(_infile_,‘,,',‘, ,') ; end ; /* comma-comma to comma-space-comma */ input << delimited fields >> ; • _infile_ is limited to 32,767 – could be problem for large records • UNIX Shell Solution nawk '{gsub(“,,", “, ,");print $0;}' INPUTFILE > OUTPUTFILE • Handles (essentially) unlimited record lengths 19

Tips and Tricks – File Contents

• UNIX head will show the first few rows • Does the file have extra lines at the top (like a title line)?

• UNIX tail will show the last few rows • • Do I have all the data? Are the records consistent beginning to end?

• UNIX wc -l (word count, line) will show how many lines in total • Is this file the right size?

• You can always edit the file (unless too large) • UNIX od -c -x (object dump, character and hex) is useful to see format and contents of file (including unprintable characters). 20

Tips and Tricks – File Contents

• Determine maximum record length and field count (based on delimiters) with a script • This example is written in awk (new awk): BEGIN{ FS=","; mNF=0; mRL=0; } { if (NF > mNF) mNF=NF; RL=length($0); if (RL > mRL) mRL=RL; } END{ print "max NF " mNF " max RECL " mRL;} • Max NF is highest number of fields on any record (does not handle quote escaping) • Max RECL is the longest record length in the file • Can be saved in a file and executed on your data: nawk csv_check.awk INPUTFILE.csv

21

Tips and Tricks – Dropping Titles

• The input CSV: This is a title line and should be here country,shoe,city,count,average salary,max salary,average "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472“ • PROC IMPORT results: Number of names found is less than number of variables found.

Name This is a title line and should be here truncated to This_is_a_title_line_and_should.

Problems were detected with provided names. See LOG.

• How to get rid of the first line since DATAROW= won’t help: • • wc -l INPUT tail -4 INPUT > OUTPUT – returns value 5 (5 lines) – saves last 4 lines 22

Tips and Tricks – Dropping Junk Records at End

• The input CSV: country,shoe,city,count,average salary,max salary,average "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472" ,,,,,, • Results in: average_ Obs country shoe city count salary max_salary average 1 Africa Boot Addis Ababa 12 $29,761 $191,821 $769 2 Asia Boot Bangkok 1 $1,996 $9,576 $80 3 Canada Boot Calgary 8 $17,720 $63,280 $472 4 • How to get rid of the last line since DATAROW= won’t help: • wc -l INPUT – returns value 5 (5 lines) • head -4 INPUT > OUTPUT – saves first 4 lines 23

Personal Note

• On a Personal Note • • I seem to learn a bunch when working on presentations, new classes, and writings I was still treating PROC IMPORT like I was still using V6, I didn’t realize it would handle quoted strings properly now.

• I have to learn some new habits and trust the tool more.

• In any commands, the single and double quotation marks should be simple, not the “smart quotes” forced my Microsoft. The same applies to dashes or minus signs – they should not be “em dashes” (- versus –) 24

Wrap Up

?!

?

?!

?

!

!

Questions

?!

?

and

!

Answers

!

?

?!

25

Proc Import References

• NOTES • Proc Import with a twist: http://www2.sas.com/proceedings/sugi30/038 30.pdf

• Data Example: http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/vie wer.htm#a000314361.htm

• • RFC 4180, IETF 2005, http://tools.ietf.org/html/rfc4180 Wiki has good CSV background: http://en.wikipedia.org/wiki/Comma separated_values • • Hanging PROC IMPORT: http://support.sas.com/kb/3/610.html

CSV 1203 “Standard” – identifies some Excel “oddities”: http://mastpoint.curzonnassau.com/csv-1203 26

Do not put content on the brand

A Few More Words About Mainframe Data in Non-EBCDIC Environment

David B. Horvath, CCP, MS PhilaSUG Spring 2014 Meeting

Mainframe Data in Non-EBCDIC Environment

• SAS TS-642 is a great place to start http://support.sas.com/techsup/technote/ts642.html

• UNIX character set conversion: • dd if=INPUT of=OUTPUT conv=ascii • • Does not handle binary or packed decimal correctly Is readable so you can look at the file • Input formats: • • • • $ebcdicNN. – converts EBCDIC character data to ASCII (COBOL pic X and 9 usage display – the default) s370fpdN.M. – converts packed data (2 binary coded decimal digits per byte, COBOL pic 9 usage comp-3) s370fzdN. – converts zoned decimal (COBOL S9 usage display where sign is not separate leading/trailing character) s370fibN. – converts binary data (COBOL pic 9 usage comp) 28

Mainframe Data in Non-EBCDIC Environment

• Looking at the data: • On the Mainframe itself (easiest if you have access and speak ISPF) • Converted to ASCII (via dd command) • In native format in dump format (character and hex): • od -c -x EBCDIC_FILE • Transferring the Data – Fixed Format • If no binary/packed data, you can use FTP ASCII mode • Otherwise you need BINARY mode 29

Mainframe Data in Non-EBCDIC Environment

• Transferring the Data – Variable Format • Conversion on mainframe to fixed layout • • • Wasteful of space Allows FTP in ASCII or BINARY depending on contents May still have varying records that require processing • Convert on mainframe to RECFM=U via IEBGENER (copy) • • • Hides Block and Record descriptor data from the operating system and FTP FTP in BINARY Use recfm=s370vb on infile statement to handle layout oddities • Access data directly via filename/ftp: • filename test1 ftp "'SASXXX.VB.TEST1'" HOST='MVS' USER='SASXXX' PROMPT

s370v RCMD='site rdw';

• Highlighted portion tells SAS to handle layout oddities 30

Mainframe Data in Non-EBCDIC Environment

• Variable Format Internal Layout (RECFM=VB) • The files are formatted like: • • • BB00RR00yourdataRR00secondrecordRR00thirdrecord BB is the blksize (block) and counts everything (47 bytes) RR is the physical record length – includes the space for the data and RR/00 (12 for your data, 16 for second, and 15 for third) • • • • • A record cannot exceed the defined LRECL, a block cannot exceed the defined BLKSIZE. These are defined when the dataset is created.

Each block has the same general layout In the record, these fields are binary data, do not try to use BB or RR calculations if you have converted the file from EBCDIC to ASCII.

in Native FTP will strip out the layout data and may insert CR/LF Your data may look like CR/LF • May you never have to deal with mainframe tapes (See me if you do) 31

Mainframe Data in Non-EBCDIC Environment

• Meaning Changes in Variable Record Format Files • Typically a record type in first byte • Remainder of the record is of that type • For instance, Employee File: • • • • A – basic employee information: name, address, SSN B – department and report-to information C – education, 1 to many records D – hire/termination information, 1 to many records • Individual byte position in the record can have different meaning (COBOL redefines, C union) input @1 rectype $1. @; if rectype = “A” then do; input @2 nextfield … 32

Mainframe Data in Non-EBCDIC Environment – Wrap Up

?!

?

?!

?

!

!

Questions and

!

Answers

!

33 ?!

?

?

?!