Spreadsheet strucuture inspection using low level access

Download Report

Transcript Spreadsheet strucuture inspection using low level access

Spreadsheet structure inspection using low level access and visualisation

Daniel Ballinger, Robert Biddle and James Noble School of Mathematical and Computing Sciences Email: {db, robert, kjx}@mcs.vuw.ac.nz

http://www.mcs.vuw.ac.nz/~db/honours.html

Motivation

   Spreadsheets are a common form of end-user programming.

Unfamiliar spreadsheets can contain daunting amounts of information in the layout and inter-cell dependencies.

Methods for studying these structures are usually limited to what the application provides.

It is difficult to get a global sense of the structure of an individual formula that may have dependencies spread out all over the spreadsheet table. Users have to track down individual cell dependencies one by one, tacking back and fourth all over the spreadsheet.

Bonnie Nardi, A Small Matter of Programming (1993)

Our Proposed Solution

     Working outside the spreadsheet application allows for greater flexibility in addressing issues.

The flexibility allows for visualisations to aid in end user understanding of spreadsheets beyond what the application is capable of.

A set of visualisations as interface enhancements that allow the user to progress from an abstract levels towards actual details present.

The greater flexibility is a trade-off with direct interaction.

We focused on Microsoft Excel due to its large market share.

Spreadsheet elements requiring extraction

     Artefacts of interest are derived from low-level structures.

The basic unit of interest is any occupied cell.

Each occupied cell will have a value, and optionally a formula.

Formula should be in the same format shown to the user (not RPN).

Building our own BIFF reader would be a sizeable project in itself. Hence we use third party software.

IBM alphaWorks ExcelAccessor

  Requires using Windows native code and Excel to be installed. Limiting portability.

 A Java Bean to access and modify to contents of spreadsheets using a Windows DLL.

Excellent ability to extract all details correctly.

 Prone to irregular crashes on larger corpus tasks.

Andy Khan’s JExcelAPI

  Pure Java integrates better with toolkit.

Open source allows for easier expansion and bug fixes.

 Some element types produce problems with extraction. E.g. Array functions, intersections, absolute references becoming relative, and earlier BIFF formats.

Spreadsheet Application Toolkit

Query  Find and store spreadsheets from the Internet.

 Extract low level structures. E.g. Cell values and formulas.

 Analyse spreadsheet structures. Either individual or corpus.

 Conveying the findings through visualisation.

URLs

XLS files Toolkit Files

Processed Data Gobbler Fetcher

Extractor

Analyser Visualisation Tools Image Google Web Servers

BIFF Reader

Algorithms Metrics

Aspects of spreadsheet structure and use

 The spreadsheet paradigm has two main characteristics: – The spatial relationships between cells – The logical relationships created by formula  These characteristics are not always disjoint.

 User problems are mapped onto a 2D table that shields them from low-level details of programming and allows for more natural expression of many problems.

Spreadsheet layout – Real-estate Utilisation 2D

 Understanding layout is an important first step in learning about a new spreadsheet.

 Actual values and formulas are only shown as occupied cells.

 The visualisation layout mimics that of Excel, with columns along the top of the x-axis and rows running down the y-axis.

 Cells with a higher occupancy level are coloured towards the red end of the colour spectrum.

Spreadsheet layout – Real-estate Utilisation 3D

 Occupancy data is projected into 3D to create a surface map.

 Discrete to continuous data transformation helps smooth the effects of spikes.

 Coloured to give a Topographical terrain effect.

 Full benefit is seen with user interaction.

Formula Inspection – Data Flow

 Visualising formula components that are extracted.

 Fully trace worksheets in one view.

 Single Cell, Range, Union, and Intersection.

Basic Referencing Components

Formula in a summation example

 Visualising formula in a fairly common summation example  The relative complexity of the bottom-right formula is clear from the larger circle.

Igarashi’s Static Global View

Formula Inspection - Dependency Types

 Excel allows for combinations of relative and absolute positioning.

 Understanding the referencing type is important when replicating formula and identifying regular patterns.

Row Absolute Fully Absolute Column Absolute Relative (default)

Related Work

    Takeo Igarashi – Spreadsheets augment “a visible tabular layout with invisible formulas”.

– Created visualisations to help reveal the hidden dataflow graphs and superficial tabular layouts of spreadsheets.

Markus Clermont – Most end-users are not trained programmers.

– Many spreadsheets exist beyond being simple scratch pads.

Raymond Panko – Studies of empirical data into spreadsheet errors.

– Found error rates can be disturbingly high.

– Errors attributed to over confidence and lack of formal checking.

Margaret Burnett – The importance of scalability in visualisations.

– Forms/3 and an embedded testing methodology.

Summary and Future Work

 We created a Java toolkit to extract artefacts from spreadsheets and then convert the basic information into visualisations.

  Future work will include detailed user studies and corpus analysis to find larger patterns.

 These visualisations are used to augment the information provided by Excel in helping users understand spreadsheets.

We must also address visualisation scalability for larger, more complex, spreadsheets.

http://www.mcs.vuw.ac.nz/~db/honours.html