静态分析

Download Report

Transcript 静态分析

静态代码分析
梁广泰
2011-05-25
提纲
动机
程序静态分析(概念+实例)
程序缺陷分析(科研工作)
动机
云平台特点
 应用程序直接部署在云端服务器上,存在安全隐患
• 直接操作破坏服务器文件系统
• 存在安全漏洞时,可提供黑客入口
 资源共享,动态分配
• 单个应用的性能低下,会侵占其他应用的资源
解决方案之一:
 在部署应用程序之前,对其进行静态代码分析:
• 是否存在违禁调用?(非法文件访问)
• 是否存在低效代码?(未借助StringBuilder对String进行大量
拼接)
• 是否存在安全漏洞?(SQL注入,跨站攻击,拒绝服务)
• 是否存在恶意病毒?
• ……
提纲
动机
程序静态分析(概念+实例)
程序缺陷分析(科研工作)
静态代码分析
定义:
 程序静态分析是在不执行程序的情况下对其进行分析的技术,简称
为静态分析。
对比:
 程序动态分析:需要实际执行程序
 程序理解:静态分析这一术语一般用来形容自动化工具的分析,而
人工分析则往往叫做程序理解
用途:
 程序翻译/编译 (编译器),程序优化重构,软件缺陷检测等
过程:
 大多数情况下,静态分析的输入都是源程序代码或者中间码(如
Java bytecode),只有极少数情况会使用目标代码;以特定形式输
出分析结果
静态代码分析
Basic Blocks
Control Flow Graph
Dataflow Analysis
 Live Variable Analysis
 Reaching Definition Analysis
Lattice Theory
Basic Blocks
A basic block is a maximal sequence of
consecutive three-address instructions with the
following properties:
 The flow of control can only enter the basic block thru the 1st
instr.
 Control will leave the block without halting or branching,
except possibly at the last instr.
Basic blocks become the nodes of a flow graph,
with edges indicating the order.
Basic Block Example
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
i=1
j=1
t1 = 10 * i
t2 = t1 + j
t3 = 8 * t2
t4 = t3 - 88
a[t4] = 0.0
j=j+1
if j <= 10 goto (3)
i=i+1
if i <= 10 goto (2)
i=1
t5 = i - 1
t6 = 88 * t5
a[t6] = 1.0
i=i+1
if i <= 10 goto (13)
A
B
Leaders
C
Basic Blocks
D
E
F
Control-Flow Graphs
Control-flow graph:
 Node: an instruction or sequence of instructions (a basic block)
• Two instructions i, j in same basic block
iff execution of i guarantees execution of j
 Directed edge: potential flow of control
 Distinguished start node Entry & Exit
• First & last instruction in program
Control-Flow Edges
Basic blocks = nodes
Edges:
 Add directed edge between B1 and B2 if:
• Branch from last statement of B1 to first statement of B2 (B2
is a leader), or
• B2 immediately follows B1 in program order and B1 does not
end with unconditional branch (goto)
 Definition of predecessor and successor
• B1 is a predecessor of B2
• B2 is a successor of B1
CFG Example
静态代码分析
Basic Blocks
Control Flow Graph
Dataflow Analysis
 Live Variable Analysis
 Reaching Definition Analysis
Lattice Theory
Dataflow Analysis
Compile-Time Reasoning About
 Run-Time Values of Variables or Expressions
At Different Program Points
 Which assignment statements produced value of variable at
this point?
 Which variables contain values that are no longer used after
this program point?
 What is the range of possible values of variable at this
program point?
 ……
Program Points




One program point before each node
One program point after each node
Join point – point with multiple predecessors
Split point – point with multiple successors
Live Variable Analysis
A variable v is live at point p if
 v is used along some path starting at p, and
 no definition of v along the path before the use.
When is a variable v dead at point p?
 No use of v on any path from p to exit node, or
 If all paths from p redefine v before using v.
What Use is Liveness Information?
Register allocation.
 If a variable is dead, can reassign its register
Dead code elimination.
 Eliminate assignments to variables not read later.
 But must not eliminate last assignment to variable
(such as instance variable) visible outside CFG.
 Can eliminate other dead assignments.
 Handle by making all externally visible variables
live on exit from CFG
Conceptual Idea of Analysis
start from exit and go backwards in CFG
Compute liveness information from end to
beginning of basic blocks
Liveness Example




Assume a,b,c visible
outside method
So are live on exit
Assume x,y,z,t not
visible
Represent Liveness
Using Bit Vector
 order is abcxyzt
0101110
a = x+y;
t = a;
c = a+x;
x == 0
1100111
abcxyzt
1000111
b = t+z;
1100100
abcxyzt
1100100
c = y+1;
1110000
abcxyzt
Formalizing Analysis




Each basic block has
 IN - set of variables live at start of block
 OUT - set of variables live at end of block
 USE - set of variables with upwards exposed
uses in block (use prior to definition)
 DEF - set of variables defined in block prior
to use
USE[x = z; x = x+1;] = { z } (x not in USE)
DEF[x = z; x = x+1; y = 1;] = {x, y}
Compiler scans each basic block to derive USE
and DEF sets
Algorithm
for all nodes n in N - { Exit }
IN[n] = emptyset;
OUT[Exit] = emptyset;
IN[Exit] = use[Exit];
Changed = N - { Exit };
while (Changed != emptyset)
choose a node n in Changed;
Changed = Changed - { n };
OUT[n] = emptyset;
for all nodes s in successors(n)
OUT[n] = OUT[n] U IN[p];
IN[n] = use[n] U (out[n] - def[n]);
if (IN[n] changed)
for all nodes p in predecessors(n)
Changed = Changed U { p };
静态代码分析 – 概念
Basic Blocks
Control Flow Graph
Dataflow Analysis
 Live Variable Analysis
 Reaching Definition Analysis
Lattice Theory
Reaching Definitions
Concept of definition and use
 a = x+y
is a definition of a
is a use of x and y
A definition reaches a use if
value written by definition may be read by use
Reaching Definitions
s = 0;
a = 4;
i = 0;
k == 0
b = 1;
b = 2;
i<n
s = s + a*b;
i = i + 1;
return s
Reaching Definitions and Constant
Propagation
Is a use of a variable a constant?
 Check all reaching definitions
 If all assign variable to same constant
 Then use is in fact a constant
Can replace variable with constant
Is a Constant in s = s+a*b?
Yes!
s = 0;
a = 4;
i = 0;
k == 0
b = 1;
On all reaching
definitions
a=4
b = 2;
i<n
s = s + a*b;
i = i + 1;
return s
Constant Propagation
Transform
s = 0;
Yes!
a = 4;
i = 0;
k == 0
b = 1;
On all reaching
definitions
a=4
b = 2;
i<n
s = s + 4*b;
i = i + 1;
return s
Computing Reaching Definitions
Compute with sets of definitions
 represent sets using bit vectors
 each definition has a position in bit vector
At each basic block, compute
 definitions that reach start of block
 definitions that reach end of block
Do computation by simulating execution of
program until reach fixed point
1234567
0000000
1: s = 0;
2: a = 4;
3: i = 0;
k == 0
1110000
1234567
1234567
1110000
4: b = 1;
1110000
5: b = 2;
1111000
1110100
1234567
1111100
1111111
i<n
1234567
1111111
1111100
6: s = s + a*b;
7: i = i + 1;
0101111
1111111
1111100
1234567
1111111
1111100
return s
1111111
1111100
Formalizing Reaching Definitions
Each basic block has
 IN - set of definitions that reach beginning of block
 OUT - set of definitions that reach end of block
 GEN - set of definitions generated in block
 KILL - set of definitions killed in block
GEN[s = s + a*b; i = i + 1;] = 0000011
KILL[s = s + a*b; i = i + 1;] = 1010000
Compiler scans each basic block to derive GEN
and KILL sets
Example
Forwards vs. backwards
A forwards analysis is one that for each program
point computes information about the past
behavior.
 Examples of this are available expressions and
reaching definitions.
 Calculation: predecessors of CFG nodes.
A backwards analysis is one that for each
program point computes information about the
future behavior.
 Examples of this are liveness and very busy
expressions.
 Calculation: successors of CFG nodes.
May vs. Must
A may analysis is one that describes information
that may possibly be true and, thus, computes
an upper approximation.
 Examples of this are liveness and reaching
definitions.
 Calculation: union operator.
A must analysis is one that describes
information that must definitely be true and,
thus, computes a lower approximation.
 Examples of this are available expressions and very
busy expressions.
 Calculation: intersection operator.
静态代码分析 – 概念
Basic Blocks
Control Flow Graph
Dataflow Analysis
 Live Variable Analysis
 Reaching Definition Analysis
Lattice Theory
Basic Idea
Information about program represented using
values from algebraic structure called lattice
Analysis produces lattice value for each program
point
Two flavors of analysis
 Forward dataflow analysis
 Backward dataflow analysis
Partial Orders
Set P
Partial order  such that x,y,zP
 xx
 x  y and y  x implies x  y
 x  y and y  z implies x  z
(reflexive)
(asymmetric)
(transitive)
Can use partial order to define
 Upper and lower bounds
 Least upper bound
 Greatest lower bound
Upper Bounds
If S  P then
 xP is an upper bound of S if yS. y  x
 xP is the least upper bound of S if
• x is an upper bound of S, and
• x  y for all upper bounds y of S
  - join, least upper bound (lub), supremum, sup
•  S is the least upper bound of S
• x  y is the least upper bound of {x,y}
Lower Bounds
If S  P then
 xP is a lower bound of S if yS. x  y
 xP is the greatest lower bound of S if
• x is a lower bound of S, and
• y  x for all lower bounds y of S
  - meet, greatest lower bound (glb), infimum, inf
•  S is the greatest lower bound of S
• x  y is the greatest lower bound of {x,y}
Covering
x y if x  y and xy
x is covered by y (y covers x) if
 x  y, and
 x  z  y implies x  z
Conceptually, y covers x if there are no elements
between x and y
Example
P = { 000, 001, 010, 011, 100, 101, 110, 111}
(standard Boolean lattice, also called hypercube)
x  y if (x bitwise and y) = x
111
011
110
101
010
001
100
000
Hasse Diagram
• If y covers x
• Line from y to x
• y above x in diagram
Lattices
If x  y and x  y exist for all x,yP,
then P is a lattice.
If S and S exist for all S  P,
then P is a complete lattice.
All finite lattices are complete
Lattices
If x  y and x  y exist for all x,yP,
then P is a lattice.
If S and S exist for all S  P,
then P is a complete lattice.
All finite lattices are complete
Example of a lattice that is not complete
 Integers I
 For any x, yI, x  y = max(x,y), x  y = min(x,y)
 But  I and  I do not exist
 I  {, } is a complete lattice
Lattice Examples
Lattices
Non-lattices
Semi-Lattice
Only one of the two binary operations (meet or
join) exist
 Meet-semilattice
If x  y exist for all x,yP
 Join-semilattice
If x  y exist for all x,yP
Monotonic Function & Fixed point
Let L be a lattice. A function f : L → L is
monotonic if
∀x, y ∈ S : x  y ⇒ f (x)  f (y)
Let A be a set, f : A → A a function, a ∈A .
If f (a) = a, then a is called a fixed point of f
on A
Existence of Fixed Points
• The height of a lattice is defined to be the length
of the longest path from ⊥ to ⊤
• In a complete lattice L with finite height, every
monotonic function f : L → L has a unique least
fixed-point :
f ( )
i
i 0
Knaster-Tarski
Fixed Point Theorem
Suppose (L, ) is a complete lattice, f: LL is a
monotonic function.
Then the fixed point m of f can be defined as
Calculating Fixed Point
The time complexity of computing a fixed-point
depends on three factors:
 The height of the lattice, since this provides a bound for i;
 The cost of computing f;
 The cost of testing equality.
The computation of a fixed-point
can be illustrated as a walk up
the lattice starting at ⊥:
Application to Dataflow Analysis
Dataflow information will be lattice values
 Transfer functions operate on lattice values
 Solution algorithm will generate increasing sequence of values at
each program point
 Ascending chain condition will ensure termination
Will use  to combine values at control-flow join
points
Transfer Functions
Transfer function f: PP for each node in
control flow graph
f models effect of the node on the program
information
Transfer Functions
Each dataflow analysis problem has a set F of
transfer functions f: PP
 Identity function iF
 F must be closed under composition:
f,gF. the function h = x.f(g(x)) F
 Each f F must be monotone:
x  y implies f(x)  f(y)
 Sometimes all fF are distributive:
f(x  y) = f(x)  f(y)
 Distributivity implies monotonicity
课程考核方式
作业(提交到课程平台http://sase.seforge.org/
,并演示) + 课程报告
作业选题:











代码注释提取,文档生成
代码信息统计:总行数,代码行数,类数量,方法数,方法长度等
Latex格式文档自动转成PDF
代码在线diff
Executable Jar转换成带有特定icon的exe程序
代码各类缺陷检测:内存泄漏,空指针异常
Test case 自动生成
脚本缺陷分析: Javascript,Python,Ruby, PHP ……
C# 代码缺陷分析
在线压缩,解压缩,加密,解密
……
Questions?
Thank you!