Transcript 静态分析
静态代码分析
梁广泰
2011-05-25
提纲
动机
程序静态分析(概念+实例)
程序缺陷分析(科研工作)
动机
云平台特点
应用程序直接部署在云端服务器上,存在安全隐患
• 直接操作破坏服务器文件系统
• 存在安全漏洞时,可提供黑客入口
资源共享,动态分配
• 单个应用的性能低下,会侵占其他应用的资源
解决方案之一:
在部署应用程序之前,对其进行静态代码分析:
• 是否存在违禁调用?(非法文件访问)
• 是否存在低效代码?(未借助StringBuilder对String进行大量
拼接)
• 是否存在安全漏洞?(SQL注入,跨站攻击,拒绝服务)
• 是否存在恶意病毒?
• ……
提纲
动机
程序静态分析(概念+实例)
程序缺陷分析(科研工作)
静态代码分析
定义:
程序静态分析是在不执行程序的情况下对其进行分析的技术,简称
为静态分析。
对比:
程序动态分析:需要实际执行程序
程序理解:静态分析这一术语一般用来形容自动化工具的分析,而
人工分析则往往叫做程序理解
用途:
程序翻译/编译 (编译器),程序优化重构,软件缺陷检测等
过程:
大多数情况下,静态分析的输入都是源程序代码或者中间码(如
Java bytecode),只有极少数情况会使用目标代码;以特定形式输
出分析结果
静态代码分析
Basic Blocks
Control Flow Graph
Dataflow Analysis
Live Variable Analysis
Reaching Definition Analysis
Lattice Theory
Basic Blocks
A basic block is a maximal sequence of
consecutive three-address instructions with the
following properties:
The flow of control can only enter the basic block thru the 1st
instr.
Control will leave the block without halting or branching,
except possibly at the last instr.
Basic blocks become the nodes of a flow graph,
with edges indicating the order.
Basic Block Example
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
i=1
j=1
t1 = 10 * i
t2 = t1 + j
t3 = 8 * t2
t4 = t3 - 88
a[t4] = 0.0
j=j+1
if j <= 10 goto (3)
i=i+1
if i <= 10 goto (2)
i=1
t5 = i - 1
t6 = 88 * t5
a[t6] = 1.0
i=i+1
if i <= 10 goto (13)
A
B
Leaders
C
Basic Blocks
D
E
F
Control-Flow Graphs
Control-flow graph:
Node: an instruction or sequence of instructions (a basic block)
• Two instructions i, j in same basic block
iff execution of i guarantees execution of j
Directed edge: potential flow of control
Distinguished start node Entry & Exit
• First & last instruction in program
Control-Flow Edges
Basic blocks = nodes
Edges:
Add directed edge between B1 and B2 if:
• Branch from last statement of B1 to first statement of B2 (B2
is a leader), or
• B2 immediately follows B1 in program order and B1 does not
end with unconditional branch (goto)
Definition of predecessor and successor
• B1 is a predecessor of B2
• B2 is a successor of B1
CFG Example
静态代码分析
Basic Blocks
Control Flow Graph
Dataflow Analysis
Live Variable Analysis
Reaching Definition Analysis
Lattice Theory
Dataflow Analysis
Compile-Time Reasoning About
Run-Time Values of Variables or Expressions
At Different Program Points
Which assignment statements produced value of variable at
this point?
Which variables contain values that are no longer used after
this program point?
What is the range of possible values of variable at this
program point?
……
Program Points
One program point before each node
One program point after each node
Join point – point with multiple predecessors
Split point – point with multiple successors
Live Variable Analysis
A variable v is live at point p if
v is used along some path starting at p, and
no definition of v along the path before the use.
When is a variable v dead at point p?
No use of v on any path from p to exit node, or
If all paths from p redefine v before using v.
What Use is Liveness Information?
Register allocation.
If a variable is dead, can reassign its register
Dead code elimination.
Eliminate assignments to variables not read later.
But must not eliminate last assignment to variable
(such as instance variable) visible outside CFG.
Can eliminate other dead assignments.
Handle by making all externally visible variables
live on exit from CFG
Conceptual Idea of Analysis
start from exit and go backwards in CFG
Compute liveness information from end to
beginning of basic blocks
Liveness Example
Assume a,b,c visible
outside method
So are live on exit
Assume x,y,z,t not
visible
Represent Liveness
Using Bit Vector
order is abcxyzt
0101110
a = x+y;
t = a;
c = a+x;
x == 0
1100111
abcxyzt
1000111
b = t+z;
1100100
abcxyzt
1100100
c = y+1;
1110000
abcxyzt
Formalizing Analysis
Each basic block has
IN - set of variables live at start of block
OUT - set of variables live at end of block
USE - set of variables with upwards exposed
uses in block (use prior to definition)
DEF - set of variables defined in block prior
to use
USE[x = z; x = x+1;] = { z } (x not in USE)
DEF[x = z; x = x+1; y = 1;] = {x, y}
Compiler scans each basic block to derive USE
and DEF sets
Algorithm
for all nodes n in N - { Exit }
IN[n] = emptyset;
OUT[Exit] = emptyset;
IN[Exit] = use[Exit];
Changed = N - { Exit };
while (Changed != emptyset)
choose a node n in Changed;
Changed = Changed - { n };
OUT[n] = emptyset;
for all nodes s in successors(n)
OUT[n] = OUT[n] U IN[p];
IN[n] = use[n] U (out[n] - def[n]);
if (IN[n] changed)
for all nodes p in predecessors(n)
Changed = Changed U { p };
静态代码分析 – 概念
Basic Blocks
Control Flow Graph
Dataflow Analysis
Live Variable Analysis
Reaching Definition Analysis
Lattice Theory
Reaching Definitions
Concept of definition and use
a = x+y
is a definition of a
is a use of x and y
A definition reaches a use if
value written by definition may be read by use
Reaching Definitions
s = 0;
a = 4;
i = 0;
k == 0
b = 1;
b = 2;
i<n
s = s + a*b;
i = i + 1;
return s
Reaching Definitions and Constant
Propagation
Is a use of a variable a constant?
Check all reaching definitions
If all assign variable to same constant
Then use is in fact a constant
Can replace variable with constant
Is a Constant in s = s+a*b?
Yes!
s = 0;
a = 4;
i = 0;
k == 0
b = 1;
On all reaching
definitions
a=4
b = 2;
i<n
s = s + a*b;
i = i + 1;
return s
Constant Propagation
Transform
s = 0;
Yes!
a = 4;
i = 0;
k == 0
b = 1;
On all reaching
definitions
a=4
b = 2;
i<n
s = s + 4*b;
i = i + 1;
return s
Computing Reaching Definitions
Compute with sets of definitions
represent sets using bit vectors
each definition has a position in bit vector
At each basic block, compute
definitions that reach start of block
definitions that reach end of block
Do computation by simulating execution of
program until reach fixed point
1234567
0000000
1: s = 0;
2: a = 4;
3: i = 0;
k == 0
1110000
1234567
1234567
1110000
4: b = 1;
1110000
5: b = 2;
1111000
1110100
1234567
1111100
1111111
i<n
1234567
1111111
1111100
6: s = s + a*b;
7: i = i + 1;
0101111
1111111
1111100
1234567
1111111
1111100
return s
1111111
1111100
Formalizing Reaching Definitions
Each basic block has
IN - set of definitions that reach beginning of block
OUT - set of definitions that reach end of block
GEN - set of definitions generated in block
KILL - set of definitions killed in block
GEN[s = s + a*b; i = i + 1;] = 0000011
KILL[s = s + a*b; i = i + 1;] = 1010000
Compiler scans each basic block to derive GEN
and KILL sets
Example
Forwards vs. backwards
A forwards analysis is one that for each program
point computes information about the past
behavior.
Examples of this are available expressions and
reaching definitions.
Calculation: predecessors of CFG nodes.
A backwards analysis is one that for each
program point computes information about the
future behavior.
Examples of this are liveness and very busy
expressions.
Calculation: successors of CFG nodes.
May vs. Must
A may analysis is one that describes information
that may possibly be true and, thus, computes
an upper approximation.
Examples of this are liveness and reaching
definitions.
Calculation: union operator.
A must analysis is one that describes
information that must definitely be true and,
thus, computes a lower approximation.
Examples of this are available expressions and very
busy expressions.
Calculation: intersection operator.
静态代码分析 – 概念
Basic Blocks
Control Flow Graph
Dataflow Analysis
Live Variable Analysis
Reaching Definition Analysis
Lattice Theory
Basic Idea
Information about program represented using
values from algebraic structure called lattice
Analysis produces lattice value for each program
point
Two flavors of analysis
Forward dataflow analysis
Backward dataflow analysis
Partial Orders
Set P
Partial order such that x,y,zP
xx
x y and y x implies x y
x y and y z implies x z
(reflexive)
(asymmetric)
(transitive)
Can use partial order to define
Upper and lower bounds
Least upper bound
Greatest lower bound
Upper Bounds
If S P then
xP is an upper bound of S if yS. y x
xP is the least upper bound of S if
• x is an upper bound of S, and
• x y for all upper bounds y of S
- join, least upper bound (lub), supremum, sup
• S is the least upper bound of S
• x y is the least upper bound of {x,y}
Lower Bounds
If S P then
xP is a lower bound of S if yS. x y
xP is the greatest lower bound of S if
• x is a lower bound of S, and
• y x for all lower bounds y of S
- meet, greatest lower bound (glb), infimum, inf
• S is the greatest lower bound of S
• x y is the greatest lower bound of {x,y}
Covering
x y if x y and xy
x is covered by y (y covers x) if
x y, and
x z y implies x z
Conceptually, y covers x if there are no elements
between x and y
Example
P = { 000, 001, 010, 011, 100, 101, 110, 111}
(standard Boolean lattice, also called hypercube)
x y if (x bitwise and y) = x
111
011
110
101
010
001
100
000
Hasse Diagram
• If y covers x
• Line from y to x
• y above x in diagram
Lattices
If x y and x y exist for all x,yP,
then P is a lattice.
If S and S exist for all S P,
then P is a complete lattice.
All finite lattices are complete
Lattices
If x y and x y exist for all x,yP,
then P is a lattice.
If S and S exist for all S P,
then P is a complete lattice.
All finite lattices are complete
Example of a lattice that is not complete
Integers I
For any x, yI, x y = max(x,y), x y = min(x,y)
But I and I do not exist
I {, } is a complete lattice
Lattice Examples
Lattices
Non-lattices
Semi-Lattice
Only one of the two binary operations (meet or
join) exist
Meet-semilattice
If x y exist for all x,yP
Join-semilattice
If x y exist for all x,yP
Monotonic Function & Fixed point
Let L be a lattice. A function f : L → L is
monotonic if
∀x, y ∈ S : x y ⇒ f (x) f (y)
Let A be a set, f : A → A a function, a ∈A .
If f (a) = a, then a is called a fixed point of f
on A
Existence of Fixed Points
• The height of a lattice is defined to be the length
of the longest path from ⊥ to ⊤
• In a complete lattice L with finite height, every
monotonic function f : L → L has a unique least
fixed-point :
f ( )
i
i 0
Knaster-Tarski
Fixed Point Theorem
Suppose (L, ) is a complete lattice, f: LL is a
monotonic function.
Then the fixed point m of f can be defined as
Calculating Fixed Point
The time complexity of computing a fixed-point
depends on three factors:
The height of the lattice, since this provides a bound for i;
The cost of computing f;
The cost of testing equality.
The computation of a fixed-point
can be illustrated as a walk up
the lattice starting at ⊥:
Application to Dataflow Analysis
Dataflow information will be lattice values
Transfer functions operate on lattice values
Solution algorithm will generate increasing sequence of values at
each program point
Ascending chain condition will ensure termination
Will use to combine values at control-flow join
points
Transfer Functions
Transfer function f: PP for each node in
control flow graph
f models effect of the node on the program
information
Transfer Functions
Each dataflow analysis problem has a set F of
transfer functions f: PP
Identity function iF
F must be closed under composition:
f,gF. the function h = x.f(g(x)) F
Each f F must be monotone:
x y implies f(x) f(y)
Sometimes all fF are distributive:
f(x y) = f(x) f(y)
Distributivity implies monotonicity
课程考核方式
作业(提交到课程平台http://sase.seforge.org/
,并演示) + 课程报告
作业选题:
代码注释提取,文档生成
代码信息统计:总行数,代码行数,类数量,方法数,方法长度等
Latex格式文档自动转成PDF
代码在线diff
Executable Jar转换成带有特定icon的exe程序
代码各类缺陷检测:内存泄漏,空指针异常
Test case 自动生成
脚本缺陷分析: Javascript,Python,Ruby, PHP ……
C# 代码缺陷分析
在线压缩,解压缩,加密,解密
……
Questions?
Thank you!