Taint-Enhanced Policy Enforcement

Transcript Taint-Enhanced Policy Enforcement

RAMSES (Regeneration And iMmunity SErviceS):
A Cognitive Immune System
Self Regenerative Systems
18 December 2007
Mark Cornwell
James Just
Nathan Li
Robert Schrag
Global InfoTek, Inc
12/18/06
R. Sekar
Stony Brook University
Outline

Overview
Efficient content-based taint identification
Syntax and taint-aware policies
Memory attack detection and response
Testing
Red Team suggestions
Questions

Demo






12/18/06
RAMSES Attack Context
Attack target: “program” mediating
access to protected resources/services
 Attack approach: use maliciously crafted
input to exert unintended control
over protected resource operations
 Resource or service uses:

 Well-defined
Incoming
requests
(Untrusted input)
Program
APIs to access
 OS
resources
 Command interpreters
 Database servers
 Transaction servers,
 ……
 Internal
 Data
interfaces
structures and functions within program
 Used
12/18/06
Outgoing requests
(Security-sensitive
operations)
by program components to talk to each other
Example 1: SquirrelMail Command Injection
$send_to_list =
$_GET[‘sendto’]
$command = “gpg -r
$send_to_list 2>&1”
popen($command)
12/18/06
Input
Interface
sendto=“nobody; rm –rf *”
Program
$command=“gpg –r
nobody; rm –rf * 2>&1”
“Output” Interface
popen($command)
Attack: Removes all
removable files in web
server document tree
Example 2: phpBB SQL Injection
$topic_id=$_GET[‘topic’]
$sql = “SELECT p.post_id
FROM POSTS_TABLE
WHERE p.topic_id =
$topic_id”
sql_query($sql)
12/18/06
Input
Interface
Program
“Output” Interface
topic=“-1
UNION SELECT
ord(substring(user_password,1,1))
FROM phpbb_users
WHERE user_id = 3”
$sql= “SELECT p.post_id FROM
POSTS_TABLE WHERE
p.topic_id = -1 UNION SELECT
ord(substring(user_password,1
,1)) FROM phpbb_users
WHERE user_id = 3”
sql_query($sql)
Attack: Steal another
user’s password
Attack Space of Interest (CVE 2006)
Others
24%
Config/Race
errors
1%
Format string
1%
Memory
errors
10%
SQL injection
14%
Input
validation/
DoS
9%
Command
injection
18%
Directory
traversal
4%
Cross-site
scripting
19%
Generalized Injection
Attacks
Detection Approach
 Attack:
use maliciously crafted
input to exert unintended
control over output operations
 Detect “exertion of control”
Based
on “taint:” degree to
which output depends on input
 Detect
Program
if control is intended:
Requires
policies (or training)
 Application-independent
policies are preferable
12/18/06
Input Interface
(Untrusted input)
“Output” Interface:
(Security-sensitive
operations)
RAMSES Goals and Approach

Taint analysis: develop efficient and
non-invasive alternatives
 Analyze observed inputs and outputs
Needs no modifications to program
 Language-neutral
Input Interface
(Untrusted input)

 Leverage

learning to speed up analysis
Attack detection: develop framework to detect
a wide range of attacks, while minimizing
“Output” Interface
policy development effort and FP/FNs
 “Structure-aware policies:” leverage interplay
between taint and structural changes to output requests
 Use Address-Space Randomization (ASR) for memory corruption


Program
ASR: efficient, in-band, “positive” tainting for pointer-valued data
Immunization: filter out future attack instances
 Output filters: drop output requests that violate taint-based policies
 Input filters: “Project” policies on outputs to those on inputs
Relies on learning relationships between input and output fields
 Network-deployable

12/18/06
Efficient Content-Based Taint
Identification
12/18/06
Steps
 Develop
efficient algorithms for inferring flow of
input data into outputs
Compare
input and output values
Allow for parts of input to flow into parts of output
Tolerate some changes to input
 Changes
such as space removal, quoting, escaping,
case-folding are common in string-based interfaces
Based
on approximate substring matching
 Leverage
Even
learning to speed up taint inference
the “efficient” content-matching algorithms
are too expensive to run on every input/output
Same learning techniques can be used for detecting
attacks using anomaly detection
12/18/06
Weighted Substring Edit Distance Algorithm






Maintain a matrix D[i][j] of minimum edit
distance between p[1..i] and s[1..j]
D[i][j] = min{D[i-1][j-1]+ SubstCost(p[i],s[j]),
D[i-1][j] + DeleteCost(p[i]),
D[i][j-1] + InsertCost(s[j])}
D[0][j] = 0 (No cost for omitting any prefix of s)
D[i][0] = DeleteCost(p[1])+…+DeleteCost(p[i])
Matches can be reconstructed from the D matrix
Quadratic time and space complexity

12/18/06
Uses O(|p|*|s|) memory and time
Improving performance

Quadratic complexity algorithms can be
too expensive for large s, e.g., HTML outputs


Storage requirements are even more problematic
Solution: Use linear-time coarse filtering algorithm
 Approximate
D by FD, defined on substrings of s of length |p|
 Let P (and S) denote a multiset of characters in p (resp., s)
 FD(p, s) = min(|P-S|, |S-P|)
 Slide
 Prove:
a window of size |p| over s, compute FD incrementally
D(p, r) < t  FD(p, r) < t for all substrings r of s
Result: O(|p|2) space and time complexity in practice
 Implementation results

 Typically
30x improvement in speed
 200x to 1000x reduction in space
 Preliminary performance measurements: ~40MB/sec
12/18/06
Efficient online operation
 Weighted
edit-distance algorithms are still too
expensive if applied to every input/output
Need
 Key
to run for every input parameter and output
idea:
Use
learning to construct a classifier for outputs
 Each

class consists of similarly tainted outputs
taint identified quickly, once the class is known
Classifying
 Our
strings is difficult
technique operates on parse trees of output
 For ease of development, generality, and tolerance to
syntax errors, we use a “rough” parser
 Classifier is a decision tree that inspects parse tree
nodes in an order that leads to good decisions
12/18/06
Decision Tree Construction
 Examines
the nodes of syntax tree in some order
 The order of examination is a function of the set
of syntax trees
Chooses
nodes that are present in all candidate
syntax trees
Avoids tests on tainted data, as they can vary
Avoids tests that don’t provide significant degree of
discrimination
 “similar-valued”
fields will be collected together and
generalized, instead of storing individual values
Incorporates
a notion of “suitability” for each field
or subtree in the syntax tree
 Takes
12/18/06
into account approximations made in parsing
Example of a Decision Tree
1. SELECT * FROM phpbb_config
2. SELECT u.*,s.* FROM phpbb_sessions s,phpbb_users u WHERE
s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id
3. SELECT * FROM phpbb_themes WHERE themes_id=1
4. SELECT c.cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE
f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order
5. SELECT * FROM phpbb_forums ORDER BY cat_id,forum_order
switch (1) {
case ROOT : switch (1.1) {
case CMD : switch (1.1.2) {
case c FINAL {@1.1.1:SELECT
@1.1.3:. cat_id,c.cat_title,c.cat_order FROM phpbb_categories
c,phpbb_forums f WHERE f.cat_id=c.cat_id GROUP BY
c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order }
case u FINAL {@1.1.1:SELECT
@1.1.3:. *,s.* FROM phpbb_sessions s,phpbb_users u WHERE
s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND
u.user_id=s.session_user_id }
case * FINAL {@1.1.1:SELECT
@1.1.3:FROM phpbb_?????? }
}
}
}
12/18/06
Implementation Status and Next Steps

“Rough” parsers implemented for
 HTML/XML
 Shell-like
languages (including Perl/PHP)
 SQL

Preliminary performance measurements
 Construction
of decision trees: ~3MB/sec
 Classification only: ~15MB/sec
 Significant

improvements expected with some performance tuning
Next steps
 Develop
better clustering/classification algorithms based on
tree edit-distance
 Current
algorithm is based entirely on a top-down traversal, and
fails to exploit similarities among subtrees
12/18/06
Syntax and taint-aware
policies
12/18/06
Overview of Policies

Leverage structure+taint to simplify/generalize policy
 Policy
structure mirrors that of parse trees
 And-Or
“trees” with cycles
 Can
specify constraints on values (using regular expressions)
and taint associated with a parse tree node
ELEMENT
NAME = “script”
OR
PARAM
PARAM_NAME=“src”

ELEM_BODY
PARAM_VALUE
Most attacks detected using one basic policy
 Controlling
“commands” vs command parameters
 Controlling pointers vs data
12/18/06
Controlling “commands” Vs “parameters”
Observation: parameters don’t alter syntactic structure of
victim’s requests
 Policy: Structure of parse tree for victim’s request should
not be controlled by untrusted input (“tainted data”)
 Alternate formulation: tainted data shouldn’t span multiple
“fields” or “tokens” in victim’s request

root
root
name
param
param
name
gpg
-r
[email protected]
gpg
12/18/06
cmd
cmd
cmd
param param
-r
nobody
separator
name
param
param
;
rm
-rf
*
Policy prohibiting structure changes

Define “structure change” without using a reference
 Avoids

need for training and associated FP issues
Policy 1
 Tainted


data cannot span multiple nodes
for binary data, it should not span multiple fields
Policy 2
 Tainted
data cannot straddle multiple subtrees
 Tainted
data spans two adjacent subtrees, and at least one of
them is not fully tainted
 Tainted data “overflowed” beyond the end of one subtree and
resulted in a second subtree

Both policies can be further refined to constrain the node
types and children subtrees of the nodes
12/18/06
Commands Vs parameters: Example 2
 Memory



corruption attack overflowing stack buffer
For binary data, we talk about message fields rather
than parse trees
Stack
frame 1
Return
Address
Stack
frame 2
Return
Address
Stack
frame 2
…..
Violation: tainted data spans multiple stack “fields”
 Heap
overflows involve tainted data spanning
across multiple heap blocks
12/18/06
Attacks Detected by “No structure change” Policy
 Various
forms of script or command injection
 SQL injection
 XPath injection
 Format string attacks
 HTTP response splitting
 Log injection
 Stack overflow and heap overflow
12/18/06
Application-specific policies
 Not
all attacks have the flavor of “command
injection”
 Develop application-specific policies to detect
such attacks
Policy
3: Cross-site scripting: no tainted scripts in
HTML data
Policy 4: Path traversal: tainted file names cannot
access data outside of a certain document tree
…
 Other
examples
Policy
5: No tainted CMD_NAME or CMD_SEPARATOR
nodes in shell or SQL commands
12/18/06
Implementation status

Four test applications
 phpBB
 SquirrelMail
 PHP/XMLRPC
 WebGoat

(J2EE)
Detects following attacks without FPs
 Command
injection (Policies 1, 2, 5)
 SQL injection (1, 2, 5)
 XSS (3)
 HTTP Response splitting (2)
 Path traversal (4)
 Memory corruption detected using ASR

Should be able to detect many other attacks easily
 XPATH
12/18/06
injection (1,2), Format-string (1, 2), Log injection (1,2)
Memory Attack Discussion
12/18/06
Memory Error Based Remote Attack
 Attacker’s
goal:
Overwrite
target of interest to take over instruction
execution
 Attacker’s
approach:
Propagate
attacker controlled input to target of
interest
Violate certain structural constraints in the
propagation process
12/18/06
Stack Frame Structural Violation
High
A’s stack frame
Function arguments
Return address
Previous stack frame
Exception Registration Record
Local variables
B’s stack frame
Function arguments
Return address( to A)
Previous stack frame
Local variables
C’s stack frame
Low
EBP
FS:0
ESP
12/18/06
Function arguments
Return address (to B)
Previous stack frame
Exception Registration Record
Local variables
Heap Block Structural Violation
Size
Segment
Index
Previous Size
Flags
Unused
Tag Index
FLink
BLink
Windows Free Heap Block Header Structure

Happens when removing free block from double-linked list:

Ability to write 4 bytes into any address, usually well known address, like
function pointer, return address, SEH etc.
12/18/06
ASLR and Crash Analysis
ASLR randomizes the addresses of targets of interest
 Memory attack using the original address will miss
and cause crash (exception).
 Crash analysis tracks back to vulnerability, which
enables accurate signature generation
 Structural information usually retrievable at
runtime, thanks to enhanced debugging technology
 Crash analysis aided with JIT(Just In-time Tracing)

 JIT
triggered at certain events:
 “Suspicious”
address
 Attach/detach
network inputs, e.g. sensitive JMP
JIT monitor at event of interest
 Memory dump can be dumped in the right granularity, log
info from a few KB to a 2GB
12/18/06
Crash Root Cause Analysis
Root Cause Analysis
Exception Record/Context,
Faulting thread/Instructions/Registers
Stack trace/Heap/Module/Symbols
Stack Corruption
Read
Access Violation
Bad EIP
(Corrupted Return
Address or SEH)
12/18/06
Read
Access Violation
Bad Deference
(Corrupted Local
Variables/passing
parameters)
Heap Corruption
Write
Access Violation
(Address to write,
Value to write )
Stack-based Overflow Analysis

“Target” driven analysis
 The
goal of attack string is to overwrite target of interest
on stack, e.g., return address, SEH handler.
 Start matching target values from crash dump to input, like
EIP, EBP and SEH handler
 More
 If
efficient than pattern match in the whole address space
any targets are matched in input, expand in both
directions to find LCS
 A match usually indicates the input size needed to overflow
certain targets
12/18/06
SEH Overflow and Analysis

A unique approach for Windows exploit
 SEH
stands for Structured Exception Handler
 Windows put EXCEPTION_REGISTRATION_RECORD chain on stack
with SEH in the record.

More reliable and powerful than overwrite return address
 More
JMP address to use (pop/pop/ret)
 An exception (accidental/intentional) is desired
 Can bypass /GS buffer check

SEH crash analysis:
 Catch
the first exception as well as the second one (caused by
ASR)
 Locate the SEH chain head from first dump, usually overwritten
by input
 Usually first exception is enough, second exception can be used
for confirmation
12/18/06
Heap Overflow Analysis

How to analyze heap overflow attack?
 Exploit
happens in free blocks unlink
 Multiple
 Write
ways to trigger
Access Violation with ASR
 with
overwriting in invalid address
 Overwrite
4 bytes value in arbitrary address
 Interested
 Exploit
targets include return address, SEH, PEB and UEF
contains the pair: (Address To Write, Value to Write)
 Appeared
in the overflowed heap blocks
 Usually contained in registers
 Should be provided from input by attacker
 Match found in synthetic heap exploits
 The
value pairs need to be in fixed offset
 For
a given heap overflow vulnerability
 To enable overwrite the right address with the right value desired
12/18/06
Case Studies
Vulnerability
Exploit
IIS ISAPI Extension synthetic stack
buffer overflow
Overwrite return address
IIS ISAPI Extension synthetic stack
buffer overflow
Overwrite Structure Exception Handler
IIS w3who.dll stack buffer
overflow(CVE-2004-1134)
Overwrite Structure Exception Handler
Microsoft RPC DCOM Interface stack Overwrite return address and Structure
Exception Handler
buffer overflow(CVE-2003-0352)
Synthetic Heap Overflow
12/18/06
Overwrite function pointer inside PEB
structure
Case Study: RPC DCOM
 Step
1: Exception Analysis
FAULTING_IP:
+18759f
ExceptionCode: c0000005 (Access violation)
Attempt to read from address 0018759f
PROCESS_NAME: svchost.exe
FAULTING_THREAD: 00000290
PRIMARY_PROBLEM_CLASS: STACK_CORRUPTION
 Step
2: Target – Input correlation:
StackBase: 0x6c0000, StackLimit: 0x6bc000,Size =0x4000
Begin analyze on Target Overwrite and Input Correlation:
Analyze crash EIP:
Find EIP pattern at socket input:
Bytes size to overwrite EIP= 128
Analyze crash EIP done!
Analyze SEH:
Find SEH byte at socket input:
Bytes size to overwrite SEH handler= 1588
Analyze SEH done!
12/18/06
Signature Generation

Signature generation:
 Signature captures the vulnerability characteristics
 Minimum
 Use
size to overwrite certain target(s)
contexts to reduce false positive:
 Using
incoming input calling stack
 Stack offset can uniquely identify the context
 Using
incoming input semantic context:
 Message format like HTTP url/parameter
 Binary message field
12/18/06
Components & Implementation
RAMSES
Crash Monitor:
* Catch interested
exception only
•Snapshots for a
given period
* Self healer
1
Crash(Exception)
2
Protected Application
Signature
3 Provide
Input
History
* Crash Dump provides the same interface
as LIVE process, so Crash Analyzer actually
does NOT have to work on saved crash dump file.
12/18/06
Generate
Crash Dump*
5
4
Uses
Analyze
RAMSES
Crash Analyzer
•Fault type detection
•Security oriented
analysis
•Feedback
Windows
Debug
Engine
Uses
Infrastructure:
Save Crash Dump
Extract Relevant Info
Search/Match
Disassemble
Testing
12/18/06
Test Attacks & Applications
Attack
phpBB SQL Injection
SquirrelMail Command Injection
SquirrelMail XSS Attack
PHP XML-RPC
HTTP Splitting
HTTP Splitting Cache Poisoning
Path Based Access Control
Xpath injection
JSON injection
XML inject
Vulnerability
CAN-2003-0486
CAN-2003-0990
CAN-2002-1341
CAN-2005-1921
CR LF escapes
tainted expiration field
tainted file open
tainted xpath string
flawed architecture
flawed architecture
Target App
phpBB
SquirrelMail
phpBB
PHP Library
WebGoat
WebGoat
WebGoat
WebGoat
WebGoat
WebGoat
App Lang
PHP
PHP
PHP
PHP
Java
Java
Java
Java
Java
Java
Exploited Lang
SQL
cmd/shell
JavaScript
XML
HTTP Request
HTTP Request
file path
Xpath Library
JSON
XML
Targets
Database
Server
3rd party clients
Server
Server page cache
Server
Server
Server Application
Server Application
Baseline Applications
Many “sub languges”
• phpBB (php)
• squirrelMail (php)
• WebGoat (java)
• hMailServer (C++)
SQL, XML, JavaScript,
HTML, HTTP, JSON, shell,
cmd, path
12/18/06
Possible Testbed Configurations
Protected System
Protected System
Web
Server
(IIS/
Apache)
files
Web
Apps
Mail
Server
Mail
Server
SQL
Database
(MySQL)
Attacker
Attacker
Protect Mail server exposed as a
service.
Baseline testbed setup
Protected System
Web
Server
(IIS/
Apache)
files
Web
Apps
SQL
Database
Mail
Server
Web
Server
(IIS/
Apache)
(MySQL)
Attacker
files
Web
Apps
SQL
Database
(MySQL)
Attacker
Protected System
Protect just mail server in context of
Web service.
12/18/06
Can extend protected system to
include Mail Serve
Mail
Server
Traffic Generation

Purpose
Coverage
of legitmate structural variation in
monitored structures
 SQL,
Stress
command strings, call parameters
of log complexity for practicality
 Multiple
users, multiple sessions
Performance
 Program
measurements
performance metrics
 Quantify performance impact
12/18/06
Traffic Generation to Web Sites

Approaches
 Simple



Record/Playback (basic)
with minor substitutions (cookies, ips)
shell scripts, netcat, MaxQ (jython based
Custom DOM/Ajax scripting (learning)
Can access dynamically generated browser content after(during)
client side script eval
 Automated site crawls of URLS
 Automated form contents (site specific metadata)


COTS tools

12/18/06
Load testing and metrics
12/18/06
Red Team Suggestions
12/18/06
Suggested Red Team ROEs
 Initial
telecons held in Fall
 Claim: RAMSES will defeat most generalized
injection attacks on protected applications
 Red Team should target our current and planned
applications rather than new ones (unless new
application, sample attacks and complete traffic
generator can be provided to RAMSES far enough in
advance for learning and testing)
Remote
network access to the targeted application
Attack designated application suite
 Required
instrumentation yet to be determined
 Red Team exercise start 15 April or later
 ……
12/18/06
RAMSES Project Schedule
CY06
Baseline Tasks
CY08
CY07
CY09
Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3
1. Refine RAMSES
Requirements
2. Design RAMSES
3. Develop
Components
4. Integrate System
5. Analyze & Test
RAMSES
6. Coordinate & Rept
Prototypes
1
2
Optional Tasks
Red Team Exercise
O.3 Cross-Area Exper
Today: 11 September 2007
12/18/06
3
Next Steps
12/18/06
Plans
 Develop
input filters from output policies
 Extend memory error analyzer
 Demonstrate RAMSES on more applications and
attack types
Native
C/C++ app (most likely app is hMail server)
Java
 Integrate
components
 Performance and false positive testing
 Red Team exercise
12/18/06
Questions?
12/18/06
Backup
12/18/06
Tokenizing and Parsing

Focus on “rough” parsing that reveals approximate
structure, but not necessarily all the details
Accurate parsers are time-consuming to write
 More important: may not gracefully handle errors (common in
HTML) or language extensions and variations (different shells,
different flavors of SQL)


Implemented using Flex/Bison

Currently done for SQL and shell command languages
 Parse
into a sequence of statements, each statement consisting of
a “command name” and “parameters”
 Incorporates a notion of confidence to deal with complex
language features, e.g., variable substitutions in shell
 Modest
effort for adding additional languages, but
substantially simplifies subsequent learning tasks
 Don’t anticipate significant additions to this language list
(other than HTML/XML)
12/18/06
Taint inference Vs Taint-tracking

Disadvantages of learning
 False
negatives if inputs transformed before use
 Low likelihood for most web apps
 False positives due to coincidence
 Mitigated using statistical information
 Plan to evaluate these experimentally

Benefits of learning
 Low
performance overhead
 Some significant implicit flows handled without incurring high
false positives
 Can address attacks multi-step attacks where tainted data is
first stored in a file/database before use
 More generally, in dealing with information flow that crosses
module boundaries
12/18/06
Attack Coverage 2004
Config errors
3%
Other logic
errors
22%
Tempfile
4%
Memory
errors
27%
(Stack-smashing, heap
overflow, integer overflow,
data attacks)
Format string
4%
Input
validation/
DoS
9%
Generalized
Injection Attacks
Directory
traversal
10%
12/18/06
Cross-site
scripting
4%
SQL injection
2%
Command
injection
15%
CVE
Vulnerabilities
(Ver. 20040901)
RAMSES System Concept
RAMSES
Components
Network/App
Firewall (e.g. mod_security)
Internet
Protected System
Event Collector
Web
Server
Web
App
(IIS/
Apache)
(PHP/
ASP)
SQL
Database
(MySQL)
• parse/decode/normalize
HTTP requests,
parameters, cookies, …
Attack Detector
• Address-space
randomization
• Taint-based policies,
anomalies
RAMSES Interceptors
Network
DLLs

OS
DLLs
Application
DLLs
Filter Generator
• Output filter
• Input filter
Key research problems
 Learn
taint propagation
 Identify
 Learn
 Use
12/18/06
tainted components in output, generate filtering criteria
input/output transformation
transformation to project output filters to input
Advantages of RAMSES Filters
 Filters
easily sharable
Complements
Application Community focus on end
user applications
 Filters
Filter
are human readable
generation algorithms can be enhanced to
address privacy concerns wrt sharing
12/18/06
Filter types
Filter Location
Filter Criteria


Correlative filters
 Equality-based filter
 Structure-based filter
 Statistical filter
Causal filters
 Filtering criteria derived
from attack detection
criteria (policy or
anomaly)


Input filter
 Easier to deploy but harder to
synthesize
Output filter (precedes sensitive
operation)
 Easier to synthesize than input
filter, but deployment needs
deeper instrumentation
 May be too late for some attacks
(memory corruption)
Note: All filters evaluated using large number of
benign samples and 1 attack sample
12/18/06