Taint-Enhanced Policy Enforcement
Download
Report
Transcript Taint-Enhanced Policy Enforcement
RAMSES (Regeneration And iMmunity SErviceS):
A Cognitive Immune System
Self Regenerative Systems
18 December 2007
Mark Cornwell
James Just
Nathan Li
Robert Schrag
Global InfoTek, Inc
12/18/06
R. Sekar
Stony Brook University
Outline
Overview
Efficient content-based taint identification
Syntax and taint-aware policies
Memory attack detection and response
Testing
Red Team suggestions
Questions
Demo
12/18/06
RAMSES Attack Context
Attack target: “program” mediating
access to protected resources/services
Attack approach: use maliciously crafted
input to exert unintended control
over protected resource operations
Resource or service uses:
Well-defined
Incoming
requests
(Untrusted input)
Program
APIs to access
OS
resources
Command interpreters
Database servers
Transaction servers,
……
Internal
Data
interfaces
structures and functions within program
Used
12/18/06
Outgoing requests
(Security-sensitive
operations)
by program components to talk to each other
Example 1: SquirrelMail Command Injection
$send_to_list =
$_GET[‘sendto’]
$command = “gpg -r
$send_to_list 2>&1”
popen($command)
12/18/06
Input
Interface
sendto=“nobody; rm –rf *”
Program
$command=“gpg –r
nobody; rm –rf * 2>&1”
“Output” Interface
popen($command)
Attack: Removes all
removable files in web
server document tree
Example 2: phpBB SQL Injection
$topic_id=$_GET[‘topic’]
$sql = “SELECT p.post_id
FROM POSTS_TABLE
WHERE p.topic_id =
$topic_id”
sql_query($sql)
12/18/06
Input
Interface
Program
“Output” Interface
topic=“-1
UNION SELECT
ord(substring(user_password,1,1))
FROM phpbb_users
WHERE user_id = 3”
$sql= “SELECT p.post_id FROM
POSTS_TABLE WHERE
p.topic_id = -1 UNION SELECT
ord(substring(user_password,1
,1)) FROM phpbb_users
WHERE user_id = 3”
sql_query($sql)
Attack: Steal another
user’s password
Attack Space of Interest (CVE 2006)
Others
24%
Config/Race
errors
1%
Format string
1%
Memory
errors
10%
SQL injection
14%
Input
validation/
DoS
9%
Command
injection
18%
Directory
traversal
4%
Cross-site
scripting
19%
Generalized Injection
Attacks
Detection Approach
Attack:
use maliciously crafted
input to exert unintended
control over output operations
Detect “exertion of control”
Based
on “taint:” degree to
which output depends on input
Detect
Program
if control is intended:
Requires
policies (or training)
Application-independent
policies are preferable
12/18/06
Input Interface
(Untrusted input)
“Output” Interface:
(Security-sensitive
operations)
RAMSES Goals and Approach
Taint analysis: develop efficient and
non-invasive alternatives
Analyze observed inputs and outputs
Needs no modifications to program
Language-neutral
Input Interface
(Untrusted input)
Leverage
learning to speed up analysis
Attack detection: develop framework to detect
a wide range of attacks, while minimizing
“Output” Interface
policy development effort and FP/FNs
“Structure-aware policies:” leverage interplay
between taint and structural changes to output requests
Use Address-Space Randomization (ASR) for memory corruption
Program
ASR: efficient, in-band, “positive” tainting for pointer-valued data
Immunization: filter out future attack instances
Output filters: drop output requests that violate taint-based policies
Input filters: “Project” policies on outputs to those on inputs
Relies on learning relationships between input and output fields
Network-deployable
12/18/06
Efficient Content-Based Taint
Identification
12/18/06
Steps
Develop
efficient algorithms for inferring flow of
input data into outputs
Compare
input and output values
Allow for parts of input to flow into parts of output
Tolerate some changes to input
Changes
such as space removal, quoting, escaping,
case-folding are common in string-based interfaces
Based
on approximate substring matching
Leverage
Even
learning to speed up taint inference
the “efficient” content-matching algorithms
are too expensive to run on every input/output
Same learning techniques can be used for detecting
attacks using anomaly detection
12/18/06
Weighted Substring Edit Distance Algorithm
Maintain a matrix D[i][j] of minimum edit
distance between p[1..i] and s[1..j]
D[i][j] = min{D[i-1][j-1]+ SubstCost(p[i],s[j]),
D[i-1][j] + DeleteCost(p[i]),
D[i][j-1] + InsertCost(s[j])}
D[0][j] = 0 (No cost for omitting any prefix of s)
D[i][0] = DeleteCost(p[1])+…+DeleteCost(p[i])
Matches can be reconstructed from the D matrix
Quadratic time and space complexity
12/18/06
Uses O(|p|*|s|) memory and time
Improving performance
Quadratic complexity algorithms can be
too expensive for large s, e.g., HTML outputs
Storage requirements are even more problematic
Solution: Use linear-time coarse filtering algorithm
Approximate
D by FD, defined on substrings of s of length |p|
Let P (and S) denote a multiset of characters in p (resp., s)
FD(p, s) = min(|P-S|, |S-P|)
Slide
Prove:
a window of size |p| over s, compute FD incrementally
D(p, r) < t FD(p, r) < t for all substrings r of s
Result: O(|p|2) space and time complexity in practice
Implementation results
Typically
30x improvement in speed
200x to 1000x reduction in space
Preliminary performance measurements: ~40MB/sec
12/18/06
Efficient online operation
Weighted
edit-distance algorithms are still too
expensive if applied to every input/output
Need
Key
to run for every input parameter and output
idea:
Use
learning to construct a classifier for outputs
Each
class consists of similarly tainted outputs
taint identified quickly, once the class is known
Classifying
Our
strings is difficult
technique operates on parse trees of output
For ease of development, generality, and tolerance to
syntax errors, we use a “rough” parser
Classifier is a decision tree that inspects parse tree
nodes in an order that leads to good decisions
12/18/06
Decision Tree Construction
Examines
the nodes of syntax tree in some order
The order of examination is a function of the set
of syntax trees
Chooses
nodes that are present in all candidate
syntax trees
Avoids tests on tainted data, as they can vary
Avoids tests that don’t provide significant degree of
discrimination
“similar-valued”
fields will be collected together and
generalized, instead of storing individual values
Incorporates
a notion of “suitability” for each field
or subtree in the syntax tree
Takes
12/18/06
into account approximations made in parsing
Example of a Decision Tree
1. SELECT * FROM phpbb_config
2. SELECT u.*,s.* FROM phpbb_sessions s,phpbb_users u WHERE
s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id
3. SELECT * FROM phpbb_themes WHERE themes_id=1
4. SELECT c.cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE
f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order
5. SELECT * FROM phpbb_forums ORDER BY cat_id,forum_order
switch (1) {
case ROOT : switch (1.1) {
case CMD : switch (1.1.2) {
case c FINAL {@1.1.1:SELECT
@1.1.3:. cat_id,c.cat_title,c.cat_order FROM phpbb_categories
c,phpbb_forums f WHERE f.cat_id=c.cat_id GROUP BY
c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order }
case u FINAL {@1.1.1:SELECT
@1.1.3:. *,s.* FROM phpbb_sessions s,phpbb_users u WHERE
s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND
u.user_id=s.session_user_id }
case * FINAL {@1.1.1:SELECT
@1.1.3:FROM phpbb_?????? }
}
}
}
12/18/06
Implementation Status and Next Steps
“Rough” parsers implemented for
HTML/XML
Shell-like
languages (including Perl/PHP)
SQL
Preliminary performance measurements
Construction
of decision trees: ~3MB/sec
Classification only: ~15MB/sec
Significant
improvements expected with some performance tuning
Next steps
Develop
better clustering/classification algorithms based on
tree edit-distance
Current
algorithm is based entirely on a top-down traversal, and
fails to exploit similarities among subtrees
12/18/06
Syntax and taint-aware
policies
12/18/06
Overview of Policies
Leverage structure+taint to simplify/generalize policy
Policy
structure mirrors that of parse trees
And-Or
“trees” with cycles
Can
specify constraints on values (using regular expressions)
and taint associated with a parse tree node
ELEMENT
NAME = “script”
OR
PARAM
PARAM_NAME=“src”
ELEM_BODY
PARAM_VALUE
Most attacks detected using one basic policy
Controlling
“commands” vs command parameters
Controlling pointers vs data
12/18/06
Controlling “commands” Vs “parameters”
Observation: parameters don’t alter syntactic structure of
victim’s requests
Policy: Structure of parse tree for victim’s request should
not be controlled by untrusted input (“tainted data”)
Alternate formulation: tainted data shouldn’t span multiple
“fields” or “tokens” in victim’s request
root
root
name
param
param
name
gpg
-r
[email protected]
gpg
12/18/06
cmd
cmd
cmd
param param
-r
nobody
separator
name
param
param
;
rm
-rf
*
Policy prohibiting structure changes
Define “structure change” without using a reference
Avoids
need for training and associated FP issues
Policy 1
Tainted
data cannot span multiple nodes
for binary data, it should not span multiple fields
Policy 2
Tainted
data cannot straddle multiple subtrees
Tainted
data spans two adjacent subtrees, and at least one of
them is not fully tainted
Tainted data “overflowed” beyond the end of one subtree and
resulted in a second subtree
Both policies can be further refined to constrain the node
types and children subtrees of the nodes
12/18/06
Commands Vs parameters: Example 2
Memory
corruption attack overflowing stack buffer
For binary data, we talk about message fields rather
than parse trees
Stack
frame 1
Return
Address
Stack
frame 2
Return
Address
Stack
frame 2
…..
Violation: tainted data spans multiple stack “fields”
Heap
overflows involve tainted data spanning
across multiple heap blocks
12/18/06
Attacks Detected by “No structure change” Policy
Various
forms of script or command injection
SQL injection
XPath injection
Format string attacks
HTTP response splitting
Log injection
Stack overflow and heap overflow
12/18/06
Application-specific policies
Not
all attacks have the flavor of “command
injection”
Develop application-specific policies to detect
such attacks
Policy
3: Cross-site scripting: no tainted scripts in
HTML data
Policy 4: Path traversal: tainted file names cannot
access data outside of a certain document tree
…
Other
examples
Policy
5: No tainted CMD_NAME or CMD_SEPARATOR
nodes in shell or SQL commands
12/18/06
Implementation status
Four test applications
phpBB
SquirrelMail
PHP/XMLRPC
WebGoat
(J2EE)
Detects following attacks without FPs
Command
injection (Policies 1, 2, 5)
SQL injection (1, 2, 5)
XSS (3)
HTTP Response splitting (2)
Path traversal (4)
Memory corruption detected using ASR
Should be able to detect many other attacks easily
XPATH
12/18/06
injection (1,2), Format-string (1, 2), Log injection (1,2)
Memory Attack Discussion
12/18/06
Memory Error Based Remote Attack
Attacker’s
goal:
Overwrite
target of interest to take over instruction
execution
Attacker’s
approach:
Propagate
attacker controlled input to target of
interest
Violate certain structural constraints in the
propagation process
12/18/06
Stack Frame Structural Violation
High
A’s stack frame
Function arguments
Return address
Previous stack frame
Exception Registration Record
Local variables
B’s stack frame
Function arguments
Return address( to A)
Previous stack frame
Local variables
C’s stack frame
Low
EBP
FS:0
ESP
12/18/06
Function arguments
Return address (to B)
Previous stack frame
Exception Registration Record
Local variables
Heap Block Structural Violation
Size
Segment
Index
Previous Size
Flags
Unused
Tag Index
FLink
BLink
Windows Free Heap Block Header Structure
Happens when removing free block from double-linked list:
Ability to write 4 bytes into any address, usually well known address, like
function pointer, return address, SEH etc.
12/18/06
ASLR and Crash Analysis
ASLR randomizes the addresses of targets of interest
Memory attack using the original address will miss
and cause crash (exception).
Crash analysis tracks back to vulnerability, which
enables accurate signature generation
Structural information usually retrievable at
runtime, thanks to enhanced debugging technology
Crash analysis aided with JIT(Just In-time Tracing)
JIT
triggered at certain events:
“Suspicious”
address
Attach/detach
network inputs, e.g. sensitive JMP
JIT monitor at event of interest
Memory dump can be dumped in the right granularity, log
info from a few KB to a 2GB
12/18/06
Crash Root Cause Analysis
Root Cause Analysis
Exception Record/Context,
Faulting thread/Instructions/Registers
Stack trace/Heap/Module/Symbols
Stack Corruption
Read
Access Violation
Bad EIP
(Corrupted Return
Address or SEH)
12/18/06
Read
Access Violation
Bad Deference
(Corrupted Local
Variables/passing
parameters)
Heap Corruption
Write
Access Violation
(Address to write,
Value to write )
Stack-based Overflow Analysis
“Target” driven analysis
The
goal of attack string is to overwrite target of interest
on stack, e.g., return address, SEH handler.
Start matching target values from crash dump to input, like
EIP, EBP and SEH handler
More
If
efficient than pattern match in the whole address space
any targets are matched in input, expand in both
directions to find LCS
A match usually indicates the input size needed to overflow
certain targets
12/18/06
SEH Overflow and Analysis
A unique approach for Windows exploit
SEH
stands for Structured Exception Handler
Windows put EXCEPTION_REGISTRATION_RECORD chain on stack
with SEH in the record.
More reliable and powerful than overwrite return address
More
JMP address to use (pop/pop/ret)
An exception (accidental/intentional) is desired
Can bypass /GS buffer check
SEH crash analysis:
Catch
the first exception as well as the second one (caused by
ASR)
Locate the SEH chain head from first dump, usually overwritten
by input
Usually first exception is enough, second exception can be used
for confirmation
12/18/06
Heap Overflow Analysis
How to analyze heap overflow attack?
Exploit
happens in free blocks unlink
Multiple
Write
ways to trigger
Access Violation with ASR
with
overwriting in invalid address
Overwrite
4 bytes value in arbitrary address
Interested
Exploit
targets include return address, SEH, PEB and UEF
contains the pair: (Address To Write, Value to Write)
Appeared
in the overflowed heap blocks
Usually contained in registers
Should be provided from input by attacker
Match found in synthetic heap exploits
The
value pairs need to be in fixed offset
For
a given heap overflow vulnerability
To enable overwrite the right address with the right value desired
12/18/06
Case Studies
Vulnerability
Exploit
IIS ISAPI Extension synthetic stack
buffer overflow
Overwrite return address
IIS ISAPI Extension synthetic stack
buffer overflow
Overwrite Structure Exception Handler
IIS w3who.dll stack buffer
overflow(CVE-2004-1134)
Overwrite Structure Exception Handler
Microsoft RPC DCOM Interface stack Overwrite return address and Structure
Exception Handler
buffer overflow(CVE-2003-0352)
Synthetic Heap Overflow
12/18/06
Overwrite function pointer inside PEB
structure
Case Study: RPC DCOM
Step
1: Exception Analysis
FAULTING_IP:
+18759f
ExceptionCode: c0000005 (Access violation)
Attempt to read from address 0018759f
PROCESS_NAME: svchost.exe
FAULTING_THREAD: 00000290
PRIMARY_PROBLEM_CLASS: STACK_CORRUPTION
Step
2: Target – Input correlation:
StackBase: 0x6c0000, StackLimit: 0x6bc000,Size =0x4000
Begin analyze on Target Overwrite and Input Correlation:
Analyze crash EIP:
Find EIP pattern at socket input:
Bytes size to overwrite EIP= 128
Analyze crash EIP done!
Analyze SEH:
Find SEH byte at socket input:
Bytes size to overwrite SEH handler= 1588
Analyze SEH done!
12/18/06
Signature Generation
Signature generation:
Signature captures the vulnerability characteristics
Minimum
Use
size to overwrite certain target(s)
contexts to reduce false positive:
Using
incoming input calling stack
Stack offset can uniquely identify the context
Using
incoming input semantic context:
Message format like HTTP url/parameter
Binary message field
12/18/06
Components & Implementation
RAMSES
Crash Monitor:
* Catch interested
exception only
•Snapshots for a
given period
* Self healer
1
Crash(Exception)
2
Protected Application
Signature
3 Provide
Input
History
* Crash Dump provides the same interface
as LIVE process, so Crash Analyzer actually
does NOT have to work on saved crash dump file.
12/18/06
Generate
Crash Dump*
5
4
Uses
Analyze
RAMSES
Crash Analyzer
•Fault type detection
•Security oriented
analysis
•Feedback
Windows
Debug
Engine
Uses
Infrastructure:
Save Crash Dump
Extract Relevant Info
Search/Match
Disassemble
Testing
12/18/06
Test Attacks & Applications
Attack
phpBB SQL Injection
SquirrelMail Command Injection
SquirrelMail XSS Attack
PHP XML-RPC
HTTP Splitting
HTTP Splitting Cache Poisoning
Path Based Access Control
Xpath injection
JSON injection
XML inject
Vulnerability
CAN-2003-0486
CAN-2003-0990
CAN-2002-1341
CAN-2005-1921
CR LF escapes
tainted expiration field
tainted file open
tainted xpath string
flawed architecture
flawed architecture
Target App
phpBB
SquirrelMail
phpBB
PHP Library
WebGoat
WebGoat
WebGoat
WebGoat
WebGoat
WebGoat
App Lang
PHP
PHP
PHP
PHP
Java
Java
Java
Java
Java
Java
Exploited Lang
SQL
cmd/shell
JavaScript
XML
HTTP Request
HTTP Request
file path
Xpath Library
JSON
XML
Targets
Database
Server
3rd party clients
Server
Server page cache
Server
Server
Server Application
Server Application
Baseline Applications
Many “sub languges”
• phpBB (php)
• squirrelMail (php)
• WebGoat (java)
• hMailServer (C++)
SQL, XML, JavaScript,
HTML, HTTP, JSON, shell,
cmd, path
12/18/06
Possible Testbed Configurations
Protected System
Protected System
Web
Server
(IIS/
Apache)
files
Web
Apps
Mail
Server
Mail
Server
SQL
Database
(MySQL)
Attacker
Attacker
Protect Mail server exposed as a
service.
Baseline testbed setup
Protected System
Web
Server
(IIS/
Apache)
files
Web
Apps
SQL
Database
Mail
Server
Web
Server
(IIS/
Apache)
(MySQL)
Attacker
files
Web
Apps
SQL
Database
(MySQL)
Attacker
Protected System
Protect just mail server in context of
Web service.
12/18/06
Can extend protected system to
include Mail Serve
Mail
Server
Traffic Generation
Purpose
Coverage
of legitmate structural variation in
monitored structures
SQL,
Stress
command strings, call parameters
of log complexity for practicality
Multiple
users, multiple sessions
Performance
Program
measurements
performance metrics
Quantify performance impact
12/18/06
Traffic Generation to Web Sites
Approaches
Simple
Record/Playback (basic)
with minor substitutions (cookies, ips)
shell scripts, netcat, MaxQ (jython based
Custom DOM/Ajax scripting (learning)
Can access dynamically generated browser content after(during)
client side script eval
Automated site crawls of URLS
Automated form contents (site specific metadata)
COTS tools
12/18/06
Load testing and metrics
12/18/06
Red Team Suggestions
12/18/06
Suggested Red Team ROEs
Initial
telecons held in Fall
Claim: RAMSES will defeat most generalized
injection attacks on protected applications
Red Team should target our current and planned
applications rather than new ones (unless new
application, sample attacks and complete traffic
generator can be provided to RAMSES far enough in
advance for learning and testing)
Remote
network access to the targeted application
Attack designated application suite
Required
instrumentation yet to be determined
Red Team exercise start 15 April or later
……
12/18/06
RAMSES Project Schedule
CY06
Baseline Tasks
CY08
CY07
CY09
Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3
1. Refine RAMSES
Requirements
2. Design RAMSES
3. Develop
Components
4. Integrate System
5. Analyze & Test
RAMSES
6. Coordinate & Rept
Prototypes
1
2
Optional Tasks
Red Team Exercise
O.3 Cross-Area Exper
Today: 11 September 2007
12/18/06
3
Next Steps
12/18/06
Plans
Develop
input filters from output policies
Extend memory error analyzer
Demonstrate RAMSES on more applications and
attack types
Native
C/C++ app (most likely app is hMail server)
Java
Integrate
components
Performance and false positive testing
Red Team exercise
12/18/06
Questions?
12/18/06
Backup
12/18/06
Tokenizing and Parsing
Focus on “rough” parsing that reveals approximate
structure, but not necessarily all the details
Accurate parsers are time-consuming to write
More important: may not gracefully handle errors (common in
HTML) or language extensions and variations (different shells,
different flavors of SQL)
Implemented using Flex/Bison
Currently done for SQL and shell command languages
Parse
into a sequence of statements, each statement consisting of
a “command name” and “parameters”
Incorporates a notion of confidence to deal with complex
language features, e.g., variable substitutions in shell
Modest
effort for adding additional languages, but
substantially simplifies subsequent learning tasks
Don’t anticipate significant additions to this language list
(other than HTML/XML)
12/18/06
Taint inference Vs Taint-tracking
Disadvantages of learning
False
negatives if inputs transformed before use
Low likelihood for most web apps
False positives due to coincidence
Mitigated using statistical information
Plan to evaluate these experimentally
Benefits of learning
Low
performance overhead
Some significant implicit flows handled without incurring high
false positives
Can address attacks multi-step attacks where tainted data is
first stored in a file/database before use
More generally, in dealing with information flow that crosses
module boundaries
12/18/06
Attack Coverage 2004
Config errors
3%
Other logic
errors
22%
Tempfile
4%
Memory
errors
27%
(Stack-smashing, heap
overflow, integer overflow,
data attacks)
Format string
4%
Input
validation/
DoS
9%
Generalized
Injection Attacks
Directory
traversal
10%
12/18/06
Cross-site
scripting
4%
SQL injection
2%
Command
injection
15%
CVE
Vulnerabilities
(Ver. 20040901)
RAMSES System Concept
RAMSES
Components
Network/App
Firewall (e.g. mod_security)
Internet
Protected System
Event Collector
Web
Server
Web
App
(IIS/
Apache)
(PHP/
ASP)
SQL
Database
(MySQL)
• parse/decode/normalize
HTTP requests,
parameters, cookies, …
Attack Detector
• Address-space
randomization
• Taint-based policies,
anomalies
RAMSES Interceptors
Network
DLLs
OS
DLLs
Application
DLLs
Filter Generator
• Output filter
• Input filter
Key research problems
Learn
taint propagation
Identify
Learn
Use
12/18/06
tainted components in output, generate filtering criteria
input/output transformation
transformation to project output filters to input
Advantages of RAMSES Filters
Filters
easily sharable
Complements
Application Community focus on end
user applications
Filters
Filter
are human readable
generation algorithms can be enhanced to
address privacy concerns wrt sharing
12/18/06
Filter types
Filter Location
Filter Criteria
Correlative filters
Equality-based filter
Structure-based filter
Statistical filter
Causal filters
Filtering criteria derived
from attack detection
criteria (policy or
anomaly)
Input filter
Easier to deploy but harder to
synthesize
Output filter (precedes sensitive
operation)
Easier to synthesize than input
filter, but deployment needs
deeper instrumentation
May be too late for some attacks
(memory corruption)
Note: All filters evaluated using large number of
benign samples and 1 attack sample
12/18/06