Transcript Slide 1


Define big data and understand how it is differentiated
from “regular old” data.

Recognize examples and applications of big data.

Understand the key problems we are trying to solve
when coping with big data.

Become aware of the “solutions” that people are using
to cope with big data.
1

Definitions differ depending on perspective.

Data that is difficult to process using traditional
database and software techniques (abbreviated from
Wikipedia/Webopedia).

“Big” is relative to the organization.

“Big” is relative in time.
2

Volume

Variety

Variability

Velocity

Veracity
3

Dealing with different types of data.
◦ Data that doesn’t have a clear data type.
◦ Data that changes data type.
◦ Unstructured data: does not have a pre-defined data model;
usually text.

Storing and accessing incredibly large quantities of data.

Transforming and loading data immediately.

Performing analytics immediately.

Using “big data” to create “real information”.
4

Rows and columns don’t work.

Need a “file” or “document” type of management system.

Examples:
◦
◦
◦
◦
◦
MongoDB
VelocityDB
Apache Hadoop (HDFS)
Oracle NoSQL
CouchDB
5

Distribute processing of very large multi-structured
data files across a large cluster of ordinary
machines/processors
◦ MapReduce
◦ Sharding/Horizontal partitioning

Break the data into parts, which are then loaded into a
file system on multiple nodes.

Each part may be replicated multiple times.

The results are collected and aggregated using a
MapReduce algorithm, or other type of partitioning
algorithm.
6

Lots of memory; really fast disk

In-memory computing
◦
◦
◦
◦
◦

HANA (SAP)
DB2 BLU (IBM)
Informix (IBM)
ActiveSpaces (TIBCO Software)
Oracle
Database appliance: marketing term for an integrated
set of servers, storage, operationg system, and DBMS
specifically pre-installed and pre-optimized for data
warehousing (Wikipedia rules!!)
7

Transform after loading data. Perform data loading and
transformation continuously.

Problems:
◦ Most data transformation tools are not designed to work well
with unstructured data.
◦ Few frameworks are currently focusing on ETL, because the data
is not “mission critical.”

Opportunities!!!!
8

Define need for immediacy. Real-time or close??

Streaming analytics: process data as it arrives; usually
does not compare against all existing data – usually has
a pre-defined “window” of time/data used for analytical
processing. May or may not store the results of the
analytical processes.

Perpetual analytics: process data as it arrives
comparing it against existing data and then storing the
results of the analytics.
9

Culture of data-driven decision making.

Data scientist.

Information visualization techniques.
10
Domain Expertise,
Problem Definition and
Decision Modeling
Data Access and
Management
(both traditional and
new data systems)
Communication and
Interpersonal
DATA
SCIENTIST
Curiosity and
Creativity
Programming,
Scripting and Hacking
Internet and Social
Media/Social Networking
Technologies
11
Typical
Job Post
for Data
Scientist
12