Eastern Ct State University Journey Through Big Data

Transcript Eastern Ct State University Journey Through Big Data

04/17/2015
Eastern Connecticut State University
Roland DePratti
Dr. Garrett Dancik
Dr. Sarah Tasneem

Initiated September, 2013 to align Data Management and
Bioinformatics topics

Hadoop programming arose as the natural synergy topic
◦
◦
◦
◦

It was seen as the natural consolidation of a number of areas in CS
A growing discipline with a concrete theoretical and practical foundation
Great job opportunities for our students
Could result in valuable assets that could be leveraged across university
departments.
Initial research completed last summer
◦
◦
◦
◦
◦
Development of Big Data Team
Completed summary research on the topic
Identified Cloudera as our Academic partner
Reviewed Cloudera Support materials
Identified grants to support work
Presentation url: http://www1.easternct.edu/deprattir/ccscne-2015-content/






Solve the challenges !
Complete team training
Develop course materials
Complete test run with 2 independent study
students (Fall, 2015)
Kickoff as a CS Topics class – Spring, 2016.
Develop future goals and roadmap

We are halfway through this process
◦ A lot still to learn

We want to share the decisions we face around four of six identified
challenges

We are looking for input from others (both during conference, as
well as later), who are ahead or behind us

And hoping the input and collaboration results in better knowledge
delivery to our students

Will document our experiences and results for future presentations

Selection of course topics (Roland)

Keeping up with the speed of change (Roland)

Ensuring proper prerequisite knowledge (Garrett)

Managing the lab environment (Sarah)

Software platform stability

Developing meaningful lab exercises
( ) identifies CS Knowledge Areas

Teach the concepts, the technology will change

Teach the future, not the past
◦ Spark vs MapReduce

Show how the platform works together
◦ Relational -> Sqoop -> HDFS -> MapReduce/Spark

Build on what they already know
◦ Relational DBMS, Java, SQL

Use lab exercises the tie in other CS topics
◦ Data Mining
◦ Bioinformatics
() identifies CS Knowledge Areas
Topic
Selected required
coverage
Current coverage
Linux operating
system
Directory structure, file
management, text editors,
core commands
none
Java
Basic Java programming,
abstract classes and
interfaces, serialization,
JUnit testing, Log4J
framework
Object-oriented Java
programming course
Eclipse IDE
Java programming,
generating JAR files, using
Junit, log4j
Object-oriented Java
programming course


Challenge: Students need additional Java /
Eclipse experience and may be "rusty", and do not
have Linux experience
Possible solution:
◦ Offer a 1 credit laboratory course as a co-requisite to Big
Data programming
◦ Offer a 1 credit "Programming in a Linux environment"
course that would be a pre/co-requisite to Big Data
programming and could also be taken by others
In House Cluster


Create clusters of
computers on campus
-limited size
Establishment and
maintenance cost
-University IT
Infrastructure As A Service- IAAS
(scalable replacement for local IT)

Access infrastructural
resources in cloud- terms of
virtual machines

No maintenance
Students can use same tools
as professionals use


AWS offers virtualized
platforms
-pay-as-you-use
-careful to not waste
computing resources
• modern day useful problem solving tool
• many universities are incorporating cloud computing in
the curriculum
• related knowledge and skills are becoming
fundamental for computing professionals.
• will provide students with hands-on cloud
computing experience.
• students will experience cutting-edge tools -help them grow professionally

Selection of course topics

Keeping up with the speed of change

Ensuring proper prerequisite knowledge

Managing the lab environment

Software platform stability

Developing meaningful lab exercises
Additional References and Content
1. Albrecht, J. 2009, Bringing big systems to small schools: distributed systems for undergraduates, SIGCSE ’09:
Proceedings of the 40th ACM technical symposium on Computer science education
2. Garrity et al, 2011, WebMapReduce: an accessible and adaptable tool for teaching map-reduce computing,
SIGCSE ’11:Proceedings of the 42nd ACM technical symposium on Computer science education
3. Lin, J. 2008, Exploring large-data issues in the curriculum: a case study with MapReduce, TeachCL ‘08
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
4. Makadev, A. & Wurst, K. 2015, Developing Cincentrations in Big Data Analytics and Softweare development at a
Small Liberal Arts University, Journal of Computing Sciences in Colleges , Volume 30 Issue 3.
5. Brandon D., 2015, Teaching Data Analytics Across the Computing Curricula, Journal of Computing Sciences in
Colleges , Volume 30 Issue 5.
6. Wolffe, G., 2009, Teaching Parallel Computing: New Possibilities, Journal of Computing Sciences in Colleges ,
Volume 25 Issue 1.
7. Brown,R. et al, 2010, Strategies for Preparing CS Students for the Multicore World, Proceedings of the 2010
ITiCSE working group reports
8. www.acm.org/education/CS2013-final-report.pdf Accessed 3/16/2015
Project
Description
HDFS
Hadoop Distributed File System: A user defined file system that manages
larger blocks and provides file management across a distributed system
Avro
A remote procedure call and data serialization framework developed
within Apache's Hadoop project
LZO
Lempel-Ziv-Oberhumer (or LZO) is a lossless algorithm that compresses
data to ensure high decompression speed.
MapReduce
A programming model for processing and generating large data sets with
a parallel, distributed algorithm on a cluster.
Spark
An open-source cluster computing framework originally developed in the
AMPLab at UC Berkeley using in-memory primitives to speed up
performance.
Tez
The Apache Tez project is aimed at building an application framework
which allows for a complex directed-acyclic-graph of tasks for processing
data
Cascading
Cascading is a software abstraction layer for Apache Hadoop. Cascading
is used to create and execute complex data processing workflows on a
Hadoop cluster using any JVM-based language hiding the underlying
complexity of MapReduce jobs.
Scalding
Scalding is a Scala library that makes it easy to write MapReduce jobs in
Hadoop. It's similar to other MapReduce platforms like Pig and Hive, but
offers a higher level of abstraction by leveraging the full power of Scala
and the JVM. Scalding is built on top of Cascading,
All definitions were sourced from Wikipedia or Apache project website
Project
Description
MongoDB
MongoDB (from humongous) is one of many cross-platform documentoriented databases.
Cassandra
Apache Cassandra is an open source distributed database management
system designed to handle large amounts of data across many commodity
servers, providing high availability with no single point of failure.
Hbase
HBase is an open source, non-relational, distributed database modeled
after Google's BigTable and written in Java
Redis
Redis is a data structure server. It is open-source, networked, in-memory,
and stores keys with optional durability.
Project
Description
Sqoop
Sqoop is a command-line interface application for transferring data
between relational databases and Hadoop.
Pig
Pig is a high-level platform for creating MapReduce programs used with
Hadoop.
Flume
Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data.
All definitions were sourced from Wikipedia or Apache project website
Project
Description
SparkSQL
Spark SQL is a component on top of Spark Core that introduces a
new data abstraction called SchemaRDD, which provides support
for structured and semi-structured data.
Hive
Apache Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and analysis.[2]
While initially developed by Facebook,
Impala
Cloudera Impala is Cloudera's open source massively parallel
processing (MPP) SQL query engine for data stored in a computer
cluster running Apache Hadoop
Drill
Apache Drill is an open-source software framework that supports
data-intensive distributed applications for interactive analysis of
large-scale datasets. Drill is the open source version of Google's
Dremel system which is available as an infrastructure service called
Google BigQuery.
Project
Description
Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs.
All definitions were sourced from Wikipedia or Apache project website
Project
Description
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling
capability to perform streaming analytics.
Storm
Apache Storm is a distributed computation framework to
allow batch, distributed processing of streaming data.
Kafka
Apache Kafka is an open-source message broker project,
which aims to provide a unified, high-throughput, low-latency
platform for handling real-time data feeds.
Samza
Apache Samza is an open-source project developed by the
Apache Software Foundation, written in Scala. The project
aims to provide a near-realtime, asynchronous computational
framework for stream processing
Project
Description
MLlib
A distributed machine learning framework on top of Spark
All definitions were sourced from Wikipedia or Apache project website