Transcript Slide 1
Define big data and understand how it is differentiated from “regular old” data. Recognize examples and applications of big data. Understand the key problems we are trying to solve when coping with big data. Become aware of the “solutions” that people are using to cope with big data. 1 Definitions differ depending on perspective. Data that is difficult to process using traditional database and software techniques (abbreviated from Wikipedia/Webopedia). “Big” is relative to the organization. “Big” is relative in time. 2 Volume Variety Variability Velocity Veracity 3 Dealing with different types of data. ◦ Data that doesn’t have a clear data type. ◦ Data that changes data type. ◦ Unstructured data: does not have a pre-defined data model; usually text. Storing and accessing incredibly large quantities of data. Transforming and loading data immediately. Performing analytics immediately. Using “big data” to create “real information”. 4 Rows and columns don’t work. Need a “file” or “document” type of management system. Examples: ◦ ◦ ◦ ◦ ◦ MongoDB VelocityDB Apache Hadoop (HDFS) Oracle NoSQL CouchDB 5 Distribute processing of very large multi-structured data files across a large cluster of ordinary machines/processors ◦ MapReduce ◦ Sharding/Horizontal partitioning Break the data into parts, which are then loaded into a file system on multiple nodes. Each part may be replicated multiple times. The results are collected and aggregated using a MapReduce algorithm, or other type of partitioning algorithm. 6 Lots of memory; really fast disk In-memory computing ◦ ◦ ◦ ◦ ◦ HANA (SAP) DB2 BLU (IBM) Informix (IBM) ActiveSpaces (TIBCO Software) Oracle Database appliance: marketing term for an integrated set of servers, storage, operationg system, and DBMS specifically pre-installed and pre-optimized for data warehousing (Wikipedia rules!!) 7 Transform after loading data. Perform data loading and transformation continuously. Problems: ◦ Most data transformation tools are not designed to work well with unstructured data. ◦ Few frameworks are currently focusing on ETL, because the data is not “mission critical.” Opportunities!!!! 8 Define need for immediacy. Real-time or close?? Streaming analytics: process data as it arrives; usually does not compare against all existing data – usually has a pre-defined “window” of time/data used for analytical processing. May or may not store the results of the analytical processes. Perpetual analytics: process data as it arrives comparing it against existing data and then storing the results of the analytics. 9 Culture of data-driven decision making. Data scientist. Information visualization techniques. 10 Domain Expertise, Problem Definition and Decision Modeling Data Access and Management (both traditional and new data systems) Communication and Interpersonal DATA SCIENTIST Curiosity and Creativity Programming, Scripting and Hacking Internet and Social Media/Social Networking Technologies 11 Typical Job Post for Data Scientist 12