Transcript Slide 1
Introduction to Business Intelligence (CIT625) Data Mining Techniques and Applications Data Mining Techniques • In data mining techniques we focus on understanding ways, methods used in analysing (sub) set of data. • In this case you have to understand four classes of task involved in data mining – Classification - Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier and Neural network. Data Mining Techniques – Clustering - Is like classification but the groups are not predefined, so the algorithm will try to group similar items together. – Regression - Attempts to find a function which models the data with the least error. A common method is to use Genetic Programming. Data Mining Techniques • Association rule learning (Mining) - Searches for relationships between variables. For example a supermarket might gather data of what each customer buys. Using association rule learning, the supermarket can work out what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as "market basket analysis". Applications of Techniques • Data mining is used for a variety of purposes in both the private and public sectors. • These techniques can be applied in companies with a strong consumer focus – retail, financial communication, and marketing organisations as will be shown below: Data Mining Techniques • The ultimate goal of data mining is prediction • Predictive data mining is the most common type of data mining and one that has the most direct business applications. • The process of data mining consists of three stages: – The initial exploration – Model building or pattern identification with validation/verification – Deployment (i.e., the application of the model to new data in order to generate predictions). More on Classification • Classification can provide a valuable support for informed decision making in the organisation. • It may classify each person as a potential buyer or non-buyer based on their personal information such as income, occupation, lifestyle, and credit ratings. Classification Table 1. Vertebrate Data Set Classification • In the above slide, the table shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian. • The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly and ability to live in water. What is Classification • Classification can be described as a task of assigning objects to one of several predefined categories. Input Attribute Set (x) Output Classification Model Class label (y) The diagram show the classification as task of mapping an input attribute set x into its class label y Simple Definition • Classification is the task of learning a target function f that maps each attribute set x into one of the pre-defined class labels y. • The target function is also known informally as a classification model. Usefulness of Classification Model • A classification model is useful for the following purposes: – It may serve as an explanatory tool to distinguish between objects of different classes (Descriptive Modeling). – It may also be used to predict the class label of unknown records (Predictive Modeling). Consider the table below: Usefulness of Classification Model • A classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. • Example you can be given the characteristics of creature known as gila monster. Usefulness of Classification Model • By building a classification model from the data set shown in Table 1, you may use the model to determine the class to which the creature belongs. • Classification models are most suited for predicting or describing data sets with binary or nominal target attributes. Classification Technique • A classification technique is a systematic approach for building classification models from an input data set. • Examples of classification techniques include: – Decision Tree Classifiers – Rule-Based Classifiers – Neural Networks – Support Vector Machines – Naıve Bayes Classifiers – Nearest-Neighbor Classifiers Classification Technique • Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data (produces outputs consistent with the class labels of the input data). Classification Technique • A good classification model must predict correctly the class labels of records it has never seen before. • Building models with good generalization capability, i.e., models that accurately predict the class labels of previously unseen records, is therefore a key objective of the learning algorithm. General Approach to Solve a Classification Problem • A general strategy to solving a classification problem is that: – First, the input data is divided into two disjoint sets, known as the training set and test set, respectively. • The training set will be used for building a classification model. • The induced model is later applied to the test set to predict the class label of each test record. Classification Applications • Often used as a means for detecting fraud, assessing risk in finance and banking. • Homeland security department, use classification to identify terrorist activities, such as money transfers and communications, and to identify and track individual terrorists themselves, such as through travel and immigration records. Association • Association analysis can be used in promoting/improving marketing strategy by analysing frequent itemset. • As a marketing manager of a Company X for instance you would like to determine which items are frequently purchased together within the same transactions. Application of Association • An example of such a rule, mined from the X Company transactional database, is buys(X; “computer”)=>buys(X; “software”) [support = 1%; confidence = 50%] where X is a variable representing a customer. • A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well. Application of Association • A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together. • This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a single predicate are referred to as singledimensional association rules. Application of Association • In addition to the marketing application, the same sort of question has the following uses: • Baskets = documents; items = words. Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering. Application of Association • Baskets = sentences, items = documents. Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web. Clustering • Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics. • The criterion for checking the similarity is implementation dependent. Clustering • Clustering is often confused with classification, but there is some difference between the two. • In classification the objects are assigned to pre defined classes, whereas in clustering the classes are also to be defined. Clustering • Precisely, Data Clustering is a technique in which, the information that is logically similar is physically stored together. • In order to increase the efficiency in the database systems the number of disk accesses are to be minimized. • In clustering the objects of similar properties are placed in one class of objects and a single access to the disk makes the entire class available. Clustering • By definition a cluster is an ordered list of objects, which have some common characteristics. The objects belong to an interval [a , b], in our case [0 , 1] [1] What can be Clustered? • Images (astronomical data) • Patterns (e.g. Robot vision data) • Shopping Items • Feet (i.e. anatomical data) • Words • Documents, etc. Application of Clustering Similarity searching in Medical Image Database • This is a major application of the clustering technique. In order to detect many diseases like Tumor etc, the scanned pictures or the x-rays are compared with the existing ones and the dissimilarities are recognized. Application of Clustering • This technique supports the development of population segmentation models, such as demographic-based customer segmentation. • For example, the buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign. Application of Clustering • For example, a company that sales a variety of products may need to know about the sale of all of their products in order to check that what product is giving extensive sale and which is lacking. • This is done by data mining techniques. But if the system clusters the products that are giving less sale then only the cluster of such products would have to be checked rather than comparing the sales value of all the products. This is actually to facilitate the mining process. Applications Miscellaneous • With data mining, a retailer could use point-of-sale (PoS) records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. Applications Miscellaneous • The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. – The “Advanced Scout” software analyzes the movements of players to help coaches orchestrate plays and strategies. Applications Miscellaneous • For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! “Advanced Scout” not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game. How does Data Mining Work? • Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. • Four types of relationships are sought using several types of available analytical software:- How does Data Mining Work? – Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. – Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. How does Data Mining Work? – Associations: Data can be mined to identify associations. The bread-cheese example is an example of associative mining. – Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. How does Data Mining Work? • Data mining consists of five major elements: – Extract, transform, and load transaction data onto the data warehouse system – Store and manage the data in a multidimensional database system. – Provide data access to business analysts and information technology professionals. – Analyze the data by application software. – Present the data in a useful format, such as a graph or table.