Transcript Document
NC STATE UNIVERSITY Program for North American Mobility in Higher Education Introducing Process Integration for Environmental Control in Engineering Curricula MODULE 17: “Introduction to Multivariate Analysis” Created at: Ecole Polytechnique de Montreal & North Carolina State University, 2003. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 2.4: Example (3) Shorter Timescales NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Shorter timescales The previous two examples used daily averages for the 130 process variables. However, we could just as easily have chosen weekly averages, monthly averages, or several other options. We could also have chosen shorter timescales, such as 8-hour averages or 30-minute averages. Obviously, at some point the number of observations will become unmanageable. For instance, a spreadsheet with 3 years’ worth of 1-minute averages would have over a million lines. Simply by choosing the timescale, you are already influencing your MVA results. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Choosing a timescale The first thing we need to understand is what timescales are available. For the TMP process we have been studying, the shortest possible time period between two logged values is 10 seconds (note that not all tags are updated this frequently). 1 year 1 mo. 1 wk. 24 h Chips sampled every 8 hours 8h 1h 10 min 1 min 10 s Pulp sampled every 2 hours Several key values, such as wood and pulp characteristics, are only measured every few hours as shown above. These tags will be of little or no use at a very short timescale. IMPORTANT CONCEPT: Some variables can only be studied at longer timescales, others at shorter timescales, depending on their sampling/logging frequency. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Shortest possible timescale For the purposes of illustration, we will use the shortest possible timescale in this example, namely 10 seconds. Because some tags are updated less frequently, we will use interpolated values for all variables, which may or may not represent reality. 10 seconds To keep the size of the dataset manageable, we have taken these data over a 24-hour period, which corresponds to around 9,000 observations. Because we have over 100 tags, the resulting dataset has about one million values. A million values per day, for only one section of the papermaking process - if we were to include the entire industrial plant over several years, we would have to analyse billions of datapoints. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 PCA of entire 24-hour period J un 20 02(1). 10 s ec onds COMPLETE W I TH 45 min LAG. M1 (PC A-X),RUntit 2X(c led um) Q2(cum) 1.00 0.80 0.60 0.40 Comp No. Comp[3] Comp[2] 0.00 Comp[1] 0.20 Simca found numerous components retained 3 The PCA for the entire 24-hour period shows quite a strong model, with a cumulative R2 over 60%. This is misleading, however. As shown on the score plot, there is a major process excursion which has totally skewed the MVA results. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Major process excursion Major process excursion from 8h15 to 8h45 A review of the original data indicates that production dropped below 10 t/d during a ten-minute period (8:15 to 8:25). The cause was a major refiner blockage known as a “feedguard event”, which makes the refiner motor shut down. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Exclude process excursion The process excursion sticks out like a sore thumb on the score plot. This means that the process temporarily went to a radically different “place” or operating regime, where relationships between the variables are different. Sticking out like a sore thumb… or a solar flare Trying to do PCA on several different operating regimes all at once is a waste of time. The software will try to establish the correlations between the different variables, and if these correlations change abruptly the results will be useless. The way to get around this problem is to divide the observations into different operating regimes, and study each regime separately. In this case we will remove the low production period to prevent it from skewing the rest of the results. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 PCA with process excursion removed We removed the entire period when the process was perturbed (8:10 to 8:45) and did a PCA on the rest of the observations. J un 20 02(1). 10 s ec onds COMPLETE W I TH 45 min LAG. M2 (PC A-X), Ext reme out liers R 2X(c um) remov ed Q2(cum) 1.00 0.80 0.60 0.40 Comp[3] Comp[2] 0.00 Comp[1] 0.20 Comp No. Interestingly, the R2 values went down slightly. This is because many of the variables changed abruptly all together when the process was shut down, making it look like they were “correlated” with each other. Remember, MVA knows nothing about the process, and just uses the data as it is. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Score plot of normal operation Now that we have removed the process upset, the score plot takes on an entirely different character. There is now an obvious time trend. During our 24-hour period, the process “snakes” around in multi-dimensional space. It is a moving target. Whereas score plots for longer, averaged periods generally resemble clouds, score plots for short timescales resemble snakes. Almost all process data show this characteristic, because a real process is never really in steady state. The process control systems are constantly responding to outside perturbations, like changes in feed material quality. Operator intervention is another source of perturbation. There are many others. One operating goal is to maintain the “snake” within a certain desirable zone. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Score plot showing time trend End: 00:59 Start: 01:00 Obvious time trend… NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 What is the significance? This “snaking” of the process at short timescales is highly significant. This was not seen when using the daily averages. By looking at which variables are changing with time, we can get tremendous insight into the process dynamics. One way to do this is to compare the contribution plots (like we saw in Example 2) at different times. Contribution plots for the start and end points of our 24-hour period are shown on the next page. Obviously it is impossible to read the names of all the variables, but that is not the point. Just look at the bar graphs. They are very different, indicating a continuous change in operating regime from start to finish. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 J un 2 0 02 (1). 10 s ec o nd s CO MPLETE W I TH 4 5 m in L AG. M3 (PC A-X), Mo re e xt re m e o ut lie Sc ore Co n trib(Ob s 7 91 0 - Av e ra ge ), W e igh t=p1 p2 33LI214.AI 52FFC117.PV 52FFC166.PV 52FIC115.PV 52FIC116.PV 52FIC154.PV 52FIC164.PV 52FIC165.PV 52FIC167.PV 52FIC177.PV 52HIC812.PV 52IIC128.PV 52IIC178.PV 52JCC139.PV 52JI189.AI 52JIC139.AI 52LIC106.PV 52PCA111.PV 52PCA161.PV 52PCB111.PV 52PCB161.PV 52PIC105.PV 52PIC159.PV 52PIC705.PV 52PIC961.PV 52SIC110.PV 52SQI110.AI 52TI011.AI 52TI031.AI 52TI118.AI 52TI168.AI 52TIC010.CO 52TIC793.PV 52XAI130.AI 52XIC130.AI 52XIC180.AI 52XPI130.AI 52XQI195.AI 52ZIC147.PV 52ZIC148.PV 52ZIC197.PV 52ZIC198.PV 53AI034.AI 53FFC455.PV 53FI012.AI 53HIC762.PV 53LIC011.PV 53LIC301.PV 53NI716.AI 53NIC013.PV 53PIC210.PV 53PIC305.PV 53PIC308.PV 53PIC309.PV 53WI012.AI Pex_L1_Blan Pex_L1_Cons Pex_L1_CSF Pex_L1_LMF Pex_L1_P200 Pex_L1_PFC Pex_L1_PFL Pex_L1_PFM Pex_L1_R100 Pex_L1_R14 Pex_L1_R28 Pex_L1_R48 53LIC510.PV 52FR960.AI 52FRA703.AI 52KQC139.AI 52KQC189.AI 52PI128.AI 52PI178.AI 52PI706.AI 52PIA143.AI 52PIA193.AI 52PIB143.AI 52PIB193.AI 52PIP143.AI 52PIP193.AI 52SI055.AI 52SIA110.AI 52TIC102.PV 52TIC711.PV 52TR964.AI 52XIC811.PV 52X_130.AI_split_L1. 52ZI144.AI 52ZI194.AI 53AIC453.PV 53LR405.AI 53LV301.AI 53NIC100.PV 85LCB320.AI Score Contrib(Obs 7910 - Average), Weight=p1p2 J un 2 0 02 ( 1) . 10 s ec o nd s CO MPLETE W I TH 4 5 m in L AG. M3 (PC A-X), Sc or e Co n tr ib( Ob s 4 57 - Av e ra g e) , W e ig ht =p 1p 2 33LI214.AI 52FFC117.PV 52FFC166.PV 52FIC115.PV 52FIC116.PV 52FIC154.PV 52FIC164.PV 52FIC165.PV 52FIC167.PV 52FIC177.PV 52HIC812.PV 52IIC128.PV 52IIC178.PV 52JCC139.PV 52JI189.AI 52JIC139.AI 52LIC106.PV 52PCA111.PV 52PCA161.PV 52PCB111.PV 52PCB161.PV 52PIC105.PV 52PIC159.PV 52PIC705.PV 52PIC961.PV 52SIC110.PV 52SQI110.AI 52TI011.AI 52TI031.AI 52TI118.AI 52TI168.AI 52TIC010.CO 52TIC793.PV 52XAI130.AI 52XIC130.AI 52XIC180.AI 52XPI130.AI 52XQI195.AI 52ZIC147.PV 52ZIC148.PV 52ZIC197.PV 52ZIC198.PV 53AI034.AI 53FFC455.PV 53FI012.AI 53HIC762.PV 53LIC011.PV 53LIC301.PV 53NI716.AI 53NIC013.PV 53PIC210.PV 53PIC305.PV 53PIC308.PV 53PIC309.PV 53WI012.AI Pex_L1_Blan Pex_L1_Cons Pex_L1_CSF Pex_L1_LMF Pex_L1_P200 Pex_L1_PFC Pex_L1_PFL Pex_L1_PFM Pex_L1_R100 Pex_L1_R14 Pex_L1_R28 Pex_L1_R48 53LIC510.PV 52FR960.AI 52FRA703.AI 52KQC139.AI 52KQC189.AI 52PI128.AI 52PI178.AI 52PI706.AI 52PIA143.AI 52PIA193.AI 52PIB143.AI 52PIB193.AI 52PIP143.AI 52PIP193.AI 52SI055.AI 52SIA110.AI 52TIC102.PV 52TIC711.PV 52TR964.AI 52XIC811.PV 52X_130.AI_split_L1. 52ZI144.AI 52ZI194.AI 53AIC453.PV 53LR405.AI 53LV301.AI 53NIC100.PV 85LCB320.AI Score Contrib(Obs 457 - Average), Weight=p1p2 Time trend within the process 01:00 Contribution plots… NAMP Module 17: “Introduction to Multivariate Analysis” Mo re e xt re m e o ut lie 4 2 0 -2 V ar ID (P ri mary) 00:59 3 2 1 0 -1 -2 V ar ID (P rimary) Tier 3, Rev.: 4 Studying the “snake” To gain further insight, we can colour-code the observations on the score plot. We did something similar in Example 1, when we colourcoded the days to show the seasons. This is very easy to do with modern MVA software. In this case, we have modified the score plot to show which range that observation falls in for one of the variables. In this case we have chosen “freeness”, an important pulp quality parameter which the process control systems try to maintain at a constant value. We could have chosen any variable. Note that during the course of our 24-hour period, the freeness starts high, then gets lower, then goes back up again. Someone with an intimate knowledge of the process could gain insight from this result. NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Score plot coloured for “freeness” Exactly the same score plot, coloured for pulp “freeness” NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Score plot in 3-D Same plot, showing 3rd component Component 3 Component 1 Component 2 NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 MVA “foresight” Another powerful use of MVA over short timescales is to predict problems before they become more widely visible. The residuals plot on the next page tells the whole story. Remember we said that the refiner shut down at 8:15 due to a blockage? It is obvious that the process started to move away from normal operation well before then. The operators tend to look at a handful of key variables when monitoring the process, but MVA looks at all the variables at the same time and is therefore much more sensitive. An analogy would be a seismometer being used to predict volcanic eruptions. NAMP Module 17: “Introduction to Multivariate Analysis” A seismometer is extremely sensitive to the slightest vibrations. Example 3 Tier 3, Rev.: 4 Residuals plot showing MVA “foresight” Build-up to 8h15 – something is happening to the process! NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 Using shorter timescales By now it should be clear that doing MVA at a shorter timescales is totally different to studying averages taken over longer timespans. Once again, we conclude that the best solution is to try many different approaches. No single MVA approach will provide all the answers we are seeking. Part of the power of this technique is the way completely different results can be obtained from exactly the same database, simply by “slicing and dicing” the data in various ways: • Longer vs. shorter timescales • More vs. fewer variables • PCA vs. PLS MVA is just a “black box”. Its use MUST be driven by an understanding of the process being studied, otherwise it is just meaningless number-crunching. “Number Cruncher” NAMP Module 17: “Introduction to Multivariate Analysis” Example 3 Tier 3, Rev.: 4 End of Example 3: One step at a time… NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 End of Tier 2 Congratulations! This is the end of Tier 2. Obviously the details of these examples are hard to grasp for a first-timer, but hopefully some of the overall patterns are starting to emerge. A true understanding of MVA can only come by actually doing it on your own, which is the purpose of Tier 3. All that is left is to complete the short quiz that follows… NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Tier 2 Quiz Question 1: What is the difference between a tag and a variable? a) The words “tag” and “variable” are synonyms. b) A tag is an identity label or address, while a variable is an attribute of the process. c) Tags change with time, but variables are fixed. d) Variables measure similar attributes, while tags measure dissimilar attributes. e) Answers (b) and (c). NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 2: Does averaging reduce or increase noise? a) b) c) d) e) Averaging increases noise significantly. Averaging increases noise, but only slightly. Averaging does not affect noise. Averaging reduces noise. Averaging reduces noise, but increases the likelihood of outliers. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 3: What is the danger of interpolating between readings that are far apart in time? a) The interpolation will give far more weight to these individual readings than they deserve. b) The interpolated values will indicate slow upward and downward trends where there are none. c) The effect of outliers will be enhanced many-fold. d) The engineer will have the false sense of comparing variables that are similar, when in fact they are very different. e) All of the above. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 4: If interpolation is such a problem, then why can’t we just use the discrete values instead? a) This would give far too much weight to periods with a large number of discrete values. b) Discrete values must be averaged to have meaning. c) No tag is ever truly discrete. d) Discrete values have no time signature. e) Answers (b) and (c). NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 5: What is the difference between a process lag and a delayed reading? a) One is caused by the process itself, the other by the measurement instruments. b) They are the same thing. c) A process lag is due to residence time, while a delayed reading is due to the time required for sampling, measurement and recording. d) One is much longer than the other. e) Answers (a) and (c). NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 6: Why does the MVA software reject variables that do not change enough with time? Only variables which are part of the “experiment” are permitted. Tags change with time, but these variables are fixed. There are insufficient data points. If a variable does not change with time, then it cannot be correlated to any other variables. e) None of the above. a) b) c) d) NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 7: What should you do if your initial PCA gives a score plot with two distinct and separate data clouds? a) Study each data cloud separately. b) Try to determine what these two clouds represent. c) Ignore the first component, which is probably being artificially induced by the two clouds. d) Do an MVA on the entire dataset. e) Answers (a), (b) and (c). NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 8: Your residual (“DModX”) plot shows several moderate outliers. What should you do? a) b) c) d) e) Remove them and continue. Leave them in and continue. Study their contribution plots. Look at the original data to try to determine the cause. Answers (c) and (d). NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 9: Two variables are located in opposite corners of your PCA loadings plot (components 1 and 2). What do you conclude? a) These variables are uncorrelated with each other. b) These variables are negatively correlated with each other. c) These variables contribute to both the first and second components. d) These variables contribute to neither the first nor the second component. e) Answers (b) and (c). NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 Tier 2 Quiz Question 10: Theoretically, on average what proportion of residuals should be above the 95% confidence line? (the red line on the “DModX” plot) a) b) c) d) e) Exactly 0.05% Exactly 5%. More than 5%. Less than 5%. Depends on the dataset. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2 Quiz Tier 3, Rev.: 4 TIER 3: Open-Ended Problem NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Tier 3: Statement of Intent Tier 3: Statement of intent: The goal of Tier 3 is to finally allow the student to do MVA independently, though in a controlled context. At the end of Tier 3, the student should know how to do the following: • Prepare a spreadsheet for use in MVA • Import spreadsheet into MVA software • Set up dataset within MVA software • Create simple PCA plots • Identify and investigate major and moderate outliers • Create and interpret more elaborate PCA plots In order to avoid losing the student along the way, each of these steps is broken down into a series of sub-steps with clear instructions. NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Tier 3: Contents Tier 3 is broken down into four sections: 3.1 Problem Statement and Dataset 3.2 Preparing and Importing the Spreadsheet 3.3 Initial MVA Results 3.4 Outliers and More Elaborate MVA plots Unlike the previous two sections, Tier 3 has no quiz. The student must submit the results of the above work in a succinct project report (10-15 pages). NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 3.1: Problem Statement and Dataset NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Problem Statement Your are the process engineer at the TMP mill from the Tier 2 examples. Your boss, the plant manager, wants to know why the pulp has different properties in the summer than in the winter. You decide to start by generating PCA results for two different datasets, one taken during the summer, the other during the winter, and then comparing them to each other. NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Summer/Winter datasets After talking to the operators, you decide to take two full weeks of data for 15 key tags, using 1-hour averages. Your data have already been imported by an IT technician into a standard spreadsheet software. The two files are: • Summerdata.xls • Winterdata.xls These are the actual data files you are going to use! Open these files, and have a look at the data. Can you tell anything about the summer/winter question just by looking? Of course not! NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 3.2: Preparing and Importing the Spreadsheet NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Preparing the spreadsheet As you can see, the spreadsheet has two names for each variable: • long descriptive name, and • short “tag” for easy identification on the MVA graphs. We want to do something similar with the individual observations. The full time signature is too long, and will make the score plots impossible to read. Besides, we already know which year and month it is. This is not useful information. We therefore want to insert a column to the right of the time signature, which gives the number of hours from the start of the two-week period. Do this now, for both spreadsheets. When you are done, save them under a new name. NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Importing the spreadsheet Now we are ready to open the MVA software. Do it now. The first thing we need to do is import the data. Go to “File: import data”, and select your newly renamed file for summer. The software will ask you a series of questions. Answer them according to the instructions on Page 2 of the spreadsheet file. One of these steps involves saving the new dataset as an MVA file. Repeat this operation for the winter spreadsheet. NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 3.3: Initial MVA Results NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Initial MVA results Re-open the summer file, and create the following plot: • Model bar chart Copy it by right-clicking and import it into your word processor file. All these plots must appear in your report. How many components does the software suggest? Usually for this kind of initial exercise, keeping 3 components is normal. Eliminate the components you do not intend to use. Now create the following basic PCA plots: • Score plots: t(1) vs. t(2) What do you notice about the results? Right! There are major outliers. Now do the same for the winter dataset. NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 3.4: Outliers and More Elaborate MVA Plots NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Investigating Outliers The summer data contains a major process excursion that is clearly visible on the score plot. Looking at the original data, try to determine the cause. Once you are satisfied, remove the outliers and save the new model. The winter data looks OK on the score plot, but that is not the entire story. Generate the following residuals plot: • DModX What do you notice? Right! There is one major outlier. Create a contribution plot to investigate: • Contribution plot What do you conclude? Remove this point and continue. NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Comparing Summer and Winter Now we are ready to compare the summer and winter results. Create the following basic PCA plots: • Score plots: t(1) vs. t(2); t(1) vs. t(3); 3-D plot • Loadings plot: p(1) vs. p(2); p(1) vs. p(3); 3-D plot Do you notice any major differences between summer and winter? Of course you do! What are they? And what does this imply about the cause of the summer/winter process differences? NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 Drawing your conclusions Now you have something to report to your boss… NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 More Elaborate MVA Plots To get familiar with some of the other MVA outputs, create the following for the final summer and winter datasets: • DModX • X/Y Contribution Plot • Residuals distribution Don’t •… just •… guess! What do these plots indicate to you? Don’t worry about finding the “right” answer, just try to figure out what these plots are trying to tell us. However, you must justify your answers. Don’t just guess. NAMP Module 17: “Introduction to Multivariate Analysis” Open Problem Tier 3, Rev.: 4 End of Tier 3 Congratulations! This is the end of Module 17. Please submit your report to your professor for grading. We are always interested in suggestions on how to improve the course. You may contact us as www.namppimodule.org NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4