Transcript Slide 1
©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Distributed Representation for Unstructured Data and Applications Kevin Chen Chief Scientist | North America Data Lab ©2014 Experian Information Solutions, Inc. All rights reserved. Experian and the marks used herein are service marks or registered trademarks of Experian Information Solutions, Inc. Other product and company names mentioned herein are the trademarks of their respective owners. No part of this copyrighted work may be reproduced, modified, or distributed in any form or manner without the prior written permission of Experian. Experian Confidential. People go to But NOT LIKE? __________ 3 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Representation of unstructured data Gartner(*) predicted enterprise data volume to grow by 800% in the next five years Unstructured data is growing 62% faster 80% of data will be unstructured data Structured data: ► Well-studied ► Interval / categorical / ordinal Forbes, Big Data—Big Money Says It Is A Paradigm Buster, June 2012 4 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Representation of unstructured data Unstructured data: ► Diverse types of data (text, audio, image, video) ► Need to be able to search, compare, understand, and predict Key question: ► How do we represent words, sentences, phrases, concepts, objects and use them in predictive modeling? Applications in Transactional Behavior Modeling: ► Merchant grouping, ► Merchant Characteristics Insight ► Behavior shift detection 5 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Language model and challenges 𝑇 Language Model: 𝑃 𝑤1𝑇 = 𝑃 𝑤1 𝑃 𝑤2 𝑤11 𝑃 𝑤3 𝑤12 ⋯ 𝑃 𝑤𝑇 𝑤1𝑇−1 = 𝑃(𝑤𝑡 |𝑤1𝑡−1 ) 𝑡=1 ► P(“He likes to run”) = P(He) x P(likes | He) x P(to | He likes) x P(run | He likes to) ► P(red | The color of rose is) = ? Discrete Representation (n-gram Model): 𝑡−1 𝑃 𝑤𝑡 |𝑤1𝑡−1 ≈ 𝑃(𝑤𝑡 |𝑤𝑡−𝑛+1 ) ► Curse of Dimensionality: e.g. 4-grams 1.6x1017 combinations assuming |V|=20,000 ► Unable to detect ‘similarity’ easily ● ► “The cat is walking in the bedroom” vs. “A dog was running in a room” Requires smoothing Analogy: Categorical Variables 6 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Neural distributed representation Hinton (1986), Bengio (2003) Each word is associated with a point in a lower dimension space (e.g. 200 dim) (0.31,0.12,…,0.20) (0.29,0.11,…,0.21) Prior Before After Saying Called Told Benefits: ► Close vectors Similar words About Around ► Reduced dimensions enable near-real (0.15,0.82,…,0.57) (0.16,0.81,…,0.55) time look up of similar words and distance ● e.g. 8TB (1,000,000 x 1,000,000 x 8) vs. 1.6GB (200 x 1,000,000 x 8) ► Language models can be derived with much smaller training data ► Compositionality: ability to express negativity using dissimilarity 7 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Recurrent Neural Network Language Model Thomas Mikolov (2010) (@ google, facebook) A recurrent neural network takes previous state s(t-1) as part of input w(t): current word at t, U: distributed representation Current state s(t) takes into account current word w(t) and previous state s(t-1) U V W Bi-gram neural network LM s(t-1) w(t-2) U Back-propagation used to update V, and U The recurrent weights W are updated by unfolding in time and train the net as a deep neural network s(t) w(t-1) W s(t-2) 𝑉 𝑡 + 1 = 𝑉 𝑡 + 𝛼𝑠 𝑡 𝑒𝑂 𝑡 ′ , 𝑈 𝑡 + 1 = 𝑈 𝑡 + 𝛼𝑤 𝑡 𝑒ℎ (𝑡)′ y(t) U y(t): next word 𝑠 𝑡 = 𝑓 𝑈𝑤 𝑡 + 𝑊𝑠 𝑡 − 1 𝑦 𝑡 = 𝑔(𝑉𝑠 𝑡 ) 1 𝑒 𝑧𝑘 𝑓 𝑧 = , 𝑔 𝑧 = 𝑘 𝑧𝑖 1 + 𝑒 −𝑧 𝑖𝑒 w(t) W s(t-3) 8 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. word2vec INPUT Focused on vector generation while simplifying LM OUTPUT W(t-2) Two models: Skip-gram, Continuous Bag-of-Words PROJECTION W(t) W(t-1) Skip-gram maximizes: 1 𝑇 W(t+1) 𝑇 𝑙𝑜𝑔𝑝(𝑤𝑡+𝑗 |𝑤𝑡 ) W(t+2) 𝑡=1 −𝑐≤𝑗≤𝑐,𝑗≠0 Skip-gram 𝑇 Where 𝑝 𝑤𝑡+𝑗 𝑤𝑡 = ′ exp(𝑣𝑤 𝑣 ) 𝑡+𝑗 𝑤𝑡 𝑊 ex𝑝(𝑣 ′ 𝑇 𝑣 ) 𝑤 𝑤𝑡 𝑤=1 Cost for calculating 𝛻𝑙𝑜𝑔𝑝(𝑤𝑡+𝑗 |𝑤𝑡 ) is huge ► ► W(t-2) Syn1 W(t-1) Hierarchical softmax using Hoffman Tree W(t+1) Negative sampling W(t+2) W W(t) Syn1 Syn1 Syn1 Syn1 W W Syn1 W W CBOW 9 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Application – merchant similarity and grouping The DataLab has been working on plastic card transaction data with merchant information Merchant Information: ► MCC - not sufficient to categorize merchants and for identifying consumers’ behavior ► Merchant names – noisy, and not informative enough about their business Neural Distributed Representation: ► Word: Merchant ID ► Sentence: Close sequence of merchants in transactions ► Model: skip-gram model ► 1.3M unique merchants, 835M transactions ► Trained in 280 minutes using 30 threads 10 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Merchant group – international travel (selected merchants) MCC 3005 3007 3010 3012 3056 3077 3078 3079 3161 3389 3503 3545 3572 3577 3710 4011 4111 4111 4112 4121 4131 4215 4511 4511 4722 4722 4814 4814 5192 5200 5251 5300 Merchant Name Merchant Name MCC Description MCC Description MCC BEIRUT DUTY FREE ARRIVAL BRITISH A BRITISH AIRWAYS 5309 Duty Free Store DFS INDIA PRIVATE LIMI AIR FRANCE 0571963678061 AIR FRANCE 5309 Duty Free Store RUSTAN S SUPERMARKET KLM BELGIUM 0742469054336 KLM (ROYAL DUTCH AIRLINES) 5411 Grocery Stores, Supermarkets VILLA MARKET-NICHADA QANTAS AIR 08173363730 QUANTAS 5411 Grocery Stores, Supermarkets and Specialty Markets HAKATAFUBIAN JET AIR 5894149559583 QUEBECAIRE 5499 Misc. Food Stores Convenience Stores Markets CO.,LTD Specialty FOODS and I-MEI WWW.THAIAIRW1234567890 THAI AIRWAYS 5499 Misc. Food Stores Convenience Stores FILIPINO SM C Stores KULTURA CHINA AIR2970836417640 CHINA AIRLINES 5719 Miscellaneous Home Furnishing Specialty RI YI CAN YIN JETSTAR AIR B7JLYP Airlines 5812 Eating places and Restaurants lounges, Night COFFEE Cocktail PRESIDENT Bars, Taverns, SHANGHAI ANAAIR ALL NIPPON AIRWAYS 5813 Drinking Places (Alcoholic Beverages), 02800 AJISEN RAMEN AVIS RENT A CAR AVIS RENT-A-CAR 5814 Fast Food Restaurants MCDONALD'S AIRPORT(290 SHERATON GRANDE SUKHUMVIT SHERATON HOTELS 5814 Fast Food Restaurants ShopsLEISURE MANAGE SENTOSA MAKATI SHANGRI LA HOTE SHANGRI-LA INTERNATIONAL 5947 Card Shops, Gift, Novelty, and Souvenir HAI YUGUI INDUS Stores SHANG SHERATON MIYAKO TOKYO H MIYAKO HOTELS 5949 Sewing, Needle, Fabric, and Price Goods Services TRAVEL CTRIP SH HUACHENG MANDARIN ORIENTAL,BANGKOK MANDARIN ORIENTAL HOTEL 5962 Direct Marketing Travel Related Arrangements THE RITZ-CARLTON, HK16501 THE RITZ CARLTON HOTELS 5964 Direct Marketing Catalog MerchantAMAZON.CO.JP LS TRAVEL RETAIL DEUTSCHL JR EAST Railroads 5994 News Dealers and Newsstands Transportation. Water Railroads, Feries, Local TRANSIT RAILWAY Transportation MAXVALUKURASHIKAN ICHIHAM MASS and Specialty Retail Stores Local/Suburban Commuter Passenger Miscellaneous 5999 Transportation. Water Railroads, Feries, Local CRUISES Transportation 012BANCO DE CHILE VISA XISHIJI Institutions Manual Cash Disbursements Local/Suburban Commuter Passenger Financial 6010 NOMADS Premiums Taiwan High Speed Rail Passenger Railways 6300 Insurance Sales, Underwriting, and WORLD PAY*KOKO RESORTS INC AIZUNORIAIJIDOSHIYA KA Taxicabs and Limousines 6513 Real Estate Agents and Managers - Rentals CENTER classifies) VISA SERVICE elsewhere / CRUZ DEL SUR Buses CHINA Bus Lines, Including Charters, TourCE 7299 Miscellaneous Personal Services ( not classifies) & VISA.COM forwarders elsewhere PASSPORTS MYUS.COM Courier Services Air or Ground, Freight 7299 Miscellaneous Personal Services ( not AMOMA CAMBODIA ANGKOR AIR-TH Airlines, Air Carriers ( not listed elsewhere) 7311 Advertising Services MAILBOX FORWARDING, IN JETSTAR PAC Airlines, Air Carriers ( not listed elsewhere) 7399 Business Services, Not Elsewhere Classified Clubs, and Sport Prom SportPL ProfessionalFLYER CHU KONG PASSENGER 28902 Travel Agencies and Tour Operations 7941 Commercial Sports, Athletic Fields, SINGAPORE AT THE TOP LLC HOSTEL WORLD Travel Agencies and Tour Operations 7991 Tourist Attractions and Exhibits Lant Tellers Kong Disneyland Fortune Hong Services Fax services, Telecommunication ONESIMCARD.COM 7996 Amusement Parks, Carnivals, Circuses, BUMRUNGRAD HOSPITAL Services ONLINE TOP UP Fax services, Telecommunication PREPAID 8062 Hospitals INTERNATIONS GMBH RELAY Books, Periodicals, and Newspapers 8641 Civic, Fraternal, and Social Associations PASS, INC PRIORITYClassified) Home Supply Warehouse Stores FUJI DOLL CHUOU 8699 Membership Organizations ( Not Elsewhere Defined) U.S.VISAAPPLICATIONFEE TRUE VALUE AYALA CEBU Hardware Stores 8999 Professional Services ( Not Elsewhere US CONSULAT SHA CNEClassified) COSTCO GUADALAJARA Wholesale Clubs 9399 Government Services ( Not Elsewhere 3572 | MIYAKO HOTELS | SHERATON MIYAKO TOKTO 6300 | Insurance | WORLD NOMADS 4814 | Telecomm | ONESIMCARD.COM 9399 | Government Services | CNE US CONSULAT 11 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Who’s Like Me -- Additive Compositionality MICHAELS STORES BARNES & NOBLE OLD NAVY MARSHALLS DSW USPS PARTY CITY KOHL'S SCHOLASTIC BOOK GAP PAPYRUS CRATE & BARREL ANTHROPOLOGIE AVEDA LOCCITANE POTBELLY NORDSTROM NESPRESSO USA LAKESHORE LEARNING 0.57 0.54 0.52 0.50 0.47 0.47 0.47 0.46 0.46 CRATE&BARREL NORDSTROM HNS*HughesNet.com SMARTSTYLE SEARS HOMETOWN DOLLAR GENERAL GOLDEN CORRAL DISH NETWORK 0.76 0.76 0.76 0.74 0.74 0.74 0.74 0.73 0.73 0.72 0.42 0.40 0.39 0.38 0.38 0.36 SEARS HOMETOWN DOLLAR GENERAL THE OLIVE GARDEN DISH NETWORK KMART JCPENNEY MCDONALD'S WALMART.COM DOLRTREE PIZZA HUT AUTOPAY/DISH NTWK BURGER KING GAP POTTERY BARN KIDS ANN TAYLOR LOFT CRATE & BARREL JANIE AND JACK BABIES R US LAKESHORE LEARNING PARTY CITY SCHOLASTIC BOOK ANTHROPOLOGIE 0.76 0.76 0.75 0.75 0.75 0.73 0.73 0.73 0.73 0.73 0.70 0.69 0.68 0.67 0.66 0.66 0.65 0.65 0.65 0.65 POTTERN BARN KIDS GYMBOREE.COM WALMART.COM VF OUTLET 71 DISH NETWORK BATH & BODY WORKS CRAZY 8 THE OLIVE GARDEN SMARTSTYLE THE CHILDRENS PLACE JCPENNEY 0.56 0.56 0.55 0.54 0.53 0.53 0.52 0.52 0.52 0.52 CHILDRENS PLACE 12 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Who’s Like Me -- More Examples GOLF GALAXY NY TIMES NATL SALES DESIGN WITHIN REACH THE NEW YORKER VANITY FAIR MAG HUMAN RIGHTS CAMPAIGN AIRBNB INC ROOM & BOARD BON APPETIT 0.60 0.55 0.53 0.53 0.52 0.52 0.52 0.51 PAPYRUS LOCCITANE APPLE STORE CRATE & BARREL ANTHROPOLOGIE NESPRESSO USA NY TIMES NATL SALES BANANA REPUBLIC SEPHORA 0.70 0.70 0.68 0.68 0.67 0.65 0.65 0.64 0.64 APPLE STORE SPORTS STATION CALIFORNIA PIZZA NORDSTROM RACK BANANA REPUBLIC STARBUCKS THE MENS WEARHOUSE CHIPOTLE 0.59 0.58 0.56 0.56 0.55 0.55 0.55 0.53 13 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Application – Behavior Shift Detection People are creature of habits — there should be a ‘language model’ to describe the consumer’s shopping patterns Count Help financial institutions to focus more on the consumers whose behavior are out of ordinary (1) Potential fraud compromise, (2) Life-style change Randomly Generated w/ same ZIP dist. Actual Transactions Outliers Similarity Similarity to past 20 transactions 14 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Summary We have demonstrated that a neural distributed representation can be used to capture relationships of merchants in the transaction data Compositionality allows higher-order understanding of merchant relationships Reduced dimensions in the representation enables near real-time look-up of similar merchants Future directions: ► ► ► Reduce the effect of localization by linking local merchants into higher level of aggregation Further develop behavior shift detection framework Deep learning of higher-order structures: Recurrent Neural Net, Convolutional Net, etc. 15 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. #FOIC2014 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Kevin Chen Chief Scientist, North America Data Lab Experian e: [email protected] ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.