The Million Book Digital Library Project Raj Reddy, Jaime Carbonell, Michael Shamos, Gloriana StClair Carnegie Mellon University Pittsburgh, Pa.
Download ReportTranscript The Million Book Digital Library Project Raj Reddy, Jaime Carbonell, Michael Shamos, Gloriana StClair Carnegie Mellon University Pittsburgh, Pa.
The Million Book Digital Library Project Raj Reddy, Jaime Carbonell, Michael Shamos, Gloriana StClair Carnegie Mellon University Pittsburgh, Pa. USA November 5, 2003 The Grand Challenge Create Access to • All published works online • Instantly available • In any language • Anywhere in the world • Searchable, browsable, navigable • By humans and machines The Challenge: One Step at a Time… • Million Book DL – Only about 1% of all the world’s books • Harvard University 12M • Library of Congress 30M • OCLC catalog 42M • All Multilingual Books ~100M • At the rate of digitization of the last decade it would take a 100 years! Million Book Project: Issues • Time – At one page per second (20,000 pages per day shift), it will take 100 years (200 working days per year) to scan a million books of 400 pages each • Cost – 100M books at US$100 per book would coat $10B – Even in India and China the cost will be $1B – The annual cost is currently expected to be close $10M per year with support from US, India and China. • Selection – Selection of appropriate books for scanning is time consuming and expensive Million Book Project: Issues (cont) • Logistics – Each containers hold 10,000 to 20,000 books. Shipping and handling costs about $10,000 • Meta Data – Accessing and/or creating Meta data requires professionals trained in Library science • Optical Character Recognition Technology – Essential for searching, translation and summarization – Many languages don’t have OCR 21st Century Computing • Exponential advances in Information and Communication Technologies will result in – innovations that will transform the way we live, learn and work. – In retrospect, these transformations will be seen as revolutionary by the future generations Exponential Growth Trends in Computer Performance 1638400 819200 Tera PC Doubling every 15 months 409600 204800 100G PC 102400 51200 M25600 I P12800 S 6400 Doubling every 2 years 10G PC 3200 1600 Giga PC 800 400 200 100 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Year Technology Trends • A Giga-PC in 2002 – – – – Billion operations per second, Billion bits of memory Billion bits per second Network bandwidth Less than $2 k • A Tera-PC by the year 2015 • A Peta-PC by the year 2030 What do we do with all this power? • Social systems not affected: – Food we eat – Clothes we wear – Mating rituals • The processing will transform the way, we – – – – Live Learn Work, and Communicate Trends in Magnetic Disk Memory • • • • Densities doubling every 12 months Thousand-fold improvement every 10 years 100GB disk memory costs ~ $100 (2003) 100 GB can be used to store – 20 movies, 500 paintings, 5000 songs of MP3 music and 25000 books – larger than most of our personal collections at home • By 2013, 100 Tera Bytes cost ~ $100 – A personal Library of several million books, a lifetime collection of music and videos– all on our home PC • By 2025, 100 Peta Bytes ~ $100 – Infinite amount of memory for all practical purposes What do we do with a Peta Byte? • Capture everything you ever said – From the moment of birth – To the moment you die – Takes less than 1% of a Peta Byte !! • Everything you did or experienced – can be captured in living color – with only a few Peta bytes Advances in Fiber Optic Technology • 1.6 Tera bits per second on a single fiber – 160 wavelengths each at 10 Gbps – Dense Wavelength Division Multiplexing (DWDM) • What can you do with 1.6 Tera bits per second ? – – – – – – 20 HDTV movies 200 regular full-length movies 30000 hours of MP3 music In one second on a Single fiber ! 20 minutes to transmit ALL books in the Library of Congress ! ALL phone calls on a single fiber with room to spare ! Bottlenecks in Infinite Data Transmission • Main bottleneck is not fiber bandwidth • It is: – – – – – Bus bandwidth Router capacity and speed Speed of light! Round-trip delay times in TCP/IP At Tera bit rates with RT times of about 30 ms across the US, 30 billion bits would have been transmitted before an acknowledgement is received Technology Trends • Exponential doubling of memory and bandwidth will continue for 10 to 20 years – Leading to the availability of • Peta-byte disks • Peta-bytes per second bandwidth • At a cost of pennies per day • Leading to Changes in Computer Science and Theory of Algorithms, and • Leading new innovates uses of Information Technology for the benefit of Society, such as – eLearning: Universities without walls – Ubiquitous Access to Knowledge: Digital Libraries – Telemedicine Access to Information in the 21st Century • Maxim: Access to all human knowledge anytime anywhere • Access, query, and print any book, magazine, newspaper, video, data item, or reference document – regardless of language • Challenges in data access – High bandwidth networking for multimedia access – Intellectual property protection while facilitating access – Intelligent information retrieval – Delivery and protection of critical information Million Book Project: Status • 15 Centers in India • 14 centers in China • 1 Center in Egypt • Planned : Australia and Europe • About 78,000 books scanned – About 50,000+ accessible on the web – Uses 4TB of storage – 10 TB server at CMU Library planned for July 2004 – 100,000 books by the end of 2004 – Capacity to scan a million pages a day expected to be operational by the end of 2004 Title Author Language Subject Publisher Year Abstract Rig Veda Pandit Sriram Sharma Acharya Sanskrit Philosophy Sanskriti Sansthan Bareli Rig Veda is the oldest of the Vedas. The Rig Veda is the oldest book in Sanskrit or any Indo-European language. Many great Yogis and scholars who have understood the astronomical references in the hymns, date the Rig Veda as before 4000 B.C., perhaps as early as 12,000. Modern western scholars date it around 1500 B.C., though recent archaeological finds in India (like Dwaraka) now appear to require a much earlier date Title Author Language Subject Publisher Year Abstract Elementary Treatise on the Wave-Theory of Light Humphery Lloyd, D.D, D.C.L English Physics Longmans, Green & Co 1873 This book deals with the various aspects of the wave theory of light. It is a critical work which contains an analytical discussion of the most recent researches in Optics. It presents a clear and connected view of the subject. Title Author Language Subject Publisher Year Abstract Beauties from Kalidas Keshav Appa Padhye Sanskrit Poetry 1927 A collection of some of the Best works of Kalidas, Ancient India’s Most Famous Sanskrit Poet. Abhignyana Sakuntalam, Kumara Sambhavam, Ritu Samhara are some of the renowned works of Kalidas. Title Author Language Subject Publisher Year Abstract Gems, Jewels, Coins and Medals Ancient & Modern Archibald Billing English Fine Arts Daldy, Isbister & Co 1875 This volume deals with the detailed description of the varied types of fine arts dealing with precious stones, Jewelry and sculpture. Title Author Language Subject Publisher Year Abstract Mudalayiram Mulamum Periya Jeeyar Tamil Religion Sri Vaishnava Sampirathaya Sanjeevikiri Sabayai 1909 This volume is written in Tamil. It provides a detailed account of the origin of Vaishnava and is written by Periya Jeeyar. . Title Author Language Subject Publisher Year Abstract Gulzar-A-Badesha Khader Badesha Urdu Literature Namipress, Chennai 1919 Literature Title Author Language Subject Publisher Year Abstract Jawahar Ali Joyviyah Dr.Ilyas lomas Arabic Metrology Bakri and Issa 1876 It is a book on Metrology, a study of measurements Title Author Language Subject Publisher Year Abstract Panchatantramu Narayana Kavi Telugu Moral Stories Vavilla Ramaswamy and Sons 1912 It is a compilation of stories told by a guru to his royal students, each story teaching a moral. Most of the characters in the stories are animals. The book served as an excellent guide to prospective kings in their everyday life, including their behaviour and their choice of friends. It also is a great asset to parents to teach ethics to their children. Title Author Language Subject Publisher Year Abstract Bharateeya Smritigalu Vidwan Ragu Sutta Kannada Biographical Notes Hemantha Sahitya Compilation of Ancient Memories Title Author Language Subject Publisher Year Abstract The Fauna of British India including Ceylon and Burma Lt. Conl. J. Stephenson English Biology Taylor and Francis 1929 Biological notes on fauna and insects compiled during British India Title Author Language Subject Publisher Year Abstract Harijan: A Journal of Applied Gandhism, 1933-1955 Joan Bondurant (introduction) English Philosophy Garland Publishing Inc. 1973 A journal on Practical implementation of Gandhiism in Every Day Life Title Author Language Subject Publisher Year Abstract Structure Des Molecules Victor Henri French Chemistry Taylor and Francis 1925 This is a unique book that explicates, in detail, the structure of molecules and touches upon certain specific characteristics of molecules with particular reference to Benzene Million Book Project: Research Challenges • Providing Access to Billions everyday – Distributed Cached Servers in every country and region • Easy to use interfaces for Billions • Multilingual Information Retrieval • Translation • Summarization Million Book Project: Policy Challenges • Compensating for Creative Works – 5% out of copyright – 92% out-of-print and in-copyright – 3% in-print and in-copyright • Options – Tax Credit – Usage based Government funded compensation • Analogous to Public Lending Right in UK and Australia – Usage charges to the user • Compulsory Licensing • Digital Submissions to National Archives of all books that are “born-digital” Can we do it? The Grand Challenge: Create Access to • All published works online • Instantly available • In any language • Anywhere in the world • Searchable, browsable, navigable • By humans and machines URLs: • http://www.ulib.org • http://www.dli.ernet.in • www.archive.org/texts/collection.php?collection=millionbooks