6.1 opening vignette: CERN
How is DAS data agnostic?
- given the number of different data sources, types, and providers that DAS connects to, it is imperative that the system itself is data agnostic and allows users to query and aggregate the metadata information in a customizable way - provides a simple plug-& play mechanism that makes it easy to add new data services as they are implemented and configure DAS to connect to specific domains.
What type of database was implemented?
-decided that a document database would best suit our needs. - after evaluating several, chose MongoDB, due to its support of dynamic queries and full indexes, including inner objects and embedded arrays, as well as auto-sharding
problem first part ?
1. CERN does not have the capacity to process all of the data that it generates and therefore must rely on numerous other research centers all around the world to access and process the data
What was the solution? How were the Big Data challenges addressed with this solution?
1. CMS's ( compact Muon Solenoid) data management and workflow management (DMWM) created the Data Aggregation System (DAS), built on MongoDB (a Big Data management infrastructure) to provide the ability to search and aggregate information across this complex data landscape. 2. MongoDB's non-schema structure allowed flexible data structures to be stored and indexed.
What were the results? Do you think the current solution is sufficient?
1. DAS is used 24 hours a day, seven days a week, by CMS physicists, data operators, and data managers at research facilities around the world. 2. The performance of MongoDB has been outstanding, with an ability to offer a free text query system that is fast and scalable. 3. Without help from DAS, information lookup would have taken orders of magnitude longer. 4. The current solution is outstanding, but more improvements can be made and CERN is looking to apply big data approached beyond CMS.
who is the opening vignette about?
1. European Organization for Nuclear Reasearch, known as CERN is playing a leading role in fundamental studies of physics. 2. is has been instrumental in many key global innovations and breakthrough discoveries in theoretical physics and today operates the worlds largest particle physics laboratory, home to the Large Hadron Collider (LHC)
problem second part?
2. The Compact Muon Solenoid (CMS) is one of the two general purpose particle physics detectors operated at the LHC - more than 3000 physicists from 183 institutions in 38 countries are involved in the design, construction, and maintenance of the experiments - experiments require an enormously complex distributed computing data management system - information is stored and retrieved from relational and nonrelational data sources, such as relational databases, document databases, blogs, wikis, file systems, and customized application
What is the essence of the data challenge at CERN? How significant is it?
A. Collision events in LHC occur 40 million times per second, resulting in 15 petabytes of data produced annually at the CERN Data Centre. B. CERN does not have the capacity to process all of the data that it generates, and therefore relies on numerous other research centers all around the world to access and process the data. C. Processing all the data of their experiments requires an enormously complex distributed and heterogeneous computing and data management system. D. With such vast quantities of data, both structured and unstructured, information discovery is a big challenge for CERN.
what was the solution?
A. created the data aggregation system (DAS) built on MongoDB (big data management infrastructure) to provide the ability to search and aggregate information across this complex data landscape B. data and metadata for CMS come from many different sources and are distributed in a variety of digital formats C. using both relational and nonrelational data sources
What is CERN, and why is it important to the world of science?
CERN is the European Organization for Nuclear Research. It plays a leading role in fundamental studies of physics. It has been instrumental in many key global innovations and breakthrough discoveries in theoretical physics and today operates the world's largest particle physics laboratory, home to the Large Hadron Collider (LHC).
What do they want as a solution?
a. at this scale the information discovery within a heterogenous, distributed associated metadata are produced in a variety of forms and digital formats. b. Users (within CERN and scientists all around the world) want to be able to query different services and combine data/ info. from these varied sources.
results
a. performance has been outstanding b. ability to offer a free-text query system that is fast and scalable with a highly dynamic and scalable cache that is data agnostic provides an invaluable two-way translation mechanism.
What can we learn from this vignette
a. technological advances make it easier to create, capture, store, and analyze very large quantities of data. B. LHC & CERN creates very large volumes of data very fast C. big data comes in varied formats and is stored in distributed server systems d. analysis of such a data landscape requires new analytical tools and techniques. E. regardless of the size, complexity, and velocity, data need to be made easy to access, query, ad analyze if promised value is to be derived from it
Was a relational database used?
the choice of an existing relational database was ruled out for several reasons - we didnt require any transactions and data persistency in DAS and as such cant have a predefined schema - dynamic typing of stored metadata objects was one of the requirements
How was DAS pivitol to the obtained results?
without help from DAS information, look-up would have taken orders of magnitude longer