DISSCO ARCHITECTURE

A "POP CULTURE" APPROACH TO DIGITAL OBJECTS, PIDS, FAIR PRINCIPLES, DATA STANDARDS AND MORE...DISSCO TECHNICAL ARCHITECTUREEverything you want to know but never dared to ask, explained in simple terms.

Reading time: 18 minutes

Contents by: DiSSCo CSO (August 2022)

This is what you know

□ You know that the global challenges we face today are multifaceted, and therefore they call for multifaceted solutions.

□ You also know that the data contained in the billions of specimens hosted in European Natural Science Collections (NSCs) are a fundamental basis of knowledge to tackle these challenges.

□ And you also know that by putting all that data together and making it accessible, DiSSCo will help the scientific community work together and do their job a zillion times more efficiently.

‖ Now, DiSSCo is more than just an effort to digitally bring together the data from a couple hundred NSCs, of course… And this is where we get to what you might or might not know:

‖ Our future RI aims at bringing together data but also at transforming that data and the ways of working with it. It is a fundamental aim that we could summarize as follows:

"DiSSCo aims at turning static records about specimens into dynamic, actionable objects that will evolve with science itself."

‖ Turning static into dynamic... You'll hear this mantra more as you scroll down...

Let's get started: DSArch

‖ In order to achieve its goals, DiSSCo will work on the basis of an innovative data management architecture, what we call DiSSCo Digital Specimen Architecture (DSArch).

‖ Explaining DSArch entails some difficulty but hopefully, this binnacle will help you understand DiSSCo’s data architecture better and will shed some light on a few of the most important concepts that we use in our discussions about technical matters, the ones you hear often but have trouble understanding.

Let’s get to it!

THE THREE PILLARS OF DSArch

‖ There are many approaches to data architecture design. While you don’t need to know them in detail for our purposes, at least keep in mind that DiSSCo data architecture does not follow a single approach, but rather a combination of them. Specifically, DiSSCo will rely on three different approaches (listed here not necessarily in order of relevance):

① Evolutionary Architecture with Protected Characteristics
② The FAIR guiding principles
③ The Digital Object Architecture

Close

①→EVOLUTIONARY ARCHITECTURE WITH PROTECTED CHARACTERISTICS

‖ You may not like Bruce Hornsby’s haircut but he’s right about something: some things do not change. At least, that is what happens in data architecture if you take the Evolutionary Architecture with Protected Characteristics approach!

I agree with being shown YouTube videos. More information

Evolutionary Architecture with Protected Characteristics in short

‖ A Research Infrastructure is as good as the reliability and solidity of the data it provides.

Fullscreen

‖ That poses a dilemma: If DiSSCo intends to build trust in the reliability and solidity of its data, some components of its data architecture must remain the same over time, even beyond the research infrastructure’s lifecycle. On the other hand, it is only natural that future technologies or user needs will affect how DiSSCo’s infrastructure evolve. In other words: changes will be needed at some point. The question therefore is how to find a balance between what stays and what goes.

‖ The evolutionary architecture approach gives us a way of eating the cake and having it too, so to speak. It acknowledges the inevitable evolution of things but at the same time shields some essential components of the architecture by granting them protected status. Those protected components, being “futureproof”, will stay the same in the long term.

We will deal with some of those components, such as the FAIRness of data or the centrality of the Digital Specimen later. If you want to take a look at the whole list, go here

Close

②→ FAIR PRINCIPLES

‖ Judging by how interested Sia is in playing a fair game, it is obvious that she must be aware of the data revolution that we have been experiencing for some decades now. The amount of data generated and made accessible on the Internet is indeed so massive and varied that it is not feasible for us humans to make sense of it anymore, at least not using current processes and standards.
What to do then? Well, here is an option: We can keep on creating and publishing information but making sure that we make the data FAIR...

I agree with being shown YouTube videos. More information

The FAIR principles in short

‖ You must have heard/seen/read the term FAIR more than a thousand times by now, so let’s make it quick...

Fullscreen

‖ FAIR is about making data increasingly findable, accessible, interoperable and reusable both for humans and machines. The “and machines” bit is crucial because, given today’s information tsunami, we will have no option but to rely on them to be able to manage, exchange and extract meaning out of huge amounts of combined data from different sources.

‖ FAIR principles are the best way of ensuring proper stewardship of data but let’s admit it: they are a bit abstract, so you might wonder: How is data made FAIR? Is it about writing code or what? The answer to that is not difficult but it is long to explain, so let’s see a couple of examples that will probably help you grasp all this.

□ Example 1: If you want your data to be Findable, the FAIR guiding principles recommend that, among other things, you give your data a globally unique and persistent identifier, that is, a sort of ID code that will belong to your data -and your data only- forever. We’ll get to that later, don’t worry.

□ Example 2: If you want your data to be Reusable, the FAIR guiding principles recommend that, among other things, your data and metadata be released with a clear and accessible data usage license.

‖ There are a bunch of other criteria that you can apply to make your data FAIR. Find them and much more about the FAIR Guiding Principles here and here.

Close

③→THE DIGITAL OBJECT ARCHITECTURE

‖ Yeah, when it comes to Digital Object Architecture, we at DiSSCo go full Freddy Mercury and want it all. Translated into DiSSCo language, "wanting it all" means that our data architecture aims to being able to:

· identify Digital Objects,
· describe their anatomy,
· and, of course, use them.

‖ Before we continue, though, a bit of disambiguation is needed. Chances are that you might have trouble telling apart concepts such as “Digital Specimen”, “Digital Collection”, “Digital Object” or “FAIR Digital Object”. If that is the case, worry no more because it is a piece of cake, really. Just keep scrolling...!

I agree with being shown YouTube videos. More information

Understanding the Digital Object happy family

Digital Specimens (DS) and Digital Collections (DC) are both specific types of Digital Objects (DO). Simple as that.

Fullscreen

(warning: long paragraph!)

‖ Ok, in the case of DiSSCo, we should rather say that they are specific types of FAIR Digital Objects, given DiSSCo’s alignment to the FAIR principles. DS and DC are not the only types of digital objects, of course, just the two that are more closely related to NSCs.

‖ “FAIR Digital Object” and “Digital Specimen” are the concepts that you will probably hear more often, so let’s give each of them a paragraph.

‖ In essence, a FAIR Digital Object (FDO) is a digital object that follows the FAIR principles. If a Digital Object is a sequence (or sequences) of bits “structured in a way that is interpretable by one or more of the computational facilities and having as an essential element an associated unique persistent identifier” (DONA Foundation), then a FDO is the same, only FAIR.

‖ And why do FDOs makes sense? At the end of the day, FAIR was not part of the original DO idea, right? Well, it turns out that there are fundamental elements in the DO nature that make it compatible with the FAIR principles. In fact, DOs make FAIR implementation with other systems possible in a much more granular and interoperable fashion. So they do make sense.

‖ As stated above, a Digital Specimen is a specific type of digital object. You will usually find it described in DiSSCo documents as a “surrogate” or a “digital twin” of a physical specimen. You can have a specimen of a butterfly in your hand and its digital twin on the screen of your computer.

‖ But the butterfly on the screen is not just a visual representation of the one in your hand. That digital image is just the “cover photo” of an online package that brings together FAIR data from different sources (taxonomic, genomic, biochemical, you name it), all referring to the same physical specimen.

‖ Besides -and this is the best bit- this online package that contains all the data related to the butterfly is not static in the same way as the information written on the tag of a physical specimens is. Instead, the data anchored by the digital specimen is dynamic, actionable. In other words: you can work on it and transform it (e.g. by annotating it or applying DiSSCo services to it). Remember the mantra: “DiSSCo aims at turning static records about specimens into dynamic, actionable objects that will evolve with science itself”? Here it is.

Close

‖ Ok, time to move on. Let´s see how DiSSCo plans to identify, describe and use Digital Objects.

IDENTIFYING DIGITAL OBJECTS(or "Be yourself" for NSCs specimens)

‖ In order to talk about identification or referencing of DOs, we need to retrieve a concept that has been already mentioned a couple of times in this binnacle yet just in passing: Persistent Identifiers.

‖ A bit of History: As NSCs started implementing mass digitisation programmes and mobilising their data for others to use, some changes became more and more pressing. The way of referencing specimens was one of them. Sure, each specimen in a collection normally has its own catalogue ID that is unique within that collection, but the moment collections start working with other collections, there are potential problems. For example, if a specimen in your botanical garden happens to share the same reference number with a totally unrelated specimen in a museum of geology, that might lead to confusion, so no bueno.

‖ Recent years have seen a number of initiatives set up global persistent identifiers (PIDs) to guarantee the “uniqueness” of both physical and digital objects over time. You must have heard acronyms such as ISBN, ORCID or ROR before, right? Those are persistent identifiers for books, individual scholars and research institutions, respectively. For digital research content the most widely used is the Digital Object Identifier (DOI) proposed by the International DOI Foundation.

‖ A DOI is an alphanumeric code that looks like this:

10.prefix/suffix

‖ For example, if you type this in your navigator...

https://doi.org/10.15468/w6ubjx

... You will visit the Royal Belgian Institute of Natural Sciences Mollusc collection dataset, accessed through GBIF, and this specific PID will never point at any other object, only this particular mollusc dataset. It will never change even though the content or the metadata related to the object might be altered in the future.

‖ Digital Specimen PIDs (and the best practices, governance model and services associated to them) contribute not only to keep the uniqueness of the specimen data over time but also to ensure long-term trust in the accuracy and authenticity of the scientific data, given that PIDs remain unaltered even if there are changes in the supporting technology that we use to implement them.

‖ Not only that: Just think about it and you will realise that they also contribute to make the data more FAIR, or at least more “FA”, because they make Findability and Accessibility of data easier (more on FAIR below).

‖ Now that you know what a PID is and why they matter, you should know that DiSSCo and other international scientific infrastructures are working to create an DOI specifically for the concept of Digital Specimen. The same way that a DS brings together all relevant data about a specimen, this "expanded" DOI is meant to bring together the relevant PIDs of all that data related to a digital specimen. A PID for connecting PIDs, so to speak.

Ꙭ Wanna know more? Then go here or here. Our colleagues from DiSSCo's technical team will be glad to give you more details.

DESCRIBING DIGITAL OBJECTS(Or "What's in the basket?")

‖ When you try to describe what should be inside a Digital Specimen is where data semantics, context and interpretation become crucial concepts.

‖ We have just seen that NSCs realised the need for finding a way to reference specimens so that each of them had a unique, persistent identifier accepted by the scientific community. Well, something similar happens with the way NSCs record, describe and exchange their data. NSCs need their own sort of "Esperanto" (only this time working!) that all can understand.

‖ Rather than keeping their own, local way of recording, describing and exchanging specimen data, NSCs saw at some point that they would need a bunch of “common guidelines” understood and followed by all, to ensure that data was interoperable across institutions and projects.

‖ The answer to that need was the development of the data exchange standards, basically a set of rules agreed upon by the scientific community so that everyone records, describes and exchanges data the same way. The main data standards for collections are Darwin Core and ABCD (Access to Biological Collection Data, which turns into ABCDEFG when Extended For Geosciences). For digitisation status, we use MIDS (Minimum Information about a Digital Specimen).

‖ Standards do a great job for supporting exchange and integration across data structures, but DiSSCo wants to take one further step…

‖ Hey, it is not about substituting well-established standards, ok? It is rather about building on them and adding improved data management capabilities. Don't forget that DiSSCo is a new infrastructure aiming to build one single European virtual natural collection, nothing less! Add to that the FAIR principles plus the list of new generation services that we are currently developing and you will understand that the whole thing demands some specifications of its own to ensure that Digital Specimens, Digital Collections and other types of digital objects are FAIR across a wide range of different software applications, services and systems.

‖ This further step that is meant to harmonise DiSSCo’s universe goes by the name of open digital specimen (openDS) data specification. Put really simply: the openDS explains what a digital specimen structure and content should be, the operations that can act upon them and generally how to handle and transfer it. The ultimate goal: Making the best of the digital transformation of NSCs that DiSSCo will bring about.

Ꙭ Wanna know more? Let our own Alex Hardisty give you details here.

USING DIGITAL OBJECTS(Or "Stronger together...!")

‖ Connecting biodiversity data with other types of information to get a wider picture of a specimen is nothing new. Researchers and collection managers have been doing this for years. What is new is trying to automate this process so that a global network that integrates different data types (and is FAIR) can be built. DiSSCo leads the way for this in Europe via the Digital Specimen concept, but ours is not the only effort...

THE DIGITAL EXTENDED SPECIMEN

‖ It so happens that, at the other side of the Atlantic, the Biodiversity Collections Network (BCoN) in the US has been walking a similar path. Building on the concept of “Extended Specimen” (Webster, 2017.) the BCoN envisages a future landscape where collections offer the scientific community actionable data that link to other specimen records and related data.

‖ Is there much difference between the European and the American approaches? Generally speaking, you might say that the European approach takes the perspective of the NSCs (at the end of the day, DiSSCo is all about collections) and the American rather sides with the view of the researcher, but all in all, the Digital Specimen and the Extended Specimen have potential to converge. It is not for nothing that, following TDWG 2020, more than 35 organisations worldwide and many individuals decided to work collaboratively towards a global specification and interoperability for the Digital Specimen and Extended Specimen concepts. Guess what new term they came up with to unite both:

Yeah...! the Digital Extended Specimen (DES).

‖ The best source of information about Digital Extended Specimens is this splendid article by Alex Hardisty et. al, recently published in BioScience. It describes what a DES is in similar terms to the ones that we have used to describe a Digital Specimen. More importantly, the article envisages DES enabling new research on many fronts. Take a look:

□ The information of a DES is richer and denser than the one limited to a physical specimen, and that will ultimately result in more reliable science.

□ As a DO, DESes will make possible a wide arrange of practices that were a little less than unthinkable with physical specimens (think simulation and prediction capabilities, for example).

□ Co-analysis across scientific disciplines will be made possible in an unprecedented way.

‖ This binnacle was meant to be a general introduction to DiSSCo Digital Specimen Architecture.

‖ As you know, we continue to work on developing many of the areas explained above, from that new expanded DOI to identify Digital Specimens, to the openDS specification or that brand-new concept of Digital Extended Specimen.

‖ We will continue uploading content about DiSSCo as we reach new milestones. In the meantime, please do not hesitate to contact us if you need further information.

We need your feedback

The binnacle aims to having us all up to speed and engaged in the latest developments of DiSSCo Prepare. This makes your contribution particularly important.

Please, follow this link and give us some feedback. It will take 2 mins. of your time. Thanks!