ToscanaOpenResearch is based on a platform that integrates and provides access to heterogeneous data. This system is built on semantic technologies and standard interoperability formats, such as Linked Open Data (LOD).
The ONTOP system (ontop.inf.unibz.it/) was employed during the development of the portal. It is open source and ‘based on ontologies’ (Ontology-Based Data Access and Integration, OBDA/I). Grounded in relational databases, it creates a SPARQL endpoint, the language used to ‘query’ datasets in RDF format.
Integrating data through a domain ontology enables users to query the data using “queries”, rather than having to navigate the technical terminology associated with the physical organisation of databases and their complex internal structures.
The existence of a domain ontology that complies with international (VIVO) and European (CERIF) standards, is fully open, and has been adapted to the Italian context, is a unique feature. The functional components of the system enable best practice, offering constant monitoring and analysis updates (the system updates automatically each time the original open data is updated).
The information system also enables a series of real-time benchmarks that help map the ecosystem of regional competences and specialisations in higher education, research and innovation.
To facilitate the use and interoperability of data, and to make it easier to extract and analyse information coming from different classification systems, ToscanaOpenResearch uses a classification based on three information levels:
- Combining different national classifications (e.g. National University Council – CUN, Scientific-Disciplinary Sector – SSD) and European classifications (European Research Council – ERC, bibliometric areas);
- Classifying information (text mining) from project abstracts and publications;
- Performing vertical analyses using semantic “vocabularies”.
More specifically, combining different classifications enables information such as research staff (associated with the CUN classification) to be related to the number of publications (classified by bibliometric areas) and the number of European projects (associated with the ERC classification), thanks to the combination of national and European classifications.
Additional information on this topic is available in the user manual, which is accessible at this link. For comments or feedback, please send an email to: staff@toscanaopenresearch.it.
The development of ToscanaOpenResearch is the outcome of a process led by the Tuscany Region, working alongside IRPET, FST and a technical partner, Siris Academic.
To date, the system has primarily integrated open data originating from:
- national, European and global open databases;
- additional datasets integrated through collaboration agreements (for example, the Italian Ministry of Education, or MIUR, for the inclusion of CTN/PRIN 2012 data within the Tuscany perimeter, AlmaLaurea data within the Tuscany perimeter, and data provided by certain research bodies located in Tuscany, such as CNR, INFN, INGV and INAF).
- For the section concerning publications, the system uses bibliometric databases that are not open, although it is already prepared for easy integration with CINECA-IRIS data.
Insight – The semantic analysis of the “research portfolio”
Abstracts of publications, patent descriptions and the objectives of R&I projects contain a wealth of textual information describing current challenges, proposed or demonstrated advances, and the expected impact of the innovation process in detail.
New Natural Language Processing (NLP) methods can now be used to exploit this semantic richness and characterise research portfolios to support strategic decision-making. Semantic approaches are powerful tools for mapping scientific and technological fields, as they enable users to:
- Analyse each document individually to avoid potential confusion linked to taxonomy;
- Build customised semantic parameters for specific fields of interest by combining taxonomies, enabling the simultaneous cross-analysis of multiple data sources;
- Systematically analyse documents within customised geographical parameters to enable benchmarking and specialisation analytics.
These analyses can be “horizontal”, with no predefined thematic focus, or “vertical”, aimed at a specific area of interest. Topic modelling is used to extract research themes and characterise research portfolios, while the development and application of controlled vocabulary enable the analysis of research in specific areas, such as the Sustainable Development Goals (SDGs) or cultural heritage. Within Toscana Open Research, both techniques are employed, and a methodology has been developed to swiftly and efficiently construct controlled vocabularies from an initial set of relevant terms.
