Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
SE GeToVa SE Generation and Transformation of Virtualized Assets ___ User Guide
Author(s): Copyright: Release Date: Revision:
Benjamin Hiltpolt UIBK, Ioan Toma UIBK © 2015 UIBK 01-11-2015 001
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
2/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Table of Contents INTRODUCTION ................................................................................................................................................. 4 1.
OVERVIEW ................................................................................................................................................. 4
2.
ARCHITECTURE AND SPECIFICATIONS........................................................................................... 4
3.
GETTING STARTED ............................................................................................................................... 15
4.
USER INTERFACE .................................................................................................................................. 15
5.
APPLICATION PROGRAMMING INTERFACE ................................................................................ 17
3/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Introduction
This document describes how to use the GeToVa SE. The administration guide explains how to setup and install the GeToVa SE.
1. Overview
FITMAN Specific Enabler for Generation and Transformation of Virtualized Assets (GeToVA) is aiming to support Virtual Factories (VF) in semi-automatic generation and clustering of Virtualized intangible Assets (VAaaS) from real-world semi-structured enterprise and network resources. GeToVa enables as well multi-format ontology transformation between various representations of Virtualized in-/tangible Assets. As today manufacturing ecosystems deal with increasing quantities of unstructured and semistructured information in webpages, e-mails, text documents, spreadsheets, news articles, collaborative posts, patents to name but a few, there is a real need to extract this information, to represent it in a meaningful, structured way, to cluster and transform it in multiple formats in order to support interoperability. The FITMAN Specific Enabler for Generation and Transformation of Virtualized Assets is aiming at providing a state-of-the-art Information Extraction-driven semantic tool for (semi-)automatic Virtualized intangible Assets in order to heavily reduce manual data entry for the population of the FITMAN-CAM Specific Enabler. The GeToVA Specific Enabler provides the following core functionalities: 1. Extraction of Virtualized Assets information from real-world semi-structured enterprise and network resource; 2. Generation of semantic representation of Virtualized intangible Assets according to ontological models; 3. Clustering of Virtualized intangible Assets enabling better search of such assets 4. Multi-format ontology transformation between various formats, mapping and exchanging Future Internet (FI) data e.g. USDL The GeToVA Specific Enabler is provided as a set of RESTFul services being implemented on top of the FITMAN baseline VF Platform. The GeToVA services APIs have been designed as fully compatible with FITMAN Platform components, namely the Data.SemanticsSupport for the GeToVA multi-formation ontology transformation and the Apps.Repository for registration of the assets generated by GeToVA.
2. Architecture and Specifications
The FITMAN-GeToVA Specific Enabler architecture supports the extraction, generation, searching and clustering of real-world semi-structured enterprises and network resources in general, with a specific focus on curriculum vitae and company profiles in particular. Furthermore multi-format ontology transformation between various curriculum vitae and company profile formats are supported. The high-level architectural of FITMAN-GeToVA is depicted in Error! Reference source not found.
FITMAN-GeToVA Specific Enabler includes seven components which are described in details below. 4/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
1. Europass Format Handler This component is used to manage structured data i.e. XML according to Europass Format. The Europass Format1 is a widely used format for curriculum vitae providing the format and terminology to specify skills and qualifications effectively when looking for a job or training, to help employers understand the skills and qualifications of the workforce as well as to help education and training authorities define and communicate the content of curricula. The Europass Format consists of: (1) Curriculum vitae, (2) Language Passport, (3) Europass Mobility, (4) Certificate supplement and (4) Diploma supplement. The Europass entity which has defined the Europass Format provides a RESTful API2 to convert between formats (XML, ODT, PDF, DOC and JSON) and languages. The Europass Format Handler component is responsible for transforming between different Europass formats using the Europass Webservices. It is capable of transforming from the Europass JSON representation to JSON-LD to directly convert it to RDF (called Base RDF), as well as to check if JSON Files are valid Europass JSON files. 2. Converter The Converter component is responsible for transforming the Base RDF formatted generated by the FormatHandler into various formats, according to various ontologies. As the RDF created by the FormatHandler is not really structured, more structure is introduced by structuring the data according to curriculum vitae and company profile ontologies. For example the Converter can transform the Based RDF data into Resume Ontology and Linked-USDL compliant data. The transformation is done using a set SPARQL Constructs. SPARQL3 is a standard way, a set of specifications that provide languages and protocols to query and manipulate RDF graph content on the Web or in an RDF store. We use the SPARQL Constructs to transform between different RDF representations i.e. from Base RDF to other RDF formats (e.g. Resume Ontology, Linked-USDL). PREFIX resb: http://kaste.lv/~captsolo/semweb/resume/base.rdfs# PREFIX res: http://kaste.lv/~captsolo/semweb/resume/cv.rdfs# PREFIX fit: http://fitman.sti2.at/base/ PREFIX foaf: http://xmlns.com/foaf/0.1/ CONSTRUCT { ?cv a res:CV. ?cv res:aboutPerson ?personname . ?personname a res:Person. ?personname foaf:firstName ?firstname . ?personname foaf:lastName ?surname . } WHERE { ?s fit:SkillsPassport ?cv. ?skillPassport fit:LearnerInfo ?learner. ?learner fit:Identification ?identification. ?identification fit:ContactInfo ?contactinfo. ?identification fit:PersonName ?personname. ?personname fit:FirstName ?firstname. ?personname fit:Surname ?surname. } Listing 1: Fragment of SPARQL Construct used to transform Base RDF to Resume Ontology Error! Reference source not found. contains a fragment of the SPARQL Construct used to transform Base RDF to Resume Ontology. 1
http://europass.cedefop.europa.eu/en/home http://interop.europass.cedefop.europa.eu/web-services/rest-api-reference/ 3 http://www.w3.org/TR/rdf-sparql-query/ 2
5/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
3. Clustering The FITMAN-GeToVA Specific Enabler supports as well the clustering of Virtualized intangible Assets. GeToVA uses Text/Document clustering techniques4. Apache Mahout5 is used for this purpose. Apache Mahout is a project of the Apache Software Foundation aiming to build a distributed, scalable machine-learning framework focused primarily in the areas of collaborative filtering, clustering and classification. GeToVA uses the K-Means clustering techniques6 from Apache Mahout. Figure 1 shows the resulted clusters after applying this technique on LinkedIn companies.
4
http://en.wikipedia.org/wiki/Document_clustering https://mahout.apache.org/ 6 https://mahout.apache.org/users/clustering/k-means-clustering.html 5
6/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Figure 1: RDF clustering based on Apache Mahout. The top picture shows an overview of all clusters. The bottom picture shows are close down view on one of the clusters. The clustering is visualized within our Web Front-End. The detailed clustering data is accessible as JSON via our REST-API
7/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
4. Knowledge Extractor The Knowledge Extractor component is responsible for extracting information from real-world semi-structured enterprise and network resource. We are in particular focusing on curriculum vitae and company profiles which are available as textual resources in .html, .xml, .txt and .doc files. The Knowledge Extractor component is using state-of-the-art IE-driven semantic tool for (semi-)automatic VAaaS in order to heavily reduce manual data entry for the population of the FITMAN-CAM SE. The Knowledge Extractor is supported by domain ontologies i.e. curriculum vitae and company profile ontologies (see Europass Format Handler component) in order to identify the ontological concepts and relations that semantically describe the text content. The component is using the GATE (http://gate.ac.uk/) for Information Extraction. GATE is an open source software capable of solving text processing tasks. It is basically a family of tools which includes an integrated development environment (GATE Developer) for language processing components bundled with a very widely used Information Extraction system and a comprehensive set of other plugins, a cloud computing solution (GATE Cloud) for hosted largescale text processing, a collaborative annotation environment (GATE Teamware), a multiparadigm search repository (GATE Mimir) and an object library (GATE Embedded). Out of this tool set, the Knowledge Extractor is using GATE Developer for semi-automatic extraction of relevant information from resources. The information is annotated using domain specific schemas/ontologies (in our case curriculum vitae and company profile ontologies). The overall process is semi-automatic. On one hand GATE will automatically spot, in the processed, entities such as places, dates, durations, names. On the other hand other entities such as specific skills need to be annotated manually by the user. For all created annotations the user has the possibility to validate them using an interface such as depicted in Figure 9.
Extraction of CVs To semi-automatically extract Resume RDF out of CVs first of all one has to start GATE and load the ANNIE system with defaults.
Figure 2 Load the Annie system by clicking the A button
Then right-click the language resources select New/Gate Document and load the CV into the GATE framework.
8/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Figure 3 Load the CV document
Right click the created document and select Create Corpus New Corpus with this Document.
Figure 4 Select the ANNIE application and run it
GATE will automatically tag parts of the document. (names, organizations) To further allow the extraction of information the user has to annotate skills, education and work experiences manually by using the tags: skill, work and edu. To add descriptions to work experiences or education the user can add the tag desc.
9/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Figure 5 To add tags it is sufficient to select a text paragraph. GATE will then prompt for a tag
Once the user is finished it is necessary to export the documents annotations by right-clicking the document and selecting save as. GATE will then export a XML file containing the annotations. To obtain the Resume RDF the user then has to use the knowledge_extracor script, by running it with the following commands: ruby knowledge_extractor.rb
Which will return Resume RDF that can then be added to the database.
Extraction of company data Approach The extraction is started by running the script over a set of documents. >ruby company_extractor.rb html/*
The data is then automatically extracted and stored into the database and the search engine. The results of the extraction can then be accessed using the REST API or the Web Front-End. The script also generates a set of results locally in the folder results. The results can be validated using GATE. To do so one has to start GATE. Load the ANNIE system as in Figure 4.
10/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Figure 6 Adding an ANNIE Gatetteer
Figure 7 Adding the Tag word lists generated by the Knowledge Extraction component
Next add an ANNIE Gazeeteer to the GATE Application (Figure 6) and load the word list generated by Knowledge Extractor (Figure 7), which can be found in the folder the script was run. Load the documents on which the extraction should be performed as in Figure 3 Run the GATE Application as in Figure 4 after adding the Gazzeeter to the ANNIE system.
11/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Figure 8 Use GATE to adapt the annotations created by the Knowledge Extractor component
Use GATE to verify and adapt the Tags generated by the Knowledge Extractor (Figure 8). Once finished store the document as XML and integrate the results into system by running: ruby gate_xml_to_rdf.rb
3. Testing
The functionalities of all major components can be verified using unit tests based on the rspec library 7. The tests can be run via the command line command: rspec
7
http://rspec.info/ 12/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Figure 9: GATE tool used by the Knowledge Extractor to highlight the name of the person and the organisations the person studied in. Entities are spotted in the raw text by GATE by using a combination of linguistic rule and dictionaries. The rules contain the logic needed to identify the specific types of entities and will trigger as the text is processed. The dictionaries contains list of entities instances e.g. list of places, names, etc. and are used to identify entity instances in the text. At the end of extraction process, based on GATE, the Knowledge Extractor component will take the annotations generated by GATE and will create Base RDF. The resulted RDF can then be interlinked with other linked RDF data sets. Some of the extractions are available inside the Web Front-End. For example downloaded HTML pages of from led-info.de can be uploaded to the platform. The platform then extracts information out of the html and integrates the data into the system (see Figure 10) 4. Extracting LED Companies within the Front-End:
Figure 10 Extraction of LED-Companies 13/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
5. Extracting public LinkedIn profiles:
For the extraction of LinkedIn people profiles it is possible to only provide an URL to the desired profile. This is done by pasting the Linkedin profile URL of a person into the Web Front-End, as is shown in Figure 11.
Figure 11 Scraping Linkedin profiles by passing a Linkedin profile URL to the Web-Frontend.
Currently the Knowledge Extractor is used for the following extractions: CVs LinkedIn Company and People pages Tenders from sell2wales.gov.uk Companies from led-info.de Companies from aerospacewalesforum.com/ Companies from XML database dumbs
5. Ontology Manager The Ontology Manager is used to create RDF data that is valid to the used ontologies within our platform. Currently we support our own internal ontology to represent companies and CVs. Also we use the Ontology Manager component to generate Linked-USDL. Further we provide the data as JSON-LD.
6. Tagging System We use a Tagging System, which allows the user to define annotations that can be reused to automatically spot properties within semi-structured data. The Knowledge extractor components uses several scripts to generate the Tags, which are then used as additional dictionaries by GATE. GATE can then be used to check the validity of the Tags and adapt them if needed. Further one can add additional Tags within GATE. Those Tags are then added into the database for further use. The Tagging System allows to collaboratively create a Knowledge Base of Tags that can be reused by several users, which allows a set of users to speed up the extraction process significantly.
7. Search Based on the Search Engine we provide Full Text search among our data. Further we provide Autocomplete of search terms and Tags (Figure 12 Autocomplete feature of the Frontend). The search also supports fuzzy search. 14/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Figure 12 Autocomplete feature of the Frontend
8. Database and Search Engine The data extracted, generated and transformed by the FITMAN-GeToVA Specific Enabler is stored in a database. We use SQLite to store the data. To handle the sophisticated search mechanisms that are provided on the Front-End as well as via the REST-API we use ElasticSearch8 as our search engine.
6. Getting Started
Run docker-compose to run the application (See administration guide for instruction), alternatively visit http://fitman.sti2.at for a live demo 7. User Interface
To showcase the functionalities of our platform we developed a web-based Front-End, that visualize all the features that we provide.
8
http://www.elasticsearch.org/ 15/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
Figure 13: User Interface; Consisting of several tabs able to display the Virtualized Intangible Assets. Also the Interface provides all functionalities of the REST API
The Web Interface is being developed using Ruby on Rails. The Web Interface provides a way for human users to access the functionality of the FITMAN-GeToVA Specific Enabler. It can be seen as a layer on top of the RESTful API, accessing and using the underlying services i.e. the Europass Format Handler, Converter, Clustering and Knowledge Extractor. As mentioned above, the FITMAN-GeToVA Specific Enabler is developed following service orientation principles. All components of the FITMAN-GeToVA Specific Enabler i.e. the Europass Format Handler, Converter, Clustering and Knowledge Extractor are provided as services and implemented as RESTful services. 16/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
8. Application Programming Interface
The overall functionality of the FITMAN-GeToVA Specific Enabler is exposed in a unified RESTful API. The RESTful API includes methods to manipulate resources such as curriculum vitae and company profiles, their representations in JSON, Base RDF, XML, Linked-USDL, etc. as well as the ConverterFormat resources. ConverterFormats are SPARQL Constructs that are used to generate from and to Base RDF format other representation formats for curriculum vitae and company profiles. A part of the API:
GET POST PATCH PUT DELETE /tanet_linkedins Manages extracted LinkedIn profile resources For example:
{id: 1, name: "Ioan Toma", data: "{"name":"Ioan Toma","first_name":"Ioan","last_name":"Toma","title": null,"location":"Austria area","country":"Austria area","industry":"Research","summary":null,"picture": "https://static.licdn.com/scds/common/u/images/themes /katy/ghosts/person/ghost_person_150x150_v1.png","lin kedin_url":"https://www.linkedin.com/in/ioantoma","ed ucation":[],"groups":[],"websites":[],"languages":[], "skills":[""],"certifications":[],"organizations":[], "past_companies":[],"current_companies":[]," }
17/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
GET /scrape_linkedin Scrapes a Linkedin profile requires to add JSON with the following structure: {'url' => "https://at.linkedin.com/in/ioantoma"}
GET POST PATCH PUT DELETE /tenders Manages extracted Tenders from sell2wales resources Resources are of the format:
{ id: 1, data: "test", created_at: "2015-11-02T09:53:18.315Z", updated_at: "2015-11-02T09:53:18.315Z" } GET POST PATCH PUT DELETE /complus Manages extracted companies from led-info resources For example http://fitman.sti2.at/complus/1.json will return:
{ id: 1, data: "{ "@id": "_:g70352464308540", "@type": "c:Company", "c:country": " Germany\n", "c:distribution_type": " Einzelhandel (B2C) ", "c:extra": "complus", "c:hasLocality": "Bolanden", "c:hasMail": "[email protected]", "c:hasStreetAddress": " Klosterwiesen 10\n", "c:hasWebsite": "http://www.asmetec.de", "c:locatedInRegion": " Germany\n", "c:name": "Asmetec", "c:postalCode": "67295", "c:sells": [ " LED-Strahler ", " Raumbeleuchtung ", " sonstige LED Leuchten " ] }", created_at: "2015-09-18T15:19:33.395Z", updated_at: "2015-09-18T15:19:33.395Z" } 18/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
POST /fetch_led_company Fetches data from downloaded HTML from led-info. Not supported to be used by thirdparties. See http://fitman.sti2.at/complus GET POST PATCH PUT DELETE /tanets Manages data required for the Tanet usecase For example http://fitman.sti2.at/tanets/1.json:
{ id: 1, data: "{ "@id": "_:g70120072481280", "@type": "c:Company", "c:category": "Skill, Education and Training", "c:country": "Wales", "c:extra": "tanet", "c:hasLocality": " SA12 7AX", "c:hasMail": "[email protected]", "c:hasStreetAddress": "Baglan Bay Innovation Centre Baglan", "c:hasWebsite": " http://www.projectmetal.co.uk ", "c:locatedInRegion": " Port Talbot", "c:name": "METaL", "c:postalCode": " SA12 7AX" }", created_at: "2015-09-14T10:02:47.859Z", updated_at: "2015-09-14T10:02:47.859Z" }
GET /companies/run_clustering Starts the clustering and gives feedback of whether the clustering is finished. Once the clustering is finished, detailed results of the clustering are returned.
GET /companies/clustering_visual Returns a compact version of the cluster results. This service is used by the visualization as well. The format is based on the following structure: {name: "Clusters", children: [{ name: "Cluster 1", children: [{ name: "A2BPlasticsLtd",size: 830}, name: "AcceleroDigitalSolutionsLtd", size: 1415}, {name: "AngleseyCircuitTracMn", size: 1145} ………
19/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
GET /companies/search/:search Run a search for a keyword specified in :search. It returns a list of companies matching the criteria
GET POST PATCH PUT DELETE /companies Manages the companies resources For example http://fitman.sti2.at/companies/1.json returns:
{ id: 1, name: "A2B Plastics Ltd ", jsonld: ""{\"@context\":{\"gr\":\"http://purl.org/goo drelations/v1#\",\"s\":\"http://schema.org/address\", \"v\":\"http://www.w3.org/2006/vcard/ns#\",\"l\":\"ht tp://www.linked-usdl.org/ns/usdlcore#\",\"foaf\":\"http://xmlns.com/foaf/0.1/\",\"c\" :\"http://fitman.sti2.at/company/\"},\"@id\":\"_:g628 09420\",\"foaf:page\":\"www.a2bplastics.co.uk\\n\",\" gr:legalName\":\"A2B Plastics Ltd\\n\"}"", created_at: "2015-09-14T09:34:28.574Z", updated_at: "2015-09-14T09:34:28.574Z" } GET POST PATCH PUT DELETE /individuals Manages all the CV resources For example http://fitman.sti2.at/individuals/1.json returns:
{ id: 1, name: "betty", created_at: "2015-09-14T09:34:25.252Z", updated_at: "2015-09-14T09:34:25.252Z" } 20/22
Project ID 604674
FITMAN – Future Internet Technologies for MANufacturing
GET POST PATCH PUT DELETE /representations Manages the CV representations resources (e.g. concrete formats of a certain CV. Like the CV represented as JSON) For example http://fitman.sti2.at/representations/2.json returns:
{ id: 2, individual_id: 2, content: "{"SkillsPassport":{"LearnerInfo":{"Identifica tion":{"PersonName":{"FirstName":"Quincy","Surname":"La badie"},"ContactInfo":{"Address":{"Contact":{"AddressLi ne":"514 Gibson Junction","PostalCode":"317852977","Municipality":"Manuelachester","Country":{"Code" :"EG","Label":"Maldives"}}}},"Demographics":{"Birthdate ":{"Year":1986,"Month":2,"Day":5},"Gender":{"Label":"ma le"}}},"WorkExperience":[{"Period":{"From":{"Year":1988 ,"Month":6},"To":{"Year":1980,"Month":6},"Current":"fal se"},"Position":{"Label":"Human Assurance Technician"},"Activities":"empower value-added convergence","Sector":"Garden","Employer":{"Name":"Schr oeder, Rogahn and Macejkovic","ContactInfo":{"Address":{"Contact":{"Addre ssLine":"9668 Schaefer Turnpike","PostalCode":"47774","Municipality":"Mabelles hire","Country":{"Label":"Northern Mariana Islands"}}}}}},{"Period":{"From":{"Year":1971,"Month":9 },"To":{"Year":1982,"Month":4},"Current":"false"},"Posi tion":{"Label":"Dynamic Brand Representative"},"Activities":"target back-end initiatives","Sector":"Electronics","Employer":{"Name": "Bauch, Stokes and Lockman","ContactInfo":{"Address":{"Contact":{"AddressL ine":"701 Eula Courts","PostalCode":"26101","Municipality":"Thielshire ","Country":{"Label":"Republic of Korea"}}}}}}]}}}", format_id: 5, created_at: "2015-09-14T09:34:25.958Z", updated_at: "2015-09-14T09:34:25.958Z" } 21/22
FITMAN – Future Internet Technologies for MANufacturing
Project ID 604674
GET POST PATCH PUT DELETE /individual_formats Manages all the formats supported by the platform for transformation of people profiles For example http://fitman.sti2.at/individual_formats/1.json returns a JSON of the format:
{ id: 1, name: "resume", baseToFormat
SPARQL_CONSTRUCT,
formatToBase: SPARQL_CONSTRUCT, created_at: "2015-09-14T09:34:25.167Z", updated_at: "2015-09-14T09:34:25.167Z" } For example the last methods in the listing “GET /curriculum_vitaes/:id/:format” can be used to get the given format for curriculum vitae e.g. Resume RDF, JSON. As we use an ElasticSearch server as our search engine an ElasticSearch server endpoint is available as well providing the whole ElasticSearch API functionalities. The RESTful API is using the different components to create a curriculum vitae and company profile representations for every existing format. To create each representation it is only necessary to upload one format. The RESTful API can then use a transformation chain to create the other formats.
22/22