News

03 MAR 2022

Coreon Wins €16m EU Semantic Web Consulting Contract

The framework contract Semantic Web Consultancy and Support (procurement OP/LUX/2021/OP/0006) for the Publications Office of the European Union, at a value of 16 Million Euro, was won by a consortium of Infeurope S.A., INTRASOFT International S.A., Cognizone BV, and Coreon GmbH. The consortium compiles vast knowledge and experience in providing software and services in AI, NLP, data and semantic technologies. The consortium members have already worked together in several projects.

The main tasks concern the elaboration of studies, technical specifications, and prototypes for the improvement of the current implementation and configuration of Ceres, CELLAR, and other systems using semantic technology…
The framework contract Semantic Web Consultancy and Support (procurement OP/LUX/2021/OP/0006) for the Publications Office of the European Union, at a value of 16 Million Euro, was won by a consortium of Infeurope S.A., INTRASOFT International S.A., Cognizone BV, and Coreon GmbH. The consortium compiles vast knowledge and experience in providing software and services in AI, NLP, data and semantic technologies. The consortium members have already worked together in several projects.

The main tasks concern the elaboration of studies, technical specifications, and prototypes for the improvement of the current implementation and configuration of Ceres, CELLAR, and other systems using semantic technology. In addition, it is foreseen that the consortium provides technical assistance for the preparation and the execution of tests demonstrating that developed systems conform to technical specifications including the production of test reports and data curation.

Semantic data assets are deployed to improve operations, especially when it comes to supporting users with information needs across languages and domains. The rich multilingual knowledge resources of the Publication Office and the European Commission, namely EuroVoc combined with IATE, will underpin such efforts nicely.
03 JUL 2020

Coreon MKS as LLOD is European Language Grid top funded project

Coreon’s proposal for using the European Language Grid (ELG) as a platform for making multilingual interoperability assets discoverable and retrievable has been awarded. This will be achieved by complementing Multilingual Knowledge Systems with a SPARQL interface. The ELG Open Call 1 received 121 proposals, of which 110 were eligible and 10 were selected. Coreon’s proposal “MKS as Linguistic Linked Open Data” was amongst the three winning proposal from industry and received the highest funding.

The goals of the project are a) to enable Semantic Web systems to query Coreon’s richly elaborated multilingual terminologies stored in concept systems and knowledge graphs…
Coreon’s proposal for using the European Language Grid (ELG) as a platform for making multilingual interoperability assets discoverable and retrievable has been awarded. This will be achieved by complementing Multilingual Knowledge Systems with a SPARQL interface. The ELG Open Call 1 received 121 proposals, of which 110 were eligible and 10 were selected. Coreon’s proposal “MKS as Linguistic Linked Open Data” was amongst the three winning proposal from industry and received the highest funding.

The goals of the project are a) to enable Semantic Web systems to query Coreon’s richly elaborated multilingual terminologies stored in concept systems and knowledge graphs and b) to prove how to overcome the limits of RDF/knowledge graph editors, which usually are fine to model concept relations, but are weak in capturing linguistic information. When deployed in March 2021 on the ELG, the innovation will enable the Semantic Web community to query rich multilingual data with a familiar, industry standard syntax.
07 NOV 2019

CEFAT4Cities Action Gets Funding

The CEFAT4Cities Action, to be executed by a multinational consortium of five partners, led by CrossLang, has received funding. The action starts in April 2020 and runs up to March 2022.
The main objective of the CEFAT4Cities Action is to develop a “Smart cities natural language context”, providing multilingual interoperability of the Context Broker DSI and making public “smart city” services multilingual, with pilots in Vienna and Brussels.
The language resources that will be created will be committed to the ELRC repository and the following languages will be developed: Dutch, English, French, German, Italian, Slovenian, Croatian and Norwegian.

Coreon's…
The CEFAT4Cities Action, to be executed by a multinational consortium of five partners, led by CrossLang, has received funding. The action starts in April 2020 and runs up to March 2022.
The main objective of the CEFAT4Cities Action is to develop a “Smart cities natural language context”, providing multilingual interoperability of the Context Broker DSI and making public “smart city” services multilingual, with pilots in Vienna and Brussels.
The language resources that will be created will be committed to the ELRC repository and the following languages will be developed: Dutch, English, French, German, Italian, Slovenian, Croatian and Norwegian.

Coreon's role in the consortium is provide the appropriate technology, to turn vocabularies into multilingual knowledge graphs, to curate and extend them to model the domain of smart cities.
19 APR 2022

So, You Think You Want A Chatbot?

Challenges Of Chatbot Development

Coreon is a part of CEFAT4Cities Action, a project co-financed by the Connecting Europe Facility of the European Union that targets the interaction of EU residents and businesses with smart city services. One of its outcomes is an open multilingual linked data repository and a pilot chatbot project for the Vienna Business Agency, which leverages the created resource.

In a small series of blog posts, we will share our experiences building a multilingual chatbot. We will demonstrate how to overcome the language gap of local public services on a European scale, thus reducing red tape for citizens and businesses…

Challenges Of Chatbot Development

Coreon is a part of CEFAT4Cities Action, a project co-financed by the Connecting Europe Facility of the European Union that targets the interaction of EU residents and businesses with smart city services. One of its outcomes is an open multilingual linked data repository and a pilot chatbot project for the Vienna Business Agency, which leverages the created resource.

In a small series of blog posts, we will share our experiences building a multilingual chatbot. We will demonstrate how to overcome the language gap of local public services on a European scale, thus reducing red tape for citizens and businesses. In this opening article we steer clear of concrete frameworks and instead focus on the challenges of chatbot development.

Before You Summon The Engineers...

Dialogue is a natural way for humans to interact -- we express ideas and share information, convey mood and engage in debates via conversational exchange. 

Chatbots, also known as ‘conversational agents’ or ‘conversational assistants’, are designed to mimic this behavior, letting us interact with digital services as if they were a real person. With the latest tech advances, we regularly come across conversational agents -- booking tickets, ordering food, obtaining directions, managing bank accounts, receiving assistance from Siri and Alexa -- the list goes on!

Here are some key (although they may appear trivial at first) questions you should ask your project manager before starting out on the development of a new chatbot:

  • What do you want to achieve with the bot? What are the limits of its capabilities? What tasks is it supposed to be tackling? Will it be of a transactional nature or should it rather provide advice, assist with search and the retrieval of information, or combine all of these capabilities in one?
  • What audience is this bot targeting? Are you interested in deploying it internally for a limited audience, or will it be used broadly, e.g., as part of customer care? Which languages does the audience speak, or which are they comfortable working with?
  • How do you capture and maintain the knowledge the bot is supposed to provide to the user?
  • What features does your Minimum Viable Product incorporate, regardless of the framework? Is the bot’s appearance important? Do you need it to have a specific personality, or would you rather keep it neutral?

You get the idea. Once the conceptual foundation is there, it’s time to dive deeper into the sea of frameworks and features.

What Are Chatbots Made Of?

As humans, we barely consciously pay attention to or think too much about sub-processes firing in our brain when engaging in a conversation.

Depending on the technology used, chatbots can vary from simple interactive FAQ-like programs to vastly adaptive, sophisticated digital assistants that can handle complex scenarios and offer a high degree of personalization due to their ability to learn and evolve.

There are a few key concepts that engineers rely on when they develop a chatbot. As humans, we barely consciously pay attention to or think too much about sub-processes firing in our brain when engaging in a conversation. As NLP engineers of a solution that aims to imitate human behavior, however, we should have a clear division of these sub-processes. A chatbot should not only be able to ‘digest’ text or voice input from a user, but also find a way to ‘understand’ it by inferring the semantics of the user's intent and generating an appropriate answer.

To recognise the needs and goals of the user, modern chatbot frameworks rely on Natural Language Processing (NLP), Natural Language Understanding (NLU) and Natural Language Generation (NLG).

NLP is responsible for parsing user's input, or utterances; NLU identifies and extracts its intents (goals) and entities (specific concepts like locations, product names, companies, -- anything really), while an NLG component generates relevant responses to the user.

So You Think You Want A Chatbot

When it comes to bot development activities, there is a variety of closed- and open-source tools and frameworks to choose from. This choice is already a challenge of its own.

Before settling for a specific solution that can accomplish your primary goal, it's helpful to analyze both the short- and long-term project objectives:

  • Have you chosen your features wisely? Are the essential ones mature enough for the project?
  • Are there framework-bound scalability limitations in case of project growth?
  • Can you easily integrate additional building blocks offered by third-party providers?
  • Does the framework satisfy your deployment and security requirements? Are they compliant with your client's demands? Have these demands been clearly defined (e.g. on-premise vs. cloud-based solution, specific feature requirements, containerization)?
  • What kind of staff and how much of their capacity is needed to deploy and maintain the deployed solution?

Another chunk of pre-implementation work is associated with domain knowledge -- proprietary domain experts and NLP engineers are rarely the same people, so coming up with a process that ensures a smooth cooperation in knowledge transfer can save a lot of time and hustle. It is also a good chance to clarify conversational limitations of the future chatbot: does it need to cover chit-chat or support human handover, make API calls and pull up information on request? Does it need to be context-aware, or is it enough to support one conversational turn? You don't want to fully draft your bot's behavior but rather establish its foundation and mark the boundaries.

Once these questions are clarified, you can roll up your sleeves and start coding. The process tends to be highly iterative -- despite all the buzz around AI, you need conversational data from real users from the early stages onwards if you want to design a human-like conversation flow and build a robust virtual assistant.

The rule of thumb is to share your prototype with test users early and continuously collect data from test users so you can adapt the bot's logic, annotate utterances, and retrain NLU models with the updated real data. And once your assistant is mature enough, you are ready to deploy! 🚀

Stay tuned for implementation details on the multilingual abilities of SmartBot, our chatbot for the Vienna Business Agency, as well as some nifty tricks to ensure consistence in a bot’s language abilities, no matter your language of choice.

*Feature Image: Robot Hand Photo created by rawpixel.com - www.freepik.com

The post So, You Think You Want A Chatbot? appeared first on .

2 MAR 2022

Maintaining Concept Maps: A Time-Saver For Terminologists

Maintaining concept maps makes the handling of terminology data more efficient.

Maintaining concept maps involves curating terminology data in a knowledge graph that visually displays the relations between concepts. The benefits of this include:

All desirable features for terminologists and their organizations, yet we often hear the following reaction when discussing the use of the Coreon Multilingual Knowledge System (MKS):

I am already so busy with other daily duties. I have no time to maintain a concept map…

Maintaining concept maps makes the handling of terminology data more efficient.

Maintaining concept maps involves curating terminology data in a knowledge graph that visually displays the relations between concepts. The benefits of this include:

All desirable features for terminologists and their organizations, yet we often hear the following reaction when discussing the use of the Coreon Multilingual Knowledge System (MKS):

I am already so busy with other daily duties. I have no time to maintain a concept map…

This is a common misconception, leading many to stick to the old method of storing an endless number of concepts in all required languages, while also writing lengthy definitions to explain and illustrate what each is about.

Particularly as a terminologist, working with a 'concept map' (such as the one visualised in the Coreon MKS) makes your life significantly easier and more efficient. So, let’s address a few concerns around the perceived burden of maintaining concept maps and look at exactly why they are in fact worth your attention!


Concept maps sound great but, as a terminologist, this just means additional work

Not if you make concept maps an integral part of your terminology management.

Consider your steps when adding a new term to your data set – let’s say HDMI input port. Firstly, you check if it already exists in your repository – you may just search for it (although read below why I think this is a dangerous method). However, if you have an already-developed concept system you can simply navigate to where you would expect concepts linked to 'screens' and other output devices to be stored. You may identify the concept 'frame', which is stored underneath the broader concept 'screen'. You also see semantically similar concepts with terms such as on/off button, power cable, VGA port, or USB port.

You also need to decide – is the new HDMI input port simply the English term for an already-existing concept? Perhaps a colleague has already added it using the German term HDMI-Eingang, or using another synonym? In that case, you’d simply add HDMI input port as a term to the existing concept.

If the concept is indeed missing, however, you will want to add it. Now comes the key point – you are already at the location in your concept system where HDMI input port needs to be inserted. With the Coreon MKS, you would simply click ‘Insert new concept here’ and the relation between HDMI input port and its broader concept 'frame' is created in the background.

Relations between terms are created automatically when maintaining concept maps.
Simply insert a new concept in the right place and the relation is created automatically.

No additional work, then, but rather a welcome side-effect of adding new concepts in a systematic fashion. It’s comparable to the basic decision of where to save a new Word document in a file system, nothing more. And we do that every day, don’t we?


I see little value besides the nice visualization

Well, the value is in fact inherent in the visualization!

Let’s say you are faced with the task of illustrating and documenting your concepts by writing a definition.

How do you usually craft a definition? An established way is to explicitly differentiate the characteristics of a concept from its general, broader concepts as well as semantically closer concepts. An HDMI input port is part of the frame, and is also somewhat complementary to the VGA input.

In the concept map you see all related concepts at hand in one view, so you can write a clear text definition much faster. This is not a hypothesis – users of the Coreon MKS have confirmed that having the concept map at hand enables faster writing of the definition.

You also benefit when crafting and rating terms. Say you’d like to add a concept with a term such as LCD screen. Is this phrasing correct, or do we prefer LCD monitor? Luckily, through the concept map we have the broader concept with its term screen in view as well as two variants, TFT screen and LED screen. In all these cases the component screen was favored over monitor, so it’s a quick and easy decision for LCD screen over LCD monitor.

All in all you benefit from the linguistic work put into related concepts, enabling consistency across concepts!


I won’t click through maps, I prefer searching

At times we all use search, but it basically means having a guess with a keyword and hoping that this string is present in the repository.

Search only displays concepts where the search term, or slight linguistic variations of it, occurs. So, if you search for 'screen' you would find TFT screen, LCD screen etc., but what if you queried for monitor or display? Or in another language, say the French écran or the German Bildschirm? Your search would miss the concept screen!

Consequently you decide to create a new concept triggered by your new term monitor – and you unfortunately just created a doublette – one concept for screen and one for monitor – even though these are synonyms. This is also a reason why I am not a big fan of duplicate recognition – a cool-sounding feature but one that only checks for homonyms and not redundant concepts…but that’s a topic for another blog post.

Navigating and interacting through a concept map is therefore the key best practice when it comes to updating or maintaining a repository. It keeps the data clean, allowing you to identify gaps to avoid redundancies and achieve high quality data that both your users and audiences can rely on as a trustworthy resource.


Your arguments are convincing – but I have no time to post-edit my existing large terminology collections

Doing this manually would indeed be time-consuming, but there is a solution.

We’ve produced a way to automate the process by using advanced AI and NLP methods to ‘draft’ a knowledge graph and speed up its creation dramatically.

If you own a ‘flat’ terminology collection of several thousand concepts – available in formats such as ISO TBX, SDL MultiTerm, or MS Excel – this auto-taxonomization method can now elevate the data into a knowledge graph faster, getting the most tedious and time-consuming part done before you apply your own expert knowledge manually.


A boost, not a burden.

So, there it is. Maintaining concept maps doesn't just add ‘additional’ illustrative flavors to your terminology data. In fact it should be an integral part of your process as you look to maximize efficiency.

Want to read more on this? Have a look at Malcolm Chisholm's post where he discusses Types of Concept System: "Concepts do not exist as isolated units of knowledge but always in relation to each other." (ISO 704)

Get in touch with us here to learn more about maintaining concept maps and how they can revolutionize your workflow.

The post Maintaining Concept Maps: A Time-Saver For Terminologists appeared first on .

29 SEP 2021

Winning with LangOps

Winning with Language Operations (LangOps)

In a recent Forbes Technology article, council member Joao Graca states that Language Operations should be the new paradigm in globalization. He hits the nail on the head by saying that serving global markets is no longer about broadcasting translated content, but rather enabling businesses to communicate with stakeholders no matter what language they speak. LangOps is an enterprise function formed of cross-functional and multidisciplinary teams which efficiently operationalize the management of textual data. Neural machine translation (NMT) and multilingual knowledge management are indispensable tools to win, understand, and support global customers.

Release the Machine Translation Handbrake

NMT is approaching…

Winning with Language Operations (LangOps)

In a recent Forbes Technology article, council member Joao Graca states that Language Operations should be the new paradigm in globalization. He hits the nail on the head by saying that serving global markets is no longer about broadcasting translated content, but rather enabling businesses to communicate with stakeholders no matter what language they speak. LangOps is an enterprise function formed of cross-functional and multidisciplinary teams which efficiently operationalize the management of textual data. Neural machine translation (NMT) and multilingual knowledge management are indispensable tools to win, understand, and support global customers.

Release the Machine Translation Handbrake

NMT is approaching human parity for many domains and language pairs thanks to algorithmic progress, computing power, and the availability of data. Yet executives are still asking themselves why these breakthroughs have so far had only marginal effects on translation costs, lag, and quality.

The main reasons for this are a price model still based on translation memory (TM) match categories and the use of the timeworn formula IF Fuzzy < x% THEN MT. In addition, terminology – which is crucial for quality, process, and analytics – often leads a pitiful existence in Excel columns or sidelined term bases. While most focus on how to squeeze the last, rather meaningless drop of BLEU score out of the NMT black box, the real benefits will only be delivered by a LangOps strategy carried out by an automated workflow and reliable resource management.

Language Operations

LangOps is built on software that automates translation and language management. AI and Machine Learning have revolutionized the process, but for many tasks a rule-based approach is still superior. As always in engineering, it’s a question of piecing it smartly and pragmatically together. For example, while NMT is replacing segment-based translation memories, the cheapest and best method will always be the recycling of previously translated content. Terminology is baked into both NMT and TM, and thus is easily overlooked. LangOps, on the other hand, elevates terminology to multilingual knowledge. It is not only used for quality estimation and assurance, but also as the key meta data to drive processes. LangOps builds a multilingual data factory optimized for costs, time, and quality needs.

AI with Experts-in-the-Loop

LangOps will enable the building of scalable language factories...and will power a move towards cloud-based service levels.

The efficiency of LangOps needs to be complemented by the part of the process which involves humans. LangOps classifies linguistic assets, human resources, workflow rules, and projects in a unified system which is expandable, dynamic, and provides fallback paths. For example, the workflow knows who has carried out a similar project before, who has expertise in a particular domain, or how many hours an expert will typically need for a specific task. LangOps will enable the building of scalable language factories that leave the outdated price-per-word business model in the dust of transactional translations, and will power a move towards cloud-based service levels.

Cut Costs, then Drive the Top-Line

LangOps typically starts with translation because that’s where enterprises have created their linguistic assets. While cutting globalization costs is important, executives are more interested in how LangOps can drive growth.

Machine translation allows enterprises to communicate instantly with their customers. Terminology databases can be upgraded to multilingual knowledge systems (MKS), which allow companies to not only broadcast localized content to global customers, but also actually understand them when they talk back. An MKS not only enables e-Commerce players to deploy language-neutral product search, but is also a proven solution to make data repositories, systems, organizations, and even countries interoperable. It also crucially provides the unified semantics for the Internet of Things. All of these benefits boost LangOps, which owns the normalized enterprise knowledge and is the basis for many critical customer-facing activities such as customer support, chatbots, text analytics, spare part ordering, compliance, and sales.

Get in touch with us here to learn more about how LangOps can grow also your top-line.

The post Winning with LangOps appeared first on .

21 SEP 2021

Building a Chatbot with Coreon

Building a Chatbot with Coreon

A chatbot informs or guides human users on a specific topic, but a machine can only ‘know’ what it is taught by humans. This means the chatbot must ‘know’ about the topic and – above all – how to relate this information to the request of the user. Ontologies are a very helpful data source for this because their purpose is to represent knowledge in context.

What is an Ontology?

An ontology is defined as the ‘shared and formal modelling of knowledge about a domain’ (IEC 62656-5:2017-06). It consists of classes (or concepts), relations, instances, and axioms (http://www.cs.man.ac.uk/~stevensr/onto/node3.html)…

Building a Chatbot with Coreon

A chatbot informs or guides human users on a specific topic, but a machine can only ‘know’ what it is taught by humans. This means the chatbot must ‘know’ about the topic and – above all – how to relate this information to the request of the user. Ontologies are a very helpful data source for this because their purpose is to represent knowledge in context.

What is an Ontology?

An ontology is defined as the ‘shared and formal modelling of knowledge about a domain’ (IEC 62656-5:2017-06). It consists of classes (or concepts), relations, instances, and axioms (http://www.cs.man.ac.uk/~stevensr/onto/node3.html). Classes refer to a ‘set […] of entities or 'things' within a domain’. Relations represent the ‘interactions between concepts or a concept's properties’ and instances are ‘the 'things' represented by a concept’. Moreover, axioms are ‘used to constrain values for classes or instances’. This means that axioms define what values a class or instance can or cannot have. Axioms are used to represent additional knowledge that cannot be derived from the classes or instances themselves (e. g., ‘there can be no train connection between Europe and America’).

An ontology can be summarized as a knowledge base consisting of concepts, as well as relations between the concepts and additional information. Ontologies are made machine-readable through standardized ontology languages such as OWL (Web Ontology Language) or RDF. They make it possible for the knowledge represented in an ontology to be understood by machines and programs, such as chatbots.

The Role of Concept Maps

Users can access terminology more easily when they see a concept in context.

In the Coreon Multilingual Knowledge System, concept maps are built as part of the terminology work. This is a very helpful addition to terminology management because the relations between concepts are captured and displayed next to the concept information. Thanks to this feature, terminologists and experts can define concepts more precisely, as the relationships with and differences to neighbor concepts are crucial factors when settling on a definition. Furthermore, users can access the terminology more easily when they see a concept in context.

Concept maps in Coreon are the perfect base for an ontology, and this is an important advantage when re-using terminology for machine applications. The information stored in concept maps can be exported, analyzed, and used by various machine applications, including chatbots. For exports from Coreon, the proprietary language coreon.xml or the standardized ontology language RDF can be used.

Use Case: A Chatbot for Company Services

In our use case we created a prototype for a chatbot to represent the services of our company, berns language consulting (blc). Its purpose was to lead users on the company website from ‘utterances’ (i.e., questions or messages typed by users) to solutions. If, for example, a customer asks: ‘How do I get a fast translation into 10 languages?’, they are led to the company service ‘Machine Translation’. Not only do customers immediately learn the name of the service, but also additional information about the advantages and different aspects of machine translation. A call to action is also displayed, e.g., an offer to speak directly to a human expert.

We created the chatbot with the programming language Python. As a database we used the export of a concept map we had created in Coreon beforehand. By doing this, we were able to use the concept map as an ontology. In the concept map, we displayed the following concept classes:

  • company service, e. g., machine translation
  • solutions (part of the service, but also concrete solutions for customers’ problems), e. g., preprocessing
  • customer experience, e. g., translation too expensive
  • other concepts, e. g., MT engine
A concept map in Coreon, focusing on blc services and solutions

The aim of the chatbot is to lead customers from their ‘utterance’ to possible solutions and company services. For this, the chatbot extracts key words from the customer’s typed enquiry and maps them to the concepts in the concept map. It then follows the paths in the concept map until it reaches solutions and/or specific company services. The extracted solutions and services determine the answer of the chatbot. To enable the chatbot to understand as many utterances as possible, there should be a large number of concepts that are related to the range of customer services.

A Smarter, More Advanced Chatbot

The major advantage of using an ontology as a database for a chatbot is that it helps the machine to understand the relationships between concepts. Users’ utterances are easily analyzed and mapped to concepts in the ontology and once the entry into a concept in the ontology is made, related concepts are found and proposed to the user. Another crucial benefit is that a concept map is a controlled database. The administrator can decide which utterances lead to which solutions. Of course, building a concept map as a base for a chatbot entails some manual effort. However, automatic procedures can be included to speed up the terminological work.

A third big advantage is that the ontology cannot only be used in this particular use case. In theory it can be re-used in practically every use case where a machine is trying to ‘understand’ a human. Such scenarios include language assistance systems, text generators, classifiers, or intelligent search engines.

Do you have a good use case for starting an ontology, or would you like to start one but don’t know how? Do you need help building an ontology? Contact us, we are happy to help!

https://berns-language-consulting.de/language/en/terminology-ontology/

Jenny Seidel is responsible for terminology management and language quality at berns language consulting (blc). She helps customers set up terminology processes and implement terminology tools for specific use cases. Her recent focus has been the potential of ontologies as a base for Machine Learning.

The post Building a Chatbot with Coreon appeared first on .

18 MAY 2021

The Journey to a Multilingual SPARQL Endpoint

A SPARQL endpoint makes terminology data accessible

The idea of accessing a Multilingual Knowledge System through the means and methods of the Semantic Web brings two keywords immediately to mind: SPARQL and LOD (Linked Open Data). We’ve already talked about the benefits of a SPARQL endpoint and how it enables your enterprise to handle all its data via one, centralized hub in a previous post, but how did we actually achieve it?

The Coreon MKS is powered by a RESTful Web API, sending its data in JSON data structures through the wire. Everyone can develop extensions and custom solutions based on this. We ourselves did it recently…

A SPARQL endpoint makes terminology data accessible

The idea of accessing a Multilingual Knowledge System through the means and methods of the Semantic Web brings two keywords immediately to mind: SPARQL and LOD (Linked Open Data). We’ve already talked about the benefits of a SPARQL endpoint and how it enables your enterprise to handle all its data via one, centralized hub in a previous post, but how did we actually achieve it?

The Coreon MKS is powered by a RESTful Web API, sending its data in JSON data structures through the wire. Everyone can develop extensions and custom solutions based on this. We ourselves did it recently, for instance: creating a plug-in to SDL Trados Studio so that linguists can directly access information stored in Coreon.

However, this required the developer of the plug-in to get familiar with the API and its data structures.

In the world of the Semantic Web (aka ‘the web of data’), we no longer see proprietary APIs. Developers and integrators instead access all the resources through the same method – SPARQL. Wouldn't it be great to also access the Coreon repositories via a SPARQL endpoint?

We will outline how we did it with Coreon, but the process is not only relevant for our own MK system – it could easily act as a blueprint or guideline for those working with similar tools.

JSON Structures to RDF Graph

The first step was to analyze how Coreon's data model could be mirrored in a RDF graph. What were the information objects? What were the predicates between them? What showed up as a literal?

In RDF, all elements or pieces of information you want to "talk about" are good candidates for becoming objects or, technically speaking, OWL classes. There were obvious candidates for classes, namely Concept or Term, but how about the concept relations such as “broader” or custom associative ones like “is-complementary-to”? How about descriptive information such as a Definition or Term Status value? Concretely, we had to go from the JSON data structure to an RDF graph model.

Before we dive in deeper, here’s a sample concept (with ID: 607ed17b318e0c181786b545) in Coreon that has two terms, English screen and German Bildschirm. Notice also the individual IDs of each of the terms – they will become important later on.

SPARQL Endpoint: Coreon concept example with concept and term IDs shown

In the original JSON data structure, this concept is represented as follows (only relevant code lines shown):

{
    "created_at": "2021-04-20T13:04:59.816Z",
    "updated_at": "2021-04-20T13:05:25.856Z",
    "terms": [
        {
            "lang": "en",
            "value": "screen",
            "created_at": "2021-04-20T13:04:59.816Z",
            "updated_at": "2021-04-20T13:04:59.816Z",
            "id": "607ed17b318e0c181786b549",
            "concept_id": "607ed17b318e0c181786b545",
            "properties": [],
            "created_by": {
                "email": "michael.wetzel@coreon.com",
                "name": "Michael Wetzel"
            },
            "updated_by": {
                "email": "michael.wetzel@coreon.com",
                "name": "Michael Wetzel"
            }
        },
        {
            "lang": "de",
            "value": "Bildschirm",
            "created_at": "2021-04-20T13:05:25.856Z",
            "updated_at": "2021-04-20T13:05:25.856Z",
            "id": "607ed195318e0c181786b55e",
            "concept_id": "607ed17b318e0c181786b545",
            "properties": [],
            "created_by": {
                "email": "michael.wetzel@coreon.com",
                "name": "Michael Wetzel"
            },
            "updated_by": {
                "email": "michael.wetzel@coreon.com",
                "name": "Michael Wetzel"
            },
        }
    ],
    "id": "607ed17b318e0c181786b545",
    "coreon_type": null,
    "alias": false,
    "referent_id": null,
    "branch_root": false,
    "properties": [],
    "parent_relations": [
        {
            "concept_id": "606336dab4dbcf018ed99308",
            "type": "SUPERCONCEPT_OF"
        }
    ],
    "child_relations": []
}

When we transform this into an RDF graph, the concept and its two terms are bound together in statements (so-called triples), each consisting of a subject, a predicate and an object. The concept will act as the subject, the term(s) act as the object(s), and the required predicate could be named in this case: hasTerm. This gives us the following triple:

coreon:607ed17b318e0c181786b545 coreon:hasTerm coreon:607ed17b318e0c181786b549 .

The triple shows that the resource with the ID 607ed17b318e0c181786b545 contains a term, and the term’s ID is 607ed17b318e0c181786b549. It doesn't yet say anything about the value or the language of the term. It simply states that the term with the given ID is a member of that concept.

Now the next triple shows that the value for the resource with ID 607ed17b318e0c181786b549 has a literal value in English, namely the string screen:

coreon:607ed17b318e0c181786b549 coreon:value “screen”@en .

Such a set of triples, i.e. many atomic statements bound together via predicates, make up the RDF graph. If we visualize some of these triples, the resulting RDF graph looks like this:

Representing concepts and terms as an RDF graph

Concepts and terms are classes (in green and blue), predicates are graph edges (above the lines).

The complete set of triples would be serialized as follows in RDF / Turtle:

@prefix coreon: <http://www.coreon.com/coreon-rdf#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://www.coreon.com/coreon-instance> a owl:Ontology;
  owl:imports <http://www.coreon.com/coreon-rdf>;
  owl:versionInfo "Created through Coreon export" .
coreon:607ed17b318e0c181786b547 a coreon:Edge;
  coreon:edgeSource coreon:606336dab4dbcf018ed99308;
  coreon:edgeTarget coreon:607ed17b318e0c181786b545;
  coreon:type "SUPERCONCEPT_OF" .
coreon:606336dab4dbcf018ed99307 a coreon:Term;
  coreon:value "peripheral device"@en .
coreon:606336dab4dbcf018ed99308 a coreon:Concept;
  coreon:hasTerm coreon:606336dab4dbcf018ed99307 .
coreon:607ed17b318e0c181786b545 a coreon:Concept;
  coreon:hasTerm coreon:607ed195318e0c181786b55e,
    coreon:607ed17b318e0c181786b549 .
coreon:607ed17b318e0c181786b549 a coreon:Term;
  coreon:value "screen"@en .
coreon:607ed195318e0c181786b55e a coreon:Term;
  coreon:value "Bildschirm"@de .

You may also have noticed in the above syntax that two or more statements are serialized together – separated via semicolons and ending with a full stop. Line 18 indicates that the resource with the ID 606336dab4dbcf018ed99308 is of OWL class coreon:Concept, and line 19 further indicates that it contains a term which has the ID 606336dab4dbcf018ed99307.

No RDF without URIs

Now all the pieces of information are bound together via RDF statements: the triples. They have a pretty atomic, isolated nature. This is quite different to how XML and other standard formats organize information. In RDF and LOD all data is stored in this atomic manner, uniquely identifiable through the URI.

Via the URIs and the predicates such as hasTerm , the resources are bound together. Only then does it become meaningful for an application or a human, as the URIs are an indispensable prerequisite. All information elements that are represented as classes have unique identifiers. The namespace coreon: , together with the unique IDs, unambiguously identifies a given resource. This is regardless of whether it is a concept, term, property, or even a concept relation. Fortunately we stored all data with URIs when we created the fundamental design of Coreon. Phew.

Build the Coreon RDF Vocabulary

After researching the basic approach described above, we analyzed all elements of the Coreon data structure and rethought them as a member of our RDF vocabulary. The following table lists the most important ones:

OWL Type Coreon RDF Vocabulary
Classes owl:Class coreon:Admin, coreon:Edge, coreon:Concept, coreon:Flagset, coreon:Property, coreon:Term
Predicates owl:ObjectProperty coreon:hasAdmin, coreon:hasFlagset, coreon:hasProperty, coreon:hasTerm
Values owl:AnnotationProperty coreon:edgeSource, coreon:edgeTarget, coreon:id, coreon:name, coreon:type, coreon:value

For the predicates we also specified what kind of information can be used, defining owl:range and owl:domain. For instance, the predicate hasTerm can only accept resources of type coreon:Concept as a subject (owl:domain). As an example: the full specification of the predicate hasTerm looks as follows:

coreon:hasTerm
  rdf:type owl:ObjectProperty ;
  rdfs:comment "makes a term member of a concept" ;
  rdfs:domain coreon:Concept ;
  rdfs:label "has term" ;
  rdfs:range coreon:Term .

Publish as an Offline Resource

Once our RDF vocabulary was ready, the first step to implement it into Coreon was to add an RDF publication mechanism to the export engine. Equipped with this, Coreon can now export its repositories in RDF, including various syntaxes (Turtle, N3, JSON-LD and more).

Real-Time Access via a SPARQL Endpoint

The final yet most complicated step was to equip the Coreon cloud service with a real-time accessible SPARQL endpoint. We chose Apache Fuseki. It runs as a secondary index in parallel to a repository's data store, updated in real-time. Thus any change a data maintainer makes is immediately accessible via the SPARQL endpoint!

Let me illustrate the ease and power of SPARQL with some examples...

Example 1: Query All the Definitions via SPARQL:

SELECT *
   WHERE {
       ?p rdf:type coreon:Property .
       ?p coreon:key "Definition" .
       ?p coreon:value ?v.
    }

We are querying all the objects that are of type coreon:Property (line 3) that also have a key with the name Definition (line 4). This is bound against the result variable p, and then for all these we retrieve the values, which are bound against the variable v.

A typical result table (here from a repository dealing with wine grape varieties) looks as follows:

[p] v
http://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8a1 Chardonnay is the most famous and most elevated grape in the region of Northern Burgundy in ...
http://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8ab A white grape variety which originated in the Rhine region. Riesling is an aromatic grape variety ...
http://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8cd Pinot noir (French: [pino nwa?]) is a red wine grape variety of the species Vitis vinifera. The ...
... ...
... ...

The first column, representing the variable p, holds the URI of the property; the second column holds the literal value.

Example 2: Query all terms and - if present - also print their usage value:

A more realistic query compared to Example 1: get me all the terms and if they have a usage flag, such as preferred, print it, too.

SELECT ?t ?termvalue ?usagevalue
    WHERE {
        ?t rdf:type coreon:Term .
        ?t coreon:value ?termvalue .
        OPTIONAL {
            ?t coreon:hasProperty ?p .
            ?p coreon:key "Usage" .
            ?p coreon:value ?usagevalue .
        }
    }

A typical result might look as follows:

[t] termvalue usagevalue
http://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8aa Riesling
http://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8bb Cabernet Sauvignon Preferred
http://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8be CS Alternative
http://www.coreon.com/coreon-rdf#5f9ee3609323c01c4728b8c2 Merlot
...

This output table shows the term's URI, then its value, and - if available - the usage recommendation.

Example 3: How many Definitions are in your repository?

A last one to share...do you know how many Definitions or Comments are in your repository, or which are the most used properties? Well, how about this...

SELECT ?k (COUNT(?k) AS ?count)
{
	?uri coreon:key ?k.
}
GROUP BY ?k
ORDER BY DESC(?count)

...which delivers a table looking like this...nice!

k count
concept status 13806
usage status 10532
part of speech 10408
term type 10353
definition 5996

European Language Grid and Outlook

We thank the European Language Grid (ELG) for funding substantial parts of this development. It is a significant step and showcases how to complement software for multilingual knowledge with an open SPARQL / LOD access mechanism. The SPARQL endpoint is available to all Coreon customers. A selected set of demo repositories will also be accessible with the SPARQL endpoint through the ELG hub by summer 2021.

We are sharing our experiences with ISO / TC37 SC3 working groups, as a draft for a technical recommendation of how to represent TBX (TermBase eXchange) as RDF. Many of our findings in this journey towards a SPARQL endpoint can be used as a base for an international standard.

The post The Journey to a Multilingual SPARQL Endpoint appeared first on .

25 MAR 2021

Multilingual Knowledge for the Data-Centric Enterprise

A SPARQL Endpoint allows the potential of enterprise terminology data to be unlocked.

Knowledge graphs are becoming a key resource for global enterprises. The textual labels of a graph’s nodes form a standardized vocabulary. Unfortunately, knowledge solutions are often wastefully developed in parallel within the same organization, be it in different departments or national branches. Starting from zero, domain experts build application-specific vocabularies in a hard-to-use taxonomy or thesaurus editor, mostly only in one language. Yet enterprise terminology databases in many languages are almost always already available. Enterprises and organizations simply have to realize that they can use this treasure chest of data for more than just documentation and translation processes.

Mutual understanding:

A SPARQL Endpoint allows the potential of enterprise terminology data to be unlocked.

Knowledge graphs are becoming a key resource for global enterprises. The textual labels of a graph’s nodes form a standardized vocabulary. Unfortunately, knowledge solutions are often wastefully developed in parallel within the same organization, be it in different departments or national branches. Starting from zero, domain experts build application-specific vocabularies in a hard-to-use taxonomy or thesaurus editor, mostly only in one language. Yet enterprise terminology databases in many languages are almost always already available. Enterprises and organizations simply have to realize that they can use this treasure chest of data for more than just documentation and translation processes.

Mutual understanding: RDF and SPARQL

The problem is that legacy terminology solutions have proprietary APIs or special XML export formats. They also do not structure their concepts in a knowledge graph, which makes it hard to use enterprise terminology data for anything more than translation. Taxonomies, thesauri, or ontology products, on the other hand, don’t cater for cross-language use and thus remain local. Multilingual Knowledge Systems such as Coreon bridge this gap, but until now even this also required integration through proprietary interfaces, or the exporting of data.

Multilingual knowledge unlocks true intelligence for the international enterprise
(App-Centric vs Data-Centric by cinchy).
Multilingual knowledge unlocks true intelligence for the international enterprise
(App-Centric vs Data-Centric by cinchy).

SPARQL (the recursive acronym for SPARQL Protocol and RDF Query Language) makes it possible to query knowledge systems without having to study their APIs or export formats. Coreon was therefore recently equipped with a SPARQL endpoint. Its knowledge repositories can now be queried in real time using the SPARQL syntax, i.e. a universal language for developers of data centric applications. 

Enterprises and organizations simply have to realize that they can use this treasure chest of data for more than just documentation and translation processes.

Semantic success

A central Multilingual Knowledge System, exposing its data via a SPARQL endpoint, thus becomes a common knowledge infrastructure for any textual enterprise application. This is regardless of department, country, or use case. For example: content management, chatbots, customer support, spare part ordering, or compliance can all be built on the same, normalized enterprise knowledge. In taking proprietary APIs out of the equation and with no need to export, mirror, and deploy into separate triple stores, real time access of live data is guaranteed.

Your organization already possesses this data. It’s just a case of maximizing its potential, introducing a cleaner and more accessible way of handling it. Contact us at info@coreon.com if you’d like to know more about how a common knowledge infrastructure can help your enterprise.

Coreon would like to extend special thanks to the European Language Grid, which funded significant parts of this R & D effort. The SPARQL endpoint will also be deployed into the ELG hub, so it will be reachable and discoverable from there.

The post Multilingual Knowledge for the Data-Centric Enterprise appeared first on .

11 JAN 2021

Keeping Your Sanity with Machine Taxonomization

Machine taxonomization

Taxonomies are crucial for businesses and institutions to handle bigger amounts of data. Manual taxonomization of thousands of concepts into a knowledge tree has so far been the only way to do this. Aside from the fact that this task can be quite tedious, it requires in-demand subject matter experts to complete. Thus, it is often considered too expensive or too much effort. A shame, given that companies then miss out on all the benefits of using taxonomies.

With a little help from your (AI) friend

Imagine a chaotic pile of books (of course, the less-organized among us may not…

Machine taxonomization

Taxonomies are crucial for businesses and institutions to handle bigger amounts of data. Manual taxonomization of thousands of concepts into a knowledge tree has so far been the only way to do this. Aside from the fact that this task can be quite tedious, it requires in-demand subject matter experts to complete. Thus, it is often considered too expensive or too much effort. A shame, given that companies then miss out on all the benefits of using taxonomies.

With a little help from your (AI) friend

Imagine a chaotic pile of books (of course, the less-organized among us may not have to imagine this) being automatically sorted into shelves, branches, and sub-branches, together with an index to help quickly find a desired book. This describes what our semi‑automatic taxonomization method can do. An initial knowledge tree is produced by Machine Learning (ML), using language models stored in huge neural networks. Clustering algorithms on top of word embeddings automatically converts a haystack of concepts into a structured tree. The final curation of the taxonomy is still carried out by a human, but the most time-consuming and tedious aspects of the task have already been dealt with, and in a consistent way.

‘Cobot’ versus manual

In a study, we benchmarked this collaborative robot approach (ML auto‑taxonomization and human curation) against the manual job done by an expert linguist. Below are the data and task flows of the two approaches:

We aimed to taxonomize 424 concepts related to COVID-19. The traditional manual method was tedious and tiring for the human expert, who took a flat list of concepts and turned them into a systematic knowledge graph by working concept by concept to get everything in its right place. Wading through the list from scratch (including constantly switching contexts – from drugs, to vaccines, to social distancing, for example) made progress on the task difficult to measure. Having no perception of how many clusters of concepts still needed to be created was demotivating.

In contrast, our semi-automatic method started off with a tree of 55 suggested clusters of leaf concepts, each representing a specific context. Of course, ML doesn’t always produce the exact results a human expert would (we hear you, AI skeptics!), so some algorithm-suggested clusters were a bit off. However, the majority of the 55 were pretty accurate. They were ready to be worked on in Coreon’s visual UI, making the human curation task much faster and easier. This also enabled progress to be measured, as the job was done cluster by cluster.

By dramatically lowering the effort, time, and money needed to create taxonomies, managing textual data will become much easier and AI applications will see a tremendous boost.

Advantage, automation!

From a business perspective the most important result was that the semi‑automatic method was five(!) times faster. The structured head-start enabled the human curator to work methodically through the concepts. The clustered nature of the ML‑suggested taxonomy would also allow the workload to be distributed – e.g., one expert could focus on one medicine, another on public health measures.

More difficult to measure (but nicely visible below) was the quality of the two resulting taxonomies. While our linguist did a sterling job working manually, the automatic approach produced a tidier taxonomy which is easier for humans to explore and can be effectively consumed by machines for classification, search, or text analytics. Significantly, as the original data was multilingual, the taxonomy can also be leveraged in all languages.

A barrier removed

So, can we auto-taxonomize a list of semantic concepts? The answer is yes, with some human help. The hybrid approach frees knowledge workers from the tedious work in the taxonomization process and offers immediate benefits – being able to navigate swiftly through data, and efficient conceptualization.

Most importantly, though, linking concepts in a knowledge graph enables machines to consume enterprise data. By dramatically lowering the effort, time, and money needed to create taxonomies, managing textual data will become much easier and AI applications will see a tremendous boost.

If you’d like to discover more about our technology and services on auto-taxonomization, feel free to get in touch with us here.

The post Keeping Your Sanity with Machine Taxonomization appeared first on .

9 DEC 2020

Making Translation GDPR-Compliant

GDPR compliant with Anonymization

Current processes violate GDPR

Out of the six data protection principles, translation regularly violates at least four: purpose limitation, data minimization, storage limitation, and confidentiality. This last one is most likely mentioned in most purchase orders, but it is hard to live up to in an industry which squeezes out every last cent in a long supply chain. 

Spicier is the fact that translators don’t need to know any personal data to translate a text, like who made the payment and how much money was transferred, as in the sample below. Anonymized source texts would address purpose limitation and…

GDPR compliant with Anonymization

Current processes violate GDPR

Out of the six data protection principles, translation regularly violates at least four: purpose limitation, data minimization, storage limitation, and confidentiality. This last one is most likely mentioned in most purchase orders, but it is hard to live up to in an industry which squeezes out every last cent in a long supply chain. 

Spicier is the fact that translators don’t need to know any personal data to translate a text, like who made the payment and how much money was transferred, as in the sample below. Anonymized source texts would address purpose limitation and data minimization. The biggest offenders, however, are the industry’s workhorses: neural machine translation (NMT) and translation memory (TM). NMT trains and TM stores texts full of personal data without means of deleting it, even though it was unnecessary for them to store the protected data in the first place. 

A GDPR-compliant translation workflow 

Some might argue that this difficult problem cannot be fixed. Well, it can. And not only this, our anonymization workflow saves money and increases quality and process safety, too. 

On a secure server ‘named entities’ (i.e. likely protected data) are recognized. This step is called NER, a standard discipline of Natural Language Processing. There are several anonymizers on the market, mainly supporting structured data and English, but they only support a one-way process. 

In our solution, the data is actually “pseudonymized” in both the source and target languages. This keeps the anonymized data readable for linguists by replacing protected data with another string of the same type. Once translated, the text is de-anonymized by replacing the pseudonyms with the original data. This step is tricky since the data also needs to be localized, as in our example with the title and the decimal and thousands separators. The TMs used along the supply chain will only store the anonymized data. Likewise, NMT is not trained with any personal data. 

Know-how

We recently did a feasibility study to test this approach. Academia considers NER a solved problem, but in reality it’s only somewhat done for English. Luckily, language models can now be trained to work cross-language. Rule-based approaches, like regular expressions, add deterministic process safety. For our study we extended the current standard formats for translation, TMX and XLIFF, to support pseudonymization. De-anonymization is hard, but I had already previously developed its basics for the first versions of TRADOS. 

What remains is the trade-off between data protection and translatability. The more text is anonymized, the better the leverage – but the harder the text is to understand for humans, too. Getting that balance right will still require some testing, best practices, and good UI design. For example, project managers will want a finer granularity on named entities than normally provided by NER tools. Using a multilingual knowledge system like Coreon, they could specify that all entities of type Committee are to be pseudonymized, but not entities of type Treaty

Anonymization is mandatory 

As shown above, a GDPR-compliant translation workflow is possible, and is thus legally mandatory. This is, in fact, good news. Regulations are often perceived as making life harder for businesses, but GDPR has actually created a sector in which the EU is a world leader. Our workflow enables highly-regulated industries, such as Life Sciences or Finance, to safely outsource translation. Service providers won’t have to sweat over confidentiality breaches. The workflow will increase quality as named entities are processed by machines in a secure and consistent way and machine translation has fewer possibilities to make stupid mistakes. It will also save a lot of money, since translation memories will deliver a much higher leverage.

If you want to know more, please contact us.

The post Making Translation GDPR-Compliant appeared first on .

12 DEC 2018

Sunsetting CAT

Neural Machine Translation is making CAT Tools obselete.

For decades Computer Assisted Translation (CAT), based on sentence translation memories, has been the standard tool for going global. Although CAT tools had originally been designed with a mid-90s PC in mind and there have been proposals for changing the underlying data model, the basic architecture of CAT has been left unchanged. The dramatic advances in Neural Machine Translation (NMT) have now made the whole product category obsolete.

NMT Crossing the Rubicon

Neural networks, stacked deeply enough, do understand us sufficiently to create a well formed translation.

When selling translation memory, I always said that machines would only be…

Neural Machine Translation is making CAT Tools obselete.

For decades Computer Assisted Translation (CAT), based on sentence translation memories, has been the standard tool for going global. Although CAT tools had originally been designed with a mid-90s PC in mind and there have been proposals for changing the underlying data model, the basic architecture of CAT has been left unchanged. The dramatic advances in Neural Machine Translation (NMT) have now made the whole product category obsolete.

NMT Crossing the Rubicon

Neural networks, stacked deeply enough, do understand us sufficiently to create a well formed translation.

When selling translation memory, I always said that machines would only be able to translate once they understand text; and if one day they can, MT will be a mere footnote of a totally different revolution. Now it turns out that neural networks, stacked deeply enough, do understand us sufficiently to create a well formed translation. Over the last two years NMT has progressed dramatically. It has now achieved “human parity” for important language pairs and domains. That changes everything.

Industry Getting it Wrong

Most players in the $50b translation industry - service providers but also their customers - think that NMT is just another source for a translation proposal. In order to preserve their established way of delivery, they pitch the concept of “augmented translation”. However, if the machine translation is as good (or bad!) as human translation, who would you have revise it, another translator or a subject matter expert?

Yes, the expert who knows what the text is about. The workflow is thus changing to automatic translation and expert revision. Translation becomes faster, cheaper, and better!

Different Actors, Different Tools

A revision UI will have to look very different to CAT tools. The most dramatic change is that it has to be extremely simple. To support the current model of augmented translation, CAT tools have become very powerful. However, their complexity can only be handled by a highly demanded group of perhaps a few thousand professional translators globally.

For the new workflow, a product design is required that can support millions of (mostly occasional) expert revisers. Also, the revisers need to be pointed to the sentences which need revision. This requires multilingual knowledge.

Disruption Powered by Coreon

Coreon can answer the two key questions for using NMT in a professional translation workflow: 1) which parts of the translated text are not fit-for-purpose, and 2) why aren't they? To do so, the multilingual knowledge system classifies linguistic assets, human resources, QA, and projects in a unified system which is expandable, dynamic, and provides fallback paths. Coreon is a key component for LangOps. In the future linguists will engineer localization workflows such as Semiox and create multilingual knowledge in Coreon. "Doing words” is left to NMT.

The post Sunsetting CAT appeared first on .

12 DEC 2018

Why Machine Learning Still Needs Humans for Language

Despite recent advances, Machine Learning needs humans!

Outperforming Humans

Machine Learning (ML) has begun to outperform humans in many tasks which seemingly require intelligence. The hype about ML has even made it regularly into the mass media, and it can now read lips, recognize faces, or transform speech to text. Yet when it comes to dealing with the ambiguity, variety and richness of language, or understanding text or extracting knowledge, ML continues to need human experts.

Knowledge is Stored as Text

The web is certainly our greatest knowledge source. However, it has been designed for consumption by humans, not machines. The web’s knowledge is mostly stored in…

Despite recent advances, Machine Learning needs humans!

Outperforming Humans

Machine Learning (ML) has begun to outperform humans in many tasks which seemingly require intelligence. The hype about ML has even made it regularly into the mass media, and it can now read lips, recognize faces, or transform speech to text. Yet when it comes to dealing with the ambiguity, variety and richness of language, or understanding text or extracting knowledge, ML continues to need human experts.

Knowledge is Stored as Text

The web is certainly our greatest knowledge source. However, it has been designed for consumption by humans, not machines. The web’s knowledge is mostly stored in text and spoken language, enriched with images and video. It is not a structured relational database storing numeric data in machine processable form.

Text is Multilingual

The web is also very multilingual. Recent statistics surprisingly show that only 27% of the web’s content is in English, and only 21% in the next 5 most used languages. That means more than half of its knowledge is expressed in a long tail of other languages.

Constraints of Machine Learning

ML faces some serious challenges. Even with today’s availability of hardware, the demand for computing power can become astronomical when input and desired output are rather fuzzy (see the great NYT article, "The Great A.I. Awakening").

ML is great for 80/20 problems, but it is dangerous in contexts with high accuracy needs: “Digital assistants on personal smartphones can get away with mistakes, but for some business applications the tolerance for error is close to zero", emphasizes Nikita Ivanov, from Datalingvo, a Silicon Valley startup.

ML performs well on n-to-1 questions. In facial recognition, for instance, there is only one correct answer to the question “which person do all these pixels show?” However, ML struggles in n-to-many or in gradual circumstances…there are many ways to translate a text correctly or express a certain piece of knowledge.

ML is only as good as its available relevant training material. For many tasks, mountains of data are needed, and data that should be of supreme quality. For language related tasks these mountains of data are often required per language and per domain. Furthermore, it is also hard to decide when the machine has learned enough.

Monolingual ML Good Enough?

Some would suggest we should just process everything in English. ML also does an 'OK' job at machine translation (Google Translate, for example). So why not translate everything into English and then simply run our ML algorithms? This is a very dangerous approach, since errors multiply. If the output of an 80% accurate machine translation becomes the input to an 80% accurate Sentiment Analysis, errors multiply to 64%. At that hit rate you are getting close to flipping a coin. 

Human Knowledge to Help

The world is innovating constantly. Every day new products and services are created. To talk about them, we continuously craft new words: the bumpon, the ribbon, a plug-in hybridTTIP ‒ only with the innovative force of language can we communicate new things.

A Struggle with Rare Words

By definition, new words are rare. They first appear in one language and then may slowly propagate into other domains or languages. There is no knowledge without these rare words, the terms. Look at a typical product catalog description with the terms highlighted. Now imagine this description without the terms – it would be nothing but a meaningless scaffold of fill-words.

Knowledge Training Required

At university we acquire the specific language and terminology of the field we are studying. We become experts in that domain. Even so, later when we change jobs during our professional career we still have to acquire the lingo of a new company: names of products, modules, services, but also job roles and their titles, names for departments, processes, etc. We get familiar with a specific corporate language by attending training, by reading policies, specifications, and functional descriptions. Machines need to be trained in the very same way with that explicit knowledge and language.

Multilingual Knowledge Systems Boost ML with Knowledge

There is a remedy: Terminology databases, enterprise vocabularies, word lists, glossaries – organizations usually already own an inventory of “their” words. This invaluable data can be leveraged to boost ML with human knowledge: by transforming these inventories into a Multilingual Knowledge System (MKS). An MKS captures not only all words in all registers in all languages, but structures them into a knowledge graph (a 'convertible' IS-A 'car' IS-A 'vehicle'…, 'front fork' IS-PART of 'frame' IS-PART of 'bicycle').

It is the humanly curated Multilingual Knowledge System that enables Machine Learning and Artificial Intelligence solutions to work for specific domains, with only small amounts of textual data, including for less resourced languages.

The post Why Machine Learning Still Needs Humans for Language appeared first on .