News

03 JUL 2020

Coreon MKS as LLOD is European Language Grid top funded project

Coreon’s proposal for using the European Language Grid (ELG) as a platform for making multilingual interoperability assets discoverable and retrievable has been awarded. This will be achieved by complementing Multilingual Knowledge Systems with a SPARQL interface. The ELG Open Call 1 received 121 proposals, of which 110 were eligible and 10 were selected. Coreon’s proposal “MKS as Linguistic Linked Open Data” was amongst the three winning proposal from industry and received the highest funding.

The goals of the project are a) to enable Semantic Web systems to query Coreon’s richly elaborated multilingual terminologies stored in concept systems and knowledge graphs…
Coreon’s proposal for using the European Language Grid (ELG) as a platform for making multilingual interoperability assets discoverable and retrievable has been awarded. This will be achieved by complementing Multilingual Knowledge Systems with a SPARQL interface. The ELG Open Call 1 received 121 proposals, of which 110 were eligible and 10 were selected. Coreon’s proposal “MKS as Linguistic Linked Open Data” was amongst the three winning proposal from industry and received the highest funding.

The goals of the project are a) to enable Semantic Web systems to query Coreon’s richly elaborated multilingual terminologies stored in concept systems and knowledge graphs and b) to prove how to overcome the limits of RDF/knowledge graph editors, which usually are fine to model concept relations, but are weak in capturing linguistic information. When deployed in March 2021 on the ELG, the innovation will enable the Semantic Web community to query rich multilingual data with a familiar, industry standard syntax.
07 NOV 2019

CEFAT4Cities Action Gets Funding

The CEFAT4Cities Action, to be executed by a multinational consortium of five partners, led by CrossLang, has received funding. The action starts in April 2020 and runs up to March 2022.
The main objective of the CEFAT4Cities Action is to develop a “Smart cities natural language context”, providing multilingual interoperability of the Context Broker DSI and making public “smart city” services multilingual, with pilots in Vienna and Brussels.
The language resources that will be created will be committed to the ELRC repository and the following languages will be developed: Dutch, English, French, German, Italian, Slovenian, Croatian and Norwegian.

Coreon's…
The CEFAT4Cities Action, to be executed by a multinational consortium of five partners, led by CrossLang, has received funding. The action starts in April 2020 and runs up to March 2022.
The main objective of the CEFAT4Cities Action is to develop a “Smart cities natural language context”, providing multilingual interoperability of the Context Broker DSI and making public “smart city” services multilingual, with pilots in Vienna and Brussels.
The language resources that will be created will be committed to the ELRC repository and the following languages will be developed: Dutch, English, French, German, Italian, Slovenian, Croatian and Norwegian.

Coreon's role in the consortium is provide the appropriate technology, to turn vocabularies into multilingual knowledge graphs, to curate and extend them to model the domain of smart cities.
25 MAR 2021

Multilingual Knowledge for the Data-Centric Enterprise

Knowledge graphs are becoming a key resource for global enterprises. The textual labels of a graph’s nodes form a standardized vocabulary. Unfortunately, knowledge solutions are often wastefully developed in parallel within the same organization, be it in different departments or national branches. Starting from zero, domain experts build application-specific vocabularies in a hard-to-use taxonomy or thesaurus editor, mostly only in one language. Yet enterprise terminology databases in many languages are almost always already available. Enterprises and organizations simply have to realize that they can use this treasure chest of data for more than just documentation and translation processes.

Mutual understanding:

Knowledge graphs are becoming a key resource for global enterprises. The textual labels of a graph’s nodes form a standardized vocabulary. Unfortunately, knowledge solutions are often wastefully developed in parallel within the same organization, be it in different departments or national branches. Starting from zero, domain experts build application-specific vocabularies in a hard-to-use taxonomy or thesaurus editor, mostly only in one language. Yet enterprise terminology databases in many languages are almost always already available. Enterprises and organizations simply have to realize that they can use this treasure chest of data for more than just documentation and translation processes.

Mutual understanding: RDF and SPARQL

The problem is that legacy terminology solutions have proprietary APIs or special XML export formats. They also do not structure their concepts in a knowledge graph, which makes it hard to use them for more than translation. Taxonomies, thesauri, or ontology products, on the other hand, don’t cater for cross-language use and thus remain local. Multilingual Knowledge Systems such as Coreon bridge this gap, but until now even this also required integration through proprietary interfaces, or the exporting of data.

Multilingual knowledge unlocks true intelligence for the international enterprise
(App-Centric vs Data-Centric by cinchy).
Multilingual knowledge unlocks true intelligence for the international enterprise
(App-Centric vs Data-Centric by cinchy).

SPARQL (the recursive acronym for SPARQL Protocol and RDF Query Language) makes it possible to query knowledge systems without having to study their APIs or export formats. Coreon was therefore recently equipped with a SPARQL endpoint. Its knowledge repositories can now be queried in real time using the SPARQL syntax, i.e. a universal language for developers of data centric applications. 

Enterprises and organizations simply have to realize that they can use this treasure chest of data for more than just documentation and translation processes.

Semantic success

A central Multilingual Knowledge System, exposing its data via a SPARQL endpoint, thus becomes a common knowledge infrastructure for any textual enterprise application. This is regardless of department, country, or use case. For example: content management, chatbots, customer support, spare part ordering, or compliance can all be built on the same, normalized enterprise knowledge. In taking proprietary APIs out of the equation and with no need to export, mirror, and deploy into separate triple stores, real time access of live data is guaranteed.

Your organization already possesses this data. It’s just a case of maximizing its potential, introducing a cleaner and more accessible way of handling it. Contact us at info@coreon.com if you’d like to know more about how a common knowledge infrastructure can help your enterprise.

Coreon would like to extend special thanks to the European Language Grid, which funded significant parts of this R & D effort. The SPARQL endpoint will also be deployed into the ELG hub, so it will be reachable and discoverable from there.

11 JAN 2021

Keeping Your Sanity with Machine Taxonomization

Taxonomies are crucial for businesses and institutions to handle bigger amounts of data. Manually organizing thousands of concepts into a knowledge tree has so far been the only way to do this. Aside from the fact that this task can be quite tedious, it requires in-demand subject matter experts to complete. Thus, it is often considered too expensive or too much effort. A shame, given that companies then miss out on all the benefits of using taxonomies.

With a little help from your (AI) friend

Imagine a chaotic pile of books (of course, the less-organized among us may not have…

Taxonomies are crucial for businesses and institutions to handle bigger amounts of data. Manually organizing thousands of concepts into a knowledge tree has so far been the only way to do this. Aside from the fact that this task can be quite tedious, it requires in-demand subject matter experts to complete. Thus, it is often considered too expensive or too much effort. A shame, given that companies then miss out on all the benefits of using taxonomies.

With a little help from your (AI) friend

Imagine a chaotic pile of books (of course, the less-organized among us may not have to imagine this) being automatically sorted into shelves, branches, and sub-branches, together with an index to help quickly find a desired book. This describes what our semi‑automatic taxonomization method can do. An initial knowledge tree is produced by Machine Learning (ML), using language models stored in huge neural networks. Clustering algorithms on top of word embeddings automatically converts a haystack of concepts into a structured tree. The final curation of the taxonomy is still carried out by a human, but the most time-consuming and tedious aspects of the task have already been dealt with, and in a consistent way.

‘Cobot’ versus manual

In a study, we benchmarked this collaborative robot approach (ML auto‑taxonomization and human curation) against the manual job done by an expert linguist. Below are the data and task flows of the two approaches:

We aimed to taxonomize 424 concepts related to COVID-19. The traditional manual method was tedious and tiring for the human expert, who took a flat list of concepts and turned them into a systematic knowledge graph by working concept by concept to get everything in its right place. Wading through the list from scratch (including constantly switching contexts – from drugs, to vaccines, to social distancing, for example) made progress on the task difficult to measure. Having no perception of how many clusters of concepts still needed to be created was demotivating.

In contrast, our semi-automatic method started off with a tree of 55 suggested clusters of leaf concepts, each representing a specific context. Of course, ML doesn’t always produce the exact results a human expert would (we hear you, AI skeptics!), so some algorithm-suggested clusters were a bit off. However, the majority of the 55 were pretty accurate. They were ready to be worked on in Coreon’s visual UI, making the human curation task much faster and easier. This also enabled progress to be measured, as the job was done cluster by cluster.

By dramatically lowering the effort, time, and money needed to create taxonomies, managing textual data will become much easier and AI applications will see a tremendous boost.

Advantage, automation!

From a business perspective the most important result was that the semi‑automatic method was five(!) times faster. The structured head-start enabled the human curator to work methodically through the concepts. The clustered nature of the ML‑suggested taxonomy would also allow the workload to be distributed – e.g., one expert could focus on one medicine, another on public health measures.

More difficult to measure (but nicely visible below) was the quality of the two resulting taxonomies. While our linguist did a sterling job working manually, the automatic approach produced a tidier taxonomy which is easier for humans to explore and can be effectively consumed by machines for classification, search, or text analytics. Significantly, as the original data was multilingual, the taxonomy can also be leveraged in all languages.

A barrier removed

So, can we auto-taxonomize a list of semantic concepts? The answer is yes, with some human help. The hybrid approach frees knowledge workers from the tedious work in the taxonomization process and offers immediate benefits – being able to navigate swiftly through data, and efficient conceptualization.

Most importantly, though, linking concepts in a knowledge graph enables machines to consume enterprise data. By dramatically lowering the effort, time, and money needed to create taxonomies, managing textual data will become much easier and AI applications will see a tremendous boost.

If you’d like to discover more about our technology and services on auto-taxonomization, feel free to get in touch with us here

9 DEC 2020

Making Translation GDPR-Compliant

Current processes violate GDPR

Out of the six data protection principles, translation regularly violates at least four: purpose limitation, data minimization, storage limitation, and confidentiality. This last one is most likely mentioned in most purchase orders, but it is hard to live up to in an industry which squeezes out every last cent in a long supply chain. 

Spicier is the fact that translators don’t need to know any personal data to translate a text, like who made the payment and how much money was transferred, as in the sample below. Anonymized source texts would address purpose limitation and…

Current processes violate GDPR

Out of the six data protection principles, translation regularly violates at least four: purpose limitation, data minimization, storage limitation, and confidentiality. This last one is most likely mentioned in most purchase orders, but it is hard to live up to in an industry which squeezes out every last cent in a long supply chain. 

Spicier is the fact that translators don’t need to know any personal data to translate a text, like who made the payment and how much money was transferred, as in the sample below. Anonymized source texts would address purpose limitation and data minimization. The biggest offenders, however, are the industry’s workhorses: neural machine translation (NMT) and translation memory (TM). NMT trains and TM stores texts full of personal data without means of deleting it, even though it was unnecessary for them to store the protected data in the first place. 

A GDPR-compliant translation workflow 

Some might argue that this difficult problem cannot be fixed. Well, it can. And not only this, our anonymization workflow saves money and increases quality and process safety, too. 

On a secure server ‘named entities’ (i.e. likely protected data) are recognized. This step is called NER, a standard discipline of Natural Language Processing. There are several anonymizers on the market, mainly supporting structured data and English, but they only support a one-way process. 

In our solution, the data is actually “pseudonymized” in both the source and target languages. This keeps the anonymized data readable for linguists by replacing protected data with another string of the same type. Once translated, the text is de-anonymized by replacing the pseudonyms with the original data. This step is tricky since the data also needs to be localized, as in our example with the title and the decimal and thousands separators. The TMs used along the supply chain will only store the anonymized data. Likewise, NMT is not trained with any personal data. 

Know-how

We recently did a feasibility study to test this approach. Academia considers NER a solved problem, but in reality it’s only somewhat done for English. Luckily, language models can now be trained to work cross-language. Rule-based approaches, like regular expressions, add deterministic process safety. For our study we extended the current standard formats for translation, TMX and XLIFF, to support pseudonymization. De-anonymization is hard, but I had already previously developed its basics for the first versions of TRADOS. 

What remains is the trade-off between data protection and translatability. The more text is anonymized, the better the leverage – but the harder the text is to understand for humans, too. Getting that balance right will still require some testing, best practices, and good UI design. For example, project managers will want a finer granularity on named entities than normally provided by NER tools. Using a multilingual knowledge system like Coreon, they could specify that all entities of type Committee are to be pseudonymized, but not entities of type Treaty

Anonymization is mandatory 

As shown above, a GDPR-compliant translation workflow is possible, and is thus legally mandatory. This is, in fact, good news. Regulations are often perceived as making life harder for businesses, but GDPR has actually created a sector in which the EU is a world leader. Our workflow enables highly-regulated industries, such as Life Sciences or Finance, to safely outsource translation. Service providers won’t have to sweat over confidentiality breaches. The workflow will increase quality as named entities are processed by machines in a secure and consistent way and machine translation has fewer possibilities to make stupid mistakes. It will also save a lot of money, since translation memories will deliver a much higher leverage.

If you want to know more, please contact us.

12 DEC 2018

Sunsetting CAT

For decades Computer Assisted Translation based on sentence translation memories has been the standard tool for going global. Although CAT had been originally designed with a mid-90s PC in mind and there have been proposals for changing the underlying data model, the basic architecture of CAT has been left unchanged. The dramatic advances in Neural Machine Translation (NMT) have now made the whole product category obsolete.

NMT Crossing the Rubicon

While selling translation memory I always said, machines will only be able to translate once they understand text; and if one day they would, MT will be a mere…

For decades Computer Assisted Translation based on sentence translation memories has been the standard tool for going global. Although CAT had been originally designed with a mid-90s PC in mind and there have been proposals for changing the underlying data model, the basic architecture of CAT has been left unchanged. The dramatic advances in Neural Machine Translation (NMT) have now made the whole product category obsolete.

NMT Crossing the Rubicon

While selling translation memory I always said, machines will only be able to translate once they understand text; and if one day they would, MT will be a mere footnote of a totally different revolution. Now it turns out that neural networks, stacked deeply enough, do understand us sufficiently to create a well formed translation. Over the last two years NMT has progressed dramatically. It has now achieved “human parity” for important language pairs and domains. That changes everything.

Industry Getting it Wrong

Most players in the $50b translation industry, service providers but also their customers, think that NMT is just another source for a translation proposal. In order to preserve their established way of delivery they pitch the concept of “augmented translation”. However, if the machine translation is as good (or bad) as human translation, who would you have revise it, another translator or a subject matter expert? 
Yes, the expert who knows what the text is about. The workflow is thus changing to automatic translation and expert revision. Translation becomes faster, cheaper, and better!

Different Actors, Different Tools

A revision UI will have to look very different to a CAT tool. The most dramatic change is that a revision UI has to be extremely simple. To support the current model of augmented translation, CAT tools have become very powerful. However, their complexity can only be handled by a highly demanded group of maybe a couple ten thousand of professional translators globally.

For the new workflow a product design is required, that can support dozens of millions of, mostly occasional, expert revisers. Also, the revisers need to be pointed to the sentences which need revision. This requires multilingual knowledge.

Disruption Powered by Coreon

Coreon can answer the two key questions for using NMT in a professional translation workflow: a) which parts of the translated text are not fit-for-purpose and b) why not? To do so the multilingual knowledge system classifies linguistic assets, human resources, QA, and projects in a unified system which is expandable, dynamic, and provides fallback paths. In the future linguists will engineer localization workflows such as Semiox and create multilingual knowledge in Coreon. "Doing words” is left to NMT.

4 APR 2018

Concept Maps Everywhere

On March 22-24 the DTT Symposion (short DTT) took place again in Mannheim. It is the bi-annual meeting of the German Terminology Association (Deutscher Terminologietag). We were exhibiting and I enjoyed talking to many Coreon customers there. It was a truly exciting event this year and according to the organizers the most busy ever. 200+ participants meant house full!

"Ausgebucht - no further seats left!"

After a half day of pre-event workshops, the event kicked off Friday morning with a presentation from Martin Volk (University Zürich) on parallel corpora, terminology extraction, and MT. Martin challenged the hype…

On March 22-24 the DTT Symposion (short DTT) took place again in Mannheim. It is the bi-annual meeting of the German Terminology Association (Deutscher Terminologietag). We were exhibiting and I enjoyed talking to many Coreon customers there. It was a truly exciting event this year and according to the organizers the most busy ever. 200+ participants meant house full!

"Ausgebucht - no further seats left!"

After a half day of pre-event workshops, the event kicked off Friday morning with a presentation from Martin Volk (University Zürich) on parallel corpora, terminology extraction, and MT. Martin challenged the hype around Neural Machine Translation and pinpointed some weaknesses: “NMT operates with a fixed vocabulary. But real world translation has to deal with new words constantly … how can we ensure terminology-consistent translation?”. His research confirms what we've outlined in an earlier blog post: Why Machine Learning still Needs Humans for Language.

“Concept Maps Everywhere”

Back to the event ... as one participant tweeted, concept maps were the dominating topic throughout the days. First a workshop by Annette Weilandt (eccenca) on taxonomy, thesauri, and ontologies, followed by a presentation by Petra Drewer (University Karlsruhe). Petra unveiled a plethora of benefits:

  • insight into the domain
  • systematic presentation
  • clear distinction between concepts
  • identification of gaps
  • equivalence checks across languages
  • new opportunities in AI contexts

No surprise, my event highlight was the Coreon customer presentation from Liebherr on the benefits of multilingual knowledge systems. In this very entertaining presentation Lukas Auer (Liebherr MCCtec) and Johannes Widmann (Liebherr Holding) outlined how pragmatic and effective the work with concept systems turns out. They concluded: “If we all think in networks, why should our termbase then be designed as an alphabetic list of terms??” Instead, the concept system driven approach has many advantages such as training of new staff, context knowledge for technical authors and translators, terminological elaboration of specific domains, insight into the degree of how far a domain is already covered, avoiding doublettes etc. Download a case study from the Coreon web site.

DTT 2018 Award for a Master Thesis on Coreon

And then the “i-Tüpfelchen” (cherry on the cake) on Friday afternoon: David Reininghaus received this year’s DTT award on his master thesis: “Applying concept maps onto terminology collections: implementation of WIPO terminology with Coreon”. David analyzed in his work how a real graph driven technology outperforms simple hyperlink based approaches: no redundancies, more efficient, less error-prone. David further developed an XSL-based method how to transform the MultiTerm / TBX hyperlink based workarounds into a real graph, visualized in Coreon.

Deutsche Bahn: Terminology-Driven AI Applications

Tom Winter (Deutsche Bahn and President of the DTT) illustrated in his session how terminology boosts AI applications. Through already simple synonym expansion the intranet search engines are now more meaningful (a search for the unofficial Schaffner, now finds even documents where only the approved Zugbegleiter was used). Other applications are automatic pre-processing of incoming requests in a customer query-answering system or even improving Alexa driven speech interaction at ticket vending machines … who says terminology is still a niche application?

From Language to Knowledge

I am excited about the evolution of the DTT in recent years. How many more participants will we see in spring 2020? I am convinced the more the DTT community continues to leave the pure documentation niche and the more the focus moves onto areas that our customer Liebherr or Tom Winter have illustrated, the relevance and awareness level of the community will continue to grow. So that the organisers can again proudly announce: Ausgebucht - no more seats left!

12 FEB 2018

Internet of Things Banks on Semantic Interoperability

The biggest challenge for widespread adoption of the Internet of Things is interoperability. A much-noticed McKinsey report states that achieving interoperability in IoT would unlock an additional 40% of value. This is not surprising since the IoT is in essence about connecting machines, devices, and sensors – ideally cross organization, cross industries, and even cross borders. But while technical and syntactic interoperability are pretty much solved, little has been available so far to make sure devices actually understand each other.

Focus Semantic Interoperability

Embedded Computing Design superbly describes the situation in a recent series of articlesTechnical interoperability

The biggest challenge for widespread adoption of the Internet of Things is interoperability. A much-noticed McKinsey report states that achieving interoperability in IoT would unlock an additional 40% of value. This is not surprising since the IoT is in essence about connecting machines, devices, and sensors – ideally cross organization, cross industries, and even cross borders. But while technical and syntactic interoperability are pretty much solved, little has been available so far to make sure devices actually understand each other.

Focus Semantic Interoperability

Embedded Computing Design superbly describes the situation in a recent series of articlesTechnical interoperability, the fundamental ability to exchange raw data (bits, frames, packets, messages), is well understood and standardized. Syntactic interoperability, the ability to exchange structured data, is supported by standard data formats such as XML and JSON. Core connectivity standards such as DDS or OPC-UA provide syntactic interoperability cross-industries by communicating through a proposed set of standardized gateways.

Semantic interoperability, though,requires that the meaning (context) of exchanged data is automatically and accurately interpreted. Several industry bodies have tried to implement semantic data models. However, these semantic data schemes have either been way too narrow for cross-industry use cases or had to stay too high-level. Without schemes data from IoT devices lack information to describe their own meaning. Therefore, a laborious and, worse, inflexible normalization effort is required before that data can be really used. 

Luckily there is a solution: abstract metadata from devices by creating an IoT knowledge system.

Controlled Vocabulary and Ontologies

A controlled vocabulary is a collection of identifiers which ensure consistency of metadata terminology. These terms are used to label concepts (nodes) in a graph which provides a standardized classification for a particular domain. Such ontology, incorporating characteristics of a taxonomy and thesaurus, links concepts with their terms and attributes in semantic relationships. This way it provides metadata abstraction. It represents knowledge in machine-readable form and thus functions as a knowledge system for specific domains and their IoT applications.

IoT Knowledge Systems made Easy

A domain ontology can be maintained in a repository completely abstracted from any programming environment. It needs to be created and maintained by domain experts. With the explosive growth of IoT constantly new devices, applications, organizations, industries, and even countries are added. Metadata abstraction parallels object-oriented programming and unfortunately so do the tools used so far to maintain and extend ontologies.

But now our SaaS solution Coreon makes sure that IoT devices understand each other. Not only does Coreon function with its API as a semantic gateway in the IoT connectivity architecture, it also provides a modern, very easy-to-use application to maintain ontologies; featuring a user interface domain experts can actually work with. With Coreon they can deliver the knowledge necessary for semantic interoperability so that IoT applications can unlock their full value.

Coreon will be presented at the Bosch ConnectedWorld Internet of Things conference February 2018 in Berlin. If you cannot come by our stand (S20) just flip thru our presentation or drop us a mail with questions. 

29 JAN 2018

Language Service Providers Need to Look Ahead to Compete with Machines

By Rachel Wheeler, Morningside Translations

Language localization services have been big business, and estimates indicate that the market will grow at an annual rate of about 7%. Companies that focus solely on translations services will continue to find demand for several years to come. The global marketplace, however, also presents new opportunities for language service providers (LSPs) to elevate their services and expand their businesses beyond translation alone.

Other LSPs Are Not The Only Competition

Some of the key benefits that professional translation agencies provide are quality translation and local expertise. To date, machine language translation software has had it…

By Rachel Wheeler, Morningside Translations

Language localization services have been big business, and estimates indicate that the market will grow at an annual rate of about 7%. Companies that focus solely on translations services will continue to find demand for several years to come. The global marketplace, however, also presents new opportunities for language service providers (LSPs) to elevate their services and expand their businesses beyond translation alone.

Other LSPs Are Not The Only Competition

Some of the key benefits that professional translation agencies provide are quality translation and local expertise. To date, machine language translation software has had it limitations: poor quality, faulty grammar and syntax, and lack of contextual understanding. LSPs have benefited from these flaws by being able to provide a superior alternative.

However, in 2017, Google introduced Google Neural Machine Translation (GNMT). What GNMT promises to provide is a new machine approach that will directly compete with human translators. Machine learning translation software has relied on an algorithmic approach to translation that was an almost a word-for-word dictionary approach. Therein lies its major flaw: it can only learn through predictive behavior analysis.

Neural networks like GNMT, however, incorporate a more complex structure that mimics the way the human brain processes information. This approach replicates the idea of intuition in many ways, not simply hard definitions. In its first published iteration, Google is already claiming a 60% reduction in errors.

For LSPs, these neural networks mean more–and cheaper–competition in the future. The nature of work for translation agencies will need to change in order to remain relevant.

Marketing Remains the Realm of People

By far, the main edge LSPs will have over machine translation is experience and local culture understanding. For global businesses, marketing their goods and services is not just a matter of translating words. Successful marketing understands the emotional impact of how information is presented.

Subtle differences in words–“discover” versus “find”, for example–have a different impact in sales and marketing than they do in more formal written content. Factoring in the additional layer of translation word choices, and the tone or intent of words can change dramatically beyond the original purpose.

Marketing content does not automatically translate from one language to another. Even visual imagery can fall in the purview of the cross-cultural marketer. Lingerie, for instance, is promote differently in conservative countries than in the West. LSPs are in the perfect position to expand their services into marketing, either as outside consultants or even agency-level providers.

Essentially, their ability to localize is a human translator’s greatest differentiator. Whether that’s leveraged for eLearning localization or creating images for a website specifically geared towards a regional audience, this is where an LSP can still shine.

Data Mining Works In Any Language

With today’s enormous output of information, data mining has become big business of its own. Data miners often refer to their work as “discovering insights.” As they review the clicks of a website, the comments on social media, and results of customer surveys, they inherently build a consumer profile with cultural bias built in.

LSPs with experts in particular languages and cultures offer the opportunity to sift through these insights in the original language that a non-native speaker can miss in translation.

Plan Ahead for Competitive Advantage

The technology world makes no secret of its innovations. LSPs should keep on eye on the changes and trends and plan for the future. By anticipating the coming shift in global demand for translation service, language service providers can be ahead of their competitors instead of playing catch-up.

This guest post is written by Rachel Wheeler from Morningside Translations.

6 JUN 2017

The IoT will Thrive on Semantics

In the Internet of Things (IoT) all devices are supposed to communicate among themselves, worldwide. Only, what are they saying to each other? Recently, former Siemens CTO Siegfried Russwurm got to the core of the issue: “Industry 4.0 needs first of all semantics. We can only get through interfaces and breaking points using unified semantics." Apparently not only civil servants in cross-border projects or industry supply chain managers need semantic interoperability. The billions and billions of IoT devices need semantic interoperability as well.

The Must of Semantic Technologies

Sebastian Tramp, coordinator of the Linked Enterprise Data Services…

In the Internet of Things (IoT) all devices are supposed to communicate among themselves, worldwide. Only, what are they saying to each other? Recently, former Siemens CTO Siegfried Russwurm got to the core of the issue: “Industry 4.0 needs first of all semantics. We can only get through interfaces and breaking points using unified semantics." Apparently not only civil servants in cross-border projects or industry supply chain managers need semantic interoperability. The billions and billions of IoT devices need semantic interoperability as well.

The Must of Semantic Technologies

Sebastian Tramp, coordinator of the Linked Enterprise Data Services (LEDS) project, nicely explains why the vision of the IoT and Industry 4.0 cannot be realized without semantics. If the meaning of IoT devices is not clear, it’s hard for them to interact or even communicate. For this the devices and their relevant metadata must be clearly defined. If, for example, some value is supposed to be measured, the data stream needs to contain information which sensor took the value when and where. But also what this value is all about. The power of the IoT is based on combining data from different sources. To link this data in a meaningful way you need interfaces in form of shared knowledge, i.e. ontologies. That’s what semantic technologies deliver.

Textual Metadata

Human language plays a surprisingly big role in the IoT. For example, a visual sensor’s image Exif information records under [Flash mode] the value “flash, red eye, no strobe return”. Another device processing this textual metadata needs to understand what “… red eye, no strobe …” actually means. And, very important, if it can’t provide specific processing for the strobe usage, it should conclude the more generic fact that a flash was active. To make things even more complex, depending on where the device was built it might say this in Chinese or German.

Leverage Terminologies, Taxonomies, and Ontologies

Luckily Multilingual Knowledge Systems (MKS) like Coreon deliver the required semantic and linguistic intelligence for the communication of IoT devices. Companies can leverage existing resources such as word lists, multilingual termbases, and taxonomies to build their metadata concepts with corresponding labels in one or more languages. The metadata concepts need to be semantically structured at least in broader-narrower relations. Through auto-taxonomisation a provisional graph is suggested which is reviewed and finalised by subject matter experts. Knowledge resources require often coverage of several languages. Mono- and bilingual term extraction, text and translation memory harvesting algorithms reduce this effort significantly.

This way a knowledge graph is created with each node representing a metadata meaning expressed by one or more labels. When shared this graph become the interface for IoT devices.

Semantics for the IoT

Without semantic interoperability IoT devices fail to communicate with each other. If human intervention is necessary the Internet of Things with billions of devices remains a buzzword for a great vision. Multilingual Knowledge Systems are a proven solution to make data repositories, systems, organizations, and even countries interoperable. They will provide the unified semantics for the Internet of Things, globally.

Learn more about Coreon or jump right in for a look and feel.

5 APR 2017

Why Machine Learning still Needs Humans for Language

Outperforming Humans

Machine Learning (ML) begins to outperform humans in many tasks which seemingly require intelligence. The hype about ML makes it even into mass media. ML can read lips, recognizes faces, or transform speech to text. But when ML has to deal with the ambiguity, variety and richness of language, when it has to understand text or extract knowledge, ML continues to need human experts.

Knowledge is Stored as Text

The Web is certainly our greatest knowledge source. However, the Web has been designed for being consumed by humans, not by machines. The Web’s knowledge is mostly stored in…

Outperforming Humans

Machine Learning (ML) begins to outperform humans in many tasks which seemingly require intelligence. The hype about ML makes it even into mass media. ML can read lips, recognizes faces, or transform speech to text. But when ML has to deal with the ambiguity, variety and richness of language, when it has to understand text or extract knowledge, ML continues to need human experts.

Knowledge is Stored as Text

The Web is certainly our greatest knowledge source. However, the Web has been designed for being consumed by humans, not by machines. The Web’s knowledge is mostly stored in text and spoken language, enriched with images and video. It is not a structured relational database storing numeric data in machine processable form.

Text is Multilingual

The Web is also very multilingual. Recent statistics show that surprisingly only 27% of the Web’s content is English and only 21% in the next 5 most used languages. That means more than half of its knowledge is expressed in a long tail of other languages.

Constraints of Machine Learning

ML faces some serious challenges. Even with today’s availability of hardware, the demand for computing power can become astronomical when input and desired output are rather fuzzy (see the great NYT article "The Great A.I. Awakening").

ML is great for 80/20 problems, but it is dangerous in contexts with high accuracy needs: “Digital assistants on personal smartphones can get away with mistakes, but for some business applications the tolerance for error is close to zero", emphasizes Nikita Ivanov, from Datalingvo, a Silicon Valley startup.

ML performs good on n-to-1 questions. For instance, in face recognition “all these pixel show which person?” has only one correct answer. However, ML is struggling in n-to-many or in gradual circumstances … there are many ways to translate a text correctly or express a certain piece of knowledge.

ML is only as good as its available relevant training material. For many tasks mountains of data are needed. And the data better be of supreme quality. For language related tasks these mountains of data are often required per language and per domain. Further, it is also hard to decide when the machine has learned enough.

Monolingual ML Good enough?

Some suggest why not process everything in English. ML does also an OK job at Machine Translation, like Google Translate. So why not translate everything into English and then lets run our ML algorithms? This is a very dangerous approach since errors multiply. If the output of an 80% accurate Machine Translation becomes the input to an 80% accurate Sentiment Analysis errors multiply to 64%. At that hit rate you are getting close to flipping a coin. 

Human Knowledge to Help

The world is innovating constantly. Every day new products and services are created. To talk about them we continuously craft new words: the bumpon, the ribbon, a plug-in hybridTTIP ‒ only with the innovative force of language we can communicate new things.

Struggle with Rare Words

By definition new words are rare. They first appear in one language and then may slowly propagate into other domains or languages. There is no knowledge without these rare words, the terms. Look at a typical product catalog description with the terms highlighted. Now imagine this description without the terms – it would be nothing but a meaningless scaffold of fill-words.

Knowledge Training Required

At university we acquire the specific language, the terminology, of the field we are studying. We become experts in that domain. But even so, later in our professional career when we change jobs we still have to acquire the lingo of the new company: names of products, modules, services, but also job roles and their titles, names for departments, processes, etc. We get familiar with a specific corporate language by attending training, by reading policies, specifications, and functional descriptions. Machines need to be trained in the very same way with that explicit knowledge and language.

Multilingual Knowledge Systems Boost ML with Knowledge

There is a remedy: Terminology databases, enterprise vocabularies, word lists, glossaries – organizations usually already own an inventory of “their” words. This invaluable data can be leveraged to boost ML with human knowledge: by transforming these inventories into a Multilingual Knowledge System (MKS). An MKS captures not only all words in all registers in all languages, but structures them into a knowledge graph (a 'convertible' IS-A 'car' IS-A 'vehicle'…, 'front fork' IS-PART of 'frame' IS-PART of 'bicycle').

It is the humanly curated Multilingual Knowledge System that enables ML and Artificial Intelligence solutions to work for specific domains with only small amounts of textual data and also for less resourced languages.

23 MAR 2017

Excel with Enterprise Taxonomy

In multiple blog posts we have mentioned Multilingual Knowledge Systems (MKS) and how it is a core component in several applications both monolingual and multilingual. An MKS is in fact a multilingual Enterprise Taxonomy.

We have explained what an MKS is and now we want to advise you how to build one.

People often fear the task of creating the basic infrastructure (Enterprise Taxonomy) for their operations in different countries. They think that it is too costly, needs special expertise and is difficult to maintain. Often due to an expensive software that is homegrown and cumbersome to use…

In multiple blog posts we have mentioned Multilingual Knowledge Systems (MKS) and how it is a core component in several applications both monolingual and multilingual. An MKS is in fact a multilingual Enterprise Taxonomy.

We have explained what an MKS is and now we want to advise you how to build one.

People often fear the task of creating the basic infrastructure (Enterprise Taxonomy) for their operations in different countries. They think that it is too costly, needs special expertise and is difficult to maintain. Often due to an expensive software that is homegrown and cumbersome to use. What many do not understand is thatthey already have this data and have been paying for it for years in their translation contracts.

What you need to do is the following:

  • Collect your terminology data in all the languages you need from your translation provider and send it to us at coreon.com
  • Assign a responsible knowledge carrier with a good overview of your operations. 

At Coreon we will manage your terminology data and in collaboration with you and your experts our team will structure, verify and QA the result.

A RESTful API makes connectivity straight forward. Your company can easily add a new product/service/operation on top of your Enterprise Taxonomy.

Deploy the power of your MKS in your applications. Contact us – we get back to you with a proposal that will do more than make you happy – it will boost you career!

The post Excel with Enterprise Taxonomy first appeared on The Multilingual Knowledge Blog.