Keeping Your Sanity with Machine Taxonomization

An automated head-start can make handling your terminology data faster and more efficient.
Machine taxonomization

Taxonomies are crucial for businesses and institutions to handle bigger amounts of data. Manual taxonomization of thousands of concepts into a knowledge tree has so far been the only way to do this. Aside from the fact that this task can be quite tedious, it requires in-demand subject matter experts to complete. Thus, it is often considered too expensive or too much effort. A shame, given that companies then miss out on all the benefits of using taxonomies.

With a little help from your (AI) friend

Imagine a chaotic pile of books (of course, the less-organized among us may not have to imagine this) being automatically sorted into shelves, branches, and sub-branches, together with an index to help quickly find a desired book. This describes what our semi‑automatic taxonomization method can do. An initial knowledge tree is produced by Machine Learning (ML), using language models stored in huge neural networks. Clustering algorithms on top of word embeddings automatically converts a haystack of concepts into a structured tree. The final curation of the taxonomy is still carried out by a human, but the most time-consuming and tedious aspects of the task have already been dealt with, and in a consistent way.

‘Cobot’ versus manual

In a study, we benchmarked this collaborative robot approach (ML auto‑taxonomization and human curation) against the manual job done by an expert linguist. Below are the data and task flows of the two approaches:

Automatic versus Manual Flow 1

We aimed to taxonomize 424 concepts related to COVID-19. The traditional manual method was tedious and tiring for the human expert, who took a flat list of concepts and turned them into a systematic knowledge graph by working concept by concept to get everything in its right place. Wading through the list from scratch (including constantly switching contexts – from drugs, to vaccines, to social distancing, for example) made progress on the task difficult to measure. Having no perception of how many clusters of concepts still needed to be created was demotivating.

In contrast, our semi-automatic method started off with a tree of 55 suggested clusters of leaf concepts, each representing a specific context. Of course, ML doesn’t always produce the exact results a human expert would (we hear you, AI skeptics!), so some algorithm-suggested clusters were a bit off. However, the majority of the 55 were pretty accurate. They were ready to be worked on in Coreon’s visual UI, making the human curation task much faster and easier. This also enabled progress to be measured, as the job was done cluster by cluster.

By dramatically lowering the effort, time, and money needed to create taxonomies, managing textual data will become much easier and AI applications will see a tremendous boost.

Advantage, automation!

From a business perspective the most important result was that the semi‑automatic method was five(!) times faster. The structured head-start enabled the human curator to work methodically through the concepts. The clustered nature of the ML‑suggested taxonomy would also allow the workload to be distributed – e.g., one expert could focus on one medicine, another on public health measures.

More difficult to measure (but nicely visible below) was the quality of the two resulting taxonomies. While our linguist did a sterling job working manually, the automatic approach produced a tidier taxonomy which is easier for humans to explore and can be effectively consumed by machines for classification, search, or text analytics. Significantly, as the original data was multilingual, the taxonomy can also be leveraged in all languages.

Automatic versus Manual 3

A barrier removed

So, can we auto-taxonomize a list of semantic concepts? The answer is yes, with some human help. The hybrid approach frees knowledge workers from the tedious work in the taxonomization process and offers immediate benefits – being able to navigate swiftly through data, and efficient conceptualization.

Most importantly, though, linking concepts in a knowledge graph enables machines to consume enterprise data. By dramatically lowering the effort, time, and money needed to create taxonomies, managing textual data will become much easier and AI applications will see a tremendous boost.

If you’d like to discover more about our technology and services on auto-taxonomization, feel free to get in touch with us here.

Michael Wetzel
Michael Wetzel

Michael has a deep knowledge of multilingual problem solving and long term experience in product management. Michael was for years product manager of TRADOS MultiTerm. He is an active contributor to the ISO TC37/SC3 and DIN NA 105 standards.