What is the Stemming for Search Engines?

In linguistics and linguistic computer science, stemming (reduction of the original form) is a procedure that traces different morphological variants of a word back to their common stem. The stemming enables improved and more relevant search results to be achieved. In any case, it leads to an expansion of the results. Stemming of words is also related to the TF-IDF (Term Frequency – Inverse Document Frequency) term. Because, usually, Search Engines use the correlation of occurring together words to understand the topic and purpose of the content. To learn more, you can read our “What is TF-IDF” article.

How does Stemming Help Search Engines to Reduce their Cost while Organizing the Information on the Web?

Various parenting processes have been tested in computer science since 1968, but so far this has been particularly widespread, especially in the USA. Search engines use the keywords contained in the text to get relevant search results. To this end, Google has been using various stemming processes to optimize the search engine since 2004. These changes in the search algorithm enable the search queries to be differentiated and thus analyze entire sentences much more precisely.

The stemming identifies the words based on their grammatical stem. Accordingly, the search engine recognizes a connection between the variants of a word and assigns relevance to this page. For example, if you enter the search term “search engine optimization”, you will get both the results “optimization” and “optimize”. In addition, the stemming reduces the storage space required and thus speeds up the search. The stemming not only reduces the content to the basic forms but also optimizes similar terms that correspond to the subject area of ​​the main keyword. For example, a text about potatoes could also provide information about other uses and types of potatoes in addition to the description of the vegetables. In this way, the text would increase its relevance to this term.

Stemming is similar to the BERT Algorithm from today. Both of them have the same purpose. Understanding the queries according to the their use situation. Search Engines’ nature stays the same while the Search Habits always change. This creates complex measurements and algorithms while time passes.

Related articles to the Stemming in terms of Search Engine Principles:

Stemming Algorithms and Their Improvement Process

Julie Beth Lovins published the first stemming algorithm book in 1968. The second big Stemming Algorithm book was published in 1980 by Martin Porter. Most of the publications have focused on the English texts so the algorithms have been applied to the other languages later. That’s why also the Florida Update has affected to the non-English Speaking countries later. In 2000, Martin Porter has published Snowball Algorithm to improve Stemming Algorithms. Also, later, he helped other algorithms with the same purpose to be developed.

Notes from Martin Porter Regarding Stemming

Stemming cannot be applied to all languages, e.g. Chinese cannot be edited with a stemmer, but the Indo-European languages ​​are more or less suitable for this. Assuming the words are written left to right, the stem is on the left and 0 or more suffixes can be appended to the right of the word. Prefixes can also be placed to the left of the word stem. (Example: ‘unhappiness’: prefix ‘un’, stem ‘happy’, ‘y’ became ‘i’ when the suffix ‘ness’ was added). Prefixes often change the meaning of the word in substantial parts (exception: ‘ge’ in German). The goal of stemming is to remove suffixes under certain circumstances. For example, ‘happy’ and ‘happiness’ have a related meaning, and it is desirable to reduce both syntactic words to the basic form ‘happi’. Infixes are rarely used (e.g. German and Dutch).

Martin Porter

The Stemming process for Search Engines

The complex stemming algorithms are optimized for the respective languages ​​due to the different linguistic conditions (grammar). For example, the algorithm recognizes the common word stem “house” from the different words “house building”, “behausen” and “a residential house” and delivers the corresponding search results. For the stemming, it is no longer a challenge whether the text contains words in the singular or plural or whether they are separated by a hyphen. The procedure enables both the reduction of the words to their basic form and the shortening of the suffixes or prefixes. The most common variant is the suffix removal.

Also, Stemming and Variations of Words are related to the Entity-based Search Engine features. Despite the different pronunciations or languages, people search the same entity with the same Neural Network. That’s why the Search Engine unifies all information from all world under the hood of “Entity Knowledge Base”. This creates a huge data unification, bigger data means better insights and algorithms.

To learn more about the Search Engine’s Evolution through history, you can read our Holistic SEO articles.

Koray Tuğberk GÜBÜR

Leave a Comment

What is the Stemming for Search Engines?

by Koray Tuğberk GÜBÜR time to read: 3 min
0