Text Analysis with Python

How to Perform Text Analysis via Python and Advertools

Text or content is a written source that conveys certain information in a certain structure, emotion and voice, and with different methodologies. Each piece of writing can be perceived as better with a specific subject in different aspects. Using different spelling techniques, sentence structures, or using different words makes the two texts on the same subject quite different from each other. It is useful to analyze content and text to see which sentence structure and form of call more useful to users and search engines. Python and Data Science are important terms of Holistic SEO. Thanks to data processing and Python, SEOs can analyze these kinds of information in a shorter time with less effort.

Natural Language Processing, Natural Language Understanding, or Natural Language Generation will be some of the most popular terms and skills in the Holistic SEO future. I believe that when we start to generate text with AI and Trained Data Sets, we will be able to tell machines to write content according to our desires in terms of sentiment, sentence structure, or terminology. But for now, in this article, we will use “phrase-based” methodologies for text snippets to analyze sentences, texts, and contents. The phrase-based methodology looks for a specific word or phrase in specific content to understand the content’s structure, purpose, writing techniques, and sentiment analysis instead of using AI-based methods. In the future, we will also write guidelines for content and text analysis with AI-based methodologies through Python Libraries.

In this article, we will use Advertools which is a Python Library for SEO and SEM created by Elias Dabbas who I admire his skills.

What is the Difference Between Absolute and Weighted Frequency?

The definition of how often a word is used in a text is word frequency. It is a concept closely related to metrics like Keyword Destiny, search engines like Google, Bing or DuckDuckGo in any way and spam methods such as keyword stuffing.

The main difference between Absolute and Weighted Frequency concerns how keyword frequency is interpreted. Absolute Keyword Frequency is the ratio of the number of words used to the total number of words in the text. Weighted Word Frequency measures how often a word is used, based on another metric in that word’s database. For example, in an e-commerce database, the word that generates the most profits will have a higher frequency, even if it is used less frequently. Similarly, when you combine two social media posts, the weighted frequency of the words in the posts will be different according to different metrics like or comment.

With this method, especially an SEO can basically calculate the “Weighted Frequency” of the words that convert or bring traffic.

The necessary function for this task from Advertools is “word_frequency()”.

To learn more about Python SEO, you may read the related guidelines:

  1. How to resize images in bulk with Python
  2. How to perform TF-IDF Analysis with Python
  3. How to crawl and analyze a Website via Python
  4. How to test a robots.txt file via Python
  5. How to Compare and Analyse Robots.txt File via Python
  6. How to Categorize URL Parameters and Queries via Python?
  7. How to Perform a Content Structure Analysis via Python and Sitemaps

An Example of Text Analysis with Absolute Keyword Frequency via Python

First, we will perform an “absolute keyword frequency” calculation with a random paragraph.

import advertools as adv
import pandas as pd
pd.set_option(‘display.max_rows’, 85)
text = “A long paragraph from the life of Turkish Historian, İlber Ortaylı (We didn’t put the paragraph here.”
adv.word_frequency(text).head(20)

We have imported the necessary libraries, we have determined the row count which will be displayed, assigned our text into “text” variable and calculated the word frequency via “word_frequency()” method. You may see the result below:

Word Frequency
We have absolute frequencies for our sentence.

Since İlber Ortaylı is a Professor in University, we see lots of words related to education with high frequency. This is also similar to the Term Frequency from the TF-IDF Analysis via Python Guideline. Now, we also can use some useful “attributes” from the Advertools for the same purpose such as “removing the stop words”.

Stop words are the words without meaning contribution to the sentence such as “of”, “a”, “is”, “are”, “you”, “he” and more… Advertools presents some ready to go stop words list for different languages, you may see them via “adv.stopwords.keys()” method.

adv.stopwords.keys()
OUTPUT>>>

dict_keys(['arabic', 'azerbaijani', 'bengali', 'catalan', 'chinese', 'croatian', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'japanese', 'kazakh', 'nepali', 'norwegian', 'persian', 'polish', 'portuguese', 'romanian', 'russian', 'sinhala', 'spanish', 'swedish', 'tagalog', 'tamil', 'tatar', 'telugu', 'thai', 'turkish', 'ukrainian', 'urdu', 'vietnamese'])

One of the other attributes of the “adv.word_frequency” is “phrase_len” which determines the symbolic word’s word count in Advertools Word Frequency Calculation. It is 1 by default, but for showing the removal of the stop-words we will use before.

adv.word_frequency(text, phrase_len=3).head(20)

Text Analysis via Python
Word frequencies with “phrase_len=3” attribute.

As you may see that we have row entries which consist of three words. Also, we have some stop words in them. Also, I should note that, instead of using TF-IDF Analysis, using the Advertools Word Frequency method with the “phrase_len” attribute, sometimes may give more hint about the context of the article as you may see. Now we can remove them via the attribute. Let’s compare the results with and without “rm_words”.

adv.word_frequency(text, phrase_len=1, rm_words=[‘the’,’and’, ‘his’, ‘to’, ‘in’, ‘a’, ‘he’, ‘at’, ‘of’, ‘from’, ‘was’, ‘as’]).head(20)

Stop words removal
We have removed the stop words.

adv.word_frequency(text, phrase_len=1, rm_words=[]).head(20)

Stop words in the text
You may see how stop words change the output.

The difference is clear. If you will compare something which is just “one word”, you shouldn’t include the stop words but for the longer row entries, keeping stop words may be better. These kinds of Absolute Keyword Frequency calculations may help along with TF-IDF Analysis to see what is the characteristics of a standardized content for certain groups of queries and topics. Via Python and also Advertools, you may check the SERP Results quickly, also you may create a custom script for checking top search result’s content character.

Note: “rm_words” doesn’t work for protecting the context and meaning of the phrase if the “phrase_len” is longer than 1. In this case, you should clean your data from the stop words before implementing the “word_frequency” method.

After the Absolute Frequency, we can also create an example for the Weighted Frequency Analysis.

An Example of Text Analysis with Weighted Keyword Frequency via Python

Since we need a third metric besides the keyword’s usage count for weighted frequency analysis, we can use our GSC Data for an affiliate website. You may think that you can’t perform a keyword frequency analysis for a data frame but actually, it is not true. For a third metric, you may calculate the weight of that query or query group.

queries = pd.read_csv(‘Queries-ex.csv’, index_col=None)
adv.word_frequency(queries[‘Query’]).head(30)

We have performed our absolute frequency analysis with our Query column.

Query Analysis
You may see the query frequency in our Google Search Console Data.

We also see another valuable thing here. We see that 380 query has the “best” word in it while “massager”, “massage” and “chair” have more than 750 occurrences in the column. It simply shows the content strategy of the all web site, we may even guess the meta titles for the contents via all the information here. You may create a harness from the words of “electric”, “massage”, “chair”, “head”, “best” and “machine” words for different search intents.

Lastly, you may use the same example with “phrase_len=3” attribute.

adv.word_frequency(queries[‘Query’], phrase_len=3).head(30)

You may see the result of code below:

In this example, we see some pointless word patterns such as “is the best” or “what is the” and “for home use”. Actually these are the parts of the queries that are longer than 3 words. But still, they reflect the minor search intents and search demands for SEO. For instance, we see that there are 7 queries which include the “for home” phrase in the search data, and it shows a useful purpose for the user. SEO Can analyze more for the same group of queries to optimize the best content marketing methodologies. Also, we all know that the “advertools’ word_frequency” method is not for query grouping but, this method can help you to group queries in a glimpse. If you wonder more about “query classification”, you should read our “How to Classify Queries via Python” guideline which is written with the help of JR Oakles’ codelab.

Lastly, you can see how often two-word queries appear in our data.

Now, we can perform the same thing with also weighted frequency calculation.

adv.word_frequency(queries[‘Query’], queries[‘Impressions’], phrase_len=2).head(30)

The code above gives the weighted word frequency based-on impression amount. Also, we will have some new columns to explain.

Word Frequency Analysis for SEO
You may see our output better now.

“wtd_freq” means the weighted frequency of the words while “rel_value” means the divided result of “wtd_freq” to the “abs_freq”. So, “rel_value” may be a shorter way to think of “frequency efficiency”. We may sort our values according to the “wtd_freq” column.

weighted_fr = adv.word_frequency(queries[‘Query’], queries[‘Impressions’], phrase_len=2).head(30)
weighted_fr.sort_values(by=’wtd_freq’, ascending=False)

You may see the result below:

Sorting Data according to Weighted Frequency
We sorted the dataframe according to the weighted frequency.

You also may calculate the query groups’ impression amount per query. We may perform the same thing via clicks data and observe the change. You may see the result below:

Word Frequency according to the Weight of the word
Weighted Frequency according to the Advertools’ “word_frequency” function.

By combining different data sets, you may determine the worst query groups for Bounce Rate, Length of Stay, Clicks, Impressions, Entrance Amount, or Exit Amount, etc. For instance, in this comparison, we see that the queries with “foot massager” and “best foot” phrases are not performing well according to the their impression amount. Also, you may concatenate different data frames with a weighted frequency calculation data frame to see some connections in a more efficient way.

“word_frequency()” method also has another attributes such as “regex” which is being used for splitting words according to the different characters or rules. There is another method which is important for efficiency of this method. “extra_info” attribute is being used for determining whether additional info should be given or not. You may see an example below:

weighted_fr = adv.word_frequency(queries[‘Query’], queries[‘Clicks’], extra_info=True, phrase_len=2).head(30)
weighted_fr.sort_values(by=’wtd_freq’, ascending=False)

Word Frequency Test via Python
You also calculate the Cumulative and Cumulative Percentages thanks to “extra_info” attribute.

The added columns are “wtd_freq_perc”, “wtd_freq_perc_cum”, “abs_perc_cum”. “wtd_freq_perc” is weighted frequency percentage, “wtd_freq_perc_cum” is the cumulative weighted frequency percentage and “abs_perc_cum” is cumulative absolute percentage. All of that information can be used for the valuation of SEO Data to see necessary points in a content strategy and its success points and unsuccessful parts. Weighted Word Frequency can also be used for scraped Tweets or other kinds of Social Media Interactivity results. You may simply calculate the best engaging tweet’s mostly used features or most liked tweet’s mutual points within a limited hashtag profile. Also, performing sentiment analysis via this methodology is possible.

Last Thoughts on Word Frequency Calculation Via Python and Its Use Cases for SEO

We have performed a simple pair of use cases in this guideline for Word Frequency Calculation. Text Processing, mining, sentiment analysis, and more can be done with Python thanks to countless different methods and libraries. In this guideline, we solely focused on Advertools’ one method which is being created and used for word frequency calculation with two different methodologies and multi-structured data frame creation. Exploring the most important phrases for any kind of specific metric and grouping the queries according to their frequency and word count, relating them with different metrics are valuable positive skills for SEO. Exploring a content’s most used phrases with different word splitting styles, weighting phrases according to their Social Media Engagement Rate, exploring content and character of hashtags in a scale, performing sentiment analysis for competitors and customers are also valuable.

If you want to learn more about this methodology, I recommend you to look at Elias Dabbas’ interactive guideline for word frequency’s use cases.

Our text processing guidelines via Python will continue to grow. Our “word frequency analysis guideline” has lots of missing points, we will continue to improve it.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top