Translate a Website with Python

How to Translate a Website within Pandas Dataframe via Python?

Translating a data frame is the process of changing a data frame values’ language to another language without changing the context, meaning, or content. Translating a Pandas Data Frame with Python can be done with Google’s “googletrans” library.

In this article, I will crawl a website by taking its semantic content into a data frame to translate the content to another language. Translating a website for exploring the new topic and content ideas can be useful for especially the content publishers that don’t know a specific content language of a website. In this example, I will translate MyFitnessPal’s content into Turkish while using a Pandas Data Frame. I know both Turkish and English, but imagine that there is a Spanish website with more than 500000 creative food recipes. Wouldn’t you want to understand its content for finding new contexts and topic ideas?

In this example, MyFitnessPal has more than 320000 food nutrition information and also recipes. It has more than 4000 URLs for only the food recipes with egg. I will crawl only the URLs for egg recipes and translate them. You can use the same methodology for helping your clients who don’t know a specific language.

Since this article is a little bit longer, if you just want to see the data frame translation to another language section, you can check the necessary code blocks and sub-sections as below.

Translate a Sentence to Another Language with Python

Translating a sentence to another language with Python, you can use “google_trans_new” Python package.

from google_trans_new import google_translator  
import google_trans_new
  
translator = google_translator()  
translate_text = translator.translate('Hola mundo!', lang_src='es', lang_tgt='en')

OUTPUT>>>

Hello World

To translate a sentence with Python, you should follow the steps below:

  • Import “google_translater” function from “google_trans_new” package.
  • Create a variable, assign “google_translator()” to it.
  • Use “translate()” method with the text to be translated.
  • Use “lang_src” for specifying the text language.
  • Use the “lang_tgt” parameter for specifying the language that text to be translated into.

Translate a Pandas Data Frame Column to Another Language

You can see the Python methodology for translating a column of a dataframe to another language.

for i in dataframe["column"][:row_number]:
    translator = google_translator()
    i = translator.translate(i, lang_src="auto", lang_tgt="tr")
    dataframe["column"][i] = i

If you want to translate a data frame’s a column, you can use the Python function above. The explanation of the function is below.

  • Start a Python for loop for the count of a number of rows to be translated.
  • Use “google_translator()” method.
  • Use the “translate()” method for every row in the selected column.
  • Change the actual content of the data frame column with the translated column.

Translate a Pandas Data Frame Column without a For Loop

dataframe["column"][:row_number] = dataframe["column"][:row_number].apply(lambda x: translator.translate(x, lang_tgt="tr"))

If you want to do the same thing without a for loop, you can use the “lambda” function with the “apply” method of Pandas Python Module as above.

Translate All Pandas Data Frame to Another Language with Python

If you want to translate all of the data frames, you can use the function block below.

dataframe = dataframe.apply(lambda x: translator.translate(x, lang_tgt="tr"))

If you don’t specify a “lang_src” to a “translator.translate()” method, it will be figured out automatically. If you want to know which languages can be used for the “google_trans_new” package of Python for translation, you can use the methodology below.

print(google_trans_new.LANGUAGES)

OUTPUT>>>

{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'he': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'latin', 'lv': 'latvian', 'lt': 'lithuanian', 'lb': 'luxembourgish', 'mk': 'macedonian', 'mg': 'malagasy', 'ms': 'malay', 'ml': 'malayalam', 'mt': 'maltese', 'mi': 'maori', 'mr': 'marathi', 'mn': 'mongolian', 'my': 'myanmar (burmese)', 'ne': 'nepali', 'no': 'norwegian', 'or': 'odia', 'ps': 'pashto', 'fa': 'persian', 'pl': 'polish', 'pt': 'portuguese', 'pa': 'punjabi', 'ro': 'romanian', 'ru': 'russian', 'sm': 'samoan', 'gd': 'scots gaelic', 'sr': 'serbian', 'st': 'sesotho', 'sn': 'shona', 'sd': 'sindhi', 'si': 'sinhala', 'sk': 'slovak', 'sl': 'slovenian', 'so': 'somali', 'es': 'spanish', 'su': 'sundanese', 'sw': 'swahili', 'sv': 'swedish', 'tg': 'tajik', 'ta': 'tamil', 'te': 'telugu', 'th': 'thai', 'tr': 'turkish', 'tk': 'turkmen', 'uk': 'ukrainian', 'ur': 'urdu', 'ug': 'uyghur', 'uz': 'uzbek', 'vi': 'vietnamese', 'cy': 'welsh', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'zu': 'zulu'}

In the next sections, you will see a practical example of pandas data frame translation with methods that include Python. In the future sections, you will see the processes below.

  • How to crawl a website for translation purposes.
  • Why a translated content shouldn’t be used for SEO purposes.
  • How to find new topic ideas from a translated content of a similar and foreign site.
  • How to perform content structure analysis from URLs for filtering the URLs for crawling purposes.
  • How to visualize the content structure of a website for translation purposes within a data frame.
  • How to translate a website’s content by making it into a data frame.

How to Make a Website’s Content into Data Frame for Translation with Python?

To take a website’s content into a data frame, you will need an SEO Crawler. You can use OnCrawl, JetOctopus, Ahrefs, Screaming Frog, or SEMRush for crawling a website but none of these crawlers won’t take the content of the web page. They will take the On-page SEO tags, title tags, canonical tags, or more, but they won’t take all of the content for every page into a data frame column. To take a website’s content entirely, you need to use Advertools’ “crawl()” function.

To learn how to crawl a website and perform an SEO Analysis, you can read the related guideline.

For translation of a data frame, you will need a semantic website. In other words, the website’s content should be of the same type so that translation can have a better purpose. In this pandas data frame translation with python tutorial, MyFitnessPal will be used. Below, you will see how to crawl MyFitnessPal.

Importing Necessary Python Modules and Libraries for Data Frame Translation via a Web Site

To crawl and take content of a website into a data frame for translating it, you will need to Python libraries below.

  • Advertools, for taking and Crawling URLs
  • Requests, for taking the sitemap URLs
  • Pandas, for using the data frames
  • Plotly, for visualizing the output
  • Googletrans, for translating the data frame
  • Google_trans_new, to overcome the bugs of Googletrans
import advertools as adv
import requests
import pandas as pd
import plotly.express as px
from google_trans_new import google_translator  
import google_trans_new

adv.__version__

OUTPUT>>>

0.10.7

Above, you will see the Python modules and libraries for taking a website’s content into a dataframe for the translation purposes through Python.

How to Take and Crawl URLs a Website for Extracting its Content into a Data Frame?

To crawl a website in an effective way, instead of crawling all sites, crawling only the URLs in the sitemap is more efficient since the quality URLs will be in the sitemap. But, MyFitnessPal doesn’t have a single sitemap or sitemap index file. In this case, we will need to find all of the necessary sitemap files and unite them. All of the sitemap files of the MyFitnessPal have the same URL pattern, it is “domain name, sitemap, sitemap number, sitemap extension”. Thus, we can use a simple Python for loop with range function as below.

list = []
for i in range(99):
    r = requests.get(f"https://www.myfitnesspal.com/sitemap-{i}.xml")
    if r.status_code == 200:
        df = adv.sitemap_to_df(f"https://www.myfitnesspal.com/sitemap-{i}.xml")
        list.append(df)

OUTPUT >>>

2021-03-14 14:09:56,703 | INFO | sitemaps.py:361 | sitemap_to_df | Getting https://www.myfitnesspal.com/sitemap-0.xml
2021-03-14 14:10:00,173 | INFO | sitemaps.py:361 | sitemap_to_df | Getting https://www.myfitnesspal.com/sitemap-1.xml
2021-03-14 14:10:03,754 | INFO | sitemaps.py:361 | sitemap_to_df | Getting https://www.myfitnesspal.com/sitemap-2.xml
2021-03-14 14:10:07,474 | INFO | sitemaps.py:361 | sitemap_to_df | Getting https://www.myfitnesspal.com/sitemap-3.xml
2021-03-14 14:10:10,062 | INFO | sitemaps.py:361 | sitemap_to_df | Getting https://www.myfitnesspal.com/sitemap-4.xml
2021-03-14 14:10:12,615 | INFO | sitemaps.py:361 | sitemap_to_df | Getting https://www.myfitnesspal.com/sitemap-5.xml
2021-03-14 14:10:15,681 | INFO | sitemaps.py:361 | sitemap_to_df | Getting https://www.myfitnesspal.com/sitemap-6.xml

The explanation of the sitemap uniting for loop is below.

  • Create an empty list.
  • Start a for loop with the “range” function.
  • Use the “requests.get” method for taking the first sitemap.
  • Use the “f string” method for looping all sitemap URLs.
  • Use an if-else statement.
  • If the response code of the sitemap URL is 200, use “adv.sitemap_to_df” method.
  • Append the output file to the list.

You will need to unite all of the created output files in one data frame so that all of the URLs to be crawled can be taken from single data frame.

united_df = pd.concat(list)
united_df.to_csv("myfitnesspal_sitemap.csv", index=False)

Using “pd.concat()” for every element of the “list” variable will unite every file. After uniting every file, you will need to use the “to_csv” method for taking a permanent CSV output file that includes all URLs to be crawled.

united_df

You can see the result below.

Data Frame Translation
URLs to be crawled and translated to another language.

Analyzing the MyFitnessPal’s Semantic Content Network for Translation Purposes

MyFitnessPal is a website that includes all the ingredients, calories, nutrition of all types of food with a brief content structure. MyFitnessPal has a short introduction section for their content, and a simple HTML Table for the nutrition of food, and a couple of brief information for the specific food’s ingredients. In other words, MyFitnessPal’s site is semantic, and it makes it perfect for translation purposes. Since, the content is brief, short and mostly it is an HTML table, the translation will be used for only simple words and short sentences. This will increase the quality of the translation process with Python.

For SEO, using translated content is not useful. And, this script and Pandas Data Frame Language Translation with Python guideline are not for using another site’s content. It is for finding new topics, themes, and content ideas for conjugate sites from another language. Thus, analyzing MyFitnessPal’s content is important, because most of their content is about “restaurants’ food recipes” or “food brands’ specific products”. But also, their content is super specific for food types, they have more than 4000 recipes just for foods with eggs. And, this type of semantic content network can help you to create a better and firm topical map and content structure for your site.

We have 328772 URLs from MyFitnessPal, only from its sitemaps. Crawling all of these URLs would require more resources, thus via URLs, you can analyze their content structure and crawl only the necessary section for your own purposes. Below, you will see another Advertools method for doing this, “url_to_df”.

paths = adv.url_to_df(united_df["loc"])
paths["dir_3"].str.split("-").explode().value_counts().to_frame().head(50)

You can see the result below.

Words in URLs
The most used words within URLs. Most of them are foods.
  • We have 33981 URLs that include the word “chicken” within it.
  • We have 22342 URLs for only salad recipes.
  • We have 10378 URLs for only beef recipes.

Now, you can understand what I mean with “new content ideas” for conjugate sites. We can visualize a website’s content structure via its URLs and Plotly as below.

How to Visualize a Website’s Content Structure with URL Breakdowns via Python (Plotly) for Translation Purposes

fig = px.bar(dir_3, x=dir_3.index, y=dir_3["dir_3"], title="Most Used Words in URLs' Last Section - MyFitnessPal", width=1500, height=800)
fig.update_xaxes(tickangle = 35)
fig.show()
Most used words in URLs
Visualization of most used words in MyFitnessPal. They have more than 30000 URLs just for chicken.

Filtering a Website’s URLs for Crawling and Translation Purposes

To filter a website’s URLs for crawling, we can use the “str.contains” method. For translating a data frame, we will use just the URLs for “eggs”. To do this, we will filter all the content from MyFitnessPal that are about eggs as below.

pd.set_option("display.max_colwidth",255)
pd.set_option("display.max_row",255)
united_df[united_df["loc"].str.contains("egg")]

Explanation of URL filtering is below.

  • We have set the maximum column width as 255.
  • We have increased the maximum row count to 255.
  • We have filtered and called all the URLs that include the word “egg” within them.

You can see the result below.

URLs with Eggs
We have more than 9000 URLs just for recipes for eggs.
egg_content = united_df[united_df["loc"].str.contains("egg")]

We have transfered all the URLs with “egg” string to “egg_content” variable.

Crawling a Website’s URLs for Extracting its Content for Translation within a DataFrame

To crawl a website or list of URLs, we can use “advertools’ crawl function” as below.

adv.crawl(egg_content["loc"], "egg_content_myfitnesspal.jl", follow_links=False)

We have crawled all the URLs within the “egg_content” data frame so that we can extract their content for translating it.

egg_content_df = pd.read_json("egg_content_myfitnesspal.jl", lines=True)
egg_content_df.head(2)

We have used the “pd.read_json” command for reading all of the output files that include the crawl output. We have called the first two rows, you can see the result below.

Data frame translation with crawled URLs
We have extracted content, headings, canonicals, images and many more things that created 42 different columns.

Auditing the Content of the Website for a Quality Dataframe Translation with Python

Translating a website’s content within a data frame can be useful for finding content ideas but if the crawled URLs don’t have a proper content or status code, the translation process also will be problematic. To prevent this situation, and to understand the website’s general content structure, auditing the content of the website is useful for a quality data frame translation.

egg_content_df["content_length"] = egg_content_df["body_text"].str.split(" ").str.len()
egg_content_df[["url", "content_length"]]

We have created a new column called “content_length” and used “str.split” with “body_text” column to take the word count per URL.

Content length
Content length per URL for translation.

We have 240-260 words per URL. We can check the word pattern frequencies as below.

adv.word_frequency(egg_content_df["body_text"], phrase_len=2).head(50)

We have used the “word_frequency” method with “phrase_len=2” and called the first 50 rows before translating the data frame so that we can see the translation material.

Word frequencies for translating a content
We see that most of the content is about “nutritions” as we guessed. So, the translation can be more quality.

We can check the word frequencies for title for auditing the translation material.

adv.word_frequency(egg_content_df["title"], phrase_len=1).head(50)

We have chosen “title” column instead of “body_text”.

Title word frequency
We see that “egg” and “food” words are the most used words. Basiaclly, the translation target is correct.

Before starting the translation, you can check the all of the related columns for our translation process. Because, you might not want to translate a content that is not about “eggs” or an URL with non-200 status code.

egg_content_df.loc[:, ["url_redirected_to", "title", "status", "h1", "content_length", "body_text"]].head(60)

You can see the filtered data frame with URL, status code, title, h1, content length and body text information.

Translation dataframe
We will translate this data frame.

We have checked that the content of the website is proper for a translation. It is brief and all the URLs, titles, heading tags and paragraphs are convenient for a healthy bulk translation.

Translating A Website’s Content within a Data Frame via Python

In this section, you will learn how to translate a dataframe with Python. We will use “google_translator()” function as below.

translator = google_translator()  
translate_text = translator.translate('Hola mundo!', lang_src='es', lang_tgt='en')  

You should remember the “translator.translate()” function from the beginning of the dataframe translating guideline.

df_eg = egg_content_df.copy()

We have copied the data frame with the help of the “copy” method. We have copied our data frame to be translated because we didn’t want to change the original data frame. You will see a practical translation example as below.

for i in df_eg["body_text"][:5]:
    translator = google_translator()
    i = translator.translate(i, lang_src="auto", lang_tgt="tr")
    df_eg["body_text"][i] = i

In the example above, we have translated only the first five rows of the “df_eg” data frame. You can see the translated data frame below. With the “lang_src” and “auto” parameter value pair, we have let the Python detect the default language, with the “lang_tgt” and “tr” parameter value pair, we have translated our content to Turkish.

In this example, I have also translated titles and the content together, to the Turkish.

In the next section, we will use the “apply” method for translating the first 1000 rows of the dataframe to the Turkish.

df_eg["body_text"][:1000] = df_eg["body_text"][:1000].apply(lambda x: translator.translate(x, lang_tgt="tr"))

After translation of the dataframe, you can output the translated dataframe as below.

df_eg.loc[:1000, ["title", "url", "h1", "body_text"]].to_csv("myfitnesspal_egg_first_1000.csv")

You can see the output below with 1328 KB.

Translated dataframe output
Translated dataframe in the process’ folder as a CSV output.

Last Thoughts on Translating Dataframes with Python and Holistic SEO

Translation of data frames via Python can be used to find new content ideas, examining websites from different countries. If an SEO doesn’t know foreign languages, he or she won’t be able to examine their content structure, most of the things will be ambiguous for him or her. To prevent this situation, translating heading tags, titles, descriptions, content or URLs can help an SEO to understand sites from other cultures, countries, or geographies. Most of the time, mainstream SEOs think that all the planet lives in English, but this is not true. To understand SEO and Search Engine, helping clients from other cultures, and examining the Search Engine’s SERP, we will need to understand sites from other languages.

Google had its own website translation service before, but now it is not available.

You might think to use Chrome’s own translation technology, but it won’t let you analyze hundreds of thousands of URLs in bulk like website translation with Python process does. Translating a website’s content within a data frame thanks to Python can help an SEO to overcome the language barriers for examining many more sites and Search Engine algorithms thanks to them.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top