NLTK Tokenize: How to Tokenize Words and Sentences with NLTK?

To tokenize sentences and words with NLTK, “nltk.word_tokenize()” function will be used. NLTK Tokenization is used for parsing a large amount of textual data into parts to perform an analysis of the character of the text. NLTK for tokenization can be used for training machine learning models, Natural Language Processing text cleaning. The tokenized words and sentences with NLTK can be turned into a data frame and vectorized. Natural Language Tool Kit (NLTK) tokenization involves punctuation cleaning, text cleaning, vectorization of parsed text data for better lemmatization, and stemming along with machine learning algorithm training.

Natural Language Tool Kit Python Libray has a tokenization package is called “tokenize”. In the “tokenize” package of NLTK, there are two types of tokenization functions.

  • “word_tokenize” is to tokenize words.
  • “sent_tokenize” is to tokenize sentences.

How to Tokenize Words with Natural Language Tool Kit (NLTK)?

Tokenization of words with NLTK means parsing a text into the words via Natural Language Tool Kit. To tokenize words with NLTK, follow the steps below.

  • Import the “word_tokenize” from the “nltk.tokenize”.
  • Load the text into a variable.
  • Use the “word_tokenize” function for the variable.
  • Read the tokenization result.

Below, you can see a tokenization example with NLTK for a text.

from nltk.tokenize import word_tokenize
text = "Search engine optimization is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic."
print(word_tokenize(text))

>>>OUTPUT

['Search', 'engine', 'optimization', 'is', 'the', 'process', 'of', 'improving', 'the', 'quality', 'and', 'quantity', 'of', 'website', 'traffic', 'to', 'a', 'website', 'or', 'a', 'web', 'page', 'from', 'search', 'engines', '.', 'SEO', 'targets', 'unpaid', 'traffic', 'rather', 'than', 'direct', 'traffic', 'or', 'paid', 'traffic', '.']

The explanation of the tokenization example above can be seen below.

  • The first line is for importing the “word_tokenize” function.
  • The second line of code is to provide text data for tokenization.
  • Third line of code to print the output of the “word_tokenize”.

What are the advantages of word tokenization with NLTK?

The word tokenization benefits with NLTK involves the benefits of White Space Tokenization, Dictionary Based Tokenization, Rule-Based Tokenization, Regular Expression Tokenization, Penn Treebank Tokenization, Spacy Tokenization, Moses Tokenization, Subword Tokenization. All type of word tokenization is a part of the text normalization process. Normalizing the text with stemming and lemmatization improves the accuracy of the language understanding algorithms. The benefits and advantages of the word tokenization with NLTK can be found below.

  • Removing the stop words easily from the corpora before the tokenization.
  • Splitting words into the sub-words for understanding the text better.
  • Removing the text disambiguate is faster and requires less coding with NLTK.
  • Besides White Space Tokenization, Dictionary Based and Rule-based Tokenization can be implemented easily.
  • Performing Byte Pair Encoding, Word Piece Encoding, Unigram Language Model, Setence Piece Encoding is easier with NLTK.
  • NLTK has TweetTokenizer for tokenizing the tweets that including emojis and other Twitter norms.
  • NLTK has PunktSentenceTokenizer has a pre-trained model for tokenization in multiple European Languages.
  • NLTK has Multi Word Expression Tokenizer for tokenizing the compound words such as “in spite of”.
  • NLTK has RegexpTokenizer to tokenize sentences based on the regular expressions.

How to Tokenize Sentences with Natural Language Tool Kit (NLTK)?

To tokenize the sentences with Natural Language Tool kit, the steps below should be followed.

  • Import the “sent_tokenize” from “nltk.tokenize”.
  • Load the text for sentence tokenization into a variable.
  • Use the “sent_tokenize” for the specific variable.
  • Print the output.

Below, you can see an example of NLTK Tokenization for sentences.

from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))

Output: ['God is Great!', 'I won a lottery ']

At the code block above, the text is tokenized into the sentences. By taking all of the sentences into a list with the sentence tokenization with NLTK can be used to see which sentence is connected to which one, average word per sentence, and unique sentence count.

What are the advantages of sentence tokenization with NLTK?

The advantages of sentence tokenization with NLTK are listed below.

  • NLTK provides a chance to perform text data mining for sentences.
  • NLTK sentence tokenization involves comparing different text corporas at the sentence level.
  • Sentence tokenization with NLTK provides understanding how many sentences are used in a different sources of texts such as websites, or books, and papers.
  • Thanks to NLTK “sent_tokenize” function, it is possible to see how the sentences are connected to each other, with what bridge words.
  • Via NLTK sentence tokenizer, performing an overall sentiment analysis for the sentences is possible.
  • Performing Semantic Role Labeling for the sentences to understand how the sentences are connected each other is one of the benefits of NLTK sentence tokenization.

How to perform Regex Tokenization with NLTK?

Regex Tokenization with NLTK is to perform tokenization based on regex rules. Regex Tokenization via NLTK can be used for extracting certain phrase patterns from a corpus. To perform regex tokenization with NLTK, the “tokenize.regexp()” method should be used. An example of the regex tokenization NLTK is below.

from nltk.tokenize import RegexpTokenizer
regex_tokenizer = RegexpTokenizer('\?', gaps = True)
text = "How to perform Regex Tokenization with NLTK? To perform regex tokenization with NLTK, the regex pattern should be chosen."
regex_tokenization = regex_tokenizer.tokenize(text)
print(regex_tokenization)

OUTPUT >>>

['How to perform Regex Tokenization with NLTK', ' To perform regex tokenization with NLTK, the regex pattern should be chosen.']

The Regex Tokenization example with NLTK demonstrates that how to take a question sentence and a sentence after it. By taking sentences that end with a question mark, and taking the sentences after it, matching the answers and questions, or taking the question formats from a corpus is possible.

How to perform Rule-based Tokenization with NLTK?

Rule-based Tokenization is tokenization based on certain rules that are generated from certain conditions. NLTK has three different rule-based tokenization algorithms as TweetTokenizer for Twitter Tweets, and MWET for Multi-word tokenization, along with the TreeBankTokenizer for the English Language rules. Rule-based Tokenization is helpful for performing the tokenization based on the best possible conditions for the nature of the textual data.

An example of Rule-based tokenization with MWET for multi-word tokenization can be seen below.

from nltk.tokenize import MWETokenizer

sentence = "I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example."

tokenizer = MWETokenizer()

tokenizer.add_mwe(("Steven", "Nissen"))

result = tokenizer.tokenize(word_tokenize(sentence))
result 

OUTPUT >>>

['I',
 'have',
 'sent',
 'Steven',
 'Nissen',
 'to',
 'the',
 'new',
 'reserch',
 'center',
 'for',
 'the',
 'nutritional',
 'value',
 'of',
 'the',
 'coffee',
 '.',
 'This',
 'sentence',
 'will',
 'be',
 'tokenized',
 'while',
 'Mr.',
 'Steven',
 'Nissen',
 'is',
 'on',
 'the',
 'journey',
 '.',
 'The',
 '#',
 'truth',
 'will',
 'be',
 'learnt',
 '.',
 'And',
 ',',
 'it',
 "'s",
 'will',
 'be',
 'well',
 'known',
 'thanks',
 'to',
 'this',
 'tokenization',
 'example',
 '.']

An example of Rule-based tokenization with TreebankWordTokenizer for English language text can be seen below.

from nltk.tokenize import TreebankWordTokenizer

sentence = "I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example."

tokenizer = TreebankWordTokenizer()

result = tokenizer.tokenize(sentence)
result 

OUTPUT >>>

['I',
 'have',
 'sent',
 'Steven',
 'Nissen',
 'to',
 'the',
 'new',
 'reserch',
 'center',
 'for',
 'the',
 'nutritional',
 'value',
 'of',
 'the',
 'coffee.',
 'This',
 'sentence',
 'will',
 'be',
 'tokenized',
 'while',
 'Mr.',
 'Steven',
 'Nissen',
 'is',
 'on',
 'the',
 'journey.',
 'The',
 '#',
 'truth',
 'will',
 'be',
 'learnt.',
 'And',
 ',',
 'it',
 "'s",
 'will',
 'be',
 'well',
 'known',
 'thanks',
 'to',
 'this',
 'tokenization',
 'example',
 '.']

An example of Rule-based tokenization with TweetTokenizer for Twitter Tweets’ tokenization can be seen below.

from nltk.tokenize import TweetTokenizer

sentence = "I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example."

tokenizer = TweetTokenizer()

result = tokenizer.tokenize(sentence)
result 

OUTPUT>>>

['I',
 'have',
 'sent',
 'Steven',
 'Nissen',
 'to',
 'the',
 'new',
 'reserch',
 'center',
 'for',
 'the',
 'nutritional',
 'value',
 'of',
 'the',
 'coffee',
 '.',
 'This',
 'sentence',
 'will',
 'be',
 'tokenized',
 'while',
 'Mr',
 '.',
 'Steven',
 'Nissen',
 'is',
 'on',
 'the',
 'journey',
 '.',
 'The',
 '#truth',
 'will',
 'be',
 'learnt',
 '.',
 'And',
 ',',
 "it's",
 'will',
 'be',
 'well',
 'known',
 'thanks',
 'to',
 'this',
 'tokenization',
 'example',
 '.']

The most standard rule-based type of word tokenization is white space tokenization. White-space tokenization is basically taken spaces between words for the tokenization. White-space tokenization can be performed with the “split(” “)” method and argument as below.



sentence = "I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example."


result = sentence.split(" ")
result 

OUTPUT >>>

['I',
 'have',
 'sent',
 'Steven',
 'Nissen',
 'to',
 'the',
 'new',
 'reserch',
 'center',
 'for',
 'the',
 'nutritional',
 'value',
 'of',
 'the',
 'coffee.',
 'This',
 'sentence',
 'will',
 'be',
 'tokenized',
 'while',
 'Mr.',
 'Steven',
 'Nissen',
 'is',
 'on',
 'the',
 'journey.',
 'The',
 '#truth',
 'will',
 'be',
 'learnt.',
 'And,',
 "it's",
 'will',
 'be',
 'well',
 'known',
 'thanks',
 'to',
 'this',
 'tokenization',
 'example.']

In NLTK tokenization methods, there are other tokenization methodologies such as PunktSentenceTokenizer for detecting the sentence boundaries, and Punctuation-based tokenization for tokenizing the punctuation-related words, and multiword properly.

How to use Lemmatization with NLTK Tokenization?

To use lemmatization with NLTK Tokenization, the “nltk.stem.wordnet.WordNetLemmetizer” should be used. WordNetLemmetizer from NLTK is to lemmatize the words within the text. The word lemmatization is the process of turning a word into its original dictionary form. Unlike stemming, lemmatization removes all of the suffixes, prefixes, and morphological changes for the word. NLTK Lemmatization is useful to see a word’s context and understand which words are actually the same during the word tokenization. Below, you will see word tokenization and lemmatization with NLTK example code block.

from nltk.stem.wordnet import WordNetLemmatizer

lemmatize = WordNetLemmatizer()

lemmatized_words = []

for w in tokens:
    rootWord = lemmatize.lemmatize(w)
    lemmetized_words.append(rootWord)

counts_lemmetized_words = Counter(lemmatized_words)
df_tokenized_lemmatized_words = pd.DataFrame.from_dict(counts_lemmatized_words, orient="index").reset_index()
df_tokenized_lemmatized_words.sort_values(by=0, ascending=False, inplace=True)
df_tokenized_lemmatized_words[:50]

The NLTK Tokenization and Lemmatization example code bloc explanation is below.

  • The “nltk.stem.wordnet” is called for importing WordNetLemmatizer.
  • It is assigned to a variable which is “lemmatize”.
  • An empty list is created for the “lemmatized_words”.
  • A for loop is created for lemmatizing every word within the tokenized words with NLTK.
  • The lemmatized and tokenized words are appended to the “lemmatized_words” list.
  • The counter object has been used for counting them.
  • The data frame has been created with lemamtized and tokenized word counts, sorted and called.

You can see the lemmatization and tokenization with the NLTK example result below.

NLTK Lemmatization and Stemming
NLTK Tokenization and Stemming Output for text

The NLTK Tokenization and Lemmatization stats will be different than the NLTK Tokenization and Stemming. These differences will reflect their methodological differences for the statistical analysis for tokenized textual data with NLTK.

How to use Stemming with NLTK Tokenization?

To use stemming with NLTK Tokenization, the “PorterStemmer” from the “NLTK.stem” should be imported. Stemming is reducing words to the stem forms. Stemming can be useful for a better NLTK Word Tokenization analysis since there are lots of suffixes in the words. Via the NLTK Stemming, the words that come from the same root can be counted as the same. Being able to see which words without suffixes are used is to create a more comprehensive look at the statistical counts of the concepts and phrases within a text. An example of stemming with NLTK Tokenization is below.

from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_words = []

for w in tokens:
    rootWord = ps.stem(w)
    stemmed_words.append(rootWord)

OUTPUT>>>

['think',
 'like',
 'seo',
 ',',
 'code',
 'like',
 'develop',
 'python',
 'seo',
 'techseo',
 'theoret',
 'seo',
 'on-pag',
 'seo',
 'pagespe',
 'UX',
 'market',
 'think',
 'like',
 'seo',
 ',',
 'code',
 'like',
 'develop',
 'main',
show more (open the raw output data in a text editor) ...

 'in',
 'bulk',
 'with',
 'python',
 '.',
 ...]
counts_stemmed_words = Counter(stemmed_words)
df_tokenized_stemmed_words = pd.DataFrame.from_dict(counts_stemmed_words, orient="index").reset_index()
df_tokenized_stemmed_words.sort_values(by=0, ascending=False)
df_tokenized_stemmed_words
index0
0think529
1like1059
2seo5389
3,22564
4code1128
10342pixel.1
10343success.1
10344pages.1
10345free…1
10346almost.1
10347 rows × 2 columns
NLTK Stemming and Tokenization Output Table.

How to Tokenize Content of a Website via NLTK?

To tokenize the content of a website with NLTK on word, and sentence level the steps below should be followed.

  • Crawling the website’s content.
  • Extracting the website’s content from the crawl output.
  • Using the “word_tokenize” of NLTK for word tokenization.
  • Using “sent_tokenize” of NLTK for sentence tokenization.

Interpreting and comparing the output of the tokenization of a website provides benefits for the overall evaluation of the content of a website. Below, you will see an example of a website content tokenization example. To perform NLTK Tokenization with a website’s content, the Python libraries below should be used.

  • Advertools
  • Pandas
  • NLTK
  • Collections
  • String

Below, you will see the importing process of the necessary libraries and functions for NLTK tokenization from Python for SEO.

import advertools as adv
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from collections import Counter
from nltk.tokenize import RegexpTokenizer
import string 


adv.crawl("https://www.holisticseo.digital", "output.jl", custom_settings={"LOG_FILE":"output.log", "DOWNLOAD_DELAY":0.5}, follow_links=True)
df = pd.read_json("output.jl", lines=True)

for i in df.columns:
     if i.__contains__("text"):
          print(i)

word_tokenize(df["body_text"].explode())

content_of_website = df["body_text"].str.split().explode().str.cat(sep=" ")

tokens = word_tokenize(content_of_website)

tokenized_counts = Counter(tokens)

df_tokenized = pd.DataFrame.from_dict(tokenized_counts, orient="index").reset_index()

df_tokenized.nunique()

df_tokenized

To crawl the website’s content to perform an NLTK word and sentence tokenization, the Advertools’ “crawl” function will be used to take all of the content of the website into a “jl” extension file. Below, you will see an example of crawling a website with Python.


adv.crawl("https://www.holisticseo.digital", "output.jl", custom_settings={"LOG_FILE":"output.log", "DOWNLOAD_DELAY":0.5}, follow_links=True)
df = pd.read_json("output.jl", lines=True)

In the first line, we have started the crawling process of the website, while in the second line we have started to read the “jl” file. Below, you can find the output of the crawled website’s output which is “output.jl” from the code block.

In the third step, the website’s content should be found within the data frame. To do that, a for loop for filtering the data frame columns with the “boyd_text” is necessary. To find it, we will use the “__contains__” method of Python.

for i in df.columns:
     if i.__contains__("text"):
          print(i)

At the next step of NLTK Tokenization for website content, we will use the Pandas library’s “str.cat” method to unite all of the content pieces across different web pages.

content_of_website = df["body_text"].str.split().explode().str.cat(sep=" ")
Unification of String
Example putout of a website’s content as a concatenated string.

Creating a variable “content_of_website” to assign the united content corpus of the website with the “sep=” parameter with a space value is necessary to decrease the computation need. Instead of performing NLTK Tokenization for every web page’s content separately and then uniting all of the tokenized output of text pieces, uniting all of the content pieces and then performing the NLTK tokenization for the united content piece is better for time and energy saving. At the next step, the “NLTK.word_tokenize” will be performed and the output of the tokenization process will be assigned to a variable.

tokens = word_tokenize(content_of_website)

To be able to see the counts of the tokenized words, and their counts, the “Counter” from the “collections” can be used as below.

tokenized_counts = Counter(tokens)
df_tokenized = pd.DataFrame.from_dict(tokenized_counts, orient="index").reset_index()

“tokenized_counts = Counter(tokens)” is to provide a counting process for all of the counted objects. And, at the second line of the counting tokenized words with NLTK, the “from_dict” and “reset_index” methods of Pandas have been used to provide a data frame. Thanks to NLTK tokenization, the “unique word count” of a website can be found below.

df_tokenized.nunique()

OUTPUT>>>

index    18056
0          471
dtype: int64

The “holisticseo.digital” has 18056 unique words within its content. These words can be seen below.

df_tokenized.sort_values(by=0, ascending=False, inplace=True)
df_tokenized

Below, you can see the tokenization of words as an image.

Sort NLTK Tokenized Counts

If the image of the word tokenization output is not clear for you, you can check the table of the word tokenization output is below.

index0
23the31354
26.23012
3,22564
36and12812
22of12349
14747NEL1
14748CSE1
14749recipe-related1
17753Plan1
18055almost.1
18056 rows × 2 columns
NLTK Tokenization Table Output

After the word tokenization with NLTK for website content, we see that the words from the header and footer appear more along with the stop words. The visualization of the counted word tokenization can be done as below.

df_tokenized.sort_values(by=0, ascending=False, inplace=True)
df_tokenized[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15})

Below, you can see the output of the code block for visualizing the word tokenization output.

NLTK Tokenization Visualization.
NLTK Tokenization Visualization.

To save the word tokenization output’s barplot as a PNG, you can use the code block below.

df_tokenized[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15}).figure.savefig("word-tokenization-2.png")

The word “the” appears more than 30000 times while some of the punctuations are also included within the word tokenization results. It appears that the words “content” and “Google” are the most appeared words besides the punctuation characters and the stop words. To have more insight when it comes to the word tokenization, the “TF-IDF Analysis with Python” can help to understand a word’s weight within a corpus. To create a better insight for SEO and content analysis via NLTK tokenization, the stop words should be removed.

How to Filter out the Stop Words for Tokenization with NLTK?

To remove the stop words from the NLTK Tokenization process’ output, a filter-out process should be performed with a repetitive loop with a list comprehension or a normal for loop. An example of NLTK Tokenization by removing the stop words can be seen below.

stop_words_english = set(stopwords.words("english"))
df_tokenized["filtered_tokens"] = pd.Series([w for w in df_tokenized["index"] if not w.lower() in stop_words_english])

To filter out the stop words during the word tokenization, the text cleaning methoıds should be used. To clean the stop words, the “stopwords.words(“english”)” method can be used from NLTK. In the code line above, the first line assigns the stop words within English to the “stop_words_english” variable. At the second line, we created a new column within the “df_tokenized” data frame which uses a list comprehension with the “pd.Series”. Basically, we take every word from the stop words list and filter the tokenized words output with NLTK according to the stop words. The “filtered_tokends” column doesn’t include any of the stop words.

How to Count Tokenized Words with NLTK without Stop Words?

To count the tokenized words with NLTK without the stop words, a list comprehension for subtracting the stop words should be used over the tokenized output. Below, you will see a specific example definition for the NLTK Tokenization tutorial.

To count tokenized words with NLTK by subtracting the stop words in English, the “Counter” object will be used over the list that has been created over the “tokens_without_stop_words” with the list comprehension process of “[word for word in tokens if not a word in stopwords.words(“english”)]”.

Below, you can see a code block to count the tokenized words and their output.

tokenized_counts_without_stop_words = Counter(tokens_without_stop_words)
tokenized_counts_without_stop_words

OUTPUT>>>

Counter({'Think': 381,
         'like': 989,
         'SEO': 5172,
         ',': 22564,
         'Code': 467,
         'Developer': 405,
         'Python': 1583,
         'TechSEO': 733,
         'Theoretical': 700,
         'On-Page': 670,
         'PageSpeed': 645,
         'UX': 699,
         'Marketing': 1086,
         'Main': 231,
         'Menu': 192,
         'X-Default': 232,
         'value': 370,
         'hreflang': 219,
         'attribute': 178,
         'link': 509,
         'tag': 314,
         '.': 23012,
         'An': 198,
         'specify': 42,
         'alternate': 109,
show more (open the raw output data in a text editor) ...

         'plan': 6,
         'reward': 3,
         'high-quality': 34,
         'due': 82,
         'back': 49,
         ...})

The next step is creating a data frame from the Counter Object via the “from_dict” method of the “pd.DataFrame”.

df_tokenized_without_stopwords = pd.DataFrame.from_dict(tokenized_counts_without_stop_words, orient="index").reset_index()
df_tokenized_without_stopwords

Below, you can see the output.

Count Word Tokenization without Stop Words
Counting tokenized words without stop words with NLTK.

Below, you can see the table output of the tokenization with NLTK without stop words in English.

index0
0Think381
1like989
2SEO5172
3,22564
4Code467
17910success.1
17911pages.1
17912infinitely1
17913free…1
17914almost.1
17915 rows × 2 columns
NLTK Tokenization without stop words output.

Even if the stop words are removed from the text, still the punctuations exist. To clean the textual data completely for a healthier word tokenization process with NLTK, the stop words should be cleaned. Below, you will see the sorted version of the word tokenization with NLTK output without stop words.

df_tokenized_without_stopwords.sort_values(by=0, ascending=False, inplace=True)
df_tokenized_without_stopwords

You can see the output of word tokenization with NLTK as an image.

NLTK Word Tokenization Result
NLTK Word Tokenization Result.

The table output of the word tokenization with NLTK without stop words and sorted values is below.

index0
21.23012
3,22564
2985639
2965623
2SEO5172
12618gzip1
6968exited1
6969seduce1
6970collaborating1
17914almost.1
17915 rows × 2 columns

How to visualize the Word Tokenization with NLTK without the Stop Words?

To visualize the NLTK Word Tokenization without the stop words, the “plot” method of Pandas Python Library should be used. Below, you can see an example visualization of the word tokenization with NLTK without stop words.

df_tokenized_without_stopwords[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15})

You can see the output as an image below.

word tokenization nltk without stopwords

The effect of the punctuations is more evident within the visualization of the word tokenization via NLTK without stop words.

How to Calculate the Effect of the Stop Words for the Length of the Corpora?

The corpora length represents the total word count of the textual data. To calculate the effect of the stop words for the length of the corpora, the stop words count should be subtracted from the total word counts. Below, you can see an example for the calculation of the total stop word count effect for the length of the corpora.

tokens_without_stop_words = [word for word in tokens if not word in stopwords.words("english")]
print(len(content_of_website))
len(content_of_website) - len(tokens_without_stop_words)

OUTPUT>>>
307199
268147

The total word count of the website is 307199, the total word count without the stop words is 268147. And, 18056 of these words are unique. The unique word count can demonstrate a website’s potential query count since every different word is a representational score for relevance to a topic, or concept. Every unique word and n-gram can give a better chance to be relevant to a concept, or phrase in the search bar.

How to Remove the Punctuation from Word Tokenization with NLTK?

To remove the Punctuation from Word Tokenization with NLTK the “isalnum()” method should be used with a list comprehension. In the NLTK Word Tokenization tutorial the “isalnum()” method will be used on the “content_of_website_removed_punct” variable as below.

content_of_website_removed_punct = [word for word in tokens if word.isalnum()]
content_of_website_removed_punct

OUTPUT >>>

['Think',
 'like',
 'SEO',
 'Code',
 'like',
 'Developer',
 'Python',
 'SEO',
 'TechSEO',
 'Theoretical',
 'SEO',
 'SEO',
 'PageSpeed',
 'UX',
 'Marketing',
 'Think',
 'like',
 'SEO',
 'Code',
 'like',
 'Developer',
 'Main',
 'Menu',
 'is',
 'a',
show more (open the raw output data in a text editor) ...

 'server',
 'needs',
 'to',
 'return',
 '304',
 ...]

As you see all of the punctuations are removed from the tokenized words with NLTK. The next step is using the Counter object for creating a data frame so that the tokenized output with NLTK can be used for analysis and machine learning.

content_of_website_removed_punct_counts = Counter(content_of_website_removed_punct)
content_of_website_removed_punct_counts

OUTPUT >>>

Counter({'Think': 381,
         'like': 989,
         'SEO': 5172,
         'Code': 467,
         'Developer': 405,
         'Python': 1583,
         'TechSEO': 733,
         'Theoretical': 700,
         'PageSpeed': 645,
         'UX': 699,
         'Marketing': 1086,
         'Main': 231,
         'Menu': 192,
         'is': 8965,
         'a': 11483,
         'value': 370,
         'for': 8377,
         'hreflang': 219,
         'attribute': 178,
         'of': 12349,
         'the': 31354,
         'link': 509,
         'tag': 314,
         'An': 198,
         'can': 5065,
show more (open the raw output data in a text editor) ...

         'natural': 56,
         'Due': 20,
         'inaccuracies': 1,
         'calculation': 63,
         'always': 241,
         ...})

The Counter object has been created for the NLTK Word Tokenization output without the punctuation. Below, you will see an example for creating a data frame methodology via “from_dict” for the result of the NLTK word tokenization without punctuation.

content_of_website_removed_punct_counts_df = pd.DataFrame.from_dict(content_of_website_removed_punct_counts, orient="index").reset_index()
content_of_website_removed_punct_counts_df.sort_values(by=0, ascending=False, inplace=True)
content_of_website_removed_punct_counts_df

Below, you can see the image output of the NLTK word tokenization by removing the punctuations.

word tokenization nltk puncutation

Below, you can see the table output of the NLTK word tokenization by removing the punctuations.

index0
20the31354
32and12812
19of12349
14a11483
44to11063
11303ground1
6178Zurich1
6181Visa1
9184omitted1
14783infinitely1
14784 rows × 2 columns

How to visualize the NLTK Word Tokenization result without punctuation?

To visualize the NLTK Word Tokenization result within a data frame without punctuation, the “plot” method of the pandas should be used. An example visualization of the NLTK Word Tokenization without punctuation can be seen below.

content_of_website_removed_punct_counts_df[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15})

The output of the visualization of the NLTK Word Tokenization result without punctuation is below.

word tokenization nltk punctuation visualization

How to Remove stop words and punctuations from text data for a better NLTK Word Tokenization?

To remove the stop words and punctuations for cleaning the text in order to have a better NLTK Word Tokenization result, the list comprehension should be used with an “if” statement. Multiple conditional list comprehensions are to provide a faster text cleaning process for the removal of punctuation and stop words. An example of the cleaning of punctuation and stop words can be seen below.

content_of_website_filtered_stopwords_and_punctiation = [w for w in tokens if not w in set(stopwords.words("english")) if w.isalnum()]
content_of_website_filtered_stopwords_and_punctiation_counts = Counter(content_of_website_filtered_stopwords_and_punctiation)
content_of_website_filtered_stopwords_and_punctiation_counts_df = pd.DataFrame.from_dict(content_of_website_filtered_stopwords_and_punctiation_counts, orient="index").reset_index()
content_of_website_filtered_stopwords_and_punctiation_counts_df.sort_values(by=0, ascending=False, inplace=True)
content_of_website_filtered_stopwords_and_punctiation_counts_df[:30].plot(kind="bar",x="index", orientation="vertical", figsize=(15,10), xlabel="Tokens", ylabel="Count", colormap="viridis", table=False, grid=True, fontsize=15, rot=35, position=1, title="Token Counts from a Website Content with Punctiation", legend=True).legend(["Tokens"], loc="lower left", prop={"size":15})

The explanation of the removal of the punctuation and the stop words in English from the tokenized words via NLTK code block is below.

  • Remove the stop words and the punctuations via the “stopwods.words(“english”)” and “isalnum()”.
  • Count the rest of the words with the “Counter” functions from Collections Python built-in module.
  • Create the dataframe with the “from_dict” method of the “pd.DataFrame” with the “orientation”, “figsize”, “kind”, “x” parameters.
  • Sort the values from high to low.
  • Use the plot method of the Pandas Python Library for the visualization with the “label”, “label”, “colormap”, “table”, “grid”, “fontsize”, “rot”, “position”, “title”, “legend” parameters.

The output of the removal of the punctuation and the stop words from the tokenized words with NLTK for visualization can be seen below.

nltk tokenization stopwords punctuations

The table version of the NLTK Tokenization for words without punctuations and stop words in English can be seen below.

index0
2SEO5172
63Google2762
160The2716
280content2477
23page2230
9600estimations1
9595Politic1
95943961
9593Politics1
14642infinitely1
14643 rows × 2 columns

How to perform sentence Tokenization with NLTK?

To perform the sentence tokenization with NLTK, the “sent_tokenize” method of the NLTK should be used. The steps below can be used for NLTK Sentence Tokenization.

  • Extract the text and assign a variable
  • Import the NLTK and sent_tokenize” method.
  • Use the “sent_tokenize” on the extracted text.
  • Use the Counter from Collections to count the sentences.
  • Create a Data Frame from the values of the count process’ output.
  • Call the dataframe.
index0
104required fields are marked * name* email* webs…175
105python seo techseo theoretical seo on-page seo…175
103post navigation ← → your email address will no…156
3259you may see the result below.43
383you can see the result below.42
867234
879329
885424
857121
908615
891515
917713
165reply your email address will not be published.13
924812
14947below, you can see the result.11
3450you also may want to read our some of the rela…11
942108
93298
16in this article, we will focus on how to resiz…7
3146you can see an example below.7
15because of those motivations, image compressio…7
15040below, you will see an example.7
3204you may see the output below.6
3451return of investment definition and importance…6
3452what is conversion rate optimization?6
11046you may see an example below.6
62what is conversion funnel?6
3147you can see the output below.6
1an hreflang value can specify the alternate ve…5
38the click path plays a role above all in terms…5
24100read more » python seo techseo theoretical seo…5
34for get and head methods, the server will retu…5
35if the resource’s etag is not on the list, the…5
8357what is a news sitemap?5
32trust elements are called by definition all th…5
31trust elements are used in this context.5
28translating a pandas data frame with python ca…5
3website is a way of world wide web presence.5
5announced in october 2015 as an internal proje…5
7user-centric performance metrics are announced…5

How to perform sentence tokenization with NLTK without the stop words?

To remove the stop words from the sentence tokenization with NLTK output, the “join()” method should be used for the textual data will be tokenized. The steps that will be followed for the sentence tokenization with NLTK without the stop words can be seen below.

  • Remove the stop words from the tokenized text data.
  • Join the tokens with space via the “join(” “)” method and argument.
  • Use the “sent_tokenize()” over the joined tokens without stop words.

An example of sentence tokenization with NLTK without the stop words can be found below.

sent_tokens_counts = Counter([sent.lower() for sent in sent_tokens])
sent_tokens_counts_df = pd.DataFrame.from_dict(sent_tokens_counts, orient="index").reset_index()
sent_tokens_counts_df.sort_values(by=0, ascending=False)[0:40]

The table output of the sentence tokenization without the stop words can be found below.

index0
104required fields are marked * name* email* webs…175
105python seo techseo theoretical seo on-page seo…175
103post navigation ← → your email address will no…156
3259you may see the result below.43
383you can see the result below.42
867234
879329
885424
857121
908615
891515
917713
165reply your email address will not be published.13
924812
14947below, you can see the result.11
3450you also may want to read our some of the rela…11
942108
93298
16in this article, we will focus on how to resiz…7
3146you can see an example below.7
15because of those motivations, image compressio…7
15040below, you will see an example.7
3204you may see the output below.6
3451return of investment definition and importance…6
3452what is conversion rate optimization?6
11046you may see an example below.6
62what is conversion funnel?6
3147you can see the output below.6
1an hreflang value can specify the alternate ve…5
38the click path plays a role above all in terms…5
24100read more » python seo techseo theoretical seo…5
34for get and head methods, the server will retu…5
35if the resource’s etag is not on the list, the…5
8357what is a news sitemap?5
32trust elements are called by definition all th…5
31trust elements are used in this context.5
28translating a pandas data frame with python ca…5
3website is a way of world wide web presence.5
5announced in october 2015 as an internal proje…5
7user-centric performance metrics are announced…5

How to Interpret the Tokenized Text with NLTK?

To interpret the tokenized text with NLTK for SEO, or NLP and text quality understanding, the metrics and dimensions below can be used.

  • The unique word count within the text data.
  • The unique word count within the headings of a website.
  • The unique word count within the anchor texts.
  • The sentence count per article of a website.
  • The unique sentence count per article of a website.
  • The unique word count per article of a website.
  • The most used words within the headings
  • The most used words within the text
  • The percentage of the stop words to the unique words.
  • Checking the impressions, clicks and rankings for the unique group of words from different website sections such as footer, header, main content area, side bar, or the headings.

In terms of Search Engine Optimization, and understanding the text’s quality, the interpretation methods above can be used.

How can a Search Engine Use Tokenization?

A search engine can use tokenization to split the text into “tokens” so that the information retrieval can have a match between the queries and the document. Tokenization is used for text normalization. A search engine uses word tokenization, and sentence tokenization to perform text normalization so that they can decrease the cost of computation for their own algorithms. Pairing words from different contexts with different prefixes and suffixes, recognizing word pairs, vectorizing the N-grams within the sentences, supporting the part of speech tag with tokenized word data from a corpus are among the usage of word tokenization purposes for a search engine. For tokenization purposes, a search engine can use NLTK and other NLP Libraries such as Genism, Keras, or TensorFlow. Natural Language Tool Kit (NLTK) can be used by Google and other search engines with the same purposes. Below, you will see two patent from Google, and Max Benjamin Braun, Ying Sheng that includes the usage of NLTK.

image
A transferrable neural architecture for structured data extraction from web documents from Google to understand a document’s structured data by creating a pattern between similar seed websites.

Below, you will see another example that shows how a search engine can use NLTK and tokenization from Google Search Engine.

image 1
A section from Message suggestions patent of Google. A search engine can use the NLTK word and sentence tokenization for autocomplete suggestion creation along with text-to-speech and message suggestions while typing, or answering a message with a quick message option such as “thank you”.

Do Search Engines use NLTK for tokenization? Yes, search engines use tokenization. Search Engines such as Microsoft Bing, Google, and DuckDuckGo can use word and sentence tokenization to create indexes of words, and indexes of documents to understand the contextual connection between the queries, and the documents. A word’s place, a word’s surrounding other words can help a search engine to understand the relevance of words to each other and to a topic. Word Tokenization and sentence tokenization are to provide a better lemmatization, stemming, word grouping, and textual data aggregation for search engines. To learn more about how a search engine can use Natural Language Processing, and its sub-practices such as tokenization, you can read the following articles.

Last Thoughts on NLTK Tokenize and Holistic SEO

NLTK Word Tokenization is important to interpret a website’s content or a book’s text. Word Tokenization is an important and basic step for Natural Language Processing. It can be used for analyzing the SEO Performance of a website or cleaning a text for NLP Algorithm training. Using lemmatization, stemming, stop word cleaning, punctuation cleaning, and visualizing the NLTK Tokenization outputs are beneficial to perform statistical analysis for a text. Filtering certain documents that mention a word, or filtering certain documents based on their content, content length, and unique word count can be beneficial to perform a faster and scaled analysis.

The NLTK Tutorials and NLTK Tokenize Guideline will be updated over time.

Koray Tuğberk GÜBÜR

Leave a Comment

NLTK Tokenize: How to Tokenize Words and Sentences with NLTK?

by Koray Tuğberk GÜBÜR time to read: 25 min
0