An example of semantic core clustering. Free search query clustering service from SEOQUICK

Hello dear friends! Happy New Year to you, I hope you have already moved away from the holidays and are in a fighting mood. I have a New Year's gift for you today - a very cool practical post. The post is not mine, but it is more than worthy to appear on the pages of this blog.

The review was compiled by a cool dude named Dmitry Miroshnichenko. Dima lives in Volgograd, works as a project manager in a local web studio, which develops and promotes its own projects. And Dima is a candidate of sciences, and this is not a lot of bullshit!

Everything written below is my point of view and is based on my life experience. I do not claim to be the ultimate truth. If you see some processes differently and know how you can solve the problem more efficiently, it is highly advisable not to restrain yourself and write about it in the comments.

So, the task: to create a semantic core for the site. What does the word "semantic" mean? This is what Wikipedia tells us. Semantics (from ancient Greek σημαντικός - denoting) is a branch of linguistics (in particular, semiotics) that studies semantic meaning units of language. That is, we need to highlight semantic directions for the site structure.

How is this problem usually solved?

Parsing queries (Wordstat, various databases, hints, services such as Spywords and Semrash, open statistics counters and other sources)
We sift out the garbage and check the frequency
We distribute requests into groups
Based on groups, we create the structure of the site and distribute articles

We can successfully solve the first two points using a key collector. There is no particular pain of choice here. The key collector is a really handy tool.

The third task is the most interesting. We will consider her decision.

The fourth problem can be solved quite trivially if the third has been implemented well.

Initial data

Information site on dacha topics. Section "shrubs and trees". A total of 562 keys were collected. This is the training dataset. It was important for me to compare the results of different tools.

For the section of the information site, requests were collected, garbage was removed and the frequency of “!” according to Wordstat, more than 30. They need to be distributed into groups.

You can distribute requests manually or automatically. We manually distribute according to meaning. Everything is clear here. There are many ways to do automatic clustering. Let's take a closer look at each tool.

Tools to make manual query clustering easier

Excel, LibreOffice, OpenOffice

I think there is no point in describing in detail how to work with these tools.

Advantages

high precision processing - we still process it by hand
versatility - you can take into account a bunch of parameters
in the case of LibreOffice, OpenOffice - free

Flaws

in the case of Excel - paid
low speed - when working with large amounts of data
need to make backups

Google Docs

Advantages

similar to the previous point
online service - convenient access to the document
no need to make backups
free

Flaws

the speed is still low

kg.ppc-panel.ru

Online service. Load requests, filter, select groups.

Works fast. The functionality is sufficient (except for saving projects), the interface is good.

Advantages

user-friendly interface
works quickly
visibility
no need to register
free
online service

Flaws

You cannot save projects, you can only upload ready-made ones
follows from the previous one - if the service fails, then all the developments will be lost
frequencies cannot be loaded

Keyword Assistant - creating the structure of the future site

Another online service. Similar to the previous one. You can now save projects.

Advantages

projects are saved
good and clear interface
you can download frequencies
free
online service

Flaws

the speed is higher than when working with excel, but still comparable
for the paranoid - it is not clear where your data is stored

Tools for automatic query clustering

Keyword grouper for PPC

Desktop version with strange logic of behavior. Details at the link above (where to download can be found there).

Brief description of how the clustering algorithm works:

We have a certain set of keywords. Before compiling the index, the script normalizes all word forms. At the next stage, the grouping script determines frequencies for the entire document index and builds a rating. Frequencies are calculated for each word (after normalization). If we have “vacation in Tunisia”, then the script counts the frequencies for “vacation” and “Tunisia”.

At this stage, the ranking of words is arranged from the most frequent to the least frequent. Why is this necessary? To create core groups. Let’s just say that if the word “Egypt” occurs more often than the word “hotel,” then a search query (for example, [cheap hotels in Egypt]) that includes the word “hotel” will be classified in the “Egypt” group and not vice versa.

So, we grouped the words, but in a rather primitive way. Next, we need a more precise grouping.

More precise grouping means that within each group the script will create subgroups and distribute words between them.
At this stage, the rating of words by frequency will also be built. In this case, the rating will be created only within the group; the main word of the group (aka the group name) will not take part in the rating. In addition, the ranking of words in a group is based on inverse frequency. Those. the word with the lowest frequency will be the first to create “its own” subgroup.

Of course, only those words that occur at least N times can create their own subgroups (set in the script settings, but usually this is at least 4-5).

It is this approach that works very effectively when the main groups are created based on the ranking from the most frequent words to the least frequent, and subgroups - from the least frequent to the most frequent.

The output is a grouped list.

Advantages

free for now
works quickly

Flaws

desktop version
somehow the project is saved crookedly
How can I delete something there?
extremely strange logic of behavior, why were there words in the system that were not there? (visible on the screenshot)
the algorithm does not take into account the meaning of words, only the common root is the most significant drawback
stated limit of 1000 keys
frequency cannot be loaded
need to make backups

Rush Analytics

Online service for query clustering based on PS issuance. More precisely, clustering is only one of the capabilities of the service. More detailed description available on the website.

Briefly about the operating algorithm:

Clustering is the automatic division of keywords into groups.
How does the technology work?
You upload a list of keywords, select the type of clustering - the system analyzes search engine rankings and, using our algorithm, divides keywords into groups that will rank well in search engines. At the output you get keywords divided into groups.

You can set the strength of the group. Apparently exhibited in parrots. The output is Excel files with the selected grouping strength. On the first tab there are clusters. The second is all that is left without clusters.

The fee is charged only for grouped requests (maximum number).

Let me remind you that there are only 562 requests. How many requests were grouped for each option can be seen in the table below.

We get the maximum number of grouped requests 359. Not bad for an automatic machine. How much did it cost?

We get that clustering 359 requests cost 552.5 rubles, or a little more than 1.5 rubles per clutter (although the grouping figure per request is not at all interesting to me, but for the overall picture let it be). Here it is necessary to clarify that more than two requests are considered a cluster. I haven’t figured out how to count how many groups there were.

Now let's see what the quality is like.

Let's select a test group for cherries. Here is a list of original requests:

how to prune cherries correctly valery chkalov cherries cherries bull's heart cherry tree cherry orchard cherries iput pruning young cherries pruning cherries cherries varieties varieties of cherries

We got two clusters for group strength 4 and 5:

For a grouping strength of 3, the clipping cluster is expanded slightly:

Obviously, the result is so-so.

If I did it manually, the cluster by variety would look something like this:

cherries bull's heart cherries iput Valery Chkalov cherries cherries varieties varieties of cherries

So the algorithms clearly need to be updated.

Advantages

online service
all projects are saved
upon registration, they give 3,000 rubles to your account (at the time of publication, in my opinion, such a freebie has already been covered up)
at the very least, but the meaning (and not just the common root) is taken into account based on search engine results

Flaws

still in the testing stage (at the time of publication it seems to be no longer available)
paid
expensive - it’s good if the core contains 500 queries, but what if there are thousands and hundreds of thousands?
you still need to finish it by hand, it’s not possible to do it fully automatically

SEMparser - Structuring semantics for SEO and context

Another online clusterer based on search engine results.

How it works (taken from the site):

What it looks like inside:

After automatic clustering, an editing window appears where you can correct errors.

The excel file is downloaded. On the first tab are queries and groups with details.

The second tab contains only groups.

On the third tab there is some top topic.

You can also set the strength of the group. I tested the same numbers: 3, 4 and 5.

Here we need to clarify that a cluster from one request is also a cluster. And it is taken into account. So formally it turns out that 100% of the requests were grouped. But I also couldn’t figure out how to calculate how many groups there were with requests of 2 or more.

You also need to take into account that pricing is slightly different. Money is withdrawn for all requests that are in the document. I bought 600 requests, which cost me 288 rubles. We get the cost of one request 0.48 kopecks. After grouping, I had 38 requests left in my system. As a result, grouping a test sample of keys cost approximately 270 rubles. Which is two times lower than in the previous service.

Let's see what's going on with the quality.

For all forces of the group there were 4 groups:

valery chkalov cherries (1/170) valery chkalov cherries (170) cherries iput (5/472) cherries iput (159) varieties of cherries (134) cherries varieties (92) cherry tree (44) cherry orchard (43) ox heart cherries ( 1/64) ox-heart cherries (64) pruning cherries (3/352) pruning cherries (226) how to properly prune cherries (86) pruning young cherries (40)

We also see that the meaning is not ideal. Needs to be finished by hand.

Advantages

online service
all projects are saved
50 requests for test upon registration
poorly poorly takes into account the meaning
cheaper than the previous option

Flaws

paid
the algorithm does not work perfectly, manual correction is needed

Just Magic - automatic selection of semantics for SEO and contextual advertising

An interesting service in my opinion. Website design: hello console.

What they say on their website:

— Collect semantics for existing site pages, immediately linking queries to them correctly.
— Expand the existing site structure.
— Offer thematic semantics for new pages of the site based on the current SL.
— Create semantics for the designed site.
- And simply cluster queries. Including thematic breakdown.

You can watch the developer’s report on TopExpert:

Here's what the developers told me about how everything works there:

We are solving a purely utilitarian problem - determining which requests can be promoted on one page.

Hence the solution method - we collect the output of the PS for each request and carry out clustering based on it.

In fact, we need to solve a fairly simple problem - scatter queries based on matching URLs in the search results, simultaneously making sure that they do not fall into one cluster different types requests. We distinguish the following types:
— Commercial/informational.
— “On the face”/“on the inside.”
— One-word/2+ words.
— With/without content type markers.

“Content type markers” are query words that the search engine uses to set requirements for content on the promoted page. For example - (“reviews”, “video”, “download”, “photo”).

Since the problem is defined and the set of input data is quite simple, the algorithm is not complicated. The main algorithm of the system does not use machine learning. We de facto use centroids in the current algorithm (one of the queries is the “center” of the cluster, and the rest must have a certain measure of similarity to it). Now a “greedy” algorithm for their (centers) selection is used. But this method has certain drawbacks, so in the next version of the algorithm, which is currently being implemented, we will, in principle, abandon the concept of a cluster center query.

We also use machine learning, but in a different place - automatic generation of marker requests based on Yandex.Metrica data.

Haven't been able to try it yet. Sent a request for test access. They said they have an update. The main functionality, costing from 30,000 rubles/month, will be available at the end of January. For mere mortals with smaller volumes in February.

If everything works as they say, it will be very cool. Let's see.

Advantages

good prospects
online service

Flaws

There is no version for mere mortals yet, only a monthly subscription
failed to try
as it is - expensive

SEO intellect - SEO automation service

Another online clustering service. Declared functionality:

SEO automation service
● query clustering
● selection of landing pages
● search for competitors
● assistance in content optimization
● ordering optimized texts

I never managed to get it to work. There is no help, no hints...

Advantages

online

Flaws

so I understood how to work with him
paid

Coolakov.ru - Breakdown of key queries

Description on the website:

The service allows you to automatically group already collected requests. Requests are divided into groups based on the similarity of the Yandex top 10.

We couldn’t find anything about the features of the algorithm.

For my 562 queries, there were 305 groups. There is at least one request per group. Let's see what's going on with our cherries:

It is clear that groups 73 and 189 could have been united... Well, the rest is also clear. Clustering, to put it mildly, is not ideal.

Advantages

online service
free

Flaws

How to download it? There are no buttons to download. You can only copy the text.
Without registration, you can only work from 20-00 to 7-00 Moscow time. There is no way to register. At all.

s:toolz is a professional query clustering tool based on search results

Another clustering service. The peculiarity is that it does not work in automatic mode. This is also his shortcoming.

Operating procedure

Declared functionality:

The query clustering service is designed for fast automated grouping of large lists of queries (keywords for promotion) into clusters, which are formed based on search engine results and submissions search engine Yandex about user needs.

Requests from one cluster must be promoted to 1 page.

Sent a brief. They responded two days later. It turns out that they have applications for manual mode are being processed. They write that they get more this way feedback. In the future they are threatening to do everything automatically.

The clustering itself, they write, lasted less than a minute. Quote:

It took less than a minute to process your application. The largest amount that had to be processed so far was 55k, the calculation took about 3 hours.

What they write about the algorithm:

We have developed our own grouping algorithm. Data - Yandex top 10 for each request. We use machine learning, but for other functionality, which will be presented soon.
There are problems with relevant pages in the report. The search engine index does not always contain what you need, especially if a specialist has not yet worked on the project.
As a result, you have to additionally process the result manually; with a certain number of requests, this is already sad. The problem is in the process of being solved.

Clustering my 562 queries cost me 309 rubles. We charge 60 kopecks per request. No discounts were given. Yes, I didn’t ask.

Now let's see what's going on with the quality:

pruning cherries how to prune cherries correctly pruning young cherries varieties of cherries cherries varieties of cherries tree Valery Chkalov cherries cherries ox heart cherry orchard cherries and the way

Again the varieties were left without clusters.

Advantages

communication skills of technical support, answered all questions
online service

Flaws

does not work automatically, the human factor spoils the impression
paid
manual correction of clusters is required

Mc-Castle.ru – Clusterer SY

And one more service. It also clusters, apparently, by word form. No search engines.

Result:

I couldn’t figure out what to do next with this. How can I split into clusters? How can I see which requests are included in one cluster? Well, if the breakdown is based on word forms, then there is no talk of any unification in terms of meaning.

Advantages

online service
free
no need to register

Flaws

strange interface
splitting algorithm based on word forms

Key Collector

The program is well known to almost everyone who has encountered the collection of keys in one way or another.

Grouping is only a small part of what she can do.

Requests can be grouped by phrase composition, by search engine results, and in combined mode. Search-based grouping works by code collected data for KEI. It took several minutes to collect information for the group. The grouping itself lasted less than a minute.

The best grouping was achieved with the following parameters:

In the first case, 381 phrases or 68% of the total number were grouped. In the second case, 403 or 72%, which is very good.

The cherries we were interested in by variety (iput, ox's heart...) also did not make it to the varieties. They were separated into separate groups. Which, in general, is not surprising.

The remaining requests were grouped more or less. As a result, we have 72% time savings (the rest must be completed manually).

Advantages

clear interface
you can select grouping settings
a bunch of other options for working with keys
reasonable price
excellent technical support

Flaws

desktop version
You cannot edit the resulting groups in the program - only in Excel
to work you need anti-gate, proxies, accounts - there are no such problems with online services, they take these problems upon themselves
manual correction of clusters is required

MegaLemma - automation of the compilation of the semantic core and Yandex.Direct campaigns

Desktop software for clustering.

It’s difficult to just get up and work. Usability is poor.

I'd like to normalize it. It gives me a message that I need to save the project. Is it not possible to save the project automatically? Why should I press a button when a computer can do it?

It's not obvious what to click to start the grouping process. It turns out this is “frequency analysis”.

Parsing 562 requests using standard settings in 7 threads and 7 proxies took approximately 10 minutes. It took another 5 minutes for normalization.

After normalization, it is again unclear how to group the words I need. I found the information on page 27 of the manual. And thanks for that.

After all, this is the main functionality of the program. Break words into groups. Why is the most necessary information so far away? Well, it would be useful to do something like quick start. For context and for sites separately. I realized there are differences in the work.

I didn’t want to complete the task of clustering my queries here. The main problem is the placement of accents in the program interface.

There is no point in using it purely for clustering. I think the full power of the program should be revealed when full time job with keys. Starting with cleaning queries from garbage and creating stop words.

Advantages

there is a full demo version

Flaws

desktop program
To work, you need anti-gate and proxies - there are no such problems with online services, they take these problems upon themselves
combines on the basis of word forms, that is, there is no need to talk about any meanings
usability needs improvement

"Semyon-Yadren" - formation of the semantic core of the site based on search engines

Another remote service. There has been a lot of PR lately.

Again, you need to work with the service through intermediaries. That is, through people. No automation for you.

You must first submit a brief, then wait until they contact you. You agree on the details. Then payment.

They didn’t want to do grouping for free, but they gave a 50% discount. As a result, grouping 562 requests cost me 350 rubles (without a discount they asked for 700). Grouping one request cost 60 kopecks (or 1.2 rubles without discounts)

Again, there are problems with usability on the site. The “submit brief” button is small, white and invisible on a white background. I couldn't resist, sorry.

They refused to say how and on what basis clustering is done. It is only known that on the basis of issuing a PS.

The result of the work was sent in a few hours. In addition to the clusters themselves, they sent another 100,500 parameters and files. Although I didn't ask. In principle, useful information for analysis. But it would be logical to divide: if you just want clustering, there’s one price; if you want more goodies, there’s another price. Because different clients need different information.

Let's see what clusters we got:

pruning cherries how to prune cherries correctly pruning young cherries cherries iput varieties of cherries cherries varieties cherries bull's heart valery chkalov cherries cherry tree cherry orchard

This is already much better! Some varieties ended up in the cluster by variety! True, Valery Chkalov got lost.

Advantages

responded quickly
ready to make discounts
online service
a bunch of different additional information, including tasks for copywriters (although they write that tasks still need to be completed individually)

Flaws

there is an intermediary in the form of a person
algorithms are a complete trade secret
manual correction of clusters is required

Results

A summary table of functionality and cost can be found below.

Tool	Price	Clustering algorithm	Format	Working hours	Cost of grouping all requests	Cost of grouping one request
	—
	for free	based on the issuance of PS	online	A couple of minutes	for free	for free
	paid	based on the issuance of PS	online	less than a minute + two days	309 rub.	60 kopecks
	paid	based on word forms	online	less than a minute	for free	for free
	1,700 rub.	based on word forms + based on PS output	desktop	A couple of minutes	—	—
	3,000 rub.	based on word forms and lemmas	desktop	A couple of minutes	—	—
	paid	based on the issuance of PS	online	a few minutes + a couple of hours	350 rub. (RUB 700 without discount)	60 kopecks (1.2 rubles without discount)
Below are services that were not included in the main review in this post, but were suggested by users in the comments or representatives of the services.
Topvisor.ru	paid	based on the issuance of PS	online	~7-8 min. for 3000 requests	—	from 30 kopecks

As a result, we don’t yet have a tool that will fully automatically group the necessary queries without errors.

The best results were shown(judged by cherry varieties) Semparser.ru and Seo-case.com. In terms of cost, we get 48 kopecks versus 1.2 rubles, respectively. The difference is almost three times. The additional cost of Seo-case, I think, is due to bonus information. Next up is the Key Collector (since the person handling the requests almost certainly has one).

The most thorough approach in clustering, in my opinion, from the guys from Just-magic.org. So as soon as the opportunity arises, I will definitely test it.

Be that as it may, working with your hands is better than using any service and it's hard to argue with that. A little more expensive, but much better quality.

This concludes the mega review, dear friends! I'm sure you liked it, so I ask you to leave your opinion in the comments, and if you have something to add, then even more so, unsubscribe.

See you later, friends!

Author: Alexey Chekushin- SEO-Expert Kokoc.com (Kokoc Group), creator of the Just-Magic.org service

In my previous article: “”, I called clustering one of the fundamental factors for success in promotion. In this publication, I discuss in detail what clustering is and how to apply it correctly.

What is clustering?

This is an automatic grouping of requests that solves two important problems:

Combining similar queries (with the same “intent”), regardless of their semantic similarity. (“intent” = user intent). For example, the requests “rent an apartment” and “rent apartments” express the same user desire.
Checking the compatibility of promoted queries: is it possible to promote them on one page to the top of Yandex simultaneously. Those. Is it possible to adjust page optimization for all these requests? Or some requests require placing on a separate page.

Of all the methods existing today, these problems are most effectively solved by the so-called “ clustering by tops", when requests are compared by the number of identical URLs in the top 10 Yandex.

Now let's talk about each point in more detail.

Combining requests with the same intent

What's happened requests with the same intent? These are different queries in which a person is actually looking for the same thing. Obviously, the requests " samsung tvs" And " samsung tvs » should be promoted on one page. But these are obvious things.

However, there are also much less obvious examples:

"working clothes" - "work clothes"
“mortgage” - “loan secured by an apartment”
“auto pawnshop” - “loan secured by a car”

Semantically, these pairs are not at all similar, but in fact they mean the same thing. Classic methods of searching for such queries that have a single intent (user intent) are based on synonyms. As a rule, for this purpose they use synonym dictionaries or Yandex synonyms. However, both methods have their serious disadvantages.

If we use synonym dictionaries, we will find very strange connections there. For example, according to one of the most popular dictionaries, synonyms for “mobile phone” are:

mobile
mobile phone
radiotelephone
cell phone
cellular telephone
telephone
ebony friend

Cell phone is definitely a useful synonym. But a “cordless telephone” is a completely different type of product. Well, who the “ebony friend” is, one can only guess.

The second option for searching for synonyms is to try to “catch” them from Yandex highlights. But this comes with two problems:

Firstly, not only synonyms are highlighted, but also other words. For example, highlighted by the key phrase “ Cell Phones” includes not only the synonym “mobile”, but also: “prices”, “Moscow”, “buy”, “catalog”, which are highlighted for other reasons. In general, this is a solvable problem, there are workarounds.

Secondly, synonyms in Yandex are non-reciprocal. For example, the entry " Cell phones" is a synonym for the query "cell phones", and in reverse order it doesn't work anymore. "Cell phones" will not be synonymous with "mobile phones", and this point becomes critical. How do you know that the query “cell phones” is related to the query “mobile phones” if the word “cell” is not highlighted?

Finally, how will you understand that the queries “jewelry store”, “jewelry” and “jewelry” have the same intent if, from the point of view of Yandex, they are not synonymous?

The solution to the problem comes through clustering queries by top. Finding the same URLs in the top signals the same intent. Here is an example of how the just-magic clusterer works:

It seems that the clusterer combined everything correctly: “mobile phones” was put in one group with “cell phones”, and “jewelry” was put in a group where there was a “jewelry store”. Why then did “jewelry” end up in a separate group, despite the fact that the topic here is the same (this can be seen in the “spec-grp” column)?

The answer to this question is given in the next part of the article.

Checking the compatibility of promoted queries

To promote, we not only need to collect similar queries on the page, we also need to check their compatibility.

Yandex does not have a single ranking formula for all queries. Requests are divided into a large number of types. And formulas for different types of queries often place mutually exclusive requirements on a page to rank in the top. Moreover, these requests are often visually very similar. For example, the queries “smartphone” and “smartphones”. The first is non-profit, geo-independent. The second is commercial geo-dependent. As you can see, in this case the singular and plural are incompatible on the same page!

If you suddenly thought that this was logical, then here is another example: the queries “laptop” and “laptops”. They are both commercial and geo-dependent and fit perfectly together on one page.

Commercialism and geo-dependence are just two of the most obvious signs. In fact, there are many more of them. For example, the search wants to see the main or internal page in the top for a query. When we don't know the full variety of indicators, the only way to determine the possibility of joint promotion of queries on a page is to look at whether there are URLs that are simultaneously shown for two queries, and count how many there are.

The logic here is as follows:

If the same URLs are in the top for queries, then they can be promoted on one page.
If there are no common URLs for queries, then we do not know whether it is possible to promote queries on one page. Most likely this is impossible.

And here we are faced with the question: how exactly to combine queries based on tops? I distinguish between two methods - the so-called “soft” and “hard” clustering.

The following picture clearly explains the difference between them:

Soft clustering boils down to the following: to form a group, one “central” request is taken and all the others are compared with it by the number of common URLs in the top 10 of Yandex. If the number of common URLs exceeds the threshold, the request is added to the group.

At hard clustering requests are combined into a group only if there is a set of URLs common to all requests, which is shown for all these requests in the top 10.

Soft clustering produces larger groups, but often makes mistakes in determining whether queries can be promoted together on a page.

Classic example: Let’s imagine that a query was selected as the “central” one, for which there are 5 main and 5 internal pages in the top. There may be two queries attached to it, one of which has 10 “faces” in the top, the other - 10 “internals”. Obviously, of these three requests, we can only advance two (depending on the type of page we choose - home or internal). In the case of hard clustering, the appearance of such a group is impossible.

But this is all lyrics. Let's move on to numerical estimates.

So, we have two criteria for assessing clustering:

How complete is the group of requests? That is, whether all requests with the same “intent” were included in it. For 100%, let's take the situation when all requests with the same intent are hit.
How compatible are the requests included in the group with each other? As 100%, we will take the situation when all requests included in the cluster are compatible with each other.

The key clustering parameter is the minimum number of common URLs to form a group. This number is called " clustering threshold" The higher it is, the more accurate the resulting groups, but at the same time they naturally decrease in size. It was determined experimentally that the minimum working threshold for “hard” clustering is 3 URLs, for “soft” - 4 URLs. There is no point in working with a smaller number: too many “left” requests end up in groups.

Here is an example of results for different thresholds for hard clustering:

Using the service just-magic.org, we compared two clustering methods on samples from different topics. Below is a summary chart:

Comparisons were made for the “soft” and “hard” methods. For the number of intersecting URLs from 3 to 6 (this is the minimum number of common URLs to form a cluster).

As can be seen from the graph, hard clustering shows very high accuracy even at a threshold of 3 URLs - 92%. So that readers of the article understand how big this figure is, I will give an example: when performed by an experienced optimizer without tools, the accuracy will be about 70%, and if an inexperienced specialist takes on the work, the accuracy will not exceed 30%. In this case, however, the completeness turns out to be quite low - only 40%. But, again, it depends on what you compare with. With their hands, optimizers gain a maximum of 20%.

Soft clustering demonstrates very good completeness indicators, but accuracy is “lame on both legs.” Values acceptable for promotion are obtained only at the threshold of “5”, but at the same time the completeness drops to 23%.

Does this mean that this method not applicable? No. It all depends on your task. If you are engaged in “traffic” promotion, and it is important for you to display as many requests as possible on the page - no matter what, then soft clustering is suitable for you. That is why, when hard clustering appeared in the just-magic.org service in January of this year, the “soft” mode was retained for the “markers” module.

If it is important for you to display a certain set of queries on the page, then your choice is clear - only hard clustering, only hardcore. Another advantage of hard clustering is that the resulting groups are unambiguous. That is, requests that end up in one group of 4 URLs cannot end up in different groups of 3 URLs (when using soft clustering, this can easily happen). Therefore, the Just-Magic clusterer displays groups of 3, 4, 5 and 6 URLs at once.

It is worth noting separately that if we want to carry out text analysis of the page in the future, then it is permissible to use only hard clustering. The fact is that any text analysis for a group of queries for a page very strictly correlates with the quality of this group. Only hard clustering provides groups of the required quality.

Let's sum it up

So, what are the benefits of clustering?

Firstly, it speeds up the process of parsing large kernels. Previously, this took weeks and months of work. Using a clusterer, the optimizer does this in a couple of hours.

Secondly, it is an opportunity to distribute requests across pages in such a way that they can be promoted simultaneously. There is no “manual” alternative to clustering here - even an experienced optimizer makes up to 30% of erroneous allocations if it acts “by eye”.

Based on the second point, it becomes clear that clustering should be used when promoting Always. Even if the core is less than 100 requests, you will not be able to correctly distribute requests across pages “by eye.” The only exceptions can be topics with ultra-low competition, where clustering by tops stops working due to the lack of relevant answers in the tops.

If you are engaged in “traffic” promotion, then you can use both “soft” and “hard” clustering methods. If you carry out “positional” promotion, when it is important to bring all queries to the top, then only the “hard” method is suitable. Also, only the “hard” mode is compatible.

Use clustering in your work, and you will find happiness and harmony, and the queries you promote are guaranteed to get to the top!

In today's episode of On the Board about semantics and structuring of keywords for the site.

About what clustering of the semantic core is. Why do you need to cluster and how can you do it?

He talks about it Oleg Shestakov, founder of Rush Analytics.

The video turned out to be quite voluminous. It contains the main nuances associated with clustering.

Let's move on to watching the video:

Photo from the board:

Important: If you have questions, feel free to ask them in the comments. Oleg will be happy to answer them.

Video transcript

1. What is clustering?

Clustering using the top similarity method is a grouping of keywords based on an analysis of search engine results. How does this happen?

We take two queries, for example, “lip gloss” and “buy lip gloss.”
We collect search results for each request, save 10 urls from each search result and check whether there are common urls in both results.
If there are at least 3-5 (depending on the clustering accuracy that we specify), then these requests are grouped.

2. Why do clustering?

Why has the clustering trend been on the market for about a year and a half now? Why is this important and how will it help?

Save time. Clustering is a wonderful technology that will help reduce the routine when working with semantic core grouping. If an ordinary semantic core specialist parses 100,000 keywords, separating them into groups, about 2-3 weeks (or even more if the semantics are complex), then a clusterer can separate this in order of priority in about an hour.

Allows you to avoid the mistake of promoting different requests to one page. Yandex has classifiers that evaluate commercial queries. For example, the results for informational requests and commercial ones are completely different. The queries “lip gloss” and “buy lip gloss” can never be pushed onto the same page.

1) For the first request (“lip gloss”) there are information sites (irecommend, Wikipedia). An information page is needed for this request.

2) For the second request (“buy lip gloss”) - commercial resources, well-known online stores. This request requires a commercial page.

That is, different types of pages are needed for different requests. A common mistake an optimizer makes is when it promotes everything together on one page. It turns out that half of the semantic core makes it into the TOP 10, and the second half cannot get there. The clusterer allows you to avoid such errors.

To prevent this from happening, you must initially correctly group requests by type of page in the search results.

3. How does clustering help in promotion?

data processing speed,
classification of pages for which promotion is done.

If the site structure is grouped and internal optimization is done correctly, then this is already half the battle, if we are talking about the Russian market. Naturally, links will be required for Western markets. In our experience, somewhere around 50-60% of requests with proper clustering and proper text optimization simply reach the TOP without any external intervention. For online stores or classifieds (aggregators and portals), in principle, texts are not even needed.

Clustering is the key to correct ranking. At the moment, there is no point in fighting search engine rankings, but it is easier to adapt to this ranking, enter the necessary types of pages and promote successfully. Changing the paradigm of promoting a particular topic is more unrealistic than real.

4. What are the clustering methods? (Hard/Soft)

Soft - this is what was described earlier. A marker request of some category of an online store is taken, other requests are linked to it, and the results are compared. “buy lip gloss”, “buy lip gloss in Moscow”, “buy lip gloss prices” - they have 4-5 connections with the main request.

These requests are bound. This completes the check, a cluster of keywords is obtained and it can be promoted.

But there are more competitive topics, for example, plastic windows. Here you need to check that all requests that were tied to the main can be promoted with each other.

We need to compare whether there are any results for these queries

same url. We compare the results not only with the main request, but also with each other. And we group only those requests that can be related to each other.

For most cases, Soft clustering is sufficient. These are online stores (not very competitive categories), information resources.

5. Clustering in Rush Analytics

We have a clustering module and 3 types of clustering:

According to Wordstat. The simplest and least time-consuming method from an optimizer's point of view. Ideal for situations when we know almost nothing about the structure of the site.

1) In Excel, load keywords into one column, frequency according to Wordstat into another, and send for clustering.

2) We sort the entire list in descending order: the most frequent words (usually the shortest) are at the top.

3) The algorithm works like this: we take the first word, try to link all other words to it, and group it. We cut out everything that has become attached, re-sort it and repeat this iteration again.

4) From the list of keywords we get a set of clusters.

By markers

Suitable for sites where the structure is defined. Works very well in e-commerce (for example, online stores).

1) We know the marker request (the main request of the page or several requests under which it is promoted).

2) We take a list of keywords, in the column on the right we mark marker queries with ones, and all other queries with zeros.

3) We take a marker keyword and try to link other keywords to it and group them into clusters. It is important here that in this algorithm the marker words that we marked with ones will never be related to each other. We won't try to tie them down.

Combined clustering

This algorithm combines the previous two

1) We load keywords, mark “token/non-token” and frequency.

2) We bind all the words that we can bind to marker queries.

3) We take keywords that remain unlinked and group them together using Wordstat.

4) Everything else will be classified as “non-clustered”.

5) As a result - a structure that we already know. We will also get automatic clustering of all other keywords, which will help us expand the structure. All of these types of clustering are available in Rush Analytics.

What other tools are there on the market?

Among the worthy ones, besides Rush Analytics, we can highlight the JustMagic service, where there is both Hard and Soft clustering. The service was developed by Alexey Chekushin.

That's all you need to know about clustering to get started with keyword grouping.

Use clustering and save your time. In addition, people often make mistakes; the error rate of the optimizer is about 15%. Entrust the routine to robots - no need to sort it out by hand.

Expert opinions

Topvisor is one of the most dynamic tools on the market search engine promotion. Developing progressively, the team regularly increases the number of useful services for SEO specialists.

One of the most interesting modules is the fast clustering of search queries based on the similarity of SERPs.

Our company did not move to Topvisor based on any recommendation. We tested different position monitoring services, and we were impressed by the developer’s responsiveness.

It's nice when your suggestions are implemented and make life and work easier. And after time, this ability to listen and implement has not disappeared. This is very cool!

I have been looking for a convenient service for checking positions for a long time, I tried a lot! You didn’t like one thing, then another... In Topvisor, you can customize everything for yourself, and additional features make us even more happy.

Definitely a must-have! I hope there will be further development!

We tried many competing services and chose Topvisor for its quality. And also for the accuracy and speed of checking positions. Now we test all new tools and implement them into our workflow.

I am especially pleased with the responsiveness of the service team and the prompt implementation of users’ ideas and wishes.

When once again I couldn’t open KeyCollector on a Mac, Topvisor saved me. Here I quickly obtained a series of data on semantics for one important study. Also, if necessary, I use Topvisor to check the positions of client sites, which is very convenient.

The creators of the service are familiar with the needs of the market, so they do everything possible to automate many tasks, sometimes not very popular. Pleasant and convenient service.

A must have in the arsenal of optimizers.

For a man who has built his reputation on semantics, it is extremely important to always get accurate data; This applies to clustering, position reading, and analytics. From the first days, Topvisor set a high level of work relative to the market and confirms its leadership every day.

In addition to the convenience and accuracy of the service tools, I would like to note the responsive work of the support service and management!

Topvisor impressed me with its thoughtfulness and versatility. So many little things were taken into account in advance. I often work with the interfaces of a variety of SEO services, I test a lot, but I have not yet seen this level of user friendliness anywhere else.

The detailed Help, friendliness and efficiency of the support are impressive.

It is very convenient to work in it - remove and expand it. core, do clustering, control the positions of both sites and pages on social networks and videos, Youtube channels, monitor competitors, analyze the optimization of your site. The prices, as it turned out, are very affordable for the work. I definitely don’t want to leave this service.

I have been using Topvisor for a long time - from the first weeks of its existence - since 2013. And to be honest, I just decided to test another service to check positions, because... The current one at that time was constantly falling and technical support did not respond at all.

And Topvisor support responded within 2 minutes, even on Twitter, and what was very pleasing was that many of my suggestions for improvement were implemented almost on the same day.

At the time of writing this review, I have approximately 270 closed tickets and many of the features came from my light hand. It would seem, what does support in the position checking service have to do with it? As it turns out, this is the most important part. because any glitch, any oversight is corrected quickly and if excess is taken from the account, they are compensated. What about stability? Everything is fine too (well, except for the rollout of new features). For more than 3 years with Topvisor, I uploaded a hundred different projects with semantic cores from 10 to 5000 queries there and there was almost no time when the project was not removed on time or something happened to the data.

Topvisor is a stable and fast service for working with semantics, which does, if not everything, then almost everything: Wordstat, AdWords, hints, grouping and clustering of queries, excellent and understandable analytics, integration with the webmaster, metrics, GA. In addition, there is a heap and a small cart of related services, such as monitoring changes on the site or a bid manager for context. I use all these features to the fullest in almost every new project.

If you choose a service for monitoring positions and other SEO tasks, I recommend taking a closer look at Topvisor.

This is a whole range of useful tools: from checking positions to collecting snippets and snapshots search results to a detailed technical analysis of the site. From selecting words, collecting search suggestions, to grouping by relevance and clustering using three different methods.

Able to work and integrate with Yandex.Metrica, Y.Webmaster, Google Analytics and Google Search Console. A real search analytics service.

Topvisor is constantly and dynamically developing, new tools are regularly appearing and current functionality is expanding. The interface is convenient, intuitive and very well thought out by the service developers. Please pay attention Special attention detailed reference materials on the tools and capabilities of Topvisor. I am sure that even beginners will not have any problems or questions with the work after reading it.

We started using Topvisor in September 2014 as a backup service for internal monitoring and analytics tools. Over time and the development of the project, some of the functionality of internal things is not fully developed on our side.

We use only the positions module, we get statistics using a convenient API, which Power BI/Query works well with for visualizing ready-made reports on the parameters of the number and dynamics of requests in the TOP-3..100+ for the required period of time.

It’s convenient that the service uses non-cash document flow through Diadoc, and ready-made invoices a week before the settlement date save a lot of time. Topvisor has the most important thing in customer service, besides the operation of technical things - great support. Response to requests within 5-10 minutes, a visible desire to help and understand the problem and improve functionality. So Russian cities now have different colors on their graphs in their statistics, and a couple of additional screenshots in the help.

When there is already a list of queries, this is not yet a semantic core - you first need to scatter the queries across the pages in order to have an idea of how to fill the site. Without good semantics, it will be very difficult to get traffic from search.

What is query clustering

Query clustering is precisely the distribution of search queries of the same topic into groups to promote a landing page.

Clustering includes the following processes:

grouping requests depending on user intentions (intent);
checking the compatibility of key queries for promotion on one page in the Yandex top.

Requests with the same intent- these are different requests through which a person, in fact, is looking for the same thing. An obvious example is the queries [Parker pen] and [Parker pen]. The situation is more complicated with such synonyms as: [table lamp] - [night light], [birth certificate] - [metric], [monitor] - [screen]. The difficulty is that when searching for key synonyms through the Yandex dictionary, the system does not always offer an adequate selection.

In practice, similar queries can have a lot of different characteristics, due to which they cannot be placed on one page. Clustering queries by top comes to the rescue. The clusterer finds identical URLs in the top search engine results, thereby signaling the presence of the same intent. The result of the work is expressed as follows:

the presence of identical URLs in the top for queries means the possibility of promoting them on one page;
The absence of common URLs indicates, with a high probability, the impossibility of such promotion.

Why clustering is needed

With the help of automatic clusterers, you can quickly group even the largest semantic cores. If previously it took weeks and months to disassemble the core, thanks to clusterizers the work is reduced to a couple of hours. A big advantage of clustering is the distribution of requests across pages in such a way that they can be promoted simultaneously. It is difficult to imagine a manual analogue of high-precision clustering, since even an experienced optimizer makes up to 30% erroneous allocations. It follows from this that keyword clustering is necessary in almost any case.

When I was a webmaster-teapot, I made a website where there was a separate article for each request. Of course, he didn’t receive any traffic - it was just a failure. And this is a problem for many beginners - incorrect queries or incorrect clustering.

Clustering methods

When grouping queries, uncertainty arises in the methodology for combining them based on tops. In practice, there are two main methods: “soft” and “hard” clustering.

Soft clustering is based on the formation of a group from one “central” request. Everyone else is compared to him based on the number of common URLs in the Yandex top 10. Soft clustering forms groups of a fairly large size, but errors often occur in determining the possibility of jointly promoting requests on a page.

Hard clustering is characterized by combining queries into a group when there is a common set of URLs for all queries, which is shown in the top 10 for all these queries.

There are two criteria for assessing clustering:

Completeness– the number of requests in the group that have the same “intent”. If all requests with the same intent fall into one group, the completeness rate is 100%.
Compatibility queries among themselves that fall into the same group. The case when all requests included in the cluster are compatible with each other is taken as 100%.

An important role is played by such a parameter as “ clustering threshold" This is the minimum number of common URLs to form a group. A large number means high accuracy of groups, but at the same time they naturally decrease in size. Experience in using semantic clusterers shows that the minimum working threshold for “hard” clustering is 3 URLs, for “soft” clustering – 4 URLs.

Even with a threshold of 3 URLs, hard clustering provides accuracy above 90%. For comparison: without using tools, the accuracy of an experienced optimizer’s work, at best, will be 70%, and for a beginner – no more than 30%. Despite the high accuracy, the “hard” method provides only about 40% completeness.

Soft clustering has a high completeness index, but significantly loses in accuracy. Thus, “soft” and “hard” methods are inversely proportional to each other. The use of one method or another depends on the goals of the optimization process.

For “traffic” promotion, when it is important to display as many queries as possible on the page, soft clustering is better suited. If “positional” promotion is carried out, then the hard one has the final say.

Hard clustering is also used in text analysis of a page. Any text analysis for a group of queries for a page is quite strictly correlated with the quality of this group. Only the “hard” method provides groups of the required quality.

How to group the semantic core

I usually do clustering in two stages. In the first stage, I throw the kernel into some automatic clustering service/program, and in the second stage, I finish the kernel manually. Via Excel. Here's something like these guys:

In these videos, it’s basically clear how to do manual finishing, but as for automatic clusterers, everyone chooses what they like best.

Semparser

The automatic query grouper from Topvisor is an alternative to Rush Analytics and Semparser, and its interface is similar to the latter. There is a degree of grouping and saving the project into an Excel file.

The Topvisor clusterer has a “regrouping” operation. After its application, the number of groups increases, and the number of requests in them decreases noticeably. This function is useful for those who are not satisfied with soft clustering and would rather use the hard option.

“Regrouping” here is paid, although it costs no more than a couple of rubles.

The advantage of Topvisor is based on its high grouping speed. The clusterer will distribute the semantic core of 1000 queries in a matter of minutes. Disadvantages: high cost of grouping and, of course, the need for manual editing.

Grouping via Key Collector

Another example of an automatic clusterer is presented as an online tool on the website coolakov.ru. Requests are divided into groups based on the similarity of the Yandex top 10.

Plus: free online service.
Cons: low grouping accuracy, lack of uploading to a file.

To summarize, you can confidently opt for automatic clusterers offered by various online services. But, unfortunately, the operation of any clusterizer requires manual modification.