When you add content to your library, Acrolinx reviews it in two steps: a crawl and a check. You'll kick off the process in the data center. Content Cube will run an initial crawl and check of your content to give a baseline. Then, it will automatically crawl and check the content library on a weekly basis.
To learn how to start your first crawl, read on.
To identify checkable text on your website, Acrolinx uses a crawler with the user agent Acrolinx-bot
. All you have to do is provide Acrolinx-bot with the domains and subdomains that you want to crawl. For example, acrolinx.com
and docs.acrolinx.com
. Once you add a domain, Acrolinx-bot automatically crawls all of the content in that domain on a weekly basis. You can run up to 100 individual crawls at a time. Learn more about Acrolinx-bot.
To make sure that Acrolinx captures the right content, you can also fine-tune a crawl. If you work in marketing and want to review the content that you use to convert prospects, you might make sure that Acrolinx crawls URLs with paths like /product/
or /products/
.
Note
When you add a domain to Content Cube, you don't have to include the subdomain www
. But the root domain (let's say acrolinx.com
) will sometimes redirect to a URL that includes a www
. For example, www.acrolinx.com
. If this happens, that the crawler might only identify one page for acrolinx.com
, but many more pages for www.acrolinx.com.
To add a new domain to your library, do the following:
-
Go to Reporting > Content Cube settings > Web crawling.
-
Click the plus icon Add new domain to open the Web crawler setup.
-
Enter the domain or subdomain that you want to crawl. For example,
docs.acrolinx.com
.Note
Be sure to leave out the protocol. For example,
http://
orhttps://
.-
Do add:
docs.acrolinx.com
-
Don't add:
http://docs.acrolinx.com
-
-
Optional: Fine-tune your crawl with the following settings:
Tip
Already have some web-crawling experience? Learn how to customize your crawl with our advanced crawl settings.
Setting
Description
Max. pages to crawl
Defines approximately how many pages Acrolinx should crawl.
Max. crawl depth
Determines how many pages Acrolinx-bot will access and index on a site during a single crawl.
Crawl these paths
Limits the crawl to certain pages within a domain. When you list one or more of the paths that follow the root domain in a page’s URL, you do the following:
-
Automatically add the path to the virtual robots.txt file as
allow:[input]
. This tells Acrolinx to only visit URLs with the specific path directly after the country code top-level domain (ccTLD). For example,my.domain/blog
. -
Use the URL as an
alternative_start_url
. -
Automatically add
disallow: /
to the robots.txt file. This will keep Acrolinx from crawling anything other than the paths you list.
If you add
/blog
under Paths to include, for example, the crawler will only access pages that havemy.domain/blog
in the URL. To include or exclude multiple paths, list each path on a separate line. For example:/blog /news/articles/product-updates
Don't crawl these paths
Ignores certain pages within a domain during a crawl. When you list one or more of the paths that follow the root domain in a page’s URL, the paths are added to the virtual robots.txt as
disallow:[input]
. This tells Acrolinx not to follow URLs with those paths.If you add
/blog
under Paths to exclude, for example, the crawler won't access pages withmy.domain/blog
in the URL. To include or exclude multiple paths, list each path on a separate line. For example:/blog /news/articles/product-updates
-
-
Click Save to start your crawl.
You can also add a new domain to Content Cube directly from the content library. This means that you don't have to the switch to the data center every time you want to set up a crawl. To open the Web Crawler Setup, click the plus icon Add new domain at the top of the content library.