Warning
The support for classic Reuse will end with the 2023.10 release. We've introduced a modernized Reuse feature with an improved UI and more checking capabilities. Learn more about it in our Reuse Quick Start.
You can use the cluster settings to influence the average number of sentences and the similarity of the sentences in each cluster.
When you create a new cluster or expand an existing cluster, you can set 4 options.
Setting |
Description |
---|---|
Minimum Word Count |
Define the lowest number of words a sentence needs before you can add it to a cluster. For example, Acrolinx treats titles as sentences while harvesting. However, titles often have only one word. You can raise the minimum word count to prevent short titles from being added to a cluster. Words such as "and," "to," and "the" are in the minimum word count. The lower the minimum word count, the more likely you’re to get irrelevant sentences in your clusters. Consider the two titles "Configuring Browsers," and "Configuring Servers." If you set the minimum word count to two, you might see both variants in the same cluster. However, these sentences don't represent the same idea. TipWe recommended that you set this to at least 6. Shorter sentences are usually sentence fragments or other non-sentence data. Only proper sentences tend to produce useful clusters. |
Minimum Cluster Size |
Define the lowest number of sentences that must be in a cluster before the cluster gets added to the repository. Acrolinx will discard clusters smaller than the number you set. This is useful because you get less data to work with. If you have huge clusters of sentences, you might set the cluster size between 5 and 10. When the data set is small, you might not want to set a minimum. For example, you use the sentence "Open the configuration file" in all of your documentation. The only exception is "Launch the configuration file.” However, you might want the option to write the sentence "End Date can’t be before the Start Date" the following ways:
|
Cluster Strictness |
Define the quality of clusters to add to the Reuse repository. There are five levels of cluster strictness ranging from lowest to highest. For large-deployment Reuse indices, set cluster strictness to either “High” or “Highest.” “Highest” will only cluster sentences with word order differences or differences in "noncontent" words like determiners and prepositions. For example, the following sentences are in a cluster with the setting "Highest":
“High” allows a little more flexibility with wording. The lowest level groups sentences that share only a few keywords. These clusters are usually large and can have ten or more sentences. For example, the following sentences are in a cluster with the setting "Lowest":
Level of strictness depends on the type of data and on the intended purpose of the reuse repository. Lower strictness can result in a repository that has a lot of variations, which might be useful for testing. To reduce the degree of variation and to eliminate clusters that are too large, you can set the cluster strictness to a higher setting. The more harvested sentences you have, the more likely you’re to need a higher strictness. |
Initial Cluster Status |
When you add harvested sentences to a repository, you can select the initial status for all clusters. You can’t change the initial status after you create the repository. You can only change the status of clusters individually. To change the status for all clusters at the same time, create the repository again and select a different initial cluster status. Set the clusters to Enabled if you want to make them available for checking. Set the clusters to Proposed or Disabled if you want to edit the clusters further and don’t want them to be available for checking. |
Increase the minimum word count when you cluster. Aim for 6 or 7 words. It may also be a good idea to look at the data source. Many bad sentences can indicate that Acrolinx isn't parsing the source documents correctly. This would impact all check components, not just Reuse.
If the clusters have lots of strange characters, such as question marks, diamonds, or boxes, there might be encoding issues. You can’t deal with this in Reuse. You’ll need to go back to the source documents and find out why there are issues with reading the text.
Reuse will recognize numerals, such as “12” and “1998,” as variables. However, it will treat spelled-out numbers such as “four” and “twelve” as ordinary words. Clusters that have numerals in different formats aren’t very useful. It's best to turn them off or add the first twelve or so spelled-out numbers to the antonyms file and recluster.
Sentences usually cluster based on shared meaning. They differ by the main noun in the subject or object. This relates to the “Terms as Variables” feature. Unfortunately, Acrolinx can't handle these kinds of clusters at this time, and you should turn them off.