Advanced NLP: training data building
Avoid missleading intent names
Keeping the intent name understandable is a part of success. There's always someone, who'll continue in your work.
bad
# (intent: utterance)advice: how much make a chatbotadvice: how to build a botduration: how much time it takes to build a botduration: how does it take to make a chatbotlong: how long the message should belong: what is the best length of text message
Also categorization of intents helps a lot.
good
# (intent: utterance)faq-howToMakeBot: how much make a chatbotfaq-howToMakeBot: how to build a botfaq-buildDuration: how much time it takes to build a botfaq-buildDuration: how does it take to make a chatbotconversation-bestTextLength: how long the message should beconversation-bestTextLength: what is the best length of text message
Avoid general intents
Putting all utterances which leads into the same answer under a same intent is a very common mistake.
bad
# (intent: utterance)aboutCats: how much a cat eatsaboutCats: how much food a cat consumesaboutCats: how much a cat weightsaboutCats: how many kilos a cat has
good
# (intent: utterance)howMuchCatEats: how much a cat eatshowMuchCatEats: how much food a cat consumeshowMuchCatWeights: how much a cat weightshowMuchCatWeights: how many kilos a cat has
Entities: Provide as much synonyms as possible
Recognition of entities can be little bit tricky. Add as much synonyms as possible. Also common misspels helps a lot.
bad
# (@entity: value [synonyms, ...])@salutation: bro [brother]
good
# (@entity: value [synonyms, ...])@salutation: bro [brother, broo, broo, brotha]
Try to mix long and short utterances
Some people are more expressive, some not. Make the NLP usable for the both groups.
bad
# (intent: utterance)howMuchCatEats: how much a cat eatshowMuchCatEats: how much food a cat consumeshowMuchCatEats: what quantity of food the cat eatshowMuchCatWeights: cat weighthowMuchCatWeights: cat kilos
good
# (intent: utterance)howMuchCatEats: how much a cat eatshowMuchCatEats: how much food a cat consumeshowMuchCatEats: what quantity of food the cat eatshowMuchCatEats: cat food quantityhowMuchCatEats: cat food consupmtionhowMuchCatWeights: how much a cat weightshowMuchCatWeights: how many kilos a cat hashowMuchCatWeights: cat weighthowMuchCatWeights: cat kilos
Avoid missleading words in intent utterances
Misleading words in combination with a disproportion between intents results in suprising false-positive intent recognition.
bad
# (intent: utterance)whatTimeIsIt: please tell me, what time iswhatTimeIsIt: hello, can you tell me, what time do you havewhatTimeIsIt: I`am wondering, what could be hourswhatTimeIsIt: where is my watchwhatTimeIsIt: oh no, I completely lost track of time
good
# (intent: utterance)whatTimeIsIt: what time is itwhatTimeIsIt: tell me, what time iswhatTimeIsIt: what time do you havewhatTimeIsIt: what could be hours
Avoid disproportion between intents
Disproportions makes a single intent much more likely to be false-positive matched.
- all intents should have similar number of utterances
- largest intent should have twice as many utterances than the smallest
- good numbers of utterances are 6-12 for simple bots, 12-24 for more sophisticated bots
- distribute common phrases over all intents
bad
the intent whereIsATiger
has just short utterances and the intent whereIsFoodCourt
only long utterances
# (intent: utterance)whereIsATiger: where is a tigerwhereIsATiger: tiger wherewhereIsFoodCourt: where can i find a foodcourtwhereIsFoodCourt: how to get to foodcourtwhereIsFoodCourt: where the food court is?whereIsFoodCourt: where is a food court?whereIsFoodCourt: foodcourt where?
good
# (intent: utterance)whereIsATiger: where is a tigerwhereIsATiger: tiger wherewhereIsATiger: where can i find a tigerwhereIsATiger: how to get to tigers cagewhereIsATiger: where the tiger iswhereIsFoodCourt: where can i find a foodcourtwhereIsFoodCourt: how to get to foodcourtwhereIsFoodCourt: where the food court iswhereIsFoodCourt: where is a food courtwhereIsFoodCourt: foodcourt where
Solving a problem with lot of synonyms for a single meaning
If there is a word with many synonyms spread all around the bot's training data, it's good to identify the word soon and replace it with an entity.
bad
# (intent: utterance)whatIsCorona: what is a corona viruswhatIsCorona: what is a covid-19whatIsCorona: more info about covid-19treatment: how the virus is treatedtreatment: how the covid19 is treatedtreatment: treatment of corona
- by intoducing an entity called @virus, we can simplify all training data.
- the entity does not need to be included in interaction conditions
good
# (@entity: value [synonyms, ...])@virus: covid19 [corona virus, covid-19, corona, the virus, virus, covid virus]# (intent: utterance)whatIsCorona: what is a @viruswhatIsCorona: more info about @virustreatment: how the @virus is treatedtreatment: treatment of @virus
Solving a problem with similar intents - Merging them
If there are many similar intents and
- the only difference is a meaning of single word
- the answer differs depending on that word
bad
# (intent: utterance)whatIsScrewdriver: what is a screwdriverwhatIsScrewdriver: tell me about screwdriverswhatIsHammer: what is a hammerwhatIsHammer: tell me about hammerswhatIsWarranty: what is a warrantywhatIsWarranty: tell me about warranty
The solution is to introduce an entity or more entities
good
# (@entity: value [synonyms, ...])@product: screwdriver [screwdrivers, turnscrew, worm-screw]hammer [hammers]@service: warranty [warranties, return policy]# (intent: utterance)whatIs: what is a @productwhatIs: what is the @productwhatIs: tell me about @product
And then you have to use entity condition at each interaction.
Adding a new utterance or intent
Let's have a new utterance say hello
and the bot now does not understand it.
find an answer interaction, and check if there is already an intent
If the right interaction does not exist, you'll have to create it.
if the intent exists, ensure your utterance is similar to other utterances
If the intent has a different meaning, but the answer is same - add a new intent - the answer can be distinguished later.
use the NLP trainer to look for possible conflicting intents
There already can be an intent you're looking for. Scroll down on discover all matched intents.
if there is a intent recognized, check the connected interaction is the answer for your question