By Chaitanya Chokkareddy, CIO, Ozonetel
Language is a funny thing. Sometimes a whole article cannot evoke anything, but a single word can evoke a lot of things for people. For example, the word “learning”. The moment we see it, we think back to our own learning process and we assume that the learning someone is talking about it is similar.
So, what exactly is GPT-3? GPT-3 is a computer program created by a San Francisco startup OpenAI. It is a huge neural network which is part of the deep learning segment of AI and ML, and it has great potential for automating tasks. So, when GPT-3 was introduced in the paper titled “Language Models are Few-Shot Learners”, what we feel is that, LMs are close to that holy grail where we give a few examples and it learns a new concept.
Since GPT-3 has been trained on a lot of data, it is equal to few shot learning for almost all practical cases. But semantically it’s not actually learning but just regurgitating from a huge database of data it has already seen.
We can compare this to the way students learn for exams (especially in India). One option is to learn by rote every single page in the textbook. The other option is to actually understand the subject while learning so variations don’t affect the learner. The examination system in India has actually proven that it takes a real good examiner or examination paper to find out who has actually learnt the subject and who has just mugged up the whole syllabus. Invariably the student who has just put the whole textbook in memory gets the best marks by just doing a pattern match of the questions and spitting out the most probable answer. But when a tricky question is given, the student stumbles.
GPT-3 is like that student who has learnt everything by rote. It’s very hard to trip it because almost everything is in its memory. It’s hard to find a topic where it already does not know something about the topic
But when it does encounter a topic it has no idea about (like the COVID topic as the current data GPT-3 is trained on is before the COVID pandemic), it fails badly.
GPT-3 is just average when it comes to regular NLP tasks like Summary generation, Winograd, Translation, Closed Book answering, Reading comprehension, Common sense reasoning, SuperGLUE and NLI. GPT-3 is the best when it comes to NLG.
Here we will list some experiments that we tried to see if it does some few shot learning. The competitor to GPT-3 was a 6-year-old kid.
The kid was able to crack all these few shot examples.
Experiment 1: Copycat Analogies by Melanie Mitchell. The kid was able to get the answers after few shot examples. GPT-3 mostly fails. We tried with multiple other priming examples also. But could not make GPT-3 learn about copycat analogies.
Experiment 2: The P language game. p language is a language where every word is appended with p. GPT-3 works for the basic version. It has understood that you should append a p to every word. But if we change the game to change only some words, like animal names for example, then GPT-3 stumbles while the kid doesn’t.
Experiment 3: Reversing words. GPT-3 does not learn from few shot that it has to reverse the words. The kid got it in 2 sentences.
Experiment 4: Train GPT-3 to reject words. GPT-3 works well in replacing specified words.
Experiment 5: Creating opposite sentences. Works mostly, but suddenly gives weird answers. But we can still consider this a win.
After the regular experiments, let’s talk about “synthetic and qualitative tasks”. This is where the actual “few shot learning” capabilities are discussed.
1. Arithmetic: Wouldn’t it be awesome if a language model can learn math? But does it? 2-digit math is good. Rest, doubtful. The log probabilities show that it is predicting the whole 2 digits instead of digit by digit. Also, it does not seem to know about the numbers mentioned as text and misses sometimes.
2. Word Scrambling and Manipulation Tasks: We tried out Cycle letters in word (CL), Anagrams of all but first and last characters (A1), Anagrams of all but first and last 2 characters (A2), Random insertion in word (RI), Reversed words (RW) tasks. The performance might not be great because of BPE encoding. But the thing is, we can prime the model to output individual words. So not sure if this theory holds. We tried the above tasks with different experiments and mostly its hit or miss. These tasks are anyway same as any language model, for example if you ask Google for criroptuon, it says did you mean corruption.
3. SAT Analogies: GPT-3 performs well on this.
4. News Article Generation: Without a doubt, this is where GPT-3 shines.
5. Learning and Using Novel Words: GPT-3 is pretty good at this. Though it mostly works for single novel words. When you try to teach it multiple words at the same time it gets
6. Correcting English Grammar and spelling mistakes: GPT-3 works perfectly for this scenario. We can easily use this in production. We wanted to look at where GPT-3 can be used for production and where it fails. Mainly discussed from a real-world scenario. We have done numerous other experiments to see where we can use GPT-3 in production, and some of the concluding observations are GPT-3 is not good for creating summaries but works great for slot filling. It cannot be used reliably to detect curse words in text and lastly using a GPT-3 as a chatbot backend is possible but tough especially for long conversations.