Generative A.I.
Presence Penalty
The 'presence penalty' also applies a penalty on repeated tokens but, unlike the frequency penalty, the penalty is the same for all repeated tokens. A token that appears twice and a token that appears 10 times are penalized the same. This setting prevents the model from repeating phrases too often in its response. If you want the model to generate diverse or creative text, you might want to use a higher presence penalty. Or, if you need the model to stay focused, try using a lower presence penalty. Similar to temperature and top_p, the general recommendation is to alter the frequency or presence penalty, not both.
Cost of LLM Creation
due to the number of parameters used in LLMs and the huge training sets, it is normally very expensive to train these models and to execute them at inference time - especially at scale, with low latency, and high concurrency. This cost is ultimately passed on to the user.
Stop Sequences
A 'stop sequence' is a string that stops the model from generating tokens. Specifying stop sequences is another way to control the length and structure of the model's response. For example, you can tell the model to generate lists that have no more than 10 items by adding "11" as a stop sequence.
Temperature
In short, the lower the temperature, the more deterministic the results in the sense that the highest probable next token is always picked. Increasing temperature could lead to more randomness, which encourages more diverse or creative outputs. You are essentially increasing the weights of the other possible tokens. In terms of application, you might want to use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For poem generation or other creative tasks, it might be beneficial to increase the temperature value.
Top_p
Similarly, with top_p, a sampling technique with temperature called nucleus sampling, you can control how deterministic the model is at generating a response. If you are looking for exact and factual answers keep this low. If you are looking for more diverse responses, increase to a higher value. The general recommendation is to alter temperature or top_p, not both.
Frequency Penalty
The 'frequency penalty' applies a penalty on the next token proportional to how many times that token already appeared in the response and prompt. The higher the frequency penalty, the less likely a word will appear again. This setting reduces the repetition of words in the model's response by giving tokens that appear more a higher penalty.
Max Length
You can manage the number of tokens the model generates by adjusting the 'max length'. Specifying a max length helps you prevent long or irrelevant responses and control costs.
Risk of Spectacular Fails
as the quote goes "Trust Takes Years To Build, Seconds To Break And Forever To Repair"... one bad failure in an application where a LLM is to blame can erase the gains made by all the other times where it worked properly. This is causing practitioners to ease their way into the LLM space, which is a prudent approach but which will also reduce the pace of adoption.
Impersonal
for LLM applications that perform Generative or Search tasks, the results can often be too generic, and based more on the training data (which is usually publicly available data from the Internet) than on an organization's specific data, which was not used during training. This can lead to impersonal user experiences, or even worse, incomplete or incorrect results.
Hallucination
for LLMs that Generate text, it is; common for them to include plausible-sounding falsehoods within their responses. This normally happens because the model doesn't have all the relevant information it needs, but a malevolent actor could (and likely has already) bias a model in a specific direction, using false or misleading content to achieve that outcome. In some cases this is merely comical, but there is real risk if people - or other applications - assume these responses are true and act on them.
Cost of LLM Based Solutions
it's now alluringly easy to wire together a POC that uses one or more LLMs to implement a given use case. But it's expensive to build and maintain a production-grade solution that is low latency, scalable, and cost-effective. This is due largely to the inference time costs mentioned above, and the expensive and rare machine learning talent that can keep you at the forefront of the fast-moving LLM space. It will frequently make more economic sense to buy a solution as opposed to building one, and there will be many vendors who offer this option.
Interpretability
it's often hard, or impossible, to know why the LLM did what it did. In some cases this is benign, in other cases this engenders mistrust and will be an inhibitor to adoption, and in still other cases it will be a complete showstopper (e.g. in regulated industries).