| What's the difference between fine-tuning and RAG? |
RAG retrieves your docs at query time and injects them into the prompt — facts. Fine-tuning updates weights to teach behavior — style, tone, format. Try RAG first; fine-tune only when prompting + RAG genuinely can't get there. |
| How would you reduce hallucinations in production? |
Ground the model with RAG and require citations, force structured output via JSON schema + Pydantic, lower temperature for factual tasks, and re-run a held-out eval set on every release. If the cost of a wrong answer is high, add a verification step or a human in the loop. |
| What is a token, and why does it matter? |
A unit of text the model reads — about 4 characters in English, ¾ of a word. Every token affects cost and latency, so logging tokens per request is one of the best habits for improving cost, speed, and quality. |
| When would you NOT use an LLM? |
When determinism matters (a regex or parser is better), when the input is structured data (use SQL), when latency budget is tight, or when failure is unacceptable without human review (medical, legal, financial decisions). |
| How do you evaluate an LLM application? |
Build a 20–100 example eval set with input + expected output before you ship. Score with exact match, embedding similarity, LLM-as-judge, or human review depending on the task — and re-run on every change. If you can't measure it, you can't improve it. |
| What is prompt injection and how do you defend against it? |
A user smuggling instructions into the prompt to override system rules ("ignore previous instructions and..."). Defenses: treat user input as data not instructions, use tool/function calling instead of free-text actions for sensitive operations, and require out-of-band confirmation for destructive actions. |
| Why does temperature matter, and what value should I use? |
It controls sampling randomness. Use lower values for extraction, classification, and structured output; use higher values for writing or brainstorming. Always test on your own examples. |
| Hosted (GPT/Claude/Gemini) or open-source (Llama/Mistral)? |
Hosted APIs are usually easiest to start with and charge per token. Open-weight models can help with privacy, control, and scale, but you operate more infrastructure. Start hosted for learning, then compare open models when cost or compliance matters. |
| How would you design a RAG chatbot for company policies? |
Ingest docs with metadata, chunk by section, embed, store in a vector DB, retrieve top-k, re-rank, assemble a grounded prompt, and return citations. Then evaluate with known policy questions, missing-answer cases, and citation checks. |
| What makes an agent different from a normal chatbot? |
A chatbot usually answers from conversation context; an agent can choose tools, observe results, update state, and continue toward a goal. Production agents need typed tools, iteration caps, logs, and budget controls. |
| How do you keep structured output reliable? |
Use the provider's structured-output mode or function calling, validate with Pydantic/JSON Schema, and retry with the validation error when parsing fails. Keep temperature low and test with malformed, empty, and adversarial inputs. |
| How would you reduce LLM app latency and cost? |
Use a smaller model where possible, trim context, cache repeated retrievals/responses, stream output, batch offline work, and stop generating early. Log tokens, latency, and model choice so optimization is based on data. |
| What should be in a GenAI portfolio case study? |
Problem, demo link, architecture diagram, setup, example prompts, screenshots, eval results, cost/latency notes, limitations, and future improvements. It should prove you can think like an engineer, not only follow a tutorial. |