Chapter 2: How Does Generative AI Work?

Chapter Overview

In Chapter 1, we discovered that generative AI is not just about classification or data analysis; rather, it creates original text and ideas based on patterns it has learned from existing information. We explored how these tools can streamline tasks like contract review and legal research, saving time and energy for legal professionals. We also discussed their limitations, such as the tendency to provide confident-sounding but erroneous answers, and the importance of using them responsibly.

In this chapter, we peel back the curtain on how these generative AI tools actually work. We will go step by step through the technical components, but in a way that remains accessible to a general audience (think: high school–level explanations). We will focus on the key ideas behind concepts like:

By the end, you will be able to understand (1) how these systems process language, (2) what makes them both powerful and fallible, and (3) how to begin integrating them thoughtfully into your legal practice. We will also lay the groundwork for our next chapter, which focuses on specific AI tools, including ChatGPT and Claude, to better understand their practical use cases.


From Conceptual Understanding to Technical Foundations

We often hear people talk about AI as if it were magic: “It just knows how to write a motion or contract.” But as future legal professionals, it’s essential to develop AI literacy, the ability to look beyond the “black box” mystique and grasp the essentials of how AI systems work. This understanding, even if it’s at a high level, will help you:

Assess AI tools critically

Use AI responsibly

Communicate effectively with technical teams

Leverage AI’s strengths

Throughout this chapter, we will keep the explanations as simple as possible, sometimes using analogies and everyday language. For those wanting a deeper dive, look for the optional “Callouts and Key Terms” or “Practice Pointers” that give additional detail.


Defining Artificial Intelligence

Artificial Intelligence (AI) is a broad field focused on creating computer programs that can perform tasks that normally require some level of human intelligence. These tasks range from recognizing speech or images to writing entire legal documents. While AI can sometimes seem magical, it is ultimately about pattern recognition, software that detects structures in data and uses those structures to make predictions or decisions.

Machine Learning: A Subset of AI

Within AI, one of the most important and fast-growing areas is machine learning (ML). Rather than manually programming rules for every scenario (which is nearly impossible for complex tasks like natural language understanding), ML systems learn automatically from examples.

Types of Machine Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning (RL)

Deep Learning: Where Neural Networks Come In

Deep Learning (DL) is a special branch of machine learning that involves neural networks with multiple layers. Each layer captures increasingly complex patterns. Think of it as a multi-layered structure that can start by recognizing letters, then words, then sentences, and so forth. When you hear about breakthroughs in image recognition, speech-to-text, or language generation, there’s a good chance it’s powered by deep learning.

Reference Note

For an in-depth, technical view of deep learning and neural networks, you might look at Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016). They break down how these networks are constructed and trained to handle complex tasks.


General vs. Narrow AI: What’s the Difference?

When discussing AI, it’s helpful to distinguish between two visions:

Artificial General Intelligence (AGI) or Strong AI

Narrow AI or Weak AI

Practice Pointer

Don’t be fooled by how “intelligent” a large language model seems. It’s still considered narrow AI. It can do many language-related tasks, but it doesn’t “understand” in the same way a human does, nor can it pivot to solve unrelated tasks like robotics (unless it is specifically designed or fine-tuned to do so).


What Is a Neural Network?

A neural network is a computational model inspired by the structure of the human brain (though it’s much simpler, so the analogy is limited). In our brains, billions of neurons pass electrical signals to each other to interpret the world and drive our actions. An artificial neural network mimics this idea with layers of artificial “neurons.”

Perceptron: The Simplest Building Block

The most basic form of an artificial neuron is called a perceptron. Think of a perceptron as a tiny decision-maker that receives inputs (numbers), multiplies them by some “importance factors” (called weights), sums them up, and passes them through a simple rule. If the total is above a certain threshold, the perceptron outputs 1 (like “Yes”); if not, it outputs 0 (like “No”).

Analogy: Brain Neuron

A single human neuron fires an electrical signal if it receives enough of the right inputs from other neurons. Similarly, a perceptron “fires” if the weighted inputs exceed a certain threshold.

How Do These Networks Learn?

When you stack thousands or even millions of perceptrons into multiple layers, you get a deep neural network. These networks learn patterns from lots of data. During training they employ the following algorithm:

  1. The network receives an example (like a sentence or an image).

  2. It makes a prediction (which could be the next word in a sentence).

  3. It checks how close that prediction was to the correct label or outcome.

  4. It adjusts the weights and biases so next time, the prediction gets closer to correct.

Over many iterations, the network finds patterns that allow it to make increasingly accurate predictions.


Large Language Models

Now, let’s zoom in on large language models (LLMs), the AI systems that power tools like ChatGPT and Claude. As the name suggests, they are big neural networks designed specifically for language tasks. Their size is often described in terms of parameters, think of each parameter as a dial that the training process fine-tunes to recognize linguistic patterns.

Scale: Why Are They Called “Large”?

The idea is that more parameters often give a model the ability to capture more nuanced patterns. If a model is “too small,” it might not learn the richness and variety of human language. But as you go bigger, you can capture a lot of subtle context. This is one reason advanced LLMs can produce surprisingly coherent, human-like text.

Example and Scenario

Key Term Callout: “Parameter”

Each parameter in an LLM is like a tiny dial the model adjusts during training to reduce errors. Examples of parameters include the model's weights and biases. More parameters mean more dials, and typically, more capacity to represent complex language patterns.


The Transformer Architecture: “Attention Is All You Need”

A huge breakthrough in language processing came in 2017 with a paper titled “Attention Is All You Need”, introducing the transformer architecture. Before transformers, AI models that handled text often processed words in order, one word after another, using recurrent neural networks (RNNs). This seems logical to process words in order, but many times the meaning of words is informed by the interrelationship of meaning between the words. Transformers changed the game by allowing the model to look at all words in a sentence at the same time and figure out which ones are most important to each other.

What Is “Attention”?

“Attention,” in this context, means the ability of the model to weigh how relevant one word (or part of a sentence) is to another word or part of a sentence. The model doesn’t just read from left to right. Instead, it learns that, for example, in the sentence “The lawyer who was very tired argued the case,” the word “lawyer” is strongly connected to “argued,” while “tired” modifies “lawyer.” This helps the model keep track of context over long sentences.

Analogy: Spotlight on Stage

Picture multiple actors on stage delivering lines. “Attention” is like a movable spotlight that highlights the most relevant actor(s) at any moment, allowing you (or the AI) to focus on the key interactions.

Detailed Example:

“When Jane realized she had forgotten her bag, she rushed back to the store, where Mark had hidden it behind the counter so no one else would take it.”

In this single sentence, the correct interpretation of words like “she,” “her,” “Mark,” and “it” depends on the relationships among them:

  1. Who is “she”? It points back to “Jane.”

  2. Whose bag is it? The word “her” (in “her bag”) also refers to Jane.

  3. What is “it”? “It” refers to the same bag mentioned earlier.

  4. What did Mark hide, and why? Mark’s action of hiding the bag behind the counter so that “no one else would take it” clarifies both his role and the function of “it.”

A language model equipped with self-attention (and cross-attention in multi-sentence contexts) can examine these words and their references simultaneously. Rather than simply reading each word in a linear fashion, the model creates a “map” of relationships among tokens. This allows it to recognize that “her bag” and “it” both point to the same object, that “she” is the same person as “Jane,” and that “Mark” is a different individual performing a distinct action. These interrelationships seem like common sense to us, but they can easily trip up less sophisticated language models. By tracking these interdependencies, the model demonstrates a form of “contextual understanding,” which is pivotal for interpreting meaning accurately in complex sentences.

Why Transformers Changed the Game

  1. Parallel Processing: Instead of reading text sequentially, transformers analyze entire sentences (or paragraphs) in parallel, which is faster and more efficient.

  2. Better Long-Range Context: The model can connect words or phrases that are far apart in the text. In legal documents, context from the beginning of a paragraph can be critical at the end.

  3. Scalability: Transformers scale very well to large amounts of data and large network sizes (hence “large language models”).

This transformer-based approach is why ChatGPT and similar tools can produce well-structured, contextually relevant paragraphs. They’re essentially experts at “paying attention” to the right parts of a sentence.


Interpolation vs. Extrapolation

An important limitation of LLMs (and AI in general) is the difference between interpolation and extrapolation:

LLMs are generally good at interpolation because they are experts at spotting and replicating patterns they’ve seen. But they’re not so great at true extrapolation, if you ask them something totally outside their training data, they might give nonsensical or made-up answers. This is one of the reasons why LLMs tend to hallucinate - because they are, in a sense, not drawing from their experience, but just trying to make it up. Of course, this is a critical point in legal settings, where a unique or unprecedented scenario might arise and the model could fail to respond accurately.

Practice Pointer

Always remember that an LLM’s knowledge is bounded by what it has seen. If your legal scenario is highly novel or cutting-edge, rely more on human legal expertise and research rather than a model’s guesses.


Weights, Biases, and Parameters

Let’s circle back to some foundational concepts in neural networks:

Weights

Biases

Parameters

When someone says a model has “billions of parameters,” they mean billions of these numerical weights and biases. Training is the process of adjusting all these parameters so the model performs better on the task at hand.

Example

In a simplified sense, if we have an input word “contract,” the network might use a certain weight to link it strongly with “legal obligations” in the next layer. If that weight is too high or too low, the model might overemphasize or underemphasize certain words in its predictions.


What Is Gradient Descent?

Gradient descent is the method most commonly used to train neural networks. Think of it as a systematic way of tuning weights and biases to reduce errors.

Analogy: Climbing Down a Hill

Imagine standing on a foggy hillside trying to reach the lowest point in the valley. You can’t see far, so you test small steps in different directions. If one step moves you downward, you keep going that way. If you go up, you backtrack. Over time, you (hopefully) reach the bottom.

In the same way, a neural network adjusts its parameters in tiny increments, guided by how much these adjustments reduce (or increase) the overall error on training examples.

The key to gradient descent is having a “loss function,” which measures how far off the model’s predictions are from the desired result. Each training step tries to minimize this loss. The ideal is for training to result zero loss, when the model’s predictions perfectly match the target outputs. While it can happen (especially for very simple datasets or overly flexible models), achieving literally zero loss is relatively rare.

Another way to think of it is to flip the numbers around: the model wants to score 100% and get everything right. It's graded at every training step so it knows how far off it is from perfection. If it scores 90%, it tries to adjust its strategy to get 10% better. Currently, getting 90%+ for LLMs is, like it is for us humans, pretty good!

Call Out: The Problem of Overfitting

For Humans: Think of overfitting like a student who memorizes every word in a textbook but never truly learns the underlying concepts. They might ace a practice test because it uses the exact same examples, but when given new questions, they struggle to apply their knowledge.

For AI: A model that is “overfitted” has learned the training data too well, picking up not just meaningful patterns but also noise and irrelevant details. As a result, it performs impressively on the examples it was trained on, yet falls short when it encounters new, unseen data.

Why It Matters: In the context of law (and any real-world application), an overfitted AI tool can give misleading or incorrect results when faced with novel scenarios. Balancing how much a model learns from training data without memorizing every quirk is key to building reliable and trustworthy AI systems.


Vectors and Embeddings

A fundamental idea in language models is representing words (and sometimes sentences or entire documents) as vectors, lists of dimensions or numbers that capture meanings or attributes.

A Simple Analogy

Suppose you want to describe a friend. You might list attributes such as:

Each of these attributes is one "dimension" in a vector. If you know enough attributes (dimensions), you can uniquely describe your friend compared to everyone else.

Word Embeddings

In language models, words are also turned into these multi-dimensional vectors called embeddings. If two words frequently appear in similar contexts (like “contract” and “agreement”), their embeddings will be similar. This helps the model “understand” relationships between words in a numeric way.

Example

Practice Pointer

Embeddings also explain why models might get confused between words that show up in similar contexts. If “defendant” and “respondent” appear in similar environments, a model might occasionally mix them up.


Reinforcement Learning

We touched on reinforcement learning (RL) earlier. Instead of just training on fixed examples, RL has the model interact with an environment. It receives rewards for good actions and penalties for bad ones.

RL With Human Feedback (RLHF)

ChatGPT famously uses a version of RL called Reinforcement Learning with Human Feedback (RLHF). Humans rate the model’s responses, effectively telling it which answers are better or worse. The model uses these ratings to adjust its parameters. Over multiple rounds, it gets better at producing the kind of answers humans find helpful.

Why Did RLHF Make ChatGPT So Good?

Regular language models might produce correct but confusing answers, or answers that are correct in form but irrelevant in content. RLHF aligns the model with human preferences, so it tends to produce responses that are both accurate (most of the time) and helpful in tone.

Janelle Shane’s Quote

“The danger of AI is not that it’s too smart but that it’s not smart enough.” – from You Look Like a Thing and I Love You (2019)

This speaks to the fact that AI can seem brilliant in one moment and then make a glaringly obvious mistake the next. Reinforcement learning with human feedback partially helps, but it’s not a cure-all.


The Scaling Hypothesis

Compute + Data = Intelligence

The scaling hypothesis in AI states that as we increase the size of our models (more parameters), provide more data, and use more powerful computers, we will continue to see improvements in AI capabilities. This is somewhat analogous to Moore’s Law, which for decades accurately predicted exponential increases in computing power.

However, bigger and faster does not guarantee less bias or more accuracy. If the training data is flawed or incomplete, the model’s output will reflect those flaws.

Practice Pointer: Bigger Isn't Always Better

Don’t assume that a newer, bigger model is always the best choice for every legal use case. Sometimes a smaller, more specialized model that has been carefully fine-tuned on relevant legal data can outperform a huge model that lacks domain-specific training.


Garbage In–Garbage Out: The Importance of Quality Data

You’ve probably heard the expression “garbage in, garbage out” (GIGO). It highlights that AI models are only as good as the data they’re trained on. Poor or biased data can lead to poor or biased results.

Janelle Shane’s Example: Rulers and Sheep

In You Look Like a Thing and I Love You, Janelle Shane provides vivid anecdotes about how AI isn't always as smart as we think it is:

  1. Ruler on an X-ray: A machine learning system was supposed to detect cancer in X-rays. Surprisingly, it learned to spot the ruler often placed next to suspicious areas for measurement, confusing the presence of the ruler with the presence of cancer.

  2. Sheep in a Field: Another system learned to recognize green grass as a signal for “sheep,” because in most training pictures, sheep were standing on green grass. The AI concluded that wherever there was a field of green grass, there must be sheep, even if no actual sheep were visible.

These stories underscore that AI can latch onto the wrong patterns if the data isn’t carefully curated.

Implications for Lawyers

Example and Scenario

If a contract review AI mostly trained on consumer contracts from the 1990s, it may not handle new data privacy clauses introduced by modern regulations like the GDPR or CCPA. This might lead to incomplete or incorrect drafting suggestions. This scenario may seem obvious or unlikely, but humans still appear to have a bias toward viewing AI as a magical oracle and believing that it if it knows one thing well, it should know everything well.


Are We Running Out of Data?

One concern in AI research is that we might be approaching a point where publicly available, high-quality text is nearly all used up. Think of data like oil, there’s a finite supply, and once we’ve extracted it, it becomes harder to find new sources.

  1. Finite Online Text: Since LLMs train on huge swaths of the internet, at some point, they’ve seen most of the high-quality text available.

  2. Data Overlap: Many data sets repeat the same texts (e.g., Wikipedia is reused often).

  3. Synthetic Data: One possible solution is to have AI generate new training data. However, if it’s based on AI’s own output, you can end up in a feedback loop.

Callout: Synthetic Data

Synthetic data is artificially generated content used to expand or diversify a training set. For legal AI, we might create hypothetical case scenarios or fake but realistic contracts. However, synthetic data can introduce new biases or inaccuracies if not carefully validated.


A Hands-On Experiment

It’s easy to talk in abstract terms about “attention” and “vectors.” Let’s do a short exercise using the Transformer Explainer tool at https://poloclub.github.io/transformer-explainer/. This interactive site visualizes how a transformer-based model (like the ones used in LLMs) predicts the next word.

Step-by-Step Guide

  1. Open the site in your browser.

  2. Type a short sentence like “The lawyer presented the argument before the judge.”

  3. Observe the Attention Weights: The tool shows which words in the sentence have the strongest influence on predicting the next word.

  4. Experiment: Try variations like “The exhausted lawyer presented the argument…” and see how “exhausted” changes the attention patterns.

Example: Context Matters

You might notice that the word “exhausted” affects how the model weighs the context around “lawyer.” This reveals why a transformer can keep track of context in a more nuanced way than older models.

By experimenting, you’re seeing a real demonstration of how the model decides which words matter most. This capacity for “attention” is at the heart of why transformers are so good at generating text.


Chapter Recap

We’ve covered a lot of ground in this chapter, moving from a basic notion of AI to the technical underpinnings of generative AI. Here are the key takeaways:

Practice Pointer

Before proceeding, reflect on the core question: How might these concepts affect the way you validate AI-generated legal documents? Keep in mind that while AI can save time, it’s crucial to know how these models reach their conclusions and where they might slip up.


Final Thoughts

Generative AI, especially large language models powered by the transformer architecture, represent a significant leap in how we create, analyze, and interpret text. For legal professionals, these tools hold the promise of faster, more efficient workflows, from document drafting to case law summarization. Yet they also come with caveats: they can generate errors or biased language, they may not handle entirely novel scenarios gracefully, and they remain reliant on the data they’re trained on.

Moving forward, keep these lessons in mind:

  1. AI is not a magical oracle nor infallible, human oversight remains essential.

  2. Better data yields better outputs, quality and representativeness matter.

  3. Scaling AI continues to expand possibilities, but size alone doesn’t solve all problems.

  4. Transparency and ethical considerations are crucial for legal professionals who adopt these tools.


What's Next?

In Chapter 3, we’ll focus on the real-world tools that operationalize these concepts, including popular AI tools like ChatGPT and Claude, and other emerging platforms, diving into their strengths, weaknesses, and how they fit into the legal workflow. We’ll talk about what you can realistically expect these tools to do for you in a law office environment, how to integrate them responsibly, and what pitfalls to watch out for. We will also examine practical use cases, like drafting briefs, summarizing case law, and more, and explore the growing number of third-party AI tools tailored for legal tasks.


References


Optional Deeper Dive (For the Inquisitive)

If you’re intrigued by any particular concept, consider exploring these subtopics on your own time:

Understanding these deeper topics can help demystify the “secret sauce” behind AI, but for most legal applications, a high-level grasp of the basics is sufficient.