KitKat Journal · Daily completo

Journal Diário — 25/03/2026

Esta edição reúne os posts completos aprovados na janela editorial. A ordem segue Gmail editorial, Substack e Skool, preservando estrutura, mídia útil e origem canônica sem resumo.

Use o índice para saltar entre os textos. Cada post mantém o corpo integral e termina com a origem e o link da fonte.

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:13-03:00

L2 Loss

Por Prof. Tom Yeh

L2 Loss - by Prof. Tom Yeh - AI by Hand ✍️ # AI by Hand ✍️### Essential AI Math Excel Blueprints Prof. Tom Yeh Feb 14, 2026 ∙ Paid 1 Share The L2 loss measures how far a model’s output (prediction) vector is from a target vector. It provides a single value that reflects the overall discrepancy between the model’s output and the desired outcome, encouraging predictions that stay close to the true target. ## Calculation To compute the L2 loss between a prediction vector and a target vector, take the difference between their corresponding components, square each difference, sum all the squared values, and then multiply the result by one-half. The one-half factor does not change where the minimum occurs, but it simplifies the gradient during optimization. This produces a single smooth measure of prediction error, where larger discrepancies contribute more strongly due to the squaring. ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Fl2-loss&utm_source=paywall&utm_medium=web&utm_content=187952350) [A

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:12-03:00

Binary Cross Entropy Loss

Por Prof. Tom Yeh

[AI by Hand ✍️](/)### Essential AI Math Excel Blueprints Prof. Tom Yeh Feb 15, 2026 ∙ Paid Binary cross entropy (BCE) loss measures how well a model’s predicted probability ŷ aligns with a target probability value y. Most often, the model outputs a probability, and the BCE loss quantifies the discrepancy between that prediction and the target. When the predicted probability closely matches the target value, the loss is small. When it deviates significantly, the loss becomes large. ## Calculation Let’s unpack the formula. We begin with the model’s predicted probability, ŷ . We take the logarithm of ŷ , then multiply it by the target value y. This measures how well the model supports the positive outcome. Because this is a binary setting, the other outcome is implied. If the model predicts ŷ for the positive outcome, then the probability of the negative outcome is (1 - ŷ). We take the logarithm of (1 - ŷ), and multiply it by (1 - ŷ). This measures how well the model supports the negative outcome. , we add these two weighted log terms together. Finally, we take the negative of that sum. The result is the binary cross entropy loss. ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [Already a paid subscriber?

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:12-03:00

KL Divergence

Por Prof. Tom Yeh

KL Divergence - by Prof. Tom Yeh - AI by Hand ✍️ # AI by Hand ✍️### Essential AI Math Excel Blueprints Prof. Tom Yeh Feb 15, 2026 ∙ Paid 1 Share Kullback–Leibler (KL) divergence measures how different one probability distribution is from another. It quantifies how much information is lost when we use a model (predicted) distribution (Q) to approximate a true (target) distribution (P). ## Calculation The calculation begins with the predicted distribution Q(x) and the target distribution P(x). First, we take the logarithm of both Q(x) and P(x). , for each outcome, we compute the difference log(P(x)) minus log(Q(x)), which represents the log ratio between the target and predicted probabilities. This difference is then weighted by P(x), producing the term P(x) multiplied by log(P(x) over Q(x)). Finally, we sum these weighted terms across all outcomes to obtain the KL divergence. ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Fkl-divergence&utm_source=paywall&utm_medium=web&utm_content=188070100) [A

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:11-03:00

Essential AI Math #11 to #15

Por Prof. Tom Yeh

Essential AI Math #11 to #15 - by Prof. Tom Yeh # AI by Hand ✍️#11 to #15 Prof. Tom Yeh Feb 18, 2026 ∙ Paid Share Dear AI by Hand Academy Members, Here is another mini-batch of new _Essential AI Math Blueprints_. I’ve taught these ideas many times over the years, scattered across different lectures, but I’m now com… ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Fessential-ai-math-11-to-15&utm_source=paywall&utm_medium=web&utm_content=188341176) [Already a paid subscriber? **

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:11-03:00

ELU (Exponential Linear Unit)

Por Prof. Tom Yeh

ELU (Exponential Linear Unit) - by Prof. Tom Yeh # AI by Hand ✍️(Exponential Linear Unit) ### Essential AI Math Excel Blueprints Prof. Tom Yeh Feb 18, 2026 ∙ Paid Share ELU (Exponential Linear Unit) introduces a smooth exponential curve in the negative region to create a gradual transition at zero. Instead of an abrupt change in slope, the function bends smoothly into negative values, producing continuous derivatives and more stable gradient flow. This smoother behavior can improve convergence and lead to more stable learning dynamics in deep neural networks. ELU is designed to address a limitation of LeakyReLU: there is still a sharp kink at x = 0, creating a discontinuity in the derivative. ELU solves this by replacing the linear negative slope with a smooth exponential curve. ELU behaves like ReLU in the positive region, passing positive inputs through unchanged. But in ReLU, a neuron can become “dead” in the negative region because the gradient is zero, meaning it receives no signal to update. Like LeakyReLU, ELU keeps the negative region alive by providing small, nonzero gradients. This allows a “dead” neuron to slowly recover, rather than remaining permanently silent. ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Felu-exponential-linear-unit&utm_source=paywall&utm_medium=web&utm_content=188382241) [A

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:10-03:00

Swish (SiLU)

Por Prof. Tom Yeh

Swish (SiLU) - by Prof. Tom Yeh - AI by Hand ✍️ # AI by Hand ✍️(SiLU) ### Essential AI Math Excel Blueprints Prof. Tom Yeh Feb 19, 2026 ∙ Paid 1 Share Swish, also known as Sigmoid Linear Unit (SiLU), is designed to introduce a smooth, self-gated activation mechanism. Instead of abruptly cutting off negative inputs like ReLU, Swish multiplies the input x by a sigmoid gate σ(x) that softly scales the signal between 0 and 1. For large positive values, the gate approaches 1 and the function behaves like a linear pass-through. For large negative values, the gate approaches 0, gradually suppressing the signal. Below is the ReLU activation for comparison. You can think of ReLU as using a hard gate: the gate value is 0 when x 0. This creates a sharp transition at x = 0. Swish replaces this sharp transition with a smooth “swish” transition (pun intended). ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Fswish-silu&utm_source=paywall&utm_medium=web&utm_content=188486273) [A

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:10-03:00

GELU (Gaussian Error Linear Unit)

Por Prof. Tom Yeh

GELU (Gaussian Error Linear Unit) - by Prof. Tom Yeh # AI by Hand ✍️(Gaussian Error Linear Unit) ### Essential AI Math Excel Blueprints Prof. Tom Yeh Feb 20, 2026 ∙ Paid 1 Share The GELU (Gaussian Error Linear Unit) activation function is fundamentally similar to Swish (SiLU), in that both apply a smooth, input‑dependent gate to the linear signal x, to achieve the effect of suppressing negative values toward zero while allowing positive values to pass through, but in a soft, probabilistic manner. This gentle attenuation of negatives preserves useful gradient information and improves learning, unlike the hard cutoff of ReLU. The core difference between GELU and Swish lies in how their “gates” transition from closed to open. GELU uses the Gaussian error function Φ(x) as its gate, which operates in a narrower band roughly between -3 and 3. In contrast, the Swish gate uses the sigmoid function σ(x), which has a wider band roughly from -6 to 6. ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Fgelu-gaussian-error-linear-unit&utm_source=paywall&utm_medium=web&utm_content=188610777) [A

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:10-03:00

Tanh

Por Prof. Tom Yeh

Tanh - by Prof. Tom Yeh - AI by Hand ✍️ # AI by Hand ✍️### Essential AI Math Excel Blueprints Prof. Tom Yeh Feb 20, 2026 ∙ Paid 1 Share The Tanh activation function takes any number and smoothly squeezes it into a range between –1 and 1. It keeps the signal centered around zero, which helps the network learn more efficiently, especially in deeper layers. Like a gentle S‑shaped curve, it allows strong signals to pass through while taming extreme values, making it a popular choice when you want both positive and negative activity in the model. In comparison, the sigmoid activation function σ(x) squeezes a value into a range between 0 and 1. It also has an S-shaped curve, but it’s centered at y = 0.5, and its most active transition occurs roughly in the x-range of –6 to 6, wider than that of tanh. ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Ftanh&utm_source=paywall&utm_medium=web&utm_content=188617676) [A

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:09-03:00

GLU (Gated Linear Unit)

Por Prof. Tom Yeh

GLU (Gated Linear Unit) - by Prof. Tom Yeh - AI by Hand ✍️ # AI by Hand ✍️(Gated Linear Unit) ### Essential AI Math Excel Blueprints Prof. Tom Yeh Feb 21, 2026 ∙ Paid 2 Share Gated Linear Units (GLU) marked a breakthrough in activation design by introducing a truly dynamic gating mechanism — meaning the gate is predicted from the input itself rather than defined by a fixed, predefined function. GLU projects the input through two parallel linear transformations: one produces a feature value, and the other produces a gate logit. The gate logit passes through a sigmoid to produce a value between 0 and 1, which determines how “open” the gate is and how much (percentage) of the feature value is allowed to pass through. Below is the visuliation of the computation of SiLU for comparison. You can notice the key difference. In SiLU the sigmoid gate depends directly on the same projected feature value (z = Wx), meaning the feature and the gate come from the same linear transformation. In contrast, GLU-style gating predicts the gate using a separate linear transformation, so the gate is not tied to the feature itself. ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Fglu-gated-linear-unit&utm_source=paywall&utm_medium=web&utm_content=188715421) [A

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:09-03:00

Essential AI Math #16 to #20

Por Prof. Tom Yeh

Essential AI Math #16 to #20 - by Prof. Tom Yeh # AI by Hand ✍️#16 to #20 Prof. Tom Yeh Feb 23, 2026 ∙ Paid 4 Share Dear Academy Members, I’m glad that I finally reached #20 for the new _Essential AI Math Blueprints_ series 🎉 After reaching this milestone, I’m confident the series is on its way to becoming one of the… ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Fessential-ai-math-16-to-20&utm_source=paywa

📰 IA NEWS AI by Hand ✍️ 2026-03-25T08:32:07-03:00

Entropy

Por Prof. Tom Yeh

Entropy - by Prof. Tom Yeh - AI by Hand ✍️ # AI by Hand ✍️### Essential AI Math Excel Blueprints Prof. Tom Yeh Mar 01, 2026 ∙ Paid 3 Share Entropy measures the inherent uncertainty (surprise) of a probability distribution. If one outcome is guaranteed—for example, B occurs with probability 1—there is no uncertainty at all, so entropy is zero. If B is almost certain but not guaranteed, there is still a small amount of uncertainty, so entropy is low. If there are two likely outcomes with comparable probabilities, uncertainty increases and entropy is high. Finally, when all possible outcomes are equally likely, uncertainty is maximized, and entropy reaches its highest value for that set of outcomes. Entropy increases as uncertainty spreads across more possible outcomes. For example, if there are 5 possible outcomes (A–E) and all are equally likely, the entropy is about 1.73. If we expand the space to 9 equally likely outcomes (A–I), there are now more possibilities to distinguish among, so entropy increases to around 2.20. In general, when all outcomes are equally likely—a uniform distribution—entropy reaches its maximum for that fixed number of outcomes. ## Excel Blueprint This Excel Blueprint is available to AI by Hand Academy members. You can become a member [via a paid Substack subscription](https://www.byhand.ai/). ## This post is for paid subscribers [ ](https://www.byhand.ai/ ?simple=true& =https%3A%2F%2Fwww.byhand.ai%2Fp%2Fentropy&utm_source=paywall&utm_medium=web&utm_content=189563240) [A

🔬 IA RESEARCH Agentic AI 2026-03-25T08:31:47-03:00

Beyond the “Gradient Highway”: How Attention Residuals Fix the Hidden Crisis of Deep LLMs

Por Ken Huang

Beyond the “Gradient Highway”: How Attention Residuals Fix the Hidden Crisis of Deep LLMs # Agentic AI“Gradient Highway”: How Attention Residuals Fix the Hidden Crisis of Deep LLMs ### Some key takeaways from recent paper from open source model Kimi’s research Ken Huang Mar 18, 2026 ∙ Paid 13 Share Sometimes, most advanced research does not need a PhD; the X post about Kimi’s new AI architecture, allegedly created by a 17‑year‑old and praised by Elon Musk, is a vivid example of how exceptional talent, open research, and plentiful compute can let independent young researchers push state‑of‑the‑art AI infrastructure forward without formal credentials, delivering “drop‑in” architectures with meaningful compute gains at minimal extra latency and proving that breakthrough ideas increasingly come from those who move fastest, not those with the longest academic résumés, and our AI researchers at DistributedApps.ai have analyzed the underlying work and summarized the key takeaways— to our paid tier to know more. 1. Introduction: The Amnesia of Depth For years, the “gradient highway” has been the structural backbone of deep learning. Standard residual connections serve as a fast lane, allowing information to bypass complex transformations via identity mappings. But this high-speed travel carries a hidden cost. Imagine a highway where every town passed adds new cargo to a truck. By the time the vehicle has traveled thousands of miles—or passed through hundreds of neural layers—the original items from the start of the trip are buried under a mountain of new weight. In modern Large Language Models (LLMs) using PreNorm architectures, this manifests as a form of architectural “amnesia” or dilution. Because standard residuals aggregate information using fixed unit weights, each individual layer’s contribution is progressively washed out as the model grows deeper. Attention Residuals (AttnRes), an innovation recently detailed by the Kimi team at Moonshot AI, introduces a “selective memory” upgrade. By moving away from blind accumulation, AttnRes allows each layer to perform content-aware retrieval across the model’s entire history. 1. The PreNorm Paradox: Why More Layers Don’t Always Mean Better Features While depth is intended to build increasingly sophisticated features, modern PreNorm architectures suffer from a technical limitation: hidden-state magnitudes grow as O(L) with depth. As these magnitudes expand, the relative influence of any single new layer—and its ability to impact the final output—shrinks. Early-layer information is effectively “buried,” and the model loses the capacity to retrieve specific representations from its own past. Standard residuals trap the model in a “fixed unit weight” strategy where every layer is treated with equal importance, regardless of its utility. As the research team observes: “Residuals also play a second role that has received less attention... residuals define how information aggregates across depth. Unlike sequence mixing and expert routing, which now employ learnable input-dependent weighting, this depth-wise aggregation remains governed by fixed unit weights.” 1. Takeaway #1: Replacing Addition with Selection (Softmax over Depth) The central intellectual pivot of AttnRes is the Duality of Time and Depth. Just as the Transformer revolution replaced the sequential recurrence of RNNs with attention across time (the sequence), AttnRes replaces the additive recurrence of residuals with attention across depth. ## Continue reading this post for free, courtesy of Ken Huang. Claim my free post [Or purchase a paid subscription.](https://kenhuangus.substack.com/ ?simple=true& =https%3A%2F%2Fkenhuangus.substack.com%2Fp%2Fbeyond-the-gradient-highway-how-attention&utm_source=paywall&utm_medium=web&utm_content=191286591&just_signed_