[{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/130nm/","section":"Tags","summary":"","title":"130nm","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/arithmetic/","section":"Tags","summary":"","title":"Arithmetic","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/bfloat16/","section":"Tags","summary":"","title":"Bfloat16","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/c/","section":"Tags","summary":"","title":"C","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/dft/","section":"Tags","summary":"","title":"DFT","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/float/","section":"Tags","summary":"","title":"Float","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/floating-point/","section":"Tags","summary":"","title":"Floating Point","type":"tags"},{"content":" I have a confession to make: floating point scares me.\nHalf a decade ago I decided that I was going to implement some floating point arithmetic. Back then it seemed approachable enough, after all, floating points are ubiquitous. How hard can it really be ? My experience until that point had been: given enough time and effort is spent bashing my brain against a problem, I can generally figure things out.\nThis is how I faced the most complete technical defeat of my existence. Through this utter annihilation emerged my present fear of floating point.\nAfter half a decade I decided it was time for a rematch, time to face my dragons!\nBut this time, I would not simply aim for a surface level understanding, this time I would aim to deeply grasp the floating point representation.\nWhen setting out on this crusade, I believed that there were only 3 types of people who truly understood floating point :\nThe people writing the spec The math PhDs working on the floating point representation The people building the floating point hardware Welcome to round 2!\nRecommended soundtrack for reading: microtonal math rock. Chapter 1: Descent into madness # Looking back on it, one of the main reasons behind my past defeat was that I mistook my ability to use floating points for a marker of understanding. And that this freed me from the need to invest the time in studying floating point, as if I was going to pick it up along the way.\nSo it’s now time to put the computer aside, and spend 10 days in the company of paper. ( you remember, the white stuff )\nHow floating point works # I am assuming that readers already have some surface level knowledge of what floating point is, so I will spare you the basic intro.\nLet me just set a few definitions, in the context of this discussion normal floating point numbers will be defined as:\n$$ (-1)^{S} × 2^{E−b} × (1 + T \\cdot 2^{1−p}) $$With the values of \\(S\\), \\(E\\) and \\(T\\) being the values stored in the floating point fields:\n\\(S\\) sign bit \\(E\\) biased exponent \\(T\\) trailing significant field Floating point layout - IEEE 754-2019, section 3 The size of these fields, as well as the values of \\(b\\) (exponent bias) and \\(p\\) (precision) depend on the floating point format.\nEg, for the IEEE 754 single precision (float32_t) we have:\n\\(b = 127\\) \\(p = 24\\) Resulting in:\n$$ (-1)^{S} \\times 2^{E−127} \\times (1 + T \\cdot 2^{-23}) $$In this discussion we will be calling :\nsign, the \\(S\\) sign bit exponent, the value stored in the biased exponent field \\(E\\) significant/mantissa, the value stored in the \\(T\\) field Significant/Mantissa The term mantissa isn’t pedantically correct since this isn\u0026rsquo;t a logarithmic representation it should really be called a significant. But my fellow programmers in the audience will appreciate that since the sign has already used the \\(s\\) name for our single letter naming of our structure elements we have no choice but to yield and call this \\(m\\) for mantissa. I will be using the term mantissa and significant interchangeably in this article.\nWhat you never wanted to know # We are not actually interested in floating point in the abstract, but rather what we commonly refer to as “float” in our programs.\nIn the world of all the possible floating point types, these are the vanillas, except in this world, everyone also wants vanilla all the time!\nThis float format is canonized by the IEEE in the IEEE 754 specification. Inside this holy grail is where the expected behavior is outlined in excruciating detail making it possible for users to expect the same behavior for the same floating point operations on different platforms. A cornerstone of making float portable.\nAlso, this is where hell starts!\n+0/-0 # Let us commence our descent slowly.\nAs the most astute readers might have already noticed looking (kudos) at the representation format, we have a real sign bit. This implies that we actually have 2 representations for zero: \\(+0.0\\) and \\(-0.0\\).\nNow where things get fun is that we have rules around which zero to use. For example, let us consider how we would determine the equality between two floating point numbers, say X == Y ?\nTo do this comparison we would generally re-use the adder and do X - Y then check all the result’s bits are 0, problem is \\(-0.0\\) is written with an 1 in its sign bit.\nSo we have rules around when the result should use \\(+0.0\\) or \\(-0.0\\), and the subtracting of two equal floating point numbers is such an example of this rule :\n$$ X - X = +0.0 $$ NaN # NaN for Not A Number.\nFor all of you that thought we were talking about numbers, this is the point at which you start understanding the difference between a number and a representation format.\nSo let’s start with the fun bit, there are actually different types of NaN’s:\nquiet NaNs (qNaNs) that you would typically encounter from your bad math. signaling NaNs (sNaNs) the ones bad math doesn’t produce and also the ones that scream at you by signaling an invalid operation exception whenever they appear as operands. Most people won’t encounter these. So, what do I mean by “qNaNs are used to indicate when the result of an arithmetic operation cannot be represented”?\nHere are a few examples for clarification :\n\\(\\sqrt{-1.0}\\) results in an qNaN as \\(\\sqrt{-1.0} = i\\), and \\(i\\) is an imaginary number that cannot be represented without the use of complex notation. \\(\\frac{0.0}{0.0}\\) would also result in a qNaN because: what are you doing ? \\(+\\infty - \\infty\\) would also result in a qNaN because \\(\\pm\\infty\\) are actually limits, not numbers. And subtracting a limit from another limit \\(+\\infty - \\infty\\) just doesn’t make sense. Want to know another fun fact about qNaNs ?\nThey are contagious.\nArithmetic operations with a qNaN as an operand will result in a qNaN.\nThink about it: what result should you give for an operation whose result can’t be represented ?\nIn memory NaNs are represented with all the exponent bits set to \\(1\\) and with at least one of the significant bits set. You can then differentiate different NaNs based on which significant bit(s) are set, the encoding of which is left to the discretion of the implementer.\nInfinitys # So we have already started introducing these with the NaNs, but the floating point representation has room for two infinity notations: one for \\(+\\infty\\) and its mirror \\(-\\infty\\). These are not numbers, infinity is not a number it’s a limit!\nIn compliance with IEEE certain specific infinities can be used in arithmetic operations, be used as inputs for boolean operations and be produced as the result of a calculation.\nIn memory, infinities have their exponent bits set to all \\(1\\)s, and to differentiate them from NaNs their significant bits are all \\(0\\)s.\nDenormal # Let’s put infinities and NaNs on the side for a minute and get back to talking just about numbers.\nIn the introduction I defined a normal floating point number as:\n$$ (-1)^{S} × 2^{E−b} × (1 + T · 2^{1−p}) $$A more common way of writing this is:\n$$ (-1)^{S} × 2^{e} × m $$\nWhere \\(m\\) is a number represented by a string of the form \\(d_0 . d_1 d_2 ... d_{p-1}\\), and is \\(p\\) long.(with \\(p\\) the precision, or number of bits in the significant + 1 ).\nFor example \\(1.5\\) would be written as : $$ (-1)^0 × 2^{0} × 1.1000 $$\nand \\(3\\) as \\(2 × 1.5\\) :\n$$ (-1)^0 × 2^{1} × 1.1000 $$\nIn our normal floating point representation, the \\(1\\) in \\((1 + T · 2^{1−p})\\) is our \\(d_0\\) and is always set to \\(d_0 = 1\\).\nNow, the funny thing is our significant actually only has \\(p-1\\) bits, and \\(d_0\\) is actually an inferred bit, we call it the hidden bit.\nSeems simple enough ? Could something finally be simple about floating point ?!\nDon’t worry: floating point isn’t going to let you down like this, because we have another category of numbers!\nThey have an implicit hidden bit set to \\(d_0 = 0\\) and are called subnormal numbers (or denormal numbers). Yay 🥳\nThese are used to encode the smallest representable floating point numbers, and were the most controversial part of the IEEE 574 spec during its elaboration.\nThey are also a giant pain in the ass to implement, so much so that a lot of the early FPU’s would trap on subnormal numbers and handle them in software … very slowly …\nSo why are we putting up with them apart from not wanting to waste some bits ?\nThe culprit: gradual underflow.\nGradual underflow # The idea behind gradual underflow is to slow the loss of precision instead of it being abrupt after the smallest representable normal number: \\(2^{-(b+1)-p+1}\\). This helps with numerical stability.\nTo illustrate what I mean, let’s imagine we didn’t have subnormals, \\(x\\) and \\(y\\) where floating points numbers such that \\(x \\neq y\\). If \\(x - y\\) fell in the range between \\(0.0\\) and the smallest representable floating point number then it would underflow to \\(0.0\\) since there is no representable number in that range.\nEg, using float16 without subnormals:\n$$ 0.000091552734375 - 0.0000762939453125 = 0.0 $$ Credit: Handbook of Floating-Point Arithmetic, Second Edition Now when we add subnormals we are effectively filling this gap.\nEg,using float16 with subnormals:\n$$ 0.000091552734375 - 0.0000762939453125 = 0.0000152587890625 $$ Credit: Handbook of Floating-Point Arithmetic, Second Edition Now with subnormals in our system we inherit the following interesting property: for any floating point number \\(x\\) and \\(y\\) such that \\(x \\neq y\\), \\(x - y\\) is necessarily nonzero.\nBonus: we have now acquired extra armour against a division by zero:\nif ( x != y ) z = 1.0 / ( x - y ); Rounding modes # The range of what can and cannot be represented is type specific, more bits equates to a larger space of representable values, conversely smaller types mean less possible values.\nLet us consider the IEEE 16 bit half float float16_t , it has 5 exponent bits and 10 significant bits, and guess what : it can’t represent 15359!\nSo what would happen when we try to set it to 15359 ?\nfloat16_t e = 15359.0; cout \u0026lt;\u0026lt; e \u0026lt;\u0026lt; endl; The answer is 42.\nMore seriously, it depends: what rounding mode are we using ?\nThe IEEE spec defines 5 rounding modes that compliant hardware should support:\nRD round downwards towards \\(-\\infty\\): the result is the largest representable floating point less than or equal to the exact results RU round towards \\(+\\infty\\): the result is the smallest representable floating point greater or equal to the exact results RZ round towards \\(0.0\\) in all cases RN_even/RN_away round to nearest: the result is the nearest possible value, with a tie breaking rule if the number is exactly halfway between the two: even (round ties to even) chooses the number where the least significant bit of the significant ( mantissa) is 0. away (round ties to away) chooses the next consecutive floating point number. Credit : Chapter 2 of Handbook of Floating-Point Arithmetic, Second Edition Let us get back to our example, 15359 isn’t a representable number with a float16_t, and the closest two numbers are 15352 and 15360.\nSo depending on which rounding mode we are using we will get :\nrounding mode result RD 15352 RZ 15352 RU 15360 RN (even) 15360 RN_even is the default IEEE rounding mode behavior on modern systems and is what is specified by the C++ FE_TONEAREST rounding mode.\nRounding modes boundary behavior # Recall how, when I was describing the behavior of the rounding modes I didn’t systematically use the term “number” for the rounding result ? This is because some rounding modes cause the result to round into \\(\\pm\\infty\\).\nFor example on float16_t using RU rounding :\nfloat16_t x, y, z; x = 65504; y = 1; fesetround(FE_UPWARD); z = x + y; cout \u0026lt;\u0026lt; x \u0026lt;\u0026lt; \u0026#34; + \u0026#34; \u0026lt;\u0026lt; y \u0026lt;\u0026lt; \u0026#34; = \u0026#34; \u0026lt;\u0026lt; z \u0026lt;\u0026lt; endl; Would result in :\n65504 + 1 = inf Similarly :\nfloat16_t x, y, z; x = -65504; y = 1; fesetround(FE_DOWNWARD); z = x - y; cout \u0026lt;\u0026lt; x \u0026lt;\u0026lt; \u0026#34; - \u0026#34; \u0026lt;\u0026lt; y \u0026lt;\u0026lt; \u0026#34; = \u0026#34; \u0026lt;\u0026lt; z \u0026lt;\u0026lt; endl; Would round down to \\(-\\infty\\) :\n-65504 - 1 = -inf Conjunctly these definitions imply that operations using the RU will never reach \\(-\\infty\\) and operations using RD will never reach \\(+\\infty\\).\nfloat16_t x, y, zsub, zadd; x = 65504; y = 1; fesetround(FE_UPWARD); zsub = -x - y; cout \u0026lt;\u0026lt; -x \u0026lt;\u0026lt; \u0026#34; - \u0026#34; \u0026lt;\u0026lt; y \u0026lt;\u0026lt; \u0026#34; = \u0026#34; \u0026lt;\u0026lt; zsub \u0026lt;\u0026lt; endl; fesetround(FE_DOWNWARD); zadd = x + y; cout \u0026lt;\u0026lt; x \u0026lt;\u0026lt; \u0026#34; + \u0026#34; \u0026lt;\u0026lt; y \u0026lt;\u0026lt; \u0026#34; = \u0026#34; \u0026lt;\u0026lt; zadd \u0026lt;\u0026lt; endl; Result :\n-65504 - 1 = -65504 65504 + 1 = 65504 Now where this get’s fun is with RZ , since it behaves like a RD for positive numbers and a RU on negative numbers, this means operations using this rounding mode CANNOT reach \\(\\pm\\infty\\).\nAt this point in the article my sudden obsession with rounding mode limit behavior might seem a little random. Please just sit tight for now and trust me, we will be exploiting this behavior later in this article.\nUnordered # It is a pretty common occurrence that we compare floating point values in our code, with the basic boolean comparison operations being specified in the IEEE 754 spec. Two numbers can be less than, equal or greater than, in relation to one another.\nBut what happens if one side of your comparison is a NaN ? Can we compare x \u0026gt; y if x is a NaN ? And if so, what is the result ?\nFour mutually exclusive relations are possible: less than, equal, greater than, and unordered; unordered arises when at least one operand is a NaN. Every NaN shall compare unordered with everything, including itself. IEEE 754-2019, section 5.11 So any comparison against a NaN will be unordered, this is a pretty interesting property meaning that there is literally no ordering relationship when NaNs are involved.\nHere is this relationship in effect:\nfloat16_t x, y; x = NAN; y = 1.0; cout \u0026lt;\u0026lt; \u0026#34;x =/= x: \u0026#34; \u0026lt;\u0026lt; ((x=x) ? \u0026#34;true\u0026#34; : \u0026#34;false\u0026#34;) \u0026lt;\u0026lt; endl; cout \u0026lt;\u0026lt; \u0026#34;x \u0026gt; x: \u0026#34; \u0026lt;\u0026lt; ((x\u0026gt;x) ? \u0026#34;true\u0026#34; : \u0026#34;false\u0026#34;) \u0026lt;\u0026lt; endl; cout \u0026lt;\u0026lt; \u0026#34;x \u0026lt;= x: \u0026#34; \u0026lt;\u0026lt; ((x\u0026lt;=x) ? \u0026#34;true\u0026#34; : \u0026#34;false\u0026#34;) \u0026lt;\u0026lt; endl; cout \u0026lt;\u0026lt; \u0026#34;x \u0026gt; y: \u0026#34; \u0026lt;\u0026lt; ((x\u0026gt;y) ? \u0026#34;true\u0026#34; : \u0026#34;false\u0026#34;) \u0026lt;\u0026lt; endl; cout \u0026lt;\u0026lt; \u0026#34;x \u0026lt;= y: \u0026#34; \u0026lt;\u0026lt; ((x\u0026lt;=y) ? \u0026#34;true\u0026#34; : \u0026#34;false\u0026#34;) \u0026lt;\u0026lt; endl; Result:\nx =/= x: true x \u0026gt; x: false x \u0026lt;= x: false x \u0026gt; y: false x \u0026lt;= y: false As a side effect of this is some of our basic assumptions about how comparison relationships are expected to work start to break down, most infamously :\n$$ not( x \u003c y ) ⇔ ( x \u003e= y ) $$The laws of trichotomy have broken down, we can never unsee this!\nFloating point K.O Adder example # Alright, so how does this actually work ? Since an example is better than a thousand words (idiom than this article clearly doesn’t follow very well), here is some code!\nHere is an example in C of the steps involved in doing a floating point addition. This code is provided for illustrative purposes only, some of the corner cases are missing.\nFloating point addition in C using the bfloat16 type: #include \u0026lt;stdint.h\u0026gt; #include \u0026lt;stddef.h\u0026gt; #include \u0026lt;stdfloat\u0026gt; #include \u0026lt;stdbool.h\u0026gt; #include \u0026lt;stdlib.h\u0026gt; #include \u0026lt;stdio.h\u0026gt; /******* * Env * *******/ /* Assert. */ #define assert(cdt) ({if (!(cdt)) {printf(\u0026#34;%s:%s : assert(%s) failed.\\n\u0026#34;, __FILE__, __LINE__, #cdt); abort();}}) #ifdef DEBUG #define check(cdt) ({if (!(cdt)) {printf(\u0026#34;%s:%s : check(%s) failed.\\n\u0026#34;, __FILE__, __LINE__, #cdt); abort();}}) #else #define check(cdt) ({;}) #endif #define swap(a, b) ({auto _ = b; b = a; a = _;}) typedef bool u1; typedef uint8_t u8; typedef uint16_t u16; typedef uint64_t u64; typedef std::bfloat16_t bfloat16_t; /********* * Types * *********/ typedef struct bf16 bf16; /************** * Structures * **************/ /* * Bfloat 16. */ struct bf16 { union { struct { u16 frc:7; u16 exp:8; u16 sig:1; }; u16 raw; }; }; /* * Special values. */ /* Special exponent. */ #define BF16_EXP_SPC 0xff /* Special nummber : infinity frac. */ #define BF16_SPC_FRC_INF 0 /************* * Utilities * *************/ /* * If @val is special, return 1. * Otherwise, return 0. */ static inline u1 bf16_is_spc( bf16 val ) { return val.exp == BF16_EXP_SPC; } /* * If @val is an infinity, return 1. * Otherwise, return 0. */ static inline u1 bf16_is_inf( bf16 val ) { return bf16_is_spc(val) \u0026amp;\u0026amp; (val.frc == BF16_SPC_FRC_INF); } /* * If @val is a nan, return 1. * Otherwise, return 0. */ static inline u1 bf16_is_nan( bf16 val ) { return bf16_is_spc(val) \u0026amp;\u0026amp; (val.frc != BF16_SPC_FRC_INF); } /* * If @val is negative, return 1. * Otherwise, return 0. */ static inline u1 bf16_is_neg( bf16 val ) { return val.sig; } /* * If @val is 0 or a subnormal, return 1. * Otherwise, return 0. */ static inline u1 bf16_is_nul_or_sub( bf16 val ) { return val.exp == 0; } /* * If @val is a subnormal, return 1. * Otherwise, return 0. */ static inline u1 bf16_is_sub( bf16 val ) { return bf16_is_nul_or_sub(val) \u0026amp;\u0026amp; val.frc != 0; } /* * If @val is zero (plus or minus) return 1. * Otherwise, return 0. */ static inline u1 bf16_is_nul( bf16 val ) { return (val.exp == 0) \u0026amp;\u0026amp; (val.frc == 0); } /* * If @val is regular (not inf, not nan, not subnormal), return 1. * Otherwise, return 0. */ static inline u1 bf16_is_reg( bf16 val ) { return (!bf16_is_spc(val)) \u0026amp;\u0026amp; (!bf16_is_sub(val)); } /* * Get @val\u0026#39;s complete mantissa with bit 1 placed at offset 31. */ static inline u64 bf16_frc_to_arr( bf16 val ) { check((1 \u0026lt;\u0026lt; 7) == 0x80); check(bf16_is_reg(val)); const u64 arr = (1 \u0026lt;\u0026lt; 31) | (((u64) val.frc) \u0026lt;\u0026lt; 24); check(((arr \u0026gt;\u0026gt; 24) \u0026amp; 0x7f) == val.frc); check(((arr \u0026gt;\u0026gt; 24) \u0026amp; 0x80) == 0x80); check((arr \u0026gt;\u0026gt; 32) == 0); check(arr \u0026amp; 0xffffff == 0); return arr; } /* * Return the opposite of @val. */ static inline bf16 bf16_opp( bf16 val ) { val.sig = val.sig ? 0 : 1; return val; } /******* * API * *******/ /* * Addition. */ static inline bf16 bf16_add( bf16 src0, bf16 src1 ) { /* Check regularity. */ assert(bf16_is_reg(src0)); assert(bf16_is_reg(src1)); /* If both are null, return 0. We have a lookup table for the sign. * If one is null, return the other. */ { const u1 nul0 = bf16_is_nul(src0); const u1 nul1 = bf16_is_nul(src1); if (nul0 \u0026amp;\u0026amp; nul1) { return (bf16) {.frc = 0, .exp = 0, .sig = (u16) (src0.sig \u0026amp; src1.sig)}; } else if (nul0 || nul1) { return (nul0) ? src1 : src0; } } /* Ensure that abs(src0) \u0026gt;= abs(src1). */ const u1 swp = ( (src1.exp \u0026gt; src0.exp) || ((src1.exp == src0.exp) \u0026amp;\u0026amp; (src1.frc \u0026gt; src0.frc)) ); if (swp) { swap(src0, src1); } /* Ensure src0 is positive. */ const u1 neg = bf16_is_neg(src0); if (neg) { src0 = bf16_opp(src0); src1 = bf16_opp(src1); } check(src0.exp \u0026gt;= src1.exp); check(src0.sig == 0); /* Get exponents. */ const u16 exp0 = src0.exp; const u16 exp1 = src1.exp; check(exp0 \u0026gt; 0); check(exp1 \u0026gt; 0); check(exp0 \u0026lt; 255); check(exp1 \u0026lt; 255); /* Get the mantissa shift amount. */ check(exp0 \u0026gt;= exp1); const u16 shf = exp0 - exp1; check(shf \u0026lt;= 253); /* Generate mantissas with the shadow 1 bit placed at offset 31. * Everything on range [32, 63] is null. * Everything on range [0, 23] is null. */ const u64 mnt0 = bf16_frc_to_arr(src0); u64 mnt1 = bf16_frc_to_arr(src1); /* Shift @mnt1 to match @src0\u0026#39;s exponent. * There are only 32 meaningful bits. * If right shift more (\u0026gt;=) than 32, @src0 is effectively * null. */ mnt1 = (shf \u0026gt;= 32) ? 0 : (mnt1 \u0026gt;\u0026gt; shf); /* After shift, mnt0 shoudl be greater than mnt1. */ check(mnt0 \u0026gt;= mnt1); /* Do the required operation. */ const u1 sub = (bf16_is_neg(src1)); const u64 mntr = sub ? mnt0 - mnt1 : mnt0 + mnt1; /* Initialize the sign part of the result. */ bf16 res; res.sig = neg; /* If there are bits in the [32, 63] range (overflow), right shift and update the exponent. */ if (mntr \u0026gt;\u0026gt; 32) { /* Only a single bit overflow is meaningful. */ check(!(mntr \u0026gt;\u0026gt; 33)); /* Only happens after sub. */ check(!sub); /* The exponent of src0 is used. Increment it. */ check(exp0 \u0026lt; 255); u16 expr = exp0 + 1; check(expr \u0026gt; exp0); /* If infinity, round down. */ if (expr == 255) { res.frc = 0x7f; res.exp = 254; } /* Otherwise, just use this exponent and right shift the mantissa of 1. */ else { res.frc = ((mntr \u0026gt;\u0026gt; 1) \u0026gt;\u0026gt; 24) \u0026amp; 0x7f; res.exp = expr; } } /* If there are no bits in the [32, 63] range, left shift and update the exponent. */ else { /* If bit 31 is not set, check that a subtraction was performed. */ check((mntr \u0026amp; (1 \u0026lt;\u0026lt; 31)) || sub); /* Determine the index of the first set bit and the shift count. * We shift at most of 31. */ u64 mnt_shf = 0; u8 shf_cnt = 0; for (shf_cnt = 0; shf_cnt \u0026lt;= 31; shf_cnt++) { const u64 mnt_shf = mntr \u0026lt;\u0026lt; shf_cnt; if (mnt_shf \u0026amp; (1 \u0026lt;\u0026lt; 31)) { goto found; } } /* If not found, default to 0. */ goto zero; found:; /* If the shift count leaves an exponent \u0026gt; 0, * compute the result. */ if (exp0 \u0026gt; shf_cnt) { check(mnt_shf \u0026amp; (1 \u0026lt;\u0026lt; 31)); res.frc = (mnt_shf \u0026gt;\u0026gt; 24) \u0026amp; 0x7f; res.exp = exp0 - shf_cnt; } /* Zero case. * Hit if set bit not found or if shift count * is greater than the exponent. */ else { zero:; res.frc = 0; res.exp = 0; } } /* Swap doesn\u0026#39;t matter as we\u0026#39;re doing a sum. * neg already handled when initializing the sign. */ /* Complete. */ return res; } int main() { bf16 f0 = {.frc = 0, .exp = 1, .sig = 0}; bf16 r = bf16_add(f0, f0); return 0; } ⚠ This code hasn\u0026rsquo;t been thoughly tested, don\u0026rsquo;t use this in prod! Chapter 2 # I know what you are thinking : So we have a C implementation, it works, why is this article still so long? Is there a prolific comment section?\nCan I go home now?\nYou remembered we said we were doing floating points from scratch right ? RIGHT !?\nCode isn’t going to cut it, we need to go deeper!\nBuilding the hard part # So what is more from scratch than C code ?\nWe could code this in assembly and try to optimize it by using all the best assembly bit twiddling tricks in the book all while pretending we didn’t notice those beautiful fadd instructions staring back at us across the ISA.\nAnd guess what: not only can we still go deeper, but sadly even with the most beautifully hand crafted assembly in the world its performance would still be orders of magnitude slower than a dedicated floating point addition instruction running on my grandmother\u0026rsquo;s computer\nNo, we are going to build our own FPU hardware out of transistors, optimize the hell out of it, and then, we are going to put it on a chip and tape it out!\nASIC implementation rules # This next section is a simplified overview aimed at giving readers that are unfamiliar with hardware design the groundwork for understanding the constraints that go into hardware design.\nIf you have already gazed into the abyss, you can skip this section. Digital hardware is built by connecting groups of prearranged transistors called cells. These cells might correspond to basic logic operations. Here is the example of a: 2 input or whose result is then anded with another input.\nThis logic can be written as :\nX = ((A1 | A2) \u0026amp; B1) Or represented by the following schematic : sky130_fd_sc_hd__o21a schematic Since these gates are built out of transistors they are specific to a fabrication process known as a node. Here is what this function looks like for the Skywater 130nm A node :\nsky130_fd_sc_hd__o21a_1 cell floorplan, with each color corresponds to a specific layer of the chip sandwich. These cells occupy an area of the chip proportional to the amount of transistors needed to build them, more transistors more area. When building a chip, the more area we need the more money it costs to build. Additionally the more area a functionality needs, the further apart the logic spreads, the longer the wire, the longer the wire delay, this impacts timing.\nTiming what ?\nImagine a world where charge carriers propagate like water: it takes time for the carrier to propagate through the logic gates and the wires. The more logic gates you have on your path, the more wire length you need to traverse, the longer the time. The density of these carriers indicates your binary state: 0 or 1, and for the predictable operation of your chip, you need to leave enough time for your flows of carriers to fully propagate through your longest path.\nHow much time you leave directly impacts how fast you can clock your hardware and hence how much performance you can get out of your design.\nOptimizing # Alright, we are not just here to build an FPU, we are here to build an optimized version.\nNow you might be wondering why I am adding the additional constraint of optimization. And that is because building an optimized version of something requires a much more nuanced understanding of something than just building something that works. Given there are many dimensions we can optimize for, in this project I will be defining “optimized” as having the lowest possible area, for the highest frequency given the functionality. This is our best bang for buck.\nTo recap, we are building hardware and in hardware all logic is expensive both in terms of area and timing, and we want to build the most optimized version according to these very metrics.\nGreat … What have I gotten myself into again?\nWhat do we need # So, since all logic we add comes with a cost, step 1 is to take a step back and first think about what we actually need to build.\nThere are plenty of different floating point formats, ranging from your standard IEEE types to your more custom application specific weirdos like the Pixar float.\nIt would be great if we lived in the dimension where we could drop them all in an arena and let them fight to death. But, tragically we can’t do that with abstract concepts. So now we are forced to sit down and think about it … uh\nYes you read that right, no this isn’t some AI hallucination, and no I am not talking about the 2019 short film.\nDid you know Pixar has its own 24 bit floating point type adapted to its use case ? Pixar is probably one of the most underrated tech companies: most people have no idea just how custom their rendering hardware has gotten in the past. Also, have you ever heard of Pixar Image Computers?\nBeware this rabbit hole goes deep.\nOn one hand we have the IEEE floats, these are your industry standard floats. Being compliant guarantees that the same floating point operation will have the same behavior regardless of the hardware it is run on, something your buyers really want. (Unless your hardware has a bug, in that case you made an attempt at being compliant. Hello intel 👋: the internet hasn’t forgotten yet). But compliance implies supporting subnormals, NaNs, \\(\\pm\\infty\\) and 5 different rounding modes.\nThen you have the matter of size: how large is your memory footprint and then, how are you allocating your bits across your exponent and significant fields?\nSome options are :\nfloat16, IEEE 754 half-precision: 5 bits exponent, 10 bits significant float32, IEEE 754 single precision: 8 bits exponent, 23 bits significant float64, IEEE 754 double precision: 11 bits exponent, 52 bits significant Pixar\u0026rsquo;s PXR24 format, 8 bits exponent, 15 bits significant tf32, Nvdia’s TensorFloat-32 which is a 19 bit format. I know right, why did they let the marketing department name this ? 8 bits exponent, 10 bits significant bfloat16, Google’s brain float format : 8 bits exponent, 7 bits significant What size we ultimately choose depends on our workload’s needs. Certain workloads need more precision which requires more significant bits, others benefit from smaller formats allowing them to workaround a memory bandwidth limitation.\nSo, once again, the answer is: it depends! Thank you for reading!\nMore seriously, let’s clarify what we are actually building, because this floating point arithmetic is going to be part of a larger project, else this wouldn\u0026rsquo;t be any fun. But to maximize the fun factor we need a project which requires a LOT of floating point math.\nLuckily, it just so happens that I know just the accelerator architecture for the task: a matrix matrix multiplication systolic array! These types of accelerators are widely found in machine learning accelerators targeting both training and inference tasks (when quantization degrades accuracy too much). Now, I am not setting out to build an AI accelerator, it just so happens to be a convenient excuse to put too much floating point arithmetic on my silicon.\nSplendid, now that we have our excuse know what application we are targeting, let us examine the constraints of this workload.\nFirstly, we are not building an FPU unit for a CPU with external clients. We can go custom, which conveniently lets us toss IEEE 754 compatibility and its 5 rounding modes back into the pit of hell from which it crawled out of. That said, for producing test vectors for testing the floating point arithmetic implementation I would like to choose a less confidential format. Additionally, any format that is easy to convert to and from one of the widely supported IEEE types, gets extra points. This will simplify interoperability between the accelerator and the firmware driving it.\nSecondly, my chip will be IO bottlenecked (for those that have been following this blog: yes, again 🔥) so my choice will be one of the minifloats, this term refers to floats of less than 32 bits wide. See, I am not choosing this just because the name is cute, there are also technical reasons.\nBecause we are targeting a smaller format, we need to be even more deliberate about our split of exponent/significant. Sacrificing exponent bits will reduce the range of our format, but skimping on the significant will reduce our number’s representable precision. Now, we can also approach the question of this split from the hardware angle by considering how this will impact the multiplication and addition implementation.\nLet us consider the multiplication, the most expensive operation in a floating point multiplication is the multiplication of the significants. This involves an unsigned multiplication of \u0026lt;significant bits\u0026gt; + 1 wide, and the hardware cost of a multiplication does everything but scale linearly: the hardwre cost of an 8 bit bfloat16 significant multiplication is roughly half that of of an 11 bit float16 significant multiplication.\nSmall significants are starting to sound very appealing.\nBack to what our application needs, AI workloads have the interesting characteristics of being relatively insensible to loss of precision, as illustrated by the fact that quantization is even possible. On the other hand, they benefit from having more range.\nFor all these reasons, I now crown the bfloat16 our winner. Congratulations, you are officially my favorite minifloat! 🏆\nHere is a recap on why bfloat16 is the best format in the history of the universe :\nonly 16 bits wide small mantissa: 7 bits widely spread format: this isn’t a custom invention and even has C++ standard library support easy conversion to float32: just chop off the mantissa bits (with some caveats which we will get into later) not an IEEE 754 type A great thing about bfloat16 is that it has no spec, so we can implement it however we want!\nA horrible thing with bfloat16 is that it has no spec, so we can implement it however we want!\nThis project was a great lesson in why we need the IEEE to keep the age-old tradition of attracting engineers with the promise of free donuts and locking them in a room until a spec is produced!\nTurns out, when you don’t have a spec you can implement something however you want, so naturally everyone does it differently!\nNow, we are building a custom accelerator, so bfloat16 operation compatibility isn’t a huge issue, the problem is that we now need to choose for ourselves which ice cream toppings we want on our floating point math flavor.\nThe first question to settle is rounding modes.\nWe need to choose at least one, and for testing against known test vectors it needs to be one of the IEEE modes. Out of the 5 modes in the spec, round towards zero is by far the most convenient and cheapest to implement in hardware. Unlike the others you never need to round upwards to the next floating point value, allowing me to skip the need for a 16 bit addition at the very end (perfectly placed right on the critical path for maximum timing pressure) of the addition.\nBut, RZ (round towards zero) has another massive advantage, and that its behavior on overflow:\nRZ doesn’t overflow to \\(\\pm\\infty\\) it clamps!\nThis means that, as long as no \\(\\infty\\) is provided as an input for my addition and multiplication operations they will never produce an \\(\\infty\\). So by disallowing \\(\\infty\\) to be used as input I can drop \\(\\infty\\) support entirely and save on hardware.\nBut, it gets better: the only operations that organically produce a NaN in addition and multiplication operations are against \\(\\infty\\). So if I also disallow NaN as an input value I can also remove the hardware cost of supporting them.\nLastly we have the question of denormal support, probably the least debated question in bfloat16 design: the 126 additional denormal values are just not worth their hardware hassle, dropped.\nTo recap, our ice cream order is 🍨:\nbfloat16: 1 bit sign, 8 bits exponent, 7 bits significant round toward zero rounding only no subnormal support, all subnormals will be clamped to \\(\\pm0.0\\) no \\(\\pm\\infty\\) or NaN support Architecture # Now that we have done the first hard part of deciding what we want to build, we need to do the second hard part: architecturing it.\nFor our matrix matrix operations we will require both an adder and a multiplier.\nSince this article is getting quite long and the multiplier is actually quite easy to build once you figure out how to design the mantissa multiplication efficiently (spoiler: unsigned booth radix-4 multipliers), I will now focus on the more complex and intricate adder design.\nThe naive approach to designing an adder is to do a single path adder: where all steps are done on the same path, similarly to the C code example. Although conceptually simple and efficient in terms or area due to the absence of logic duplication, because of the depth of this single path, this architecture is very expensive in terms of performance.\nThis is by no means a bad design and if we were focused entirely on optimizing area, this might be a viable candidate, but we have the dual mandate of area and performance, so we must do better.\nLooking back at the addition algorithm, we observe that the massive cancellations and mantissa shifting are actually mutually exclusive. These cancellations for which we need to count the mantissa difference leading zeros and subtract more than 1 from the exponent ahead of normalization only occur when the exponent difference of the two operands is less than 2 AND we are doing an effective subtraction.\nBased on this, and at the cost of some minor functionality duplication we can split our adder into 2 paths:\nclose path: exponent difference \u0026lt; 2 and effective substraction far path: exponent diff \u0026lt; 2 and effective addition or exponent difference \u0026gt;= 2 This split architecture is called the dual path architecture and has been the de facto adder architecture for high performance FPU since the 80s.\nDual path adder architecture schematic.\nCredit: Handbook of Floating-Point Arithmetic, Second edition Now, this schematic is actually for the IEEE compliant float, and we are not designing the general case. So how does this change for us ?\nRecall how we were doing RZ rounding only ? RZ is an effective clamping of the mantissa whenever rounding is involved, which means we will never need to round upwards, which in turns means we can chop off all the logic to this effect.\nI have highlighted in color all the logic needed as a result of this rounding upwards, and we are removing all of it. (🔥 w 🔥) Next, since we are imposing no NaNs or \\(\\infty\\) be used as an operand, we have no way of triggering an exception, so this too is removed.\nRIPing even more stuff out 🪓 Now, this schematic doesn’t illustrate how subnormals are handled, but our implementation is also saving logic there. That said, we do still need some logic to detect when they occur and clamp them to \\(0\\). On the close path this functionality is rolled into a block that I will label as “normalize” that is on our critical path after the multibit significant shift and the exponent subtraction.\nThe final design looks something like this:\nSchematic of the version for bfloat16 addition I am implementing. Chapter 3: Theory meets reality # Verification # Theory is entertaining but unless it has proven it can stand up to reality, nothing proves it true. So, there is no better way of validating one\u0026rsquo;s understanding than tossing it against hard cold reality.\nAlso, this isn’t just some thought experiment, we are actually taping this out on actual silicon, and if my past traumas with hardware have ingrained but one lesson in my mind it is that: until you have proven it works, it is broken!\nTime to run some tests!\nTesting floating point arithmetic hardware is actually an interesting challenge since it is full of corner cases: you can’t just test 100 random values and call it a day. No, you need exhaustive coverage for all these corners, most of which you didn’t know existed. This is very much a \u0026ldquo;you don’t know you don’t know\u0026rdquo; problem. So, what is the plan for this ?\nThis is where I committed my crime against verification: directed simulation driven testing scales with the size of the input space, which is a fancy way of saying it doesn’t scale linearly. This is why formal methods are increasingly widespread for floating point validation. If we wanted to test this using directed testing we would need to test for all \\(2^{32}\\) input combinations, which sounds like a terrible idea …\n… and exactly what I am going to do because it is the only way to exhaustively test all the corner cases without having verified prior knowledge of where all the angles are.\nThis is a second order of ignorance problem.\nThis creates our next problem: testing time, just how long is testing +4 billion combinations going to take? Because I believe low iteration times is paramount to getting shit done I need this testbench to run fast, so I need a fast simulator and golden model.\nEnter verilator, similarly to Synopsis vcs it compiles your simulation into an executable that can be run natively on your machine, making it the fastest open source simulator in town.\nNext comes the golden model, and it just so happens we are living in 2026 and the C++23 standard library has introduced the bfloat16 type in stdfloat.\nFinally, we can use the DPI-C interface (and not the slower VPI interface) to call our custom golden models coded in C++ and compiled into the testbench.\nSounds like a perfect plan … until C++ betrayed me.\nHow does stdfloat’s bfloat16 actually work ? # But before going into how my golden model turned out to be not so golden, we need to understand how the stdfloat bfloat16 type actually works under the hood.\nCrucially my computer actually doesn’t have native bfloat16 hardware, so how is it mathing ?\nLets test it using a simple addition:\n#include \u0026lt;stdfloat\u0026gt; int main(){ bfloat16_t a,b,c; a = 1.0; b = 1.0; c = a + b; return 0; } Looking at the disassembly of this test program we can see that gcc handles the bfloat16 addition by using the soft bfloat16 floating point function replacements (__extendbfsf2, __truncsfbf2 + wrapper code ).\nThis is indicative that my current hardware either doesn\u0026rsquo;t have hardware support for bfloat16 or that the support isn\u0026rsquo;t being advertised to the compiler.\n4\tint main(){ 0x0000000000001119 \u0026lt;+0\u0026gt;:\tpush %rbp 0x000000000000111a \u0026lt;+1\u0026gt;:\tmov %rsp,%rbp 0x000000000000111d \u0026lt;+4\u0026gt;:\tsub $0x20,%rsp 5\tbfloat16_t a, b, c; 6\t7\ta = 1.0; 0x0000000000001121 \u0026lt;+8\u0026gt;:\tmovzwl 0xee4(%rip),%eax # 0x200c 0x0000000000001128 \u0026lt;+15\u0026gt;:\tmov %ax,-0x6(%rbp) 8\tb = 1.0; 0x000000000000112c \u0026lt;+19\u0026gt;:\tmovzwl 0xed9(%rip),%eax # 0x200c 0x0000000000001133 \u0026lt;+26\u0026gt;:\tmov %ax,-0x4(%rbp) 9\tc = a+b; 0x0000000000001137 \u0026lt;+30\u0026gt;:\tpinsrw $0x0,-0x6(%rbp),%xmm0 0x000000000000113d \u0026lt;+36\u0026gt;:\tcall 0x1180 \u0026lt;__extendbfsf2\u0026gt; 0x0000000000001142 \u0026lt;+41\u0026gt;:\tmovss %xmm0,-0x14(%rbp) 0x0000000000001147 \u0026lt;+46\u0026gt;:\tpinsrw $0x0,-0x4(%rbp),%xmm0 0x000000000000114d \u0026lt;+52\u0026gt;:\tcall 0x1180 \u0026lt;__extendbfsf2\u0026gt; 0x0000000000001152 \u0026lt;+57\u0026gt;:\tmovaps %xmm0,%xmm1 0x0000000000001155 \u0026lt;+60\u0026gt;:\taddss -0x14(%rbp),%xmm1 0x000000000000115a \u0026lt;+65\u0026gt;:\tmovd %xmm1,%eax 0x000000000000115e \u0026lt;+69\u0026gt;:\tmovd %eax,%xmm0 0x0000000000001162 \u0026lt;+73\u0026gt;:\tcall 0x1250 \u0026lt;__truncsfbf2\u0026gt; 0x0000000000001167 \u0026lt;+78\u0026gt;:\tmovd %xmm0,%eax 0x000000000000116b \u0026lt;+82\u0026gt;:\tmov %ax,-0x2(%rbp) 10\t11\treturn 0; 0x000000000000116f \u0026lt;+86\u0026gt;:\tmov $0x0,%eax 12\t} 0x0000000000001174 \u0026lt;+91\u0026gt;:\tleave 0x0000000000001175 \u0026lt;+92\u0026gt;:\tret Based on this assembly, the expected behavior for the bfloat16_t would be\nsimilar to a clamped down float32_t.\nWhich is totally legal, given the bfloat16 behavior is not fully outlined by any spec, making it implementation defined. 🌈\nProbing the standard library soft bfloat16_t’s implementation # Since I am not perfectly fluent in x86 assembly, I decided to write a simple test program to probe out the behavior of my soft bfloat16_t.\nFrom this I learned that it:\nhas subnormal support has NaN support has inf support So naive me thought that, in order to use this as a golden model for the hardware, all I needed was to also manually clamp subnormals to 0 and not drive NaNs and \\(\\infty\\)s.\n#define IS_SUBNORMAL(x) ((isnormal(x) | isnan(x) | isinf(x) | (x == 0e0bf16))) Betrayed by C++ # So I clamped my subnormals and went on my merry way.\nBut while I was blissfully working my way through testing all of my possible operands combinations, disaster struck.\nI quote from the C++ 2022 published proposal on \u0026ldquo;Extended floating-point types and standard names\u0026rdquo; :\n7.2. Supported formats\nWe propose aliases for the following layouts:\n[IEEE-754-2008] binary16 - IEEE 16-bit. [IEEE-754-2008] binary32 - IEEE 32-bit. [IEEE-754-2008] binary64 - IEEE 64-bit. [IEEE-754-2008] binary128 - IEEE 128-bit. bfloat16, which is binary32 with 16 bits of precision truncated; see [bfloat16]. \u0026lt;\u0026ndash; P1467R9 - Extended floating-point types and standard names\nhttps://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1467r9.html#alias-formats Note Let’s make a deal and collectively pretend we all agree we didn’t see that the spec is saying binary32 has 16 bits of precision instead of 24. This is just a typo, and I am unworthy of telling the guys literally designing the next C++ standard that they should fix their spec.\nEssentially, this means bfloat16_t is, exactly like our reading of the binary hinted at: a truncated down float32_t (referred to as binary32 in the IEEE spec).\nThe problem with this approach is that, float32_t has a much larger internal precision \\(p\\) when compared with bfloat16_t :\nfloat32_t \\(p = 24\\) bfloat16_t \\(p = 8\\) In practice, this means that, if I want my hardware to correctly match the golden model, as specified by the C++ standard library, I will need to support \\(p = 24\\), which directly translates to a much wider significant path everywhere \u0026hellip; and that is in no universe the outcome I am interested in.\nWithin 1 ulp # Given the C++ standard\u0026rsquo;s library\u0026rsquo;s implementation of bfloat16_t of using a float32_t under the hood I cannot cleanly match the results of my golden model to the expected RTL output.\nThis is because float32_t has \\(p=24\\) bits on internal precision, while bloat16_t has 8 bits, so given the same input values, if the difference in exponent between these inputs if within range \\(]p_{bfloat16}; p_{float32}[\\)! I will observe a difference in rounding behavior even if we are using the same rounding mode and the same operands.\nAnother property of this is that, this difference will be contained within a margin of the next consecutive floating point number.\nTo help simplify the following section, let me define \\(ulp(x)\\) as the \u0026ldquo;unit of last place\u0026rdquo;, or more formally :\n\\(ulp(x)\\) is the gap between the two floating-point numbers nearest to \\(x\\), even if \\(x\\) is one of them. ~ William Kahan, Godfather of floating point, 1960 As such, my relative error between my golden model\u0026rsquo;s bfloat16_t and my implementation will be at most \\(1 \\times ulp(x)\\).\n\\(ulp(x)\\) is defined as :\n$$ ulp(x) = 2^{-p+1} $$For my bfloat16 implementation with \\(p=8\\) we thus have \\(ulp(x) = 2^{-7}\\), and I will be calculating the relative error as:\n$$ error(x) = \\frac{x_{model} - x_{hw}}{x_{model}} $$ C++ isn’t at fault # Now that I have finished ranting and gotten my golden model to behave, I would like to point out that I recognize that emulating bfloat16 from float32 is the superior approach for the standard library. It allows offloading most of the compute to the CPU’s FPU, giving orders of magnitude better performance than if it was entirely implemented in software.\nSure it might come with different results, but this is what we signed up for when we started using bfloat16.\nImplementation # Now that we have some working bfloat16 arithmetics comes my favourite part: building it out of expensive crystals semiconductors\nInitial global placement of all the logic cells of the ASIC floorplan in action.\nCaptured using OpenROAD global placement in debug mode. Since this article is focused on the floating point arithmetic I will contain my desire to tell you all about the rest of the accelerator and lock it up in the ASIC’s repo:\nEssenceia/Systolic_Array_with_DFT_v2 IHP 130nm ASIC tapeout of a 2x2 bfloat16 matrix matrix multiplication with DFT infrastructure. Iteration on the previous accelerator taped out on GF180. Verilog 25 2 Floorplan of the second generation Systolic Array designed for IHP 130nm\u0026rsquo;s node using the sg13g2 PDK. It occupies 126,685 µm² of die area and has a target typical operating voltage of 1.2V at 25°C.\nThis design features two clock trees, one for the MAC and another for the JTAG TAP. The MAC clock targets a 100 MHz maximum operating frequency, but current output GPIO frequency experiements suggest a 75 MHz maximum, and the JTAG 2 MHz. Tiny Tapeout ihp26a # This time we will be taping the chip on IHP’s fancy 130nm sg13g2 node using Tiny Tapeout’s ihp26a chip as our shuttle. Also, for full transparency I was offered a coupon by the Tiny Tapout team that I traded for the area that I am using in this project (and a devboard ). So technically speaking this tapeout is sponsored by Tiny Tapeout, which will definitely not help me rave less about them!\nThanks guys!\nTiny Tapeout shuttle chip ihp26a render.\nSource: https://github.com/TinyTapeout/tinytapeout-chip-renders The IHP 130nm cell library we are using this time is special: it is lightning fast compared to the other two open source PDKs allowing us to reach some truly impressive fmax’s. (As illustrated by our fmax competition using the sister IHP sg13cmos5l node.)\nBut once again, we have an IO problem: the maximum GPIO stable operating frequency is expected to be around 100MHz on input and 75MHz on the output path, meaning this systolic array is effectively bottlenecked at 75MHz. But, since the sg13g2 PDK is so fast, and closing timing at 75MHz was not challenging enough, I decided to challenge myself to target a normal frequency of 100MHz. Oh and I would also do the entire bfloat16 addition and multiplication in a single cycle.\nSure, I could pipeline these operations, but that would be wasting an opportunity to force myself to improve the implementation’s performance.\nYosys you pleb # Let me tell you about an interesting story that occurred during implementation.\nBut before let us set the stage: the critical path cuts exactly where you would expect it: going through the multiplication’s mantissa multiplication and continuing through the adder\u0026rsquo;s close path right through the LZC (leading zero count).\nAt this point in our story, I had already pulled a handful of the more classic RTL timing optimization tricks and was sitting comfortably at +0.6ns of slack on the slow corner and +3.9ns on the nominal corner. I thought my work here was done when a friend started questioning me about my LZC design choice.\nThese numbers were from the timing before redoing all the systolic array’s control logic and chaining together all the flops as part of the DFT scan chain. I had chosen to implement a tree based LZC, the same one you find in the literature, and although the verilog code is so unreadable RTL code it warrants it’s own testbench its underlying concept was too elegant to pass up.\nSince my timing was already sitting pretty I decided to keep my option of adding a more optimized leading zero anticipation as a backup option for later. And so my friend comes along and suggests we do something different. Forget the tree based LZC, just write a priority mux and let the synthesizer deal with it.\nalways @(*) begin casez (in) 9\u0026#39;b1????????: shift_amt = 4\u0026#39;d0; 9\u0026#39;b01???????: shift_amt = 4\u0026#39;d1; 9\u0026#39;b001??????: shift_amt = 4\u0026#39;d2; 9\u0026#39;b0001?????: shift_amt = 4\u0026#39;d3; 9\u0026#39;b00001????: shift_amt = 4\u0026#39;d4; 9\u0026#39;b000001???: shift_amt = 4\u0026#39;d5; 9\u0026#39;b0000001??: shift_amt = 4\u0026#39;d6; 9\u0026#39;b00000001?: shift_amt = 4\u0026#39;d7; 9\u0026#39;b000000001: shift_amt = 4\u0026#39;d8; default: shift_amt = 4\u0026#39;d0; endcase end In all transparency, I did NOT think this would result in better timing than the tree based LZC. Yet, between the RTL and timing there is yosys with its 124 levels of optimization, and yosys is techmap aware.\nFor reference, this is the original module\u0026rsquo;s code, isolated into its own module so that I can keep track of it when keeping the hierarchy during implementation :\nmodule pmux( input wire [8:0] data_i, output reg [3:0] zero_cnt ); always @(*) begin casez (data_i) 9\u0026#39;b1????????: zero_cnt = 4\u0026#39;d0; 9\u0026#39;b01???????: zero_cnt = 4\u0026#39;d1; 9\u0026#39;b001??????: zero_cnt = 4\u0026#39;d2; 9\u0026#39;b0001?????: zero_cnt = 4\u0026#39;d3; 9\u0026#39;b00001????: zero_cnt = 4\u0026#39;d4; 9\u0026#39;b000001???: zero_cnt = 4\u0026#39;d5; 9\u0026#39;b0000001??: zero_cnt = 4\u0026#39;d6; 9\u0026#39;b00000001?: zero_cnt = 4\u0026#39;d7; 9\u0026#39;b000000001: zero_cnt = 4\u0026#39;d8; default: zero_cnt = 4\u0026#39;d0; endcase end endmodule This module is implemented to a 19 cell result that comes out to only 3 logic levels deep on the critical path, resulting in better timing.\nIn the final flattened version this results in a +0.05 ns improvement on the slow path. On one hand, this isn\u0026rsquo;t a huge gain. But on another, this experience forces me to recognise how performant the tools can be.\nThis casez LZC design is not only slightly faster, its biggest strength is in how much simpler it is to understand, which in turn will make it easier to maintain, directly reducing the likelihood of bugs being introduced in the future.\nSometimes a good design is about more than just performance.\nCombo! # But wait! There was a second tapeout ?!\nWhen I started planning this article, my hope was to have the bfloat16 arithmetic be taped out as part of my second generation systolic array on IHP 130 nm.\nBut in the midst of writing this article the Tiny Tapeout community got the chance to do a second tapeout on IHP 130 nm targeting IHP’s newer sg13cmos5l node.\nJust like how we got the chance to do the GF180 tapeout for the first generation systolic array this was a private experimental shuttle, making us come full circle.\nTiny Tapeout shuttle ihp0p4 chip render.\nCredit: Luis Eduardo Ledoux Pardo Now I have a somewhat unique rule for my tapeouts: I never tapeout the same design twice.\nSo if I wanted to submit to the ihp0p4 shuttle chip, I couldn’t just re-use my existing IP. No, I needed something new.\nEnter the fmax challenge!\nDo you recall how I told you IHP cells were lightning fast and how I was doing the full addition and multiplication in a single cycle ? Well part of me was dying to know how high I could reach if we were to ignore the IO limitation and aim for the maximum frequency! Luckily for me another community member was taping out a comparable design: and so we raced!\nEssenceia/uselessly_fast_bfloat16_multiplier Pushing the bf16 multiplication clock frequency to the max on the nominal corner on IHP 130nm 5L node. Python 6 0 In order to increase the frequency the bfloat16 multiplication was cut into 2 cycles. As expected, the main critical path went through the mantissa multiplication. Now, in the original implementation of the multiplication, I was using the synthesizer implementation directive to infer an unsigned Booth radix-4 multiplier.\nAs the LZC experience has shown us, yosys is not a light weight when it comes to generating optimized logic. Unfortunately, we trade this for a loss of control on our netlist, and in this case the inability to choose exactly where we would split the multiplication.\nThus, in order to help pipeline this path, I needed to re-implement a custom 8-bit unsigned Booth radix-4 multiplier from scratch.\nMinor detail \u0026hellip; Oh, also, I forgot to mention one small thing: this shuttle was officially announced, opened and closed within the span of 24 hours! So now picture all of this happening at 3am. 🫠\nInside this custom multiplication stage, a flop is added after the encoding stage, in the middle of the compression stage. We are storing the partial compression of the first two partial products, and the last 3 before, on the next cycle compressing them together to get the final result of this mantissa multiplication.\nSchematic of the fast multiplier I am implementing with the seperation line indicating which operations happen on \\(t_0\\) and which on \\(t_1\\). A few additional such optimizations were performed throughout the multiplier allowing this design to reach a maximum operating frequency of 454.545 MHz.\nUselessly fast multiplier floorplan render. Operate at up to 454.545 MHz on the nominal operating corner of 1.20 V at 25°C and occupies a single tile of area 202.08x154.98 um. Closing # After over half a decade, I have finally slayed my dragon and taken my revenge on floating point math!\nAfter having build my own floating point arithmetic from scratch I now believed the only people that truly understand floating point are :\nThe people writing the IEEE 754 spec The math PhDs working on the floating point representation After having re-implemented floating point arithmetics and taped it out twice, I can confidently assert that I do not deeply understand floating point arithmetics, but at least now I know exactly how deep the rabbit hole goes and what I must do if I want to truly master it.\nBut with two 130nm tapeouts containing my own floating point IP I can confidently leave the explorations of the other minifloats and implementation of more complex operations to some other day.\nBecause I now have something much more important to do!\nBefore we can go to sleep, before we can finish writing the doc, there is one post tapeout tradition that must never be skipped:\nWaffle House! ❤ P.S # I highly recommend the excellent book \u0026ldquo;Handbook of Floating-Point Arithmetic, Second edition\u0026rdquo;, for readers looking for the 600 page version of the floating point question.\nSpecial thanks to my best half, yg, Prawnzz and Erstfeld for helping review this article.\n","date":"3 April 2026","externalUrl":null,"permalink":"/projects/floating_dragon/","section":"Other projects","summary":"Actually building floating point from scratch!","title":"Floating point from scratch: Hard Mode","type":"projects"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/ihp/","section":"Tags","summary":"","title":"IHP","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/math/","section":"Tags","summary":"","title":"Math","type":"tags"},{"content":"This section contains a list of some of my other recent projects.\n","date":"3 April 2026","externalUrl":null,"permalink":"/projects/","section":"Other projects","summary":"","title":"Other projects","type":"projects"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/rtl/","section":"Tags","summary":"","title":"Rtl","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/systolic_array/","section":"Tags","summary":"","title":"Systolic_array","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/","section":"Tales on the wire","summary":"","title":"Tales on the wire","type":"page"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/verification/","section":"Tags","summary":"","title":"Verification","type":"tags"},{"content":"","date":"3 April 2026","externalUrl":null,"permalink":"/tags/verilog/","section":"Tags","summary":"","title":"Verilog","type":"tags"},{"content":"I build things.\nNot because of a job, nor because I have to, but because my ability to build is part of my identity, I develop myself as an individual by expanding the scope of what I can do.\nNot all projects are created equal, most are small, ranging for a few hours to a few days. I might pick up a bit of additional nuance on a domain I have already opened the door to, but these simply prolong the trajectory of my current growth.\nThen you have the medium projects, taking days to months. These do not happen accidentally, they come with goals, plans and strategies. They push the boundaries of my current understanding and sometimes create step functions in the technical growth curve.\nLastly, you have the true mastodonts, the tentpole projects. These are purposefully designed to be leaped into the unknown and cut a new path to grow though it. They are hard, long and ambitious projects, requiring multiple months of perseverant work. But they re-define the growth trajectory, opening up new dimensions to it and accelerating it upwards.\nThey are also the rarest.\nI have currently done only 5 tentpole projects in my life. I can name each one and tell you exactly how each changed what I can do. Not all are hardware or software projects, but each expanded the space of possibility.\nToday I am embarking on project number 6.\nBA public lecture about a model solar system, with a lamp—in place of the sun—illuminating the faces of the audience, Joseph Wright of Derby (1766) - europeana.eu, Public Domain, https://commons.wikimedia.org/w/index.php?curid=1292995 ","date":"29 March 2026","externalUrl":null,"permalink":"/thoughts/tentpole_projects/","section":"Thoughts","summary":"Projects that expand the space of possibility.","title":"Tentpole Projects","type":"thoughts"},{"content":"Repository of personal thoughts.\n","date":"29 March 2026","externalUrl":null,"permalink":"/thoughts/","section":"Thoughts","summary":"","title":"Thoughts","type":"thoughts"},{"content":" This article was originally published on the 2nd of September 2025 and is regularly kept up to date to include the most recent developements around using the Alibaba AS02MC04 board as a dev board.\nThe Before you buy section wasn\u0026rsquo;t part of the original article and features the most recent updates.\nIntroduction # I was recently in the market for a new FPGA to start building my upcoming projects on.\nDue to the scale of my upcoming projects a Xilinx series 7 UltraScale+ FPGA of the Virtex family would be perfect, but a Kintex series FPGA will be sufficient for early prototyping. Due to not wanting to part ways with the eye watering amounts of money that is required for an Vivado enterprise edition license my choice was effectively narrowed to the FPGA chips available under the WebPack version of Vivado.\nXilinx supported boards per Vivado edition Unsurprisingly Xilinx are well aware of how top of the range the Virtex series are, and doesn\u0026rsquo;t offer any Virtex UltraScale+ chips with the webpack license. That said, they do offer support for two very respectable Kintex UltraScale+ FPGA models, the XCKU3P and the XCKU5P.\nXiling product guide, overview for the Kintex UltraScale+ series These two chips are far from being small hobbyist toys, with the smaller XCUK3P already boasting +162K LUTs and 16 GTY transceivers, capable, depending on the physical constraints imposed by the chip packaging of operating at up to 32.75Gb/s.\nNow that the chip selection has been narrowed down I set out to look for a dev board.\nMy requirements for the board where that it featured :\nat least 2 SFP+ or 1 QSFP connector a JTAG interface a PCIe interface at least x8 wide As to where to get the board from, my options where :\nDesign the board myself Get the AXKU5 or AXKU3 from Alinx See what I could unearth on the second hand market Although option 1 could have been very interesting, designing a dev board with both a high speed PCIe and ethernet interface was not the goal of today\u0026rsquo;s project.\nAs for option 2, Alinx is newer vendor that is still building up its credibility in the west, their technical documentation is a bit sparse, but the feedback seems to be positive with no major issues being reported. Most importantly, Alinx provided very fairly priced development boards in the 900 to 1050 dollar range ( +150$ for the HPC FMC SFP+ extension board ). Although these are not cheap by any metric, compared to the competitions price point, they are the best value.\nOption 2 was coming up ahead until I stumbled upon this ebay listing :\nEbay listing for a decommissioned Alibaba Cloud accelerator FPGA. Model name: AS02MC04 For 200$ this board featured a XCKU3P-FFVB676, 2 SPF+ connector and a x8 PCIe interface. On the flip side it came with no documentation whatsoever, no guaranty it worked, and the faint promise in the listing that there was a JTAG interface. A sane person would likely have dismissed this as an interesting internet oddity, a remanence of what happens when a generation of accelerator cards gets phased out in favor of the next, or maybe just an expensive paperweight.\nBut I like a challenge, and the appeal of unlocking the 200$ Kintex UltraScale+ development board was too great to ignore.\nAs such, I aim for this article to become the documentation paving the way to though this mirage.\nThe debugger challenge # Xilinx\u0026rsquo;s UG908 Programming and Debugging User Guide (Appendix D) specifies their blessed JTAG probe ecosystem for FPGA configuration and debug. Rather than dropping $100+ on yet another proprietary dongle that\u0026rsquo;ll collect dust after the project ends, I\u0026rsquo;m exploring alternatives. The obvious tradeoff: abandoning Xilinx\u0026rsquo;s toolchain means losing ILA integration. However, the ILA fundamentally just captures samples and streams them via JTAG USER registers, there\u0026rsquo;s nothing preventing us from building our own logic analyzer with equivalent functionality and a custom host interface.\nEnter OpenOCD. While primarily targeting ARM/RISC-V SoCs, it maintains an impressive database of supported probe hardware and provides granular control over JTAG operations. More importantly, it natively supports SVF (Serial Vector Format), a vendor-neutral bitstream format that Vivado can export.\nThe documentation landscape is admittedly sparse for anything beyond 7-series FPGAs, and the most recent OpenOCD documentation I could unearth was focused on Zynq ARM core debugging rather than fabric configuration. But the fundamentals remain sound: JTAG is JTAG, SVF is standardized, and the boundary scan architecture hasn\u0026rsquo;t fundamentally changed.\nThe approach should be straightforward: generate SVF from Vivado, feed it through OpenOCD with a commodity JTAG adapter, and validate the configuration. Worst case, we\u0026rsquo;ll need to patch some adapter-specific quirks or boundary scan chain register addresses. Time to find out if this theory holds up in practice.\nThe plan # So, to resume, the current plan is to buy a second hand hardware accelerator of eBay at a too good to be true price, and try to configure it with an unofficial probe using open source software without any clear official support.\nThe answer to the obvious question you are thinking if you, like me, have been around the block a few times is: many things.\nAs such, we need a plan for approaching this. The goal of this plan is to outline incremental steps that will build upon themselves with the end goal of being able to use this as a dev board.\n1 - Confirming the board works # First order of business will be to confirm the board is showing signs of working as intended.\nThere is a high probability that the flash wasn\u0026rsquo;t wiped before this board was sold off, as such the previous bitstream should still be in the flash. Given this board was used as an accelerator, we should be able to use that to confirm the board is working by either checking if the board is presenting itself as a PCIe endpoint or if the SFP\u0026rsquo;s are sending the ethernet PHY idle sequence.\n2 - Connecting a debugger to it # The next step is going to be to try and connect the debugger. The eBay listing advertised there is a JTAG interface, but the picture is grainy enough that where that JTAG is and what pins are available is unclear.\nAdditionally, we have no indication of what devices are daisy chained together onto the JTAG scan chain. This is an essential question for flashing over JTAG, so it will need to be figured out.\nAt this point, it would also be strategic to try and do some more probing into the FPGA via JTAG. Xilinx FPGAs exposes a handful of useful system registers accessible over JTAG. The most well known of these interfaces is the SYSMON, which allows us, among other things, to get real time temperature and voltage reading from inside the chip. Although openOCD doesn\u0026rsquo;t have SYSMON support out of the box it would be worth while to build it, to :\nFamiliarise myself with openOCD scripting, this might come in handy when building my ILA replacement down the line Having an easy side channel to monitor FPGA operating parameters Make a contribution to openOCD as it have support for the interfacing with XADC but not SYSMON 3 - Figuring out the Pinout # The hardest part will be figuring out the FPGA\u0026rsquo;s pinout and my clock sources. The questions that need answering are :\nwhat external clocks sources do I have, what are there frequencies and which pins are they connected to which transceivers are the SFPs connected to which transceivers is the PCIe connected to 4 - Writing a bitstream # For now I will be focusing on writing a temporary configurations over JTAG to the CCLs and not re-writing the flash.\nThat plan is to trying writing either the bitstream directly though openOCD\u0026rsquo;s virtex2 + pld drivers, or by replaying the SVF generated by Vivado.\nSince I believe a low iteration time is paramount to project velocity and getting big things done, I also want automatize all of the Vivado flow from taking the rtl to the SVF generation.\nSimple enough ?\nLiveness test # A few days later my prize arrived via express mail.\nMy prized Kintex UltraScale+ FPGA board also known as the decommissioned Alibaba cloud accelerator. Jammed transceiver now safely removed. Unexpectedly it even came with a free 25G SFP28 Huawei transceiver rated for a 300m distance and a single 1m long OS2 fiber patch cable. This was likely not intentional as the transceiver was jammed in the SFP cage, but it was still very generous of them to include the fiber patch cable.\nFree additional SFP28-25G-1310nm-300m-SM Huawei transceiver, and 1m long OS2 patch cable The board also came with a travel case and half of a PCIe to USB adapter and a 12V power supply that one could use to power the board as a standalone device. Although this standalone configuration will not be of any use to me, for those looking to develop just networking interfaces without any PCIe interface, this could come in handy.\nOverall the board looked a little worn, but both the transceiver cages and PCIe connectors didn\u0026rsquo;t look to be damaged.\nStandalone configuration # Before real testing could start I first did a small power-up test using the PCIe to USB adapter that the seller provided. I was able to do a quick check using the LEDs and the FPGAs dissipated heat that the board seemed to be powering up at a surface level (pun intended).\nPCIe interface # As a reminder, this next section relies on the flash not having been wiped and still containing the previous user\u0026rsquo;s design. Since I didn\u0026rsquo;t want to directly plug mystery hardware into my prized build server, I decided to use a Raspberry Pi 5 as my sacrificial test device and got myself an external PCIe adapter.\nIt just so happened that the latest Raspberry Pi version, the Pi 5, now features an external PCIe Gen 2.0 x1 interface. Though our FPGA can handle up to a PCIe Gen 3.0 and the board had a x8 wide interface, since PCIe standard is backwards compatible and the number of lanes on the interface can be downgraded, plugging our FPGA with this Raspberry Pi will work.\nFPGA board connected to the Raspberry Pi 5 via the PCIe to PCIe x1 adapter After both the Raspberry and the FPGA were booted, I SSHed into my rpi and started looking for the PCIe enumeration sequence logged from the Linux PCIe core subsystem.\ndmesg log :\n[ 0.388790] pci 0000:00:00.0: [14e4:2712] type 01 class 0x060400 [ 0.388817] pci 0000:00:00.0: PME# supported from D0 D3hot [ 0.389752] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring [ 0.495733] brcm-pcie 1000110000.pcie: link up, 5.0 GT/s PCIe x1 (!SSC) [ 0.495759] pci 0000:01:00.0: [dabc:1017] type 00 class 0x020000 Background information # Since most people might not be intimately as familiar with PCIe terminology, allow me to quickly document what is going on here.\n0000:00:00.0: is the identifier of a specific PCIe device connected through the PCIe network to the kernel, it read as domain:bus:device.function.\n[14e4:2712]: is the device\u0026rsquo;s [vendor id:device id], these vendor id identifiers are assigned by the PCI standard body to hardware vendors. Vendors are then free to define there own vendor id\u0026rsquo;s.\nThe full list of official vendor id\u0026rsquo;s and released device id can be found : https://admin.pci-ids.ucw.cz/read/PC/14e4 or in the linux kernel code : https://github.com/torvalds/linux/blob/7aac71907bdea16e2754a782b9d9155449a9d49d/include/linux/pci_ids.h#L160-L3256\ntype 01: PCIe has two types of devices, bridges allowing the connection of multiple downstream devices to an upstream device, and endpoints are the leafs. Bridges are of type 01 and endpoints of type 00.\nclass 0x60400: is the PCIe device class, it categorizes the kind of function the device performs. It uses the following format 0x[Base Class (8 bits)][Sub Class (8 bits)][Programming Interface (8 bits)], ( note : the sub class field might be unused ).\nA list of class and sub class identifiers can be found: https://admin.pci-ids.ucw.cz/read/PD or again in the linux codebase : https://github.com/torvalds/linux/blob/7aac71907bdea16e2754a782b9d9155449a9d49d/include/linux/pci_ids.h#L15-L158\nDmesg log # The two most interesting lines of the dmesg log are :\n[ 0.388790] pci 0000:00:00.0: [14e4:2712] type 01 class 0x060400 [ 0.495759] pci 0000:01:00.0: [dabc:1017] type 00 class 0x020000 Firstly the PCIe subsystem logs that at 0000:00:00.0 it has discovered a Broadcom BCM2712 PCIe Bridge ( vendor id 14e4, device id 0x2712 ).This bridge (type 01) class 0x0604xx tells us it is a PCI-to-PCI bridge, meaning it is essentially creating additional PCIe lanes downstream for endpoint devices or additional bridges.\nThe subsystem then discovers a second device at 0000:01:00.0, this is an endpoint (type 00), and class 0x02000 tells us it is an ethernet networking equipment. Of note dabc doesn\u0026rsquo;t correspond to a known vendor id. When designing a PCIe interface in hardware these are parameters we can configured. Additionally, among the different ways Linux uses to identify which driver to load for a PCIe device the vendor id and device id can be used for matching. Supposing we are implementing custom logic, in order to prevent any bug where the wrong driver might be loaded, it is best to use a separate vendor id. This also helps identify your custom accelerator at a glance and use it to load your custom driver.\nAs such, it is not surprising to see an unknown vendor id appear for an FPGA, this with the class as an ethernet networking device is a strong hint this is our board.\nFull PCIe device status # Dmesg logs have already given us a good indication that our FPGA board and its PCIe interface was working but to confirm with certainty that the device with vendor id dabc is our FPGA we now turn to lspci. lspci -vvv is the most verbose output and gives us a full overview of the detected PCIe devices capabilities and current configurations.\nBroadcom bridge:\n0000:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2712 PCIe Bridge (rev 21) (prog-if 00 [Normal decode]) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast \u0026gt;TAbort- \u0026lt;TAbort- \u0026lt;MAbort- \u0026gt;SERR- \u0026lt;PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 38 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 Memory behind bridge: [disabled] [32-bit] Prefetchable memory behind bridge: 1800000000-182fffffff [size=768M] [32-bit] Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast \u0026gt;TAbort- \u0026lt;TAbort- \u0026lt;MAbort- \u0026lt;SERR- \u0026lt;PERR- BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- \u0026gt;Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+ MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s \u0026lt;2us, L1 \u0026lt;4us ClockPM+ Surprise- LLActRep- BwNot+ ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x1 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt+ RootCap: CRSVisible+ RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+ RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+ AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd- AtomicOpsCtl: ReqEn- EgressBlck- LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS+ LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported, DRS- DownstreamComp: Link Up - Present Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 RootCmd: CERptEn+ NFERptEn+ FERptEn+ RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000 Capabilities: [160 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 \u0026lt;?\u0026gt; Capabilities: [240 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=8us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=1us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Capabilities: [300 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Kernel driver in use: pcieport FPGA board:\n0000:01:00.0 Ethernet controller: Device dabc:1017 Subsystem: Red Hat, Inc. Device a001 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast \u0026gt;TAbort- \u0026lt;TAbort- \u0026lt;MAbort- \u0026gt;SERR- \u0026lt;PERR- INTx- Region 0: Memory at 1820000000 (64-bit, prefetchable) [disabled] [size=2K] Region 2: Memory at 1800000000 (64-bit, prefetchable) [disabled] [size=512M] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s \u0026lt;64ns, L1 \u0026lt;1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s (downgraded), Width x1 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [1c0 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 For our board, the following lines are particularly interesting:\nLnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s (downgraded), Width x1 (downgraded)0x060400 The LnkCap tells us about the full capabilities of this PCIe device, here we can see that the current design supports PCIe Gen 3.0 x8. The LnkSta tells us the current configuration, here we have been downgraded to PCIe Gen 2.0 at 5GT/s with a width of only x1.\nDuring startup of when a new PCIe device is plugged, PCIe performs a link speed and width negotiation where it tries to reach the highest supported stable configuration for the current system. In our current system, though our FPGA is capable of 8GT/s, as it is located downstream of the Broadcom bridge with a maximum link capacity of Gen 2.0 ( 5GT/s ), the FPGA has been downgraded to 5GT/s.\nAs for the width of x1, that is expected since the Broadcom bridge is also only x1 wide, and our board’s other 7 PCIe lanes are literally hanging over the side.\n7 PCIe lanes left unconnected and hanging over the air Thus, we can finally confirm that this is our board and that the PCIe interface is working. We can now proceed to establishing the JTAG connection.\nJTAG interface # Xilinx FPGAs can be configured by writing a bitstream to their internal CMOS Configuration Latches (CCL). CCL is SRAM memory and volatile, thus the configuration is re-done on every power cycle. For devices in the field this bitstream would be read from an external SPI memory during initialization, or written from an external device, such as an embedded controller. But for development purposes overwriting the contents of the CCLs over JTAG is acceptable.\nThis configuration is done by shifting in the entire FPGA bitstream into the device’s configuration logic over the JTAG bus.\nFPGA board JTAG interface # As promised by the original eBay listing the board did come with an accessible JTAG interface, and gloriously enough, this time there wasn\u0026rsquo;t even the need for any additional soldering.\nView of the JTAG interface on the PCB In addition to a power reference, and ground, conformely to the Xilinx JTAG interface it featured the four mandatory signals comprising the JTAG TAP :\nTCK Test Clock TMS Test Mode Select TDI Test Data Input TDO Test Data Output Of note, the JTAG interface can also come with an independent reset signal. But since Xilinx JTAG interfaces do not have this independent reset signal, we be using the JTAG FSM reset state for our reset signal.\n6 pin board JTAG interface This interface layout doesn\u0026rsquo;t follow a standard layout so I cannot just plug in one of my debug probes, it requires some re-wiring.\nSegger JLINK :heart: # I do not own an AMD approved JTAG programmer.\nTraditionally speaking, the Segger JLink is used for debugging embedded CPUs let them be standalone or in a Zynq, and not for configuring FPGAs.\nThat said, all we need to do is use JTAG to shift in a bitstream to the CCLs, so technically speaking any programmable device with 4 sufficiently fast GPIOs can be used as a JTAG programmer. Additionally, the JLink is well supported by OpenOCD, the JLink\u0026rsquo;s libraries are open source, and I happened to own one.\nNote : I could also have used a USB Blaster, which considering it is literally an Altera tool would have made it hilarious.\n[Update]Note: Michał Hęćka has gotten the board flashing with an Altera USB Blaster.\n20 pin segger JLink pinout Wiring # Rewiring :\nWiring diagram to connect JLink JTAG probe to the board. JTAG is a parallel protocol where TDI and TMS will be captured according to TCK. Because of this, good JTAG PCB trace length matching is advised in order to minimize skew.\nTiming Waveform for JTAG Signals (From Target Device Perspective); source : https://www.intel.com/content/www/us/en/docs/programmable/683719/current/jtag-timing-constraints-and-waveforms.html Ideally a custom connector with length matched traces to work as an interface between the JLink\u0026rsquo;s probe and a board specific connector would be used.\nFar from length matched JTAG connections Yet, here we are shoving breadboard wires between our debugger and the board. Since OpenOCD allows us to easily control the debugger clock speed, we can increase the skew tolerance by slowing down the TCK clock signal. As such there is no immediate need for a custom connector but we will not be able to reach the maximum JTAG speeds.\nIf no clock speed is specified OpenOCD sets the clock speed at 100MHz. This is too high in our case. As such, latter in the article, I will be setting the JTAG clock down to 1MHz for probing and reset, programming will be done at 10MHz.\nNo issues were encountered at these speeds. OpenOCD # OpenOCD is a free and open source on-chip debugger software that aims to be compatible with as many probes, boards and chips as possible.\nSince OpenOCD has support for the standard SVF file format, my plan for my flashing flow will be to use Vivado to generate the SVF and have OpenOCD flash it. Now, some of you might be starting to notice that I am diverging quite far from the well lit path of officially supported tools. Not only am I using a not officially supported debug probe, but I am also using some obscure open source software with questionable support for interfacing with Xilinx UltraScale+ FPGAs. You might be wondering, given that the officially supported tools can already prove themselves to be a headache to get working properly, why am I seemingly making my life even harder?\nThe reason is quite simple: when things inevitably start going wrong, as they will, having an entirely open toolchain, allows me to have more visibility as to what is going on and the ability to fix it. I cannot delve into a black box.\nBuilding OpenOCD # By default the version of OpenOCD that I got on my server via the official packet manager was outdated and missing features I will need.\nAlso, since saving the ability to modify OpenOCD\u0026rsquo;s source code could come in handy, I decided to re-build it from source.\nThus, in the following logs, I will be running OpenOCD version 0.12.0+dev-02170-gfcff4b712.\nNote : I have also re-build the JLink libs from source.\nDetermining the scan chain # Since I do not have the schematics for the board I do not know how many devices are daisy-chainned on the board JTAG bus. Also, I want to confirm if the FPGA on the ebay listing is actually the one on the board. In JTAG, each chained device exposes an accessible IDCODE register used to identify the manufacturer, device type, and revision number.\nWhen setting up the JTAG server, we typically define the scan chain by specifying the expected IDCODE for each TAP and the corresponding instruction register length, so that instructions can be correctly aligned and routed to the intended device. Given this is an undocumented board off Ebay, I do not know what the chain looks like. Fortunately, OpenOCD has an autoprobing functionality, to do a blind interrogation in an attempt to discover the available devices.\nThus, my first order of business was doing this autoprobing.\nIn OpenOCD the autoprobing is done when the configuration does not specify any taps.\nsource [find interface/jlink.cfg] transport select jtag set SPEED 1 jtag_rclk $SPEED adapter speed $SPEED reset_config none The blind interrogation successfully discovered a single device on the chain with an IDCODE of 0x04a63093.\ngp@workhorse:~/tools/openocd_jlink_test/autoprob$ openocd Open On-Chip Debugger 0.12.0+dev-02170-gfcff4b712 (2025-09-04-21:02) Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html none separate Info : Listening on port 6666 for tcl connections Info : Listening on port 4444 for telnet connections Info : J-Link V10 compiled Jan 30 2023 11:28:07 Info : Hardware version: 10.10 Info : VTarget = 1.812 V Info : clock speed 1 kHz Warn : There are no enabled taps. AUTO PROBING MIGHT NOT WORK!! Info : JTAG tap: auto0.tap tap/device found: 0x04a63093 (mfg: 0x049 (Xilinx), part: 0x4a63, ver: 0x0) Warn : AUTO auto0.tap - use \u0026#34;jtag newtap auto0 tap -irlen 2 -expected-id 0x04a63093\u0026#34; Error: IR capture error at bit 2, saw 0x3ffffffffffffff5 not 0x...3 Warn : Bypassing JTAG setup events due to errors Warn : gdb services need one or more targets defined Comparing against the UltraScale Architecture Configuration User Guide (UG570) we see that this IDCODE matches up precisely with the expected value for the KU3P.\nJTAG and IDCODE for UltraScale Architecture-based FPGAs By default OpenOCD assumes a JTAG IR length of 2 bits, while our FPGA has an IR length of 6 bits. This is the cause behind the IR capture error encountered during autoprobing. By updating the script with an IR length of 6 bits we can re-detect the FPGA with no errors.\nsource [find interface/jlink.cfg] transport select jtag set SPEED 1 jtag_rclk $SPEED adapter speed $SPEED reset_config none jtag newtap auto_detect tap -irlen 6 Output :\ngp@workhorse:~/tools/openocd_jlink_test/autoprob$ openocd Open On-Chip Debugger 0.12.0+dev-02170-gfcff4b712 (2025-09-04-21:02) Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html Info : Listening on port 6666 for tcl connections Info : Listening on port 4444 for telnet connections Info : J-Link V10 compiled Jan 30 2023 11:28:07 Info : Hardware version: 10.10 Info : VTarget = 1.812 V Info : clock speed 1 kHz Info : JTAG tap: auto_detect.tap tap/device found: 0x04a63093 (mfg: 0x049 (Xilinx), part: 0x4a63, ver: 0x0) Warn : gdb services need one or more targets defined Based on the probing, this is the JTAG scan chain for our board :\nJTAG scan chain for the alibaba cloud FPGA System Monitor Registers # Previous generations of Xilinx FPGA had a system called the XADC that, among other features, allowed you to acquire chip temperature and voltage readings. The newer UltraScale and UltraScale+ family have deprecated this XADC module in favor of the SYSMON (and SYSMON4) which allows you to also get these temperature readings, just better.\nUnfortunately, openOCD didn\u0026rsquo;t have support for reading the SYSMON over JTAG out of the box, so I will be adding it.\nTo be more precise, the Kintex UltraScale+ has a SYSMON4 and not a SYSMON. For full context, there are 3 flavors of SYSMON:\nSYSMON1 used in the Kintex and Virtex UltraScale series SYSMON4 used in the Kintex, Virtex and in the Zynq programmable logic for the UltraScale+ series SYSMON used in the Zynq in the processing system of the UltraScale+ series. Yes, you read that correctly the Zynq of the UltraScale+ series features not one, but at least two unique SYSMON instances. For the purpose of this article, all these instances are similar enough that I will be using the terms SYSMON4 and SYSMON interchangeably.\nIn order for the JTAG to interact with the SYSMON, we first need to write the SYSMON_DRP command to the JTAG Instruction Register (IR). Based on the documentation, we see that this command has a value of 0x37, which funnily enough, is the same command code as the XADC, solidifying the SYSMON as the XADC\u0026rsquo;s descendant.\nThe SYSMON offers a lot more additional functionalities than just being used to read voltage and temperature, but for today\u0026rsquo;s use case we will not be using any of that. Rather, we will focus only on reading a subset of the SYSMON status registers.\nThese status registers are located at addresses (00h-3Fh, 80h-BFh), and contain the measurement results of the analog-to-digital conversions, the flag registers, and the calibration coefficients. We can select which address we wish to read by writing the address to the Data Register (DR) over JTAG and the data will be read out of TDO.\n# SPDX-License-Identifier: GPL-2.0-or-later # Xilinx SYSMON4 support # # Based on UG580, used for UltraScale+ Xilinx FPGA # This code implements access through the JTAG TAP. # # build a 32 bit DRP command for the SYSMON DRP proc sysmon_cmd {cmd addr data} { array set cmds { NOP 0x00 READ 0x01 WRITE 0x02 } return [expr {($cmds($cmd) \u0026lt;\u0026lt; 26) | ($addr \u0026lt;\u0026lt; 16) | ($data \u0026lt;\u0026lt; 0)}] } # Status register addresses # Some addresses (status registers 0-3) have special function when written to. proc SYSMON {key} { array set addrs { TEMP 0x00 VCCINT 0x01 VCCAUX 0x02 VPVN 0x03 VREFP 0x04 VREFN 0x05 VCCBRAM 0x06 SUPAOFFS 0x08 ADCAOFFS 0x09 ADCAGAIN 0x0a VCCPINTLP 0x0d VCCPINTFP 0x0e VCCPAUX 0x0f VAUX0 0x10 VAUX1 0x11 VAUX2 0x12 VAUX3 0x13 VAUX4 0x14 VAUX5 0x15 VAUX6 0x16 VAUX7 0x17 VAUX8 0x18 VAUX9 0x19 VAUX10 0x1a VAUX11 0x1b VAUX12 0x1c VAUX13 0x1d VAUX14 0x1e VAUX15 0x1f MAXTEMP 0x20 MAXVCC 0x21 MAXVCCAUX 0x22 } return $addrs($key) } # transfer proc sysmon_xfer {tap cmd addr data} { set ret [drscan $tap 32 [sysmon_cmd $cmd $addr $data]] runtest 10 return [expr \u0026#34;0x$ret\u0026#34;] } # sysmon register write proc sysmon_write {tap addr data} { sysmon_xfer $tap WRITE $addr $data } # sysmon register read, non-pipelined proc sysmon_read {tap addr} { sysmon_xfer $tap READ $addr 0 return [sysmon_xfer $tap NOP 0 0] } # Select the sysmon DR, SYSMON_DRP has the same binary code value as the XADC proc sysmon_select {tap} { set SYSMON_IR 0x37 irscan $tap $SYSMON_IR runtest 10 } # convert 16 bit temperature measurement to Celsius proc sysmon_temp_internal {code} { return [expr {$code * 509.314/(1 \u0026lt;\u0026lt; 16) - 280.23}] } # convert 16 bit supply voltage measurments to Volt proc sysmon_sup {code} { return [expr {$code * 3./(1 \u0026lt;\u0026lt; 16)}] } # measure all internal voltages proc sysmon_report {tap} { puts \u0026#34;Sysmon status report :\u0026#34; sysmon_select $tap foreach ch [list TEMP MAXTEMP] { echo \u0026#34;$ch [format %.2f [sysmon_temp_internal [sysmon_read $tap [SYSMON $ch]]]] C\u0026#34; } foreach ch [list VCCINT MAXVCC VCCAUX MAXVCCAUX] { echo \u0026#34;$ch [format %.3f [sysmon_sup [sysmon_read $tap [SYSMON $ch]]]] V\u0026#34;\t} } I added a report that reads the current chip temperature, internal and external voltages as well as the maximum values for these recorded since FPGA power cycle, to my flashing script output:\ngp@workhorse:~/tools/openocd_jlink_test$ openocd Open On-Chip Debugger 0.12.0+dev-02170-gfcff4b712 (2025-09-04-20:02) Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html set chipname XCKU3P Read temperature sysmon 4 Info : J-Link V10 compiled Jan 30 2023 11:28:07 Info : Hardware version: 10.10 Info : VTarget = 1.819 V Info : clock speed 1 kHz Info : JTAG tap: XCKU3P.tap tap/device found: 0x04a63093 (mfg: 0x049 (Xilinx), part: 0x4a63, ver: 0x0) Warn : gdb services need one or more targets defined -------------------- Sysmon status report : TEMP 31.12 C MAXTEMP 34.62 C VCCINT 0.852 V MAXVCC 0.855 V VCCAUX 1.805 V MAXVCCAUX 1.807 V Pinout # To my indescribable joy I happened to stumble onto this gold mine, in which we get the board pinout. This most likely fell off a truck: https://blog.csdn.net/qq_37650251/article/details/145716953\nThe bellow version has been corrected based on work from Alex Forencich (link to his work) and some of my own oppinions.\nAlex drives the leds at 3.3V while I drive them at only 1.8V, so if you are seeing issues with your LED behavior consider switching them over the 3.3V.\nPin Index Name IO Standard Location Bank Notes 0 diff_100mhz_clk_p LVDS E18 BANK67 - 1 diff_100mhz_clk_n LVDS D18 BANK67 - 2 sfp_mgt_clk_p LVDS K7 BANK227 3 sfp_mgt_clk_n LVDS K6 BANK227 4 sfp_1_txn - B6 BANK227 5 sfp_1_txp - B7 BANK227 6 sfp_1_rxn - A3 BANK227 7 sfp_1_rxp - A4 BANK227 8 sfp_2_txn - D6 BANK227 9 sfp_2_txp - D7 BANK227 10 sfp_2_rxn - B1 BANK227 11 sfp_2_rxp - B2 BANK227 12 SFP_1_MOD_DEF_0 LVCMOS33 D14 BANK87 PULLUP true 13 SFP_1_TX_FAULT LVCMOS33 B14 BANK87 PULLUP true 14 SFP_1_LOS LVCMOS33 D13 BANK87 PULLUP true 15 SFP_1_LED LVCMOS18 B12 BANK87 SLEW SLOW DRIVE 12 16 SFP_2_MOD_DEF_0 LVCMOS33 E11 BANK86 PULLUP true 17 SFP_2_TX_FAULT LVCMOS33 F9 BANK86 PULLUP true 18 SFP_2_LOS LVCMOS33 E10 BANK86 PULLUP true 19 SFP_2_LED LVCMOS18 C12 BANK87 SLEW SLOW DRIVE 12 20 IIC_SDA_SFP_1 LVCMOS33 C14 BANK87 SLEW SLLOW DRIVE 12 PULLUP true 21 IIC_SCL_SFP_1 LVCMOS33 C13 BANK87 SLEW SLLOW DRIVE 12 PULLUP true 22 IIC_SDA_SFP_2 LVCMOS33 D11 BANK86 SLEW SLLOW DRIVE 12 PULLUP true 23 IIC_SCL_SFP_2 LVCMOS33 D10 BANK86 SLEW SLLOW DRIVE 12 PULLUP true 24 IIC_SDA_EEPROM_0 LVCMOS33 G10 BANK86 SLEW SLOW DRIVE 12 PULLUP true 25 IIC_SCL_EEPROM_0 LVCMOS33 G9 BANK86 SLEW SLOW DRIVE 12 PULLUP true 26 IIC_SDA_EEPROM_1 LVCMOS33 J15 BANK87 SLEW SLOW DRIVE 12 PULLUP true 27 IIC_SCL_EEPROM_1 LVCMOS33 J14 BANK87 SLEW SLOW DRIVE 12 PULLUP true 28 GPIO_LED_R LVCMOS18 A13 BANK87 SLEW SLOW DRIVE 12 29 GPIO_LED_G LVCMOS18 A12 BANK87 SLEW SLOW DRIVE 12 30 GPIO_LED_H LVCMOS18 B9 BANK86 SLEW SLOW DRIVE 12 31 GPIO_LED_1 LVCMOS18 B11 BANK86 SLEW SLOW DRIVE 12 32 GPIO_LED_2 LVCMOS18 C11 BANK86 SLEW SLOW DRIVE 12 33 GPIO_LED_3 LVCMOS18 A10 BANK86 SLEW SLOW DRIVE 12 34 GPIO_LED_4 LVCMOS18 B10 BANK86 SLEW SLOW DRIVE 12 35 pcie_mgt_clkn - T6 BANK225 36 pcie_mgt_clkp - T7 BANK225 37 pcie_tx0_n - R4 BANK225 38 pcie_tx1_n - U4 BANK225 39 pcie_tx2_n - W4 BANK225 40 pcie_tx3_n - AA4 BANK225 41 pcie_tx4_n - AC4 BANK224 42 pcie_tx5_n - AD6 BANK224 43 pcie_tx6_n - AE8 BANK224 44 pcie_tx7_n - AF6 BANK224 45 pcie_rx0_n - P1 BANK225 46 pcie_rx1_n - T1 BANK225 47 pcie_rx2_n - V1 BANK225 48 pcie_rx3_n - Y1 BANK225 49 pcie_rx4_n - AB1 BANK224 50 pcie_rx5_n - AD1 BANK224 51 pcie_rx6_n - AE3 BANK224 52 pcie_rx7_n - AF1 BANK224 53 pcie_tx0_p - R5 BANK225 54 pcie_tx1_p - U5 BANK225 55 pcie_tx2_p - W5 BANK225 56 pcie_tx3_p - AA5 BANK225 57 pcie_tx4_p - AC5 BANK224 58 pcie_tx5_p - AD7 BANK224 59 pcie_tx6_p - AE9 BANK224 60 pcie_tx7_p - AF7 BANK224 61 pcie_rx0_p - P2 BANK225 62 pcie_rx1_p - T2 BANK225 63 pcie_rx2_p - V2 BANK225 64 pcie_rx3_p - Y2 BANK225 65 pcie_rx4_p - AB2 BANK224 66 pcie_rx5_p - AD2 BANK224 67 pcie_rx6_p - AE4 BANK224 68 pcie_rx7_p - AF2 BANK224 69 pcie_perstn_rst LVCMOS33 A9 BANK86 PULLUP true Global clock # On high end FPGAs like the UltraScale+ family, high-speed global clocks are typically driven from external sources using differential pairs for better signal integrity.\nAccording to the pinout we have two such differential pairs.\nFirst I must determine the nature of these external reference clocks to see how I can use them to drive my clocks.\nThese differential pairs are provided over the following pins:\n100MHz : {E18, D18} 156.25MHz : {K7, K6} Judging by the naming and the frequencies, the 156.25MHz clock is likely my SFP reference clock, and the 100MHz can be used as my global clock.\nWe can confirm by querying the pin properties.\nK6 properties :\nVivado% report_property [get_package_pins K6] Property Type Read-only Value BANK string true 227 BUFIO_2_REGION string true TR CLASS string true package_pin DIFF_PAIR_PIN string true K7 IS_BONDED bool true 1 IS_DIFFERENTIAL bool true 1 IS_GENERAL_PURPOSE bool true 0 IS_GLOBAL_CLK bool true 0 IS_LOW_CAP bool true 0 IS_MASTER bool true 0 IS_VREF bool true 0 IS_VRN bool true 0 IS_VRP bool true 0 MAX_DELAY int true 38764 MIN_DELAY int true 38378 NAME string true K6 PIN_FUNC enum true MGTREFCLK0N_227 PIN_FUNC_COUNT int true 1 PKGPIN_BYTEGROUP_INDEX int true 0 PKGPIN_NIBBLE_INDEX int true 0 E18 properties :\nVivado% report_property [get_package_pins E18] Property Type Read-only Value BANK string true 67 BUFIO_2_REGION string true TL CLASS string true package_pin DIFF_PAIR_PIN string true D18 IS_BONDED bool true 1 IS_DIFFERENTIAL bool true 1 IS_GENERAL_PURPOSE bool true 1 IS_GLOBAL_CLK bool true 1 IS_LOW_CAP bool true 0 IS_MASTER bool true 1 IS_VREF bool true 0 IS_VRN bool true 0 IS_VRP bool true 0 MAX_DELAY int true 87126 MIN_DELAY int true 86259 NAME string true E18 PIN_FUNC enum true IO_L11P_T1U_N8_GC_67 PIN_FUNC_COUNT int true 2 PKGPIN_BYTEGROUP_INDEX int true 8 PKGPIN_NIBBLE_INDEX int true 2 This tells us:\nThe differential pairings are correct: {K6, K7}, {E18, D18} We can easily use the 100MHz as a source to drive our global clocking network The 156.25MHz clock is to be used as the reference clock for our GTY transceivers and lands on bank 227 as indicated by the PIN_FUNC property MGTREFCLK0N_227 We cannot directly use the 156.25MHz clock to drive our global clock network With all this we have sufficient information to write a constraint file (xdc) for this board.\nTest design # Further sections will be using the following design files.\ntop.v:\nmodule top ( input wire Clk_100mhz_p_i, input wire Clk_100mhz_n_i, output wire [3:0] Led_o ); wire clk_ibuf; reg [28:0] ctr_q; reg unused_ctr_q; IBUFDS #( .DIFF_TERM(\u0026#34;TRUE\u0026#34;), .IOSTANDARD(\u0026#34;LVDS\u0026#34;) ) m_ibufds ( .I(Clk_100mhz_p_i), .IB(Clk_100mhz_n_i), .O(clk_ibuf) ); BUFG m_bufg ( .I(clk_ibuf), .O(clk) ); always @(posedge clk) { unused_ctr_q, ctr_q } \u0026lt;= ctr_q + 29\u0026#39;b1; assign Led_o = ctr_q[28:25]; endmodule alibaba_cloud.xdc :\n# Global clock signal set_property -dict {LOC E18 IOSTANDARD LVDS} [get_ports Clk_100mhz_p_i] set_property -dict {LOC D18 IOSTANDARD LVDS} [get_ports Clk_100mhz_n_i] create_clock -period 10 -name clk_100mhz [get_ports Clk_100mhz_p_i] # LEDS set_property -dict {LOC B11 IOSTANDARD LVCMOS18} [get_ports { Led_o[0]}] set_property -dict {LOC C11 IOSTANDARD LVCMOS18} [get_ports { Led_o[1]}] set_property -dict {LOC A10 IOSTANDARD LVCMOS18} [get_ports { Led_o[2]}] set_property -dict {LOC B10 IOSTANDARD LVCMOS18} [get_ports { Led_o[3]}] Writing the bitstream # My personal belief is that one of the most important contributors to design quality is iteration cost. The lower your iteration cost, the higher your design quality is going to be.\nAs such I will invest the small upfront cost to have the workflow be as streamlined as efficiently feasible.\nThus, my workflow evolved into doing practically everything over the command line interfaces and only interacting with the tools, Vivado in this case, through tcl scripts.\nVivado flow # The goal of this flow is to, given a few verilog design and constraint files produce a SVF file. Our steps are :\ncreat the Vivado project setup.tcl run the implementation build.tcl generate the bitstream and the SVF gen.tcl I will be using make to kick off and manage the dependencies between the different steps, though I recognise this isn\u0026rsquo;t a widespread practice for hardware projects. make is a highly flexible, reliable and powerful tool and I believe its ability to tie together any type of workflow makes it a prime tool for this use case.\nWe will be invoking Vivado in batch mode, this allows us to provide a tcl script alongside script arguments, the format is as following :\nvivado -mode batch \u0026lt;path to tcl script\u0026gt; -tclargs \u0026lt;script args\u0026gt; Though this allows us to easily break down our flow into incremental stages, invoking a single script in batch mode has the drawback of restarting Vivado and needing to re-load the project or the project checkpoint on each invocation.\nAs the project size grows so will the project load time, so segmenting the flow into a large number of independent scripts comes at an increasing cost.\nMakefile :\nSHELL := /bin/bash VIVADO_PRJ_DIR=prj VIVADO_PRJ_NAME=$(VIVADO_PRJ_DIR) VIVADO_PRJ_PATH=$(VIVADO_PRJ_DIR)/$(VIVADO_PRJ_NAME).xpr VIVADO_CHECKPOINT_PATH=$(VIVADO_PRJ_DIR)/$(VIVADO_PRJ_NAME)_checkpoint.dcp VIVADO_CMD=vivado -mode batch -source SRC_PATH=src OUT_DIR=out all: setup build gen $(VIVADO_PRJ_PATH): mkdir -p $(VIVADO_PRJ_DIR) $(VIVADO_CMD) setup.tcl -tclargs $(VIVADO_PRJ_DIR) $(VIVADO_PRJ_NAME) setup: $(VIVADO_PRJ_PATH) $(VIVADO_CHECKPOINT_PATH): $(VIVADO_PRJ_PATH) $(wildcard $(SRC_PATH)/*.xdc) $(wildcard $(SRC_PATH)/*.v) $(VIVADO_CMD) build.tcl -tclargs $(VIVADO_PRJ_PATH) $(SRC_PATH) $(VIVADO_CHECKPOINT_PATH) build: $(VIVADO_CHECKPOINT_PATH) $(OUT_DIR)/$(VIVADO_PRJ_NAME).svf: $(VIVADO_CHECKPOINT_PATH) mkdir -p $(OUT_DIR) $(VIVADO_CMD) gen.tcl -tclargs $(VIVADO_CHECKPOINT_PATH) $(OUT_DIR) gen: $(OUT_DIR)/$(VIVADO_PRJ_NAME).svf flash: $(OUT_DIR)/$(VIVADO_PRJ_NAME).svf openocd\tclean: rm -rf $(VIVADO_PRJ_DIR) rm -rf $(OUT_DIR) rm -f vivado*{log,jou} rm -f webtalk*{log,jou} rm -f usage_statistics_webtalk*{html,xml} setup.tcl :\nset project_dir [lindex $argv 0] set project_name [lindex $argv 1] puts \u0026#34;Creating project $project_name at path [pwd]/$project_dir\u0026#34; create_project -part xcku3p-ffvb676-2-e -force $project_name $project_dir close_project exit 0 build.tcl :\nset project_path [lindex $argv 0] set src_path [lindex $argv 1] set checkpoint_path [lindex $argv 2] puts \u0026#34;Implementation script called with project path $project_path and src path $src_path, generating checkpoint at $checkpoint_path\u0026#34; open_project $project_path # load src read_verilog [glob -directory $src_path *.v] read_xdc [glob -directory $src_path *.xdc] # synth synth_design -top top # implement opt_design place_design route_design phys_opt_design write_checkpoint $checkpoint_path -force close_project exit 0 Generating the SVF file # The SVF for Serial Vector Format is a human readable, vendor agnostic specification used to specify JTAG bus operations.\nExample SVF file, test program:\n! Initialize UUT STATE RESET; ! End IR scans in DRPAUSE ENDIR DRPAUSE; ! End DR scans in DRPAUSE ENDDR DRPAUSE; ! 24 bit IR header HIR 24 TDI (FFFFFF); ! 3 bit DR header HDR 3 TDI (7); ! 16 bit IR trailer TIR 16 TDI (FFFF); ! 2 bit DR trailer TDR 2 TDI (3); ! 8 bit IR scan, load BIST opcode SIR 8 TDI (41) TDO (81) MASK (FF); ! 16 bit DR scan, load BIST seed SDR 16 TDI (ABCD); ! RUNBIST for 95 TCK Clocks RUNTEST 95 TCK ENDSTATE IRPAUSE; ! 16 bit DR scan, check BIST status SDR 16 TDI (0000) TDO(1234) MASK(FFFF); ! Enter Test-Logic-Reset STATE RESET; ! End Test Program Vivado can generate a hardware aware SVF file containing the configuration sequence for an FPGA board, allowing us to write a bitstream.\nGiven the SVF file literally contains the bitstream written in clear hexademical, in the file, our first step is to generate our design\u0026rsquo;s bitstream.\nVivado proper isn\u0026rsquo;t the software that generates the SVF file, this task is done by the hardware manager which handles all of the configuration.\nWe can launch a new instance open_hw_manager and connect to it connect_hw_server. Since JTAG is a daisy chained bus, and given the SVF file is just a standardised way of specifying JTAG bus operations, in order to generate a correct JTAG configuration sequence, we must inform the hardware manger of our scan chain.\nDuring our earlier probing of the scan chain, we have established that our FPGA is the only device on the chain. We inform the hardware manager of this by creating a new device configuration ( the term \u0026ldquo;device\u0026rdquo; refers to the \u0026ldquo;board\u0026rdquo; here ) and add our fpga to the chain using the create_hw_device -part \u0026lt;device name\u0026gt;.When we have multiple devices we should register them following the order in which they appear on the chain.\nFinally to generate the SVF file, we must select the device we wish to program with program_hw_device \u0026lt;hw_device\u0026gt;, then write out the SVF to the file using write_hw_svf \u0026lt;path to svf file\u0026gt;.\ngen.tcl:\nset checkpoint_path [lindex $argv 0] set out_dir [lindex $argv 1] puts \u0026#34;SVF generation script called with checkpoint path $checkpoint_path, generating to $out_dir\u0026#34; open_checkpoint $checkpoint_path # defines set hw_target \u0026#34;alibaba_board_svf_target\u0026#34; set fpga_device \u0026#34;xcku3p\u0026#34; set bin_path \u0026#34;$out_dir/[current_project]\u0026#34; write_bitstream \u0026#34;$bin_path.bit\u0026#34; -force open_hw_manager # connect to hw server with default config connect_hw_server puts \u0026#34;connected to hw server at [current_hw_server]\u0026#34; create_hw_target $hw_target puts \u0026#34;current hw target [current_hw_target]\u0026#34; open_hw_target # single device on scan chain create_hw_device -part $fpga_device puts \u0026#34;scan chain : [get_hw_devices]\u0026#34; set_property PROGRAM.FILE \u0026#34;$bin_path.bit\u0026#34; [get_hw_device] #select device to program program_hw_device [get_hw_device] # generate svf file write_hw_svf -force \u0026#34;$bin_path.svf\u0026#34; close_hw_manager exit 0 Configuring the FPGA using OpenOCD # Although not widespread openOCD has a very nice svf execution command :\n18.1 SVF: Serial Vector Format # The Serial Vector Format, better known as SVF, is a way to represent JTAG test patterns in text files. In a debug session using JTAG for its transport protocol, OpenOCD supports running such test files.\n[Command]svf filename [-tap tapname] [[-]quiet] [[-]nil] [[-]progress] [[-]ignore_error] This issues a JTAG reset (Test-Logic-Reset) and then runs the SVF script from filename. Arguments can be specified in any order; the optional dash doesn’t affect their se- mantics.\nCommand options:\n-tap tapname ignore IR and DR headers and footers specified by the SVF file with HIR, TIR, HDR and TDR commands; instead, calculate them automatically according to the current JTAG chain configuration, targeting tapname; [-]quiet do not log every command before execution; [-]nil “dry run”, i.e., do not perform any operations on the real interface; [-]progress enable progress indication; [-]ignore_error continue execution despite TDO check errors. ~ OpenOCD documentation\nhttps://openocd.org/doc-release/html/Boundary-Scan-Commands.html#SVF_003a-Serial-Vector-Format We invoke it in our openOCD script using the -progress option for additional logging.\nopenocd :\nset svf_path \u0026#34;out/project_prj_checkpoint.svf\u0026#34; source [find interface/jlink.cfg] transport select jtag set SPEED 1 jtag_rclk $SPEED adapter speed $SPEED reset_config none # jlink config set CHIPNAME XCKU3P set CHIP $CHIPNAME puts \u0026#34;set chipname \u0026#34;$CHIP source [find ../openocd/tcl/cpld/xilinx-xcu.cfg] source [find ../openocd/tcl/fpga/xilinx-sysmon.cfg] init puts \u0026#34;--------------------\u0026#34; sysmon_report $CHIP.tap puts \u0026#34;--------------------\u0026#34; # program if {![file exists $svf_path]} { puts \u0026#34;Svf path not found : $svf_path\u0026#34; exit } svf $svf_path -progress exit Flashing sequence log :\ngp@workhorse:~/tools/openocd_jlink_test$ openocd Open On-Chip Debugger 0.12.0+dev-02170-gfcff4b712 (2025-09-04-21:02) Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html set chipname XCKU3P Read temperature sysmon 4 Info : J-Link V10 compiled Jan 30 2023 11:28:07 Info : Hardware version: 10.10 Info : VTarget = 1.812 V Info : clock speed 1 kHz Info : JTAG tap: XCKU3P.tap tap/device found: 0x04a63093 (mfg: 0x049 (Xilinx), part: 0x4a63, ver: 0x0) Warn : gdb services need one or more targets defined -------------------- Sysmon status report : TEMP 50.46 C MAXTEMP 52.79 C VCCINT 0.846 V MAXVCC 0.860 V VCCAUX 1.799 V MAXVCCAUX 1.809 V -------------------- svf processing file: \u0026#34;out/project_prj_checkpoint.svf\u0026#34; 0% TRST OFF; 0% ENDIR IDLE; 0% ENDDR IDLE; 0% STATE RESET; 0% STATE IDLE; 0% FREQUENCY 1.00E+07 HZ; adapter speed: 10000 kHz 0% HIR 0 ; 0% TIR 0 ; 0% HDR 0 ; 0% TDR 0 ; 0% SIR 6 TDI (09) ; 0% SDR 32 TDI (00000000) TDO (04a63093) MASK (0fffffff) ; 0% STATE RESET; 0% STATE IDLE; 0% SIR 6 TDI (0b) ; 0% SIR 6 TDI (14) ; 0% RUNTEST 0.100000 SEC; 0% RUNTEST 10000 TCK; 0% SIR 6 TDI (14) TDO (11) MASK (31) ; 0% SIR 6 TDI (05) ; 95% ffffffffffff) ; 95% SIR 6 TDI (09) TDO (31) MASK (11) ; 95% STATE RESET; 95% RUNTEST 5 TCK; 95% SIR 6 TDI (05) ; 95% SDR 160 TDI (0000000400000004800700140000000466aa9955) ; 95% SIR 6 TDI (04) ; 95% SDR 32 TDI (00000000) TDO (3f5e0d40) MASK (08000000) ; 95% STATE RESET; 95% RUNTEST 5 TCK; Info : Listening on port 6666 for tcl connections Info : Listening on port 4444 for telnet connections Resulting in a successfully configured our FPGA.\nConclusion # For $200 we got a fully working decommissioned Alibaba Cloud accelerator featuring a Kintex UltraScale+ FPGA with an easily accessible debugging/programming interface and enough pinout information to define our own constraint files.\nWe also have a fully automated Vivado workflow to implement our designs and the ability to write the bitstream, and interface with the FPGA\u0026rsquo;s internal JTAG accessible registers using an open source programming tool without the need for an official Xilinx programmer.\nIn the end, this project delivered an at least 5x cost savings over commercial boards (compared to the lowest cost $900-1050 Alinx alternatives), making this perhaps the most cost effective entry point for a Kintex UltraScale+ board.\nExternal ressources # Xilinx Vivado Supported Devices : https://docs.amd.com/r/en-US/ug973-vivado-release-notes-install-license/Supported-Devices\nOfficial Xilinx dev board : https://www.amd.com/en/products/adaptive-socs-and-fpgas/evaluation-boards/ek-u1-kcu116-g.html\nAlinx Kintex UltraScale+ dev boards : https://www.en.alinx.com/Product/FPGA-Development-Boards/Kintex-UltraScale-plus.html\nUltraScale Architecture Configuration User Guide (UG570) : https://docs.amd.com/r/en-US/ug570-ultrascale-configuration/Device-Resources-and-Configuration-Bitstream-Lengths?section=gyn1703168518425__table_vyh_4hs_szb\nUltraScale Architecture System Monitor User Guide (UG580): https://docs.amd.com/v/u/en-US/ug580-ultrascale-sysmon\nVivado Design Suite Tcl Command Reference Guide (UG835): https://docs.amd.com/r/en-US/ug835-vivado-tcl-commands/Tcl-Initialization-Scripts\nPCI vendor/device ID database: https://admin.pci-ids.ucw.cz/read/PC/14e4\nPCI device classes: https://admin.pci-ids.ucw.cz/read/PD\nLinux kernel PCI IDs: https://github.com/torvalds/linux/blob/7aac71907bdea16e2754a782b9d9155449a9d49d/include/linux/pci_ids.h#L160-L3256\nLinux kernel PCI classes: https://github.com/torvalds/linux/blob/7aac71907bdea16e2754a782b9d9155449a9d49d/include/linux/pci_ids.h#L15-L158\nTruck-kun pinout: https://blog.csdn.net/qq_37650251/article/details/145716953\nEbay listing: https://www.ebay.com/itm/167626831054?_trksid=p4375194.c101800.m5481\nOpenOCD documentation: https://openocd.org/doc-release/pdf/openocd.pdf\nBefore you buy # If you have gotten yourself this board and have any personal experiences you would be willing to share please feel free to shot me an email.\nTL;DR:\nyour SFP might be faulty there are hidden extra GPIOs most of the LVCMOS18 are actually LVCMOS33 this board is supported by Corundum, an high-performance FPGA-based NIC using the taxi open-source transport library Broken SFP # Credit: Michał Hęćka\nMichał received an FPGA board with a Huawei MTRA-3E11A SFP module, but before plugging it in he noticed some dirt specks inside the module. He proceeded to open the module for closer inspection, and realized the receiver and transmission lasers were unsoldered and just \u0026hellip; hanging there.\nClearly very broken SFP \u0026hellip; How does this even happen ? Outcome: The ebay seller refunded the amount of the module. Seller maybe123.\nExtra GPIOs # Credit: Alex Forencich\nThere are some additional GPIO pins accessible via pads and he speculates some boards might have the actual header and components associated to these populated.\nFor all of us other plebs, these signals are available through SMD resistor footprints.\nJ5 location on the board J5 connection to FPGA pin reverse engineering :\nset_property -dict {LOC A14 IOSTANDARD LVCMOS33} [get_ports {gpio[0]}] ;# J5.3,4 set_property -dict {LOC E12 IOSTANDARD LVCMOS33} [get_ports {gpio[1]}] ;# J5.5,6 set_property -dict {LOC E13 IOSTANDARD LVCMOS33} [get_ports {gpio[2]}] ;# J5.7,8 set_property -dict {LOC F10 IOSTANDARD LVCMOS33} [get_ports {gpio[3]}] ;# J5.9,10 set_property -dict {LOC C9 IOSTANDARD LVCMOS33} [get_ports {gpio[4]}] ;# J5.11,12 set_property -dict {LOC D9 IOSTANDARD LVCMOS33} [get_ports {gpio[5]}] ;# J5.13,14 source: https://github.com/fpganinja/taxi/blob/8567f91ef6bab46a261e98f5ab660731162605f5/src/cndm/board/AS02MC04/fpga/fpga.xdc#L49C1-L54C84\nMost pins are actually 3.3V # Credit: Alex Forencich\nAccording to his experience, a lot of the I/O interfaces reported to be operating at 1.8V should actually be operating at 3.3V.\nI have partially updated the board pinout table and information in the section above to reflect this.\nTo get the full up to date board xdc according to Alex checkout the Corundum board support.\nCorundum for the Alibaba AS02MC04 # Credit: Alex Forencich\nFor users not looking to re-implement the entire PCIe + Ethernet stack themselves and wanting something to get their projects running off the bat, the Corundum NIC has recently added support for this alibaba board.\nlink to the Corundum project\nCorundum NIC # Corundum is an open-source, high-performance FPGA-based NIC and platform for in-network compute. Features include a high performance datapath, 10G/25G/100G Ethernet, PCI express gen 3+, a custom, high performance, tightly-integrated PCIe DMA engine, many (1000+) transmit, receive, completion, and event queues, scatter/gather DMA, MSI/MSI-X interrupts, per-port transmit scheduling, flow hashing, RSS, checksum offloading, and native IEEE 1588 PTP timestamping. A Linux driver is included that integrates with the Linux networking stack. Development and debugging is facilitated by an extensive simulation framework that covers the entire system from a simulation model of the driver and PCI express interface on the host side and Ethernet interfaces on the network side.\nSeveral variants of Corundum are planned, sharing the same host interface and device driver but targeting different optimization points:\ncorundum-micro: size-optimized for applications like SoCs and low-bandwidth NICs, supporting several ports at 1 Gbps up to 10-25 Gbps corundum-lite: middle of the road design, supporting multiple ports at 10G/25G or one port at 100G, up to around 100 Gbps aggregate corundum-ng: intended for high-performance packet processing with deep pipelines and segmented internal interfaces, supporting operation at up to 400 Gbps aggregate corundum-proto: simplified design with simplified driver, intended for educational purposes only Planned features include a DPDK driver, SR-IOV, AF_XDP, white rabbit/IEEE 1588 HA, and Zircon stack integration.\nNote that Corundum is still under active development and may not ready for production use; additional functionality and improvements to performance and flexibility will be made over time.\ntaxi project README\nhttps://github.com/fpganinja/taxi ","date":"25 February 2026","externalUrl":null,"permalink":"/projects/alibaba_cloud_fpga/","section":"Other projects","summary":"No documentation, no problem!","title":"Alibaba cloud FPGA: the 200$ Kintex UltraScale+ [Updated]","type":"projects"},{"content":"","date":"25 February 2026","externalUrl":null,"permalink":"/tags/as02mc04/","section":"Tags","summary":"","title":"AS02MC04","type":"tags"},{"content":"","date":"25 February 2026","externalUrl":null,"permalink":"/tags/debugging/","section":"Tags","summary":"","title":"Debugging","type":"tags"},{"content":"","date":"25 February 2026","externalUrl":null,"permalink":"/tags/ebay/","section":"Tags","summary":"","title":"Ebay","type":"tags"},{"content":"","date":"25 February 2026","externalUrl":null,"permalink":"/tags/fpga/","section":"Tags","summary":"","title":"Fpga","type":"tags"},{"content":"","date":"25 February 2026","externalUrl":null,"permalink":"/tags/hacking/","section":"Tags","summary":"","title":"Hacking","type":"tags"},{"content":"","date":"25 February 2026","externalUrl":null,"permalink":"/tags/linux/","section":"Tags","summary":"","title":"Linux","type":"tags"},{"content":"It sometimes feels like I was of the last generation: the last to grow up in a world with a familiar texture, a texture future generations will never know.\nI was born in 1996, according to society\u0026rsquo;s conflation of individuals into generations based on their order of arrival, I am the youngest of the millennials.\nPandor, by John William Waterhouse (1989) My childhood played out on the tempo of adventures outside the limits of the family yard, floors covered in dolls and legos, and birthdays playing hide and seek.\nWhen I started reaching the boundary of adolescence, facebook had not yet taken root, and msn was but an irregular curiosity. Although I grew alongside the popularity of social media, we walked parallel paths, rarely encountering one another. Our way of building friendships had already reached escape velocity, and were able to persist in their older ways despite its pull.\nI met the love of my life under the shadow of a bicycle shed.\nTime marched on, but taking notes on paper was still a thing. As I moved towards higher education personal computers started blooming around me, officially for note taking, confidentially for anything but. We had coding assignments crowned by frantic all night crunches, partially unsolved math homework, and rookie broken linux builds. What we could not do we slowly learned.\nWe graduated into adulthood, grasping at our cultivated learning ability, fighting to stay afloat in the sea of complexity. With a slow imperceptible movement the tide receded, leaving us standing firmly on the sand as our legs were caressed by the waves, admiring the wide endless sea extending before us.\nSome choose to stop and admire the sea disappear over the horizon, others frenetically chase it despite the incessant fear of drowning, committed to growing with each step.\nToday I wish to chase the sea.\n","date":"7 February 2026","externalUrl":null,"permalink":"/thoughts/textures_of_growing_up/","section":"Thoughts","summary":"Description of the texture of my formative years.","title":"Textures of growing up","type":"thoughts"},{"content":"","date":"7 February 2026","externalUrl":null,"permalink":"/tags/writting-experminent/","section":"Tags","summary":"","title":"Writting Experminent","type":"tags"},{"content":"How much of what you do is for yourself, and how much is for others?\nSocial pressure is a powerful force.\nA Mermain, John William Waterhouse (1900) A small subset of minds might truly be immune to its effects, but the greatest majority delude themselves in believing in their resistance, as they get swept away by the flow.\nI for one, am susceptible, weak to the suggestion to the crowd. And although I have dispensed great effort in building up my mental insulation, my defenses are porous.\nMaybe this entire enterprise is the errand of a fool, deluding itself about the need to belong deeply engrained into our physique by millions of years of evolution. Regardless, I am such a fool.\nBut, I am such a fool not blinded by my weakness: I know how perniciously desires that are not my own creep in from the outside. I feel the pull in joining the flock of enthusiastic cheers marching towards what they see as the next frontier.\nHow beaming and passionate they appear. How delightful their company must be. How brightly they shine for all to see. How much I would like to join them.\nYet, before I could blend myself among their ranks, I crossed paths with a familiar presence: my own porous mental barrier. It reminded me of my resolution, and urged me to firmly grasp my resolve. But as a shadow chased away by the brightness, my resolve was nowhere to be found.\nAs I awoke from my daze I was unable to parse my own desire from the traction of the crowd. How much this desire is my own, and how much is the sweet songs of sirens?\nUnable to parse this mystery, I settled that I would not walk alongside the crowd so long as it is cheering, and only join when it starts to cry.\n","date":"20 January 2026","externalUrl":null,"permalink":"/thoughts/ai_acceleratos/","section":"Thoughts","summary":"On coping with the difficulties of seperating sincere instrests from fenzys.","title":"Why wait for the AI bubble to pop to tapeout AI accelerators?","type":"thoughts"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/180nm/","section":"Tags","summary":"","title":"180nm","type":"tags"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/asic/","section":"Tags","summary":"","title":"Asic","type":"tags"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/gf180/","section":"Tags","summary":"","title":"Gf180","type":"tags"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/jtag/","section":"Tags","summary":"","title":"JTAG","type":"tags"},{"content":" Living under rocks # As anyone that hasn’t been living under a rock might have heard, AI accelerators are the coolest kids in town these days. And although I have never been part of the \u0026ldquo;in\u0026rdquo; crowd, this time at least, I get the appeal.\nSo when the opportunity arose to join an experimental shuttle using global foundries 180nm for FREE I jumped onto the opportunity and designed my own JTAG!\n…\nI’m sorry? Is this not what you were expecting?\nFrankly, I would love to tell you a great story about how I went into this wanting to design a free and open source silicon proven AI accelerator that the community could freely extend and re-use in their own projects. But in truth this project started out first as me wanting to design some far less sexy, in-silicon debug infrastructure and only later came to include the systolic matrix matrix multiplication accelerator … to serve as the design under test.\nNo wonder I was never one of the cool kids.\nAlso, I’m designing everything from scratch and have only two weeks left, welcome to the crunch.\nEssenceia/Systolic_MAC_with_DFT GF180 ASIC tapeout of a 2x2 MAC with DFT infrastructure Verilog 52 0 GDS rendering of this project using the GF180 PDK If you are looking for a more formal overview of this ASIC, you can find the datasheet here Experimental shuttle # Once again this tapeout was done as part of a Tiny Tapeout shuttle, but this time, it went out as not part of a public but an experimental shuttle.\nYou can learn more about the awesome Tiny Tapeout shuttle programs at the official website: https://tinytapeout.com These experimental shuttles are used as testing grounds for new nodes and flows in order to iron out issues before they are opened to the public.\nParticipation in these experimental shuttles is commonly reserved to contributors having previously submitted to Tiny Tapeout.\nThis limitation was set in place in order to help select for veteran designers as these experimental tapeout typically have less stable tooling and the final chip doesn\u0026rsquo;t feature the same level of inter design isolation as in a public tapeout.\nContributions to these shuttles are done with the understanding that the resulting chip might not be functional for some reason outside of the designers control.\nGiven these limitations, the Tiny Tapeout program is generously making submissions to these shuttles free of charge.\nIn practice, this makes area effectively free, explaining the higher occurrence of absolutely massive designs being submitted.\nFull GDS render of the GF 0.2 Tiny Tapeout chip, with a lot of large multi-tile designs. If the final chip is deemed sufficiently functional, the resulting ASICs along with the dev board will be available for purchase at the Tiny Tapeout store.\nGiven you need to have taped out a chip with Tiny Tapeout before to be eligible, I became eligible 14 days before the experimental shuttle submission deadline.\nCombo # Before we start, let us acknowledge what I am attempting to do, alone.\nIf this were any normal corporate setting, it would be the deadline equivalent to a one way ticket straight to the ever-lengthening queue outside the gates of Hell.\nDo not try this at work!\nNow that safety precautions are out of the way:\nWelcome to a tale of two designs.\nThe first, is the systolic array, typically found at the heart of any AI inference accelerator, its function is to perform matrix-matrix multiplications.\nThe other is our silicon debug infrastructure, namely the all ubiquitous JTAG TAP component. Its goal is to provide something to latch onto when my ASIC comes back as an expensive brick and I am frantically trying to figure out where I fucked up. Given I wish to trust it not to be broken, having something that is proven early is very important since my masterplan is to slap it on all my future tapeouts.\nProject Roadmap # Alright, I have to admit something: I embellished reality in order to make myself look good I lied. Although initially I had 2 weeks to do this tapeout, because I spent the first 4 days reeling from the after effects of my previous tapeout (aka: sleeping, going outside and talking to another human being) and \u0026ldquo;deciding on a technical direction\u0026rdquo; which is corporate speech for describing a mixture between \u0026ldquo;procrastinating\u0026rdquo; and \u0026ldquo;figuring out what I could build without sacrificing too much of my remaining sanity\u0026rdquo;, I now had 10 days left. You’re welcome.\nGiven my self-imposed dire straits of a timeline, if I wanted to have any hope whatsoever of meeting the tapeout deadline I needed a battle plan: here is the roadmap.\n--- config: logLevel: 'debug' theme: 'default' gitGraph: showBranches: true mainBranchName: 'architecture' mainBranchOrder: 6 --- gitGraph TB: commit id: \"idea \" commit id: \"figured it out \" branch design order: 4 checkout design commit id: 'basic RTL ' branch simulation order: 5 checkout simulation commit id: 'this RTL is broken ' checkout design commit id: 'starts to work ' branch implementation order: 3 checkout implementation commit id: 'RIP timing ' commit id: 'RIP area ' checkout design commit id: 'looking good ' branch emulation order: 2 checkout emulation commit id: 'bitstream acquired ' branch firmware commit id: 'firmware bringup ' checkout emulation merge firmware checkout design merge emulation merge simulation checkout implementation merge design commit id: 'tapeout! ' architecture: what I need to build and how components would interface with one another. design: where most of the RTL design takes place. simulation: I write tests for simulating and validating that my design is behaving correctly and without any identifiable bugs. I used the classic Cocotb wrapper around iverilog for this. FPGA emulation: when the design has mostly taken shape is when emulation takes place. This is the step where I port the design to an FPGA. firmware: Alongside the emulation I bring up the firmware to interface with my design. This allows me to validate there are no issues when interfacing between the MCU and the ASIC. implementation: this is when the ASIC flow is run and is the longest running task. It starts being run when the RTL design starts becoming functional to identify and fix timing and area utilization issues. And once all the verification is finished, it is run one final time to generate the final ASIC GDSII (manufacturing files). The most astute readers might have noticed that this grand strategy is pretty much the same roadmap I always use.\nIf it aint broken dont fix it. ~ a wise man\nFlow saves the day # So, why am I doing this to myself? Well, self delusion is a powerful force, and it was telling me this timeline would be possible if I leveraged my previous experience with the Tiny Tapeout/Librelane/OpenROAD flows, my existing personal linting/simulation/fpga/firmware flows, my existing code bases, and my own ingrained knowledge of the dark arts experience.\nBut, let us not delude ourselves, the saving grace of this terrible idea was really just how great the Tiny Tapeout/Librelane/OpenROAD ASIC flow is.\nThe following section assumes readers are already familiar with the Open Source Silicon ecosystem.\nFor those not already familiar with Tiny Tapeout, Librelane and OpenROAD, you can find a short description of these in my previous hashing accelerator ASIC article, where I introduce some of the great tools that the open source silicon ecosystem has created. The OpenROAD project was conceived with a no-human-in-the-loop (NHIL) target, and the goal of enabling 24-hour-or-less design turnaround times.\nLibrelane, the master coordinator of the flow itself, brings together OpenROAD, Yoysy, ABC, Magic, and many more amazing open source tools, building on top of this philosophy. Creating a process that takes you from your verilog and a few configurations all the way to the tapeout ready artifacts, in an extremely streamlined and fast fashion, requiring minimal human intervention.\nTiny Tapeout then completes the loop, running your testbenches on top of the entire implementation, and then allowing you to automatically upload your GDSII for integration into the shuttle chip.\nThis deeply resonates with my personal beliefs that faster iteration times are central to higher quality and more efficient design :\nSince I believe a low iteration time is paramount to project velocity and getting big things done, I also want to automatize all of the Vivado flow from taking the rtl to the SVF generation.\n~ Previous article: Alibaba cloud FPGA: the 200$ Kintex UltraScale+\nThus, not only was I building on top of a legacy of efficient design flows, but I had previously spent invested a lot of time streamlining my tasks, especially repetitive ones.\nA classic example would be my FPGA build flow used for emulation. It only requires a single command to create the Vivado project, read all the rtl and constraint files, synthesize, optionally add an ILA core and connect any wires marked for debug to it, implement, and then flash the bitstream :\nmake fpga_prog debug=1 Essentially, the bottleneck to making this design, from scratch in 10 days, wasn’t going to be the tools, but the squishy human between the chair and the keyboard.\nhomo electro-engineerus observed in its natural habitat in a pre-tapeout hibernation period known as a \u0026lsquo;crunch\u0026rsquo;. During this period, the specimen will seldom move from its cave, surviving exclusively on a diet of \u0026lsquo;coffee\u0026rsquo; and \u0026lsquo;HEB smoked ham\u0026rsquo;. Design # Without further ado, let\u0026rsquo;s talk design !\nSystolic Array Design # The goal of this systolic array is to perform a 2×2 matrix-matrix multiply on 8-bit integer numbers.\nWithout going too much into detail on why systolic arrays are the recurring stars at the heart of modern AI inference accelerators, their main strength is achieving a high ratio of compute to everything else. And since memory operations are the most expensive family of operations by far, a hh ratio of compute to memory operations.\nData is recirculated directly within the array and reused across multiple consecutive operations rather than being repeatedly fetched/written from/to memory. This matters because memory accesses are expensive. SRAM accesses cost time and significant power, while DRAM accesses cost eternities of time and egregious amounts of power. Compute operations, even 64 bit floating point multiplications, are by comparison, cheap.\nThus, the larger the systolic array, the deeper the chain of compute, the better this compute to memory ratio becomes.\nEnergy ratio # The following sub-section is me geeking out on power consumption numbers. If you don\u0026rsquo;t have a deeply engrained passion for discussing pJ (pico Joules) you can safely skip it, I won\u0026rsquo;t judge you. As an illustrative example of this evolution of compute ratios, let us compare the power cost of 64 bit floating point multiply-add (MAC) operations in a hypothetical systolic array designed on a 45nm node running at 0.9V.\nCompared with the energy expenditure needed to access this data using 256 bit wide reads to the 16nm DRAM. DRAM access costs will include the 10mm of wire, interface and access costs.\nIn this example, we will very generously assume the weights are stored in place and do not need to be updated, so will assume only the input data matrix needs to be read from DRAM.\nOperation 45 nm 16-bit integer multiply 2 pJ 64-bit floating-point multiply-add 50 pJ 64-bit access to 8-Kbyte SRAM 14 pJ 256-bit access to 1-Mbyte SRAM 566 pJ 256-bit 10 mm wire 310 pJ 256-bit DRAM interface 5,120 pJ 256-bit DRAM access 2,048 pJ source: ENERGY PROPORTIONAL MEMORY SYSTEMS\nEven though I purposefully chose 64 bit floats, the most energy intensive arithmetic operation, we still required a 64x64 systolic array before the cost of compute started exceeding the cost of the initial DRAM value reads.\naka: For those not living in 2026, we have uncovered a new clue to the mystery of where all the HBM chips have suddenly vanished to! (HBM has the lowest energy cost per accessed bit.)\nScaling # Another great feature of systolic arrays is how regular they are and how well their design scales. So, although today I am designing a small 2x2 array, without much re-work this can be scaled up to a larger 256x256 array.\nSome of you might be wondering: if area is free and systolic arrays scale so well, why am I limiting myself to only making a 2x2 array ?\nWell, because I am a good neighbor and any additional tile of area I use is potentially depriving someone else from having the area available to submit their project.\nSince the initial motivation for this project was to design some proven in silicon debug infrastructure, a 2x2 array is sufficient for my needs.\nConstraints # Since this ASIC is once again taping out as part of a Tiny Tapeout shuttle, it has to grapple with the similar limitations as my previous hashing accelerator, namely: the eternal I/O bandwidth limitation!\nTiny Tapeout chip (I recommend switching the page to light mode for better readability of the pin names) This limitations comes in two flavors:\nPin count: eight input pins, eight output pins, and eight configurable I/O pins, limiting my parallel buses in and out of the accelerator. Maximum operating frequency: unlike with the well-trodden path that was the public Skywater 130nm shuttle tapeout, for this experimental shuttle the maximum switching frequency of these pins hasn’t been characterized yet. And since I haven\u0026rsquo;t yet figured out how to do this characterization using a simulator myself, I\u0026rsquo;ve hand-wavingly assumed that both the input and output directions have a sustainable switching frequency of 50 MHz, and have sized the path timings around that assumption. Additionally, I once again have the constraint of not having any SRAM. Though unlike with the Skywater 130 tapeout, the GlobalFoundries 180nm PDK does include a proven SRAM macro!\nHooray! Human progress is unstoppable !\nThis time, the limitation was actually imposed by the experimental nature of this shuttle.\nFirstly, the flow didn\u0026rsquo;t initially support these macros out of the box. More specifically in the way they were laid out, which led to a slew of DRC failures that needed to be fixed.\nSecondly, this experimental shuttle didn’t provide individual project power gating. As such, integrating this SRAM macro was commonly deemed not the best idea, and the community collectively agreed to wait for a future shuttle, which would include per-project macro power gating, before including it.\nDesign Details # This system is broken into two parts :\nthe compute units the main array controller Compute units # Since this is a 2×2 systolic array, there are four compute units. Each unit takes in an 8 bit signed integer and performs a multiply, addition, and a clamping operation, producing 8 bit signed integers.\nCompute unit critical path. Multiplication # The multiply is done using a custom (from scratch) implementation of a Booth Radix-4 multiplier with Wallace trees.\nThis multiplier architecture strikes a good balance between area, power and performance, and can be regarded as the multiplier equivalent of what \u0026ldquo;Rental White™\u0026quot;[1] is to interior design.\nMeaning it is a solid, well rounded from a PPA’s perspective, multiplier option, that without being anything novel or groundbreaking, is probably good enough for your use case (and given my +2ns of worst negative slack, and comfortable area occupancy, it was plenty good enough for mine).\nAnother advantage of booth radix-4 is that, since we are performing a signed multiplication, we can optimize out a level in the Wallace tree we are using for the partial product additions given we only have 4 partial products, unlike the 5 needed for unsigned operations.\nThe original article plan included an in-depth explanation of the booth radix-4 multiplication implementation and optimization, resulting in a very interesting but also very dry multi-page explanation covered with boolean algebra. Resulting in its canning.\nIf you are looking for a good explanation of this multiplier and its optimization see chapter 11 of ‘CMOS VLSI Design: A Circuits and Systems Perspective’.\nLink to David Money Harris (co-author) blog.\nClamping # The clamping operation occurs after both the multiply and addition operations. At which point the data is now 17 bit wide, and since we want to prevent our data size from exploding, the clamping operation is needed to clamp the outgoing data back down to eight bits.\n$$ clamp_{i8}(x) = \\begin{cases} \\phantom{-}127,\\\\,\\text{if}\\\\,x \u003e 127, \\\\\\\\ \\phantom{-12}x,\\\\,\\text{if}\\\\,x \\in [-128,127], \\\\\\\\ -128,\\\\,\\text{if}\\\\,x \u003c -128 \\end{cases} $$ In place weight storage # Given that the weights have high temporal and spatial locality, meaning the same set of weights can be reused over multiple unique input data matrices, in order to save on bandwidth, the choice was made to store an 8 bit weight in place inside of each unit.\nA separate control sequence allows the user to load a new set of weights inside each of the units. The weight packet is sent over four consecutive cycles using the same input interface as the data matrix.\nArray controller # Given the way in which the matrix matrix multiplication is performed using a systolic array, the input data matrix needs to be shaped and fed to the array in a staggered manner.\nUnderstandably readers might not instinctively get what I mean by \u0026ldquo;the data needs to be shaped\u0026rdquo;.\nIn order to help better illustrate this point, I would like to bring your attention to a great animation to show how data flows through the array.\nAn added bonus is that this animation is also of a 2×2 systolic array, such that it shares the same data with the same arrival constraints as my accelerator.\nCredit : Pan, William. (2025). Systolic Array Simulator. GitHub Repository.https://github.com/wp4032/william-pan.com This animation is not a carbon copy of my accelerator. Rather, readers should use this animation as a tool to better understand how data flows in a systolic array and how it results in the computation of a matrix-matrix multiplication. For the sake of completeness, I would like to point out the major differences with my implementation:\nthis animation uses floating-point numbers, I am using 8-bit integers there is no clamping step the timings are not entirely identical. In my accelerator, many more operations occur on the same cycle. Granted, this was probably done in an effort to help make the animation more legible. In addition to helping shape the input data to the systolic array, this controller also helps coordinate data transfers around my I/O limitations.\nGiven the parallel data bus allows only 8 bits of input data to arrive per cycle, it is used to control the input buffers in order to accumulate enough data to create the next wave.\nSimilarly on the output side, when the accelerator produces two 8-bit results per cycle, the controller stores the results in an output buffer in order to only stream out 8 bits per cycle.\nArray Controller interfacing with both the input and output data buffers. Validation # Validation was an extremely important step for this systolic array, particularly for validating my custom implementation of the Booth Radix-4 multiplication.\nOnce again, I used Cocotb as the abstracting layer allowing me to interface with multiple different simulators. Namely, icarus verilog for my standard verification and CVC for the post implementation timing annotated netlist.\nThis validation followed my standard approach of providing a set of input data matrices and weights over the standard accelerators interfaces and comparing the produced results to the expected values.\nGiven the nature of the problem, I made extensive use of randomization in order to get better testing coverage and attempt to hit corner cases that would not have been revealed using a directed approach. This randomization randomized both the values of the inputs weight and data, and also the timings with which this incoming data was fed over the parallel bus to the accelerator.\nFirmware # The firmware used to interface with this accelerator was designed to run on the RP2040 Raspberry Pi silicon and used the PIO hardware block to drive the parallel port.\nI followed a similar approach as with my hashing accelerator, co-designing with the PIO imposed limitation when designing the parallel bus protocol.\nAlthough this was designed for an RP2040, I later learned that the Tiny Tapeout boards would be evolving to a new version of the chip. These new PCBs would use a different pinout layout for communicating with the ASIC chips.\nAlthough this will necessitate to re-adapt parts of the firmware for the new dev boards, this shouldn\u0026rsquo;t cause any major incompatibilities.\nJTAG TAP Design # The second major component of this design and admittedly the actual principle piece of this ASIC is the JTAG TAP.\nAs stated earlier when my ASIC comes back as a glorified paperweight I will suddenly have a strong and immediate need for some kind of hardware block providing some in-silicon observability.\nThis is absolutely crucial, as it isn\u0026rsquo;t a question of if but when one of my ASIC will have hardware issues.\nAnd, when that day comes, not having a view of the internal behavior will only make it that much more painful to identify the root issue in order to fix it for the upcoming generation.\nAs such, this TAP is actually a part of my larger efforts to help produce a set of DFT (Design for Test) proven IPs/tools that I can integrate into all of my future ASICs. As such, I have quite an incentive to have these designs proven early.\nJTAG was chosen as it is a very common debug protocol. Not only is it well-defined but has good off-the-shelf support, both on the hardware (debug probes) and software side.\nJTAG Implementation # In addition to implementing all of the basic JTAG features in order to be compliant with the protocol, I additionally added custom instructions, extending it to fit my personal requirements.\nRequired instructions :\nEXTEST, opcode 0x0, boundary scan operation IDCODE, opcode 0x1, read JTAG tap identifiers, which allows the hardware to advertise itself on the JTAG scan chain SAMPLE_PRELOAD, opcode 0x2, boundary scan operation BYPASS, opcode 0x7, set the TAP in bypass mode Custom instructions :\nUSER_REG, opcode 0x3, custom instruction used to probe the internal registers The systolic array is quite a deep structure in the sense that the data is directly reused as it is recirculated within the array.\nAs such, it is very easy to lose track of what the internal state is, having only the input and output observable to the user. Although this might stay manageable with a 2 x 2 array. As the structure grows larger, this will start to become more and more of an issue.\nIn order to help address this future pain point, I added the custom USER_REG JTAG instruction to read the current state of a target compute units internal registers.\nLink to the datasheet of the USER_REG instruction.\nDesign # The JTAG TAP design itself is quite straightforward as JTAG was conceived as a hardware first protocol, and this shows in its implementation. The design clearly flows from the JTAG specification to the RTL (unlike you BLAKE2 I am looking at you!). As such I do not think it is worthwhile to discuss it in detail.\nWhat makes this design more interesting is how the JTAG and the systolic array live in two different clock domains. Making this accelerator have not one but two separate clock trees.\nOn the altar of this noble cause, I sacrificed one of my precious data input pins to serve as the JTAG clock input (TCK).\nClock tree of clk, the clock from the systolic array. Clock tree of the tck, the clk for the JTAG. You can see it spreads towards the systolic array logic, because it is accessing the internal registers of the compute units. The SDC script I used to generate these two clock trees drew heavy inspiration from the official LibreLane clock generation SDC, but was re-adapted from my contorted used case:\n# modified librelane base.sdc to support 2 clocks # custom env variable set ::env(JTAG_CLOCK_PERIOD) 500 if { [info exists ::env(CLOCK_PORT)] } { set port_count [llength $::env(CLOCK_PORT)] puts \u0026#34;/gc[INFO] Found ${port_count} clocks : $::env(CLOCK_PORT)%\u0026#34; if { $port_count == \u0026#34;0\u0026#34; } { puts \u0026#34;/gc[ERROR] No CLOCK_PORT found.\u0026#34; error } # set both clock ports set ::clock_port [lindex $::env(CLOCK_PORT) 0] set ::jtag_clock_port [lindex $::env(CLOCK_PORT) 1] } set port_args [get_ports $clock_port] set jtag_port_args [get_ports $jtag_clock_port] puts \u0026#34;/gc[INFO] Using clock $clock_port… with args $port_args\u0026#34; puts \u0026#34;/gc[INFO] Using jtag clock $jtag_clock_port… with args $jtag_port_args\u0026#34; create_clock {*}$port_args -name $clock_port -period $::env(CLOCK_PERIOD) create_clock {*}$jtag_port_args -name $jtag_clock_port -period $::env(JTAG_CLOCK_PERIOD) set input_delay_value [expr $::env(CLOCK_PERIOD) * $::env(IO_DELAY_CONSTRAINT) / 100] set output_delay_value [expr $::env(CLOCK_PERIOD) * $::env(IO_DELAY_CONSTRAINT) / 100] puts \u0026#34;/gc[INFP] for clk $clock_port :\u0026#34; puts \u0026#34;/gc[INFO] Setting output delay to: $output_delay_value\u0026#34; puts \u0026#34;/gc[INFO] Setting input delay to: $input_delay_value\u0026#34; # keep the same io delay constraints for jtag set jtag_input_delay_value [expr $::env(JTAG_CLOCK_PERIOD) * $::env(IO_DELAY_CONSTRAINT) / 100] set jtag_output_delay_value [expr $::env(JTAG_CLOCK_PERIOD) * $::env(IO_DELAY_CONSTRAINT) / 100] puts \u0026#34;/gc[INFP] for clk $jtag_clock_port :\u0026#34; puts \u0026#34;/gc[INFO] Setting output delay to: $jtag_output_delay_value\u0026#34; puts \u0026#34;/gc[INFO] Setting input delay to: $jtag_input_delay_value\u0026#34; set_max_fanout $::env(MAX_FANOUT_CONSTRAINT) [current_design] if { [info exists ::env(MAX_TRANSITION_CONSTRAINT)] } { set_max_transition $::env(MAX_TRANSITION_CONSTRAINT) [current_design] } if { [info exists ::env(MAX_CAPACITANCE_CONSTRAINT)] } { set_max_capacitance $::env(MAX_CAPACITANCE_CONSTRAINT) [current_design] } # clk set clk_input [get_port $clock_port] set clk_indx [lsearch [all_inputs] $clk_input] set all_inputs_wo_clk [lreplace [all_inputs] $clk_indx $clk_indx \u0026#34;\u0026#34;] # jtag clk set jtag_clk_input [get_port $jtag_clock_port] set jtag_clk_indx [lsearch [all_inputs] $jtag_clk_input] set jtag_all_inputs_wo_clk [lreplace [all_inputs] $jtag_clk_indx $jtag_clk_indx \u0026#34;\u0026#34;] # rst set all_inputs_wo_clk_rst $all_inputs_wo_clk # jtag has no trst so there is no need to define another rst path # correct resetn set clocks [get_clocks $clock_port] set_input_delay $input_delay_value -clock $clocks $all_inputs_wo_clk_rst set_output_delay $output_delay_value -clock $clocks [all_outputs] if { ![info exists ::env(SYNTH_CLK_DRIVING_CELL)] } { set ::env(SYNTH_CLK_DRIVING_CELL) $::env(SYNTH_DRIVING_CELL) } set_driving_cell /gc -lib_cell [lindex [split $::env(SYNTH_DRIVING_CELL) \u0026#34;/\u0026#34;] 0] /gc -pin [lindex [split $::env(SYNTH_DRIVING_CELL) \u0026#34;/\u0026#34;] 1] /gc $all_inputs_wo_clk_rst set_driving_cell /gc -lib_cell [lindex [split $::env(SYNTH_CLK_DRIVING_CELL) \u0026#34;/\u0026#34;] 0] /gc -pin [lindex [split $::env(SYNTH_CLK_DRIVING_CELL) \u0026#34;/\u0026#34;] 1] /gc $clk_input set_driving_cell /gc -lib_cell [lindex [split $::env(SYNTH_CLK_DRIVING_CELL) \u0026#34;/\u0026#34;] 0] /gc -pin [lindex [split $::env(SYNTH_CLK_DRIVING_CELL) \u0026#34;/\u0026#34;] 1] /gc $jtag_clk_input set cap_load [expr $::env(OUTPUT_CAP_LOAD) / 1000.0] puts \u0026#34;/gc[INFO] Setting load to: $cap_load\u0026#34; set_load $cap_load [all_outputs] puts \u0026#34;/gc[INFO] Setting clock uncertainty to: $::env(CLOCK_UNCERTAINTY_CONSTRAINT)\u0026#34; set_clock_uncertainty $::env(CLOCK_UNCERTAINTY_CONSTRAINT) $clocks puts \u0026#34;/gc[INFO] Setting clock transition to: $::env(CLOCK_TRANSITION_CONSTRAINT)\u0026#34; set_clock_transition $::env(CLOCK_TRANSITION_CONSTRAINT) $clocks puts \u0026#34;/gc[INFO] Setting timing derate to: $::env(TIME_DERATING_CONSTRAINT)%\u0026#34; set_timing_derate -early [expr 1-[expr $::env(TIME_DERATING_CONSTRAINT) / 100]] set_timing_derate -late [expr 1+[expr $::env(TIME_DERATING_CONSTRAINT) / 100]] if { [info exists ::env(OPENLANE_SDC_IDEAL_CLOCKS)] \u0026amp;\u0026amp; $::env(OPENLANE_SDC_IDEAL_CLOCKS) } { unset_propagated_clock [all_clocks] } else { set_propagated_clock [all_clocks] } set_clock_groups -asynchronous -group $clock_port -group $jtag_clock_port Because the official JTAG spec lives behind the impregnable IEEE paywall, a castle in which I am not permitted to set foot as a result of not having paid its lord my dues, the verification of the JTAG TAP was actually quite interesting.\nSince JTAG is such a common protocol, its easy to find free resources online to build a good mental model of the internal FSM and TAP structure. My initial implementation and test bench was derived from this best effort understanding.\nAs such emulation actually became a critical step for validating this JTAG TAP.\nRecall how earlier I mentioned that JTAG was a well-supported protocol with good off-the-shelf support on the software side?\nDid I mention OpenOCD has great support for JTAG?\nIf I wasn\u0026rsquo;t going to get access to the official spec on how JTAG is expected to behave, plan B was to let my implementation be guided by how OpenOCD expected JTAG to behave.\nTo the purists clutching their pearls flabbergasted by the idea of not implementing per the spec: I’m not sorry. But if you do have the official spec here is my email.\nIn practice, whenever OpenOCD flagged my JTAG behavior as problematic, I assumed that my implementation was at fault. And trust me, there were issues. I called this\n$$Designed\\\\, By\\\\, Support\\\\,^{TM}$$Then came the matter of supporting my custom TAP\u0026rsquo;s unholy instructions.\nLuckily for me, OpenOCD allows you to finetune its behavior, through custom TCL scripts (did I mention I have PTSD and now love TCL?).\nThanks to already having gone through the pain of learning to create custom OpenOCD scripts during my Alibaba accelerator salvage project, this was a breeze.\nThese scripts allowed me to bring up my custom TAP and add support for my own godforsaken instructions.\nThe script can be found here, and below is proof of my crime a log of me interfacing with my design during emulation:\nOpen On-Chip Debugger 0.12.0+dev-02171-g11dc2a288 (2025-11-23-19:25) Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html Info : J-Link V10 compiled Jan 30 2023 11:28:07 Info : Hardware version: 10.10 Info : VTarget /gc= 3.380 V Info : clock speed 2000 kHz Info : JTAG tap: tpu.tap tap/device found: 0x1beef0d7 (mfg: 0x06b (Transwitch), part: 0xbeef, ver: 0x1) Readers having reached enlightenment in their level of familiarity with the JTAG protocol might notice that the chip actually advertises itself as an Nvidia accelerator (mfg 0x06b.)\nIt\u0026rsquo;s good to have dreams.\nConclusion # And somehow, against all odds, I made it! I met this absurd self-imposed deadline and the chip is now in fabrication!\nThis project was a battle against time. But making it fit in the end is something I\u0026rsquo;m honestly very proud of.\nOne of the big difficulties was to not get side-tracked. Even if it was frustrating, I had to commit to only adding the minimal set of features needed to get the project rolling. And trust me, the temptation to add, just that one extra little teeny-tiny feature, either to the systolic array or the debug system, was excruciating.\nThe good news is that, with this upcoming tapeout (v2 has already started), I can finally lash out !! My grand strategy is for the next project to be an improvement of both concepts.\nFirstly, I\u0026rsquo;d like to finally have a rematch with an old adversary of mine: Floating Point arithmetics! Less for the TOPS-\u0026gt;FLOPS bragging rights, but rather to finally pierce its deep mysteries through an optimized hardware implementation of my own.\nSecondly, I will continue extending my debug infrastructure: adding scan chains and an ATPG flow not only to help identify silicon manufacturing issues, but also as a convenient way to extract the current state of all flops for usage debugging.\nYet, this is but another step towards my greater long term objective: to tape out my own chip, not as part of Tiny Tapeout, nor as part of a shuttle chip, but entirely on my own.\nThere are many minutiea of the ASIC\u0026rsquo;s design that for now (and for good reason) abstracted away by Tiny Tapeout. So, my objective is to use these shuttle programs as a harness while I master the missing skills.\nI would like to finish with a thanks to a company, which has not been sponsoring me in any way, is totally unaware of my existence, but who, never the less, never fail to provide the level of reward desperately needed after an all nighter spent trying to wrap up my design:\nMy beloved Waffle House. Footnotes # [1] - All uses and attributions of the designation ‘rental white’™ are the sole property of John and are used hereby with acknowledgment of his proprietary rights.\n","date":"2 January 2026","externalUrl":null,"permalink":"/projects/two_weeks_until_tapeout/","section":"Other projects","summary":"Chronicles of a bad idea. Or how I designed a systolic array with in-silicon debug infrastructure from scratch in under two weeks, and taped it out on Global Foundry 180 nm though a Tiny Tapeout experimental shuttle.","title":"Two weeks until tapeout","type":"projects"},{"content":"","date":"30 December 2025","externalUrl":null,"permalink":"/tags/blake2/","section":"Tags","summary":"","title":"Blake2","type":"tags"},{"content":" My Latest Bad Idea # Some people have hobbies, I find that cryptographic hashing accelerators are fascinating engineering challenges. Please hear me out. Each algorithm offers slightly different design tradeoffs, presents unique opportunities for optimization, is easy to validate, and each serves as a perfect excuse to voluntarily subject yourself to a Friday evening of debugging hash result mismatches.\nNow that the stage is set, let me introduce today\u0026rsquo;s star: BLAKE2. BLAKE2 is a family of cryptographic hash functions that takes arbitrary amounts of data and compresses it down to a variable-sized digest (hash). This hash is then used as a digital signature for message authentication codes and integrity protection mechanisms.\nThe BLAKE2 family comes in two primary variants:\nBLAKE2b, designed for 64-bit platforms BLAKE2s, the 32-bit variant What makes BLAKE2 particularly interesting is that it was originally designed and optimized for high software performance. It\u0026rsquo;s fundamentally a software-first algorithm, in contrast to AES, which maps so clearly to hardware that your architecture practically writes itself as you read the spec.\nThis article will mostly focus on the why behind the design choices and not an in-depth presentation of the design itself. The full codebase can be found here :\nEssenceia/blake2_asic SKY130A implementatoin of the Blake2s hash algorithm Verilog 21 1 For readers more interested in using this accelerator, the datasheet can be found here\nAccessible, Not Easy # The goal of this project was to independently tape out my first ASIC outside of any organized team. This involves, solo: designing, validating, and successfully taping out a fully-featured BLAKE2s hashing accelerator on the SkyWater 130nm process through the Tiny Tapeout shuttle program.\nA few years ago, independent ASIC tape-out would have been a pipe dream unless you had successfully sold your startup to Meta or grew up in a family whose garage included a private jet.\nToday, thanks to recent advances in open-source EDA tools (Librelane, OpenROAD), the emergence of open-source PDKs thanks to Google (SkyWater 130nm, Global Foundries 180nm, IHP 130nm), and public low-cost shuttle programs like Tiny Tapeout that multiplex hundreds of designs onto shared chips, there now exists a path where independent designers can tape out custom ASICs without bankruptcy.\nSome would say ASIC design has never been more accessible. This statement is technically true in the same way that saying running a marathon has never been as accessible : you can get a decent pair of running shoes from your local sports store, and running outside is free.\nBut accessible it is, and that\u0026rsquo;s revolutionary.\nOne Shot, Don\u0026rsquo;t Miss # Unlike FPGA development and software where patches are possible, ASIC bugs are permanent, expensive, and occasionally end up as cautionary tales in textbooks.\nAs such, verification became my obsession. This verification strategy employed a multi-layered defensive line:\nSimulation-based verification formed the foundation. The design results were simulated using cocotb against an instrumented golden model (hashlib\u0026rsquo;s blake2s version). Testing included both directed test cases for the official test vectors and constrained random stimulus generation for broader coverage. The goal wasn\u0026rsquo;t just functional correctness; it was finding the bugs that a directed-approach testing would miss.\nFPGA emulation at the top, because hardware without software is just expensive modern art that occasionally gets warm. Emulation provided the critical bridge between simulation and silicon. The design was ported to a Basys3 FPGA (Xilinx Artix-7, board chosen for its abundance of IO pins and next-day shipping on Amazon) and connected to an RP2040 microcontroller via GPIO, recreating the hardware/firmware interface planned for the final ASIC. This environment enabled co-design and validation of both the accelerator and its firmware, catching protocol issues and integration bugs that only manifest in the real hardware/MCU (MicroController Unit) setup.\nTheory Meets Reality # On the ASIC side, design correctness and timing passing were necessary but not sufficient.\nAn actual tape-out introduces the additional constraint of : can we actually build this? Can we get this to fit within the characterized operating parameters of the standard cells? AKA: if we do build it, will it work?\nThis challenge involves:\nDesign Rule Checking (DRC): for the SKY130A process, this meant staying within acceptable antenna ratios and keeping all the wire slew rates and max capacitances within PDK spec.\nShuttle-imposed limitations required designing within the constraints inherent to the Tiny Tapeout shuttle chip. The shared I/O architecture imposed severe bandwidth limitations: 66 MHz input maximum, 33 MHz output due to weak buffer drivers, and only 24 total pins for all communication.\nThese weren\u0026rsquo;t just inconveniences; they fundamentally shaped the architecture, capping performance more than any internal logic constraints.\nWhat Comes Next # This article goes through the process from idea to tape-out.\nThe chip is currently in fabrication, and in nine months, we\u0026rsquo;ll know if this produced a functional cryptographic accelerator or an expensive paperweight.\nOpen Source Silicon # Before jumping deep into the weeds, I would first like to take a step back and give some context about the tools and shuttle program I will be using for this tape-out.\nThis won\u0026rsquo;t be an exhaustive list, and there are plenty of amazingly powerful tools that I won\u0026rsquo;t be mentioning. Think of it as a list of my favorites.\nIf you are already familiar with the Open Source Silicon ecosystem and workflow, you can skip this section. OpenROAD # OpenROAD isn\u0026rsquo;t the name of a single tool as much as it is a family of ASIC design tools and is at the heart of the ASIC flow.\nIn conjunction with OpenSTA, which contains the timing engine, it includes the tools needed for taking a design from floorplanning through placement, clock tree synthesis, and routing all the way to finishing.\nThe official OpenROAD documentation has a great breakdown of some of the main features:\nHere are the main steps for a physical design implementation using OpenROAD:\nFloorplanning Floorplan initialization - define the chip area, utilization IO pin placement (for designs without pads) Tap cell and well tie insertion PDN- power distribution network creation Global Placement Macro placement (RAMs, embedded macros) Standard cell placement Automatic placement optimization and repair for max slew, max capacitance, and max fanout violations and long wires Detailed Placement Legalize placement - align to grid, adhere to design rules Incremental timing analysis for early estimates Clock Tree Synthesis Insert buffers and resize for high fanout nets Optimize setup/hold timing Global Routing Antenna repair Create routing guides Detailed Routing Legalize routes, DRC-correct routing to meet timing, power constraints Chip Finishing Parasitic extraction using OpenRCX Final timing verification Final physical verification Dummy metal fill for manufacturability Use KLayout or Magic using generated GDS for DRC signoff It was originally funded by DARPA, and this thought warms my aching heart when I see my tax bill.\nLibreLane # LibreLane is like a makefile that ties all of the ASIC tools together to build the ASIC flow. It pulls in Verilator for lint checking, Yosys and ABC for synthesis, OpenROAD for implementation, augmented with a few additional custom scripts to help complete the flow.\nFor this tape-out, we will be using the \u0026ldquo;classic\u0026rdquo; variant of the LibreLane flow.\nSky130A PDK # The PDK, or Process Design Kit, is an integral part of making a digital ASIC design, as it contains the cell libraries and their characterizations for target operating parameters (temperature/voltage). It is literally the foundation upon which a digital design is built. Because of their nature, PDKs are foundry/process specific, each PDK is tailored by nature to a foundry/process and is typically designed by or in close cooperation with said foundry.\nTypically, getting access to a PDK requires at least signing an NDA with a foundry and isn\u0026rsquo;t accessible to just anyone, so when Google partnered with the US-based SkyWater fab to release an open-source PDK for their 130nm process, it was very very significant. Google has additionally partnered with Global Foundries and released the gf180 for their 180nm MCU process. These open-source PDKs have truly changed what was possible for open-source silicon.\nFor this tape-out, I will be targeting a sky130A process. This is a classic CMOS process without the magnetic RAM that differentiates it from the sky130B process. More specifically, for this tape-out, I will be using the high-density cell library hd, which occupies a 0.46 x 2.72µm typical site, equivalent to 9 met1 tracks. As for the target typical operating parameters, it will be a temperature of 25C for a voltage of 3.3V.\nTiny Tapeout # Tiny Tapeout is an open shuttle program where multiple projects are pulled together and taped out as part of the same chip. Participant projects are hardened as macro blocks, the size of which scales with the amount of purchased tiles. As such, projects range from the very tiny counter to the much larger SoC. These projects are multiplexed together behind a shared mux, connected to a shared bus, and use common I/O (with a few exceptions).\nInterested in learning more about Tiny Tapeout or when the next shuttle is closing? Check out the official website: https://tinytapeout.com/ Full sky25b shuttle Tiny Tapeout chip render. The final chip users can configure which project they want to enable during operation and all other designs will be power gated. The final chip is sold as part of a dev board, which contains both the Tiny Tapeout chip and is connected to an RP2040 MCU.\nArchitecture # As I said in the introduction, something that I find makes hashing algorithms particularly interesting to implement is how the same functionality can be designed so differently given different PPA (power, performance, area) targets. And in this project, there were a few external constraints which, because of their importance, have defined the architectural direction.\nBLAKE2 algorithme # Readers already intimately familiar with the BLAKE2 hashing function internals can skip this section. Before going deeper into our discussion of the architecture, in order to gain a better understanding of how our environment has shaped our architectural decisions, I would like to take a moment to delve in depth into the BLAKE2 algorithm to gain a better understanding of the problem itself.\nAs stated in the introduction, BLAKE2 is a hashing function. Its objective is to take a large amount of data and produce a high-entropy, reproducible shorter digest of this data, called a hash. This hash can then be compared against an expected hash result to identify if the data has been corrupted or tampered with.\nInternally, the BLAKE2 hashing function breaks down the incoming data to be hashed into blocks (b) and passes each block through a compression function, during which the data will go through a few rounds of a mixing function. During this, a per-block internal state is computed (v), and the block result (h) is then derived from this internal state. The final hash result is derived from h and returned as the result on the last block.\nThe size of the data, as well as the number of mixing rounds, varies with the BLAKE2 variant:\nA block (b) is 64 bytes for BLAKE2s and 128 bytes for BLAKE2b The mixing function has 10 rounds for BLAKE2s and 12 rounds for BLAKE2b The mixing function internal state (v) size is 64 bytes for BLAKE2s and 128 bytes for BLAKE2b The block result (h) size is 32 bytes for BLAKE2s and 64 bytes for BLAKE2b We can clearly see not only how BLAKE2b needs twice the storage requirement of BLAKE2s but also how large the memory footprint of BLAKE2, regardless of the variant is.\nPseudo code # Pseudo-code for the blake2 function as per the rfc-7693 spec, we will make references to this latter in the article.\nMixing function G. R1, R2, R3, R4 and w are the constants given per blake2 version.\nFUNCTION G( v[0..15], a, b, c, d, x, y ) | | v[a] := (v[a] + v[b] + x) mod 2**w | v[d] := (v[d] ^ v[a]) \u0026gt;\u0026gt;\u0026gt; R1 | v[c] := (v[c] + v[d]) mod 2**w | v[b] := (v[b] ^ v[c]) \u0026gt;\u0026gt;\u0026gt; R2 | v[a] := (v[a] + v[b] + y) mod 2**w | v[d] := (v[d] ^ v[a]) \u0026gt;\u0026gt;\u0026gt; R3 | v[c] := (v[c] + v[d]) mod 2**w | v[b] := (v[b] ^ v[c]) \u0026gt;\u0026gt;\u0026gt; R4 | | RETURN v[0..15] | END FUNCTION. Compression function F :\nFUNCTION F( h[0..7], m[0..15], t, f ) | | // Initialize local work vector v[0..15] | v[0..7] := h[0..7] // First half from state. | v[8..15] := IV[0..7] // Second half from IV. | | v[12] := v[12] ^ (t mod 2**w) // Low word of the offset. | v[13] := v[13] ^ (t \u0026gt;\u0026gt; w) // High word. | | IF f = TRUE THEN // last block flag? | | v[14] := v[14] ^ 0xFF..FF // Invert all bits. | END IF. | | // Cryptographic mixing | FOR i = 0 TO r - 1 DO // Ten or twelve rounds. | | | | // Message word selection permutation for this round. | | s[0..15] := SIGMA[i mod 10][0..15] | | | | v := G( v, 0, 4, 8, 12, m[s[ 0]], m[s[ 1]] ) | | v := G( v, 1, 5, 9, 13, m[s[ 2]], m[s[ 3]] ) | | v := G( v, 2, 6, 10, 14, m[s[ 4]], m[s[ 5]] ) | | v := G( v, 3, 7, 11, 15, m[s[ 6]], m[s[ 7]] ) | | | | v := G( v, 0, 5, 10, 15, m[s[ 8]], m[s[ 9]] ) | | v := G( v, 1, 6, 11, 12, m[s[10]], m[s[11]] ) | | v := G( v, 2, 7, 8, 13, m[s[12]], m[s[13]] ) | | v := G( v, 3, 4, 9, 14, m[s[14]], m[s[15]] ) | | | END FOR | | FOR i = 0 TO 7 DO // XOR the two halves. | | h[i] := h[i] ^ v[i] ^ v[i + 8] | END FOR. | | RETURN h[0..7] // New state. | END FUNCTION. Main function :\nFUNCTION BLAKE2( d[0..dd-1], ll, kk, nn ) | | h[0..7] := IV[0..7] // Initialization Vector. | | // Parameter block p[0] | h[0] := h[0] ^ 0x01010000 ^ (kk \u0026lt;\u0026lt; 8) ^ nn | | // Process padded key and data blocks | IF dd \u0026gt; 1 THEN | | FOR i = 0 TO dd - 2 DO | | | h := F( h, d[i], (i + 1) * bb, FALSE ) | | END FOR. | END IF. | | // Final block. | IF kk = 0 THEN | | h := F( h, d[dd - 1], ll, TRUE ) | ELSE | | h := F( h, d[dd - 1], ll + bb, TRUE ) | END IF. | | RETURN first \u0026#34;nn\u0026#34; bytes from little-endian word array h[]. | END FUNCTION. Reality\u0026rsquo;s constraints # Area # A single sky130 Tiny Tapeout project tile is, as the name foreshadows, tiny… About 161x111.52 µm, or about as large as to realistically fit 256 bits worth of flip-flops on a good day.\nProjects come in a set of varying sizes, from the smallest and more common single-tile project all the way to the 24-tile behemoth. But size isn\u0026rsquo;t the only thing that scales: cost scales too, with a single tile costing 70 euros. Projects thus range from 70 to 2240 euros.\nSo just like in the semiconductor industry, I have one of the best motivations to optimize for area: money.\nUnfortunately, given the memory footprint intrinsic to the BLAKE2 algorithm, I had not chosen the most favorable project in the circumstance.\nStorage # At a minimum, I need to store:\nCurrent blocks: 64–128 Bytes (B) Internal state used by the mixing function: 64–128 Bytes (v) Current block result: 32–64 Bytes (h) For context, a single tile can fit 440 bits, or 55 Bytes, of D-Flip-Flops at most, if you optimize solely for DFF count and push routing to the absolute limit. This is because D-Flip-Flop cells, a combination of two latches, are among the largest standard cells in the library.\nTo illustrate this area scale difference, here is the sky130_fd_sc_hd__and2_1 cell, a 2-input AND gate with a weak driver: sky130_fd_sc_hd__and2_1 And here is the sky130_fd_sc_hd__dfxbp_1 cell, a standard complementary output D-Flip-Flop with the same weak output drive strength. sky130_fd_sc_hd__dfxbp_1 This flip-flop occupies five times the area of a simple AND gate. Yet, here we are looking to store, at the strict and very optimistic minimum, 3 to 6 tiles worth of flip-flops just to get this project off the ground.\nBut, when it comes to on-chip storage, I have two major issues:\nThere is currently no proven open-source SRAM for sky130. Although an experimental SRAM macro was submitted alongside this design on this shuttle, given that it is unproven, it would have risked losing the chip due to a bug.\nD-Flipflop cells, now my primary source of storage, each take up a lot of area.\nIn a perfect universe, given the relatively massive amount of storage and the access pattern, storing data in SRAM would have been ideal. In practice, using flip-flops was my only feasible path. Given that the BLAKE2b variant requires twice the on-chip storage compared to the BLAKE2s version, this area constraint naturally guided the choice to implement the BLAKE2s variant.\nI/O Bottleneck # The objective was to tape out this design as part of the sky25b Tiny Tapeout shuttle. As such, this design will be integrated as a pre-hardened macro block into the larger chip. Like most blocks, it will communicate with the pins through a shared mux and will not own any pins on its own.\nTiny Tapeout chip (I recommend switching the page to light mode for better readability of the pin names) In addition to a reset and clock signal, each block is given access to the following I/O:\n8 input pins 8 output pins 8 configurable input/output pins Although there is an option to purchase extra design-exclusive pins, these cost an additional 100 euros per pin and are limited in number. Since these are the shuttle’s shared I/O, the design must respect whatever operating characteristics this shared I/O access imposes. This is where two additional limitations arise:\nFirstly, the GPIO cell has a characterized maximum operating frequency of 66 MHz when operating above 2.2 V (we are operating at 3.3 V). In practice, this caps the maximum frequency at which we can operate the parallel bus between our MCU and our design over these pins.\nSecondly, because of a weak driver on the output path (Design to MCU) leading to a slow slew rate, the maximum reliable operating frequency is capped at 33 MHz. This means that any transitions sent over the parallel bus must not exceed a transition frequency of 33 MHz, or signals risk being captured at the wrong level.\nPutting it Together # Given that additional metadata must be transmitted alongside the input data (and setting aside a few bits for handshaking with the MCU) we are left with only 8 bits in both the input and output directions for data transfer. Because each hashing round of the mixing function is performed on an entire block\u0026rsquo;s data, we must accumulate all 64 bytes of the next block before computation can begin.\nThis article is focused on explaining the rationale behind the design decisions rather than being a datasheet for the design itself. So, just for context here is the final accelerator\u0026rsquo;s pinout:\nui (Inputs) uo (Outputs) uio (Bidirectional) ui[0] = data_i[0] uo[0] = hash_o[0] uio[0] = valid_i[0] ui[1] = data_i[1] uo[1] = hash_o[1] uio[1] = cmd_i[0] ui[2] = data_i[2] uo[2] = hash_o[2] uio[2] = cmd_i[1] ui[3] = data_i[3] uo[3] = hash_o[3] uio[3] = ready_o ui[4] = data_i[4] uo[4] = hash_o[4] uio[4] = output_mode_i[0] ui[5] = data_i[5] uo[5] = hash_o[5] uio[5] = output_mode_i[1] ui[6] = data_i[6] uo[6] = hash_o[6] uio[6] = ui[7] = data_i[7] uo[7] = hash_o[7] uio[7] = hash_valid_o[7] Additionally, due to area limitations, we cannot afford the extra 64 bytes of memory required to pipeline this accumulation in parallel with the previous block’s computation. As such, the computation is stalled while the next block’s data is being received.\nTo comply with the GPIO operating frequency limit, we could have operated the accelerator at 33 MHz to align the input bus frequency with the output frequency. However, because of the bottleneck imposed by the significant number of cycles required to transfer the next block of data, maintaining the highest possible input bus frequency was paramount. Consequently, I decided to operate both the input and output buses at 66 MHz and compensate for the slow slew rate by holding data on the output interface for two clock cycles.\nWhile it would have been technically possible to introduce a faster internal clock domain to the accelerator and add a clock domain crossing (CDC) between the internal logic and the I/O interfaces, it would have yielded little improvement. Since a faster internal clock would only accelerate the computation phase, and the design is primarily bottlenecked by the time spent waiting for input block data, the overall performance gain would be negligible.\nDesign # Given these external constraints, the design direction was clear: a BLAKE2s implementation using on-chip D-FlipFlops for storage, focusing on area optimization and a target operating frequency of 66 MHz.\nThe design described by the following section can be found on my github repository\nHash Configuration # During each hash function run, the internal state (h) is calculated based on a set of per-run configuration parameters, either during the initialization of the first block or during the initialization of the last block. These values are:\nkk: Key length in bytes nn: Resulting hash length in bytes ll: Total length of the data to be hashed In the software implementation, these are passed to the BLAKE2s function as arguments. In our design, since we cannot derive these from the input data, we need a way to acquire these configuration parameters before the hash computation begins.\nAs such, a configuration parameter transfer packet exists in parallel to the block data transfer packet. This packet is sent over the same 8-bit data input interface and is differentiated from data transfers by the state of the data mode pins.\nThe packet follows a little-endian layout, when sending multi-byte arrays, the lower indices are sent first:\nHash configuration packet layout The byte_size_config module (located under the io_intf module) identifies these configuration packets, parses and latches them. It outputs the most recent values of kk, nn, and ll directly to the main hashing modules. Consequently, the same configuration can be reused across multiple runs of the accelerator.\nlink to code: byte_size_config\nBlock Data Buffering # A block of data needs to be streamed over 64 cycles from the MCU to the ASIC. Like with the config parameter, there is a module dedicated in the design to tracking this data arrival, but because of the size of the buffer, directly translating to the area occupied by dffs, needed to hold this data, the block isn\u0026rsquo;t stored in memory in this module (and its content aren\u0026rsquo;t held stable over an interface destined for the main hashing module). Rather, this module contains only the logic needed to identify when data is being received, the data offset, and a flopped copy of the most recent byte. On the main hashing module side, these signals are received and used to determine the appending of the next incoming data byte to the block buffer.\nThis split allows there to be only the strictly necessary logic dedicated for the block data reception in the main hashing module (blake2s), while putting the majority of the block data streaming logic in the block_data module, also under io_intf.\nNow, earlier I said that a byte index was included in the signals sent from the block_data to the main hashing module blake2s.\nContrary to what intuition might suggest, this byte index isn\u0026rsquo;t used to address the buffer on writes. The reason why is that this would require a 1-to-64 x8-wide demux logic to implement, which would be quite expensive in terms of area. Rather, we are using a less expensive shift buffer for writes, which also happens to perfectly fit this application given that bytes are streamed in order to the ASIC.\nIf readers are interested in learning more about the area impact of different memory storage structures on the sky130 node, Toivoh did a great study that can be found here: https://github.com/toivoh/tt08-on-chip-memory-test This byte index is actually used to keep track of the completion of the block data so that the main hash module can start computing the hash as soon as the full block has been received. block_data\nMixing Function # The mixing function is at the heart of the hashing function and is also the critical path in the main loop of the hashing logic. It is called 8 times per round using different data as input, for 10 rounds.\n| | // Message word selection permutation for this round. | | s[0..15] := SIGMA[i mod 10][0..15] | | | | v := G( v, 0, 4, 8, 12, m[s[ 0]], m[s[ 1]] ) // 0 | | v := G( v, 1, 5, 9, 13, m[s[ 2]], m[s[ 3]] ) // 1 | | v := G( v, 2, 6, 10, 14, m[s[ 4]], m[s[ 5]] ) // 2 | | v := G( v, 3, 7, 11, 15, m[s[ 6]], m[s[ 7]] ) // 3 | | | | v := G( v, 0, 5, 10, 15, m[s[ 8]], m[s[ 9]] ) // 4 | | v := G( v, 1, 6, 11, 12, m[s[10]], m[s[11]] ) // 5 | | v := G( v, 2, 7, 8, 13, m[s[12]], m[s[13]] ) // 6 | | v := G( v, 3, 4, 9, 14, m[s[14]], m[s[15]] ) // 7 FUNCTION G( v[0..15], a, b, c, d, x, y ) | | v[a] := (v[a] + v[b] + x) mod 2**w | v[d] := (v[d] ^ v[a]) \u0026gt;\u0026gt;\u0026gt; R1 | v[c] := (v[c] + v[d]) mod 2**w | v[b] := (v[b] ^ v[c]) \u0026gt;\u0026gt;\u0026gt; R2 | v[a] := (v[a] + v[b] + y) mod 2**w | v[d] := (v[d] ^ v[a]) \u0026gt;\u0026gt;\u0026gt; R3 | v[c] := (v[c] + v[d]) mod 2**w | v[b] := (v[b] ^ v[c]) \u0026gt;\u0026gt;\u0026gt; R4 | | RETURN v[0..15] | END FUNCTION. In an ideal world, from a performance perspective, I could have parallelized all of these 8 iterations in a single cycle. But this is reality and there are 3 limitations with this approach:\nThe design direction of this ASIC is to optimize for area, and the additional logic incurred by this technical decision would be very expensive in terms of area.\nThere are data dependencies between v array values due to reused indexes.\nThe full path through a single G function, including the data reads, was 69 logic levels deep at its maximum and is much too long to fit in a single 66MHz cycle. So at a minimum, we are looking at 2 cycles.\nBecause of the above, this approach isn\u0026rsquo;t technically feasible, and even if it was, the area expense might not have been worthwhile, as extra area could be better expensed elsewhere, for example, for adding a second block buffer to allow streaming the next block in parallel with hashing a current block.\nIf you have read until this point, you are either procrastinating or passionate about hardware design, in both cases, welcome. The initial approach was to split the critical path through the G function into 2 cycles. This resulted in a 100% reduction of the hashing function performance, but this represents only a drop of 55% of the overall system performance when taking into account the block streaming dedicated cycles.\nlink to code: G function main code This helps solve part of our timing issue, but the path is still quite deep due to the data read section at the start of this path.\nAs readers can imagine, the data read paths are bloated by a number of large muxes for obtaining v[a], v[b], v[c], v[d], x, and y. There isn\u0026rsquo;t much optimization to be done for x and y, obtained from m[s[\u0026lt;current index\u0026gt;]] given s varies per round and spans all indexes of m. But there is some optimization we can squeeze out for the v vector accesses.\nLooking closer at the code, a, b, c, and d can each take a set of only 4 possible values, depending on the current G function call. By reworking the logic we can force the instantiation of only a 4-way muxing logic, greatly reducing logic cost and depth at the cost of some slightly more complex RTL.\nlike to code: mux rework\nNow we have reduced the logic on our read path, timing is looking good, but the G function still takes 2 cycles, and although we are not optimizing primarily for performance, leaving so much performance on the table is … unfortunate. Luckily, looking closer at the algorithm, we can spot a few additional critical nuances:\nThe m vector is not modified during a round, only read, so we can be flexible with the ordering of its reads.\nThe a, b, c, and d indexes used to read and write the v vector when calling Gn not overlap with Gn+1 for n ∈ {0,1,2,4,5,6}. Though, there is overlap for n=3, d=15 and n=7,b=4.\nWith these observations, we can start pipelining our G function path. By this I mean, we can start the next G function instance as the previous instance is still going through its second cycle. Though because of the data dependencies noted for instances n=3 and n=7, this pipelining isn\u0026rsquo;t perfect, and a nop cycle is imposed in order for the new values of v[d] and v[b] to become available for the next data read.\nWith this optimization, we can drastically drop the penalty on the hashing from 100% to 25%, with the overall performance penalty dropping from 55% to only 11%.\nOutput Streaming # At the very end of the hash computation, the ASIC starts streaming out the hash result. Even though 32 final hash bytes have been computed, only the first nn bytes will be streamed out (dependent on the nn configuration parameter). Due to our slow slew rate on the output GPIO buffer (the electrical equivalent of low blood pressure), a \u0026ldquo;slow output mode\u0026rdquo; was designed to hold each data transfer out of the ASIC over 2 cycles rather than one, leaving enough time for the signal to settle on the output pins.\nThe output streaming logic lives in the main hash module blake2s, but result streaming doesn\u0026rsquo;t block the FSM from starting to wait for the next start of the block to stream in.\nlink to code: output streaming\nCo-designing Firmware # When it comes to designing a master for this hashing accelerator ASIC, my two main options were an FPGA or an MCU. Designing for an FPGA would have been simpler, given that an FPGA’s flexibility, configurability, and timing accuracy make it well-adapted for high-speed custom protocols. The downside to this approach was that it would require users other than myself to have a similar FPGA setup if they wished to interact with my design.\nAs such, even at the cost of more effort, my objective was to have this design interface with the MCU shipped on the Tiny Tapeout dev board. Fortunately, the chosen chip was none other than the Raspberry Silicon RP2040.\nThe new lineup of MCU silicon developed by Raspberry includes an unusual hardware block called the PIO (Programmable Input/Output), which is a tiny co-processor that allows for pin reads and writes in a deterministic and cycle-accurate fashion. Without going too far off-topic into the details, this is perfect for designing custom high-speed parallel bus protocols and supports my upper target of 66 MHz.\nA future article is planned to go over the PIO in greater detail. The problem is that the PIO co-processor is tiny and supports only a subset of operations. As such, this custom bus protocol must be designed around the MCU hardware\u0026rsquo;s capabilities.\nBubbles in Data Transfer # As established earlier, the block data is transferred from the MCU to the ASIC 8 bits at a time. Each block is 64 bytes long, so ideally, this transfer would be completed over 64 consecutive cycles. In practice, however, even though continuous transfer is the norm, a corner case exists where the MCU can introduce random gaps in the middle of block streaming.\nTo understand why these empty cycles occur, we need to take a step back and understand how data accesses work in the PIO. Each PIO is divided into four State Machines (SM) that can each independently run a pio program. Our block streaming program runs on one such SM.\nInternally, each SM is equipped with very little storage, as memory being expensive in terms of area is a universal struggle. Each SM can be configured to have up to 8x32 bits of FIFO storage, with up to 32 aligned bits accessible each cycle and a 32-bit wide output buffer. To write to a pin, we must first pull a full 32-bit FIFO entry into the output buffer, and only then can we write the contents of the output buffer to the pins. The PIO hardware allows us to configure the FIFO so that when the output buffer has been read to a trigger point, it automatically pulls the next available entry from the FIFO. Issues arise when the FIFO becomes empty and needs to be refilled.\nOutput Shift Register (OSR), source: RP2040 Datasheet The SM can only write to consecutive pins. To simplify wiring during emulation, this resulted in a 32-bit wide write, meaning a full FIFO entry was consumed per write. When using the Tiny Tapeout dev board, this can be optimized down to a 16-bit wide write, consuming a FIFO entry only every two writes. This SM program writes to the parallel bus operating at 66 MHz. The RP2040 features a dual-core Cortex-M0+, and programs running on it are, to put it delicately, \u0026ldquo;slower than hardware.\u0026rdquo; I cannot rely on a software program to poll the FIFO state and refill it as soon as an entry is free, as software runs orders of magnitude too slowly.\nLuckily, the RP2040 has a DMA (Direct Memory Access) system to which I can offload this exact task : given an array in memory, it starts a 32-bit wide transfer from the next unread section of memory to the SM FIFO. This DMA is designed to be able to keep up with the PIO SM speed. However, in very rare corner cases, since the DMA is a shared component using shared internal buses, it might be stalled for a few cycles, resulting in bubbles being introduced during block streaming.\nBecause of this, both the parallel bus protocol and the ASIC must accommodate empty cycles during data streaming for both block data and for the same reason, the configuration parameter packets.\nHandshaking for Block Streaming # This ASIC can hash multiple blocks consecutively. However, since the internal block storage is reused for both streaming data in and performing the hashing operation, the design does not support streaming a new block while hashing is in progress. Having the MCU keep track of when the ASIC is ready to receive the next block would be too imprecise and would leave too much performance on the table, as we would have to overshoot the timing estimate to avoid data corruption.\nTo solve this, the parallel bus protocol includes a data_ready_o signal that indicates to the MCU that the hardware is ready to accept a new block. As soon as the SM (State Machine) detects this signal, it knows it is safe to start streaming the next block.\nWith this implementation, the SM data write program sequence becomes :\nWait for data_ready_o to be asserted. Wait for data to be available in the FIFO. Synchronize with the clock. Stream out data until the block transfer is complete. Repeat. link to code: data write pio program\nResult Valid # The PIO co-processor instruction set is small and simple, featuring only nine different instructions. Notably, there is no single \u0026ldquo;read on signal asserted\u0026rdquo; instruction. The \u0026ldquo;wait for signal\u0026rdquo; and the \u0026ldquo;read from pin\u0026rdquo; instructions are separate, and since each instruction takes one cycle to execute, we cannot assert the result valid signal at the exact moment we start the data transfer. We must do so at least one cycle beforehand.\nAs such, our parallel protocol requires the ASIC to assert the hash_valid_o signal one cycle before data starts flowing. This gives the state machine program enough time to identify the event and transition to the data capture routine. In a way, this is the digital equivalent of yelling \u0026ldquo;INCOMING!\u0026rdquo; before throwing your precious hash results overboard.\nlink to code: hash read pio program\nResult to Memory # The hash results are of variable length, with a maximum length of 32 bytes. Receiving and reading the incoming hash result from the ASIC to the MCU memory is handled by a separate PIO SM (State Machine). This SM has its own 8×32 bits deep RX FIFO in which incoming data can be buffered. However, this RX FIFO is, yet again, too small to capture the full hash on its own, even if we only read 8 bits (from 8 pins) per cycle as we do on the emulator.\nInput Shift Register (ISR), source: RP2040 Datasheet Unfortunately, because of how the final Tiny Tapeout dev board MCU is wired, with ui_in[0-3] placed in the middle of the uo_out pins, we face a mapping challenge. Since the PIO hardware can only read and write consecutive pins, and we want our data to be aligned on 32-bit boundaries, the final version will require reading 16 bits each cycle to capture the necessary 8 bits. RP2040 wiring on the Tiny Tapeout dev board. Here again, the Cortex-M0+ cores are too slow for our use case, so we rely on the DMA to copy the contents of the full FIFO entries to memory. Unlike the data streaming example, where we could tolerate gaps, here we must guarantee that no bubbles are introduced in the read sequence. In this context, bubbles would occur if the RX FIFO and the input buffer were full, causing the pin read to stall until space became available, and dropping all incoming bytes in the meantime.\nThe RX FIFO could reach capacity if our RX to memory DMA transfer were stalled by another peripheral using the shared internal hardware resources, such as a concurrent block data DMA transfer to a different state machine on the same PIO.\nTo prevent this scenario, we configure the DMA engine to give the highest priority to our transfer.\nlink to code: set DMA high priority\nImplementation # As this article is already getting quite long (congratulations for sticking around), I won\u0026rsquo;t delve too deeply into the implementation issues encountered while iterating on this design. I will simply say that implementation runs were performed regularly in parallel with the design work to quickly identify and refine timing while keeping area utilization in check.\nASIC implementation render When it comes to violations, antenna issues were the primary challenge because this ASIC targets Sky130A and the chosen layout was a very long 4×2 tile rectangle. Some of these wires were so long they practically had multiple postal codes. This is also why, when looking at the cell frequency table, readers can see the scars of battle in the form of the design being covered in diodes. These were used to offer discharge paths for charges accumulated during fabrication:\nCell type Count Area Fill cell 6073 30445.45 Tap cell 2158 2700.09 Diode cell 4031 10087.17 Buffer 66 251.49 Clock buffer 91 1238.69 Timing Repair Buffer 1892 16809.87 Inverter 89 380.36 Clock inverter 28 341.58 Sequential cell 1657 33324.46 Multi-Input combinational cell 5638 53603.91 Total 21723 149183.08 The heuristic addition of diodes by the tools helped fix many of the major issues, but there were still a few remaining. Since antenna violations only translate to a probabilistic increase in defect rates, leaving a few minor violations should be unnoticeable. However, for the few major violations remaining, I manually added buffer cells to these paths to break them apart:\n// manually inserting buffers for fixing implementation genvar buf_idx; generate for(buf_idx = 0; buf_idx \u0026lt; W; buf_idx=buf_idx+1) begin: g_buffer `ifdef SCL_sky130_fd_sc_hd /* verilator lint_off PINMISSING */ sky130_fd_sc_hd__buf_2 m_c_buf( .A(g_c_buf[buf_idx]), .X(g_c[buf_idx])); sky130_fd_sc_hd__buf_2 m_y_buf( .A(g_y_buf[buf_idx]), .X(g_y[buf_idx])); /* verilator lint_on PINMISSING */ `else assign g_c[buf_idx] = g_c_buf[buf_idx]; assign g_y[buf_idx] = g_y_buf[buf_idx]; `endif end endgenerate A few minor violations remained (P/R: 2.65, 1.26, and 1.02); these should be acceptable and are unlikely to cause any noticeable yield degradation, so I decided to let them through. Apart from these antenna violations, there are no other known issues in this implementation.\nThe ASIC has now been taped out and is in fabrication as part of the Tiny Tapeout sky25b shuttle , with an estimated delivery date of June 30, 2026.\nConclusion # This project has been a game changer for me.\nMy previous RTL engineering experiences in CPU at ARM and on FPGAs at Optiver were mostly focused on design and verification, and only gave me a restricted picture of what end-to-end chip tapeout really is.\nWe live in an interesting history period, where it is realistic for someone to fully design and tape out their own ASIC, as long as they have enough learning motivation and sanity to spare. It is rather unprecedented, and I am not sure if it will last, as the actual business model of such manufacturing processes is still to be established.\nBut in the meantime, it represents a gigantic learning opportunity for designers, and I can’t stress that enough. For someone like me who hugely values technical skills, opensource PKDs are a gold mine. In a couple of months, armed with patience, a couple of all nighters, and a megaton of coffee, homemade tapeout went from an unrealistic dream to an actual reality.\nEven if the accelerator itself was “relatively” simple (a sort of blinky project on steroids, one might say), the real difficulty was in correctly understanding the entire ASIC design flow, the many tools and steps involved, and the new constraints imposed by the fact that at the end, it had to run on actual silicon, and work on first try.\nIt did cost a lot of time and energy but it was definitely worth the shot and I would recommend the interested designer to have a try at it, given how much they will learn from it.\nCelebration of the tapeout submission. ","date":"30 December 2025","externalUrl":null,"permalink":"/projects/blake2s_hashing_accelerator_a_solo_tapeout_journey/","section":"Other projects","summary":"BLAKE2s ASIC implementation targetting the SKY130A process, tapedout with Tiny Tapeout.","title":"BLAKE2s Hashing Accelerator: A Solo Tapeout Journey","type":"projects"},{"content":"","date":"30 December 2025","externalUrl":null,"permalink":"/tags/cryptography/","section":"Tags","summary":"","title":"Cryptography","type":"tags"},{"content":"","date":"30 December 2025","externalUrl":null,"permalink":"/tags/sky130/","section":"Tags","summary":"","title":"Sky130","type":"tags"},{"content":"Intel is dying, and it is deeply troubling.\nI have literally actively worked in direction of their destruction, designing CPU cores for the competition. I also firmly believe the x86 memory model will not scale to massively multicore systems as efficiently as the arm architecture.\nYet, manufacturing is what killed Intel.\nThe Fighting Temeraire, tugged to her last Berth to be broken up, JMW Turner (1838) Our society has forgotten death, for humans and for entities.\nBy virtue of not paying attention to our own History, structures around us feel immortal. We look at giants and assume they will always stand tall, forgetting how their feet are made of clay.\nWhen giants fall, they fall slowly. This is another fallacy of ours. Thinking that the giant will collapse in a shake of thunder, lifting the earth, to piles of dust for all to see. But in reality, it is a slow lowering to the earth as the structure sinks into the ground, the giant now nowhere to be seen on the horizon.\nAnd while the giant crumbles, we will go about our days. And when the giant is gone, we will have forgotten how it once stood above the horizon.\n","date":"16 December 2025","externalUrl":null,"permalink":"/thoughts/forgetting_mortality/","section":"Thoughts","summary":"How giants vanish in silience.","title":"Reflexion on mortality","type":"thoughts"},{"content":"","date":"21 January 2024","externalUrl":null,"permalink":"/tags/network/","section":"Tags","summary":"","title":"Network","type":"tags"},{"content":"","date":"21 January 2024","externalUrl":null,"permalink":"/tags/switch/","section":"Tags","summary":"","title":"Switch","type":"tags"},{"content":" Introduction # This article serves as a quick follow-up to the previous post on Troubleshooting the Redstone D2020. Its goal is to provide a straightforward solution for adjusting the fan speed of the switch.\nDealing with a Noisy Switch # If your dream is to live in a server room, then the Celestica Redstone D2020, available for around $300 on eBay, might be the right choice for you. However, for the rest of us who prefer a quieter environment, finding a way to reduce the fan noise is crucial for making the most out of this purchase.\nCelestical Redstone D2020 Solution # The purpose of this script is to make the switch quieter by adjusting the fan speed. This fan speed can be modified by resetting its pwm in the associated Linux configuration file.\nBy default, I connect to the switch via telnet and log in to a custom cli rather than a Linux shell.\nPrivilege escalation using the CLI to get the root linux shell. As a result, the first step is to escalate from the cli to the Linux shell.\nTo automate the interaction with the telnet text interface, I use a tcl package called expect. The tcl portion of the script is named silence.expect and is invoked by the bash entry point called silence.sh.\nEssenceia/switch_scripts Switch utilities scripts for the D2020. Shell 0 0 silence.sh # This Bash script begins by sourcing configurations from the config.sh file, which includes the switch\u0026rsquo;s hostname, username/password, and the target pwm. The config.sh file is not included in the repository by default, but users can find an example file called config_example.sh and rename it to config.sh once modified.\nNext, it invokes the tcl script with these configurations.\n#!/bin/bash # configuation file CONFIG=config.sh # Source username and password if [ ! -f $CONFIG ]; then echo \u0026#34;File $CONFIG not found.\u0026#34; exit 1 fi source $CONFIG printf \u0026#34;sourcing configuration file, set variables :\\nhostname:$HOST\\nuser:$USER\\npassword:$PW\\ntarget pwm:$PWM\\n\u0026#34; # Call expect on switch and shut it up expect switch.expect $HOST $PWM $USER $PW silence.expect # This tcl script manages the connection to the switch via telnet, handles privilege escalation from the cli to the Linux bash, and ultimately writes the new fan pwm values.\n#!/usr/bin/expect set timeout 20 set hostName [lindex $argv 0] set pwm [lindex $argv 1] set userName [lindex $argv 2] set password [lindex $argv 3] spawn telnet $hostName expect \u0026#34;Trying $hostName...\u0026#34; expect \u0026#34;Connected to $hostName.\u0026#34; expect \u0026#34;Escape character is \u0026#39;^]\u0026#39;.\u0026#34; expect \u0026#34;\u0026#34; expect \u0026#34;User:\u0026#34; send \u0026#34;$userName\\r\u0026#34; expect \u0026#34;Password:\u0026#34; send \u0026#34;$password\\r\u0026#34;; send \u0026#34;enable\\r\u0026#34; send \u0026#34;linuxsh\\r\u0026#34; expect \u0026#34;#\u0026#34; send \u0026#34;echo $pwm \u0026gt; /sys/class/thermal/manual_pwm\\r\u0026#34; send \u0026#34;exit\\r\u0026#34; expect \u0026#34;Connection closed by foreign host.\u0026#34; send \u0026#34;quit\\r\u0026#34; send \u0026#34;quit\\r\u0026#34; interact Additional Documentation # For further information on the expect package, refer to Expect.\n","date":"21 January 2024","externalUrl":null,"permalink":"/projects/d2020_p2/","section":"Other projects","summary":"No documentation, no problem - part 2","title":"Troubleshooting the redstone D2020, part 2","type":"projects"},{"content":"","date":"13 November 2023","externalUrl":null,"permalink":"/tags/hft/","section":"Tags","summary":"","title":"HFT","type":"tags"},{"content":"","date":"13 November 2023","externalUrl":null,"permalink":"/tags/itch/","section":"Tags","summary":"","title":"ITCH","type":"tags"},{"content":"","date":"13 November 2023","externalUrl":null,"permalink":"/tags/iverilog/","section":"Tags","summary":"","title":"Iverilog","type":"tags"},{"content":"","date":"13 November 2023","externalUrl":null,"permalink":"/tags/moldudp64/","section":"Tags","summary":"","title":"Moldudp64","type":"tags"},{"content":" Introduction # In this post, I will be going over the testing for the MoldUDP64 and ITCH modules. This article is best read after having read the post on designing the MoldUDP64 module and ITCH module.\nAlthough both the ITCH and MoldUDP64 modules have their own small SystemVerilog test benches, these test benches are relatively simple and do not offer extensive testing coverage. Additionally, they are limited to a single block, and experience has taught me that a lot can go wrong when connecting different modules. The goal of this top-level test bench is to provide more comprehensive testing for each block and test the system as a whole.\nThe goal when designing this top-level test bench was to simulate the driving signals our hardware would encounter if it were connected directly to the NASDAQ data feed. To closely match real-world behavior within the constraints of my nonexistent budget, our test bench will recreate data feed packets based on NASDAQ-provided logs containing ITCH messages captured at the exchange.\nNASDAQ generously provides these files free of charge, and they can be found here.\nEssenceia/Nasdaq-HFT-FPGA RTL design for a nasdaq compatible high frequency trading low level. Supports itch on moldudp64. C 86 27 Architecture # The NASDAQ data feed directly updates participants on changes in the status via ITCH messages. These messages are delivered by MoldUDP64 packets. A single MoldUDP64 packet can contain multiple ITCH messages.\nThe objective of our testbench will be to read this dump file and extract the ITCH messages. We will then recreate a MoldUDP64 packet containing multiple of these messages and break it down into smaller payloads of the corresponding size to the UDP -\u0026gt; MoldUDP64 data width and feed them through the simulator.\nIn parallel, the testbench will decode the ITCH message and produce the next expected ITCH module decoder output. This expected outcome will be compared to the actual ITCH module output, completing the self-testing loop.\nTestbench simulation wave, displaying the decoding of an ITCH system event message. We can observe two different ITCH message interfaces: the itch_* signals are driven by the ITCH RTL module, while the tb_itch_* signals contain the expected ITCH decoded values and are driven by our testbench. The values of these two groups are compared during the checking process. NASDAQ ITCH file # During the testbench initialization, the user will provide the path to the NASDAQ ITCH dump file.\nThis NASDAQ-provided dump file contains multiple ITCH messages stored using the NASDAQ-specified BinaryFILE format.\nBinaryFILE format. Each payload contains a single ITCH message in its raw binary format.\nSince reading these files and decoding the ITCH messages are operations I perform in other contexts, the code for these operations is in my TotalView-ITCH 5.0 library.\nEssenceia/TotalView_ITCH_5.0_C_lib Small C library to handle itch messages C 2 4 Fake MoldUDP64 packet # Each of our MoldUDP64 packets contains a random number of ITCH messages extracted from the dump file. For each ITCH message that is added to the packet, we push the expected decoded ITCH module output to a FIFO to be dequeued later during checking.\nMoldUDP64 package format, containing multiple ITCH messages. Once this new MoldUDP64 packet has been created, we then \u0026lsquo;flatten\u0026rsquo; the intermediate C structure representation into its binary format.\nDuring simulation, if the MoldUDP64 signals it is ready to accept a new payload, the testbench will write at most the next 64 bits (or data bus width between the UDP -\u0026gt; MoldUPD64) onto the bus.\nDecoded output checking # Within a few cycles of simulation, the ITCH module will have received a full message and will drive it on its outbound interface.\nAt this moment, we will dequeue the expected ITCH decoded signals we stored earlier when creating the MoldUDP64 packet onto a series of testbench-driven signals that mirror the real ITCH outbound interface.\nWe will then compare the testbench-driven expected outbound signal values with the values obtained from the logic. This checking is performed in the SystemVerilog part of the testbench using a series of assertions.\n`assert_stop( tb_itch_system_event_v == itch_system_event_v_o); `assert_stop( ~tb_itch_system_event_v | tb_itch_system_event_v \u0026amp; tb_itch_system_event_stock_locate == itch_system_event_stock_locate_o); `assert_stop( ~tb_itch_system_event_v | tb_itch_system_event_v \u0026amp; tb_itch_system_event_tracking_number == itch_system_event_tracking_number_o); `assert_stop( ~tb_itch_system_event_v | tb_itch_system_event_v \u0026amp; tb_itch_system_event_timestamp == itch_system_event_timestamp_o); `assert_stop( ~tb_itch_system_event_v | tb_itch_system_event_v \u0026amp; tb_itch_system_event_event_code == itch_system_event_event_code_o); If both the message type and the decoded message fields content match, the check is successful; otherwise, the assertion will trigger.\nLiveness # In our testbench, we are practically always sending valid payloads to the MoldUDP64 module, but during testing, a few bugs caused long periods without a valid ITCH decoded message.\nTo quickly detect these cases and simplify tracking down these bugs, a liveness counter was added.\nIt is reset whenever a valid ITCH message is decoded and will trigger an assertion once it reaches 0.\nTest bench # Our top-level testbench is coded using a mix of C and SystemVerilog and runs using the Icarus Verilog (iverilog) simulator. The C code interfaces with the simulator using the Verilog Procedural Interface (VPI).\nUsing this approach is particularly convenient, as it allows me to build much more complex testbenches than I could with SystemVerilog alone and also enables easy reuse of my code base.\nFor illustration, this testbench uses my C TotalView-ITCH 5.0 library for parsing the NASDAQ log file.\nVerilog Procedural Interface (VPI) # In practice, the C code is compiled and linked into a shared object, which is loaded at runtime by the simulator.\nAs this is a shared object, we must compile our .o to be position independent. This is done using the -fpic compile flag:\nFLAGS = -fpic moldudp64.o: moldudp64.c moldudp64.h $(CC) -c moldudp64.c $(FLAGS) To link to the target vpi format I am using the -shared and -lvpi flags :\ntb.vpi: tb.o tb_utils.o tb_itch.o tv.o moldudp64.o axis.o tb_rand.h tb_config.h libitch.a $(LD) -shared -o tb.vpi tb.o tv.o axis.o moldudp64.o tb_utils.o tb_itch.o $(LIB) -lvpi When launching the iverilog simulation I specify the vpi’s name and directory using the -m\u0026lt;vpi_file\u0026gt; and -M \u0026lt;vpi_dir\u0026gt; arguments :\nrun: test vpi vvp -M $(VPI_DIR) -mtb $(BUILD)/hft_tb Please note, the VPI interface is simulator-specific, so the earlier snippets are only applicable with iverilog.\nAs of writing, I have not added support for verilator to this testbench. If readers want to see an example of a VPI-enabled testbench with dual iverilog and verilator support, I would point you towards my Ethernet Physical Layer testbench.\nsha1: 28a1a6a9fc1a0033b5cac89fde349a17266c27c7\nUpon loading, using VPI, we register four new custom system functions with the simulator.\nEach time these are reached in simulation, the corresponding C code is called. These are $tb_init, $tb, $tb_itch, and $tb_end. They can be called like any other system function. For reference, $random is a system function.\n$tb_init # $tb_init is used to initialize the C part of the testbench and allows the user to provide the path to the NASDAQ ITCH dump file to be used in this simulation.\nIf the file cannot be found, the simulation aborts.\nUsage :\n$tb_init(\u0026#34;\u0026lt;nasdaq_file_path\u0026gt;\u0026#34;); $tb # $tb simulates a 64 bit axis bus between a hypothetical UDP block to the MoldUDP64 module.\nUsage :\nlogic axis_ready; logic axis_valid; logic [63:0] axis_data; logic [7:0] axis_keep; logic axis_last; logic tb_finished; $tb(axis_ready, axis_valid, axis_data, axis_keep,axis_last, tb_finished); It reads the axis_ready driven by the MoldUPD64 and correspondingly writes the content of axis_valid, axis_data, axis_keep and axis_last. It also writes out a tb_finished signal to indicate the C testbench has reached the end of the ITCH file and we can safely stop the simulation.\n$tb_itch # $tb_itch drives the expected values on the duplicate ITCH outbound interface. Whenever a new ITCH message is decoded these values are compared with the actual output of the ITCH module to check for any differences.\nUsage :\n$tb_itch(\u0026lt;duplicate_itch_outbound_interface\u0026gt;); Conclusion # This top-level testbench allowed me to verify the behavior of the MoldUDP64 and ITCH modules in much more depth than the smaller block-level testbenches ever could.\nIts next upcoming evolution will likely be to add support for verilator for faster iteration time and eventually integrate the MAC, IPv4, and UDP modules.\nRessource # iverilog VPI documentation\nNASDAQ BinaryFILE format specification\n","date":"13 November 2023","externalUrl":null,"permalink":"/projects/top_tb/","section":"Other projects","summary":"Top level testbench for the ITCH and MoldUPD64 modules.","title":"MoldUDP64 and ITCH testbench ","type":"projects"},{"content":"","date":"13 November 2023","externalUrl":null,"permalink":"/tags/simulation/","section":"Tags","summary":"","title":"Simulation","type":"tags"},{"content":"","date":"13 November 2023","externalUrl":null,"permalink":"/tags/testbench/","section":"Tags","summary":"","title":"Testbench","type":"tags"},{"content":"","date":"13 November 2023","externalUrl":null,"permalink":"/tags/testing/","section":"Tags","summary":"","title":"Testing","type":"tags"},{"content":"","date":"13 November 2023","externalUrl":null,"permalink":"/tags/totalview/","section":"Tags","summary":"","title":"TotalView","type":"tags"},{"content":" Introduction # ITCH is a message protocol used in the application layer of financial exchanges that implement the ITCH/OUCH feeds.\nIt is part of the exchange\u0026rsquo;s direct data feed, a low latency feed between the exchange\u0026rsquo;s servers and a client\u0026rsquo;s trading infrastructure.\nNASDAQ ITCH data feed network stack This project is a synthesizable Verilog implementation of an ITCH protocol message parser, used on the client end of the link.\nAlthough multiple exchanges use ITCH in their data feeds, the format of these messages varies. In order to reduce the amount of additional work required to add support for new exchanges, the majority of this RTL is procedurally generated.\nBy default this module supports NASDAQ\u0026rsquo;s TotalView ITCH 5.0 message format.\nEssenceia/ITCH RTL implementation of the Nasdaq ITCH protocol decoder. Verilog 7 6 ITCH # The ITCH protocol is an integral part of the direct exchange data feed. This protocol delineates a series of exchange-specific binary messages that convey essential exchange status information, including the tracking of orders, administrative messages, and exchange event notifications. It is exclusively used for outbound market data feeds and does not support order entry.\nThe messages comprising the ITCH protocol are delivered via MoldUDP64 packets, which ensure proper sequencing and tracking.\nIt is important to note that there is no universal implementation of ITCH; instead, each exchange defines its own message formats. For instance, NASDAQ\u0026rsquo;s version of ITCH is known as TotalView ITCH, and the Australian Securities Exchange (ASX) uses ASX ITCH.\nMessage format # Messages are of defined length based on their Message Type.\nI will be using the TotalView ITCH version 5.0 message format in the following examples.\nThe modules used in the following examples will utilize an 8-byte payload with for UDP-\u0026gt;MoldUPD64.\nThere is a predefined format for each Message Type with predefined fields.\nFor example, the System Event message, used to signal a market or data feed handler event, has a Message Type of 0x53 (S in ASCII), a total length of 12 bytes, and follows the following format :\nSystem Event message format, part of NASDAQ\u0026rsquo;s TotalView-ITCH version 5.0 The fields can belong to one of the following four types :\nUnsigned integer : This is the most common type, used for message integer fields, and it is represented in big endian. Price : Integer fields that need to be converted to a fixed point decimal format. There are two sub-type for Price: Price(4) with 4 decimal places and Price(8) with 8. ASCII : Text fields that are left-justified and padded on the right with spaces. Timestamp : A 6-byte unsigned integer representing the number of nanoseconds elapsed since midnight. Automatic generation # Because there is no single ITCH protocol message format, and because the RTL code for the decoder is painfully repetitive, I have decided to automatically generate the majority of the Verilog code for this module.\nGeneration flow from XML to Verilog module. XML # The ITCH message format is described in an XML file.\nThis XML file is also used to generate the code for the C ITCH library associated with this project. This library is used in the HFT project\u0026rsquo;s self-checking test bench and in my custom tools.\nThe following is the description of the System Event message :\n\u0026lt;Struct name=\u0026#34;system_event\u0026#34; len=\u0026#34;12\u0026#34; id=\u0026#34;S\u0026#34; database=\u0026#34;true\u0026#34;\u0026gt; \u0026lt;Field name=\u0026#34;message_type\u0026#34; offset=\u0026#34;0\u0026#34; len=\u0026#34;1\u0026#34; type=\u0026#34;char_t\u0026#34;/\u0026gt; \u0026lt;Field name=\u0026#34;stock_locate\u0026#34; offset=\u0026#34;1\u0026#34; len=\u0026#34;2\u0026#34; type=\u0026#34;u16_t\u0026#34;/\u0026gt; \u0026lt;Field name=\u0026#34;tracking_number\u0026#34; offset=\u0026#34;3\u0026#34; len=\u0026#34;2\u0026#34; type=\u0026#34;u16_t\u0026#34;/\u0026gt; \u0026lt;Field name=\u0026#34;timestamp\u0026#34; offset=\u0026#34;5\u0026#34; len=\u0026#34;6\u0026#34; type=\u0026#34;u48_t\u0026#34;/\u0026gt; \u0026lt;Field name=\u0026#34;event_code\u0026#34; offset=\u0026#34;11\u0026#34; len=\u0026#34;1\u0026#34; type=\u0026#34;eSystemEvent\u0026#34;/\u0026gt; \u0026lt;/Struct\u0026gt; Struct.name : message type Sturct.len : total length in bytes of this message Struct.id : ascii code for this message type Struct.database : identifies if we should include this Struct in our generation, currently unused Field.name : name of the field Field.offset : field start position, offset in bytes from the start of the message Field.len : length in bytes of this field Field.type : type of this field, unused by ITCH module, used by C libraries to indicate how to manipulate the data. This XML was originally authored by the github user doctorbigtime for his own message parser written in Rust. All credits for this XML belong to him.\nPython script # The itch_msg_to_rtl.py Python script reads this XML and translates the outlined message formats into Verilog code.\nThese generated sequences are then written into multiple Verilog files in the gen folder.\nInclude in module # Our main module assembles the code by using the include directive to include the generated code files.\nArchitecture # This module is a message decoder, internally it works by accumulating message bytes, identifying the message type and routing the message field data to the wires associated with the message type.\nITCH module overview. Inbound message bytes are sent from the MoldUDP64 module, and its output is connected to the trading algorithm. The outbound early interface is optional. Internally, the message decoder accumulates the message bytes received from the MoldUDP64 module into the internal data_q and ov_data_q flops. The contents of the data_q flops are connected to the corresponding outbound decoded message fields.\nThe number of bytes received for the current message is tracked by the data_cnt_q counter. By examining the first byte of each message, we can identify the Message Type and determine the number of expected bytes to collect.\nThe cycle after the entire message has been received, the validity signal on the ITCH outbound interface corresponding to this Message Type is asserted.\nOverlap # I refer to an overlap as a case where, from the perspective of the MoldUDP64 module, the last bytes of a message and the first bytes of a new message are transmitted within the same UDP-\u0026gt;MoldUDP64 payload.\nPayload containing data of two messages, having its data split onto both outbound MoldUDP64 interfaces. Because overlap only occurs when there is at least 1 byte of the previous message data in the payload, and the length field is 2 bytes, our overlap data is at most N-3 bytes wide for a N byte payload. The overlapping bytes are the first bytes of this new message.\nDue to choices made regarding how to handle overlap cases, the ITCH module has two inbound interfaces by which it can accept new message bytes.\nIn order to avoid corrupting the value of the byte count data_cnt_q and the flopped data data_q for the finishing message, data pertaining to these overlapping bytes will be stored in dedicated flops ov_data_q and ov_cnt_q, and merged with the remainder of its message in the following cycle.\nInterfaces # Input interfaces # The inbound interface is connected to the MoldUDP64 module and used to receive message bytes.\nMessage interface # The message interface is the standard interface used to transmit all message bytes, with the exception of the overlapping bytes.\ninput valid_i, input start_i, input [KEEP_LW-1:0] len_i, input [AXI_DATA_W-1:0] data_i, valid_i : signals the validity of data on this interface start_i : signals the start of a new message len_i : length of the valid data in bytes data_i : data bytes Overlap interface # The overlap interface is used exclusively for transmitting the overlapping bytes. Due to the conditions in which an overlap occurs, there is no need for a start signal, as the start is implied when we have a valid overlap.\ninput ov_valid_i, input [OV_KEEP_LW-1:0] ov_len_i, input [OV_DATA_W-1:0] ov_data_i, ov_valid_i : an overlap has occurred, data on this interface is valid. Implies the start of a new message ov_len_i : length of the valid data in bytes ov_data_i : overlapping bytes Output interfaces # The output of this module is intended to be connected to a trading algorithm.\nThere are two output interfaces. The first is the standard output interface, which becomes valid once all the bytes of the message have been received.\nThe second interface is optional, and is an early interface used to identify the type of the message currently being received, and which message fields have received all their data.\nTo include this interface, declare the EARLY macro.\nStandard interface # Standard outbound decoder interface, contains fully decoded messages.\noutput logic itch_\u0026lt;message_type\u0026gt;_v_o, output logic [\u0026lt;field length\u0026gt;-1:0] itch_\u0026lt;message_type\u0026gt;_\u0026lt;field_name\u0026gt;_o, ... output logic [\u0026lt;field length\u0026gt;-1:0] itch_\u0026lt;message_type\u0026gt;_\u0026lt;field_name\u0026gt;_o, itch_\u0026lt;message_type\u0026gt;_v_o : valid signal, a message of \u0026lt;message_type\u0026gt; has been fully received itch_\u0026lt;message_type\u0026gt;_\u0026lt;field_name\u0026gt;_o : message field Early interface # This optional early interface is used to start triggering decisions as soon as individual data fields have been fully received, eliminating the need to wait for the complete reception of all the message bytes.\noutput logic itch_\u0026lt;message_type\u0026gt;_early_v_o, output logic itch_\u0026lt;message_type\u0026gt;_\u0026lt;field_name\u0026gt;_early_v_o, ... output logic itch_\u0026lt;message_type\u0026gt;_\u0026lt;field_name\u0026gt;_early_v_o, itch_\u0026lt;message_type\u0026gt;_early_v_o : valid signal, decoding a message of type \u0026lt;message_type\u0026gt;. itch_\u0026lt;message_type\u0026gt;_\u0026lt;field_name\u0026gt;_early_v_o : valid signal, all bytes of \u0026lt;field_name\u0026gt; have been received. When used, it should only be used when the associated early \u0026lt;message_type\u0026gt; valid signal is high. The field data bytes will be on the standard interface. Example # In the following example, the ITCH module with an early interface is decoding two messages. The first is a 21-byte-long Snapshot message, and the second is a 39-byte-long anonymous Add Order message.\nAll values are represented using hexadecimal with the exception of the data byte counter data_cnt_q in yellow that is using decimal. Snapshot message format, Message Type=0x47 Add Order message format, Message Type=0x41 Since our Snapshot message is 21 bytes long, and since, in our MoldUDP64 packet, there is a 2 bytes long lenght field before the start of each new message\u0026rsquo;s data bytes, the first byte of the Add Order message will overlap.\nLast bytes of the Snapshot message and the first byte of the Add Order message are overlapping within an 8 byte payload. Due to this, the start of the Add Order message will be sent through the overlap interface. We can observe that the data_cnt_q counter will delay accounting for this overlapping byte by one cycle, giving us the needed time to finish processing the previous Snapshot message.\nWave view of the ITCH module behavior including the optional early interface. Resources # ASX Trade ITCH Specification\n","date":"1 November 2023","externalUrl":null,"permalink":"/projects/itch/","section":"Other projects","summary":"Design of the ITCH module.","title":"ITCH RTL implementation","type":"projects"},{"content":" Introduction # When doing embedded systems development, it can sometimes be hard to find a development board with the desired features at an affordable price. A solution to this is to learn how to design and manufacture your own custom development boards.\nDevelopment board. This project is my first custom development board. It features :\nthe STM32H750VBT6TR MCU containing an ARM cortex-m7 core; an SWD debug interface with a pinout compatible with the 20 pin J-Link probe; an USB-B mini connector; a Micro SD card connector. Essenceia/stm32h750-dev-board Small embedded dev board with an STM32H7 and a full J-link connector null 12 3 The USB\u0026rsquo;s data transfer functionality and the SD card reader have not been tested as of writing. I suggest double checking the PCB schematics for these if you intend to use them in your own project. An anniversary present # I met my wonderful husband over 10 years ago, we were high school love birds. Today he is a talented low level C developer that writes kernels for fun.\nHe likes targeting microcontrollers but was often disappointed by the lack of proper JTAG or SWD debugging interface on all the development boards he owned.\nAs such, this board was made as my 10 year anniversary present to him.\nDebug interface # One of the goals on this design is to be able to directly plug the debug probe onto the board without the need for any additional adapter.\nBoard connected to our J-Link J-Link connector pinout # We are using a J-Link as our debug probe. It has a 20 pin connector and supports SWD using the following connector pinout:\nJ-Link connector pinout for SWD The following table lists the J-Link / J-Trace SWD pinout.\nPin Signal Type Description 1 VTref Input This is the target reference voltage. It is used to check if the target has power, to create the logic-level reference for the input comparators and to control the output logic levels to the target. It is normally fed from Vdd of the target board and must not have series resistors. 2 Vsupply NC This pin is not connected in the J-Link. It is reserved for compatibility with other equipment. Connect to Vdd or leave it open in target system. 3 Not used NC This pin is not used by the J-Link. If the device may also be accessed via JTAG, this pin may be connected to nTRST, otherwise leave it open. 5 Not used NC This pin is not used by the J-Link. If the device may also be accessed via JTAG, this pin may be connected to TDI, otherwise leave it open. 7 SWDIO I/O Single bi-directional data pin. 9 SWCLK Output Clock signal to target CPU. It is recommended that this pin is pulled to a defined state of the target board. Typically connected to TCK of target CPU. 11 Not used NC This pin is not used by the J-Link. This pin is not used by J-Link when operating in SWD mode. If the device may also be accessed via JTAG, this pin may be connected to RTCK, otherwise leave it open. 13 SWO Input Serial Wire Output trace port. (Optional, not required for SWD communication.) 15 nRESET I/O Target CPU reset signal. Typically connected to the RESET pin of the target CPU, which is typically called \u0026ldquo;nRST\u0026rdquo;, \u0026ldquo;nRESET\u0026rdquo; or \u0026ldquo;RESET\u0026rdquo;. This signal is an active low signal. 17 Not used NC This pin is not connected in the J-Link. 19 5V-Supply Output This pin is used to supply power to some eval boards. Pins 4, 6, 8, 10, 12, 14, 16, 18, 20 are GND pins connected to GND in J-Link. They should also be connected to GND on the board.\nJ-Link connections to the board. All SWD J-Link pins are connected with the exception of the 5V-Supply, as even when the debug probe is connected power is still gotten from the USB.\nMounting the connector to the PCB # The connector should be mounted with the slot facing up, away from the MCU as shown below.\nMale 20 pin J-Link connector mounting direction on PCB front face. CAD # This board was designed using Kicad 7.0.8 and all project files are available for download in the following github repository.\nSchematic # Full schematics of the board :\nSchematics of the development board. PCB # Computer rending of the PCB :\nFront view of the PCB. Back view of the PCB. Final result :\nUnmounted spare PCB\u0026rsquo;s, will be used for version 2. Bill of Materials # Item # Designator Qty Manufacturer Mfg Part # Description / Value Package/Footprint Type Your Instructions / Notes 1 DC6,DC3,DC5,DC4,DC8,DC9,DC7,DC11,DC1,DC2 10 KEMET C0402C104K4RAC 100nF 0402 SMD 2 U3 1 Texas Instruments TLV1117-33CDCYRG3 TLV1117-33 SOT-223 SMD 3 R4,R5,R3 3 SEI Stackpole RMCF0603JJ1K00 1k 0603 SMD 4 Y2 1 ECS Inc. ECS-250-9-37B2-CKM-TR 25MHz 9uF 2.0x1.6mm SMD 5 R15,R14,R13,R7,R16,R12,R10,R11 8 YAGEO RC0402FR-0710KL 10k 0402 SMD 6 U1 1 STMicroelectronics STM32H750VBT6TR / STM32H742VIT6 STM32H750VBTx / STM32H742VI LQFP-100 QFP MCU for v1 and v2, pin compatible 7 J2 1 On Shore Technology Inc. 302-S201 Conn_ARM_JTAG_SWD_20 THD Through Hole 8 C8,C5,C6,C7,C1,C2 6 Samsung Electro-Mechanics CL31A476MQHNNNE 47uF/3528 1206 SMD 9 D2,D1 2 EVERLIGHT 19-213SYGC/S530-E2/5T LED 0603 SMD 10 R1,R2 2 YAGEO RC0402FR-0722RL 22 0402 SMD 11 R6,R8 2 SEI Stackpole RMCA0603JT510R 510 0603 SMD 12 C4,C3 2 Murata Electronics GJM1555C1H8R0DB01D 8pF 0402 SMD 13 J3 1 Molex 1040310811 Micro_SD_Card_Det1 SMD 14 D3 1 Toshiba CRS30I40A(TE85L,QM SS34 SOD-123F SMD 15 L1,L2 2 TAI-TECH FCM1608KF-102T02 1KB 0603 SMD 16 SW1 1 C\u0026amp;K PTS636 SM43 SMTR LFS SW_RESET 6.0x3.5mm SMD 17 U2 1 STMicroelectronics USBLC6-2P6 USBLC6-2P6 SOT-666 SMD 18 J1 1 Adam Tech MUSB-B5-S-RA-SMT-PP-T/R USB_B_Mini SMD 19 J4,J5 2 Sullins Connector Solutions PPTC252LFBN-RC Conn_02x25_Odd_Even Through Hole For reference, I recently ordered components for 3 version 2 boards in Canada and spent 107 CAD at Mouser.\nVersion 2 # Now that the first version is up and running it is time to start thinking of improvements for the second version.\nThe first revision of the board uses the STM32H750VBT6TR MCU, which only has 128kB of flash. Now that I have confirmed that the board is working it is time to upgrade to something with more flash.\nThis new revision will keep the existing PCB and components but drop in the STM32H742VIT6 as the MCU. This MCU features 2 MB of flash, 1 MB of RAM and is otherwise the same chip with regards to the features that matter to us.\n","date":"9 October 2023","externalUrl":null,"permalink":"/projects/dev_board/","section":"Other projects","summary":"Designing a small stm32h750 development board as an anniversary present","title":"Custom STM32H750 embedded development board","type":"projects"},{"content":"","date":"9 October 2023","externalUrl":null,"permalink":"/tags/electronics/","section":"Tags","summary":"","title":"Electronics","type":"tags"},{"content":"","date":"9 October 2023","externalUrl":null,"permalink":"/tags/embedded/","section":"Tags","summary":"","title":"Embedded","type":"tags"},{"content":"","date":"9 October 2023","externalUrl":null,"permalink":"/tags/manufacturing/","section":"Tags","summary":"","title":"Manufacturing","type":"tags"},{"content":"","date":"9 October 2023","externalUrl":null,"permalink":"/tags/pcb/","section":"Tags","summary":"","title":"Pcb","type":"tags"},{"content":"","date":"2 October 2023","externalUrl":null,"permalink":"/tags/ethernet/","section":"Tags","summary":"","title":"Ethernet","type":"tags"},{"content":" Introduction # This project aims to implement the Ethernet physical layer for 10Gb and 40Gb fiber optics links.\nThis post is a work in progress and currently used as means to share schematics.\nEssenceia/ethernet-physical-layer RTL implementation of the ethernet physical layer PCS for 10GBASE-R and 40GBASE-R. Tcl 34 12 Architecture # PCS # High level schematics and design considerations for the 10GBASE-R and 40GBASE-R RX and TX PCSs.\n10G RX # Vanilla # 10GBASE-R RX path, optional xgmii interface Optional CDC # The CDC is optional. We can configure the design in two ways :\nWithout CDC : Reduce latency for our latency optimized designs as we could to get away without having different clock frequencies. Instead, we would simply need to compensate the de-phasing between the SerDes derived clock and the rest of the design also running at 161.13MHz. On the flip is all the downstream design I would need to support invalid data one cycle every 32 cycles. Because of this we also need to a signal_ok signal along side the data to help differentiate a cycle with no data and a signal loss.\nFor the default implementation we would still be able to include the optional CDC and allow you to have different clock frequencies and thus not need to support invalid data cycles.\n10GBASE-R RX path, optional xgmii interface and CDC 10G TX # I have decided to select the design with the optional CDC for 10G.\nOptional CDC # 10GBASE-R TX path, optional xgmii interface and CDC 40G RX # Because I need all lanes to be predictably valid at the same cycle to complete the descrambling I cannot have a design with an optional CDC. This is made less consequential as I do not have low latency requirements for the 40GBASE-R PCS.\n40GBASE-R RX path, optional xgmii interface 40G TX # Each lane SerDes block is driven by the common 161,13MHz clock coming from the PCS.\nAll lanes are in sync with one another, so are the gearboxes. Thus each one should be ready to accept a new 66b data block every cycle.\nThe ability to accept a new data block is signaled by the gearboxes using the ready signal. Since they are all in sync, all those ready signals should have the same value. Thus we can connect any of those to the CDC.\n40GBASE-R TX path, all SerDes blocks are clocked from the same 161.13MHz clock coming from the PCS, optional xgmii interface Testing # 10G PCS loopback # Clock\nClock framework overview Reset\nThe reset signal for the PCS modules is controlled by a reset controller that keeps the logic in reset until all PLLs and SerDess have achieved lock.\nThis reset controller runs at a frequency of 50MHz, so we first need to synchronize it with the RX PCS domain. The resulting signal is nreset, which is used to reset the RX PCS.\nThe reset signal for the TX PCS is driven by the pcs_loopback module. This module compensates for the phase difference between both PCS clocks while also introducing an essential additional cycle delay. This synchronization ensures that both PCS gearboxes are aligned concerning when the RX PCS sends valid data and when the TX PCS is ready to accept new data.\nPCS modules reset signals overview. ","date":"2 October 2023","externalUrl":null,"permalink":"/projects/pcs/","section":"Other projects","summary":"RTL implementation of the Ethernet Physical Coding Sublayer (PCS) for 10Gb and 40Gb fiber optics.","title":"Ethernet 10GBASE-R and 40GBASE-R PCS","type":"projects"},{"content":"","date":"2 October 2023","externalUrl":null,"permalink":"/tags/systemverilog/","section":"Tags","summary":"","title":"Systemverilog","type":"tags"},{"content":" Introduction # One of my current projects is to build the Ethernet physical layer for 10Gb (10GBASE-R) and 40Gb (40GBASE-R) fiber. In order to create a testbench I went out and bought a decommissioned Redstone D2020 enterprise switch off ebay.\nAlthough there is practically no documentation on this switch and I have no prior experience working with any network equipment whatsoever, it was cheap and I believed that troubleshooting a system was a great opportunity to acquire that missing experience.\n2 weeks later, this beauty showed up !\nCelestical Redstone D2020 I am writing this article to document my experience, and in the hope that it may be useful to future owners of a Redstone D2020.\nCelestical Redstone D2020 # The Celestica Redston D2020 is an 1U data center switch with 48 10GbE SFP+ capable ports and 4 QSFP+ 40GbE capable ports. It has two 460W power supplies for redundancy, 5 cooling fans, 1 Ethernet RJ45, 1 console RJ45 and 1 USB type A port.\nAnother big additional selling point is that, unlike some other models, it doesn\u0026rsquo;t require any license to operate.\nI got mine off from UNIXSurplusNet on ebay for 150$.\nThe sell provided a test report mentioning some very handy information such as the admin username and password, as well the console serial configuration, and as some general system information.\nTest report provided with the switch by the ebay seller. Ethernet ASIC # Thanks to the seller-provided test report we learn that the main IC is showing up as a Broadcom Trident 56846.\nThis is probably of the first Trident generation, the BCM56846KFRBG of the Broadcom BCM56840 family that has since been discontinued. This appears to be a custom Broadcom ASIC targeting 10Gb Ethernet applications with 64 integrated 10GBASE-KR capable serial PHY\u0026rsquo;s. In our switch 16 of these modules are configured, such that 4 lanes are bounded together, to form our 4 40GBASE-R ports.\nBroadcom Ethernet IC configuration in our switch I was unfortunately unable to find a full data-sheet describing this IC\u0026rsquo;s internals in detail.\nAs such, I did the next best thing I could think of\u0026hellip;\nPooping the lid open, we discover a single gorgeous multilayer PCB.\nJudging by the traces on the PCB coming from the Ethernet connectors cages, the Broadcom IC is likely under the massive passive cooling block.\nSwitch PCS top face, we can see the PCB traces going from the Ethernet connector cages to bellow the big metal passive cooler Looking at the product brief it appears this Broadcom IC doesn\u0026rsquo;t feature an CPU, rather it acts as a network interface connected to the CPU via PCIe.\nBCM56840 system schematics from the products briefhttps://docs.broadcom.com/doc/12358267). CPU # Our system\u0026rsquo;s CPU is likely located below the block cooling block as hinted by the DDR3 Sk hyinx memory chips surrounding another imposing black passive cooler.\nSwitch CPU surrounded by multiple Skyinx DDR3 memory chips. Our processor\u0026rsquo;s chip is a Freescale P2020, this is a dual core PowerPC system with 2GB of external EEC DDR3 DRAM. Here we can notice an interesting inconsistency : the switch data brief lists the CPUs as running at 800MHz, but when we read the contents of /proc/cpuinfo the cores are reported as running at 1.2GHz.\n# head -n 20 /proc/cpuinfo processor : 0 cpu : e500v2 clock : 1199.999988MHz revision : 5.1 (pvr 8021 1051) bogomips : 100.00 processor : 1 cpu : e500v2 clock : 1199.999988MHz revision : 5.1 (pvr 8021 1051) bogomips : 100.00 total bogomips : 200.00 timebase : 50000000 platform : P2020 CEL model : fsl,P2020 Memory : 2048 MB Power and cooling # Behind the processor we have a series of cooling fans flanked on both sides by our power blocks.\nBoth the power blocks and the 5 fans have connectors to the PCB making them detachable. The fans can be easily detached, and come off as a single block. In practice only 4 fans are needed for operation, the 5th is redundant, so that one fan block can safely be removed at any time during operation.\nDetachable fan blocks, these were detached when the switch was not in operation Same goes for the power block, only one the 460W bricks is needed to power the switch, even at maximum load.\nFPGA # Interestingly this PCB also features 4 Lattice FPGAs of the MachXO2 family.\nLattice LCMX02-1200UHC FPGA on the PCB. These are relatively small FPGAs with only about 1280 LUTs each and are likely used as I2C bus controllers for accessing the Digital Diagnostic Monitoring Interface (DDMI) on the optical transceivers.\nLattice semiconductor MachXO2 family datasheet, overview of FPGA features. We have the XO2-1200U in our switch For those unfamiliar with optical transceivers, this interface allows real time access to the transceiver\u0026rsquo;s operating parameters, and it includes a system of alarm and warning flags which alerts the host system when particular operating parameters are outside of a factory set normal operating range. Additionally this also includes information the transceiver itself, such as the vendor, its laser wavelength, its supported link length, and more.\nInternally each transceiver features an small microcontroller in charge of reporting this diagnostic information and communicates data to the wider system via the 2-wire serial I2C bus.\nI2C bus connecting transceivers to Broadcom ASIC Since an I2C bus is a shared medium and multiple transceivers are connected onto the same I2C bus the FPGA acts as the I2C Master of this bus, as well as the controller for allowing the Broadcom Ethernet ASIC to interface with the I2C bus.\nThanks to this we can obtain information on the internal status of our connected transceivers.\nCommands like show fiber-ports optical-transceiver give us the latest internal operating parameters as read by the transceiver\u0026rsquo;s internal microcontroller and reported over I2C to the system. Using this command we can get information on the transceiver\u0026rsquo;s temperature, input and output signal strength and operating voltage.\n(Routing) #show fiber-ports optical-transceiver all Output Input Port Temp Voltage Current Power Power TX LOS [C] [Volt] [mA] [dBm] [dBm] Fault -------- ---- ------- ------- ------- ------- ----- --- 0/49 30.5 3.292 N/A -4.737 -19.318 No Yes 0/51 32.9 3.288 N/A -4.665 -9.397 No No Temp - Internally measured transceiver temperatures. Voltage - Internally measured supply voltage. Current - Measured TX bias current. Output Power - Measured optical output power relative to 1mW. Input Power - Measured optical power received relative to 1mW. TX Fault - Transmitter fault. LOS - Loss of signal. Here I can see that one of my transceivers has a lost the signal and the received optical power Input Power (dBm) = -19.318 dBm this might indicate I might have some dust in my optical connections , or may just be a bad contact.\nCommands like show fiber-ports optical-transceiver-info reports the content of sections of the transceiver\u0026rsquo;s EEPROM and presents them in a readable format. This includes the unit\u0026rsquo;s vendor, its serial numbers, part number and what 802.3 physical medium it is compliant with.\n(Routing) #show fiber-ports optical-transceiver-info all Link Link Nominal Length Length Bit 50um 62.5um Rate Port Vendor Name [m] [m] Serial Number Part Number [Mbps] Rev Compliance -------- ---------------- --- ---- ---------------- ---------------- ----- ---- ---------------- 0/49 AVAGO 0 0 QF2606UK AFBR-79EQDZ-JU1 10300 01 40GBase-SR4 0/51 AVAGO 0 0 QF1803PC AFBR-79EQDZ-JU1 10300 01 40GBase-SR4 Here I have 2 40Gb transceivers compliant with IEEE 802.3 Physical Medium Dependant (PMD) type 40GBASE-SR4 as outlines in clause 86.\nIEEE clause 86 Physical Medium Dependant (PMD), summary for 40GBASE-SR4 medium requirements This is the 4 lane optical physical layer compatible with the 40GBASE-R4 PMA, the 40GBASE-R PCS, which I am currently working on.\nConnecting to the switch console # My original plan was to gain access to the switch command line interface via the console port.\nRJ45 to USB serial cable To this end, I had acquired an RJ45 to USB cable and configured my PC\u0026rsquo;s serial to match the seller-provided serial configuration.\n(Routing) #show serial Serial Port Login Timeout (minutes)............ 5 Baud Rate (bps)................................ 9600 Character Size (bits).......................... 8 Flow Control................................... Disable Stop Bits...................................... 1 Parity......................................... none Initially all seemed to be going well, the serial cable was correctly detected as an USB to serial device, as evidenced by my dmesg logs.\npitchu /etc \u0026gt;sudo dmesg | tail [55357.778160] usb 1-4: new full-speed USB device number 7 using xhci_hcd [55357.931485] usb 1-4: New USB device found, idVendor=1a86, idProduct=7523, bcdDevice= 2.54 [55357.931501] usb 1-4: New USB device strings: Mfr=0, Product=2, SerialNumber=0 [55357.931503] usb 1-4: Product: USB2.0-Ser! [55357.936519] ch341 1-4:1.0: ch341-uart converter detected [55357.949546] ch341-uart ttyUSB0: break control not supported, using simulated break [55357.949663] usb 1-4: ch341-uart converter now attached to ttyUSB0 I was using picocom as a serial terminal with a baud-rate of 9600B, no flow control, a character size of 8, 1 stop bit and no parity.\npitchu /dev/serial \u0026gt;sudo picocom -b 9600 /dev/ttyUSB0 --omap delbs picocom v3.1 port is : /dev/ttyUSB0 flowcontrol : none baudrate is : 9600 parity is : none databits are : 8 stopbits are : 1 escape is : C-a local echo is : no noinit is : no noreset is : no hangup is : no nolock is : no send_cmd is : sz -vv receive_cmd is : rz -vv -E imap is : omap is : delbs, emap is : crcrlf,delbs, logfile is : none initstring : none exit_after is : not set exit is : no Type [C-a] [C-h] to see available commands Terminal ready Yet, nothing happened. There was never any response from the console port. It was as if I was sending commands into the void .\nEven after trying multiple different serial terminals such as minicom and screen as well as trying different serial configurations I didn\u0026rsquo;t seem to find a way to successfully connect to the switch.\nWas there something wrong with the switch, was it not booting properly ?\nChecking switch liveness # At this point the switch was powered and connected via its console port to my PC but it was not connected to my network.\nAlthough the fans were spinning and I had some blinking, I wanted to check if the switch systems had been successfully started. I connected the RJ45 management port directly to my PC and started scanning network traffic on this link using wireshark.\nFor context, within the first 3 bytes of the MAC address, 22 bits are reserved for the equipment vendor\u0026rsquo;s identifiers, and Celestica has the vendor identifier 0x00e0ec. Our switch\u0026rsquo;s MAC address is 00:e0:ec:38:e5:d5.\nSwitch MAC addresses After a little while an ICMP message originating from the MAC address 00:e0:ec:38:e5:d5 was captured. We can spot our switch\u0026rsquo;s MAC address as the source MAC in the packet\u0026rsquo;s MAC header.\n0000 33 33 00 00 00 16 00 e0 ec 38 e5 d5 86 dd 60 00 33.......8....`. ^^ ^^ ^^ ^^ ^^ ^^ 0010 00 00 00 24 00 01 00 00 00 00 00 00 00 00 00 00 ...$............ 0020 00 00 00 00 00 00 ff 02 00 00 00 00 00 00 00 00 ................ 0030 00 00 00 00 00 16 3a 00 05 02 00 00 01 00 8f 00 ......:......... 0040 89 7c 00 00 00 01 04 00 00 00 ff 02 00 00 00 00 .|.............. 0050 00 00 00 00 00 01 ff 38 e5 d5 .......8.. This confirmed that our switch was indeed working correctly, so now we just needed to find another way in.\nTelnet # Since I now had proof that the switch was working properly, and that the basic networking features were running as evidenced by the ICMP packet I decided to check if there wasn\u0026rsquo;t also an ssh port open.\nAt this point I connected the switch to my home\u0026rsquo;s router and started scanning my network using nmap, to see if the switch had an assigned IP.\npitchu /dev/serial \u0026gt;nmap -sn 192.168.4.0/24 Starting Nmap 7.94 ( https://nmap.org ) at 2023-10-03 16:13 PDT Nmap scan report for 192.168.4.1 Host is up (0.018s latency). Nmap scan report for 192.168.4.22 Host is up (0.0075s latency). Nmap scan report for 192.168.4.81 Host is up (0.00045s latency). Nmap scan report for 192.168.4.106 Host is up (0.015s latency). Nmap done: 256 IP addresses (4 hosts up) scanned in 3.09 seconds The switch\u0026rsquo;s address was 192.168.4.106, and I then proceeded to check what ports were open.\npitchu /dev/serial \u0026gt;nmap --top-ports 1000 192.168.4.106 Starting Nmap 7.94 ( https://nmap.org ) at 2023-10-03 16:16 PDT Nmap scan report for 192.168.4.106 Host is up (0.0039s latency). Not shown: 998 closed tcp ports (conn-refused) PORT STATE SERVICE 23/tcp open telnet 80/tcp open http Nmap done: 1 IP address (1 host up) scanned in 0.19 seconds Initially I had hoped for an open ssh port, telnet can also provide access to a virtual terminal. Although telnet is sometimes considered as lesser compared to ssh because it is less secure, for my local test oriented use case it is just as good.\nI then opened a telnet connection and logged into the admin user session.\npitchu /dev \u0026gt;telnet 192.168.4.106 23 Trying 192.168.4.106... Connected to 192.168.4.106. Escape character is \u0026#39;^]\u0026#39;. User:admin Password: (Routing) \u0026gt; Success, we are in :partying_face:\nGetting access to linux shell # When connecting to the switch, by default we log into a networking specific command line interface and not a linux shell. This CLI is very similar to the one used by Dell for there S4048-ON System.\nBy entering the ? character was can view all available commands.\n(Routing) \u0026gt;? enable Enter into user privilege mode. help Display help for various special keys. logout Exit this session. Any unsaved changes are lost. password Change an existing user\u0026#39;s password. ping Send ICMP echo packets to a specified IP address. quit Exit this session. Any unsaved changes are lost. show Display Switch Options and Settings. telnet Telnet to a remote host. By default we are logged in an unprivileged session, as signified by the \u0026gt; in out prompt.\nWe can elevate our privilege level, using the enable command, this also expands our available commands.\nWe can also confirm that we have entered privileged mode thanks to the # in our prompt.\n(Routing) \u0026gt;enable (Routing) #? application Start or stop an application. arp Purge a dynamic or gateway ARP entry. bcmsh Enter into BCM Shell boot Marks the given image as active for subsequent re-boots. cablestatus Isolate the problem in the cable attached to an interface. capture Enable CPU packets capturing. clear Reset configuration to factory defaults. configure Enter into Global Config Mode. copy Uploads or Downloads file. debug Configure debug flags. delete Deletes the given image or the language pack file. dir Display directory information. disconnect Close remote console session(s). dot1x Configure dot1x privileged exec parameters. enable Set the password for the enable privilege level. erase Erase configuration file. exit To exit from the mode. filedescr Sets text description for a given image. help Display help for various special keys. hostname Change the system hostname. ip Configure IP parameters. linuxsh Enter into Linux Shell logout Exit this session. Any unsaved changes are lost. network Configuration for inband connectivity. ping Send ICMP echo packets to a specified IP address. quit Exit this session. Any unsaved changes are lost. release To release IP Address. reload Reset the switch. renew To renew IP Address. script Apply/Delete/List/Show/Validate Configuration Scripts. serviceport Specify the serviceport parameters / protocol. set Set Router Parameters. show Display Switch Options and Settings. snmp-server Configure SNMP server parameters. sshcon Configure SSH connection parameters. telnet Telnet to a remote host. telnetcon Configure telnet connection parameters. terminal Set terminal line parameters. traceroute Trace route to destination. udld Reset UDLD disabled interfaces. vlan Type \u0026#39;vlan database\u0026#39; to enter into VLAN mode. watchdog Enable/Disable/Clear watchdog timer settings. write Configures save options. Unfortunately, this is a dedicated CLI and I would like to have access to the full linux shell. Now that we are in privilege mode we can escape this CLI and access the linux shell by using linuxsh.\n(Routing) #linuxsh Trying 127.0.0.1... Connected to 127.0.0.1 Linux System Login # pwd /mnt/application To recap : Privilege escalation using the CLI to get the root linux shell. I now felt right at home.\nReducing the noise # At idle the fan duty cycle is set to 60%, stated otherwise : this switch is cosplaying as a jet engine. :rocket:\nObviously this isn\u0026rsquo;t going to fly.\nThe first order of business is to make the noise a little more bearable.\nI can reduce the fan\u0026rsquo;s PWM by overwriting the contents of /sys/class/thermal/manual_pwm. This value is bound within the [0;255] range. It is apparently advised to keep the temperature of all internal components of the switch below 50 degrees Celcius.\nSo far a ~15% duty cycle seems to be a good compromise given my use case.\n# echo 40 \u0026gt; /sys/class/thermal/manual_pwm Check thermals # To check thermals, either exit linuxsh using the exit command and check the equipment\u0026rsquo;s status using show environment :\n# exit Connection closed by foreign host. (Routing) #show environment Temp (C)....................................... 37 Fan Speed, RPM................................. 3181 Fan Duty Level................................. 16% Temperature traps range: 0 to 45 degrees (Celsius) Temperature Sensors: Unit Sensor Description Temp (C) State Max_Temp (C) ---- ------ ---------------- ---------- -------------- -------------- 1 1 lm75_p2020 30 Normal 30 1 2 lm75_bcm56846 37 Normal 39 1 3 lm75_LIA 32 Normal 32 1 4 lm75_RIA 27 Normal 27 1 5 lm75_ROA 26 Normal 26 1 6 lm75_psuinlet1 27 Normal 33 1 7 lm75_psuinlet2 26 Normal 26 Fans: Unit Fan Description Type Speed Duty level State ---- --- -------------- --------- ------------- ------------- -------------- 1 1 Fan-1 Removable 3181 16% Operational 1 2 Fan-2 Removable 3186 16% Operational 1 3 Fan-3 Removable 3171 16% Operational 1 4 Fan-4 Removable 3215 16% Operational 1 5 Fan-5 Removable 3178 16% Operational Power Modules: Unit Power supply Description Type State ---- ------------ ---------------- ---------- -------------- 1 1 PS-1 Removable Operational 1 2 PS-2 Removable Operational Or read the contents of the *_temp files in the /sys/class/thermal folder.\n# cd /sys/class/thermal/ # ls LIA_temp bcm56846_temp fan3speed manual_pwm psu2_status RIA_temp fan1speed fan4speed p2020_temp psuinlet1_temp ROA_temp fan2speed fan5speed psu1_status psuinlet2_temp Since cat is not installed by default, I am using head as a replacement to quickly read those files.\n# head ROA_temp 28 Scripts are removed at reboot # I had written a small script to rewrite the fan\u0026rsquo;s PWM after boot, which I had named rc.local and placed in /etc/init.d with execute permissions.\n#!/bin/sh echo 30 \u0026gt; /sys/class/thermal/manual_pwm exit 0 This script was confirmed to be working when invoked via shell.\nUnfortunately after reboot not only did the changes not take effect but the script was gone.\nThis may be a symptom that the root file system is getting mounted at boot from an image, and since I am modifying the mounted version and not the original one, my changes are not permanent. Finding a workaround for this will be the subject of a later post.\nClosing remarks # From initially getting what amounted to a black box and having no networking equipment knowledge. I now have a working switch, a better understanding on the internals of this switch, a root access to its linux shell and have upgraded up my network-equipment-related knowledge through troubleshooting and experimentation.\nMoving forward I plan to continue looking for a way to reset PWM fan speed after boot, start experimenting by writing a few static routing tables, and maybe open an ssh tunnel to replace telnet.\nI would like to thank EmbeddedKen for helping me figure out the use of the Lattice FPGAs, ThomasC and reddit user bvcb907 for their very insightful answers from 3 years ago on the reddit thread related to this switch.\nResources # Redstone D2020 data brief\nDigital Diagnostic Monitoring Interface for SFP and SFP+ Optical Transceivers\nReddit : Redstone D2020 48x 10GbE SFP+ \u0026amp; 4x QSFP Switch???\nIs Broadcom’s chip powering Juniper’s Stratus?\nHIGH-CAPACITY STRATAXGS® ETHERNET SWITCH FAMILY WITH INTEGRATED 10G SERIAL PHY\nQorIQ® P2020 and P2010 Dual- and Single-Core Communications Processors\nList of MAC addresses with vendor identifiers.\nDell Command Line Reference Guide for the S4048–ON System 9.14.2.5\n","date":"2 October 2023","externalUrl":null,"permalink":"/projects/d2020_p1/","section":"Other projects","summary":"No documentation, no problem","title":"Troubleshooting the redstone D2020","type":"projects"},{"content":" Hi, I\u0026rsquo;m Julia 👋 # I\u0026rsquo;m a RTL designer and serial tinkerer.\nMe and my little robot. I build this hexapod robot from scratch when I was 18, it was my first major project and opened my eyes to the joy of figuring it out. Although I was born in France, I have lived in 6 different countries, and am currently living in the US.\nProfessionally, I have a Masters in Electrical and Computer Engineering, have previously worked as a CPU designer at ARM and an FPGA engineer at Optiver.\nIn this blog, I document my most recent projects, in the hopes to help others in their own endeavors.\nIf you like what I do, don\u0026rsquo;t hesitate to give me a star and a follow on github .\n","date":"2 August 2023","externalUrl":null,"permalink":"/about/","section":"Tales on the wire","summary":"","title":"About me","type":"page"},{"content":" Introduction # MoldUDP64 is a networking protocol used in the NASDAQ native market data feed.\nIt is built on top of UDP and its output feeds into the ITCH layer.\nAs such, I implemented this MoldUDP64 module as part of my NASDAQ HFT project.\nNASDAQ ITCH data feed network stack Due to the nature of this application, the design of this module is optimized for low latency. Power and area are secondary concerns.\nAs of today this module is fully combinational, and doesn\u0026rsquo;t require any pipelining.\nThis article aims to serve as an accessible introduction to my design process thus far when implementing this module. For a more technical documentation please refer to my github.\nEssenceia/MoldUPD64 RTL implementation of a MoldUPD64 receiver. Verilog 9 4 I do not have any professional experience in the domain, and as such my understanding may be flawed. If you see any mistakes please contact me, so that I can correct them. MoldUDP64 # Each client obtains the data feed from a NASDAQ server via UDP multicast.\nEach packet is only transmitted once. If a client misses a packet, it needs to detect this by itself, and make a retransmission request to the dedicated re-request NASDAQ server. This server will respond with the missing packets via UDP unicast.\nHigh level overview of the client architecture not including may modules, of which more relevant to this article are the ITCH and Trading algorithm modules. Payload format # The MoldUPD64 packet was designed with the requirement of efficiency in mind and with minimal overhead. As such its format is quite simple. MoldUDP64 packet Header # Each packet header contains a 10 byte session id, an 8 byte sequence number and a 2 byte message count.\nMoldUDP64 packet header The session id and sequence number fields are used to keep track of missing messages.\nThe session id keeps track of what sequence of messages we are currently receiving.\nEach message within a session is individually tracked using a unique sequence number. The sequence number of a header indicates the sequence number of the first message in the packet. The messages following this first message are implicitly numbered sequentially.\nThe message count tells us how many messages our packet will contain.\nAs such, we can predetermine the next sequence number we expect to receive.\nsequence_number_next = sequence_number + message_count If the sequence number of the next packet doesn\u0026rsquo;t match, either the packet has been re-ordered within the network and will arrive later or we have missed a packet.\nMessages # Internally this packet can contain 0 or more messages, each of variable length. Each of these messages is preceded by a 2 byte length field followed by the message data of length bytes. MoldUDP64 message format Each of these message data will contain an ITCH message.\nArchitecture # Internally, our module intakes new data via a point to point AXI stream interface, to which it is connected as a client, and outputs the message data to the ITCH module.\nSimplified architecture, showing the MoldUDP64 module\u0026rsquo;s connection to UDP for input data to ITCH for output data. As long at the MoldUDP64 is ready to accept new data, if any data is available, it will receive a new AXI payload at each cycle. The payload\u0026rsquo;s data will be transmitted on the axis_tdata bus and the validity of each data byte is indicated by the axis_tkeep.\nAs this implementation targets low latency, one of the major design goals is to always be ready to accept new data without any corner case.\nWe will be using an AXI stream data axis_tdata width of 64 bits for our illustration, but ultimately the goal is to make this parameterizable. Message overlap # As this implementation targets low latency,we won\u0026rsquo;t accept a new payload each cycle.\nAs of writing, 3 iterations of this module have been written, each aiming to improve latency.\nVersion 1 # In the first version, I wanted to have the ITCH message data aligned on the data stream\u0026rsquo;s width with no bubbles in the data, and have only one message getting sent to the ITCH module at a time.\nAdditionally I wanted all message data bytes sent in a payload to be valid, with the last payload being the only exception.\nThis created a corner case when two different messages were on the same payload. Since we could only send one message at a time, we needed to back pressure the UDP module to leave time to purge the end of the previous message before receiving the new message.\nBack pressure when two messages arrive within the same payload. The purge spreads over multiple cycles To make matters worse, contrary to the example just above I was also waiting to have accumulated a full payload\u0026rsquo;s worth of valid message data before sending it out.\nThe initial motivation for doing so was to simplify the logic on the ITCH side as all payloads would be guaranteed to be 8 bytes wide with the exception of the last.\nAdditionally, it allows us to avoid applying back pressure in examples like above, and start accumulating bytes to complete our output vector.\nIn practice this doesn\u0026rsquo;t allow us to get rid of back pressure as there is a limit to the amount of bytes we can store in our flop, and we still ultimately need to resort to back pressure to purge it. The only reason why there was not back pressure in this previous example was because I worked under the assumption that at t our flop was empty.\nAssembling a full 8 bytes of valid data before sending it to the ITCH module. We are assuming that at t our flop used to store the last message data byte is empty. As such this increases the complexity of the MoldUPD64 logic and doesn\u0026rsquo;t help with latency that much.\nVersion 2 # Instead of having only one ITCH module, I duplicated this module, in order to start sending message data to the other ITCH module in a round robin fashion when the overlap occurs.\nAdditionally, since ITCH messages are guaranteed to be longer than the payload width, and we only mark the ITCH message as valid once all the bytes have been received, I added a multiplexer to the output of the ITCH modules to have a single ITCH message interface connected to the trading algorithm.\nBack pressure when two messages arrive within the same payload. The purge spreads over multiple cycles This implementation was ultimately scrapped as :\nthe demux inside the MoldUDP64 module, and muxes between the ITCH and the trading algorithm module add avoidable logic levels on this critical data feed path.\nthe duplicated logic increases wire delay due to its increased size.\nthe guarantee that only one ITCH module\u0026rsquo;s output would be valid at a time stopped being true when considering early decoded ITCH signals. These signals are used to send the data of fully received ITCH message fields from a still incomplete ITCH message such that the trading algorithm can start without having received the full ITCH message yet.\nVersion 3 # This third iteration, the most recent as of writing, aims to leverage the fact that the message data will always be larger than the width of the payload.\nIn order to make this version possible, I had to add some complexity to the logic of the ITCH module such that I could relax the requirements on the number of message data bytes that could be obtained from each payload.\nPreviously, I expected 8 bytes from every payload with the exception of the last.\nI have added a second outbound data interface used to store message data bytes in the event two messages overlap on the same payload.\nIt is called the ov for \u0026ldquo;overlap\u0026rdquo; interface, and because of its nature, valid bytes on the overlap are always the first bytes of a new message.\nPayload containing data of two messages, having its data split onto both outbound MoldUDP64 interfaces. Because overlap only occurs when there is at least 1 byte of the previous message data in the payload, and the length field is 2 bytes, our overlap data is at most 5 bytes wide for a 8 byte payload. In this example N=8. Because the presence of an overlap coincides with the end of the previous message, and because I wanted to have only one ITCH module, within the ITCH module these bytes are flopped as the finishing message is drained. They are then appended to the start of the new ITCH message data. In these cases we will be writing more than 8 bytes of data per cycle. I will elaborate on this more in a future ITCH module write-up.\nOverlap signals are transmitted in parallel to normal message signals. There is only one ITCH module and a single interface between it and the Trading algorithm. Conclusion # There is still a lot of room for improvement. For instance, there is no guarantee that the payload size will remain 8 bytes. If it drops under 4 bytes, some corner cases like the overlap will cease to exist.\nThe next write-up will likely be on the ITCH module. If you wish to be notified when this article is published shot me a mail.\nAdditional resources # NASDAQ MoldUDP64 v1.0 specification\nTotalView ITCH 5.0 specification\nAMBA 4 AXI4 stream protocol specification\n","date":"2 August 2023","externalUrl":null,"permalink":"/projects/moldudp64/","section":"Other projects","summary":"Discussing the design of the current MoldUDP64 module.","title":"MoldUDP64 RTL implementation","type":"projects"},{"content":"","date":"30 July 2023","externalUrl":null,"permalink":"/tags/aes/","section":"Tags","summary":"","title":"AES","type":"tags"},{"content":" Introduction # The Advanced Encryption Standard (AES) is a widely used block cipher encryption algorithm. One of my past projects called for the RTL implementation of a version of AES for both encoding and decoding. This blog post is a presentation of this Verilog project.\naes128 encryption simulation Essenceia/AES RTL implementaion of 128 bit Advanced Encryption Standard (AES) encyption algorithm C 11 1 This code wasn\u0026rsquo;t optimized for power, performance, area or hardened against side channel attacks. Advanced Encryption Standard (AES) # Before we begin, here is a quick introduction to the AES algorithm :\nThe AES algorithm is a symmetric block cipher that can encrypt (encipher) and decrypt (decipher) information. Encryption converts data to an unintelligible form called ciphertext; decrypting the ciphertext converts the data back into its original form, called plaintext.\nThe AES algorithm is capable of using cryptographic keys of 128, 192, and 256 bits to encrypt and decrypt data in blocks of 128 bits. These different “flavors” may be referred to as “AES-128”, “AES-192”, and “AES-256”.\nOur implementation is of the AES-128 flavor.\nOverview of AES # As mentioned earlier, AES is a block cipher algorithm, it encrypts/decrypts over multiple rounds. Each round will receive a 16 bytes data block and a key, and generate a new version of this data block, as well as a new key for the next round.\nLet\u0026rsquo;s emphasize here that both the key and the data will be updated each round.\nThe key size determines the number of rounds. There are :\n10 rounds for a 128 bit key 12 rounds for the 192 bit key 14 rounds for the 256 key. Encrypting the data # Encrypting a plaintext data block to ciphertext is done by applying the following transforms : SubBytes, ShiftRows, MixColumns and AddRoundKey.\nThe basic AES-128 cryptographic architecture, credit Arrag, Sliman \u0026amp; Hamdoun, A. \u0026amp; Tragha, Abderrahim \u0026amp; Khamlich, Salaheddine For the initial round, only the AddRoundKey transform is applied.\nFor the middle rounds, all the transforms are performed.\nFor the final round, only the SubBytes, ShiftRows and AddRoundKey transforms are applied.\nWe will elaborate more on these transforms later in the article.\nInternally the 16 data bytes are mapped onto a 4x4 byte matrix.\nIn the article, we will be using the row and column term to refer to the rows and columns of this matrix.\nInput byte data mapping onto a 4x4 byte matrix, source : FIPS-197 Announcing the ADVANCED ENCRYPTION STANDARD (AES) Key scheduling # During each encryption round, a new key is computed based on the round index, and on the key used during this round. This is called Key Expansion, and the algorithm that creates this new key for each round is called the Key Scheduler.\nThe new key is obtained by passing the last column of the old key though the following transforms : RotWord, SubWord, Rcon, then xor-ing the result with the first column\u0026rsquo;s original value, and then propagating this back through all the columns.\nAES key scheduling for 128-bit key, credit By Sissssou - Own work, CC BY-SA 4.0 Decryption overview # The AES algorithm can be inverted to perform the decoding operations. This is done by applying the inverse cipher transforms in reverse order.\nEncryption # SubBytes # The SubBytes transform is a byte substitution based on a table called the S-box.\nEach byte of the 16-bytes data block will be substituted by its S-box equivalent.\nSubBytes applies the S-box to each byte of the data The substitution table is as follows with the x and y indexes corresponding to the hexadecimal value of the xy data to be substituted.\nS-box : substitute values for the byte xy (in hexadecimal format). Most implementations store this table in memory and access it using the value for the byte to be substituted as an offset.\nBecause we are implementing this functionality in hardware and would ideally like each round to only take 1 cycle, using this method would force us to implement 16 memories, each 256 entries deep and 8 bits wide to perform this operation in parallel. To avoid that cost, we looked for a more efficient way.\nIt turns out that the S-box logic can be minimized, as shown by Boyar and Peralta in their 2009 paper A new combinational logic minimization technique with applications to cryptology.\nOur S-box is a translation of the circuit they proposed in this paper.\nThe result is far from being human readable, but the produced output matches perfectly with the substitution table and is much cheaper logic-wise.\nIf the reader is as doubtful of the logic\u0026rsquo;s equivalence as I was when I implemented it, he can take a look at a test bench that I wrote to verify that equivalence.\nLink to code ShiftRows # The ShiftRows transform performs a left cyclical byte shift of each data row, with an offset based on the row\u0026rsquo;s index.\nCyclical left shift of the data rows : FIPS-197 Announcing the ADVANCED ENCRYPTION STANDARD (AES) This transform is easily implementable in hardware.\nLink to code MixColumns # Unlike what this transform\u0026rsquo;s name might suggest, it isn\u0026rsquo;t as simple as shuffling columns.\nTransform operates on the data column-by-column, source : FIPS-197 Announcing the ADVANCED ENCRYPTION STANDARD (AES) This transform takes each column, and treats it as a 4 term polynomial. This polynomial is then multiplied by a constant 4x4 matrix.\nMix column matrix multiplication, source : FIPS-197 Announcing the ADVANCED ENCRYPTION STANDARD (AES) The catch here is that these operations are done in a Galois field and as such, the meaning of the \u0026ldquo;sum\u0026rdquo; and \u0026ldquo;product\u0026rdquo; operations are not the usual ones.\nGalois field arithmetic, source : FIPS-197 Announcing the ADVANCED ENCRYPTION STANDARD (AES) Because we are dealing with Galois field arithmetics all these operations can be implemented using basic xor \\(\\oplus\\) and and \\(\\bullet\\) operations, and again, translate easily into hardware.\nLink to code AddRoundKey # This transform is a simple xor between each data column and the corresponding column of the current round\u0026rsquo;s key.\nxor each column of the data with the corresponding column of the key, source : FIPS-197 Announcing the ADVANCED ENCRYPTION STANDARD (AES) The key obtained as a result of the key scheduling transform for each round.\nLink to code RotWord # This transform is part of the key scheduling and consists of a simple one byte left cyclical shift (rotation) of the last column of the key.\n$$ [ a_{0}, a_{1}, a_{2}, a_{3} ] \\to [a_{1}, a_{2}, a_{3}, a_{0} ] $$Link to code SubWord # The SubWord is the key scheduling\u0026rsquo;s equivalent of the SubBytes step described earlier. It performs a byte substitution using the S-box, but this time on the last column of the key.\nAs such, our implementation re-uses the same S-box module as the SubNytes transform : Rcon # This transform involves xor-ing the last byte of the last column of the key with a constant whose value depends on the index of the current round.\nIn AES-128, the constants are the following :\nround 1 2 3 4 5 6 7 8 9 10 constant (hex) 8\u0026rsquo;h01 8\u0026rsquo;h02 8\u0026rsquo;h04 8\u0026rsquo;h08 8\u0026rsquo;h10 8\u0026rsquo;h20 8\u0026rsquo;h40 8\u0026rsquo;h80 8\u0026rsquo;h1b 8\u0026rsquo;h36 constant (bin) 8\u0026rsquo;b1 8\u0026rsquo;b10 8\u0026rsquo;b100 8\u0026rsquo;b1000 8'10000 8\u0026rsquo;b100000 8\u0026rsquo;b1000000 8\u0026rsquo;b10000000 8\u0026rsquo;b11011 8\u0026rsquo;b110110 Looking at the binary representation we can see a pattern emerge :\nfrom round 1 to 8 Rcon is a 1 bit left shift.\nafter round 8 Rcon overflows, its new value gets set to 8'h1b and the pattern of shifting left by 1 bit continues.\nOur implementation of for obtaining the next Rcon is based on this simple observation Decryption # As mentioned earlier, the decryption procedure only consists of applying the inverse of the encoding transforms, in reverse order.\nAES-128 decryption, credit braincoke The inverted transforms being closely related to the original transforms, we will not elaborate on their behavior in this article.\nThough, the interested reader can find their implementations using the following links.\nInvSubBytes InvShiftRows InvMixColumns InvRotWord InvRcon Testing # In order to test the correctness of our AES implementation, our test bench compares the result of our simulation against the output of a golden model coded in C.\nFor more information, as well as instructions on how to run the test bench, please see the README.\nResources # If after reading this article, the reader desires a more in depth explanation of the AES algorithm, I would recommend reading the excellent write-up on this topic at braincoke. In particular the articles covering :\nencryption,\ndecryption\nkey scheduling.\nOfficial AES specification, link to pdf : Federal Information Processing Standard Publication 197 - Specification for the ADVANCED ENCRYPTION STANDARD (AES)\n","date":"30 July 2023","externalUrl":null,"permalink":"/projects/aes/","section":"Other projects","summary":"Learn more about me and why I am starting this blog.","title":"AES 128b RTL implementation","type":"projects"},{"content":" Introduction # In this project we have implemented a feature reduced version of the BLAKE2 cryptographic hash function into synthesizable RTL using Verilog.\nBLAKE2b hash simulation wave view: it takes 12 cycles to produce a result for one block. Essenceia/Blake2 Blake2 RTL implementation Verilog 5 1 BLAKE2 # BLAKE2 is specified in the RFC7693.\nThe algorithm receives plaintext data, breaks it down into blocks of 32 or 64 bytes and produces a hash of 16 or 32 bytes for each block.\nIn practice BLAKE2 is used in different applications from password hashing to proof of work for cryptocurrencies.\nThere are 2 main flavors of BLAKE2:\nblock size (bytes) hash size (bytes) BLAKE2b 64 32 BLAKE2s 32 16 Our code is written in a parametric fashion, to support both the b and s flavor.\nThis hash function works on individual blocks of data or on data streams.\nThis code was written to be configured as both the b and s variants, but only the b variant has been thoroughly tested thus far. This implementation does not currently support secret keys or streaming data for compression : it only accepts one block at a time. Function overview # The BLAKE2 takes plaintext data, breaks it down into blocks of 32 or 64 bytes, and passes each of these blocks through the compression function. The main loop in this function includes the permutation function and the mixing function, this loop will be called 10 or 12 times. flowchart TD subgraph T0[\" \"] B--\u003eI[for each message block] F--\u003eG[end for] I--\u003eJ; G--\u003eI subgraph T1[\" \"] J[Init block]--\u003eE[for round=0..N]; E--\u003eC[Permutation]; C--\u003eD[Mixing]; D--\u003eF[end for]; F--\u003eE; end end A(Plaintext)--\u003eB[Initialize algorithm]; G--\u003eH(Hash); click C \"#permutation function\" _blank click D \"#mixing function\" _blank style T0 fill:#8b5cf6; style T1 fill:#d946ef; The number of rounds is dependant of the flavor of BLAKE2 :\nBLAKE2b BLAKE2s rounds 12 10 Permutation function # Within the compression function loop, at the start of each round, we calculate a new 16 entry wide selection array s based on a predefined pattern shown below :\nRound 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SIGMA[0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SIGMA[1] 14 10 4 8 9 15 13 6 1 12 0 2 11 7 5 3 SIGMA[2] 11 8 12 0 5 2 15 13 10 14 3 6 7 1 9 4 SIGMA[3] 7 9 3 1 13 12 11 14 2 6 5 10 4 0 15 8 SIGMA[4] 9 0 5 7 2 4 10 15 14 1 11 12 6 8 3 13 SIGMA[5] 2 12 6 10 0 11 8 3 4 13 7 5 15 14 1 9 SIGMA[6] 12 5 1 15 14 13 4 10 0 7 6 3 9 2 8 11 SIGMA[7] 13 11 7 14 12 1 3 9 5 0 15 4 8 6 2 10 SIGMA[8] 6 15 14 9 11 3 0 8 12 2 13 7 1 4 10 5 SIGMA[9] 10 2 8 4 7 6 1 5 15 11 9 14 3 12 13 0 This array s is then used to select the indexes used to access data from our message block vector m.\nIn the C language, this operation can be written simply as m[sigma[round][x]], with round the round we are currently at and x the initial array index.\nIn hardware, this function is admittedly the most costly component of this entire algorithm as it requires a lot of muxing logic :\nOne 64 bit wide, 10 deep mux to select, depending on which round we are at, the correct s select values. Link to code 16 64 bit wide, 16 deep muxes used to assign the new values of array m, and using values of s for the select.Link to code Mixing function # This function is also part of the compression function loop. It is referred to in the official specification as G. It takes in 6 bytes and produces 4 new bytes.\nFUNCTION G(v[0..15], a, b, c, d, x, y) | | v[a] := (v[a] + v[b] + x) mod 2**w | v[d] := (v[d] ^ v[a]) \u0026gt;\u0026gt;\u0026gt; R1 | v[c] := (v[c] + v[d]) mod 2**w | v[b] := (v[b] ^ v[c]) \u0026gt;\u0026gt;\u0026gt; R2 | v[a] := (v[a] + v[b] + y) mod 2**w | v[d] := (v[d] ^ v[a]) \u0026gt;\u0026gt;\u0026gt; R3 | v[c] := (v[c] + v[d]) mod 2**w | v[b] := (v[b] ^ v[c]) \u0026gt;\u0026gt;\u0026gt; R4 | | RETURN v[0..15] | END FUNCTION. Internally it is composed of simple operations :\n3 way unsigned integer add 32 or 64 byte modulo circular right shift xor The G function is called in the compression loop, with the following arguments:\nv := G(v, 0, 4, 8, 12, m[s[ 0]], m[s[ 1]]) v := G(v, 1, 5, 9, 13, m[s[ 2]], m[s[ 3]]) v := G(v, 2, 6, 10, 14, m[s[ 4]], m[s[ 5]]) v := G(v, 3, 7, 11, 15, m[s[ 6]], m[s[ 7]]) v := G(v, 0, 5, 10, 15, m[s[ 8]], m[s[ 9]]) v := G(v, 1, 6, 11, 12, m[s[10]], m[s[11]]) v := G(v, 2, 7, 8, 13, m[s[12]], m[s[13]]) v := G(v, 3, 4, 9, 14, m[s[14]], m[s[15]]) In practice we have already obtained the values of x and y from having calculated the values for m[s[x]] as part of the permutation function.\nThe values of a, b, c and d are constants.\nValues for R1, R2, R3, and R4 depend on the flavor of BLAKE2 we are implementing.\nBLAKE2b BLAKE2s R1 31 16 R2 24 12 R3 16 8 R4 63 7 As such, this function easily maps onto hardware with minimal cost. Link to code Testing # To test our implementation, we are comparing the output of our simulated implementation with the test vector produced by a golden model.\nIn this case our golden model was the C implementation of BLAKE2 found in the appendix of the RFC7693 specification.\nFor more instructions on running the test bench see the README .\nResources # BLAKE2 specification RFC7693\n","date":"29 July 2023","externalUrl":null,"permalink":"/projects/blake2/","section":"Other projects","summary":"Learn more about me and why I am starting this blog.","title":"BLAKE2 RTL implementation","type":"projects"},{"content":" DFT: Designed For Trouble # Mandatory JTAG instructions # The four IEEE 1149.1 defined mandatory JTAG instructions IDCODE, BYPASS, SAMPLE/PRELOAD, and EXTEST.\nBoundary Scan # The EXTEST instruction is used for sampling external pins and loading output pins with data. The data from the output latch will be driven out on the pins as soon as the EXTEST instruction is loaded into the JTAG IR-register. Therefore, the SAMPLE/PRELOAD should also be used for setting initial values to the scan ring, to avoid damaging the board when issuing the EXTEST instruction for the first time. SAMPLE/PRELOAD can also be used for taking a snapshot of the external pins during normal operation of the part.\nsource : https://onlinedocs.microchip.com/oxy/GUID-74F8229E-4C43-4FA0-BE7D-1AA303C6F8A4-en-US-6/GUID-86F1AB9A-120D-42D4-8B59-7A9F04E34236.html?hl=extest\nEXTEST # The active states are:\nCapture-DR: Data on the external pins are sampled into the Boundary-scan Chain. Shift-DR: The Internal Scan Chain is shifted by the TCK input. Update-DR: Data from the scan chain is applied to output pins. Viviado emulation # Driving multiple clock, means finding multiple clk pins\nVivado% get_package_pins -filter {IS_CLK_CAPABLE == 1 \u0026amp;\u0026amp; BANK == 14} M18 M19 L17 K17 N17 P17 P18 R18 Bank 14 is basys3 pmodC port.\nRessources # AVR JTAG System Overview : https://onlinedocs.microchip.com/oxy/GUID-74F8229E-4C43-4FA0-BE7D-1AA303C6F8A4-en-US-6/GUID-86F1AB9A-120D-42D4-8B59-7A9F04E34236.html?hl=extest\n","externalUrl":null,"permalink":"/projects/dft_asic/jtag/","section":"Other projects","summary":"","title":"","type":"projects"},{"content":"Power connection wasn\u0026rsquo;t being made.\nI am using the default hierarchical macro instegration. My design\u0026rsquo;s main power is on TopMetal1, and I was expecting SRAM power ports to be on layer N-1, so Metal5. Yet looking at the SRAM macro lef file shows me that power is actually expected on Metal4.\nhttps://github.com/IHP-GmbH/IHP-Open-PDK/blob/a2bf8ea81aee7d0fcdd6d62168edca0d7d0bcb08/ihp-sg13g2/libs.ref/sg13g2_sram/lef/RM_IHPSG13_1P_256x8_c3_bm_bist.lef#L146C1-L153C8\nGood problem to have, since this means I can integreate SRAM as a 2 layer deep macro. Problem is: how do I drop the expected power grid one more layer ?\nCurrently the OpenROAD power delievery network grid builder doesn\u0026rsquo;t find the pad:\n[09:20:27] WARNING [PDN-0232] The grid \u0026#34;macro - m_ihp_sram\u0026#34; (Instance) does not contain any shapes or vias. openroad.py:297 [09:20:27] ERROR [PDN-0233] Failed to generate full power grid. This is only made more puzzling to me as I think I am already creating a connection:\nadd_global_connection \\ -net $power_net \\ -inst_pattern $instance_name \\ -pin_pattern $power_pin \\ -power Global connection might not mean what I think it means\u0026hellip;\nI asked for help un-stupidding myself on the Tiny Tapeout discord and tnt jumped in pointing me to this missmatch in power connection layers. Then mole99 also dropped by and pointed me to the crown jewel I was looking for: an example of an full IHP chip (using the new librelane chip flow) that just so happened to be using the SAME SRAM MACRO !!!!\nAlright, calm my exitement, not exactly the same macro, but of the same family, aka: Close enogth !\nAnd guess what, he has a custon PDN tcl script for setting the power delivery to these SRAM MACROS, which I will now proceed to unshamfully wripe off. https://github.com/IHP-GmbH/ihp-sg13g2-librelane-template/blob/da17746e19984826dd780ce778b6bb40dbf54544/librelane/config.yaml#L196\nOh yeah baby, here we go! We have found the magical missing ingrediant:\ndefine_pdn_grid \\ -macro \\ -instances \u0026#34;\\ i_chip_core.sram_0\u0026#34; \\ -name sram_NS \\ -starts_with POWER add_pdn_stripe \\ -grid sram_NS \\ -layer Metal5 \\ -width 2.81 \\ -pitch 11.24 \\ -offset 2.81 \\ -spacing 2.81 \\ -nets \u0026#34;VSS VDD\u0026#34; \\ -starts_with POWER add_pdn_connect \\ -grid sram_NS \\ -layers \u0026#34;Metal4 Metal5\u0026#34; add_pdn_connect \\ -grid sram_NS \\ -layers \u0026#34;Metal5 TopMetal1\u0026#34; My version :\ndefine_pdn_grid \\ -macro \\ -instances \u0026#34;m_ihp_sram\u0026#34; \\ -name sram_NS \\ -starts_with POWER add_pdn_stripe \\ -grid sram_NS \\ -layer Metal5 \\ -width 2.81 \\ -pitch 11.24 \\ -offset 2.81 \\ -spacing 2.81 \\ -nets \u0026#34;VGND VPWR\u0026#34; \\ -starts_with POWER add_pdn_connect \\ -grid sram_NS \\ -layers \u0026#34;Metal4 Metal5\u0026#34; add_pdn_connect \\ -grid sram_NS \\ -layers \u0026#34;Metal5 TopMetal1\u0026#34; So what is this party all about ?\nDefining a new power grid # Create a new power grid over the m_ihp_sram macro called sram_NS, with the first strap being POWER (default is GROUND). The reference to the macro instance is used for defining of this new power grid\u0026rsquo;s area.\ndefine_pdn_grid \\ -macro \\ -instances \u0026#34;m_ihp_sram\u0026#34; \\ -name sram_NS \\ -starts_with POWER Adding stripes # Recall how the macro expects power on Metal4 and we have power on TopMetal1 ? This step adds a power grid to the intermediary Metal5 layer to help bridge the gap.\nDefine a pattern of the power and ground stripes to be added to Metal5 in the sram_NS grid. The width, pitch, offset and spacing parameters are used to define the straps topology. They must be sized wide enoght and be plentifull enoght to provide sufficient power accross the entire maco, even during heavy load (hello IR drop, I did not miss you).\nThese stripes are likely oversized to prevent any headacks.\nadd_pdn_stripe \\ -grid sram_NS \\ -layer Metal5 \\ -width 2.81 \\ -pitch 11.24 \\ -offset 2.81 \\ -spacing 2.81 \\ -nets \u0026#34;VGND VPWR\u0026#34; \\ -starts_with POWER Static IR drop analysis shows we are A OK, though I am forced to recognise a complete analysis would also encoupase a IR dynamic drop analysis.\nVPWR report :\n########## IR report ################# Net : VPWR Corner : nom_typ_1p20V_25C Total power : 3.05e-04 W Supply voltage : 1.20e+00 V Worstcase voltage: 1.20e+00 V Average voltage : 1.20e+00 V Average IR drop : 1.25e-06 V Worstcase IR drop: 9.84e-05 V Percentage drop : 0.01 % ###################################### VGND report :\n########## IR report ################# Net : VGND Corner : nom_typ_1p20V_25C Total power : 3.05e-04 W Supply voltage : 0.00e+00 V Worstcase voltage: 9.42e-05 V Average voltage : 1.22e-06 V Average IR drop : 1.22e-06 V Worstcase IR drop: 9.42e-05 V Percentage drop : 0.01 % ###################################### Connecting the sram power grid to main power grid # In this process, there is no district level authority to inspect my installation before connecting it to the grid, unlike if I was to install a solar array on my roof.\nAll that is left is to call add_pdn_connection to hook up the VPWR and VGND power nets from my newly defined sram_NS grid to the corresponding power nets on Metal4 and TopMetal1.\nadd_pdn_connect \\ -grid sram_NS \\ -layers \u0026#34;Metal4 Metal5\u0026#34; add_pdn_connect \\ -grid sram_NS \\ -layers \u0026#34;Metal5 TopMetal1\u0026#34; Results # TopMetal1, represented in light blue, is the project\u0026rsquo;s main power grid extending over all of the core, and is now connected to our newly defined sram_NS grid.\nIn red we can see the custom straps we have added to Metal5 now bridging TopMetal1 and the SRAMs macro\u0026rsquo;s own power straps on Metal4, prepresented in green.\n","externalUrl":null,"permalink":"/projects/ihp_sram/notes/","section":"Other projects","summary":"","title":"","type":"projects"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"Documenting my attempt at building a NASDAQ compatible HFT FPGA from scratch in my living room.\n","externalUrl":null,"permalink":"/hft/","section":"HFT","summary":"","title":"HFT","type":"hft"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}]