Audio Hacking On The ESP8266 – Part 2

With our newly found knowledge of the ESP8266 in the first part of Audio Hacking On The ESP8266 we now go on with creating a sample playing keyboard or Rompler.

The difference to our drum sampler is that this plays the sample chromatically and polyphonically.

The EMU-II in the picture above used 8-bit DPCM samples at the strange sample rate of 27.7KHz. To make it more simple we are going to use 16-bit signed samples at a rate of 32KHz.

The EMU-II was 8-voice polyphonic but I’m no good at writing voice assigners so our sampler will be 128-voice fully polyphonic.

While you can in theory play all the 128 MIDI keys at once with individual envelopes the polyphony will be less because of limits in the processing power.

Our definitions

#include <Arduino.h> 
#include "ESP8266WiFi.h"
#include <i2s.h>
#include <i2s_reg.h>
#include <pgmspace.h>
#include <Ticker.h>
uint32_t i2sACC;
uint8_t i2sCNT=32;
uint16_t DAC=0x8000;
uint16_t err;

//Envelope and VCA parameters
volatile ENVcnt=8; //16mS env resolution
int16_t VCA[128]; //VCA levels
volatile uint8_t ATTACK=30; // ENV Attack rate 1-255
volatile uint8_t RELEASE=3; // ENV Release rate 1-255

//Sample parameters and tables
uint32_t FREQ[128]; //Phase accumulators
uint32_t SPNT[128]; //Sample pointers
uint32_t LOOP1[128]; //Start of loop segment in sample
uint32_t LOOP2[128]; //End of loop segment in sample
uint32_t SLEN[128]; //Length of sample

There are as you can see 128 tables of each parameter.

The parameters for each key on the keyboard is:

  • Phase accumulator (more explained later)
  • Sample pointer (The linear counter for time inside the sample)
  • LOOP1 (The starting point of a sustain loop)
  • LOOP2 (The end point of the loop and where it jumps back to LOOP1)
  • Length (How long or number of words the total sample is)

The setup

This is our setup routine using MIDI input to play the samples.

void setup() {
  WiFi.forceSleepBegin(); //Turn off WiFi radio
  delay(1); //Wait for it to turn off
  Serial.begin(31250); //Start the serial port with default MIDI baudrate
  Serial.swap(); //Move the TX and RX GPIOs to 15 and 13
  i2s_begin(); //Start the i2s DMA engine
  i2s_set_rate(32000); //Set sample rate
  pinMode(2, INPUT); //restore GPIOs taken by i2s
  pinMode(15, INPUT);
  timer1_attachInterrupt(onTimerISR); //Attach our sampling ISR
  timer1_enable(TIM_DIV16, TIM_EDGE, TIM_SINGLE);
  timer1_write(2000); //Service at 2mS intervall

It turns off the WiFi radio, bumps the CPU freq up to 160MHz, sets up the UART for MIDI, start the i2s DMA engine and turns on the Timer.

We also need the Timer interrupt that takes care of loading the DMA at a 2mS interval.

void ICACHE_RAM_ATTR onTimerISR(){ //Code needs to be in IRAM because its a ISR
  while (!(i2s_is_full())) { //Don’t block the ISR if the buffer is full
    DAC=samplerTick(); //Calculate current sample value
    //Pulse Density Modulated 16-bit I2S DAC
    for (uint8_t i=0;i<32;i++) { 
      if(DAC >= err) {
        err += 0xFFFF-DAC;
        err -= DAC;
    bool flag=i2s_write_sample(i2sACC);

//Envelope handler
if (!(ENVcnt--)) { //Calculate ENV every 16mS
  for (envcnt=0;envcnt<128;envcnt++) { //128 VCA's
  if ((MIDItable[envcnt]>0)&&(VCA[envcnt]<255)) {
    if (VCA[envcnt]>255) VCA[envcnt]=255;
  if ((MIDItable[envcnt]==0)&&(VCA[envcnt]>0)) {
    if (VCA[envcnt]<0) VCA[envcnt]=0;
  timer1_write(2000);//Next in 2mS

Inside the Timer handler we also run our envelope generators at a 16mS interval.

Each key has its own Attack/Decay volume envelope. Sustain is always at full level.

MIDI handler

Our loop() takes care of checking if serial data is available and if so runs the MIDI processor.

void loop() {
  if (Serial.available()) processMIDI(;

If MIDI data is available it processes it.

void processMIDI(uint8_t MIDIRX) { //MIDI processor
Handling “Running status”
1.Buffer is cleared (ie, set to 0) at power up.
2.Buffer stores the status when a Voice Category Status (ie, 0x80 to 0xEF) is received.
3.Buffer is cleared when a System Common Category Status (ie, 0xF0 to 0xF7) is received.
4.Nothing is done to the buffer when a RealTime Category message is received.
5.Any data bytes are ignored when the buffer is 0.

  if ((MIDIRX>0xBF)&&(MIDIRX<0xF8)) { MIDIRUNNINGSTATUS=0; MIDISTATE=0; return; } if (MIDIRX>0xF7) return;
  if (MIDIRX & 0x80) {
  if (MIDIRX < 0x80) {
    if (!MIDIRUNNINGSTATUS) return;
    if (MIDISTATE==1) {
    if (MIDISTATE==2) {



The handleMIDInoteON/OFF writes the the MIDI mapping table showing which keys are depressed.

The sample engine

This is where all the sample counters are handled and all the different samples are looped and summed.

As each key has a frequency that has a twelfth root relationship of the next key or multiplied/divided by 1.05 how do we get the frequency of each key as they are all processed at 32KHz?

The answer is the phase accumulator. It actually count fractions of one sample tick.

The counters are 15-bit and for the highest key we add 0x80000000 to it and if it overflows we have a full tick.

For one octave below we add 0x40000000 and at overflow we have half the frequency. 1 seminote below is 0x80000000 / 1.05 and so on.

The ticker for the C3 octave looks like this:

void samplerTick() //Calculate total sample value for each playing note
  int32_t total=0;

  if ((VCA[48+0])&&(SPNT[48+0]<SLEN[48+0])) { //If VCA is active and the sample has not reached end
    FREQ[48+0]+=1073741824; //Add frequency to the phase accumulator for C3 key
    if (FREQ[48+0]&0x8000000) { //If phase accumulator overflows
      FREQ[48+0]&=0x7FFFFFFF; //Trim off MSB
      if ((SPNT[48+0]>LOOP2[48+0])&&(MIDItable[48+0])) SPNT[48+0]=LOOP1[48+0]; //Check if we're in a loop
      total+=(((pgm_read_word_near(SAMPLE + SPNT[48+0])^32768)-32768)*VCA[48+0])>>8; //Add the sample value to total with ENV scaling
      SPNT[48+0]++; //Increment sample pointer

  if ((VCA[49+0])&&(SPNT[49+0]<SLEN[49+0])) {
    FREQ[49+0]+=1137589835; //Add frequency to counter for C3# key
    if (FREQ[49+0]&0x8000000) {
      if ((SPNT[49+0]>LOOP2[49+0])&&(MIDItable[49+0])) SPNT[49+0]=LOOP1[49+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[49+0])^32768)-32768)*VCA[49+0])>>8;

  if ((VCA[50+0])&&(SPNT[50+0]<SLEN[50+0])) {
    FREQ[50+0]+=1205234447; //Add frequency to counter for D3 key
    if (FREQ[50+0]&0x8000000) {
      if ((SPNT[50+0]>LOOP2[50+0])&&(MIDItable[50+0])) SPNT[50+0]=LOOP1[50+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[50+0])^32768)-32768)*VCA[50+0])>>8;

  if ((VCA[51+0])&&(SPNT[51+0]<SLEN[51+0])) {
    FREQ[51+0]+=1276901416; //Add frequency to counter for D3# key
    if (FREQ[51+0]&0x8000000) {
      if ((SPNT[51+0]>LOOP2[51+0])&&(MIDItable[51+0])) SPNT[51+0]=LOOP1[51+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[51+0])^32768)-32768)*VCA[51+0])>>8;

  if ((VCA[52+0])&&(SPNT[52+0]<SLEN[52+0])) {
    FREQ[52+0]+=1352829926; //Add frequency to counter for E3 key
    if (FREQ[52+0]&0x8000000) {
      if ((SPNT[52+0]>LOOP2[52+0])&&(MIDItable[52+0])) SPNT[52+0]=LOOP1[52+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[52+0])^32768)-32768)*VCA[52+0])>>8;

  if ((VCA[53+0])&&(SPNT[53+0]<SLEN[53+0])) {
    FREQ[53+0]+=1433273379; //Add frequency to counter for F3 key
    if (FREQ[53+0]&0x8000000) {
      if ((SPNT[53+0]>LOOP2[53+0])&&(MIDItable[53+0])) SPNT[53+0]=LOOP1[53+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[53+0])^32768)-32768)*VCA[53+0])>>8;

  if ((VCA[54+0])&&(SPNT[54+0]<SLEN[54+0])) {
    FREQ[54+0]+=1518500249; //Add frequency to counter for G3 key
    if (FREQ[54+0]&0x8000000) {
      if ((SPNT[54+0]>LOOP2[54+0])&&(MIDItable[54+0])) SPNT[54+0]=LOOP1[54+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[54+0])^32768)-32768)*VCA[54+0])>>8;

  if ((VCA[55+0])&&(SPNT[55+0]<SLEN[55+0])) {
    FREQ[55+0]+=1608794973; //Add frequency to counter for G3# key
    if (FREQ[55+0]&0x8000000) {
      if ((SPNT[55+0]>LOOP2[55+0])&&(MIDItable[55+0])) SPNT[55+0]=LOOP1[55+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[55+0])^32768)-32768)*VCA[55+0])>>8;

  if ((VCA[56+0])&&(SPNT[56+0]<SLEN[56+0])) {
    FREQ[56+0]+=1704458900; //Add frequency to counter for A3 key
    if (FREQ[56+0]&0x8000000) {
      if ((SPNT[56+0]>LOOP2[56+0])&&(MIDItable[56+0])) SPNT[56+0]=LOOP1[56+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[56+0])^32768)-32768)*VCA[56+0])>>8;

  if ((VCA[57+0])&&(SPNT[57+0]<SLEN[57+0])) {
    FREQ[57+0]+=1805811301; //Add frequency to counter for A3# key
    if (FREQ[57+0]&0x8000000) {
      if ((SPNT[57+0]>LOOP2[57+0])&&(MIDItable[57+0])) SPNT[57+0]=LOOP1[57+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[57+0])^32768)-32768)*VCA[57+0])>>8;

  if ((VCA[58+0])&&(SPNT[58+0]<SLEN[58+0])) {
    FREQ[58+0]+=1913190429; //Add frequency to counter for B3 key
    if (FREQ[58+0]&0x8000000) {
      if ((SPNT[58+0]>LOOP2[58+0])&&(MIDItable[58+0])) SPNT[58+0]=LOOP1[58+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[58+0])^32768)-32768)*VCA[58+0])>>8;

  if ((VCA[59+0])&&(SPNT[59+0]<SLEN[59+0])) {
    FREQ[59+0]+=2026954652; //Add frequency to counter for B3# key
    if (FREQ[59+0]&0x8000000) {
      if ((SPNT[59+0]>LOOP2[59+0])&&(MIDItable[59+0])) SPNT[59+0]=LOOP1[59+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[59+0])^32768)-32768)*VCA[59+0])>>8;

  if ((VCA[60+0])&&(SPNT[60+0]<SLEN[60+0])) {
    FREQ[60+0]+=2147483648; //Add frequency to counter for C4 key, this overflows every tick thus 32KHz
    if (FREQ[60+0]&0x8000000) {
      if ((SPNT[60+0]>LOOP2[60+0])&&(MIDItable[60+0])) SPNT[60+0]=LOOP1[60+0];
      total+=(((pgm_read_word_near(SAMPLE + SPNT[60+0])^32768)-32768)*VCA[60+0])>>8;

  if (total>32767) total=32767; //Clip to max
  if (total<-32767) total=-32767; //Clip to min
  total+=32768; //Center value
  return total;

For each key in the octave, if it’s still sounding (VCA>0) and the sample has not past the end we add the frequency to the phase accumulator for that key.

Then we check if the key is held and we pass a loop point and in that case jump the sample pointer to the beginning of the loop.

Then we fetch the sample value and scale it’s volume to the VCA value.

Now you probably figured out that we can have multi-samples if we want because each key has unique sample parameters.

Finally since we are adding samples up we could pass the dynamic range of our signal so we clip it to the limits.

I showed the example for the C3 octave because C4 is 32KHz or one tick and the key we make our samples in.

But we are going to need 9 octaves more of this.

To be continued…