Upgrading Binary(Set)Attribute

Warning

The behavior of BinaryAttribute and BinarySetAttribute has changed in backwards-incompatible ways as of the 6.0 release of PynamoDB.

To prevent data corruption, use legacy_encoding=True for existing binary attributes.

Context

PynamoDB version 5 (and lower) had two bugs in the way they handled binary attributes, which were addressed in PynamoDB 6:

  • Top-level binary attributes (i.e. within a Model) were being Base64-encoded twice. For elements in BinarySetAttribute, each element was being encoded twice.

    This behavior was an oversight and resulted in larger item sizes and non-standard semantics.

    For example…

    The 4 bytes CA FE F0 0D should’ve been sent over the wire as yv7wDQ== (single round of Base64 encoding). Server-side, they would’ve been decoded and stored as 4 bytes. Instead, they were put through an extra round of Base64 encoding, thus sending eXY3d0RRPT0= over the wire. Server-side, they decoded into yv7wDQ== (in bytes, 79 76 37 77 44 51 3D 3D) and stored as 8 bytes.

  • Nested binary attributes (i.e. within a MapAttribute and ListAttribute) were being wrapped in an additional layer of Base64 encoding on every serialization roundtrip.

    Not only it prevented them from being deserialized correctly, but also the model would also grow in size exponentially until it hit the DynamoDB item limit of 400KB. For this reason we conclude that BinaryAttribute and BinarySetAttribute were not used in practice within maps and lists before PynamoDB 6.0 and thus there is no practical reason you would want legacy_encoding=True for them.

Guidance

Top-level binary attributes

  • In models existing at the time of an upgrade from PynamoDB 5 (or lower), use legacy_encoding=True.

    Note

    In PynamoDB 6 we require this new parameter to be explicitly set to prevent inadvertent data corruption during upgrades. By setting it to True during an upgrade, the developer marks the attribute as pre-existing and thus requiring legacy handling.

    For example:

     class SomeExistingModel(Model):
    -  my_binary = BinaryAttribute()
    +  my_binary = BinaryAttribute(legacy_encoding=True)
    
     class SomeExistingModel(Model):
    -  my_binary = BinarySetAttribute()
    +  my_binary = BinarySetAttribute(legacy_encoding=True)
    

    After the version upgrade is complete, you can consider adding a new binary attribute and migrating the data.

  • In new models, use legacy_encoding=False.

    class NewModel(Model):
      my_binary = BinaryAttribute(legacy_encoding=False)
      my_binary_set = BinarySetAttribute(legacy_encoding=False)
    

Nested binary attributes

  • In maps, use legacy_encoding=False.

    class MyMap(MapAttribute):
      binary = BinaryAttribute(legacy_encoding=False)
      binary_set = BinarySetAttribute(legacy_encoding=False)
    
  • In raw maps, normal (non-legacy) encoding will be used.

    class MyModel(Model):
      my_raw_map = MapAttribute()
    
    my_model = MyModel()
    my_model.my_raw_map = MapAttribute(binary=b'foo')
    
  • In lists, normal (non-legacy) encoding will be used.

    This applies to both ListAttribute(of=BinaryAttribute) and of=BinarySetAttribute as well as when of=... is not specified (for bytes and Set[bytes] elements).

    For example:

    class MyModel(Model):
      binary_list = ListAttribute(of=BinaryAttribute)
      binary_set_list = ListAttribute(of=BinarySetAttribute)
      mixed_list = ListAttribute()
    
    
    model = MyModel()
    model.binary_list = [b'\xCA', b'\xFE']
    model.binary_set_list = [{b'\xCA', b'\xFE'}, {b'\xF0', b'\x0D'}]
    model.mixed_list = [
       b'\xCA\xFE',
       {b'\xF0', b'\x0D'},
    ]
    

Migrating

Since PynamoDB 6 is compatible with existing data through legacy_encoding=True, you do not need to migrate data during an upgrade. Whether you want to migrate data depends on your use case. Advantages include smaller item sizes and more standardized serialization. However, for large tables, there might be significant cost and engineering complexity involved.

Warning

Be sure to have an up-to-date backup of your data.

These are the typical steps to migrate an attribute:

  1. Double-write to both the old and new attribute. Read from the new, falling back to the old.

class SomeExistingModel(Model):
   _my_binary_v1 = BinaryAttribute(legacy_encoding=True, attr_name='my_binary')
   _my_binary_v2 = BinaryAttribute(legacy_encoding=False, attr_name='my_binary_v2')

   @property
   def my_binary() -> bytes:
     return self._my_binary_v1 if self._my_binary_v2 is None else self._my_binary_v2

   @my_binary.setter
   def my_binary(value: bytes) -> None:
     self._my_binary_v1 = value
     self._my_binary_v2 = value

   def save(self, *args, **kwargs):
     self.my_binary_v2 = self._my_binary_v1
     return super().save(*args, **kwargs)
  1. Change the old attribute to be optional:

     class SomeExistingModel(Model):
    -   _my_binary_v1 = BinaryAttribute(legacy_encoding=True, attr_name='my_binary')
    +   _my_binary_v1 = BinaryAttribute(legacy_encoding=True, attr_name='my_binary', null=True)
    

    and rather than double-write to it, unset it by assigning None:

     @my_binary.setter
     def my_binary(value: bytes) -> None:
    -  self._my_binary_v1 = value
    +  self._my_binary_v1 = None
       self._my_binary_v2 = value
    
     def save(self, *args, **kwargs):
    -  self.my_binary_v2 = self._my_binary_v1
    +  if self._my_binary_v1 is not None:
    +    self.my_binary_v2 = self._my_binary_v1
    +    self._my_binary_v1 = None
       return super().save(*args, **kwargs)
    

    At this point, you can either let natural migration run its course (as your online system re-saves models), or you can perform a one-time migration by scanning the table and re-saving each item.

  2. Once migration is done, remove the old attribute and all migration logic.

class SomeExistingModel(Model):
   my_binary = BinaryAttribute(legacy_encoding=False, attr_name='my_binary_v2')