dynamodb, sets and boto3

Still trying to wrap my head around the right way to structure data in dynamodb. There are scalars, documents, and sets. I’m not clear on why the distinction of documents and sets. Also I find the Query and Scan documentation impenetrable.

Anyway, there are several fields that I’d like to be multi-valued associated with a media object, namely tags and location hierarchy. (tags is obvious, but location hierarchy might require a bit of unpacking, it just means that I want to store the fact that a photo is taken in Bali and also in Indonesia and also in Gianyar such that I can ask for photos taken in Bali)

It seems like the right way to do this in Dynamodb is to use Scan. It is so weird that the correct pattern is a full table Scan. Every aspect of my MySQL scaling experience cries out in horror. I could build a secondary index around tags and another around geo, but then I have to pay for the throughput for that index, which isn’t something I’m going to be querying often.

store a set of strings (StringSet) in Dynamodb using Boto3

table.update_item(
     Key={'path_md5':path_md5},
     UpdateExpression="set Colors=:c",
     ExpressionAttributeValues={':c' : set(['Black', 'Green'])})

And then to pull it back out again

resp = table.scan(
    FilterExpression=Attr('Colors').contains('Green')
)

Lambda from SNS or CLI

One of the really exciting promises of Lambda and the modern AWS services is the idea of an event bus deeply built into the infrastructure. The reality is definitely clunkier.

That said it still feels like TheRightWay(tm) to deploy lambdas is to tie them together using SNS.

However when you’re developing lambdas it’s often much simpler to simply invoke it directly from the CLI.

Which means your lambda needs to deal with polymorphic events.

Here is the pattern I’ve developed.

def handler

.... logic ... 

if 'Records' in event:
        message = extract_sns_message(event, context)
    else:
        # else called directly
        message = event

.... logic ...

def extract_sns_message(event, context):
    for record in event['Records']:
        logging.info('looking at {}'.format(record))
        if 'aws:sns' == record['EventSource']:
            message = json.loads(record['Sns']['Message'])
            return message

Then I can either invoke the lambda via SNS

    message = {'bucket': key.bucket_name, 'key': key.key}

    sns.publish(
        TargetArn=config.input_sns_topic,
        Message=json.dumps({'default': json.dumps(message)}),
        MessageStructure='json'
    )

Or directly

aws lambda invoke --function-name Dynamo_PutPy --region us-west-2 outputfile.txt

a dynamodb μ-service

Dynanomdb is kind of a head trip. A friend recently described dealing with the managed AWS services as requiring a “kremlinology of Amazon”. That resonates. I feel more like the blind man reading the docs trying to intuit the underlying architecture and the operationally appropriate ways to use it. This is uncomfortable for me. I tend to be a deep in the ops style engineer. Growth requires discomfort. So I’m sticking with it.

I spent a lot of time trying to figure out how to the equivalent of the INSERT OR UPDATE MySQL style query. I wasn’t ever able to get as much expressiveness as I can out SQL but I did discover that the PutItem API is kind of a toy and that you should probably be using the UpdateItem API which is much more flexible and gives you at least some INSERT-or-UPDATE ness.

However unlike PutItem it has an awful syntax. So I wrote a helper to turn a Python dict into a DynamoDb style UpdateItem block.

Code here

update_expression, update_expression_names, update_values = generate_update(item)

resp = table.update_item(
    Key=key,
    UpdateExpression=update_expression,
    ExpressionAttributeNames=update_expression_names,
    ExpressionAttributeValues=update_values)

print(resp)

def generate_update(item):
    update_expression = ["#{}=:{}".format(k,k) for k in item]
    update_expression = "set " + ', '.join(update_expression)
    update_expression_names = {"#{}".format(k):k for k in item}
    update_values = {":{}".format(k):item[k] for k in item}
    return update_expression, update_expression_names, update_values

With that capability in place I can how fan out different types of computation to different Lambdas and have them each call the my Dynamo_Put Lambda which will update the record with new info.

Lambas v0.1

I can’t even node. I want to. I’m intrigued by the community, but ugh it makes me feel dumb, and so my options for native Lambda are either Java or Python (2.7) at the moment. So Python!

Getting Started

I tried to walk through the tutorials that use the console. I was left feeling utterly baffled regarding cause and effect and what the fuck is going on.

Using the CLI worked better. This post finally got me started on the right path

The one bit you need to do that is easier through the console is make sure you’ve got an IAM role for lambda execution that has at a minimum

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}

Otherwise you’re just running blind. Beyond that add Amzn “Managed Roles” as needed per service.

Success to date

I’ve open sourced what I have to date, along with some sketchy notes. It reads photos from a bucket, calculates a bunch of hashes, extracts EXIF, etc, and then sends a JSON blob to another Lambda that writes to DynamoDb.

Code is here

Photos project

The photos project is an attempt to cobble together tools to make the photos I’ve stored in S3 buckets useful. Most recently motivated by the death of Picturelife.

The constraints for the project are:

  • I’ve already got a bunch of photos uploaded to S3, I could upload them to some other cheap cloud storage (aka Google), but I trust AWS long term support more than I trust Google’s. (yes, I’m still bitter about Reader)

  • There are a ton of free options for hosting ones photos these days that are pretty damn good e.g. Google Photos, so it isn’t reasonable to spend a lot of on this.

  • Especially given that most photos are never looked at.

  • Besides being cheap is an interesting constraint that unlocks other possibilities.

  • That leads me to want to use Lambda and Dynamodb as I don’t want to be paying for always on compute. Lambda is free at rest, and Dynamodb throughput can be cranked down to less than a dollar a month (per table per index).

And boy are they a pain in the ass.