Amazon DynamoDB: A Crash Course

· aws dynamodb

Amazon DynamoDB is a key-value and document database. It is fully on cloud although you can run a limited testing edition for testing purposes on premises.

It differs quite a bit to MongoDB or any of the other NoSql databases. This article will delve deep into the various parts of the API it provides. Note that most things done through the API can also be done through the AWS management console (which is the web interface of AWS). This article is more of a crash course into how to use some of the features.

I will start from the basics, the CRUD operations then into TTL and covering some other concepts of the DynamoDB ecosystem. References will be available at the end. The .NET client will be used however AWS provides clients for other programming languages and everything done here can be replicatable in those as well as the AWS CLI.

Before starting, recently AWS announced SQL-like support to query the database as follows:

This article will have lots of theory to cover so it will be read heavy

Basic table theory:

Will assume you have a .NET application setup.

Let’s install the .NET client for DynamoDB, note that it is part of the AWS SDK however installed separately:

dotnet add package AWSSDK.DynamoDBv2

There is not a concept of a database instance here, since it is all on the cloud. The data is stored redundantly and encrypted by default. So the concept of a table is more feasible. A table is basically the equivalent of the RDBMS table however within the DynamoDB realm.

A table can have theoritcally unlimated records in it. What makes a table is the primary key. And what makes a certain record unique is the value/s of the primary key. Note a primary key is one or two attributes. An attribute is basically a column from the RDBMS world. A primary key can be the following combination:

  • Partition (Hash) Key
  • Partition (Hash) and Sort (Range) Keys

You can’t change the primary key once you made the table. The only mandatory attributes in a record are the primary key attributes. They must be of the same type defined when creating the table. However, other attributes can exist or not, unlike RDBMS you don’t need the schema for each record, each record in this case can have differing attributes with differing types. What you just read is important because if you don’t know at least a parition key you can’t query the data. You can scan it without any value but a scan is basically an O(n) operation running through all the records, in other words a full table scan. Thus, patterns such as pagination are done different, search is limited if you rely solely on the database.

Now this is not all, you can have more than one primary key which is discussed later.

Capacity units:

Since a table is treated kinda like a database is treated. Each table has it’s own ecosystem and is not part of a database. It makes sense to think of how data bandwidth is billed and carried out. A table has a Read Capacity Unit (RCU) as well as a Write Capacity Unit (WCU). Those are the units of work done onto a table. An RCU is an up to 4KB (4 kilobyte) read operation on a table. On the other hand, a WCU is an up to 1KB write operation on a table.

So how do these work? Imagine I am writing a 500 byte item to the table, this will consume 1 WCU even though it is not a complete 1KB, even 1 byte would count as one WCU. A 5 KB item/s write would consume 5 WCUs. Now if you use DynamoDB’s transactions it will consume more WCU (will be covered later).

Now reads are more forgiving. It goes the same way as WCUs but for RCUs you need to consider are you reading eventually consistent or strongly consistent. Now eventual/strong are design choices determined by your project however you can ask when querying for one of the two.

A single RCU is 4KB for one strongly consistent read and 8KB for 2 eventually consistent reads. What if multiple reads are going on at the same time? Each different type of consistency is calculated on it’s own. i.e. Strongly consistent reads will be counted on their own and eventually consistent on their own.

An example: Suppose I am reading 101KB of data in an hourly batch job, the reads must be consistent. Hence:

101 / 4 = 25.25

Then rounding up 25.25 to 26 which is the RCU cost. An eventually consistent read would then cost less RCUs sacrificing consistency, since it is a batch job that I will run next hour, nothing will be lost so now let’s divide 26 by 2 which yields 13 RCUs and that is the cost. Note that a capacity unit is per second.

So now we know about what the DynamoDB ecosystem calls in terms of capacity units. What use are they? A lot. Let me explain. When you provision a new table, you choose how many capacity units are available per second or have it on-demand which scales accordingly. This is important. When you set how many capacity units you will be consuming per second, you are paying for them if used or not. You get throttling when you exceed capacity units. On the other hand, autoscaling your capacity units on demand is safer choice when you don’t know the traffic but you need to know how DynamoDB does the scaling.

Pricing wise, on demand costs more than provisioning how much will you use but also writes cost more than reads and different regions cost differently.

It is not that easy, you have capacity scaling which aims to be the middleground of on demand capacity and provisioned capacity.

The following article goes into depth on what you should know: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html

Creating a table:

Every single API call has it’s companion documentation, since this is a crash course I will not cover everything necessary. Will create a client:

var dynamodbClient = new AmazonDynamoDBClient();

All clients must be disposed of at the end to release resources. Now this client will try to use the IAM credentials and region available in your environment. This is enough for EC2 instances if the instance profile has the required IAM policy to access DynamoDB. You can also pass specific credentials like so:

var dynamodbClient = new AmazonDynamoDBClient(
 new BasicAWSCredentials(
 // Access key
 "hghNJbrgjbrjf",
 // Secret key
 "Kurh5nXHnbndaL"),
 RegionEndpoint.APEast1);

You can pass more advanced credentials if you want, refer to documentation for your needs.

Now we have a client, lets call the first API call to create a table like so:

string tableName = "table";
string partitionKey = "Year";
string sortKey = "Name";
var createTableRequest = new CreateTableRequest
{
 TableName = tableName,
 BillingMode = BillingMode.PAY_PER_REQUEST,
 KeySchema = new List<KeySchemaElement>
 {
 new KeySchemaElement
 {
 AttributeName = partitionKey,
 KeyType = KeyType.HASH
 },
 new KeySchemaElement
 {
 AttributeName = sortKey,
 KeyType = KeyType.RANGE
 }
 },
 AttributeDefinitions = new List<AttributeDefinition>
 {
 new AttributeDefinition
 {
 AttributeName = partitionKey,
 AttributeType = ScalarAttributeType.N
 },
 new AttributeDefinition
 {
 AttributeName = sortKey,
 AttributeType = ScalarAttributeType.S
 }
 }
};

var createTableResponse = await dynamodbClient.CreateTableAsync(createTableRequest);

Note that the SDK models everything in classes. Here I am creating a table called table with a primary key consisting of a hash key Year and range key Name. The hash key has a number type (N) and the range key has a string (S) type. Then calling the create table API. Will return a successful 200 HTTP status immediately. I set the table to be billed on demand. Note this is an asynchronous operation, which means the table is not yet done creating. Will wait for 10 seconds which is usually enough time for a table to be created (may take more) and check:

await Task.Delay(10000);

var describeTableRequest = new DescribeTableRequest
{
 TableName = tableName
};

var describeTableResponse = await dynamodbClient.DescribeTableAsync(describeTableRequest);

Now let’s assume the table is created. The return type is DescribeTableResponse which contains TableDescription that has all the information and metadata about a table.

Basic CRUD operations:

CRUD = Create Read Update Delete

To add/write/put an item we issue a PutItem API call:

var putItemRequest = new PutItemRequest
{
 TableName = tableName,
 Item = new Dictionary<string, AttributeValue>
 {
 {partitionKey, new AttributeValue
 {
 N = 2021.ToString()
 }},
 {sortKey, new AttributeValue
 {
 S = "John Doe"
 }},
 {"Description", new AttributeValue
 {
 S = "This is a string stored in DynamoDb"
 }}
 }
};

var putItemResponse = await dynamodbClient.PutItemAsync(putItemRequest);

Here adding an item with a Description attribute. This attribute does not need to be in any other items inserted later, it can exist as a different data type. It does not matter, this is NoSQL, the land of flexibility.

Reading the item we just wrote will return a map of attribute name and attribute value pairs as Dictionary<string, AttributeValue>. The same data type we used to write the values above.

var getItemRequest = new GetItemRequest
{
 TableName = tableName,
 Key = new Dictionary<string, AttributeValue>
 {
 {partitionKey, new AttributeValue
 {
 N = 2021.ToString()
 }},
 {sortKey, new AttributeValue
 {
 S = "John Doe"
 }}
 }
};

var getItemResponse = await dynamodbClient.GetItemAsync(getItemRequest);

Now to update this item requires some work, updating needs us to explore update expressions. Since you can update many different things, there exists many operations and built-in functions that are DynamoDB specific. This following link will have all the available update operations to get you comfortable: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.UpdateExpressions.html

Will update the item by adding a new attribute called Gender as a string.

string genderAttributeName = "Gender";
string genderAttributeValueName = ":GenderVal";
var updateItemRequest = new UpdateItemRequest
{
 TableName = tableName,
 Key = new Dictionary<string, AttributeValue>
 {
 {
 partitionKey, new AttributeValue
 {
 N = 2021.ToString()
 }
 },
 {
 sortKey, new AttributeValue
 {
 S = "John Doe"
 }
 }
 },
 UpdateExpression = $"SET {genderAttributeName} = {genderAttributeValueName}",
 ExpressionAttributeValues = new Dictionary<string, AttributeValue>
 {
 {
 genderAttributeValueName,
 new AttributeValue("Male")
 }
 }
};

var updateItemResponse = dynamodbClient.UpdateItemAsync(updateItemRequest);

Look at the way the SET operation is used, a SET sets something that exists to a new value or adds something that does not exist to a new value. Here, I am making a new Dictionary<string, AttributeValue> which holds the new values and are in the ExpressionAttributeValues property. What is going on here? Well yeah, the name of an attribute is different to the name of the new value of the attribute. This is the way it is, it gives some flexibility and you can also use it with conditional operators which are very useful. To go overbored, you can change the name of the attribute in the update expression to have full flexibility in naming things but that is confusing enough until you understand how it works.

Note you cannot update the primary key, you will need to duplicate the item to write a new item with different primary key.

Now let’s delete the entire item:

var deleteItemRequest = new DeleteItemRequest
{
 TableName = tableName,
 Key = new Dictionary<string, AttributeValue>
 {
 {
 partitionKey, new AttributeValue
 {
 N = 2021.ToString()
 }
 },
 {
 sortKey, new AttributeValue
 {
 S = "John Doe"
 }
 }
 },
};

var deleteItemResponse = await dynamodbClient.DeleteItemAsync(deleteItemRequest);

That’s it. The item is gone.

All of the above operations can be within a transaction, a transaction is an all or nothing operation. It is guarranteed to be like that. It costs more capacity units. Their use will not be covered in this article, however they are really important where a unit of work involves many operations in your design. Now they do have some nuances and are not straightforward, you must know how they work to use them. The link to how they work is below: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/transaction-apis.html

Batch reading/writing/deleting operations exist and are the same as the above CRUD operations however done in quantity.

A PutItem operation can update an item if it exists like the UpdateItem.

Scanning:

Enough with CRUD, lets list everything available in the table.

var scanTableRequest = new ScanRequest
{
 TableName = tableName
};

var scanTableResponse = await dynamodbClient.ScanAsync(scanTableRequest);

Will bring everything available, however the max payload is 1MB. So if anything remains unscanned, you will get a pagination primary key where you can use it in the next scan operation to get the rest. The pagination primary key, is basically the last primary key of an item last scanned.

You can limit the number, you can do server-side filtering. What is nice is if you can project the number of attributes returned to do aggregation, you can just return a single attribute and that frees up your payload for more items. Note the attribute projection is available to all other CRUD operations.

A scan does not guarantee any order. In other words, you must order the data by dates if you are looking for time-based ordering.

Querying:

To really get what you want, you can query a table like you would with a SQL statement but more limited and limited to HTTP requests. Querying only works on primary keys with parition key and sort key. The sort key can be substituted for local secondary indexes. I will not cover querying.

TTL and timed items:

TTL = Time To Live

An item with TTL has a lifespan, after that lifespan it will be deleted by DynamoDB. The deletion is free of WCU but will not be deleted immediately, in my experience it takes 10-15 minutes for an item to be deleted in my region after expiry. It may take more or less time.

It is really easy to do that in DynamoDB as this is a common case for NoSQL datastores especially when working with user sessions.

To make this possible, will need to update the table to be TTL-able. You can’t add this when creating the table, it must be added on a separate API call after the table is created.

var ttlRequest = new UpdateTimeToLiveRequest
{
 TableName = tableName,
 TimeToLiveSpecification = new TimeToLiveSpecification
 {
 Enabled = true,
 AttributeName = "TTL"
 }
};

var ttlResponse = await dynamodbClient.UpdateTimeToLiveAsync(ttlRequest);

Quite a simple request. Note the attribute name does not have to be TTL.

Items which have no TTL attribute will not be added for the deletion queue. Also, if the TTL value is not a DynamoDB Number will be ignored. The Number must be a Unix Epoch in seconds that is no longer than 5 years in the future.

long timeToLive = 1613439761;

You add the attribute like you would with any other attribute.

For more on TTL on DynamoDB check this link: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html

Indexing:

More often than not you wish to fetch data by date, count, etc.. but cannot do so efficiently. In DynamoDB you would do this with a scan for small datasets. However you must not keep doing it for bigger datasets. We need to index for efficiency as well as for access patterns. Your application evolves over time and you want to get more out of your data. Two types of indexes exist.

A local secondary index (LSI) replaces the range/sort key of the primary key for query purposes. Another way of putting it into words, you would have multiple primary keys with the same parition key. Now this former statement will contradict the next statement but bear with me. Now you could have a sparse LSI which only has one primary key in a table with multiple primary keys. This is all possible. What is even more possible is having an LSI that can have:

  • Keys only
  • All
  • Include

A keys-only index will only store the keys in the index and for reading the entire item will cause a lookup. An All index will store a duplicate of the item in the index which may be great depending on your use case but will consume double the WCUs. An Include index is the happy medium where you can only include the needed attributes.

A global secondary index (GSI) is different beast. It is basically the same data but with different primary key (different parition key as well). Basically different for different applications. Useful data warehousing from the same source. Although, GSIs internally are in different tables and can be in different regions.

How are those 2 types created/used will result in a huge article that may cause confusion upon you the reader as DynamoDB docs covers it better as well other resources.

Indexing introduces new complexity, more cost and potentially slower reads/writes. I suggest you make sure you design the primary key the way you are accessing the data but that is not always possible.

Other features:

Backups can be taken on DynamoDB tables on demand or can be scheduled with retention periods. A backup is done asynchronously and the time it takes to do varies and is not known in advance. The same way for recovering from a backup. While a backup is going on, a table may not be deleted. More on backups here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/backuprestore_HowItWorks.html

When an event occurs in a table, it is streamed to a DynamoDB Stream. An event maybe an insertion to a table. It is held for 24 hours in the stream. It can be handled by AWS Lambda which is powerful. Note the stream must be enabled for this to happen. More below: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

Things to mention:

Another thing I like using is testing DynamoDB locally, it is offical from AWS where you can install and run a cluster or spin it up as a Docker container from: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html Obviously not all features are supported in the local build.

Below is a snippet of me connecting to the running local build I have installed:

var dynamodbClient = new AmazonDynamoDBClient(new AmazonDynamoDBConfig
{
 ServiceURL = "http://localhost:8000"
});

Note that the DynamoDB API has no connection, you do everything with HTTP requests. This is due to the complicated distributed nature of the system.

Disposing the client:

dynamodbClient.Dispose();

Conclusion:

A lot of ground has been covered here. It is quite an interesting system. I wanted to cover more things but I cannot have a huge article and also cover things I am limited in experience with. Other great resources are linked below help a lot.

Thanks for reading this.

References: