Integration Tests using a Database and Spring Boot

Author unknown

Integration tests are crucial part in testing most applications. In my experience, some projects even go against the famous pyramid diagram and have more of these, than unit tests. For instance – web systems, where some framework, like Spring Boot, frees us from the burden of implementing input validation, role-based access control, etc. and consequently, good amount of unit testing is done inside the library itself.

When I work on a REST API, I often find that most of acceptance tests, checking various use cases and scenarios, are of integration type. In his article – A Guide to Robust Unit and Integration Tests with JUnit – Joshua Hayden describes how JUnit can be used for writing and running tests. I would like to further it and add more details about testing multiple components of the system.

I like a quote from article by James O Coplien – “Create system tests with good feature coverage (not code coverage) — remembering that proper response to bad inputs or other unanticipated conditions is part of your feature set.”

Let’s assume our project is A REST API for online education system, where user can register, find a course, enroll and submit homework. It would be difficult to test all features of our app without integration tests, as in most cases it would require creating and modifying multiple resources, hence, database access. For instance, to test enrollment, we would need to have a registered and activated user and existing course. In my opinion, mocking a persistence layer for this would be only a waste of time. So, let’s dive in:

To run tests against a database, we have to overcome several challenges:

  • We need a separate DB and corresponding configuration
  • A database should be easy to build from scratch without manual intervention
  • We need isolation for each test or test suite, so they won’t pollute data for other tests.

For the sake of example, I will try to solve them using Spring Boot, however the same idea can be used with other frameworks and technologies.

 

In-Memory Database vs Production

I’ve often received advice to use in-memory databases for my integration tests, but I’m still very skeptical about it. Even though all popular databases follow some standards, they have lots of vendor-specific features. This is what makes writing a fully portable app difficult. For instance, not only in-memory H2 database might act differently than MySQL, but two versions of MySQL might give different results in some cases. It’s error-prone to consider every detail or a deprecated feature, so, for my tests I always prefer to select exact vendor and version, which is used in production environment. (Container technologies, like Docker, make our lives easy in this decision).

Let’s move on with a real database here and decide on configuring it: If we start up a docker container on every `test` command and destroy it afterwards, we can get away with one DB, but if we also need a persistent database for manual testing, which must not be cleared on every test run, then we’ll need an additional temporary DB. In Spring Boot, we can either create a test/resources/application.properties file with connection url of such database or use a test profile. This might vary among frameworks, but a key thing is to solve the task without manual intervention or commenting out stuff for running tests.

 

Rebuilding a Database

To build a database from scratch, one can have a SQL script of structure exported from a DB. But I prefer to go with a nice method, called Database Migrations. This enables us to have a versioning of a database and application code in sync. It also makes rebuilding a DB very easy.
You can also learn about database migrations from this article: Database Migrations: Turning Caterpillars into Butterflies

 

Isolated Tests

Test isolation is good for several reasons: they can be run in parallel and in case database is updated, they can be safe from breaking each others’ data. On the internet you might find a suggestion to use transactions for such tests, which won’t be committed and will be rolled back after test completion. In Spring boot this can be achieved by adding @Transactional annotation on a test method.

I don’t like this solution, because if we actually use transactions inside our business logic, we’ll end up with nested transactions, which might have unexpected results in corner cases. Also I find it very helpful that by running some tests I can quickly populate database with various scenarios and continue manual testing or debugging using persisted data.

Generally, I prefer to do it this way: Dropping and rebuilding the database before running each test is an expensive operation. It might be better to rebuild it with migrations only once before starting testing, so migration scripts will also be tested. Then, I use plain sql to truncate tables before a test.

One of the ways for achieving this:

Add a configuration class, which will run once before all tests and reset migrations there (Flyway is a plugin for database migrations.)

@TestConfiguration
public class TestConfig {
	@Bean
	public FlywayMigrationStrategy clean() {
		return flyway -> {
			flyway.clean();
			flyway.migrate();
		};
	}
}

Add sql script of truncate as an annotation in a base class for integration tests. This will be run before each test method.

@Sql(executionPhase = ExecutionPhase.BEFORE_TEST_METHOD, 
	scripts = "classpath:ge/demo/api/sql/beforeTest.sql")

Sometimes, it might be reasonable to group several tests in a test suite and reset database only before a suit and not a single test.

Georgian Capital letters added to Unicode. Now what?

Last may Unicode approved 46 Capital letters of Georgian Mkhedruli alphabet.

Maybe it’s a bit early, but operating systems will support this change in future anyway. Out of curiosity I decided to do a little research about what will change for us – developers and I’m sharing it in this article.

Few definitions just in case:
Unicode – A standard, which maps every symbol with an unique number. Also it describes specific rules for different languages. This standard is used all over the technical world and everyone who needs text processing / representing, follows it – Operating systems, platforms, browsers…

UTF-8 – Unicode has the list of symbols and number codes, but it does not care how this information will be stored in memory. There are various encoding algorithms for this. UTF-8 is one of the most popular ones as it optimally uses memory and does not require extra bytes for a symbol which can be fit in just one. Other encoding examples are UCS-2, UTF-16, UTF-32…

Changes in a standard cause changes in implementations, which does not happen immediately. For instance, ₾ Georgian currency symbol was added to 8 version of Unicode on 17 may, 2015 and the Windows update for this symbol was released on 19 January, 2016.

Operating Systems should update keyboard drivers to enable Georgian users use CAPS mode to write Capital letters (there are 33 letters in Georgian, so shift+symbols method is already taken). Also system fonts should be updated, so correct symbols will appear during the font fallback.

Due to the fact, that the Capital and Small versions of the same letter have different codes, software developers usually need some considerations – up until now only for other languages, now for Georgian, too. For instance, when it’s necessary to compare strings, search, match regex patterns, sort, store into the database, etc.

 

Database

MS SQL server has built in Unicode support and during the operations it follows the standard anyway. Just make sure it follows the correct version: SQL Fiddle

It’s different with MySQL – Here each database, table or even a column might have corresponding collation defined, based on what kind of information it stores. We are accustomed to using utf8_general_ci, as it ‘processes’ Georgian letters, too. This collation does not completely implement the Unicode unlike utf8_unicode_ci. Generally, it was being used just because of better performance, however, there is not much difference with modern processors. utf8_unicode_ci will correctly process new Georgian alphabet upon version upgrade.

Here is an example:
Together with the unique codes, Unicode also defines the order of symbols, which is used during sorting. E.g. in this list all kinds of Georgian letter ‘ა’ are listed together – Nuskhuri, Asomtavruli and Mkhedruli. Then versions of ‘ბ’ letter appear. Probably new capital letters will be added in the same way.

SQL Fiddle

CREATE TABLE IF NOT EXISTS `test` (
  `content` varchar(200) NOT NULL
) DEFAULT CHARSET=utf8 COLLATE utf8_general_ci;
INSERT INTO `test` (`content`) VALUES
  ('აბგ'),  ('ააააა'),  ('Ⴁააააა'),  ('Ⴀააააა'),  ('bcd'),  ('ab.'),  ('Ⴄ'),  ('ж'),  ('Ж'),  ('ц'),  ('Ц');
  

CREATE TABLE IF NOT EXISTS `test_better` (
  `content` varchar(200) NOT NULL
) DEFAULT CHARSET=utf8 COLLATE utf8_unicode_ci;
INSERT INTO `test_better` (`content`) VALUES
  ('აბგ'),  ('ააააა'),  ('Ⴁააააა'),  ('Ⴀააააა'),  ('bcd'),  ('ab.'),  ('Ⴄ'),  ('ж'),  ('Ж'),  ('ц'),  ('Ц');


select * from `test` d order by d.content;
select * from `test_better` d order by d.content;

Result:

ab., bcd, Ж, ж, ц, Ц, Ⴀააააა, Ⴁააააა, Ⴄ, ააააა, აბგ
ab., bcd, ж, Ж, ц, Ц, ააააა, Ⴀააააა, აბგ, Ⴁააააა, Ⴄ

The MySQL 8 beta release, which appeared currently, has implemented Unicode version 9, our capital letters are in version 11 🙂

 

Javascript

Although there are many implementations, we can’t ignore V8 anyway, so I’ll discuss based on it.

Javascript has Unicode support, but some things still have problems (e.g. unicode regex). If we need sorting or filtering on our site, than ordinary sort won’t work any more and we should use Locale. Then it will consider Unicode rules. For instance:

let a = ['აბგ','ააააა','Ⴁააააა','Ⴀააააა','bcd','ab.','Ⴄ','ж','Ж','ц','Ц'];
console.log(a.sort());
console.log(a.sort(Intl.Collator('ru').compare));

Unfortunately it does not have support for Georgian collation at all. So, we cannot correctly sort together with Nuskhuri and Asomtavruli. Well, this is a very rare case anyway, so no need to worry. Casual function sorts based on the code points, so it will be according to alphabet (with the exception of capital letters).

That problem with capitals can be solved by converting strings to the same case. Giorgi suggested an idea:

myArray.sort(function(s1,s2){ return s1.toLowerCase() > s2.toLowerCase()}));

Probably it will work correctly for Georgian, too, after V8 renews the Unicode implementation. Currently it works like that for Asomtavruli and Nuskhuri: "Ⴀ".toLowerCase() => "ⴀ"

It seems that, as standard defined Asomtavruli as CAPITAL and Nuskhuri as SMALL, these alphabets are implemented as cases of single one instead of being two completely different alphabets. (v8 source file: unicode.cc, code points are mapped directly.)
Now Mkhedruli is caseless. It’s interesting how it will be marked. I think there is no other language with two kinds of Capital letters.
Anyway, this requires the version upgrade anyway.

Now I remembered, that V8 is an open source project and a volunteer can add Georgian locale. For the time being this results in an empty array:

Intl.Collator.supportedLocalesOf('ka')

 

Java

Java is not in a hurry to upgrade either. JDK 9 with the Unicode 8 implementation (where Lari currency symbol was added) was released after two years – September of 2017.
Here the strings are compared with ‘equals’. In future we’ll need to use the ‘equalsIgnoreCase’ method for Georgian, too:

"Ⴀ".equals("ⴀ")  => false
"Ⴀ".equalsIgnoreCase("ⴀ")  => true

As there is one Capital alphabet already, I’m testing with it. We just don’t use that alphabet generally.

Also, we can’t search with regex directly. Ordinary i – ignore case flag does not work, as Unicode is processed differently. So, we should write:

"A".matches("(?i)[a]")  => true
"Ⴀ".matches("(?i)[ⴀ]") => false

Pattern.compile("[ⴀ]", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE).matcher("Ⴀ").matches();  => true

Correspondingly, we should consider this wherever we use strings – maps, sets, etc.

 

PHP

Generally, working with unicode strings is not pleasant in PHP at all + more conversions will be added here, too.

 

 
We’ll also need change at other place – with very convenient search tools – grep and the similar ones. The case insensitive option of grep does not work for existing Georgian Asomtavruli capital alphabet even now. I hope the Unicode changes will be reflected in their upgrades, too. They are great apps for regex filtering and searching in large (or small) texts and files.

Many of Georgian application systems won’t be able to quickly upgrade their platforms, as testing would take huge amount of time. They will probably add some conversions and validations in front-end, to prevent user input capital strings from appearing in old Java or other systems.

Overall, I like that Capital letters were added (as a result of several persons hard work). It’s an important part of the Georgian language and should not be lost.

Do you have any ideas, what else will need to be changed?

Some resources about the topic:
On.ge – UNICODE-მა ქართული მხედრული ანბანის 46 მთავრული ასონიშანი დაამტკიცა
DevFest 2016: Akaki Razmadze –  ❤  [I LOVE UNICODE]
DevFest 2017: Akaki Razmadze – გუტენბერგი, სტივ ჯობსი, გუგლი, ხინკალი, უნიკოდი
DevFest 2016: Michael Everson – The confusing case history of Georgian in Unicode

Memory optimization with multidimensional arrays in Java

Yesterday I hit a memory limit while submitting a problem solution to an online judge. All because I used a wrong declaration of a multidimensional array. In case of C or C++ language this would not be a problem, but due to the fact that in Java everything is an object (despite primitive types), we should calculate allocated memory differently for multidimensional arrays.

For instance, in C++ the size of this array is 96 bytes (assuming we have 4-byte int):
int arr[2][3][4];
2 * 3 * 4 = 24. Plus 4 bytes for each due to int type and we get total of 24 * 4 = 96.
If we declare the array like this, result would be the same: int arr[4][3][2]

Let’s compare these two declarations in Java:
int arr[][][] = new int[40000][200][2];
int arr[][][] = new int[2][200][40000];

There is not much different from the data storage viewpoint, however the first one requires much more memory. Here is why:
1. Array in Java is an object itself. Every object (not near the reference, but in the heap) holds few additional bytes for the object header information. This header contains data necessary for the VM, which later uses it for the garbage collector or some other purposes. As far as I know, generally it takes 8 bytes in case of a 32-bit machine and 16 bytes in case of a 64-bit one. Apart from this, an object holds information about the array size – that’s additional 4 bytes. Also padding in memory might take few more bytes. So, I won’t start precise calculations and let’s assume that each array object (excluding array elements) takes X bytes.

2. int[a][b] – In Java language this is ‘a’ number of arrays which contain ‘b’ number of elements. I.e. we have a+1 number of objects instead of just one.
int[a][b][k] In case of a three-dimensional one a * b + a + 1 number of object, etc.

Now let’s calculate and compare sizes of these arrays:
(40,000 * 200 + 40,000 + 1) * X + (40,000 * 200 * 2 * 4)
(2 * 200 + 2 + 1) * X + (40,000 * 200 * 2 * 4)

Clearly, the second part, which calculates the total size of elements based on int type, will be the same. But according to the first part we will have 8,039,598 excessive objects in case of the first array, consequently it will take considerable larger memory.

By the way, I cannot see if this number is real with profiler, do you have any idea how to check this?

Dynamic Array

Dynamic array is a data structure with variable length. One can insert, retrieve or delete an element using random access – i.e. it needs a fixed time to reach any element and this does not depend on whole size of the array.

You might come across this structure by different names – dynamic array, growable array, resizable array, dynamic table, mutable array, array list.

To guarantee random access, it is necessary to allocate a continuous memory for the array, i.e. its elements should be stored next to each other.

In case of a static array, this is not a problem, as we define the length in advance. But sometimes we don’t yet know the length. So, how much continuous memory should we allocate?

Clearly, there is no point for allocating huge memory just in case. We should not reserve million bytes in advance just to store 20 elements.

This is where dynamic array comes in. In many programming languages this problem is solved this way:
The dynamic array is backed by a static array, which is created in small size. When this static array is filled and users will try to add more elements, a larger static array will be created behind the scenes. Existing elements will be copied into it and the old one will be deleted. Consequently, insertion of some elements in the dynamic array will take more time than the others.

With this solution in mind, we should answer to an important question: By what factor should we increase the static array length?

It should be noted, that again we are searching for a balance between performance and memory waste. If we increase the array length with only one element, rewriting whole array on each new element will take too much time. But if we increase by a large factor, we might end up with a large empty array.

Optimal number is 2. This number might be slightly altered corresponding to requirements.

You might find its different versions in various programming languages – e.g. Vector in C++ is increased 2 times. Vector from Java standard library also has a factor 2, however you can change it by passing arguments. A static array of ArrayList in Java is increased by 3/2 times. A static array of HashMap – by 2 times. In the C implementation of Python, the number is a bit odd – approximately 9/8. Here is the source.. And here is an explanation.

If the programmer knows approximate size of an array in advance, they can configure the dynamic array correspondingly. E.g. Vector in C++ has a function reserve, which will reserve memory of given size.

ArrayList and HashMap classes in Java have a constructor parameter initialCapacity. In case of HashMap, not only static array is rewritten in the background, but the hashes are also regenerated.

If performance is critical, this parameter can be used. I carried out several experiments and saw the difference, however, in ordinary tasks this difference is not noticeable. Even in case of factor 2 and million elements, the arrays are rewritten only for 20 times.

In the beginning, I mentioned that some element insertion might take time from O(1) to O(n), where n is a total number of elements. Despite of this and based on amortized analysis, insertion time in a dynamic array is defined as O(1).

The idea of amortized analysis is to consider both, slow and fast operations of the algorithm. They might balance each other. While estimating an algorithm performance, we generally reach for the worst case scenario, but sometimes it is possible to calculate the ratio of expensive operations.

Let’s calculate the time for filling a dynamic array of n elements:
If we increase the length of an array by 2 times, we can estimate the number of rewrite like this:

Let’s start from the end. In the end it will need to rewrite all elements. Before that, only half of elements. Before that, quarter of elements, etc.
n + n/2 + n/4 + n/8 + … = n (1 + 1/2 + 1/4 + 1/8 + …) = 2n

Also, let’s add an insertion time of a new element and we get 3n. If we take an average, we will get O(3) = O(1) time for inserting an element.