Introduction
As a developer you might already be familiar with Kotlin notebooks, maybe even experimented with them a bit. But chances are, you're still trying to figure out how to effectively apply them in real-world scenarios. Often, the assumption is that they’re mainly useful for data scientists and data analysts. I've been there too. In fact, I had to dive deep into machine learning, train models, and show how Kotlin notebooks could be helpful before I gave them a proper chance. I’ll probably share more on that topic in another post. For now, though, I want to emphasize that extensive experience in machine learning isn’t necessary to understand the concepts.
What if you need to quickly write down a formula, process some data, or test an idea you saw in a tech talk or an article? Before, I’d either use Kotlin Playground, open my IDE to start a new file, or just add to an existing one. The tricky part came when I had to save those changes without messing up my main work - like committing those experimental changes or stashing them away, only to clean up later.
But with Kotlin notebooks, things got a lot easier. Trying out new ideas is very straightforward. Also, unlike scripts that you have to rerun after each attempt, Kotlin notebooks allow you to run heavy calculations in one cell, store data in variables, and then reuse and do more with that data in further cells without needing to rerun the whole code. This efficiency significantly speeds up the iterative process of testing and refining your ideas.
Practical Uses
In my daily work, I find numerous uses for Kotlin notebooks
- Writing Small Scripts: Automating tasks or simplifying repetitive actions quickly.
-
Sandboxing:
- Library Trials: Experimenting with new libraries directly without affecting the main project
- Variant Testing: Easily comparing different approaches in separate cells without cluttering my workspace with commented code
-
Data Analysis:
- Handy File Analysis: A quick look-through or deep dive info data, provided in files
- Data Source Connections: Linking different data sources for comparison, e.g. comparing files with data in databases.
- Chart Building: Fast visualization of data or results through charts.
- Sharing Concepts: Combining code with text to share ideas with the team and stakeholders via GitLab snippets or GitHub gists.
Prerequisites
For these demonstrations, I'm using IntelliJ IDEA, equipped with the Kotlin Notebook plugin and all necessary dependencies. If you’d like to follow along step-by-step, I recommend setting up your environment similarly. Here is a short installation guide.
Writing Small Scripts
Imagine you have a list of MySQL datetime strings that require reformatting for further use. While this could certainly be managed with a simple script, let me demonstrate how the same task can be efficiently handled within a Kotlin notebook, showcasing another aspect of its utility:
In [1]:
import java.time.LocalDateTime
import java.time.ZoneId
import java.time.ZonedDateTime
import java.time.format.DateTimeFormatter
val rawDateTimes = listOf(
"2024-03-11 23:51:42",
"2024-02-23 01:12:00",
"2023-12-31 23:59:59",
)
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val zonedDateTimes = rawDateTimes.map {
val localDateTime = LocalDateTime.parse(it, formatter)
ZonedDateTime.of(localDateTime, ZoneId.of("CET"))
}
println(zonedDateTimes.joinToString("\n") { it.format(DateTimeFormatter.ISO_ZONED_DATE_TIME) })
Out [1]:
2024-03-11T23:51:42+01:00[CET]
2024-02-23T01:12:00+01:00[CET]
2023-12-31T23:59:59+01:00[CET]
Kotlin is designed to improve the development experience, offering various ways to create and manage code. You could accomplish similar tasks using kotlin scratch files (where it's also possible to load modules and reuse project code), nevertheless the process to test multiple versions would require creating multiple scratch files. In contrast, Kotlin notebooks allow you to use cells for different versions or tests within the same notebook, simplifying the workflow and reducing clutter.
Sandboxing
Experimenting with libraries or incorporating them into your scripts is straightforward in Kotlin notebooks. You can start writing code with a minimum setup, which can also be done in the notebook.
For instance, let's say we need to obtain an air quality index. I chose a specific API for this example because it provides a free demo URL, making it accessible for demonstration purposes. The task is accomplished in three steps, each corresponding to a separate cell in the notebook.
Step 1: Loading Libraries
It is worth mentioning, that kotlin notebooks provide various methods for incorporating libraries into your projects:
- Gradle-style syntax
-
DependsOn
annotation -
%use
magic command
You can find a more detailed explanation here.
For this example, I want to use the USE
(giggle) function from jupyter.kotlin
. This function allows you to include libraries in the same way you would in your build.gradle.kts
file:
In [1]:
USE {
dependencies {
implementation("io.ktor:ktor-client-cio-jvm:2.3.9")
implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.6.3")
}
}
Step 2: Defining the Response Structure
In [2]:
import kotlinx.serialization.Serializable
import kotlinx.serialization.json.Json
import kotlinx.serialization.json.JsonObject
@Serializable
data class City(val name: String)
@Serializable
data class AirQualityData(val aqi: Int, val city: City)
@Serializable
data class AirQualityResponse(val data: AirQualityData)
Step 3: Retrieving and Displaying the Data
In [3]:
import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.request.*
import io.ktor.client.request.forms.*
import io.ktor.client.statement.*
import io.ktor.http.*
import kotlinx.coroutines.runBlocking
val client = HttpClient(CIO)
val responseText = runBlocking {
val response = client.get("http://api.waqi.info/feed/shanghai") {
parameter("token", "demo")
}
response.bodyAsText()
}
val json = Json { ignoreUnknownKeys = true }
val airQualityResponse = json.decodeFromString<AirQualityResponse>(responseText)
airQualityResponse
Out [3]:
AirQualityResponse(data=AirQualityData(aqi=155, city=City(name=Shanghai (上海))))
Data Analysis
I want to start by stating that I am not a data analyst. I'm a software engineer. However, I still find myself analyzing data from time to time. I'll share an example from my personal experience to illustrate this point. Since I often work with hotels, they'll serve as the basis for this example.
Comparing Files
Let's assume I support the platform for keeping track of hotel inventory. Imagine an Online Travel Agency (OTA) uploads a file containing approximately 344k rows of different hotels, but only about 341k end up in the system. They assert that all their records are perfectly correct and are baffled by the discrepancy in numbers. This type of investigation is something I might find myself working on. Typically, every developer believes their system is functioning perfectly, so the responsibility falls on me to determine whether the issue lies with the data provided by the OTA or with our system.
Manually sifting through such large files is not an easy task. This is a hypothetical case designed to highlight how challenging it would be to identify discrepancies manually. But, even if we were dealing with smaller files (say, about 1k files with a 50-record difference), where manual research might be somewhat feasible, it's still tedious and time-consuming.
So, what do we do?
Given: Input file and result file.
Objective: To determine why the number of hotels differs.
For the purposes of this example, I found a public dataset with hotel information. It was quite large, so I only took about 341k hotels from it and removed the hotel descriptions.
I will start with loading the file. Knowing that I'll have to work with tables, I decided to use the Kotlin DataFrame library.
The DataFrame library can be loaded with a magic command %use
, which I mentioned in a previous example. Most popular libraries can be found using this command, so you don't have to load them the way I did in the "Sandboxing" example.
In [1]:
%use dataframe
After that, I loaded the data from the file, which could have been provided by the OTA:
In [2]:
val inputFileDf = DataFrame.readCSV("../data/hotels_no_info_with_duplicates.csv")
inputFileDf
Out [2]:
+------------+----------------+----------+---------------+-----------+-----------------------------+-------------+---------------------------------------------------------------------+-----------------------------------------------------------------------+---------------+-----------------------------------+----------------+-----------+-----------------------------------------------------------------+
| countyCode | countyName | cityCode | cityName | HotelCode | HotelName | HotelRating | Address | Attractions | FaxNumber | Map | PhoneNumber | PinCode | HotelWebsiteUrl |
+------------+----------------+----------+---------------+-----------+-----------------------------+-------------+---------------------------------------------------------------------+-----------------------------------------------------------------------+---------------+-----------------------------------+----------------+-----------+-----------------------------------------------------------------+
| BR | Brazil | 104695 | Abelardo Luz | 1436268 | Quedas Park Hotel | ThreeStar | Av. Fermino Martins Neto 2395 Fairro Vila Ceres Bairro SC 89830-000 | null | null | -26.551453 | -52.321371 | 55-46999810033 | 89830-000 | http://www.quedasparkhotel.com.br/ |
| BR | Brazil | 101609 | Abre-Campo | 1436644 | Memorial Hotel Abre Campo | ThreeStar | Rodovia BR 262 KM 96, s/n Minas Gerais 35365-000 | | null | -20.30706 | -42.46655 | +553138721754 | 35365-000 | https://www.booking.com/hotel/br/vision-express-abre-campo.html |
| IT | Italy | 113140 | Camaiore | 1223545 | Hotel Caesar | FourStar | Viale Bernardini 325 55043 Lido di CamaioreLucca | null | +390584610888 | 43.9052 | 10.21682 | +390584617841 | 55043 | https://www.booking.com/hotel/it/caesar.html |
| GB | United Kingdom | 138878 | Staffordshire | 5256963 | Glen's Cottage | FourStar | Lane Head Farm Longnor Buxton | null | null | 53.1688 | -1.89911 | null | SK17 0NG | https://www.booking.com/hotel/gb/glens-cottage.html |
| FR | France | 132258 | Pleurtuit | 5267044 | Gite De La Caminais | All | La Caminais | null | null | 48.57518 | -2.03045 | null | 35730 | https://www.booking.com/hotel/fr/gite-de-la-caminais.html |
| CY | Cyprus | 122542 | Kato Paphos | 5168981 | Aphrodite Gardens Apt 27 | ThreeStar | Soteraki Markidi Avenue | Distances are displayed to the nearest 0.1 mile and kilometer. ... | null | 34.760602 | 32.4228 | 357-99566007 | 8036 | null |
| GB | United Kingdom | 124857 | Leith | 5338488 | Property in Leith | ThreeStar | Great Junction Street | Distances are displayed to the nearest 0.1 mile and kilometer. ... | null | 55.973775 | -3.176405 | 44-7441909799 | EH6 5LJ | null |
| DE | Germany | 123299 | Königssee | 1207939 | Sulzbergeck | FourStar | Sulzbergweg 21 | null | null | 47.61389 | 12.99043 | null | 83471 | https://www.booking.com/hotel/de/sulzbergeck.html |
| FR | France | 104669 | Calvados | 1299062 | Le Trophee By M Hotel Spa | ThreeStar | 81 Rue General Leclerc 14800 Deauville | Distances are displayed to the nearest 0.1 mile and kilometer. ... | 33-23-1880794 | 49.359119 | 0.071808 | 33-23-1884586 | 14800 | http://www.letrophee.com/ |
| HR | Croatia | 148316 | Tribunj | 5291108 | Apartments Laura I Angelina | All | Starine 12 B | null | null | 43.757923126221 | 15.742042541504 | null | 22212 | null |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
+------------+----------------+----------+---------------+-----------+-----------------------------+-------------+---------------------------------------------------------------------+-----------------------------------------------------------------------+---------------+-----------------------------------+----------------+-----------+-----------------------------------------------------------------+
Let's quickly count the provided hotels:
In [3]:
inputFileDf.count()
Out [3]:
344318
Now, I will load the result data, which is stored in the system. Since we do not have access to the database for this example, we'll simply load another file. It's also reasonable to assume that this file was provided by the OTA, as the one they downloaded from their account on the platform.
In [4]:
val resultFileDf = DataFrame.readCSV("../data/hotels_no_info_distinct.csv")
resultFileDf
Out [4]:
+------------+------------+----------+--------------+-----------+---------------------------+-------------+--------------------------------------------------------------------------+-------------------------------------------------------+------------------+-------------------------+----------------+------------+------------------------------------------------------------------+
| countyCode | countyName | cityCode | cityName | HotelCode | HotelName | HotelRating | Address | Attractions | FaxNumber | Map | PhoneNumber | PinCode | HotelWebsiteUrl |
+------------+------------+----------+--------------+-----------+---------------------------+-------------+--------------------------------------------------------------------------+-------------------------------------------------------+------------------+-------------------------+----------------+------------+------------------------------------------------------------------+
| BR | Brazil | 104695 | Abelardo Luz | 1436268 | Quedas Park Hotel | ThreeStar | Av. Fermino Martins Neto 2395 Fairro Vila Ceres Bairro SC 89830-000 | null | null | -26.551453 | -52.321371 | 55-46999810033 | 89830-000 | http://www.quedasparkhotel.com.br/ |
| BR | Brazil | 101609 | Abre-Campo | 1436644 | Memorial Hotel Abre Campo | ThreeStar | Rodovia BR 262 KM 96, s/n Minas Gerais 35365-000 | | null | -20.30706 | -42.46655 | +553138721754 | 35365-000 | https://www.booking.com/hotel/br/vision-express-abre-campo.html |
| BR | Brazil | 101609 | Abre-Campo | 1892089 | Hotel Sao Jorge | All | Rua Professora Yolanda Brando 29 CentroAbre CampoMinas Gerais 35365-000 | null | null | -20.30013 | -42.47471 | +553138721425 | 35365-000 | https://www.booking.com/hotel/br/sao-jorge-abre-campo.html |
| BR | Brazil | 101611 | Acailandia | 1432637 | Hotel Golden Lis | ThreeStar | Rod. Br 010 Km 1415 65926-000 Acailandia (Maranhao) | null | null | -4.95416 | -47.50305 | null | 65926-000 | https://www.booking.com/hotel/br/golden-lis-hoteis-ltda-epp.html |
| BR | Brazil | 101611 | Acailandia | 1433597 | Hotel Lara's | ThreeStar | Rua Duque de Caxias, no 1272 - Centro, Aailandia 65930-000 | null | null | -4.95122 | -47.49812 | +559935921000 | 65930-000 | http://www.hotellaras.com.br/ |
| BR | Brazil | 101611 | Acailandia | 1540808 | Vera Cruz Business Hotel | FourStar | BR 222 300 Bairro Jardim AmericaAcailndiaMaranhao 65930-000 | null | null | -4.94952 | -47.48089 | +559935384257 | 65930-000 | https://www.booking.com/hotel/br/vera-cruz-business.html |
| BR | Brazil | 101611 | Acailandia | 1585468 | Genova Palace Hotel | All | Rodovia 222 QD 28 LT 01Vila Sao Francisco | null | null | -4.94774 | -47.47794 | +559935381572 | 65930-000 | https://www.booking.com/hotel/br/genova-palace.html |
| BR | Brazil | 101611 | Acailandia | 1717926 | Vip Palace Hotel | All | Rua Sao Francisco 928 AcailandiaMaranhao 65930-000 | null | null | -4.95021 | -47.50011 | null | 65930-000 | https://www.booking.com/hotel/br/vip-palace-acailandia.html |
| BR | Brazil | 106068 | Acara | 1430684 | Hotel Acarau Riviera | ThreeStar | Rua Prefeito Raimundo Rocha 477 Centro 62580-000 Acarau (Ceara) | null | +55(0)8636611690 | -2.88838 | -40.11592 | (88) 3661-1690 | 62.580-000 | https://www.booking.com/hotel/br/acarau-riveira.html |
| BR | Brazil | 106068 | Acara | 1543389 | Castelo Encantado | All | Rua Santo Antonio 1550 AcaraCearo 62580-000 | null | null | -2.87838 | -40.12185 | (88) 9925-5917 | 62580-000 | https://www.booking.com/hotel/br/castelo-encanado.html |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
+------------+------------+----------+--------------+-----------+---------------------------+-------------+--------------------------------------------------------------------------+-------------------------------------------------------+------------------+-------------------------+----------------+------------+------------------------------------------------------------------+
Next, let's check the count and note the difference:
In [5]:
resultFileDf.count()
Out [5]:
340903
In [6]:
inputFileDf.count() - resultFileDf.count()
Out [6]:
3415
We observed an actual difference of 3415 rows. This discrepancy was pointed out by the OTA, asking us to investigate the cause.
I'll be taking a thorough approach to this investigation since the root of the problem is not immediately apparent (unless you read the file names🙂). The first step is to narrow down the potential issues. Our dataset includes a countyName
field, I'll begin by grouping the data by countyName
to see if there are any noticeable discrepancies in the number of hotels listed for each country.
First, let's examine the input file:
In [7]:
val groupedInputFileDf = inputFileDf
.groupBy("countyName")
.count()
.sortBy("countyName")
groupedInputFileDf
Out [7]:
+----------------+-------+
| countyName | count |
+----------------+-------+
| Brazil | 29770 |
| Croatia | 6967 |
| Cyprus | 3238 |
| France | 56803 |
| Germany | 53601 |
| Greece | 26632 |
| Italy | 96923 |
| Ukraine | 1556 |
| United Kingdom | 68828 |
+---------------+--------+
And now, let's look at the result file:
In [8]:
val groupedResultFileDf = resultFileDf
.groupBy("countyName")
.count()
.sortBy("countyName")
groupedResultFileDf
Out [8]:
+----------------+-------+
| countyName | count |
+----------------+-------+
| Brazil | 29467 |
| Croatia | 6891 |
| Cyprus | 3202 |
| France | 56247 |
| Germany | 53039 |
| Greece | 26369 |
| Italy | 95977 |
| Ukraine | 1542 |
| United Kingdom | 68169 |
+----------------+-------+
Instead of comparing them manually, why not leverage the power of a Kotlin notebook along with DataFrames? I will join the two DataFrames to compare the results.
In [9]:
val joinedDf = groupedInputFileDf.join(groupedResultFileDf, "countyName").sortBy("countyName")
joinedDf
Out [9]:
+----------------+-------+--------+
| countyName | count | count1 |
+----------------+-------+--------+
| Brazil | 29770 | 29467 |
| Croatia | 6967 | 6891 |
| Cyprus | 3238 | 3202 |
| France | 56803 | 56247 |
| Germany | 53601 | 53039 |
| Greece | 26632 | 26369 |
| Italy | 96923 | 95977 |
| Ukraine | 1556 | 1542 |
| United Kingdom | 68828 | 68169 |
+----------------+-------+--------+
Let's dive even deeper and calculate the difference between the two DataFrames, which can also be achieved quite easily. For clearer understanding, I'll rename the count
columns to countInput
and countResult
. Additionally, I'll use a slightly different join method to illustrate the other ways in achieving the same outcome.
In [10]:
val joinedWithDiffDf = groupedInputFileDf
.rename("count" to "inputCount")
.innerJoin(groupedResultFileDf.rename("count" to "resultCount")) {
"countyName" match right["countyName"]
}.add("diff") {
it["inputCount"] as Int - it["resultCount"] as Int
}
joinedWithDiffDf.sortBy("countyName")
Out [10]:
+----------------+------------+-------------+------+
| countyName | inputCount | resultCount | diff |
+----------------+------------+-------------+------+
| Brazil | 29770 | 29467 | 303 |
| Croatia | 6967 | 6891 | 76 |
| Cyprus | 3238 | 3202 | 36 |
| France | 56803 | 56247 | 556 |
| Germany | 53601 | 53039 | 562 |
| Greece | 26632 | 26369 | 263 |
| Italy | 96923 | 95977 | 946 |
| Ukraine | 1556 | 1542 | 14 |
| United Kingdom | 68828 | 68169 | 659 |
+----------------+------------+-------------+------+
This approach didn't clarify much, as we observed discrepancies across all countries. Given that a combination of hotel code and city code uniquely identifies a hotel, I will next group by the hotel code to check for any duplicates in the result file.
In [11]:
val groupedByUniqueHotelDf = inputFileDf.groupBy("HotelCode", "cityCode").count()
groupedByUniqueHotelDf
Out [11]:
+-----------+----------+-------+
| HotelCode | cityCode | count |
+-----------+----------+-------+
| 1436268 | 104695 | 2 |
| 1436644 | 101609 | 2 |
| 1223545 | 113140 | 2 |
| 5256963 | 138878 | 2 |
| 5267044 | 132258 | 2 |
| 5168981 | 122542 | 2 |
| 5338488 | 124857 | 2 |
| 1207939 | 123299 | 2 |
| 1299062 | 104669 | 2 |
| 5291108 | 148316 | 3 |
| ... | ... | ... |
+-----------+----------+-------+
Bingo! I believe I've pinpointed the issue. It appears that the agency's data was indeed correct; however, the same hotel was sometimes listed multiple times. My next step is to validate all duplicates and verify if their count equates to the discrepancy of 3415.
Initially, I filter the groupedByUniqueHotelDf
DataFrame to isolate only the duplicates. I will subtract 1 from the count of each group to account for the first occurrence, which isn't considered a duplicate.
In [12]:
val duplicates = groupedByUniqueHotelDf
.filter { it["count"] as Int > 1 }
.add("onlyDuplicates") { it["count"] as Int - 1 }
duplicates
Out [12]:
+-----------+----------+-------+----------------+
| HotelCode | cityCode | count | onlyDuplicates |
+-----------+----------+-------+----------------+
| 1436268 | 104695 | 2 | 1 |
| 1436644 | 101609 | 2 | 1 |
| 1223545 | 113140 | 2 | 1 |
| 5256963 | 138878 | 2 | 1 |
| 5267044 | 132258 | 2 | 1 |
| 5168981 | 122542 | 2 | 1 |
| 5338488 | 124857 | 2 | 1 |
| 1207939 | 123299 | 2 | 1 |
| 1299062 | 104669 | 2 | 1 |
| 5291108 | 148316 | 3 | 2 |
| ... | ... | ... | ... |
+-----------+----------+-------+----------------+
Next, I'll sum the onlyDuplicates
column to check if the total number of duplicate entries matches 3415.
In [13]:
duplicates.sum("onlyDuplicates")
Out [13]:
3415
And indeed it is, the assumption was correct. I contacted the agency and informed them about the problem. They will investigate the issue and provide us with the correct data. But I am also sure that they would ask us to provide the list of those duplicates, so that they do not have to look for them manually. Sure, let's try to do that.
In [14]:
val duplicatedRawHotelsDf = inputFileDf
.innerJoin(duplicates, "HotelCode", "cityCode")
.sortBy("HotelCode", "cityCode")
duplicatedRawHotelsDf
Out [14]:
+------------+------------+----------+-------------------+-----------+--------------------------------+-------------+-----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+------------------------+----------------+---------+---------------------------------------------------------------------+-------+----------------+
| countyCode | countyName | cityCode | cityName | HotelCode | HotelName | HotelRating | Address | Attractions | FaxNumber | Map | PhoneNumber | PinCode | HotelWebsiteUrl | count | onlyDuplicates |
+------------+------------+----------+-------------------+-----------+--------------------------------+-------------+-----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+------------------------+----------------+---------+---------------------------------------------------------------------+-------+----------------+
| GR | Greece | 133257 | Portaria | 1000819 | Valeni Boutique Hotel & Spa | FourStar | Regional Road Volou Portarias 37011 Portaria | Distances are displayed to the nearest 0.1 mile and kilometer. <br /> <p>Theophilos Museum - 0.5 km / 0.3 mi <br /> Panagia Portarea Church - 1 km / 0.6 mi <br /> Path of the Centaurs - 1.4 km / 0.9 mi <br /> Museum of Folk Art and History of Pelion - 4.8 km / 3 mi <br /> Argonafton Promenade - 7.2 km / 4.5 mi <br /> Volos General Hospital - 7.4 km / 4.6 mi <br /> Volos Town Hall - 7.6 km / 4.7 mi <br /> Archaeological Museum of Volos - 7.7 km / 4.8 mi <br /> Volos Port - 7.8 km / 4.9 mi <br /> Canyoning Hellenic - 8 km / 5 mi <br /> Volos Municipal Theatre - 8.1 km / 5 mi <br /> Tsalapatas Brickworks Museum - 8.8 km / 5.5 mi <br /> Anavros - 10.6 km / 6.6 mi <br /> Alikes Beach - 13.5 km / 8.4 mi <br /> Pelion Ski Centre - 14.1 km / 8.7 mi <br /> </p><p>The preferred airport for Valeni Boutique Hotel is Volos (VOL) - 57.3 km / 35.6 mi </p> | 30-24280-90205 | 39.387431 | 22.992011 | +302428090206 | 37011 | http://www.valeni.gr/en | 2 | 1 |
| GR | Greece | 133257 | Portaria | 1000819 | Valeni Boutique Hotel & Spa | FourStar | Regional Road Volou Portarias 37011 Portaria | Distances are displayed to the nearest 0.1 mile and kilometer. <br /> <p>Theophilos Museum - 0.5 km / 0.3 mi <br /> Panagia Portarea Church - 1 km / 0.6 mi <br /> Path of the Centaurs - 1.4 km / 0.9 mi <br /> Museum of Folk Art and History of Pelion - 4.8 km / 3 mi <br /> Argonafton Promenade - 7.2 km / 4.5 mi <br /> Volos General Hospital - 7.4 km / 4.6 mi <br /> Volos Town Hall - 7.6 km / 4.7 mi <br /> Archaeological Museum of Volos - 7.7 km / 4.8 mi <br /> Volos Port - 7.8 km / 4.9 mi <br /> Canyoning Hellenic - 8 km / 5 mi <br /> Volos Municipal Theatre - 8.1 km / 5 mi <br /> Tsalapatas Brickworks Museum - 8.8 km / 5.5 mi <br /> Anavros - 10.6 km / 6.6 mi <br /> Alikes Beach - 13.5 km / 8.4 mi <br /> Pelion Ski Centre - 14.1 km / 8.7 mi <br /> </p><p>The preferred airport for Valeni Boutique Hotel is Volos (VOL) - 57.3 km / 35.6 mi </p> | 30-24280-90205 | 39.387431 | 22.992011 | +302428090206 | 37011 | http://www.valeni.gr/en | 2 | 1 |
| FR | France | 130867 | Palavas-les-Flots | 1000991 | Les Alizes | ThreeStar | 6 Boulevard Marechal Joffre 34250 Palavas ves Flots | null | +33467503619 | 43.52775 | 3.93228 | (33) 467680180 | 34250 | https://www.booking.com/hotel/fr/les-alizes-palavas-les-flots.html | 2 | 1 |
| FR | France | 130867 | Palavas-les-Flots | 1000991 | Les Alizes | ThreeStar | 6 Boulevard Marechal Joffre 34250 Palavas ves Flots | null | +33467503619 | 43.52775 | 3.93228 | (33) 467680180 | 34250 | https://www.booking.com/hotel/fr/les-alizes-palavas-les-flots.html | 2 | 1 |
| IT | Italy | 138355 | Sperlonga | 1001187 | Residence Costa Di Kair Ed Din | ThreeStar | Via Fontana della Camera 18 04029 Sperlonga | Distances are displayed to the nearest 0.1 mile and kilometer. <br /> <p>Spiaggia di Ponente - 1.3 km / 0.8 mi <br /> Torre Truglia - 1.5 km / 1 mi <br /> Sperlonga Port - 1.6 km / 1 mi <br /> Spiaggia di Levante - 1.6 km / 1 mi <br /> Gulf of Gaeta - 1.8 km / 1.1 mi <br /> National Archaeological Museum of Sperlonga - 2.5 km / 1.6 mi <br /> Villa di Tiberio - 2.7 km / 1.7 mi <br /> Bazzano - 3.4 km / 2.1 mi <br /> Capratica - 4.6 km / 2.9 mi <br /> Sant Agostino - 6.1 km / 3.8 mi <br /> San Vito - 8.8 km / 5.5 mi <br /> 300 Steps Beach - 9.5 km / 5.9 mi <br /> Arenauta Beach - 10.4 km / 6.5 mi <br /> Quaranta Remi - 12.7 km / 7.9 mi <br /> Ariana Beach - 12.7 km / 7.9 mi <br /> </p> | 39-771-548164 | 41.263233 | 13.444079 | 39-771-549634 | 04029 | http://www.costakair.it/ | 2 | 1 |
| IT | Italy | 138355 | Sperlonga | 1001187 | Residence Costa Di Kair Ed Din | ThreeStar | Via Fontana della Camera 18 04029 Sperlonga | Distances are displayed to the nearest 0.1 mile and kilometer. <br /> <p>Spiaggia di Ponente - 1.3 km / 0.8 mi <br /> Torre Truglia - 1.5 km / 1 mi <br /> Sperlonga Port - 1.6 km / 1 mi <br /> Spiaggia di Levante - 1.6 km / 1 mi <br /> Gulf of Gaeta - 1.8 km / 1.1 mi <br /> National Archaeological Museum of Sperlonga - 2.5 km / 1.6 mi <br /> Villa di Tiberio - 2.7 km / 1.7 mi <br /> Bazzano - 3.4 km / 2.1 mi <br /> Capratica - 4.6 km / 2.9 mi <br /> Sant Agostino - 6.1 km / 3.8 mi <br /> San Vito - 8.8 km / 5.5 mi <br /> 300 Steps Beach - 9.5 km / 5.9 mi <br /> Arenauta Beach - 10.4 km / 6.5 mi <br /> Quaranta Remi - 12.7 km / 7.9 mi <br /> Ariana Beach - 12.7 km / 7.9 mi <br /> </p> | 39-771-548164 | 41.263233 | 13.444079 | 39-771-549634 | 04029 | http://www.costakair.it/ | 2 | 1 |
| DE | Germany | 148643 | Weisenbach | 1001928 | Gasthaus zur Krone | All | Jakob-Bleyer-Strase 21 76599 Weisenbach | | null | 48.7221863 | 8.3579674 | +4972243140 | 76599 | https://www.booking.com/hotel/de/gasthaus-zur-krone-weisenbach.html | 2 | 1 |
| DE | Germany | 148643 | Weisenbach | 1001928 | Gasthaus zur Krone | All | Jakob-Bleyer-Strase 21 76599 Weisenbach | | null | 48.7221863 | 8.3579674 | +4972243140 | 76599 | https://www.booking.com/hotel/de/gasthaus-zur-krone-weisenbach.html | 2 | 1 |
| DE | Germany | 136928 | Scharbeutz | 1001957 | Bayside | FourStar | Strandallee 130a 23683 Scharbeutz | null | (49) 45036096101 | 54.02822 | 10.75645 | (49) 450360960 | 23683 | https://www.booking.com/hotel/de/bayside.html | 2 | 1 |
| DE | Germany | 136928 | Scharbeutz | 1001957 | Bayside | FourStar | Strandallee 130a 23683 Scharbeutz | null | (49) 45036096101 | 54.02822 | 10.75645 | (49) 450360960 | 23683 | https://www.booking.com/hotel/de/bayside.html | 2 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
+------------+------------+----------+-------------------+-----------+--------------------------------+-------------+-----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+------------------------+----------------+---------+---------------------------------------------------------------------+-------+----------------+
Now I will save the duplicates to the file and send them to the OTA:
In [15]:
duplicatedRawHotelsDf.writeCSV("../data/duplicated_hotels.csv")
Leveraging Project Code
Addressing the duplicates issue brought up a noteworthy point: comparing two files of identical structure might initially seem simpler than extracting data from a database. Yet, this perspective shifts when you can leverage your existing project's code.
Kotlin notebooks allow for the integration of your project’s code by adding your repository to the list of available modules.
For this task, let's use FileHotelRepository
from my project for data loading and comparison. This repository is designed to read from files. I decided not to go for overkilling and setting up a database, but that's hardly a limitation. With Kotlin notebooks, the functionality of your code remains consistent with its behavior in the project. This means if your project includes a PostgresHotelRepository
, you could easily fetch data directly from your database with the right setup. The key takeaway is the seamless extension of your project's capabilities into the Kotlin notebook environment, enabling direct data handling as if you were operating within the project itself.
I will use the FileHotelRepository to load the data for comparison. It also reads from the file, I did not set up any DB, but it does not matter that much, since whatever your code does in the project it will do in kotlin notebooks. If you have "PostgresHotelRepository" implemented you can directly load the data from the database if you pass the correct configuration.
Disclaimer: If you're considering using Kotlin notebooks to test out real-time changes in your repository, it's important to manage your expectations. Typically, changes made to a repository used as a module in a Kotlin notebook do not automatically reflect in the notebook. This is because the notebook captures the state of the module at the time of inclusion. To see the latest changes, you'll need to manually update the dependency reference in your notebook. This process may involve specifying a new version of your module if you're using versioned releases, or restarting the notebook's kernel may be necessary to ensure it uses the updated module. For a more efficient development cycle, consider moving the code under test directly into the notebook. Once you've confirmed it works, you can then integrate it back into your project. This approach minimizes the overhead of frequently updating dependencies and leverages the notebook environment for rapid testing and development. Your project's code is a library in the notebook environment, so treat it accordingly.
Next, we'll demonstrate initializing a repository, loading data, and converting it into a DataFrame:
In [16]:
import hotels.manager.app.infrastructure.FileHotelRepository
val fileHotelRepository = FileHotelRepository("../data")
In [17]:
val loadedHotels = fileHotelRepository.getAll(12345)
val loadedHotelsDf = loadedHotels.toDataFrame()
loadedHotelsDf
Out [17]:
+--------+---------+----------+--------------+-----------+---------------------------+--------+
| locale | country | cityCode | city | hotelCode | hotelName | rating |
+--------+---------+----------+--------------+-----------+---------------------------+--------+
| BR | Brazil | 104695 | Abelardo Luz | 1436268 | Quedas Park Hotel | THREE |
| BR | Brazil | 101609 | Abre-Campo | 1436644 | Memorial Hotel Abre Campo | THREE |
| BR | Brazil | 101609 | Abre-Campo | 1892089 | Hotel Sao Jorge | ALL |
| BR | Brazil | 101611 | Acailandia | 1432637 | Hotel Golden Lis | THREE |
| BR | Brazil | 101611 | Acailandia | 1433597 | Hotel Lara's | THREE |
| BR | Brazil | 101611 | Acailandia | 1540808 | Vera Cruz Business Hotel | FOUR |
| BR | Brazil | 101611 | Acailandia | 1585468 | Genova Palace Hotel | ALL |
| BR | Brazil | 101611 | Acailandia | 1717926 | Vip Palace Hotel | ALL |
| BR | Brazil | 106068 | Acara | 1430684 | Hotel Acarau Riviera | THREE |
| BR | Brazil | 106068 | Acara | 1543389 | Castelo Encantado | ALL |
| ... | ... | ... | ... | ... | ... | ... |
+--------+---------+----------+--------------+-----------+---------------------------+--------+
And now, let's compare the counts by country, just as we did previously with files:
In [18]:
val loadedHotelsGroupedDf = loadedHotelsDf
.groupBy { country }
.count()
.rename("count" to "resultCount")
val inputWithResultReworkedDf = groupedInputFileDf
.rename("count" to "inputCount")
.join(loadedHotelsGroupedDf) { it.countyName match right.country }
.add("diff") { it["inputCount"] as Int - it["resultCount"] as Int }
inputWithResultReworkedDf
Out [18]:
+----------------+------------+-------------+------+
| countyName | inputCount | resultCount | diff |
+----------------+------------+-------------+------+
| Brazil | 29770 | 29467 | 303 |
| Croatia | 6967 | 6891 | 76 |
| Cyprus | 3238 | 3202 | 36 |
| France | 56803 | 56247 | 556 |
| Germany | 53601 | 53039 | 562 |
| Greece | 26632 | 26369 | 263 |
| Italy | 96923 | 95977 | 946 |
| Ukraine | 1556 | 1542 | 14 |
| United Kingdom | 68828 | 68169 | 659 |
+----------------+------------+-------------+------+
But doing the same old thing again felt redundant. So, I decided to focus on something more useful — saving the duplicates using our repository's save method. The catch? The duplicates had to be in the Hotel
type to fit the function.
I gave it a go, trying to turn those duplicates into a list of Hotel
objects to save them.
In [19]:
import hotels.manager.app.domain.Hotel
duplicatedRawHotelsDf.cast<Hotel>().toList()
Out [19]:
Can not find column `locale` in DataFrame
java.lang.IllegalStateException: Can not find column `locale` in DataFrame
at org.jetbrains.kotlinx.dataframe.impl.api.ToListKt.toListImpl(toList.kt:42)
at Line_46_jupyter.<init>(Line_46.jupyter.kts:5)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:122)
at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:48)
at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
at kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:49)
at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl$eval$resultWithDiagnostics$1.invokeSuspend(InternalEvaluatorImpl.kt:127)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:277)
at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:95)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:69)
at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:48)
at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:127)
at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:80)
at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:78)
at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:711)
at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute(CellExecutorImpl.kt:78)
at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute$default(CellExecutor.kt:12)
at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode(ReplForJupyterImpl.kt:534)
at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evaluateUserCode(ReplForJupyterImpl.kt:128)
at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:420)
at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:417)
at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:398)
at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:417)
at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:139)
at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:138)
at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$Task.execute(JupyterExecutorImpl.kt:42)
at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:82)
at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:80)
at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)
Turns out, you can’t just cast a DataFrame to a list of Hotels because the names in the DataFrame and the Hotel class didn’t match. Good news, though — the DataFrame library has this neat function for converting, once you map out the names (yes, you will have to spend some time there).
In [20]:
import hotels.manager.app.domain.Hotel
import hotels.manager.app.domain.HotelRating
fun <T> DataRow<T>.getStringValue(columnName: String) = (this[columnName] as String).trim()
val duplicatedHotelObjectsDf = duplicatedRawHotelsDf.convertTo<Hotel> {
fill { col<String>("locale") }.with { it.getStringValue("countyCode") }
fill { col<String>("country") }.with { it.getStringValue("countyName") }
fill { col<Int>("cityCode") }.with { it["cityCode"] as Int }
fill { col<String>("city") }.with { it.getStringValue("cityName") }
fill { col<Int>("hotelCode") }.with { it["HotelCode"] as Int }
fill { col<String>("hotelName") }.with { it.getStringValue("HotelName") }
fill { col<HotelRating?>("rating") }.with { row ->
when ((row["HotelRating"] as String).lowercase()) {
"onestar" -> HotelRating.ONE
"twostar" -> HotelRating.TWO
"threestar" -> HotelRating.THREE
"fourstar" -> HotelRating.FOUR
"fivestar" -> HotelRating.FIVE
"all" -> HotelRating.ALL
else -> null
}
}
}
val duplicatedHotelList = duplicatedHotelObjectsDf.toList()
fileHotelRepository.save(111222, *duplicatedHotelList.toTypedArray())
Out [20]:
Done. Saved 6804 hotels for 111222 to a file.
And there you have it, saved right through the project's repository. It’s pretty cool how you can mix in this kind of one-off test code with your project without running the whole app. After seeing how it works, maybe you’ll be tempted to try it out too.
Chart Building
I'm not sure what your go-to method is for building charts when you need them, but I highly recommend giving Kandy a shot in Kotlin notebooks.
I used the same hotel data to provide an example. First, loading the hotels, grouping them by country, and counting them:
In [1]:
%use kandy, dataframe
In [2]:
val hotelsDf = DataFrame.readCSV("../data/hotels_no_info_distinct.csv")
val hoteldGroupedByCountryDf = hotelsDf.groupBy { it["countyName"] }.count()
hoteldGroupedByCountryDf
Out [2]:
+----------------+-------+
| countyName | count |
+----------------+-------+
| Brazil | 29467 |
| Croatia | 6891 |
| Cyprus | 3202 |
| France | 56247 |
| Germany | 53039 |
| Greece | 26369 |
| Italy | 95977 |
| Ukraine | 1542 |
| United Kingdom | 68169 |
+----------------+-------+
Then, turning that data into a visual representation like this:
In [3]:
hoteldGroupedByCountryDf.plot {
layout.title = "Hotels by country"
bars {
x("countyName") { axis.name = "Country" }
y("count") { axis.name = "Number of hotels" }
fillColor("countyName") { legend.name = "Country" }
}
}
Out [3]:
This is a basic example, but Kandy can handle much more complex charting. And you're not limited to DataFrames either, kotlin Lists work perfectly well.
I could throw a bunch of examples at you, but Kandy got this covered with some great ones already. Instead of repeating what they've done, I suggest checking them out here.
Sharing Concepts
Last but certainly not least, sharing concepts with Kotlin notebooks is very straightforward. You can gather all your code alongside explanations of its purpose or the logic behind it. This is particularly handy for evaluating different solutions: simply lay them out side by side in the same notebook for easy comparison.
What I really appreciate is the ability to share these notebooks as GitLab snippets or GitHub gists, making them accessible and viewable online. It’s a great way to collaborate or get feedback on your work.
All the notebooks, along with the data and code used for this post, are available in this repository.
Conclusion
If you're already a Kotlin developer, I believe Kotlin notebooks are a tool that should definitely be in your toolbox. And if you're not, perhaps trying out Kotlin in notebooks might just convert you 🙂.
In this post, I've shared how I utilize Kotlin notebooks for writing scripts, analyzing files, building charts, and incorporating project code. That last part alone opens up a myriad of possibilities for experimentation and solution evaluation.
Initially, it was challenging for me to find applications for Kotlin notebooks since I'm neither a data analyst nor a data scientist by trade. But now, it's become second nature. If you've discovered other ways to optimize your work with Kotlin notebooks, I'm eager to hear about them!
Top comments (2)
Unfortunately it's only available for IntelliJ Ultimate, why wouldn't they avail it to everyone, this is a new product and could use more community feedback and projects to demonstrate it use for Kotlin developers.
true, unfortunately, I do not have an answer for that :(
Perhaps IntelliJ IDEA can answer