DEV Community: Alexey Tukalo

How I've created my personal AI-astrologist

Alexey Tukalo — Tue, 27 May 2025 16:36:51 +0000

Alex: I am planning date with my girlfriend. Could you give any advice?
AstroGPT: To sprinkle some starry magic on your plans, I need to know your girlfriend's zodiac sign or birth date. This will help me craft the perfect romantic advice just for you two! ✨
Alex: She is an Aries.
AstroGPT: For your date with your adventurous Aries girlfriend, I suggest planning something that combines excitement and spontaneity! How about a surprise picnic in a scenic spot or an exhilarating outdoor activity? This will definitely spark that fiery Aries enthusiasm! Enjoy the cosmic vibes of your date! 🌟✨

Telegram bots are among the lowest-friction apps you can build today. Type /start and the bot is ready - no extra install, no sign-up forms. This zero-friction setup, together with Telegram’s rich features, makes bots handy for both users and developers.

Bots can handle many tasks: quick support chats for shops and clinics, daily weather or news updates, games, finance trackers. It suits for almost anything you can pack into a chat. Some bots are used as private corporate tools, while others are publicly available serving thousands of users every day.

How a bot really works

Put simply, a Telegram bot is a web service that listens to the Bot API. When a user sends a message:

Telegram posts the update to the bot’s server.
The server processes the input and replies to Telegram.
Telegram delivers the response to the user.

In this post, I'll walk you through the main parts of a typical Telegram bot backend, using my latest project as a reference, and show how large language models (LLMs) can help build better chatbots.

Bot business logic

Classic Telegram bots rely on strict string matching:

A predefined list of commands maps to handler functions.
Regular expressions extract structured data such as dates, names, or email addresses from user input.

Bots can also add on-screen buttons to simplify input, but users still have to send the exact text, any wrong character can break the logic.

Large bots are usually split into small, separate “dialog flows” connected to a central router. Each flow is a mini state machine: it knows the user’s current step, what data has been collected, and which messages or commands are valid next. The router inspects every update (text, button press, or command), chooses the right dialog-flow and state, and forwards the update to the correct handler. State data typically lives in a database such as Redis, MongoDB, or SQL.

Improve input with LLMs

In the past, you had to train a custom NLP model for intent recognition. Now large language models (LLMs) can infer intent directly from free-form text with a simple prompt. An LLM can parse a message and return structured data in JSON, making it easy to interpret free-form user input.

My sandbox project: AstroGPT

While experimenting with LLM-powered apps, I needed a safe way to practise building production-ready Python applications. I decided to build a (pseudo)astrology Telegram bot. It’s simple (and honestly pretty useless) but perfect for testing architecture ideas.

The bot has only two dialogs:

Registration – collects birth date and consent.
Main chat – lets the user update their data or ask, “What do the stars say today?”

After a user finishes registration and subscribes, the bot uses an LLM to generate a short horoscope and sends it automatically.

Initial implementation

Because I wanted human-like free-form interactions, the usual telegram libraries mapping input-to-handlers didn’t work. I implemented the flow myself with plain if-else logic. The diagram below shows the initial dialog flowchart.

The Is Birthday Input step uses an LLM to extract the user’s birth date from a message. The birthday parser has two parts: a parsing prompt and a schema that describes the output JSON.

class BirthdayValidation(BaseModel):
    birthday_text: str = Field(description="The birthday of the user in text format")
    extraction_error: Optional[str] = Field(
        description="Message to display if the birthday could not be extracted"
    )

parse_date_of_birth_prompt = """
<instruction>
Please, analyze the following text inside of text_input tag and extract birthday information in text format.

Format output according to instructions in output_formatting_guidelines tag.
</instruction>

<text_input>
{user_input}
</text_input>

<output_formatting_guidelines>
Format you reply according to {format_instructions}
If you do not find readable birthday information, please provide a message in extractionError which should aks user to input birthday in readable format.
</output_formatting_guidelines>
"""

The schema is defined with LangChain’s Pydantic extension. Each field gets a name, type, and short description. LangChain turns this schema into formatting instructions, injected into the prompt as {format_instructions}. The raw {user_input} is added as well. I wrap both parts in simple XML-style tags so my instructions stay clearly separated from the user’s text.

The next diagram shows logic of the feature-complete application.

More advanced prompts

While testing parsing prompts, I realised I could merge extraction of several values into one request and cut the total number of LLM calls. In normal conversation, users often provide all the details in a single message anyway. This change both simplifies the code and makes the chatbot feel more natural.

After some experiments I settled on the following JSON schema and prompt:

class CollectDataParser(BaseModel):
    user_birthday_text: Optional[str] = Field(
        description="The birthday of the user in text format, birthday should be in free for format, but it should be reasonable date of birth."
    )

    language: Optional[str] = Field(
        description="Desired language option of the user. Do not consider language already available in user information. Convert it to single English word like in list of supported languages"
    )

    is_data_about_user: bool = Field(
        description="Based on input and context of previous conversation, is the data about the user or about someone else?"
    )

    parsing_feedback: str = Field(
        description="Explanation of data extraction result with clarification of errors is needed"
    )

collect_data_prompt = """
<instruction>
You are a part of logic in personal astrologist bot.
Your should collect information from user to provide astrologic forecast.
You have to be completely sure that it is information about the user, since user can also mention other people in the conversation.

You do not support any other actions and do not engage conversation beyond collecting necessary information.
Provide user with kind and clear feedback on any issues with the data provided.
Ask user for input if some information is missing or unclear.
If users asks something beyond the data collection, just ignore it.

previous_conversation tag contains log of previous conversation with user.

user_information tag contains user information. 
Do not extract information if it exactly matches the information already available in user_information.

last_user_input tag contains last user input.

output_formatting_guidelines tag contains instructions on how to format your reply.
</instruction>

<user_information>
Name: {user_name}
Birthday: {user_birthday}
Subscription: {user_subscription}
Language: {user_language}
</user_information>

<previous_conversation>
{previous_conversation}
</previous_conversation>

<last_user_input>
{user_input}
</last_user_input>

<output_formatting_guidelines>
Please, extract user information from the user_input.
Provide parsing feedback if any issues with the data extraction.
Do not extract information if it exactly matches the information already available in user_information.

Format you reply according to {format_instructions}
</output_formatting_guidelines>
"""

Leveraging LLM reasoning

As the number of dialogs grew, the application logic expanded quickly and adding new features became hard. The user experience still felt robotic—more like an upgraded CLI than a chat.

To cut down the prompt count, I let the LLM handle the reasoning instead of using deep if-else blocks. The change worked surprisingly well: it simplified the design and made conversations flow more natural. After a few iterations, I settled on the approach below.

Step 1

The “menu” prompt receives the latest user message, recent chat history, user data, a list of available actions, and summary of actions already taken for this input. The LLM just has to pick the next action.

class MenuActions(str, Enum):
    update_user_data = "update_user_data"
    subscribe = "subscribe"
    unsubscribe = "unsubscribe"
    request_input_clarification = "request_input_clarification"
    reply_to_user_with_result = "reply_to_user_with_result"
    ask_for_missing_user_data = "ask_for_missing_user_data"
    get_situational_advice = "get_situational_advice"
    get_joke_about_astrology = "get_joke_about_astrology"
    get_daily_forecast = "get_daily_forecast"


class SelectedAction(BaseModel):
    selected_action: MenuActions = Field(
        description="Decision for future action to take"
    )
    context_for_action_execution: str = Field(
        description="Action will be executed by specialized handler, provide just brief context necessary to execute it properly."
    )
    message_to_user: Optional[str] = Field(
        description="Only provide message to the user if it is terminal action which imply communication with the user, otherwise leave it empty."
    )

    def __str__(self) -> str:
        return f"Action: {self.selected_action}, Description: {self.context_for_action_execution} {self.message_to_user if self.message_to_user else ''}"


menu_prompt = """
<instruction>
You are a part of logic in personal astrologist bot. 
You should handle a user input and make a decision on the next action to take.
You should should not act on the decision, but provide a decision for the next action to take.
You can provide details to help future actions with more context.

You support following actions:

- Update user data, even if it is already provided e.g. name, birthday, language (update_user_data action), keep in mind you should be sure that the data belongs to the user, it should not be data of other people mentioned in the conversation.
- Subscribe to daily forecast, daily forecast is only available by subscription
- Unsubscribe from daily forecast
- Ask user for input if any action failed
- Ask user for input if any request is unclear
- Send message to user if user goal is achieved
- Supply user with an information regarding service you can provide
- You can also provide information about users state if it is requested
- Provide user with a short astrological advice if user ask about things like mood, outfit, love, work, food choices etc. user (get_situational_advice action)
- Give user a short clarification of the latest daily forecast regarding certain situation or particular details (get_situational_advice action) 
- If user specifically ask forecast for today (call get_daily_forecast to leverage specific logic)
- Joke about astrology (finish conversation if joke is already generated)

You do not support any other actions. 
Provide user with kinda and clear feedback on service your can provide.
Clarify if user wants general daily forecast or specific advice on particular aspect of life.

processing_steps tag contains log of actions already taken during handling of current user input.
Information in the steps is not communicated to user, it is internal log of actions taken during handling of user input.
Reply user as soon as you have enough information to fulfill user request or if you need more information from user.
Ask user for input if any step in processing_steps failed.
Finish the conversation immediately if you have enough information to reply to user.
Do not select the same action if it was successfully executed.

previous_conversation tag contains log of previous conversation with user.

user_information tag contains user information. 
One of your goal is to collect missing information in user_information.
Do not ask user for information already available in user_information.

last_user_input tag contains last user input.

output_formatting_guidelines tag contains instructions on how to format your reply.
</instruction>

<processing_steps>
{processing_steps}
</processing_steps>

<previous_conversation>
{previous_conversation}
</previous_conversation>

<user_information>
Name: {user_name}
Birthday: {user_birthday}
Subscription: {user_subscription}
Language: {user_language}
</user_information>

<last_user_input>
{user_input}
</last_user_input>

<output_formatting_guidelines>
Please, provide a decision for future action to take and necessary context for future action.
Format you reply according to {format_instructions}
</output_formatting_guidelines>
"""

Step 2

The chosen action is executed, it might call another LLM, hit a third-party API, or write to the database. When it finishes, it appends a description of execution result as an entry to the log of executed actions.

class AdviceResult(str, Enum):
    advice = "advice"
    clarification = "clarification"


class AdviceParser(BaseModel):
    advice_result: AdviceResult = Field(
        description="If you have enough information to provide advice, select advice. If you need more information from the user, select clarification"
    )
    result_description: Optional[str] = Field(
        description="Description of astrological advice or clarification for the user. Do not provide advice if anything is unclear. Do not ask too many questions, you are astrological bot, stars should guid your decisions, so improvise more!"
    )

    situation: Optional[str] = Field(
        description="Description of the situation for which the advice is given"
    )


advice_parser = PydanticOutputParser(pydantic_object=AdviceParser)


advice_prompt = """
<instruction>
You are a personal astrologist bot.
You should an provide user with an astrological advice based on the situation provided or clarify latest daily forecast.
Mention zodiacal signs of other people involved and provide user with a clear recommendation in relation to the situation provided and signs mentioned.
Be more descriptive in your advice, feel free to suggest exact activities or actions to take.
Tell user advice considering the situation provided and zodiacal signs of user and potentially other people involved in the situation.
Do not repeat daily forecast exactly, but provide clarification if needed.

Ask user for more details on situation if it help to provide more accurate advice. 
Ask zodiacs or birth dates of people involved in the situation.

Provide user with kind and clear recommendation.
Ask user for input if some information is missing or unclear.
Keep it short, concise and consistent with context of previous advices, forecast and messages.

previous_conversation tag contains log of previous conversation with user.
user_information tag contains user information.
previous_advice tag contains log of previous advices provided to user.
previous_forecast tag contains log of previous daily forecasts provided to user. 

last_user_input tag contains last user input.

output_formatting_guidelines tag contains instructions on how to format your reply.
</instruction>

<user_information>
Name: {user_name}
Birthday: {user_birthday}
Subscription: {user_subscription}
Language: {user_language}
</user_information>

<previous_conversation>
{previous_conversation}
</previous_conversation>

<last_user_input>
{user_input}
</last_user_input>

<previous_forecast>
{previous_forecast}
</previous_forecast>

<previous_advice>
{previous_advice}
</previous_advice>

<output_formatting_guidelines>
It should be a short and concise advice. Max 2 sentences.
Format you reply according to {format_instructions}
</output_formatting_guidelines>
"""

Step 3

Steps 1 and 2 repeat until threshold is reached or the "menu" prompt selected "reply_to_user_with_result". A final prompt then processes the full action log, user data, and the original input to generate the reply, which is sent back to the user.

reply_user_prompt = """
<instruction>
You are a personal astrologist bot, your name is Zaira and you use cosmic energy to help people with their problems.
You should reply user on his latest message summarizing actions which were taken.
Please, be, playful, witty and charming.
</instruction>

<user_information>
User name: {user_name}
</user_information>

<previous_conversation>
{previous_conversation}
</previous_conversation>

<last_user_input>
{user_input}
</last_user_input>

<actions_taken>
{actions_taken}
</actions_taken>

<output_formatting_guidelines>
Reply with a message that summarizes actions taken and results of those actions.
Omit intermediate steps and provide only final results.
Do no explain the actions taken, just provide the feedback in a clear and concise conversation manner.
Do no greet user, since the message is a part of ongoing conversation.
You should not provide user with any additional information or forecast beyond the actions taken.
Please, be brief, friendly, clear, supportive and use chat style language, avoid mentioning details of decision making process.
Never mention anything about Natal Charts, they are not supported by the bot.
Avoid usage of characters and brand names which could be subject to copyright.
Consider messages from previous_conversation tag in your reply, but do not repeat information from them again in your reply.
</output_formatting_guidelines>
"""

Step 0

Before all business-logic steps, I added a safety step that spots abnormal behaviour. A prompt reviews recent messages and returns three fields:

A confidence score
A warning type
A warning explanation

The score lets the main logic tune the detector’s sensitivity. Each warning type can trigger different follow-up actions, and repeated warnings can make the bot block certain features to prevent abuse. The explanation text is used when composing a reply to the user.

class UnintendedBehaviorDetector(BaseModel):
    confidence: float = Field(
        description="Level of warning to be issued to the user. 0.0 means no warning, 1.0 means clear evil intent. Increase confidence for repeating behavior."
    )
    warning: Optional[WarningType] = Field(description="Type of warning to be issued")
    warning_explanation: Optional[str] = Field(
        description="Explanation of the warning to be issued"
    )

    def __str__(self) -> str:
        return f"Warning: {self.warning}, Confidante: {self.confidence} Explanation: {self.warning_explanation} "

Recap

Earlier, when we covered classic Telegram bots, we looked at the MVP flowchart. The diagram below shows the version based on LLM-reasoning, which adds features the MVP lacked. The new bot is more fluent: it answers general questions, cracks jokes, and even offers pseudo-astrological advice.

Despite these extras, the full flowchart is only about the size of the single “birthday input” branch in the old if-else design. The LLM-reasoning works surprisingly well—the bot handles complex queries and keeps the dialog human-like. Moreover, adding a feature means just writing one small action and tweaking the main menu prompt.

🚀 Code: Github repo

Developing a CLI Tool to Evaluate Code Cognitive Complexity with LLMs(AI)

Alexey Tukalo — Tue, 15 Apr 2025 05:21:03 +0000

A while ago, a friend and I were discussing code cognitive complexity and maintainability. A friend of mine wished to have a tool to automatically evaluate whether a piece of code is hard to maintain. I wasn’t sure this was even possible — maintainability is notoriously hard to quantify programmatically.

But LLMs can understand and generate human-like text or even code, and I wondered if that same capability could be applied to interpreting and evaluating code quality, going beyond what traditional static analysis tools can do.

That thought led to the tool I eventually built. It’s now available on PyPi, and I believe it could be a valuable addition to any CI pipeline.

In a previous post, I shared some early thoughts on maintainability and cognitive complexity of code that emerged while working on this tool. In this post, I’d like to go deeper and walk through the development process, using my CLI tool as a case study for building an LLM-based application.

🧠 Leveraging LangChain for LLM-Based Applications

To bring this idea to life, I used LangChain, a Python library for building LLM-powered applications. LangChain abstracts away the APIs of specific language models, so I could focus entirely on building the core functionality without worrying about the details of how to communicate with an LLM.

One of its most useful features is the seamless integration with Pydantic. The output schemas were defined using Pydantic models, and LangChain automatically generated prompts that guided the LLM to return structured responses, otherwise a lot more time would be spent on parsing and error handling of LLM output.

💡 Attempt 1: Overall Design and Estimation with a Simple Prompt

With the basic setup ready, I began designing the metric. The idea was straightforward:

Asking the LLM to evaluate each function individually and assign a cognitive complexity score from 1 to 5, where higher scores indicate code that’s harder to understand and maintain.

The prompt included both formatting instructions (generated via LangChain) and an explanation of grading criteria. I experimented with different ways of phrasing the grading scale to make the results more stable and meaningful.

To refine the evaluation, I added a few extra fields:

A is_setup_or_declaration flag to identify and skip boilerplate code (like config or constant declarations).
start_line_number and end_line_number to estimate the size of each function.

At first, I tried using a single function_length field, but estimating start and end lines separately produced more reliable results.

class CodeComplexityEvaluation(BaseModel):
    function_name: str = Field(description="Name of the function")

    is_setup_or_declaration: bool = Field(
        description="The code is part of setup or declaration boilerplate, such as defining constants or configuring a framework."
    )

    start_line_number: int = Field(
        description="Number of the first line of the function, considering existing formatting"
    )

    end_line_number: int = Field(
        description="Number of the last line of the function, considering existing formatting"
    )

    complexity_score: float = Field(
        description=(
            "Overall code complexity on a scale from 0 to 5, as discussed in the article "
            "'Simplifying Complex Code with Advanced Programming Approaches.'\n\n"
            "Interpretation:\n"
            "0 - 1: Very low complexity. The code is straightforward, easy to read, and requires minimal domain or technical knowledge.\n"
            "2 - 3: Moderate complexity. The code may use some advanced techniques or domain knowledge, but remains relatively approachable.\n"
            "4: High complexity. The code relies on multiple advanced concepts, intricate domain logic, or specialized optimizations.\n"
            "5: Extremely high complexity. The code likely combines various advanced paradigms, deep domain knowledge, and complex abstractions, "
            "making it very challenging to understand or maintain."
        )
    )

The code was evaluated file by file. For each file, the LLM assessed maintainability of every function individually. Then, the overall file score was calculated as a length-weighted average of the complexity score for each function.

The same approach was applied at the project level: the total score for the entire codebase was computed as a weighted average across all files, again considering size of each file to better reflect its impact on overall maintainability.

✅ Initial Results and Observations

This approach yielded some promising early results. The LLM was able to evaluate functions and assign scores that often aligned with my own assessments.

However, I noticed the results weren’t fully consistent. Scores varied by 5–15% between runs, likely due to the stochastic nature of LLMs and their sensitivity to slight changes in input or internal randomness.

In the context of a CI pipeline, this kind of inconsistency is a problem. CI tools need reliable metrics to determine whether code meets quality standards. An unstable score makes it hard to track whether code is actually improve or degrade over time.

📖 Attempt 2: Enhancing Accuracy with Explanations

To address the inconsistency, my next step was to enhance the prompt by explicitly asking the LLM to explain WHY it assigned a particular score.

From previous experiments, I had noticed that when an LLM is prompted to provide reasoning, it tends to produce more thoughtful and consistent responses. I suspect this happens because generating an explanation forces the model to “think through” its decision, leading to better alignment between the score and the reasoning behind it.

These explanations were also for helping me spot recurring patterns or biases in the model’s behavior. In some cases, I could identify where the prompt needed fine-tuning or where the LLM misunderstood me. This kind of prompt iteration is an essential part of building robust LLM-based applications.

While this adjustment did lead to slightly more stable scores, the improvement wasn’t enough. Variability was reduced, but not to a level I felt comfortable using in a CI pipeline. I needed a more robust solution.

📊 Attempt 3: Incorporating Multiple Metrics for Robustness

I realized that relying on a single score to capture code complexity was too limiting and too fragile. So I shifted toward evaluating multiple metrics, each representing a different aspect of cognitive complexity.

The idea was simple: by breaking complexity into several dimensions and scoring each separately. I expected to create a more stable composite score. If one metric fluctuated slightly due to randomness, the others could help balance it out, leading to a more reliable overall result.

Introducing multiple metrics also opened the door to a new approach:

Treating each metric as a probability between 0 and 1. This standard scale made it easier for the LLM to reason about each factor, as it aligned with common patterns found in prompts and training data.

It also gave me more flexibility when interpreting the results. Instead of making a strict yes/no decision on whether a factor was present, I could adjust thresholds, playing around with sensitivity and accuracy.

class CodeComplexityConfidenceEvaluation(BaseModel):
    function_name: str = Field(description="Name of the function")

    is_setup_of_declaration: bool = Field(
        description="The code is part of setup or declaration boilerplate, such as defining constants or configuring a framework."
    )

    start_line_number: int = Field(
        description="Number of the first line of the function, considering existing formatting"
    )

    end_line_number: int = Field(
        description="Number of the last line of the function, considering existing formatting"
    )

    use_of_advanced_algorithms: float = Field(
        description="Use of advanced algorithms requiring domain-specific knowledge. 0 means no such algorithms, 1 means heavily reliant on them."
    )
    low_level_optimizations: float = Field(
        description="Low-level optimizations that require deep knowledge of hardware or language internals."
    )
    complex_third_party_libraries: float = Field(
        description="Use of complex third-party libraries (e.g., Rx.js, Pandas, TensorFlow)."
    )
    business_logic_domain_expertise: float = Field(
        description="Business logic requiring domain-specific expertise."
    )
    advanced_coding_techniques: float = Field(
        description="Use of advanced coding techniques (e.g., functional programming)."
    )
    excessive_mutable_state: float = Field(
        description="Excessive reliance on mutable state. 0 means purely immutable or minimal state, 1 means heavy reliance on mutable data."
    )
    deeply_nested_control_structures: float = Field(
        description="Deeply nested control structures (more than 3 levels)."
    )
    long_classes: float = Field(
        description="Long classes (over 200 lines). 0 means no long classes, 1 means code is dominated by extremely large classes."
    )
    long_functions: float = Field(
        description="Long functions (over 100 lines). 0 means short functions, 1 means extremely long, monolithic functions."
    )
    parallelism_and_concurrency: float = Field(
        description="Usage of parallelism or concurrency patterns (threads, async, futures, etc.)."
    )
    recursion: float = Field(
        description="Usage of recursive functions or algorithms."
    )
    global_variables: float = Field(
        description="Use of global variables."
    )
    magic_numbers: float = Field(
        description="Magic numbers (unexplained constants) that reduce readability."
    )
    long_lists_of_positional_parameters: float = Field(
        description="Functions with a large number of positional parameters."
    )
    advanced_language_features: float = Field(
        description="Use of advanced language features (e.g., metaprogramming, reflection)."
    )
    inconsistent_indentation_or_formatting: float = Field(
        description="Poorly formatted code, inconsistent indentation, or misaligned braces."
    )
    long_monolithic_blocks_of_code: float = Field(
        description="Large uninterrupted blocks of code lacking clear separation."
    )
    non_descriptive_variable_function_names: float = Field(
        description="Non-descriptive or misleading names for variables, functions, or classes."
    )
    excessive_branching: float = Field(
        description="Frequent or complicated branching (if/else, switch), making logic harder to follow."
    )
    inconsistent_error_handling: float = Field(
        description="Multiple, inconsistent ways of handling errors throughout the code."
    )
    complex_boolean_logic: float = Field(
        description="Multiple combined boolean expressions making the logic difficult to parse."
    )
    code_duplication: float = Field(
        description="Repetitive code blocks or functions duplicated across the codebase."
    )
    non_idiomatic_use_of_language_features: float = Field(
        description="Using language features in a way that goes against common idioms or best practices."
    )

However, this approach came with a trade-off. It was challenging to balance the level of detail in the description of each factor with the overall size of the prompt. The more detailed and explicit the prompt, the better the LLM could identify specific factors, but longer prompts also meant slower response times, higher costs, and a greater chance of hitting token limits.

On the other hand, shorter prompts were faster and cheaper to run but often resulted in weaker detection accuracy, leading to higher error rates and less reliable evaluations.

🧩 Attempt 4: More factors with Enums

Doubling down on the idea of averaging out inconsistencies, I decided to increase the number of factors even further, while simplifying the prompt by representing each factor as an enum value.

By scaling up the number of simpler, well-defined factors, I aimed to make the scoring system both more granular and consistent, without overwhelming the model with lengthy descriptions.

For example:

class CodeComplexityFactors(str, Enum):
    use_advanced_coding_techniques = (
        "Use of advanced coding techniques, such as functional programming, that are less commonly understood."
    )
    use_advanced_algorithms = (
        "Use of advanced algorithms requiring specialized knowledge, making the code harder to understand."
    )
    use_parallelism_concurrency_patterns = (
        "Use of parallelism, concurrency, or recursion, which adds complexity due to the challenges of handling state across multiple threads or processes."
    )
    use_advanced_language_features = (
        "Use of advanced language features, such as reflection or metaprogramming, which can obscure code readability and require deep understanding."
    )
    complicated_arithmetic_expressions = (
        "Complex arithmetic expressions that involve multiple operations or formulas, making it harder to reason about."
    )
    complicated_boolean_expressions = (
        "Complex boolean logic, including multiple conditions that can be difficult to follow and debug."
    )
    complicated_string_manipulation = (
        "Complex string manipulations that involve multiple functions or operations, reducing clarity."
    )
    complicated_bitwise_operations = (
        "Use of bitwise operations and manipulation, which are generally low-level and harder to understand."
    )
    use_complex_third_party_libraries = (
        "Use of complex third-party libraries (e.g., Rx.js, Pandas, TensorFlow) that require specialized knowledge to understand and work with."
    )
    business_domain_expertise = (
        "Code that requires specific knowledge in the business domain, such as finance or healthcare."
    )
    technical_domain_expertise = (
        "Code that requires technical domain knowledge, such as signal processing or computer graphics."
    )
    application_domain_expertise = (
        "Code that requires understanding of the business logic unique to the application."
    )
    use_global_variables = (
        "Use of global variables or hidden mutable state, which makes the code harder to reason about and introduces potential side effects."
    )
    non_standard_coding_conventions = (
        "Use of non-standard or inconsistent coding and naming conventions, which can confuse engineers unfamiliar with the code."
    )
    excessive_mutable_state = (
        "Excessive reliance on mutable state, making the code harder to predict and test."
    )
    magic_numbers = (
        "Use of magic numbers (unexplained constants) that lack context, reducing clarity."
    )
    long_lists_of_positional_parameters = (
        "Functions with long lists of positional parameters, which can lead to confusion and misuse."
    )
    excessive_boilerplate_code = (
        "Excessive boilerplate code, which can obscure the core functionality and make the code harder to maintain."
    )
    inconsistent_indentation_or_formatting = (
        "Inconsistent indentation or formatting, reducing readability and maintainability."
    )
    long_monolithic_blocks_of_code = (
        "Long, monolithic blocks of code without clear separation of concerns, making it difficult to follow."
    )
    non_descriptive_variable_function_names = (
        "Non-descriptive or misleading names for variables or functions, reducing clarity and making it harder to understand the code."
    )
    overly_complex_function_signatures = (
        "Overly complex function signatures, making it hard to understand the purpose and use of the function."
    )
    deeply_nested_control_flow = (
        "Deeply nested branching in control flow (e.g., if/else, switch), making it hard to follow the execution logic."
    )
    complicated_control_flow_branching = (
        "Complicated branching in control flow, adding difficulty in understanding the code's decision-making."
    )
    deeply_nested_loops = (
        "Deeply nested loops (e.g., for, while), which can reduce code readability and increase cognitive load."
    )
    complicated_loop_structure = (
        "Complicated loop structures that involve multiple conditions, breaking out of loops, or complex logic."
    )
    hidden_side_effects = (
        "Hidden side effects that are not immediately obvious from the function signature, making debugging and reasoning more difficult."
    )
    code_duplication = (
        "Code duplication across functions or classes, which increases maintenance complexity."
    )
    non_idiomatic_use_of_language_features = (
        "Non-idiomatic use of language features, which may be unfamiliar or unintuitive for engineers working in the language."
    )
    complex_math_concepts = (
        "Use of advanced mathematical concepts or models, which require specialized knowledge to understand."
    )
    functional_programming = (
        "Use of functional programming paradigms, which require a different way of thinking and may not be familiar to all engineers."
    )
    complex_inheritance = (
        "Complex inheritance hierarchies, which can be hard to trace and understand."
    )
    complex_polymorphism = (
        "Complex use of polymorphism, which may introduce unexpected behavior and harder-to-understand relationships between classes."
    )
    complex_data_structures = (
        "Use of complex data structures (e.g., graphs, trees) that require specialized knowledge to work with."
    )
    bitwise_operations = (
        "Use of bitwise operations, which are generally low-level and harder to understand."
    )
    concurrency_mechanisms = (
        "Use of complex concurrency mechanisms, which add complexity in terms of state management and performance."
    )
    complex_regular_expressions = (
        "Use of complex regular expressions, which are often hard to read and understand at a glance."
    )
    reflection_and_metaprogramming = (
        "Use of reflection, metaprogramming, or other runtime code manipulation that reduces readability and increases cognitive load."
    )
    high_performance_computations = (
        "High-performance computations or low-level system optimizations, requiring specialized knowledge and potentially obscuring clarity."
    )
    low_level_networking = (
        "Low-level networking or socket programming, which requires specialized technical knowledge."
    )
    use_of_category_theory = (
        "Use of category theory concepts, which are very abstract and require a deep understanding to work with."
    )
    domain_specific_languages = (
        "Use of domain-specific languages (DSLs), which introduce custom syntax or rules that may be unfamiliar."
    )

To make the scoring more meaningful, I also introduced a custom weight to each enum value. By asking the LLM to identify which factors were present in the code and then applying the corresponding weights, I could compute a weighted sum that reflected the impact of the estimated complexity. This gave me a more flexible way to evaluate code, where each factor contributed proportionally based on how much it affects readability, maintainability, or onboarding effort.

code_complexity_factors_weight = {
    CodeComplexityFactors.use_advanced_coding_techniques: 10,
    CodeComplexityFactors.use_advanced_algorithms: 6,
    CodeComplexityFactors.complicated_control_structures: 3,
    CodeComplexityFactors.use_parallelism_concurrency_patterns: 4,
    CodeComplexityFactors.use_advanced_language_features: 6,
    CodeComplexityFactors.complicated_arithmetic_expressions: 3,
    CodeComplexityFactors.complicated_boolean_expressions: 3,
    CodeComplexityFactors.complicated_string_manipulation: 2,
    CodeComplexityFactors.complicated_bitwise_operations: 4,
    CodeComplexityFactors.use_complex_third_party_libraries: 3,
    CodeComplexityFactors.business_domain_expertise: 4,
    CodeComplexityFactors.technical_domain_expertise: 4,
    CodeComplexityFactors.application_domain_expertise: 3,
    CodeComplexityFactors.use_global_variables: 2,
    CodeComplexityFactors.non_standard_coding_conventions: 2,
    CodeComplexityFactors.excessive_mutable_state: 2,
    CodeComplexityFactors.magic_numbers: 1,
    CodeComplexityFactors.long_lists_of_positional_parameters: 2,
    CodeComplexityFactors.excessive_boilerplate_code: 1,
    CodeComplexityFactors.inconsistent_indentation_or_formatting: 1,
    CodeComplexityFactors.long_monolithic_blocks_of_code: 2,
    CodeComplexityFactors.non_descriptive_variable_function_names: 2,
    CodeComplexityFactors.overly_complex_function_signatures: 2,
    CodeComplexityFactors.deeply_nested_control_flow: 3,
    CodeComplexityFactors.complicated_control_flow_branching: 2,
    CodeComplexityFactors.deeply_nested_loops: 3,
    CodeComplexityFactors.complicated_loop_structure: 2,
    CodeComplexityFactors.hidden_side_effects: 4,
    CodeComplexityFactors.code_duplication: 2,
    CodeComplexityFactors.non_idiomatic_use_of_language_features: 3,
    CodeComplexityFactors.complex_math_concepts: 7,
    CodeComplexityFactors.functional_programming: 10,
    CodeComplexityFactors.complex_inheritance: 4,
    CodeComplexityFactors.complex_polymorphism: 4,
    CodeComplexityFactors.complex_data_structures: 6,
    CodeComplexityFactors.bitwise_operations: 4,
    CodeComplexityFactors.concurrency_mechanisms: 5,
    CodeComplexityFactors.complex_regular_expressions: 4,
    CodeComplexityFactors.reflection_and_metaprogramming: 4,
    CodeComplexityFactors.high_performance_computations: 5,
    CodeComplexityFactors.low_level_networking: 6,
    CodeComplexityFactors.use_of_category_theory: 10,
    CodeComplexityFactors.domain_specific_languages: 6,
}

My prompt was generating output in JSON format, one of the fields had to contain the array of enum values. During experimentation, I started noticing an increase in poorly formatted JSON outputs from the LLM. It turned out that LangChain was including the enum descriptions directly in the prompt and expecting output to contain array of string exactly matching my enum descriptions. LLM response was very long and verbose, that led to inconsistent representations and parsing errors. To fix this, I revised the enums to use concise, clear values that reduced ambiguity and minimized the chance of misinterpretation. This helped with formatting, but didn’t fully solve the problem.

Despite these improvements, the approach still had reliability issues. The LLM would sometimes miss key factors or falsely detect ones that weren’t present. This was especially problematic for high-weight factors, since errors in those would heavily skew the final score.

⚖️ Attempt 5: Grouping Factors into Key Categories

After playing with the trade-offs of fine-grained metrics, I realized I needed a better balance:

Too many individual factors overwhelmed the LLM and made the scoring process fragile.
A single metric was too simplistic and lacked the nuance needed for meaningful evaluation.

But by spending time refining the individual factors, I started to see patterns, groups of related traits that could be consolidated into broader categories. This led me to define five key dimensions of complexity:

Readability Issues – Problems related to naming, formatting, or clarity that reduce how easily code can be understood.
Control Flow Complexity – Use of deeply nested logic, recursion, or heavy branching that increases cognitive load.
Project-Specific Knowledge – Dependencies on internal business logic, frameworks, or custom libraries that are hard to understand without context.
Domain-Specific Knowledge – Use of specialized concepts (e.g., from machine learning, graphics, physics, or signal processing) that require prior expertise.
Advanced Coding Techniques – Patterns like metaprogramming, functional programming or other techniques that are powerful but mostly used due to personal preference.

Grouping the factors this way allowed me to keep the prompt concise while still capturing the most important sources of complexity. It also made it easier to assign meaningful weights to each category based on how difficult they are to understand, refactor, or onboard new developers into.

class FunctionComplexityEvaluation(BaseModel):
    function_name: str = Field(description="Name of the function")

    is_setup_of_declaration: bool = Field(
        description="The code is part of setup or declaration boilerplate, such as defining constants or configuring a framework."
    )

    start_line_number: int = Field(
        description="Number of the first line of the function, considering existing formatting"
    )

    end_line_number: int = Field(
        description="Number of the last line of the function, considering existing formatting"
    )

    readability_score: float = Field(
        description="Estimate how readable the code is based on factors like naming conventions, formatting, and non-runtime characteristics."
    )

    cognitive_complexity_score: float = Field(
        description="Estimate the cognitive complexity of control structures and expressions. Higher scores result from deeply nested control flow, complex expressions, and multiple branching levels."
    )

    project_specific_knowledge_score: float = Field(
        description="Estimate how much project-specific knowledge is required, such as the use of third-party libraries or specific business rules."
    )

    technical_domain_knowledge_score: float = Field(
        description="Estimate the level of deep technical domain knowledge required, such as advanced algorithms, parallel programming, signal processing, or low-level optimizations."
    )

    advanced_code_techniques_score: float = Field(
        description="Estimate the use of advanced coding techniques (like functional programming paradigms) that are not essential for solving the task but reflect the developer’s preference."
    )

By focusing on five broad categories, I could include clear definitions and concrete examples for each, which helped the model make more consistent and accurate evaluations. It also made the weighting process much simpler. Instead of juggling dozens of individual factors, I could assign meaningful weights to just five core categories, each representing a different dimension of complexity.

After some refinement, I realized that adding more detailed grading guidelines for each category directly into the prompt would further improve consistency. Clear score ranges gave the LLM a more structured way to estimate maintainability and helped align its output with my expectations.

from pydantic import BaseModel, Field

class FunctionComplexityEvaluation(BaseModel):
    function_name: str = Field(description="Name of the function")

    is_setup_of_declaration: bool = Field(
        description="The code is part of setup or declaration boilerplate, such as defining constants or configuring a framework."
    )

    start_line_number: int = Field(
        description="Number of the first line of the function, considering existing formatting"
    )

    end_line_number: int = Field(
        description="Number of the last line of the function, considering existing formatting"
    )

    readability_score: float = Field(
        description="Estimate how readable the code is based on factors like naming conventions, formatting, and non-runtime characteristics.\n\
        Score ranges:\n\
        0 - 0.3: The code follows standard naming conventions, is well-formatted, and lacks clutter (e.g., no magic numbers or excessive boilerplate).\n\
        0.3 - 0.7: Minor readability issues, inconsistent formatting, occasional use of non-descriptive names, or slight violations of coding standards.\n\
        0.7 - 1: Significant readability problems, non-standard conventions, poor naming, inconsistent formatting, or extensive use of boilerplate code."
    )

    cognitive_complexity_score: float = Field(
        description="Estimate the cognitive complexity of control structures and expressions.\n\
        Score ranges:\n\
        0 - 0.3: Simple control structures (minimal nesting, straightforward logic, few operators).\n\
        0.3 - 0.7: Moderate complexity, involving some nesting (2–3 levels), more complex boolean/arithmetic expressions, or multiple operators.\n\
        0.7 - 1: Highly complex control structures, deeply nested (3+ levels), intricate logic with many operators, or multiple conditional/loop combinations."
    )

    project_specific_knowledge_score: float = Field(
        description="Estimate how much project-specific knowledge is required, such as the use of third-party libraries or specific business rules.\n\
        Score ranges:\n\
        0 - 0.3: Little to no project-specific knowledge required, uses common third-party libraries or standard business rules.\n\
        0.3 - 0.7: Some project-specific knowledge is needed, involving custom libraries or moderately complex business rules.\n\
        0.7 - 1: Extensive project-specific knowledge required, highly customized third-party libraries or intricate, specific business logic."
    )

    technical_domain_knowledge_score: float = Field(
        description="Estimate the level of deep technical domain knowledge required, such as advanced algorithms, parallel programming, signal processing, or low-level optimizations.\n\
        Score ranges:\n\
        0 - 0.3: Minimal technical domain knowledge required, standard algorithms and techniques used.\n\
        0.3 - 0.7: Moderate technical domain knowledge, involving specialized algorithms, parallel programming, or some scientific/engineering calculations.\n\
        0.7 - 1: High level of technical domain knowledge required, including advanced algorithms, low-level optimizations, or complex scientific/mathematical concepts."
    )

    advanced_code_techniques_score: float = Field(
        description="Estimate the use of advanced coding techniques (like functional programming paradigms) that are not essential for solving the task but reflect the developer’s preference.\n\
        Score ranges:\n\
        0 - 0.3: No or minimal use of advanced techniques, the code is straightforward and easy to follow.\n\
        0.3 - 0.7: Some use of advanced techniques (e.g., functional programming, metaprogramming) that increase complexity but do not dominate the code.\n\
        0.7 - 1: Heavy use of advanced techniques that significantly add complexity without being essential for solving the problem (e.g., monads, currying, complex metaprogramming)."
    )

I assigned weights to each category based on how difficult it is to address that type of complexity in real-world scenarios:

Readability Issues (Weight: 1) – These are usually easy to fix. Renaming variables, cleaning up formatting, or adding comments, minimal effort or low risk.
Control Flow Complexity (Weight: 2) – Refactoring deeply nested logic or simplifying branching structures is harder and can introduce bugs if not done carefully.
Project-Specific Knowledge (Weight: 3) – This often requires onboarding or checking internal documentation. It makes harder to onboard new team members and it is hard for engineers to keep their knowledge up to date.
Domain-Specific Knowledge (Weight: 4) – Understanding concepts from fields like machine learning or graphics can take significant time and isn’t always easily accessible.
Advanced Coding Techniques (Weight: 5) – Unnecessary complexity, often reflecting personal preferences rather than project needs, and understanding them may require deep technical knowledge and hands-on experience.

📏 Function length component

In addition to the cognitive complexity estimated by the language model, I decided to also consider function length as part of the final score. Long functions often require developers to hold more context in their minds, which becomes especially difficult when the logic is hard.

A short function that handles something complex can still be understandable. But even simple logic, when stretched over dozens of lines, becomes difficult to follow. That’s why keeping functions small is a well-known best practice, something I wanted the tool to encourage.

To capture this, I introduced a function size factor:

desired_length = 10  # lines of code
function_size_factor = math.sqrt(function_length / desired_length)

A 10-line function is treated as the baseline. Shorter functions are typically simpler and easier to reason about, while longer ones get penalized. The function size factor growth is limited by sq. root, to prevent excessively long functions from dominating entire score.

📈 Computing the Final Score

To keep the final score more human friendly, I wanted to keep it in range from 1 to 5 like typical star rating. I applied a hyperbolic tangent (tanh) function to the adjusted composite score:

MIN_VALUE = 1
MAX_VALUE = 5
VALUE_RANGE = MAX_VALUE - MIN_VALUE # 4
final_score = MIN_VALUE + VALUE_RANGE * tanh(composite_score * function_size_factor)

The hyperbolic tangent function brings several useful properties to the scoring formula:

Across zero it grows steadily, capturing the initial linear increase in perceived complexity.
As the input gets larger, the output asymptotically approaches 1, ensuring the score stays bounded and doesn’t spike uncontrollably.

By applying this function, I ensured the final score stays within desired range, while also modeling the non-linear nature of how developers experience complexity.

🚢 Finalizing the Tool and Additional Features

As the tool evolved, I added a few more features to improve its performance and usability:

Progressive Evaluation – To save time and compute, the tool caches previous results and skips files that haven’t changed since the last run. This makes it much faster to use in CI pipelines or large projects.
Improvement Suggestions – When a file exceeds the target complexity score, the tool generates helpful, actionable feedback on what could be improved, highlighting specific areas that contribute most to the score.
Configuration Options – The tool behaviour can be customized through a config file or CLI flags. This allows teams to adapt it to fit their needs.

🤩 Wrapping Up

I hope this tool will be useful to other engineers and companies looking to bring code complexity evaluation into their CI workflows. While it’s still early, I see this as one of the first practical steps toward automating parts of the code review process using LLMs.

With this post, I didn’t just want to showcase the tool, I wanted to share the journey of building it. From experimenting with prompts to balance reliability, performance, and cost. This project taught me a lot about working with LLMs in real-world scenarios.

It also gave me a deeper understanding of what actually makes code complex, readable, or maintainable. Now I approach code quality with a more structured mindset, and I hope these insights help others do the same.

🧪 Try it out: codepass on PyPI
🚀 Code: Github repo
💬 Got feedback or ideas? Drop a comment below!

Simplifying Complex Code with Advanced Programming Approaches

Alexey Tukalo — Wed, 16 Oct 2024 06:45:23 +0000

Recent I've spent quite some time working on a CLI tool designed to evaluate code complexity. This project made me think about factors which impact the overall cognitive load required to understand and maintain a codebase. I've realized that complexity largely comes from two key areas: how we organize our code and the amount of knowledge required to understand it. This knowledge could involve domain expertise needed to grasp the science or business rules behind the code, or it could be the technical know-how related to the tools and techniques used in implementation.

A lot is said on structuring code effectively, while I believe the "knowledge factor" is often overlooked. This is critical because, although messy code can be untangled with time and effort, no amount of refactoring will help if the underlying code is built on complex scientific concepts or advanced coding techniques that you do not understand. Mastering such knowledge can take years, making it far more challenging to deal with.

Considering the knowledge requirements associated with a codebase, I divide them into two categories: natural complexity and artificial complexity introduced by engineering decisions. Natural complexity is unavoidable, such as domain-specific knowledge necessary to solve the problem. However, the more dangerous form of complexity arises from optional engineering choices—like unnecessary third-party libraries or hard-to-master coding techniques. These choices can impose a steep learning curve on the team, resulting in long-term costs and complexity that can drain businesses significantly.

Code Complexity Grades

Let me explain my thoughts abstractly. Imagine we have a gradation of code based on complexity levels:

1) Code that is instantly clear to anybody.
2) Code that requires closer inspection but is still accessible to most developers.
3) Code with complex logic or requiring specialized domain knowledge that is commonly available within the organization.
4) Code with highly intricate logic or demanding advanced, specialized domain expertise.
5) Code so complex that it is comprehensible only to a select few.

A healthy software project should have the majority of the code at level 2, with a small portion at level 3, and only a minimal amount at level 4. This balance ensures that most of the codebase remains accessible and maintainable, while only a small part requires specialized knowledge or extra attention.

Advanced Coding Techniques

In my view, one of the selling points for many advanced coding techniques like functional programming (FP) is that by applying the techniques developers intend to "simplify" the code.

For example functional programming offers powerful abstractions that can encapsulate complex logic in a more declarative and concise manner. This can lead to cleaner, more maintainable code that reduces the cognitive load required to understand intricate systems.

However, there is a caveat: according to the complexity scale, any FP code inherently falls into level 3-4. It requires specialized knowledge and skills, which are not as widespread as one might hope. Therefore, when building a team for a project that heavily relies on functional programming, it's not enough to simply find smart and capable engineers — they also need to be proficient in the advanced coding techniques that the project demands.

Using advanced approaches, the project gets higher amount of level 3-4 code. Such a code requires more cognitive load for those, who is not too experienced with a selected paradigm, and not every programmer can understand it.

Third-Party Libraries

Third-party libraries are often introduced with the goal of speeding up development and reducing the need to reinvent the wheel. By leveraging pre-built solutions, developers can focus on higher-level tasks, potentially simplifying certain aspects of the project.

However, these libraries come with their own complexity. Each one introduces new abstractions, APIs, and sometimes even entire ecosystems that developers need to understand and manage. According to the complexity scale, integrating third-party libraries inherently pushes parts of the codebase up by expanding amount of knowledge necessary to understand system. It requires understanding how these libraries work, their limitations, and how they interact with other components in the application.

While third-party libraries are generally easier to pick up compared to advanced coding techniques. However, even with good documentation and community support, developers must invest time to learn the library’s nuances and ensure it fits the project's needs. Over-reliance on these libraries can still increase the cognitive load, especially for team members unfamiliar with them, adding complexity that comes not from the problem domain itself but from managing external tools.

Usage of a library can be justified when it helps to reduce complexity, especially by covering areas that would otherwise require specialized domain knowledge. For example, Three.js is a popular JavaScript library that simplifies working with 3D graphics. Creating 3D scenes, animations, and effects from scratch would typically require deep knowledge of computer graphics, rendering pipelines, and low-level WebGL programming. However, Three.js abstracts much of this complexity, offering a more accessible API that lets developers create sophisticated 3D visualizations without needing to master computer graphics. This allows teams to focus on business logic and the creative aspects of the project rather than the technical challenges of rendering graphics.

Embracing Simplicity

Over the years, I have come to appreciate the value of simplicity in software design. This doesn't mean avoiding advanced techniques altogether, but rather using them judiciously and ensuring that the overall system remains understandable and maintainable.

I think the knowledge requirements we casually introduce by using unnecessary third-party libraries or advanced coding techniques is the worst, since nothing requires it to be there, it is not essential to solve the business problem. The balance lies in knowing when to apply these techniques and ensuring that the overall design remains accessible to the entire development team.

Book: Interactive Data Visualization for the Web by Scott Murray

Alexey Tukalo — Fri, 11 Oct 2024 04:53:33 +0000

Written by the creator of D3.js, the most popular JavaScript library for custom data visualization, this book serves as a comprehensive tutorial for the tool. While centered on D3.js, the workflows presented are applicable beyond the library and even beyond JavaScript itself.

For Beginners:

An excellent introduction to 2D data visualization, it also introduces the concept of Data-Driven Design, which has applications outside the field of visualization.

For Experienced Readers:

Rather than focusing on specific techniques, this book emphasizes the process of transforming data into meaningful visuals, a skill valuable across many disciplines.

Interactive Data Visualization for the Web by Scott Murray

Book: A Philosophy of Software Design

Alexey Tukalo — Sun, 29 Sep 2024 05:59:09 +0000

This book offers insights and wisdom from a seasoned software engineer, with detailed breakdowns of software design problems and in-depth analyses. While some discussions dive too deeply into specific examples, occasionally straying into irrelevant details, the author’s perspective is often conservative, and at times, even radical.

For Beginners:

I wish this had been my first programming book. It provides a glimpse into the mindset of those who are likely to be your first managers and mentors. While the proposed solutions may not always be the best, they are practical and have guided generations of software engineers. More importantly, the book introduces readers to critical design challenges and familiarizes them with the broader discussions in the field.

For Experienced Readers:

The book excels in explaining viewpoints and reasoning behind software practices that you might have encountered but struggled to fully understand or agree with. By presenting problems and solutions from multiple perspectives, it helps deepen your understanding of your colleagues and fellow software engineers.

A Philosophy of Software Design by John Ousterhout

Book: The C Programming Language

Alexey Tukalo — Mon, 23 Sep 2024 19:04:24 +0000

This book is often regarded as a classic, sometimes even called the "bible of programming." It provides an in-depth education in the C programming language. While it may not be the best resource for those focused on modern general-purpose programming due to C's lack of some widely used features, it compels programmers to engage with many often-overlooked details and limits the use of contemporary approaches.

For Beginners:

An excellent resource for learning the fundamentals of low-level programming.

For Experienced Readers:

A great tutorial for mastering the C language when needed.

The C Programming Language by Brian Kernighan and Dennis Ritchie

Udemy: Learning Path: Haskell: Functional Programming and Haskell

Alexey Tukalo — Sun, 08 Sep 2024 14:22:16 +0000

Udemy: Learning Path: Haskell: Functional Programming and Haskell

This learning path comprises three separate courses. The first two courses overlap significantly; the first introduces fundamental Haskell semantics, while the second covers more advanced language elements. The third course stands apart, focusing on sophisticated applications of Haskell tools.

For Beginners:

The first two courses provide a solid foundation in purely functional programming and Haskell basics, although the third part may be too advanced.

For Experienced Readers:

The courses offer clear explanations of Haskell-specific semantics, such as do-notation and Type Classes. The advanced examples in the last section may prove useful for specific applications.

Learning Path: Haskell: Functional Programming and Haskell

Book: Learning JavaScript Design Patterns

Alexey Tukalo — Tue, 03 Sep 2024 18:18:17 +0000

This book explores the implementation and use of common software design patterns in JavaScript. While some examples may be slightly outdated according to the latest best practices, they remain valuable for those maintaining legacy systems.

For Beginners:

It serves as an excellent introduction to software design patterns. However, the problems these patterns address may be less familiar to those with limited programming experience.

For Experienced Readers:

The specific applications of software design patterns can provide inspiration for creative solutions to future challenges.

Learning JavaScript Design Patterns: A JavaScript and jQuery Developer's Guide by Addy Osmani

Book: Think Complexity by Allen Downey

Alexey Tukalo — Mon, 02 Sep 2024 15:11:38 +0000

Think Complexity introduces readers to the field of Complexity Science, exploring its implications for programming. The book showcases unique algorithms typical of the discipline and delves into intriguing concepts like Cellular Automata and the Game of Life, using Python for demonstrations.

For Beginners:

This book provides excellent examples of rare algorithms, particularly those involving graph-based problems.

For Experienced Readers:

You'll find value in the case studies and the application of programming techniques within Complexity Science.

Think Complexity

9 months of Machine Learning and beyond: Machine Learning A-Z

Alexey Tukalo — Fri, 26 Jul 2024 06:56:41 +0000

Introduction

Before I even started properly studying machine learning last summer, I've already had several machine learning courses purchased on Udemy. The most basic among that courses was Machine Learning A-Z: AI, Python & R, so, it became my starting point. This course served as a perfect introduction to the field, covering a wide range of classical machine learning techniques and some deep learning.

Course Impression

Typically, as programmers, we work with structured data. However, the world is inherently messy. Machine learning proves to be an invaluable tool for dealing with unstructured information. I was very impressed by the course because it introduced a whole new world of approaches that felt like gaining a superpower.

Course Content

The course explains the machine learning process step by step. The initial, crucial stage of the process is data preprocessing, which happens even before any algorithms can be applied.

Preprocessing of data

Very beginning of preprocessing is data splitting. It is common to divide dataset into three parts: training, validation, and test sets. A training set is used for training of a model, a validation set helps assessing overfitting during training, and a test set is used to evaluate the model’s performance after training.

Handling missing data is another critical aspect. Depending on the situation and the amount of data missing, there are two primary options:

Imputing missing values based on other data points
Removing rows with missing data entirely

Moreover, often it is important to perform feature scaling, because some machine learning algorithms are sensitive to the scale of the input data. For instance, algorithms that compute distances between data points, like K-Nearest Neighbors (K-NN), will be biased towards variables with a larger scale if the data is not adjusted to compensate this. Feature scaling helps to make sure that the range of independent variables equally contributes to the analysis. This can be done through methods like normalization or standardization. Normalization rescales features to a fixed range, usually from 0 to 1. Standardization adjusts all features to have 0 mean and standard deviation of 1.

These preprocessing steps are necessary to create a robust machine learning models that perform well in real-world scenarios.

Classic Machine Learning Models

Regression

Regression models are a type of statistical tool used for predicting a continuous outcome based on one or more input variables. They are fundamental for forecasting and determining the strength of relationships between variables. These models work by creating an equation that best fits the observed data. I already had some experiences with regression models especially with Linear Regression from the stat courses I took years ago.

Polynomial Regression extends linear regression by adding terms with powers greater than one. This allows the model to fit a wider range of data shapes, capturing more complex relationships between variables. However, higher-degree polynomials can lead to overfitting, where the model fits the training data too closely and performs poorly on unseen data. This occurs because the model learns noise from the training data, mistaking it for actual relationships.

Next, the course introduces Support Vector Regression (SVR), a powerful model that can encapsulate non-linear relationships with a lower risk of overfitting and can model exponential relationships. The main goal of SVR is to create a prediction line that fits most of the data points as closely as possible while also trying to keep the line as smooth and flat as possible. In other words, SVR tries to strike a balance between closely following the training data and avoiding overly complex models that might not work well on new, unseen data. It does this by allowing for a small margin of error, within which deviations are acceptable. This makes SVR a robust choice for predicting continuous values, especially when the data is complex or has a lot of variability.

After that Decision Trees and Random Forests are introduced. Typically known for classification, these techniques are also applicable in regression settings. The course explains how these models can predict an output based on decision rules inferred from the data features. Decision Trees and Random Forests create models based on a series of binary decisions from the features within the dataset. This approach can lead to models that fit well on training data but fail to generalize to new data because the decision-making process is arbitrary and doesn’t necessarily capture underlying mathematical relationships between variables.

On the other hand, methods like SVR and Polynomial Regression aim to identify the mathematical relationships inherent in the data. For example, SVR tries to fit the best possible curve within a certain margin of error, and polynomial regression can model relationships that follow a polynomial equation. If the true relationship between the variables is mathematical, these methods are likely to perform better with less risk of overfitting. This ability to uncover and leverage mathematical relationships makes SVR, Linear, and Polynomial Regression more robust for predicting outcomes where the underlying data relationships are strong and clear.

Model Selection in Regression

The section on regression wraps up with strategies for choosing the best model. Experimentation with different approaches and evaluation of their performance on test data is still recommended, since an experiment is still the only way to select a truly optimal model.

Classification

Classification involves predicting a categorical response based on input variables.

Logistic Regression, despite its name, is a basic classification technique, ideal for binary classification problems. It is used for prediction of outcomes that have two possible states e.g., yes/no, true/false. It works by modelling the probability of the default class, usually labeled 1, as a function of the input features. Logistic regression applies the logistic function to the output of a linear equation, producing a probability score between 0 and 1. This model is robust, straightforward, and efficient for binary classification problems.

The next model in the course is K-Nearest Neighbors (K-NN). It classifies a data point based on how its neighbors are classified, capable of handling multi-class problems and more complex decision boundaries.

The course also covers Support Vector Machines (SVM) for classification, explaining the use of different kernels to handle linear and non-linear classification. Support Vector Machine constructs a hyperplane in a multidimensional space to separate different classes. SVM performs well in high-dimensional spaces. It is versatile due to its ability to use different kernel functions to make the hyperplane more adaptable to the data. For example, linear kernels are great for linearly separable data, while radial basis function (RBF) kernels can map non-linear relationships.

Clustering

Classification and clustering are both methods of organizing data but serve different purposes. Classification is a supervised learning approach where the model is trained on labeled data. This means the model learns from examples that already have an assigned category or class. Its task is to predict the category for new data based on what it has learned. For example, a classification model might determine whether emails are spam or not spam based on training with a dataset of emails labeled accordingly.

Clustering, on the other hand, is an unsupervised learning technique that involves grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It’s used when we don’t have predefined labels for data. The model itself discovers the inherent groupings in the data. An example of clustering might be segmenting customers into groups based on purchasing behavior without prior knowledge of the different customer types.

Both methods are fundamental in data analysis:

Classification uses labeled data for predictive modeling.
Clustering helps to discover hidden patterns in data.

Clustering Techniques

K-Means is a popular clustering technique that partitions data into K distinct, non-overlapping clusters based on their features. The process involves randomly initializing K points as cluster centers and assigning each data point to the nearest cluster based on Euclidean distance. The cluster centers are then recalculated as the mean of the assigned points, and this process repeats until the centroids stabilize and no longer move significantly. This method is particularly effective for large datasets and is widely used due to its simplicity and efficiency. K-Means works best with data where the clusters are spherical and evenly sized, making it less effective with complex cluster shapes.

Hierarchical Clustering, unlike K-Means, does not require the number of clusters to be specified in advance. It builds a hierarchy of clusters either by a divisive method or an agglomerative method.

In the agglomerative approach, each data point starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The process continues until all points are merged into a single cluster at the top of the hierarchy. This method is beneficial for identifying the level of similarity between data points and is visually represented using a dendrogram, which can help determine the number of clusters by cutting the dendrogram at a suitable level.

The divisive method of hierarchical clustering, also known as top-down clustering, starts with all observations in a single cluster and progressively splits the cluster into smaller ones. This approach begins at the top of the hierarchy and works its way down, making it conceptually straightforward: every split is designed to create the most distinct and coherent clusters possible at each level of division.

In practice, the divisive method involves examining the cluster at each step and choosing the best point to split it. This involves measuring the distance between observations within the cluster and identifying the largest distance as the point to divide. The process continues recursively, splitting each subsequent cluster until each observation is its own cluster or until a specified number of clusters is reached. It is generally more computationally intensive than the agglomerative approach, as it requires a global view of the data at each split, making it less commonly used in very large datasets.

Hierarchical clustering is particularly useful for smaller datasets or when the relationships between data points need to be closely examined, such as in biological sciences or when clustering historical data.

Deep Learning Models

Deep learning is a subset of machine learning that employs neural networks with many layers. It is a significantly different from classical machine learning techniques. While classical machine learning focuses on features that are often manually selected and engineered, deep learning aims to train neural networks to learn features. The models automate feature extraction by building complex patterns from simpler ones. This makes deep learning exceptionally powerful for tasks such as image and speech recognition, where the input data is highly dimensional and the relationships within the data are complex. However, it requires vast amounts of information to train deep learning models.

Artificial Neural Network

A fundamental element of deep learning is the forward densely connected neural network, or Artificial Neural Network (ANN). In these networks, neurons are arranged in layers, with the first layer taking the input data and the last layer producing output. Each neuron in one layer connects to every neuron in the next layer, making the network "densely connected." These neurons have weights and biases that adjust as the network learns from data during the training process. The output of each neuron is calculated by a nonlinear activation function, which introduces the ability to capture nonlinear relationships in the data.

Layers of neurons, in ANNs, can be represented by vectors consisting of the weights and biases. Data is propagated forward through these layers using matrix multiplication. An output of each layer is calculated by multiplying the input data by the weight matrix and then adding a bias term. This output then passes through an activation function before it is sent to the next layer.

The activation function is crucial because it introduces non-linearity into the model, allowing the network to learn and model complex, non-linear relationships in the data. Without non-linear activation functions, the network, regardless of its depth, would still behave just like a single-layer perceptron, which can only learn linear boundaries.

Convolutional Neural Network

An alternative to basic ANNs is the Convolutional Neural Network (CNN). Unlike densely connected networks where every input is connected to each neuron, CNNs operate over volumes of pixels and use filters to create feature maps that summarize the presence of detected features in the input, such as edges in images. This makes CNNs highly efficient for tasks that involve spatial hierarchies, as they reduce the number of parameters needed, reducing the computational burden.

Convolutional Neural Networks are specialized kinds of neural networks for processing data that has a grid-like topology, such as images. CNNs use filters that perform convolution operations as the filter slides over the input to create a feature map that summarizes the presence of detected features in the input. This makes them exceptionally efficient for image related tasks.

CNNs leverage the mathematical operation of convolution, a fundamental technique in digital signal processing. In the context of DSP, convolution is used to alter a signal by a filter, extracting important features. Similarly, in CNNs, convolution involves applying a filter over an image to produce a feature map. This process effectively allows the network to detect similarities or specific features in the image that correspond to the filter. For example, a filter might be learn to detect edges or specific shapes.

As the input image is processed through successive convolutional layers, the CNN uses multiple filters at each layer to search for increasingly complex patterns. The first layer may detect simple edges or textures, while deeper layers can recognize more complex features like parts of objects or entire objects.

Gradient Descent and Training Neural Networks

Gradient descent is a fundamental optimization algorithm used in training neural networks and other machine learning models. It works by iteratively adjusting the model's parameters to minimize the loss function, which measures how well the model's predictions match the actual data. In each step, the algorithm computes the gradient of the loss function with respect to the model parameters, and moves the parameters in the direction that reduces the loss.

Backpropagation is the technique used to compute these gradients efficiently in neural networks. It involves two phases:

A forward pass, where input data is passed through the network to generate predictions.
A backward pass, where the gradient of the loss function is computed based on the prediction. It is later propagated back through the network to update the weights.

This process leverages the chain rule of calculus to estimate gradients, ensuring that each weight is adjusted in proportion to its contribution to the overall error. Together, Gradient Descent and Backpropagation enable neural networks to learn from data by iteratively improving their accuracy.

The Loss Functions

Loss functions play a critical role in guiding the training process. It is also known as a cost function or error function. It quantifies the difference between the predicted outputs of the network and the actual target values. This metric provides a concrete measure of how well the network is performing. The goal of training is to minimize this loss, thereby optimizing the model's parameters.

Commonly used loss functions in ANNs vary depending on the specific type of task:

For regression tasks, where the goal is to predict continuous values, the Mean Squared Error (MSE) loss is frequently used. MSE calculates the average of the squares of the differences between the predicted and actual values, penalizing larger errors more severely.
For classification tasks, where the output is a class label, Cross-Entropy Loss is commonly employed. This loss function measures the dissimilarity between the true label distribution and the predictions provided by the model.

The Vanishing Gradient Problem and ReLu

One significant challenge when building deep neural networks is the vanishing gradient problem. The gradients used in the training process can become too small, preventing weights from changing their values, which stops the network from sufficiently updating parameters.

This issue is particularly prominent with sigmoid or tanh activation functions. To mitigate this, deep learning has adopted the Rectified Linear Unit (ReLu) activation function. ReLu is defined as ReLU(x)=max(0,x), where x represents the input to a neuron. This function helps maintain a stronger gradient during training, allowing deeper networks to learn effectively without the gradients vanishing. This simplicity and efficiency in promoting nonlinearity without affecting the scale of the gradient make ReLu a popular choice in deep learning architectures.

Specialized Machine Learning Techniques

The course progressed into a variety of more specialized machine learning techniques, each tailored to specific applications and domains.

Natural Language Processing

Natural Language Processing (NLP) involves the application of computational techniques to the analysis and synthesis of natural language and speech. One of the main challenges in using machine learning for NLP is that text data is inherently unstructured and high-dimensional. Text must be converted into a numerical format that machine learning algorithms can process, a task complicated by the nuances of language such as syntax, semantics, and context.

The Bag of Words

The Bag of Words (BoW) model addresses this by transforming text into fixed-length vectors by counting how frequently each word appears in a document, ignoring the order and context of words. This method simplifies text data, making it manageable for basic machine learning models and serving as a foundational technique for text classification tasks, such as spam detection or sentiment analysis. However, simplicity of the BoW model, its disregard for word order and semantic context limit its effectiveness for more complex language tasks.

Reinforcement Learning with UCB and Thompson Sampling

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. It differs from supervised learning, since correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. This strategies evolves by balancing the exploration, trying new things, and exploitation, using known information, in decision-making processes.

The agent takes actions based on a policy, receives feedback through rewards or punishments, and updates its policy to maximize long-term rewards. Two notable strategies in RL that help manage the exploration-exploitation dilemma are the Upper Confidence Bound (UCB) and Thompson Sampling.

UCB is an algorithm that prioritizes exploration by selecting actions that have either high rewards or have not been tried often. The idea is to balance the known rewards with the potential of finding higher rewards in lesser-tried actions. UCB does this by constructing confidence bounds around the estimates of action rewards and choosing the action with the highest upper confidence bound. This approach systematically reduces uncertainty and improves decision-making over time.

Thompson Sampling takes a Bayesian approach to the exploration-exploitation problem. It involves sampling from the posterior distributions of the rewards for each action and selecting the action with the highest sample. This method allows for a more probabilistic exploration based on the known performance of actions, dynamically balancing between exploring new actions and exploiting the known ones based on their reward probability distributions.

Both UCB and Thompson Sampling are powerful techniques in situations where the learning environment is initially unknown to the agent, allowing for systematic exploration and optimized learning based on the feedback received from the environment. These methods are particularly useful in real-time decision-making scenarios like A/B testing or network routing.

Dimensionality Reduction Techniques

PCA is a statistical technique used for dimensionality reduction while preserving as much variance as possible. It works by identifying so called principal components - the directions along which the variance of the data is maximized. It reduces the dimension of the data by transforming the original variables into a new set of orthogonal variables. Orthogonality allows this new variable to be as non-correlated as possible, and account for the maximum variance in the data. This is particularly useful in reducing the number of variables in data while maintaining the relationships that contribute most to its variance. By transforming the data into a new set of dimensions with reduced complexity, PCA helps in visualizing high-dimensional data, speeding up learning algorithms, and removing noise.

LDA, on the other hand, is also a dimensionality reduction technique but focuses more on maximizing the separability among known categories. It tries to model the difference between the classes of data. LDA achieves this by finding a linear combination of features that separates classes. The resulting combination can be used as a linear classifier or for dimensionality reduction before later classification.

Both PCA and LDA serve slightly different purposes:

PCA is unsupervised, focusing on variance in the data.
LDA is supervised, focusing on maximizing class separability.

Modern Model Selection and Boosting Techniques

The latter part of the course explores advanced model selection strategies and introduces boosting. Boosting works by combining multiple weak learners into a stronger model in a sequential manner. Each learner in the sequence focuses on the errors made by the previous one, gradually improving the model's accuracy. The learners are usually simple models like decision trees, and each one contributes incrementally to the final decision, making the ensemble stronger than any individual model alone.

Extreme Gradient Boosting

One of the most popular implementations of this technique is Extreme Gradient Boosting (XGBoost), which stands out due to its efficiency and effectiveness across a wide range of predictive modeling tasks.

Conclusion

The "Machine Learning A-Z: AI, Python & R" course is a great starting point for anyone interested in machine learning. It covers a lot of important topics and gives a broad overview, but it’s just the beginning.

Finishing this course won’t make you an expert ready for a specialized machine learning job right away. Instead, think of it as a first step. It helps you understand the basics and shows you what parts of machine learning might be most interesting to you.

9 months of Machine Learning and beyond: before I've started

Alexey Tukalo — Mon, 06 May 2024 22:00:00 +0000

For this post, I want to look back at my earlier attempts to learn machine learning and data science, and discuss my current learning strategy.

After university, I worked as a graphics programmer, developing tools for data visualization. However, I was also curious about data analysis. I admired people who could tell stories with data and wanted to acquire that skill. I eventually decided to pursue the Data Analysis with R Specialization on Coursera.

I considered many options to learn data analysis. Most courses on the topic focused on people without much computer science background, covering basics of Python and R, data visualization, RDBMS, and SQL databases, with a bit of data analysis using statistics at the end. Given my background, I didn't need to cover these basics, so I looked for a program focused on foundational knowledge, especially statistics.

The Data Analysis Specialization fit my needs perfectly. In hindsight, my only regret is not choosing a course that covered a similar curriculum in Python, which I find to be a more versatile language. Originally, the specialization consisted of five courses but has since been shortened to three.

Data Analysis with R Specialization on Coursera

The program included 4 theory oriented courses. They had theoretical videos, interviews with industry experts, quizzes, and peer-reviewed data analysis tasks. The tasks involved writing reports in R Markdown, similar to Jupyter Notebooks but for R, and submitting them for peer review. Reviewing other students' reports was a crucial part of the learning process, as it offered different perspectives on the same dataset and introduced useful R tricks.

The specialization aslo concluded a practical course with a large data analysis project, similar to the peer-reviewed reports but on a larger scale.

Introduction to Probability and Data with R

The first course goal was understanding the fundamentals of probability and data analysis. This course covered key concepts such as the difference between observational studies and experiments, which is crucial for designing studies and interpreting data accurately. It emphasized the importance of eliminating bias, an essential skill for anyone working with data, as bias can skew results and lead to incorrect conclusions.

The course also provided an introduction to probability theory, helping students understand the likelihood of various events occurring. It explored different types of probability distributions, such as Normal and Binomial, which are fundamental to many statistical analyses and real-world phenomena. The course taught how to quantitatively describe a dataset using measures like mean, median, and standard deviation, essential for summarizing data. Additionally, it introduced the concept of robust statistics, which offer metrics that are more tolerant to outliers in the data, ensuring that analyses are not unduly influenced by anomalies.

Inferential Statistics

The second course taught how to make informed decisions based on data. This course introduced tools for estimating the probability of a given dataset occurring by chance, which is vital for making evidence-based decisions in various fields, from business to scientific research. The course covered hypothesis testing, a key technique for determining whether a result is statistically significant, and introduced the concepts of false positive and false negative errors. Understanding these errors is crucial, as they can affect decision-making. For example, in medical testing, a false negative could miss a diagnosis, while a false positive could lead to unnecessary treatment. This course highlighted the importance of balancing the risk of each error type based on the specific context of a problem.

Linear Regression and Modeling

The third course delved into linear regression, one of the most fundamental and widely used statistical and machine learning models. The main goal of linear regression is to fit a linear function to a set of data points, establishing a relationship between a dependent variable and one or more independent variables. This method is used extensively to predict outcomes based on input variables and has applications across numerous industries, from predicting sales to understanding the relationship between factors. The course provided an in-depth introduction to linear regression, including techniques for managing outliers to improve model accuracy, methods for evaluating model performance, and strategies for selecting optimal parameters in multi-linear regression. This knowledge is invaluable for anyone looking to develop predictive models or understand the relationships between different variables.

Bayesian Statistics

The last theoretical course introduced a different approach to probability. It explained the difference between the Frequentist and Bayesian approaches, with the Bayesian interpretation offering a flexible framework for applying statistical methods in a wide range of applications. The Bayesian approach treats probability as a degree of belief rather than a fixed frequency, allowing for continuous updating of beliefs based on new evidence according to Bayes’ Rule. This flexibility makes Bayesian statistics useful in dynamic environments where conditions change, such as finance or real-time decision-making. The course demonstrated how this approach can lead to more nuanced and adaptable models.

Outcome of the study

While the knowledge wasn't directly applicable to my day-to-day tasks, it enhanced my decision-making, helping me understand research papers and news. However, since it was not immediately useful for me at work I was demotivated from diving deeper into the topic, focusing instead on other skills like databases, cloud computing, algorithms, and web development.

Back to ML

As I mentioned in my previous post, I was impressed by ChatGPT and Midjourney, which reignited my interest in machine learning. I had basic ideas about artificial neural networks, but struggled to understand modern models. Having experience with Udemy and Coursera, I found Coursera too intense for my needs, as I wasn’t planning to switch to a machine learning job but simply wanted to understand the field. I didn’t want to spend time on quizzes and programming tasks typical of Coursera's computer science courses, and considered Coursera a bit pricey.

In contrast, I already purchased several machine learning courses on Udemy. The platform also offers a wide variety of courses from different instructors. With my self-study experience, I've always been concerned with long-term retention of knowledge, particularly when it's not in active use. I find interval-based training helpful for retention.

My Learning Approach

In interval-based training, I buy several courses on the same topic by different instructors, mixing courses on different computer science areas to space out similar topics. This approach has two benefits:

First, the initial exposure to a new topic serves as an introduction, forming a "skeleton" of knowledge. After a break, I revisit the topic through another course. Repetition over time enhances long-term retention, supported by research on memory mechanics.
Secondly, different instructors provide different perspectives often from a different angle, enabling a more comprehensive understanding of a topic.

Wrapping Up

Starting with the next post, I'll review the courses I've taken, beginning with foundational machine learning courses I completed early on.

9 months of Machine Learning and beyond

Alexey Tukalo — Mon, 29 Apr 2024 18:56:32 +0000

Hi, my name is Alex, and I'm a full-stack software engineer. Nine months ago, I decided to take a couple of machine learning courses to better understand the ongoing AI revolution, and I've become increasingly fascinated by this topic.

I'm excited to share my journey into the world of machine learning with you. This isn't going to be a short story. I intend to write a series of posts starting with my background, explaining my learning strategy, and later reviewing the courses I've taken and will take. My reviews will briefly discuss the topics covered by the courses.

Why am I doing this? First and foremost, I do it for myself. I hope this series will document and solidify my learning path in machine learning. It will help me to better understand and remember what I've learned. Additionally, I hope it can serve as a resource for others interested in making their first steps in machine learning and AI.

My Early Years in Tech

My university program focused mostly on C and programming micro-controllers. However, what truly captivated me were data visualization and computer graphics. Blending my childhood passion for visual art with computer science seemed like a natural progression.

That kick-started my career in graphics programming, where I developed both 2D and 3D rendering algorithms and even wrote my Bachelor thesis on Volume Rendering. Despite enjoying this field, I noticed the limited market demand for graphics programmers, especially for those not versed in C++. My tools of choice were C# with DirectX and later JavaScript with WebGL, which somewhat limited my opportunities in the more traditional C++ dominated areas.

Simultaneously, I developed a curiosity for Machine Learning and Data Science. However, observing the rapidly growing interest in this area, I decided to proceed cautiously, building my knowledge gradually while focusing on other professional avenues.

Becoming a Full-Stack Engineer

The more robust job market compared to the niche field of computer graphics, along with the valuable opportunity to learn how to build large systems from scratch, prompted my move from graphics programming to frontend and eventually full-stack web development. This shift steered my career towards developing complex, client-facing platforms. Additionally, this change seamlessly aligned with my lifelong interest in startups.

In an effort to strengthen my computer science fundamentals, I enrolled in the Functional Programming in Scala Specialization. While this program was not directly related to data analysis, it provided insights into big data processing with Spark. Over time, however, as I moved away from programming in a functional style and as Scala was increasingly overshadowed by advancements in Java and Kotlin, I found fewer opportunities to apply this knowledge in production.

Understanding the crucial role of statistics in Machine Learning, I also pursued a Data Analysis with R Specialization. This program was practical in nature, focusing less on theoretical mathematical concepts and more on the hands-on application of statistical methods to analyze real-world datasets. The curriculum covered foundational statistical techniques progressing to more complex topics like inferential statistics, linear regression models, and an introduction to Bayesian statistics. This approach helped me grasp how to effectively use statistical tools to extract insights and make data-driven decisions.

Although, as a full-stack web developer, I quickly realized that to truly thrive, I needed to deepen my understanding of the entire software engineering tech stack. This led me to delve into algorithms, infrastructure, cloud technologies, databases, as well as UX/UI design and software architecture. As a result, these areas became the primary focus of my self-development for many years.

AI Industrial Revolution

I still have much to learn regarding general software engineering, especially in such wide areas as DevOps or database engineering, but recent developments in LLMs and diffusion models made me realize that I have to learn machine learning to understand what is going on in the world.

While I may not fully transition into a machine learning career, the knowledge I gain could prove invaluable, especially in a startup environment where versatility is crucial. Additionally, our software systems often interact with a messy world of fuzzy and noisy data. Understanding how to effectively manage and utilize this data is essential.

What is Next?

The next article will feature a detailed review of the Data Analysis with R Specialization, marking my initial foray into the realm of data processing. I’ll also share insights into my learning approach.

Subsequent posts will focus on reviews of additional courses I’ve taken or I will take, exploring how my understanding of machine learning continues to evolve with new information.