DEV Community: Marcos

Paper Notes - From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent

Marcos — Sat, 26 Jul 2025 20:18:54 +0000

Manus AI Research Paper Summary

1. Paper Metadata

Authors: Minjie Shen¹ and Qikai Yang²

Publication Venue: arXiv

Year of Publication: May 2025

DOI/URL: arXiv:2505.02024v1

2. Key Objectives & Research Questions

What problem does the paper address?

Review of an important player in the Agentic AI systems landscape: Manus AI

What are the main research questions/hypotheses?

The importance of a comprehensive overview and examination of Manus AI
Examine the architecture
Explore applications in the industry
Compare with other technologies: OpenAI, Google, DeepMind, and Anthropic; to highlight where Manus stands out
Discuss limitations and future improvements

Why is this research important for LLMs?

Given the impact of this new agentic solution, it's super important to have deep dive efforts like this to evaluate (from an outsider perspective) the internals and expand discussions.

3. Methodology & Approach

Model Architecture

Multi-agent architecture with three complementary agents:

Planner Agent: Breaks down the user request into manageable sub-tasks and produces a step-by-step plan to achieve the outcome
Execution Agent: Takes the plan and invokes the needed operations or tools to perform the required actions for each step
Verification Agent: Quality control component, watcher of the execution agent actions, checking the accuracy and completeness, guaranteeing that it meets the requirements expected, being able to correct and trigger the planning if needed

Tool Integration Capability

Interface with external applications and APIs
Web browsing (e.g., can call browser to retrieve stock prices)
Natural language call of these tools
This feature gives super powers to Manus to extend his knowledge base beyond the model weights, being able to access real-time information and specialized functions

Training Techniques

RLHF (Reinforcement Learning from Human Feedback)
Adapts with open-ended/unfamiliar situations instead of following fixed rules like many AI systems
Key difference: Context-aware decision making
Maintains an internal memory slot context about intermediate results as it works through the problem
This allows dynamic state control of the task helping the next action execution
Incorporates human-like reasoning, trying to infer user goals and use critical thinking to automatically establish the steps to achieve it

Environment

Creates a controlled runtime environment

Modality

Multi-modal and multitask learning: text, image, audio, code (inputs/outputs)
Large and scalable neural network architecture to handle this type of data

Evaluation Metrics

GAIA test: Benchmark to evaluate AI ability to reason, use tools, and automate real-world tasks
- Outperformed GPT-4
- Exceeded the previous leader in GAIA by 65%
Objective completion (during training): RLHF guided by a reward mechanism for successfully completed objectives

4. Key Findings & Contributions

Manus AI is a general-purpose AI agent introduced in early 2025 by a Chinese company called Monica.im
Focus on planning, executing and validating complex end-to-end tasks to produce solid results
Cuts the need for step-by-step prompts and that's a game changer
Combines large-scale machine learning models with an intelligent agent framework, setting it apart as a breakthrough in autonomous artificial intelligence

5. Strengths & Limitations

Strengths

Autonomous work: Requiring less human interaction
Versatility: Sophisticated generalist with consistent results on different modalities and domains
State-of-the-art results: Benchmarks for AI reasoning, tool use, real-world task automation evaluation
Tool use: Highly effective in integrating with external tools
Adaptive learning given the user interaction

Limitations

Explainability: Opaque decision-making process, given it's not easy to follow what makes the system take a given decision
Reliability: The Verification Agent is not infallible and doesn't prevent the inner models from hallucinating
Security and privacy: Manus often requires accessing external data which might contain sensitive data and bring security concerns
Computational resources: Given the nature of a multi-agent model architecture, it could bring high processing power needs, implicating high costs for real-world applications
Ethical issues: Fully automating decisions implicates issues like wrong judgment for finance processes, bias in law decisions

6. Critical Analysis & Personal Insights

It's interesting when the authors have to reflect about the social impacts they don't cite any work, just reproduce the common sense about the impact of AI in society
Vague results mentioned in the benchmark section
As many AI papers, it has a promotional tone in many parts, like "significant leap in AI capabilities"
Lack of more robust architecture deep dive, showing only a high-level explanation
Low quality in the ethical safeguards discussions and there is a clear need for more open discussions about this given the huge focus on the fully autonomous system evaluated in this paper

Reflexões sobre a palestra "IA e os desafios éticos e sociais" de Mark Coeckelbergh

Marcos — Fri, 25 Oct 2024 01:42:57 +0000

Disclaimer: Esse post tem o intuito de refletir e apresentar algumas críticas (perspectivas) a uma palestra sobre ética na Inteligência Artificial que participei. O único intuito é apresentar ideias e enriquecer o debate que julgo crucial para sociedade, com isso, não tenho nenhuma pretenção de julgar pessoalmente ninguém muito menos desmerecer o trabalho feito, pelo contrário, refletir a partir da obra e do que foi discutido.

Introdução

Recentemente, tive a oportunidade de participar de uma palestra do pesquisador Mark Coeckelbergh, autor do livro Ética na Inteligência Artificial (AI Ethics, 2020) que também pude ler recentemente. A palestra em questão tinha o tema "IA e os desafios éticos e sociais" que abrangia muito do que é debatido no livro e perpassava por discussões relevantes mais recentes que não foram sistematicamente abordados no livro, dada sua época, como por exemplo os modelos LLM (Large Language Models). Dentre os temas abordados, podemos citar alguns como:

IA e privacidade
transparência
regulação
vigilância
desinformação
crise climática

O debate também incluiu Fernanda Martins, diretora de P&D do Internet Lab (mediadora do debate), Renata Mielli, coordenadora do comitê gestor da internet no Brasil (CGI) e Diogo Cortiz, professor pela PUC-SP, ambos atores relevantes para o tema do ponto de vista político e técnico.

Não vou fazer uma análise aprofundada porque acredito ter sido uma apresentação e um debate interessantes do ponto de vista de tópicos abordados e problemáticas apontadas tanto por Mark quanto pelos demais participantes. Portanto, gostaria de apenas destacar 3 pontos que julgo importantes para destacar e aprofundar nessa crítica.

Ideal em democracia burguesa

Toda fala do autor relativa a ideia de fazer "democracias" mais fortes se centra nos conceitos de democracia liberal sem levar em consideração que se trata de uma democracia burguesa, que por sua vez, no capitalismo tem um viés de classe claramente voltado a fortalecer os detentores do capital. Por sua vez, a própria democracia burguesa já é origem por si só de muitos dos problemas que se delega apenas para o escopo de uma "ética de IA" e fazer "democracias mais fortes" não solucionariam por si só a problemática oriunda materialmente do sistema político/econômico em questão.

Em certo ponto da palestra, Mark apresenta um slide que ressalta entre os pontos importantes de tornar a democracia mais forte a necessidade de "questionar o capitalismo e a propriedade privada", no qual eu concordo totalmente. No entanto, tanto na fala quanto no livro eu senti muita falta da ampliação desse debate e suas respectivas problemáticas e possíveis caminhos para esse enfrentamento tão crucial para essa finalidade.

Supressão de tecnologias soberanas nacionais

um dos pontos levantados pela Renata Mielli que eu achei fundamentais para o debate, especificamente para o debate interno brasileiro e outras nações de capitalismo dependente, foi a questão da dificuldade de regulamentar desenvolvimento dada a baixa capacidade de desenvolvimento de tecnologias de IA nacionais, restando apenas a customização de modelos existentes produzidos fora do país.

Nesse sentido, o fortalecimento econômico e o avanço (ou até uma perspectiva de avanço) tecnológico com intuito claro e fundamental na criação de um pólo nacional forte no desenvolvimento e aplicação de IAs para países do sul global se mostra como grande ameaça para os países do norte global que os vêem apenas como mercado consumidor, impossibilitam assim sua emancipação e atacando suas soberanias.

Obviamente, isso não se dá de forma direta, mas muitas vezes de forma implícita aos mecanismos funcionais da democracia liberal burguesa que fortalece monopólios e concentração de poder político, que se torna poder econômico, altamente focalizado no norte global.

Nesse tema, Diogo Cortiz apresentou outro ponto para o debate que também que foi o "Dilema da economia política". A ideia central é a dinâmica no qual plataformas fechadas das Big Techs usam nossos dados (mas o mesmo serve para os outros países de capitalismo dependente) para treinar modelos especializados na nossa língua com a justificativa de garantir calibração adequada para o nosso uso. Essa dinâmica, alinhada com políticas contínuas de desincentivo e sucateamento da ciência no Brasil claramente nos torna em refém das Big Techs. Isso sem falar no papel bizarro de criação de insumos valiosos (dados de cidadãos brasileiros) para empresas privadas do norte global sem contrapartida.

Por fim, a falta de iniciativas, no mínimo, reformistas dos governantes desses países do sul global para contestar esse status quo que esmaga o desenvolvimento tecnológico local, também contribuem para essa baixa capacidade tecnológica nacional para soberania em IA dos países do sul global.

Falta de mecanismos de incentivo

Apesar de gostar muito do tema de Ética na Inteligência Artificial, acredito que muitas vezes o foco é muito em pensar em princípios ou um conjunto de ideias que norteiem a produção, disseminação e uso de IAs; mas pouco entender e descrever quais seriam os incentivos em um mundo capitalista para a adoção desses princípios.

Sem mecanismos de incentivo, estes princípios tornam-se ineficazes e podem ser facilmente anulados por incentivos econômicos, reduzindo-os a meras ferramentas de marketing em vez de orientações eficazes para um comportamento ético desejado e esperado dos atores envolvidos nesse ecossistema. Colocar esses princípios de uma AI Ética como substituto ou menos importante que políticas de regulação, que irão efetivamente delinear mudanças políticas práticas no uso dessa tecnologia, parece um tanto quanto wishful thinking.

Similarmente, esse pensamento me lembra a forma com que algumas empresas Big Tech adotaram a criação de grupos minorias sub-representadas e.g. LGBTQIA+, pretos, PCDs; mas sem muní-los com qualquer tipo de poder para efetivamente colocar seus interesses em pauta nas decisões estratégicas das empresas. Não diria uma ineficiência programada, porque acredito que no campo de Ética nas IAs existe um trabalho sério e bem intencionado para resolução e estudo das implicações existentes e possíveis relacionadas a essa tecnologia, mas acredito que uma expansão do debate que vá para o campo político de incentivos que possibilitem efetivamente o uso desses princípios se faz mais que necessário.

Quem faz um debate extenso e interessante sobre esse tópico é o pesquisador da University of Queensland, Luke Munn no artigo The uselessness of AI ethics (2022), por isso deixo o link do estudo ao final como referência.

Considerações finais

De forma geral, foi um debate muito produtivo e cheio de questões interessantes para a discussão central das implicações éticas e sociais do uso ético das IAs em nossa sociedade. Espero ter contribuído em algum nível na ampliação desse debate que acho imprescindível para uma sociedade que consiga socializar os imensos frutos dos avanços tecnológicos da IA.

Link para palestra: .

Referências

The uselessness of AI ethics, 2022

it's all about the least worst combination of trade-offs

Marcos — Wed, 05 Jun 2024 01:02:03 +0000

i remember that early in my computer science career in the industry hearing a lot of silver bullet frameworks/packages that fit all the cases you want and (apparently) didn't have any disadvantages or drawbacks. "using this MVP framework X will solve your PHP development problems", "the best API interface for Java is Y", and similar pitches were quite common. At the time, the community was infested with solution evangelists and it was so common that providing critical analysis of the usage or its limitation was like desacrating.

these frameworks were often marketed with grand promises and I saw minimal discussion about the context or implications of its usages involved. at first, these claims seemed incredibly attractive, especially for someone new to the field, as they promised a panacea to all the prevalent issues. however, it didn’t take long to realize that every framework or package had its own set of limitations, and the context in which they were used mattered significantly. this background made me more appreciative of any resource that addressed the complexity of architectural decisions honestly.

last week, I read the book "Software Architecture: The Hard Parts" that I really liked. by the way, the hard parts the author refers to are related to the difficult choices, and the foundation parts - should change less compared to the "soft" ones - of a software design. the book then focuses mainly on these architectural decisions involved in modern software development and several guides on how to better evaluate the implications of different types of solutions to solve these problems: trade-off analysis. the rigorous attention to decision-making frameworks ensures that readers are not just passively absorbing information but are actively engaging with the flow of thinking to understand the nuanced consequences of each choice. the author discusses how these decisions can have long-lasting impacts on maintainability, scalability, and performance, emphasizing that what works well in one context might become a bottleneck in another - which for me makes this book special.

the author demystified the concept of the one-fit-all solution for software architecture by focusing on identifying the possible trade-offs inherited in all design decisions. one of the first quotes of the book is "don't try to find the best design in software architecture, instead, strive for the least worst combination of trade-offs". that really caught my eye and gave me the curiosity to finish the book. the examples and case studies provided are rich with scenarios that demonstrate how even small decisions can have cascading effects on a software system's future evolution. by enforcing this critical mindset, the book arms readers with the tools to make more informed decisions, rather than chasing an elusive "perfect" architecture or being "evangelizers".

the good parts of the book are

real-world examples to follow the theoretical assumptions the author presents
systematic guidance on how to decompose complex systems
large and useful amount of common decisions trade-offs on modern software architecture
personal tips from the author alongside the chapter discussions
amazing and intuitive diagrams
comprehensive language and storytelling

the weak parts of the book are

not so recommended for beginner software developers (but I don't think that's the book's target)
despite the chapters having a good interconnection, sometimes the author refers to a chapter discussion without making it easier for the reader to remember the context of what was discussed and its implications
personal tips from the author alongside the chapter discussions (also a weak part here because in some cases I thought it biased the discussion)

overall I think it's definitely an amazing reading and I really recommend it for every programmer who creates modern distributed software that, consequently, involves a lot of design decisions that should be carefully evaluated in order to develop robust systems. The concepts presented in this book are not just applicable to large companies but can also be adapted for use in startups or smaller teams, making it versatile and essential reading for anyone involved in software design and development.

GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models - generated by Summarizepaper.com

Marcos — Fri, 10 Nov 2023 13:53:04 +0000

Generative Pre-trained Transformers and Their Potential Impact on the U.S. Labor Market

In a recent study, researchers investigated the potential implications of Generative Pre-trained Transformer (GPT) models and related technologies on the U.S. labor market. The findings indicate that approximately 80% of the U.S. workforce could have at least 10% of their work tasks affected by GPTs, while around 19% of workers may see at least 50% of their tasks impacted regardless of wage level or industry type. This research paper proposes a new rubric for understanding Language Model (LLM) capabilities and their potential effects on jobs, as well as suggests that LLMs like GPT-4 are likely to have pervasive impacts on various industries in the future due to their wide range of applications and capabilities.

Background Information

Generative Pre-trained Transformer (GPT) is a type of language model developed by OpenAI which has been used in natural language processing applications such as text generation, question answering, machine translation, summarization and more recently dialogue systems for chatbots and virtual assistants such as Alexa or Siri. It is based on transformer networks which use attention mechanisms to learn contextual relationships between words in a sentence or document; this allows them to generate text that is coherent with its context without having access to any external data sources or training datasets other than what it was pre-trained with during development phase. The introduction of GPTs into the labor market could potentially bring both positive and negative consequences depending upon how they are implemented; while they can automate certain processes thus increasing efficiency and reducing costs associated with manual labor, they also pose a threat to existing job roles if not managed properly since some tasks can be entirely replaced by machines powered by these models instead human workers who would then become redundant in those positions leading to job losses across different sectors over time if no alternative employment opportunities are available for them elsewhere within same industry or outside it altogether .

Study Overview

To assess occupations based on their correspondence with GPT capabilities, researchers incorporated both human expertise and classifications from GPT-4 into a new rubric which was then applied to occupational data in the U.S economy using annotators from humans along with GTPT-4 itself as classifiers for this purpose; this allowed them measure overall exposure levels without distinguishing between labor augmenting or displacing effects caused by introduction/implementation/usage of these models within particular industries/sectors/workplaces etc.. The analysis indicated that Generative Pre-trained Transformers exhibit characteristics similar general purpose technologies (GPTs), suggesting that even if development were halted today subsets machine learning software still meet criteria for being considered general purpose technology when taking into account collective development complementary technologies implying LLMs like GPT-4 are indeed general purpose technologies capable having notable economic social policy implications going forward into future .

Findings & Implications

The findings suggest that approximately 80% US workforce could have at least 10 percent work tasks affected introduction GTPs while around 19 percent workers may see least 50 percent tasks impacted regardless wage level industry type; higher income jobs potentially facing greater exposure however impact not limited industries higher recent productivity growth information processing industries exhibiting high exposure manufacturing agriculture mining demonstrating lower exposure respectively . This research provides valuable insight into potential implications generative pre trained transformers related technologies US labor market highlighting need further investigation order understand full scope impact these models might have long term basis before implementing them workplaces large scale basis . Additionally given fact LLMs like GPTS already possess characteristics general purpose technology means organizations should take extra care ensure proper management usage order avoid displacement existing job roles whilst simultaneously reaping benefits automation increased efficiency cost savings associated implementation process . Overall results suggest there significant potential economic social policy implications arising from usage generative pre trained transformers related technologies within US labor market hence further research needed order better understand extent impact will be felt across different sectors so appropriate measures taken mitigate any negative consequences arise result implementation process.

Review & Support time

Marcos — Sun, 02 Apr 2023 22:29:11 +0000

Introduction

In today's fast-paced world of software development, it's easy to get caught up in the excitement of building new features and pushing out code. However, it's equally important to set aside time for support activities such as reviewing RFCs, and pull requests, answering questions in channels, guiding new people, and contributing to incident resolution.

Failing to reserve time for support activities can lead to a negative impact on team morale and burnout. Without support from their colleagues, team members may feel overwhelmed, stressed, or isolated, leading to a decrease in productivity and an increase in turnover. Moreover, not having enough time to address critical incidents can lead to system downtime and service outages, which can be costly for both the company and the customers. Therefore, setting aside time for support activities not only benefits the team's work but also creates a culture of collaboration and support that leads to a healthier work environment.

In this blog post, we'll explore why support is crucial for software development teams and the cultural impact of having these initiatives.

The Importance of Support

Support activities are essential for ensuring that software development teams function effectively. Here are some reasons why:

1. Improves Code Quality

By proactively reviewing RFCs and pull requests, team members can identify potential issues before they become significant problems. This approach promotes a culture of continuous improvement and helps to ensure that the codebase is of high quality, reliable, and efficient. It also allows team members to share their knowledge and best practices, which helps to ensure that everyone is aligned with the team's standards and goals. Additionally, by catching issues early, the team can avoid potential delays, reduce technical debt, and streamline the development process. Therefore, investing time in reviewing RFCs and pull requests is a critical part of building high-quality software and achieving long-term success.

2. Fosters Collaboration

Answering questions in channels and guiding new team members are crucial for building a collaborative team culture. When team members share their knowledge and expertise, they create an environment that fosters continuous learning and growth. Moreover, by helping new team members to onboard quickly, they can feel more confident and empowered to contribute to the team's goals. This approach helps to reduce barriers, increase cross-functional understanding, and encourage innovation. Additionally, by working together to solve problems, team members can identify new solutions and opportunities for improvement. Ultimately, building a collaborative culture through answering questions and guiding new team members not only benefits the team's work but also creates a positive and inclusive work environment.

3. Increases Efficiency

By actively contributing to incident resolution, team members can minimize downtime and ensure the smooth operation of the software. This approach is essential for maintaining high levels of efficiency and productivity, leading to a positive impact on the team's and the organization's bottom line. Incident resolution often requires the collaboration of different team members, and by actively contributing to resolving an issue, team members can share their insights and expertise, which can help to uncover the root cause of the issue and prevent it from happening again in the future. Therefore, by contributing to incident resolution, team members not only help to reduce downtime but also ensure the reliability and quality of the software and enhance the customer's overall experience.

The Cultural Impact of Support Initiatives

Having support initiatives in place can have a significant cultural impact on software development teams. Here are some examples:

1. Encourages a Culture of Learning

By providing support and guidance to new team members, more experienced team members can help to create a culture of learning within the team. This can lead to a more engaged and motivated workforce, which can have a positive impact on productivity and retention. Furthermore, mentoring relationships between more experienced and newer team members can build trust and camaraderie, leading to a more collaborative and supportive work environment. Ultimately, investing in the development of new team members is an investment in the team's long-term success and helps to ensure the continuity of the team's culture of learning and growth.

2. Promotes Transparency

By reviewing RFCs and pull requests, team members can ensure that everyone is on the same page and that there are no surprises down the line. This helps to promote transparency within the team and can lead to better communication and collaboration.

3. Builds Trust

By contributing to incident resolution, team members can build trust with each other and with external stakeholders. This helps to create a culture of accountability and responsibility, which can lead to better outcomes for the team and the organization as a whole.

Moreover, by working together to resolve complex issues, team members can learn from each other and develop new skills, which can contribute to a more skilled and effective team over time. This culture of trust can extend beyond the team and to external stakeholders, such as customers or partners, who rely on the team to deliver high-quality and reliable software. Thus, a culture of accountability and responsibility built through incident resolution can have a ripple effect that extends throughout the organization, creating a positive impact on overall performance.

In addition to building trust, contributing to incident resolution also helps team members develop their problem-solving skills. When team members encounter complex problems and work together to find solutions, they have an opportunity to learn and practice critical thinking, analytical reasoning, and creative problem-solving skills. These skills are essential for success in any role and are highly valued in today's competitive job market.

Conclusion

In conclusion, setting aside time for support activities is crucial for software development teams. By doing so, teams can improve code quality, foster collaboration, and increase efficiency. Moreover, having support initiatives in place can have a significant cultural impact, promoting a culture of learning, transparency, and trust. So, if you're not already doing so, make sure to prioritize support activities in your software development team.

Attention models: a brief overview

Marcos — Sat, 04 Mar 2023 18:46:27 +0000

Machine learning has revolutionized various fields over the past decade, including computer vision, natural language processing, and speech recognition. Attention models have emerged as a powerful technique in machine learning, enabling models to selectively focus on relevant parts of the input, which has resulted in significant performance improvements in various tasks. From their first proposal with neural machine translation, attention models have rapidly evolved and have become a key component of many state-of-the-art machine learning models. In this article, we will provide a brief history of attention models in machine learning, including their evolution, major breakthroughs, current advancements, and impact of attention models on the field of machine learning. Here, we focus on temporal attention for some but the core functions are similar for other types like spatial attention.

We will start by briefly review its foundations, such as basic concepts of Recurrent Neural Networks (RNNs). Then we bring the history of attention mechanism, its definition, and the different formulations proposed in the literature.

Recurrent Neural Networks

This neural network family is commonly adopted when dealing with sequential data x₁, ..., xₜ. The main idea is that the outputs from the previous step are fed to the current step, creating a recurrent dependency among the outputs. This means that, in theory, RNNs can “memorize” computations made over a long period to return the current response, although this does not happen in practice. RNNs have proven effective in learning time-dependent signals whose structure varies over short periods. However, when there are long-term dependencies on the data, these methods suffer from the “gradient vanishing” problem. This problem occurs when gradients propagated through various stages tend to “vanish” or arrive close to zero [1]. The best-known architecture for addressing this problem is the Long Short Term Memories (LSTM) [2]. LSTMs follow the gated scheme based on creating paths through the time where the gradients can flow for a long duration.

Attention Mechanism

RNNs can be used to map an input sequence to output sentences, which usually have different sizes. This idea can be used in several applications, such as speech recognition, question answering, and machine translation. An RNN encoder processes an input and emits a context vector C, usually computed using a aggregation function over the encoder hidden layers. Subsequently, an RNN decoder, based on the fixed-length vector C, generates the output sequence Y = (y₁, ..., yₜ). The major difference from this model to other architectures is that the inputs and outputs’ sizes can be varied. Sutskever et al. [3] independently developed an encoder-decoder architecture that obtained state-of-the-art results on English to French translation task from the Conference on Machine Translation (WMT) 2014 workshop.

A potential issue with this encoder-decoder approach is that a neural network needs to encode all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences where the last representation of the RNN does not capture important information of the sentences due to problems like gradient exploding, for example. A more effective approach is to read every sequence and then produce the translated words one at a time, each time focusing on different relevant parts of the input sequence [4]. This mechanism was proposed by Bahdanau et al. [5] using the context vector to align the input source and target by “attending to” certain parts of the input. Figure 1 illustrates the attention method proposed.

" width="391" height="332">Figure 1: Illustration of the attention mechanism, known as Soft Attention, proposed by Bahdanau et al. [5] for neural machine translation. The method works by computing the weighted average of all the hidden representations h with attention weights α to form the context vector c. The attention weights α(t) are continuous values between [0, 1] learned by a feedforward neural network which is jointly trained with the proposed model. Figure reproduced from Goodfellow et al. [4].

More specifically, the network learns the attention weights by incorporating an additional Feedforward Network (FFN) trained jointly with the main architecture, producing the attention weights as a function of a candidate hidden state and a query state [6]. The whole idea is inspired by Neuroscience, based on the aspect of many animals focus on specific parts of their visual inputs to compute the adequate responses [7]. The idea has been successfully translated into neural networks so that the models focus their actions on relevant regions rather than using all available information.

Typically, there are three types of attention: (1) hard attention, (2) soft attention, and (3) self-attention. In soft attention, we compute a weight ai for each input xi and use it to calculate a weighted average for xi as the recurrent network input. These added weights are equal to 1, which can be interpreted as the probability that xi is the area that we should pay attention to. Hard attention employs a stochastic sampling process to focus on specific regions of the input xi. On the other hand, self-attention, first introduced for machine-reading by Cheng et al. [8], computes a representation of an input sequence by relating different positions of the sequence itself (Figure 2). The authors observed that the self-attention mechanism enables the LSTM to learn the correlation between the current words and past words of the sentence. Xu et al. [9] explored soft and hard attention for neural image captioning task. In this problem, the model needs to generate captions given an image. The authors adopted an encoder-decoder design, where a CNN (encoder) provides features to an LSTM (decoder) modified with an attention mechanism that allows the decoder to focus on relevant parts of the input. For this purpose, the decoder uses the previous hidden state, the previously generated word, and the context vector to generate the next word using an attention function.

Figure 2: Self-attention scores of the model proposed by Cheng et al. [8] for machine reading. The current word being analyzed is expressed by the red color and the blue represents the sentence importance.

In short, the attention mechanism can be seen as a dynamic pooling in which the weights are learning along with the training. In some cases, these weights can be used as a tool to provide interpretability to the model, although it is not applied in specific scenarios [10]. This is an important feature due to the growing interest in fairness and transparency in Machine Learning. Doughty et al. [11] employed multiple attention filters to discover essential parts in long videos for skill determination task. They also introduced a loss function that encourages the attention filters to be complementary.

Another interesting and intuitive attention-based method is proposed by Hermann et
al. [12] for reading comprehension in real documents. In addition to introducing a new supervised reading comprehension dataset, covering a gap in the literature, the authors also built four deep learning models incorporating attention mechanisms in RNNs. The attention weights allow the models to focus on specific parts of the document to answer the question. Comparing the results obtained by the attention-based method to a range of baselines and traditional heuristic methods, the authors obtained state-of-the-art results in the proposed dataset. Figure 3 shows the attention heat maps obtained by one of the attention-based methods proposed.

Figure 3: Attention scores of one of the attention-based methods proposed by Hermann et al. [12] for reading comprehension task. The model works by focusing on specific parts of the document that better fits to the question in the bottom. The crucial part here is that there are a lot of text being ignored in this context.

Motivated by the computational inefficiency of RNNs, since the sequential computation inhibits the parallelization, Vaswani et al. [13] introduced the Transformer Network for machine translation. This architecture is built without recurrences and convolutions, using only attention blocks. The model consists of six Encoder blocks and six Decoder blocks, in which each one of them presents are built using the same modules: Feed Forward Network and Multi-Head Self-Attention. First, the Multi-Head Self Attention layer helps the encoder look at other words in the input sentence as it encodes a specific word. Particularly, this module is named Multi-Head because several attention layers are stacked in parallel, with different linear transformations of the same input [14]. Then, the vector of the Multi-Head Self Attention step is then fed into the Position-wise Feed-Forward Network, which consists of two linear transformations with a ReLU activation in between. The Transformer model achieved state-of-the-art performance on English-to-German and English-to-French translation using significant parallel processing, producing higher accuracy for translation and not using any recurrent component.

Figure 4 illustrates the whole architecture of the Transformer proposed by Vaswani et al.

Figure 4: Transformer network proposed by Vaswani et al. [13] for machine translation. The authors introduces a new encoder-decoder architecture. The encoder is composed of a stack of multi-head attention and feed-forward layers, each of them has a residual connection and a normalization layer. The decoder works similarly, with the addition of a third sub-layer that applies multi-head attention over the output of the encoder.

Due to the success of the Transformer, many variants have been proposed in the literature, improving the original model in terms of computation and memory efficiency. The main focus of improvement is the self-attention model, responsible for computing similarity scores for all the sequence pairs. The original formulation is O(n2) both in time and space, dropping the efficiency for larger input sequences. Most of the proposed methods are based on the concept of sparse attention, which uses a subset of the input sequence to apply the attention [15]. One of the first approaches to cope with this problem was Image Transformer (Parma et al. [16]). For a more detailed review of Transformed-based approaches focusing on efficiency improvements, please refer to Tay et al. [15].

Attention mechanisms are an effective way for neural networks to enhance their capability and interpretability [17]. In sequence learning, attention is broadly employed on encoder-decoder models to solve the limitation of encoding the input sequence to one fixed-length vector to decode each output time step. To solve this, attention mechanisms learn a vector of scores for each time step t observed, representing its relevance. An attention module will tell the decoder to look more at targeted sub-components of the source to be translated. Pei et al. [17] combined gated RNNs and attention networks to detect salient parts of the sequence and encode this information through a custom RNN. The temporal attention weights returned from this mechanism provide meaningful value for the salience of each time step in the sequence, which gives the model a higher degree of interpretability. They showed that the learned weights automatically filter noisy parts of the sequence, producing a reasonable interpretation of each time-steps relevance to the model (see Figure 5). One disadvantage is using the last hidden representation as input to the fully connected network, which contains all previous frames’ information, retaining high values along with the video.

Figure 5: Overview of the Temporal Attention-gated Model architecture [81]. The bottom is the Temporal Attention Module, which generates attention weights for each frame. At the top, Recurrent Attention-Gated Units use these weights to refine the internal representation.

We can express an attention mechanism as a function that maps an input vector to an output based on a weight vector. This weight vector represents the relevance of each input. Internally, the network tries to learn to focus on specific salient “regions” and capture somewhat global information rather than solely to infer based on one hidden state. Focusing on one instance of attention methods that have been commonly used in machine translation works [5], we can describe the attention mechanism for video classification by the following steps.

Given an input sequence h of length N, the attention mechanism produces a context vector c computed as a weighted sum of hᵢ, where the alignment scores are the weights.

c = ∑ᵢ₌₁ ᴺ aᵢhᵢ

The weight of each $a_i$ is computed by

aᵢ = exp(eᵢ) / ∑ⱼ₌₁ ⁿ exp(eⱼ),

eᵢ = α(sₜ₋₁,hₜ),

where α, known as alignment model, is a single layer neural network that computes matching between inputs around the current position i. The network computes the weight based on the annotation vector hᵢ and the previously hidden state vector of the RNN. We referred to this vector as key since the attention model seeks the most related weighted average of the input vectors according to these keys. This network can be jointly trained with the other components of the RNN used.

Zadeh et al. [18] proposed a new neural architecture for multimodal sequential learn- ing called the Memory Fusion Network (MFN) that explicitly accounts for both inter- actions in neural architecture and continuously models them through time. This model has three modules: 1) System of LSTMs: multiple LSTMs, one for each view, encoding view specific interactions; 2) Delta-memory Attention Network: a particular attention mechanism to discover temporal interactions across the System of LSTMs. 3) Multi-view Gated Memory: storing cross-view interactions over time. With the best configurations, this approach achieved state-of-the-art results on six different multimodal datasets.
Girdhar et al. [19] extended the well-know Transformer network for action recognition and localization in videos. This architecture discards the use of RNNs, replacing them with multi-head attention modules that learn to attend to the frames’ relevant regions. Figure 6 shows an overview of the architecture. Meng et al. [20] introduced a model that combines spatial and temporal attention simultaneously employing a set of regularizers to force the attention models to attend to coherent regions of the videos. The model was evaluated for action classification, and also evaluated for action localization, trained in a weakly-supervised.

Figure 6: Diagram of the Action Transformer network for action localization in videos proposed by Girdhar et al.[19]

In summary, the attention mechanism has been adopted in literature for its simplicity (in the case of soft attention) and the benefits of the method. One of these benefits is the sequential ordering of relevance data, which can implicitly help filter input noise. Another important feature is the interpretability of the weights associated with the inputs xᵢ, which can be used as a good indicator of the relevance of that time-step to the scene. This may be useful in systems where not only final prediction is needed, but also some other indicator that reinforces that decision. Finally, this mechanism has the flexibility to be able to be added anywhere in the network, as long as it makes sense for the desired purpose.

And, yes, the cover image of this post was automatically generated by an attention-based deep neural network :)

References

[1] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998. 25
[2] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural com- putation, 9(8):1735–1780, 1997. 25
[3] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014. 26
[4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learn- ing, volume 1. MIT press Cambridge, 2016. 26
[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations, 2015. 8, 16, 26, 30, 46, 47, 59, 64, 68, 70
[6] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun Mithal. An attentive survey of attention models. arXiv preprint arXiv:1904.02874, 2019. 26, 28, 76
[7] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998. 26
[8] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. In Conference on Empirical Methods in Natural Language Processing, pages 551–561, 2016. 8, 16, 27
[9]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, 2015. 27, 46, 56
[10] Sofia Serrano and Noah A. Smith. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy, July 2019. Association for Computational Linguistics. 27
[11] Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7862–7871, 2019. 27, 46, 47, 56, 59, 64, 68, 70
[12] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and com- prehend. In Advances in Neural Information Processing Systems, pages 1693–1701, 2015. 8, 28
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017. 8, 16, 28, 29, 31, 77
[14] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun Mithal. An attentive survey of attention models. arXiv preprint arXiv:1904.02874, 2019. 26, 28, 76
[15] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020. 29
[16] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International Conference on Machine Learning, pages 4055–4064, 2018. 29, 77
[17] Wenjie Pei, Tadas Baltrušaitis, David MJ Tax, and Louis-Philippe Morency. Tem- poral attention-gated model for robust sequence classification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 820–829, 2017. 8, 30, 32, 33, 46, 47, 56, 59, 64, 68, 70
[18]
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. arXiv preprint arXiv:1802.00927, 2018. 31
[19]Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video ac- tion transformer network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 244–253, 2019. 31, 46
[20] Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Wei Sun, Frederick Tung, and Leonid Sigal. Interpretable spatio-temporal attention for video action recognition. In IEEE International Conference on Computer Vision Workshops, 2019. 16, 31

Staff Engineer: Leadership beyond the management track - Book Notes

Marcos — Sat, 11 Feb 2023 14:19:15 +0000

I recently read Will Larson's book Staff Engineer: Leadership beyond the management track. As the name suggests, the book covers a career alternative for more senior engineers that don't feel comfortable with management positions and still want to be involved in complex technical problems. Besides being a broad role with responsibilities varying from company to company, the book defines some key archetypes and develops along the chapters what is a job description for each of them.

I discovered this book after listening to the amazing podcast Staff+, presented by Paulo Vasconcellos, Marlesson Santana, and Flavio Clesio, talking about this new technical path for engineers. The whole conversation grabs my attention because I don't see the management track as a good path for me since the beginning of my career.

Overview

Difficulty in finding resources for professionals above senior software developers
Very mysterious notion of impact for more senior engineers
If you want to advance in your career without becoming a conventional manager/manager?

Four Staff+ archetypes

Tech Lead: guides the approach and execution of a particular team
Architect: responsible for the direction, quality and approach within a critical area
Solver: digs deep into arbitrary complex problems and finds and appropriate path forward
Right Hand: Expands the attention of an executive by lending his scope and authority to operate particularly complex organizations.

What do Staff Engs actually do?

It depends a lot on the team's needs and the engineer's strengths.
Work on strategic projects and bring technical design while helping the progress of the team.
Senior SE + setting & editing technical direction, mentorship and adding engineering context to organizational decisions.
Does not have to be a hero and solve everything: the main goal is to help the team evolve.
Balance between mentorship and sponsorship.
Staff+ will often be in organizational decision-making rooms: it's time to bring engineering context to the table.

Operating at Staff

write an engineering strategy
Keep aligned with authority positions
blend your vision with others
create space for others

Important takes

more senior -> less time to accomplish tasks
avoid snacking -> do easy low-impact tasks
avoid preening -> do easy high visibility tasks
avoid chasing ghosts -> do low-impact but very complex tasks

Managing technical quality

Maintain technical quality and business deliveries
Accountability
System Thinking
data models tolerant to evolution
technical vectors : Create technical decisions with vectors that point in the same direction
- Give direct feedback when someone acts out of alignment with the planned approach
- Redefine the engineering strategy
- Communicate your approach from existing and future processes/tools.
- onboarding process
- Conway's Law: organizations build software that reflects their structure.
- points to consider in the quality definition
to lead you have to follow: not only to adopt good practices, but also to share them with the teams
never be wrong without creating a bad ambient, be right while creating space for others

Creating space for others

ask the right questions to avoid missteps
bring everyone to the table
take notes
does not need to ponder over absolutely everything
sponsorship: Promote that person, give visibility, opportunity, advice, context, but the final responsibility lies with the person.

Build Network of peers

be easy to find
be visible
internal networks - within squads, for example

Present to executives

from a certain point in your career, what will make the difference will be how you can effectively influence executives.
don't fight feedback

Getting the title where you are

path not very clear in companies, some may even "prevent" certain promotions
opportunities for staff are not equally divided
sometimes it's good to test engineering management career

Promotion packets

what were the staff projects you did? what was the impact obtained? how did you improve the company? How much money did your impact generate?
have answers to why you are looking for this promotion?
do not be anxious, these promotions often take time to be presented
bring the matter up to 1:1 with your manager
write a promotion package document
edit with your peers and manager
periodically review this document

Find your sponsor

staff engineer is not just a faster senior engineer
dont play team games alone, you will lose
find people who give visibility to your achievements and increase your value in the company
be direct with your sponsor: show that you want to be a well-known staff within the company
if that doesn't work, consider switching companies

Staff projects

it's not written anywhere, but usually you have to deliver a staff project to get the promotion
complex and ambiguous projects
diverse stakeholders from different areas

Get in the room, and stay there

not only participate once in important decisions, but consistently be there
be someone important in the room: bring important ideas or questions
important to have someone sponsoring your presence there
be aligned with your manager
speak clearly and concisely : contribute by speaking less
prioritize being useful

Being visible

make sure your work is being seen
write long-lived documents
lead forums within the company
contribute to the company blog
share notes about your work

Liquid Roles: Thoughts

Marcos — Sat, 28 Jan 2023 13:35:58 +0000

In this post, I will show my personal opinions regarding data roles. The emphasis on opinions is designed since the purpose is not to provide any kind of career advice but just to share some perceptions observed working as a computer scientist.

In 2017, I started my career as a data scientist only because it was the role that matched the things that I was studying in my master's degree course. I remember working with visualization, optimization, data analysis, ETL preparation, ML modeling training/serving, and also communicating directly with the customers. Maybe this was common to many people due to the broad job description of the role. Of course, it is not the optimal way to structure roles, but since it was a startup, you discover that everyone is accumulating responsibilities.

Apart from how chaotic this routine was sometimes, it opened my eyes to how the whole business of the company works and helped me to have a mental model of how the things that I was doing connected to the pieces of other engineers, business analysts & sales were doing. For me, this sense of how the business you have involved works, might value more than many years of technical learning - with all due respect, and I obviously think having both is the best - because it drives your focus/prioritization on things that will be more beneficial for this 'business gear'.

Note that I am not saying that you must prepare features for a risk model, automate a multi-instance deployment, create dashboards/reports for monitoring, and implement business rules to use the model predictions. Just be aware that these job descriptions vary a lot based on several things: project, team/company necessity, market, and last but not least, YOUR preferences. For example, my transition from data scientist to Machine Learning Engineer (MLE) was shaped by an off-the-job study of MLOps practices and their many benefits for companies, becoming a global trend in data. I started to incrementally bring up discussions about some development practices that we could adopt, highlighting the operational benefits involved. Fortunately, I had a lot of support for the ideas that I was bringing, but mainly because the value to the business in the mid/long term was clear to them. I don't get promoted or changed my role to MLE just to this initiative and that is the point: tech roles are natively liquid roles. Today, training a Neural Network for speech recognition is on your scope and tomorrow might be creating an API to accelerate partition distribution over many clusters could be your new scope.

My claim here is that adding business value to the company can extrapolate your job description and you could be aware of this by focusing on things you like to do/study/improve even if it goes beyond your current role. Tech roles come and go but at the end of the day what is important is if you are adding value to the company or not and making it clear to the stakeholders (of course).

"Hello me, meet the real me"

Marcos — Tue, 17 Jan 2023 02:38:12 +0000

Hello! I will use this space as my blog to share thoughts on different subjects (mostly related to work but not only).

The purpose here is not to dive deep into complex subjects but to try to put into words some of the many perspectives I have regarding some topics. I've always believed that one of the best options to guarantee that you have learned something is to write our talk as much as you can about the subject. However, we avoid this very important phase of learning and foul our brain saying "Yeah yeah, you already got it, NEXT!" [1]

One of the many things that inspire me was Flavio Clesio's blog [2] and his talks. The topics and the way that he approaches them - showing his high knowledge about how to introduce, start the discussion, and highlight each trade-off involved - really grab my attention from the beginning to the end. Another one was this post [3] regarding a switch from Data Science to an MLE role that pointed out many of my arguments to make my career switch too. Yeah, last but not least, this "passive-aggressive" post [4] also motivated me haha.

This is also an opportunity to force me on writing in English - despite not having the intention of writing all posts in English - because it is on my daily routine but right now it is now as instantaneous as I would like. I expect this practice to help me organize my thoughts clearly and effectively.

Ah, and about the headings, yeah ... I am a big fan of Megadeth ;D

[1] https://www.amazon.com.br/How-We-Learn-Surprising-Happens/dp/0812984293
[2] https://flavioclesio.com/
[3] https://ryxcommar.com/2022/11/27/goodbye-data-science/
[4] https://startafuckingblog.com/