Lucas Ribeiro

Posted on Jun 17

MAIL: Multi-layer Attentional Interception Layer for Deep Learning Networks with Multiple Inputs and Multiple Outputs (MIMO-DL)

#ai #datascience #python

Authors: [Lucas Ribeiro]

Date: June 17, 2025

Abstract: Deep Learning networks with Multiple Inputs and Multiple Outputs (MIMO-DL) are increasingly used in complex domains requiring the processing of diverse input data streams to generate multiple predictions or inferences. However, the inherent complexity of these architectures often results in "black-box" models, making it difficult to interpret how specific inputs influence corresponding outputs. This paper proposes a novel mechanism called Multi-layer Attentional Interception Layer (MAIL). MAIL is a customizable layer that can be integrated into MIMO-DL architectures to provide granular interpretability, allowing for the "interception" and analysis of learned interactions between subsets of specific inputs and outputs. We present the theoretical formulation of MAIL, a detailed Python implementation using TensorFlow/Keras, and discuss its potential to advance the interpretability of MIMO-DL systems.

Keywords: Multiple Inputs Multiple Outputs (MIMO), Deep Learning, Interpretability, Attention, Neural Networks, Keras, Python, XAI (Explainable AI).

1. Introduction

Deep neural networks have demonstrated remarkable success across a wide range of applications. Particularly, systems with Multiple Inputs and Multiple Outputs (MIMO) are essential in scenarios where diverse sources of information need to be processed to generate a set of responses or predictions. Examples include recommendation systems, robotics, signal processing in telecommunications (e.g., Massive MIMO), and modeling complex systems in healthcare.

Despite their predictive power, the interpretability of deep learning models, especially MIMO ones, remains a significant challenge. The ability to understand which inputs or input features are most influential for which specific outputs is crucial for model debugging, domain knowledge validation, trust-building, and ensuring fairness. Traditional interpretability approaches often provide global insights or are applied post-hoc, potentially not fully capturing the specific internal dynamics of input-output pathways in MIMO systems.

Attention mechanisms have proven effective in highlighting relevant parts of the input that contribute to a given output, particularly in natural language processing and computer vision tasks. Inspired by this success, we propose the Multi-layer Attentional Interception Layer (MAIL), a neural layer designed to be integrated into MIMO-DL models. MAIL aims to explicitly learn and expose attention weights governing the relationships between groups of specific inputs and outputs, allowing for a clear "interception" of these influences.

Our contributions are:

The formulation of a new attentional layer, MAIL, for MIMO-DL systems.
A detailed implementation of the MAIL layer in Python using TensorFlow/Keras, demonstrating its practical applicability.
A discussion on how MAIL can be utilized to enhance interpretability and facilitate the analysis of MIMO-DL models.

2. Related Works

2.1. MIMO Neural Networks
MIMO architectures in deep learning vary considerably, from simple concatenations of input feature vectors processed by a shared network to more complex structures with multiple processing branches that eventually merge or generate independent outputs. The Keras Functional API, for example, facilitates the creation of such models. The central challenge lies in managing and interpreting the flow of information through these multiple pathways. Works like MixMo explore ways to mix multiple inputs for multiple outputs through sub-networks.

2.2. Attention Mechanisms
Attention mechanisms were introduced to allow models to focus on specific parts of the input sequence when generating an output. The core concept involves calculating attention weights (scores) which are then used to create a weighted representation of the inputs. Variations such as self-attention and multi-head attention have become fundamental components of state-of-the-art architectures like Transformers. The application of attention in MIMO systems, while promising, is still a developing area, with some research focused on specific applications like channel estimation in wireless communications.

2.3. Interpretability in Deep Learning (XAI)
Interpretability in machine learning, and more specifically in deep learning, is an active research field. XAI methods can be broadly categorized into:

Inherently interpretable models: Models like shallow decision trees, linear regression, or generalized additive models (GAMs).
Post-hoc methods: Techniques that explain an already trained model, such as LIME, SHAP, or gradient-based analysis.
Attention-based methods: Where the attention weights themselves can serve as a form of explanation, indicating which parts of the input were considered important.

Researchers from institutions like Stanford have actively explored interpretability, including optimizing models to be inherently interpretable or developing new explanation techniques. Our work aligns with the idea of building interpretability directly into the model's architecture through custom attention mechanisms.

3. Proposed Methodology: MAIL (Multi-layer Attentional Interception Layer)

We propose a MAIL layer that can be inserted into a MIMO-DL architecture. The core idea is that for a set of N input streams and M desired output streams, the MAIL layer will learn attentional representations that explicitly model the contribution of each input stream (or a processed combination thereof) to each output stream.

Conceptual Architecture of MAIL:

Multiple Inputs: The layer accepts a list of input tensors [X_1, X_2, ..., X_N], where each X_i represents a distinct data stream.
Input Processing/Combination (Optional, but Recommended): Before the main attention mechanism, inputs can be processed individually (e.g., by CNNs, RNNs, or Dense layers) and/or combined (e.g., concatenation, weighted sum). To simplify the initial presentation of MAIL, we will assume that the inputs are concatenated, forming a tensor X_concat.
Attention Heads per Output: For each of the M output streams, MAIL instantiates a dedicated "attention head." Each attention head j (for j=1...M) is responsible for learning a set of attention weights alpha_j over X_concat. These weights indicate the relevance of different features in X_concat for generating the output Y_j.
- Mathematically, for each output head j, the attention weights alpha_j can be calculated, for example, through a small neural network (e.g., a Dense layer with softmax activation) that maps X_concat to the weights: e_j = Dense_j(X_concat) alpha_j = softmax(e_j)
Attention Application: The attention weights alpha_j are then used to modulate X_concat, creating a representation C_j (context vector) specific to output j: C_j = alpha_j * X_concat (element-wise multiplication)
Generation of Multiple Outputs: Each context vector C_j is then processed by an output sub-network (e.g., one or more Dense layers) to produce the final output Y_j. Y_j = OutputDense_j(C_j)
Interception: The learned attention weights alpha_j for each output head can be extracted and visualized. This allows for "intercepting" and analyzing which parts of the concatenated inputs (and, by extension, the original input streams if the mapping is clear) were considered most important for each specific output task.

This architecture allows the model to dynamically learn to prioritize different aspects of the combined inputs for each of its output tasks. "Interceptability" comes from the ability to inspect the alpha_j vectors, which provide a proxy for the importance of input features for each output.

4. Implementation in Python with TensorFlow/Keras

Below, we present an implementation of the MAIL layer using the Keras Functional API and the ability to create custom layers.

import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense, concatenate, Input
from tensorflow.keras.models import Model
import numpy as np

class MIMOAttentionLayer(Layer):
    """
    Multi-layer Attentional Interception Layer (MAIL)
    This layer receives multiple inputs, concatenates them, and then applies
    separate attention mechanisms to generate multiple outputs.
    Attention weights can be extracted for interpretability.
    """
    def __init__(self, num_output_streams, output_stream_dims, attention_hidden_units=None, **kwargs):
        """
        Args:
            num_output_streams (int): The number of desired output streams (M).
            output_stream_dims (list or tuple): A list/tuple containing the dimensionality
                                                 of each output stream.
                                                 Ex: (64, 32) for two outputs with 64 and 32 dims.
            attention_hidden_units (int, optional): Number of units in the internal dense layer
                                                    used to calculate attention scores.
                                                    If None, uses the concatenated input dimension.
        """
        super().__init__(**kwargs)
        if not isinstance(output_stream_dims, (list, tuple)) or len(output_stream_dims) != num_output_streams:
            raise ValueError("`output_stream_dims` must be a list or tuple with `num_output_streams` elements.")

        self.num_output_streams = num_output_streams
        self.output_stream_dims = output_stream_dims
        self.attention_hidden_units = attention_hidden_units

        # Lists to store attention and output layers for each stream
        self.attention_score_layers = []
        self.output_processing_layers = []
        self.learned_attention_weights = [] # To store attention weights

    def build(self, input_shape):
        """
        Defines the layer's weights.
        Args:
            input_shape (list of tuples): A list of shapes of the input tensors.
                                          Ex: [(None, 128), (None, 64)] for two inputs.
        """
        if not isinstance(input_shape, list) or len(input_shape) < 1:
            raise ValueError("Input to MIMOAttentionLayer must be a list of tensors.")

        # Assume inputs will be concatenated. Calculate concatenated dimension.
        # input_shape[i][-1] gets the last dimension (features) of each input tensor.
        self.concatenated_input_dim = sum(shape[-1] for shape in input_shape)

        attention_units = self.attention_hidden_units if self.attention_hidden_units is not None else self.concatenated_input_dim

        for i in range(self.num_output_streams):
            # Layer to calculate attention scores for output stream i
            # These scores will be used to weight the concatenated input
            self.attention_score_layers.append(
                Dense(attention_units, activation='tanh', name=f'attention_scorer_{i}')
            )
            # Final Dense layer to generate attention weights (with softmax over features)
            # Could also be a layer generating a single weight per feature, or a set of weights
            # Here, for simplicity, attention will modulate features of the concatenated input.
            self.attention_score_layers.append(
                Dense(self.concatenated_input_dim, activation='softmax', name=f'attention_weights_{i}')
            )

            # Processing layer to generate the final output of stream i
            # from the concatenated input weighted by attention
            self.output_processing_layers.append(
                Dense(self.output_stream_dims[i], activation='linear', name=f'output_stream_{i}')
            )
        super().build(input_shape)

    def call(self, inputs):
        """
        Layer's processing logic (forward pass).
        Args:
            inputs (list of Tensors): List of input tensors.
        Returns:
            list of Tensors: List of output tensors, one for each stream.
        """
        if not isinstance(inputs, list) or len(inputs) < 1:
            raise ValueError("Input to MIMOAttentionLayer must be a list of tensors.")

        if len(inputs) > 1:
            concatenated_inputs = concatenate(inputs)
        else:
            concatenated_inputs = inputs[0] # If only one input (list with one tensor)

        output_streams = []
        current_attention_weights = [] # Stores weights for this call

        for i in range(self.num_output_streams):
            # Calculate attention scores
            # The attention architecture here is simple; can be more complex (e.g., Bahdanau-style)
            # attention_scorer_idx = i * 2 (to get the first Dense of the i-th head)
            # attention_weights_idx = i * 2 + 1 (to get the second Dense of the i-th head)

            # A simplified form: each attention head learns to weight the features of the concatenated input
            # for its respective output task.
            attention_hidden = self.attention_score_layers[i*2](concatenated_inputs) # (batch_size, attention_units)
            attention_weights = self.attention_score_layers[i*2 + 1](attention_hidden) # (batch_size, concatenated_input_dim)
            current_attention_weights.append(attention_weights)

            # Apply attention weights to the concatenated input
            # Element-wise multiplication (Hadamard product)
            attended_inputs = concatenated_inputs * attention_weights

            # Process the weighted input to generate output stream i
            stream_output = self.output_processing_layers[i](attended_inputs)
            output_streams.append(stream_output)

        # Store attention weights for possible external inspection
        # Note: self.learned_attention_weights would accumulate across batches if not reset
        # For inspection during or after training, it's better to get via model.get_layer().output
        # or callbacks. Here, we just store the last set for example purposes.
        self.learned_attention_weights = current_attention_weights

        return output_streams

    def get_config(self):
        config = super().get_config()
        config.update({
            "num_output_streams": self.num_output_streams,
            "output_stream_dims": self.output_stream_dims,
            "attention_hidden_units": self.attention_hidden_units
        })
        return config

    @classmethod
    def from_config(cls, config):
        return cls(**config)

# Example of MAIL layer usage:
if __name__ == '__main__':
    # Defining model inputs
    input_a_dim = 128
    input_b_dim = 64
    input_a = Input(shape=(input_a_dim,), name='input_A')
    input_b = Input(shape=(input_b_dim,), name='input_B')

    # Defining desired outputs
    # Output 1: Regression with 10 values
    # Output 2: Binary classification (1 value with sigmoid, or 2 with softmax)
    # Output 3: Regression with 5 values
    num_outputs = 3
    output_dims = [10, 1, 5] # Dimensionality of each output

    # Instantiating the MAIL layer
    # mail_layer = MIMOAttentionLayer(num_output_streams=num_outputs,
    #                                 output_stream_dims=output_dims,
    #                                 attention_hidden_units=32,
    #                                 name='mail_processing')

    # Applying MAIL layer to inputs
    # output_streams = mail_layer([input_a, input_b])

    # If individual processing before MAIL is desired:
    processed_a = Dense(64, activation='relu')(input_a)
    processed_b = Dense(32, activation='relu')(input_b)

    mail_layer = MIMOAttentionLayer(num_output_streams=num_outputs,
                                    output_stream_dims=output_dims,
                                    attention_hidden_units=64, # Adjust according to concatenated dim (64+32=96)
                                    name='mail_processing')

    output_streams = mail_layer([processed_a, processed_b])


    # Renaming outputs for clarity (optional, but good for `model.summary()`)
    # and applying final activations if needed (MAIL layer used 'linear' by default)
    output_1 = Dense(output_dims[0], activation='linear', name='output_Reg10')(output_streams[0]) # Already done in layer, but can be redone/adjusted
    output_2 = Dense(output_dims[1], activation='sigmoid', name='output_ClassBin')(output_streams[1])
    output_3 = Dense(output_dims[2], activation='linear', name='output_Reg5')(output_streams[2])

    # Creating the model
    # model = Model(inputs=[input_a, input_b], outputs=output_streams) # Using direct outputs from MAIL
    model = Model(inputs=[input_a, input_b], outputs=[output_1, output_2, output_3])


    # Compiling the model
    # Each output can have its own loss function and metrics
    model.compile(optimizer='adam',
                  loss={'output_Reg10': 'mse',
                        'output_ClassBin': 'binary_crossentropy',
                        'output_Reg5': 'mae'},
                  metrics={'output_ClassBin': ['accuracy']})

    model.summary()

    # Generating dummy data for testing
    num_samples = 100
    X_a_dummy = np.random.rand(num_samples, input_a_dim)
    X_b_dummy = np.random.rand(num_samples, input_b_dim)

    Y_1_dummy = np.random.rand(num_samples, output_dims[0])
    Y_2_dummy = np.random.randint(0, 2, size=(num_samples, output_dims[1]))
    Y_3_dummy = np.random.rand(num_samples, output_dims[2])

    # Training the model
    print("\nStarting dummy training...")
    history = model.fit([X_a_dummy, X_b_dummy],
                        {'output_Reg10': Y_1_dummy, 'output_ClassBin': Y_2_dummy, 'output_Reg5': Y_3_dummy},
                        epochs=3, batch_size=32, verbose=1)
    print("Dummy training completed.")

    # "Intercepting" attention weights after training (example)
    # For more robust analysis, attention weights should be collected
    # during prediction or using a Keras Callback.
    # The `mail_layer.learned_attention_weights` variable contains weights from the last processed batch.

    # To get attention weights for a new dataset:
    # Create an intermediate model that also returns the attention weights.
    # attention_model_outputs = [model.get_layer('mail_processing').output] # List of lists of weights
    # Attention outputs are in mail_layer.learned_attention_weights,
    # which is a list (for each output_stream) of tensors (batch_size, concatenated_input_dim)

    # Correction: To get attention weights as model output, we need to define a model that exposes them.
    # The MAIL layer stores the weights from the last call in `self.learned_attention_weights`,
    # but this is not ideal for systematic extraction.
    # A better way is to modify the layer's `call` to return the weights
    # or create a model that has attention weights as one of its outputs.

    # Example of how to build a model to extract attention weights:
    # Assuming the 'mail_processing' layer was built as above.
    # We need the layer's `call` to return the weights or have access
    # to the outputs of the attention sublayers.

    # Let's get the names of the attention weight layers within MAIL
    # attention_weight_layer_names = []
    # for i in range(num_outputs):
    #     attention_weight_layer_names.append(f'attention_weights_{i}') # Dense layer with softmax

    # Accessing the outputs of attention layers directly from the trained model
    # (assuming the Dense sublayers generating weights were named accordingly)
    # This requires sublayers to be accessible. In the current implementation, they are class attributes.

    # A cleaner approach to extracting weights:
    # Create a new model that has attention outputs as outputs.
    # The outputs of the Dense layers that calculate attention weights (softmax)
    # within the MAIL layer can be exposed.
    # mail_layer_instance = model.get_layer('mail_processing')
    # attention_outputs_for_extraction = []
    # for i in range(num_outputs):
    #     # Accessing the named sublayers
    #     # The name would be mail_processing/attention_weights_0, etc., if built within the model's scope.
    #     # In our case, sublayers are in the self.attention_score_layers list
    #     attention_weight_sub_layer = mail_layer_instance.attention_score_layers[i*2 + 1] # The Dense with softmax
    #     attention_outputs_for_extraction.append(attention_weight_sub_layer.output)

    # if attention_outputs_for_extraction:
    #     attention_extractor_model = Model(inputs=model.inputs, outputs=model.outputs + attention_outputs_for_extraction)
    #     predictions_and_attentions = attention_extractor_model.predict([X_a_dummy[:5], X_b_dummy[:5]])

    #     main_predictions = predictions_and_attentions[:num_outputs]
    #     extracted_attention_weights = predictions_and_attentions[num_outputs:]

    #     print(f"\nExtracting attention weights for {len(extracted_attention_weights)} output streams:")
    #     for i, weights in enumerate(extracted_attention_weights):
    #         print(f"  Attention weights for Output {i+1} (shape: {weights.shape}):\n  {weights[0][:10]}...") # First 10 features of the first example
    # else:
    #     print("\nCould not extract attention weights this way. Check layer structure.")

    # Simpler way to access weights from the last batch processed by the layer instance:
    last_batch_attention_weights = mail_layer.learned_attention_weights
    if last_batch_attention_weights:
        print(f"\nAttention weights from the last processed batch (accessed from layer instance):")
        for i, weights in enumerate(last_batch_attention_weights):
            print(f"  Attention weights for Output {i+1} (shape: {weights.shape}):\n {weights[0][:10]}...")

Implementation Explanation:

__init__: Initializes the number of output streams, their dimensions, and optional hidden units for the attention layers.
build: Creates the necessary sublayers. For each output stream, two sequential Dense layers (one with tanh and another with softmax over the concatenated input dimension) are created to calculate attention weights, and one Dense layer to process the weighted input and generate the stream's output.
call:
1. Inputs are concatenated (if multiple).
2. For each output stream i:
  - Attention weights (attention_weights) are calculated using the corresponding Dense sublayers, applying softmax so weights sum to 1 (or behave like importance probabilities) over the features of the concatenated input.
  - The concatenated input is weighted by element-wise multiplication with attention_weights.
  - The weighted input (attended_inputs) is passed through the output processing layer to generate stream_output.
3. Calculated attention weights (current_attention_weights) are stored in the instance variable self.learned_attention_weights for inspection (mainly useful for the last processed batch).
4. Returns a list of output tensors.
get_config / from_config: Allow the layer to be serialized and deserialized by Keras.
Usage Example: Demonstrates how to instantiate the MAIL layer in a Keras model with two inputs and three outputs, compile it, and train it with dummy data. It also outlines how attention weights could be extracted, highlighting that the most robust way is to build a model that explicitly returns these weights as part of its outputs.

5. Experiments and Results (Conceptual)

To validate the MAIL layer, a set of hypothetical experiments would be conducted:

Dataset: A synthetic or real dataset with multiple heterogeneous inputs (e.g., tabular data, time series, text embeddings) and multiple output tasks (e.g., one regression and two classifications). For example, in an industrial predictive maintenance scenario:
- Inputs: Sensor data (vibration, temperature, pressure), maintenance logs (text processed into embeddings), machine specifications (tabular).
- Outputs: Risk of failure (regression), probable failure type (classification), remaining useful life (regression).
Baseline Model: A standard MIMO-DL architecture without the MAIL layer (e.g., simple concatenation of processed inputs followed by branches for each output).
Model with MAIL: The same baseline architecture but with the MAIL layer inserted before the output branches.
Metrics:
- Task Performance: Appropriate metrics for each output task (e.g., MSE for regression, Accuracy/F1-score for classification).
- Interpretability: Qualitative analysis of attention weights (alpha_j). Visualizations of the weights can show which input features (or which original input streams, if the mapping is clear after concatenation) receive the most attention for each output task. For example, it is expected that to predict "failure type," "maintenance logs" might receive higher attention, while for "risk of failure," "sensor data" would be more heavily weighted.
Expected Results:
- The model with MAIL should achieve comparable or slightly superior performance to the baseline model, due to attention's ability to focus on relevant features.
- Analysis of attention weights should provide insights into the input-output relationships learned by the model, ideally aligning with domain knowledge or revealing new interactions. For example, if input X_1 is consistently weighted more heavily for output Y_1 than for Y_2, this provides interpretable evidence of information flow specialization.

6. Discussion

The proposed MAIL layer offers a mechanism to dissect the complex interactions within MIMO-DL models. By forcing the model to learn explicit attention weights for each output pathway, we gain a window into its internal workings. "Intercepting" these weights allows researchers and practitioners to:

Validate model behavior: Verify if the model is focusing on relevant features as expected by domain knowledge.
Discover new relationships: Identify unexpected interactions between inputs and outputs that could lead to new hypotheses.
Debug the model: If a specific output is underperforming, analyzing attention weights might indicate whether the model is failing to attend to the correct inputs.
Improve model architecture: Insights about feature importance can guide feature engineering or the design of more efficient architectures.

Limitations:

Interpretability provided by attention weights is not a definitive causal explanation but rather a correlation learned by the model.
If inputs are extensively pre-processed and transformed before the MAIL layer, mapping attention weights back to original features can be complex.
The complexity of the MAIL layer itself increases with the number of input/output streams and data dimensionality.

Future Work:

Explore more sophisticated attention mechanisms within the MAIL layer (e.g., location-based attention, hierarchical self-attention between input streams).
Develop more advanced visualization methods for attention weights in MIMO contexts.
Apply MAIL to real-world problems in domains like healthcare, finance, and autonomous systems to evaluate its practical utility.
Integrate the MAIL layer with other XAI techniques to obtain richer and more robust explanations.

7. Conclusion

The MAIL (Multi-layer Attentional Interception Layer) is a novel approach for embedding interpretability into deep neural networks with multiple inputs and multiple outputs. By explicitly learning the relevance of inputs to specific outputs through dedicated attention heads, MAIL allows for the "interception" and analysis of these relationships. The provided Python implementation demonstrates the feasibility of integrating such a layer into existing deep learning workflows. We believe MAIL represents a step towards more transparent and understandable MIMO-DL models, facilitating their adoption in critical applications.

8. References

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. (Reference for original attention)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. (Reference for Transformers and Multi-Head Attention)
Galassi, A., Lippi, M., & Torroni, P. (2020). Attention in natural language processing. IEEE Transactions on Neural Networks and Learning Systems, 32(10), 4291-4313. (Survey on attention in NLP)
Chaudhari, S., Mithal, V., Polatkan, G., Ramanath, R., & Bera, A. (2021). An attentive survey of attention models. ACM Transactions on Intelligent Systems and Technology (TIST), 12(5), 1-32. (Comprehensive survey on attention models)
Samek, W., Wiegand, T., & Müller, K. R. (2017). Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ITU Journal: ICT Discoveries, 1(1), 39-48. (Overview of XAI)
TensorFlow Core. Attention layers. (Accessed on June 2025). Available at: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention
Rame, A., & Cord, M. (2021). MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). (Example of MIMO architecture)
Xu, D., Cheng, W., Luo, D., Liu, X., & Zhang, X. (2019). A Survey on Multi-output Learning. arXiv preprint arXiv:1907.10042. (Survey on multi-output learning).
Sabath, A. (2021). Scikeras Tutorial: A MIMO Wrapper for CapsNet Hyperparameter Tuning with Keras. Towards Data Science. (Use of Keras for MIMO).
Bhatia, S. (N.d.). Combining Multiple Features and Multiple Outputs Using Keras Functional API. Analytics Vidhya. (Example of Keras API for MIMO).
MathWorks. Import Keras Layers. (Accessed on June 2025). (Support for MIMO in tools).
Zhang, C., Li, Y., Liu, P., & Li, G. Y. (2021). An Attention-Aided Deep Learning Framework for Massive MIMO Channel Estimation. arXiv preprint arXiv:2108.09605. (Attention in MIMO for communications).
Yu, W. (2021). A Learning Approach to the Optimization of Massive MIMO Systems. (Seminar Video, Stanford or similar, on DL in Massive MIMO).
Gregor, K., & LeCun, Y. (2010). Learning Fast Approximations of Sparse Coding. ICML. (Reference for "unrolling" which can inspire interpretability). (The paper "Algorithm Unrolling: Interpretable, Efficient Deep Learning..." discusses how "unrolling" iterative algorithms can lead to more interpretable DL architectures.)
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215. (Advocacy for inherently interpretable models). (Prof. Cynthia Rudin's lab focuses on interpretability).
DataCamp. (2024). What is Attention and Why Do LLMs and Transformers Need It? (Article explaining attention).
Wu, M. (2022). Optimizing for Interpretability in Deep Neural Networks. (Stanford Seminar on interpretability).
Fraunhofer HHI. Interpretable Machine Learning. (Research page on XAI).
Nguyen, T. H. D., et al. (2023). On the Combination of Multi-Input and Self-Attention for Sign Language Recognition. International Conference on Applied Science and Engineering (ICASE). (Combination of Multi-Input and Attention).
Hasan, M. K., et al. (2023). Implementation of the deep learning method for signal detection in massive-MIMO-NOMA systems. Scientific Reports. (DL in Massive MIMO systems).
OpenReview. (2022). MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition. (Paper on MIMONets).
Analytics Vidhya. (2025). Understanding Attention Mechanisms Using Multi-Head Attention. (Article on Multi-Head Attention).
SRI Lab, EPFL. Reliable and Interpretable Artificial Intelligence. (Research focus on reliable and interpretable AI).
GeeksforGeeks. (2025). Multi-Head Attention Mechanism. (Tutorial on Multi-Head Attention).
Lakkaraju, H. (2022). Stanford Seminar - ML Explainability Part 2 I Inherently Interpretable Models. (Stanford seminar video on interpretable models).
Pal, S. & Gulli, A. (2017). 2 ways to customize your deep learning models with Keras. Packt. (On customization in Keras).
TensorFlow Core. Custom layers. (Accessed on June 2025). Available at: https://www.tensorflow.org/guide/keras/custom_layers.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.