Mario García for Let's Talk! Open Source

Posted on Mar 10

AI-Assisted Python: Refactoring and Reviewing with Copilot

#tutorial #python #githubcopilot

How much time do you spend debugging, implementing features, or refactoring code? Sometimes, these tasks consume hours. While pair programming often optimizes the process, you may find yourself working on a solo project. In the era of AI, you can integrate tools directly into your favorite editor and have a constant collaborator to help you improve your development workflow.

General-purpose LLMs like ChatGPT or Gemini can help you with these tasks. However, nothing compares to using a tool without having to leave your editor. In Visual Studio Code, you can use Copilot by installing the extension and logging in with your GitHub account. Let's see how it works.

To demonstrate, I selected a collection of Python scripts designed to generate test data for database schemas, which you can find in this repository.

Code Review

The following code block is from the sql.py script that generates data for MySQL and PostgreSQL:

if __name__ == "__main__":
    num_cores = cpu_count() - 1
    with Pool() as pool:
        data = pd.concat(pool.map(create_dataframe, range(num_cores)))

    data.to_sql(name='employees', con=engine, if_exists='append', index=False, dtype=schema)
    with engine.connect() as conn:
        conn.execute("ALTER TABLE employees ADD id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;")
        # conn.execute("ALTER TABLE employees ADD COLUMN id SERIAL PRIMARY KEY;")

I asked Copilot to perform a code review. By selecting the code block, right-clicking, and choosing Generate Code > Review, the tool analyzed the logic and provided seven suggestions to improve the script's quality.

Copilot made the following suggestions:

Process Safety: It warned that cpu_count() - 1 could result in zero cores on single-core systems, causing the multiprocessing pool to fail or hang.
Database Compatibility: It identified that the ALTER TABLE statement was MySQL-specific and would fail on other backends like PostgreSQL or SQLite.
Security: To prevent SQL injection, it recommended wrapping raw SQL strings with the sqlalchemy.text() function.
Performance: For large datasets, it suggested using the chunksize parameter in to_sql() to ensure efficient batching.
Input Validation: It noted that pd.concat would fail if create_dataframe didn't return a valid DataFrame for every call.
Explicit Control: It recommended passing the processes argument explicitly to the Pool for better clarity and execution control.
Documentation: It suggested documenting the purpose of the schema argument, especially when imported from custom modules.

After reviewing the suggestions, I manually implemented the changes:

if __name__ == "__main__":
    num_cores = cpu_count() if cpu_count() <= 2 else cpu_count() - 1

    with Pool(processes=num_cores) as pool:
        data = pd.concat(pool.map(create_dataframe, range(num_cores)))

    data.to_sql(name='employees', con=engine, if_exists='append', index=False, dtype=schema)

    with engine.begin() as conn:
        conn.execute(text("ALTER TABLE employees ADD id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;"))
        # conn.execute("ALTER TABLE employees ADD COLUMN id SERIAL PRIMARY KEY;")

Refactoring

Within the modules/ directory, you will find the core logic of the generator:

base.py: Handles the database connection setup and session management for both MySQL and PostgreSQL.
dataframe.py: Contains the logic to generate synthetic data and store it temporarily in a pandas DataFrame before ingestion.
schema.py: Defines the database schema and data types required for cross-database compatibility.

Open the base.py script and use the Copilot Chat to refactor the code.

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine("mysql+pymysql://user:password@localhost/company")
#engine = create_engine("postgresql+psycopg2://user:password@localhost:5432/company")
Session = sessionmaker(bind=engine)

Then I asked the tool to enable dynamic database connection selection for all the technologies supported by this project. The prompt used was the following:

Refactor my database logic into base.py. Create a function get_client() that returns either a SQLAlchemy Session (for MySQL/Postgres) or a PyMongo MongoClient (for MongoDB) based on the DB_TYPE environment variable. Ensure the function handles the connection logic for all three databases using credentials from os.getenv. Use PyMongo instead of Motor to keep the workflow synchronous. Follow PEP 8 strictly and include type hints to distinguish between the SQL session and the NoSQL client.

After analyzing the prompt, Copilot will review the code and suggest changes that you can apply directly from the chat interface, as shown in the following image.

After applying the changes, this is how the script looks like:

def _build_sql_url(db_type: str) -> str:
    """Build a SQLAlchemy URL for MySQL or PostgreSQL."""
    user = os.getenv("DB_USER", "")
    password = os.getenv("DB_PASSWORD", "")
    host = os.getenv("DB_HOST", "localhost")
    port = os.getenv("DB_PORT")
    db_name = os.getenv("DB_NAME", "")

    if not db_name:
        raise ValueError("DB_NAME environment variable must be set")

    if db_type == "mysql":
        driver = "mysql+pymysql"
        port = port or "3306"
    elif db_type == "postgres":
        driver = "postgresql+psycopg2"
        port = port or "5432"
    else:
        raise ValueError(f"Unsupported SQL DB_TYPE: {db_type!r}")

    return f"{driver}://{user}:{password}@{host}:{port}/{db_name}"


def _build_mongo_uri() -> str:
    """Return a MongoDB URI using authSource=admin."""
    user = os.getenv("DB_USER", "")
    password = os.getenv("DB_PASSWORD", "")
    host = os.getenv("DB_HOST", "localhost")
    port = os.getenv("DB_PORT", "27017")
    db_name = os.getenv("DB_NAME", "")

    if not db_name:
        raise ValueError("DB_NAME environment variable must be set")

    return (
        f"mongodb://{user}:{password}@{host}:{port}/{db_name}"
        "?authSource=admin"
    )


def get_client() -> Union[SASession, MongoClient]:
    """Return a DB client based on DB_TYPE.

    For 'mysql' or 'postgres' returns a SQLAlchemy Session instance.
    For 'mongodb' returns a pymongo.MongoClient.
    """
    db_type = os.getenv("DB_TYPE", "mysql").lower()

    if db_type in {"mysql", "postgres"}:
        url = _build_sql_url(db_type)
        engine = create_engine(url)
        return db_type, engine

    if db_type == "mongodb":
        uri = _build_mongo_uri()
        return db_type, MongoClient(uri)

    raise ValueError(f"Unsupported DB_TYPE: {db_type!r}")

You must create an .env file and set the environment variables for the database connection details.

# .env.example
DB_TYPE=mysql
DB_USER=your_user
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=3306
DB_NAME=your_database

Where DB_TYPE can be set to any of the following values: mysql, postgres, and mongodb.

Now that the database connection selection is configured, the sql.py script must be adapted to automatically select the database and insert the data based on the environment variables in the .env file:

if __name__ == "__main__":
    num_cores = cpu_count() if cpu_count() <= 2 else cpu_count() - 1

    db_type, client = get_client()

    with Pool(processes=num_cores) as pool:
        data = pd.concat(pool.map(create_dataframe, range(num_cores)))

    if db_type in {"mysql", "postgres"}:
        data.to_sql(name='employees', con=client, if_exists='append', index=False, dtype=schema)

        with client.begin() as conn:
            if db_type == "mysql":
                conn.execute(text("ALTER TABLE employees ADD id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;"))
            elif db_type == "postgres":
                conn.execute(text("ALTER TABLE employees ADD COLUMN id SERIAL PRIMARY KEY;"))
    elif db_type == "mongodb":
        data_dict = data.to_dict('records')
        db = client["company"]
        collection = db["employees"]
        collection.insert_many(data_dict)

You must rename the sql.py script to generate_data.py, and delete the mongodb.py script.

Running Tests

Configure your database instance to validate that the script is working. Run any of the following instructions to start the corresponding container. Choose the one you're using for your project.

MariaDB:

docker run -d \
  --name mariadb \
  -p 3306:3306 \
  -e MARIADB_ROOT_PASSWORD=12345 \
  -e MARIADB_DATABASE=company \
  mariadb

PostgreSQL:

docker run -d \
  --name postgres \
  -p 5432:5432 \
  -e POSTGRES_PASSWORD=12345 \
  -e POSTGRES_DB=company \
  postgres

MongoDB

docker run -d \
  --name db-mongo \
  -p 27017:27017 \
  -e MONGO_INITDB_ROOT_USERNAME=username \
  -e MONGO_INITDB_ROOT_PASSWORD=12345 \
  mongo

Then edit the .env file and replace the values of the environment variables with the connection details.

Now you can run the script:

python generate_data.py

The script is designed to create data for a database that stores information about employees, but you can adapt it to your needs.

Conclusion

While GitHub Copilot is a powerful tool to integrate within your workflow, it is essential to remember that you must always review and validate the suggested code, as it may not work as expected. The quality of the output depends heavily on your input; therefore, your prompts should be as descriptive as possible, providing clear context about your architectural requirements and dependencies.

DEV Community

AI-Assisted Python: Refactoring and Reviewing with Copilot

Code Review

Refactoring

Running Tests

Conclusion

Top comments (0)