opensource.google

Docs

Python

This describes Python-specific guidance for checking code into //piper/third_party/py.

IMPORTANT: Read go/thirdparty first.

NOTE: Python packages are installed in subdirectories of //piper/third_party/py (see also the Epydoc page for //third_party/py).

Overview

Adding third party code to google3 is relatively straight forward, but has a couple of additional steps to ensure requirements are met for legal and correctness reasons.

In brief, the overall process is:

  1. Follow the steps (see following sections) for importing your code. Be sure to follow go/pristinecopy.
  2. Create a CL and send for review

Adding from external sources of truth (github, etc)

This is for code that isn’t coming from elsewhere in google3. This includes Github, BitBucket, PyPI, and Git-on-Borg.

IMPORTANT: You must follow go/pristinecopy. Your code will not be approved otherwise.

To import external code, follow these steps:

Once your code structure is done, follow the steps for code review.

Adding from internal sources of truth

This is for code that is either developed directly in third_party or elsewhere in google3.

  • Check that the package name (the third_party/py directory name) is unused in the global Python ecosystem. Check PyPI, Github, or other popular sites. IMPORTANT: Your Python package name must be unused in the external Python ecosystem. See Unique Package Names.
  • Follow other Common requirements.

Once your code structure is done, follow the steps for code review.

Externally unique package names

Because Python packages share a global namespace, packages need to have unique top-level names in the larger Python ecosytem. If two packages try to use the same name, it effectively makes it impossible for a program and its transitive dependencies to use both at the same time.

As an example of how this creates an impossible situation, consider a program, app.py: app.py imports one and two. one and two are unrelated to each other, owned by different people, and otherwise unaware and don’t care about the other. one imports conflict, intending to get Alice’s conflict library to do Alice stuff. two also imports conflict, but intends to get Bob’s conflict library to do Bob stuff.

If app.py depends on Alice’s conflict library, then two doesn’t work. If it depends on Bob’s, one doesn’t work. But app.py needs both one and two, otherwise it isn’t very useful. So now app.py is stuck, and one and two are mutually exclusive with each other.

Because this problem applies transitively, the only real way to avoid it is for Alice’s and Bob’s library to have different top-level names.

NOTE: Your package name doesn’t have to be absolutely, unequivocally globally unique. It just has to have a name that is unused to the best of your knowlege. There is no central authority for Python package name assignment.

TIP: If you’re part of a larger project, you can use namespace packages to independently release sub-packages within a top-level package, thus avoiding any potential name collisions.

Common requirements

  • Python 3 support: the code must support Python 3. This means setting srcs_version="PY3" and python_version="PY3".
  • As-installed file layout: the layout of the .py files must be the same as when installed. It’s not uncommon for the source layout to be different than the installed layout because e.g. all the code is under a src sub-directory or setup.py moves files around.
  • Directory name matches import name: The third_party/py directory name must match the top-level import name. i.e. third_party/py/spam must be imported as import spam.
  • Any additional go/thirdparty requirements.

Code Review

Once your code is ready, send a code review to third-party-*removed* to have a reviewer assigned automatically. A third_party/py/OWNERS reviewer will then review your CL, verify it is properly imported, and approve your CL. The whole process typically takes a couple days.

It is OK for building and testing to fail in the initial CL because go/pristinecopy prevents you from making fixes. Adding tests and build rules is still required, though, so that we can verify it is being setup correctly. {@paragraph #failure-is-an-option}

NOTE: Code review by third-party-removed is only required for the initial CL adding a new package to third_party/py. Subsequent CLs should follow the regular Google review process (LGTM by anyone, Approval by package owner). The auto-assigner will try to detect whether a package is new, but it may fail. If it thinks your package is new when it isn’t, you do not need to wait for review before submitting. If you need or want third-party-removed review and the auto-assigner isn’t assigning anyone, send it to emailremoved@ for review.

Using third-party packages that have already been installed

To use a module named PIL, you need to add an import statement to your code, and list it as a BUILD dependency.

Code in my_main_binary.py:

import google3  # Set up import path (not necessary with hermetic Python)
import PIL  # Use module as normal
...
def MyFunc():
  c = PIL.Image.open(...)
  ...

BUILD rule:

py_binary(
    name = "myprogram",
    ...
    deps = [ ...
        "//third_party/py/PIL:pil",
    ],
    ...
)

If your binary is not using our default Hermetic Python runtime under Blaze, your program must directly or indirectly import the google3 package before importing any third-party code. If you do the imports in the incorrect order, you will get an ImportError.

If you invoke the Python interpreter interactively, or run a Python program without going through the google3 build system, the infrastructure will try to make your imports work (as long as your program is inside a recognizable google3 source tree).

In the example above, PIL is actually a package, and Image is a module inside that package. If you are attempting to use a plain old single level module, you’d use import lines like this:

import google3  # (if necessary)
import SOAPpy
...
SOAPpy.foo(...)

and the corresponding BUILD rule

    deps = [ ...
        "//third_party/py/SOAPpy",
    ],

Installing new third-party packages

Preferred method: install with Puppy

The preferred method for importing new Python packages is to use Puppy (go/puppy-python), a command line tool for gLinux. It will transform GitHub projects to third_party/py format and generate a go/copybara config to make future updates easier.

See go/puppy-python for detailed documentation. As a quick example, importing a package hosted at https://github.com/google/example can be done by running (note that this needs the ‘quilt’ and ‘python3-venv’ debian packages installed):

blaze run //devtools/python/janitor/puppy -- \
  --new https://github.com/google/example

This generates a CL importing the example package into //third_party/py, as well as a Copybara configuration file which can be used to easily bring in future updates, apply Google-specific patches, and have go/3pp-upgrade-service register CaaS for you. You’ll still need to write the BUILD file yourself; see the BUILD section below for details.

Creating a Copybara config

Setting up Copybara is less daunting than the large volume of docs and configuration options might imply. We highly recommend using Puppy to, at the least, generate a base Copybara config. Once generated, you can modify it as you please.

For detailed Copybara documentation, see go/copybara; for git-specific docs, see go/copybara-git-sot.

If you have to manually create a Copybara config, please use git_to_third_party_py macro unless it doesn’t work for your case. There are usually three things you need to do:

  • Move and rename files: this is done using core.move().
  • Apply source modifications: this is done using transformations (e.g. core.replace() or patches, which are a special type of transformation).
  • Ignore extraneous files: this is done using the exclude parameter of glob() when passing in the list of files to include to the git_files parameter of git_to_third_party_py.

Putting this all together, here is a very basic and minimal Copybara config to get you started:

load("//devtools/python/janitor/puppy/puppy", "git_to_third_party_py")

git_to_third_party_py(
    git_origin_url = "https://github.com/project/spam",
    # Your package will be imported to //third_party/py/spam.
    python_package_name = "spam",
    git_files = glob(
        include = ["**"],
        exclude = ["bad_spam.py", "docs/example.md", "samples/*"],
    ),
    transformations = [
        core.move("src", "")
    ],
    patching_enabled = False,
    version_selector = None,
    git_ref = "main",
)

Modifying source code

It’s not uncommon for third party code to need a few minor modifications.

There are several ways to do this:

  1. Copybara transformations: best for simple, regex based find-replace changes.
  2. Copybara patching: best for complicated modifications that a Copybara transformation can’t do.
  3. Manually applied patches: best when you’re not using Copybara.
  4. Directly modify the code: an option of last resort, and only if you’re not using copybara.

BEST PRACTICE: Using Copybara is the best way to manage modifictions to third party source. The main advantage is that, when you later upgrade the library, the modifications will be re-applied and you don’t have to figure out what was done months ago.

BEST PRACTICE: Avoid making changes just to satisfy our internal linter or formatter. Those changes unnecessarily make future upgrades difficult.

Copybara transformations

Copybara transformations are a quick and simple way to make sed-like modifications to the source code. Custom starlark code can also be used to create more advanced transformations.

See the Copybara API reference for detailed documentation about Copybara’s API. We only list the ones you’re most likely to need.

  • core.replace(): regex-based find-replace transformation. It’s ideal for, e.g., replacing problematic imports.
  • core.move(): Rename files and directories. Ideal for e.g., moving code out of a src sub-directory, renaming LICENSE files, and turning non-packages into packages.

Copybara patches

Copybara patches are ideal for complicated modifications that regular transformations can’t do.

The basic way to do this is:

  1. Create a patches directory and add patch files to it (example). See Manual patches for how to generate and maintain patch files.
  2. Create a patches/series file that lists the patch file names in the order to apply them (example)
  3. Pass patching_enabled = True to the git_to_third_party_py macro. Also remember to:
    • Add the patch file to the CL.
    • Add the patch file to the series file.

Copybara will then apply the patches when it imports code, after having applied the other Copybara transformations.

In the end, you should have a config similar to this:

load("//devtools/python/janitor/puppy/puppy", "git_to_third_party_py")
git_to_third_party_py(
    ...
    patching_enabled = True,
)

Manual patches

Manual patches are ideal for when you can’t use Copybara. Their main advantage is, because they record what changed, they can be easily re-applied to future imports of the third party code.

Patches are, by convention, kept in a patches sub-directory of the third party code.

To aid generating and managing patches, you can use go/qu4, which is a tool that will track changes to files and generate diffs for you.

Once you have patch file, you can apply it to the code, then send a separate CL (after the initial, pristine code import) as a direct modification

Direct modifications

Direct modifications of the source should be considered a last resort. The main disadvantage of them is they make later upgrades more difficult: the changes will be lost, and someone has to go through the change history to figure out what to reapply and how to reapply it.

In any case, doing this is simple: just modify the code and send a CL. It’s strongly suggested to create manual patches for any changes so that later upgrades are easier.

IMPORTANT: Remember go/pristinecopy: source modifications are not allowed in the initial CL (Copybara excepted).

Dependencies

If your third-party package X depends on another third-party package Y, install Y at the top level //third_party/py/Y instead of trying to put it as a subdir inside your //third_party/py/X.

Directory structure

NOTE: When installing a new package, if possible please follow the preferred method section instead of generating the folder structure manually. This will allow you to automatically pull in future package updates, and reduce the amount of time you’ll have to spend updating your CL to fix structural issues.

One of the main considerations when introducing new software in //piper/third_party/py is to ensure that the new software can be imported by other Python code remains the same inside Google as outside. This is an important concern so that software in //piper/third_party/py that depends on other software in //piper/third_party/py needn’t be modified to reflect a Google-specific way of importing one of its dependencies.

For example, if the spam software is typically imported with:

import spam
from spam import bacon

it should work the same way inside google3.

There is magic in //piper/…/init.py that will ensure statements like import xyz or from xyz import zzx will find software from //third_party/py/xyz once google3 has been imported. The following sections explain how to make sure you install the software in //piper/third_party/py in a way that will allow everything to work.

You will know that you’ve installed your software correctly into //piper/third_party if, after building a script that depends on it, it can import and use your software in the same way the upstream examples do.

Determine Python package type

If it’s not apparent how the Python code is packaged, here’s some guidance on how to figure it out:

Packages

When a third-party software spam is installed as a Python package (a directory with an __init__.py file), we just duplicate the package structure under third_party/py/spam and everything will automatically work. An example of this structure would be:

google3/
  third_party/
    py/
      spam/
        BUILD
        METADATA
        OWNERS
        __init__.py
        bacon.py

You can recognize your software is being distributed as a package if it has an __init__.py file, accompanied by zero or more other Python files or binary extensions.

With the above structure, the following would work:

import google3  # (if necessary)
import spam
from spam import bacon

Often, packages are distributed as source packages containing a Python package: they will have a setup.py file describing how to install the package, and the actual Python package as a subdirectory (alongside the setup.py file or in a src subdirectory). The Python package is the thing that needs to be duplicated in //third_party/py. The preferred installation method takes care of this automatically.

Namespace packages

Namespace packages are mostly treated as if they weren’t namespaced.

  • Add the code, as usual, as a sub-directory of its containing package.
  • In the parent package’s METADATA, set third_party { type: GROUP }.

For example, given a spam.eggs namespaced package, the file layout should resemble:

# Relative to third_party/py
spam/METADATA  # type: GROUP
spam/eggs/METADATA
spam/eggs/OWNERS, BUILD, LICENSE, etc
spam/eggs/*.py, etc

Third-party software not distributed as a package

Some third-party libraries are not structured as Python packages: there will be no __init__.py file, and typically just one single Python source file, eg. eggs.py, that gets imported with import eggs.

In this case, the library must be transformed into a package in order for the Google machinery to work. You do that by creating a file named //piper/third_party/py/eggs/init.py, and placing the contents of eggs.py inside it:

google3/
  third_party/
    py/
      eggs/
        BUILD
        METADATA
        OWNERS
        __init__.py     # Has the contents of eggs.py.

The following will work:

import google3  # (if necessary)
import eggs

It may also be the case that, in addition to eggs.py, the software includes some private helpers that are not meant to be imported by the user of the software. For example, if in the case above a module _eggs.py was also included, it’s fine to ship it in the same directory, thus:

google3/
  third_party/
    py/
      eggs/
        BUILD
        METADATA
        OWNERS
        __init__.py     # Has the contents of eggs.py.
        _eggs.py

In this case, import _eggs will not work except when in __init__.py or other files in that directory; but that shouldn’t be a concern since it’s a private module.

Finally, if the software consists of several modules, eg. milk.py and chocolate.py, all of which should be importable by the user as top-level modules (that is, import milk, chocolate should work), please get in touch with third-party-removed to devise a sensible solution for your case. But this would be very atypical.

Extraneous files

Because open source projects and google3 differ in their development tooling, open source code typically has many files that aren’t relevant to google3. Since they go unused in google3, their presence is confusing.

In particular, remove files that appear to be part of the build/test process, including, but not limited to:

  • setup.py, setup.cfg, MANIFEST, requirements.txt et al
  • Makefiles, configure scripts, et al
  • Project config files: mypy.ini, pytest.ini, py.typed, tox.ini, pyproject.toml, appveyor.yml, etc
  • Dot files

It’s recommended to also remove the following:

  • Unused code, such as sample or example code.
  • Unbuilt/unused binaries.
  • Documentation. You may keep it if you wish, but usually they are only readable in source form because e.g., g3doc won’t render third party docs.

See Puppy’s git_exclude list for more examples

  • If you’re using Copybara: add the patterns to origin’s exclude patterns.
  • If you’re using Puppy: add the patterns to git_exclude

Writing BUILD rules

Create a BUILD file with a single py_library rule and at least one test. This will look like:

py_library(
    name = "spam",
    srcs = [
        "__init__.py",
        "bacon.py",
        ...
    ],
    srcs_version = "PY3ONLY",
)

py_test(
    name = "spam_test",
    srcs = ["spam_test.py"],
    srcs_version = "PY3ONLY",
    python_version = "PY3",
    deps = [
        ":spam",
        "//testing/pybase"
    ],
)

The code must support Python 3. If it does not, expect strong pushback from your third-party-removed reviewer. Exceptions will be rare, likely involving you immediately taking on porting the code to PY3 in a child CL.

Determine test framework

If it’s not apparent what test framework the third party code uses, here are some ways to figure it out:

  • If pytest is imported somewhere, then it likely uses pytest.
  • If test files don’t have an if __name__ == '__main__': ... block, then it likely uses pytest.
  • If unitest.main() is called, then it uses the stdlib’s unittest.
  • If absltest.main() is called, then it uses absltest.
  • If nose is mentioned, then it likely uses nose.
  • If it has tests, and those tests have an if __name__ == '__main__': ... block, but it doesn’t appear to use unittest or absltest, then it can probably be treated the same as if it was using unittest.

Creating a basic test

If the third party code lacks tests, then you need to create a basic test to ensure your targets build and can be imported. Here is a simple template to copy:


import spam
import unittest


class SpamTest(unittest.TestCase):
  def test_basic(self):
    self.assertTrue(spam.some_attribute)

if __name__ == '__main__':
  unittest.main()

Building extension modules

Always build Python binary extensions using a py_extension rule (or py_wrap_cc in the case of a swig-wrapped extension). This sets the appropriate options for the compiler and, more importantly, configures the library for loading dynamically at run time, together with its dependencies. This ensures, for example, that binaries in google3 depending on two packages in //piper/third_party/py, that in turn depend themselves on OpenSSL, will only load a single copy of OpenSSL.

Do not use cc_binary or cc_library to build Python binary extensions. If you come across any documentation recommending that you do so, please contact emailremoved@ for investigation.

Here’s a sample BUILD file for a Python library with one binary extension:

py_library(
    name = "spam",
    srcs = [
        "__init__.py",
        "bacon.py",
    ],
    deps = [
        ":_eggs",
    ],
)

py_extension(
    name = "_eggs",
    outs = ["_eggs.so"],
    srcs = [
        "eggs.c",
        "util.c",
    ],
    deps = [
        "//third_party/python_runtime:headers",
    ],
)

Some important notes:

  • //third_party/python_runtime:headers is always needed as a dependency; this will load the version of Python associated with the Crosstool version in use.
  • if the binary extension requires some library to work, add it in deps, e.g. //third_party/openssl:crypto.
  • if the C code requires some extra options for the compiler, you can use the copts attribute; however, you will need to add "$(PYTHON_EXTENSION_COPTS)" to it, since that is the default for py_extension.
  • if the third-party software ships several binary extensions (several .so files), and they all share some utility code in a common file (util.c, for example), do not include that file in the srcs attribute of each extension. Instead, create a separate cc_library with the utility code, and add it to the extensions as a dependency. (See an example in //piper/third_party/py/OpenSSL/BUILD)

Other gotchas

  • Only //third_party/python_runtime:headers is needed as a dependency for Python extensions. In particular, Python extensions must never depend on //third_party/python_runtime:embedded_interpreter, which brings in libpython itself: binary extensions will always be loaded into a process that embeds this library already (be it the Python interpreter, or some other process), and duplication would result in hard-to-diagnose crashes.

Precompiled extension modules

Being able to run Python code without going through the BUILD system is sometimes desired. However, this requires checking in compiled binaries for all extension modules, which brings with it a high maintenance burden on the packages’ owners and on other teams (e.g. the compiler and Python teams.) Please consider whether you really need this, as it’s become exceedingly rare in google3.

If there is a real need to provide precompiled extension modules, the code can be structured to make this possible. Do realize that you are committing yourself (and the other owners of your package) to regular maintenance to rebuild the package with newer compilers and Python versions. For each extension module, provide a precompiled version. Then provide Python code that, at run-time, will first try to locate the shared library in the build system, then fall back to a precompiled version.

Alternate build mechanisms

Using distutils

Python comes with a standard mechanism for packaging and installing third-party modules. However it has a few notable limitations. In particular, it’s not very good for cross-compiles (compiling at corp to run in prod is essentially a cross-compile.)

In theory, one could pass in all the right flags to make it work. However, it probably would be easier to simply write a standard google3 build rule.

Reviewer Checklist

  • [ ] Initial checkin is pristine: go/pristinecopy
  • [ ] OWNERS file lists at least two owners
  • [ ] METADATA’s third_party.url.value (type: GIT) and copy.bara.sky’s git.orgin refer to the same repository.
  • [ ] Has at least one usable target (e.g. py_library) and test (e.g. py_test).
  • [ ] Python 3 support is tested.
  • [ ] Tests call unittest.main() OR pytest build rules are used.
  • [ ] Directory name matches imported name.
  • [ ] Import name is unclaimed in the Python ecosystem (only applies to to-be-open-sourced packages).
  • [ ] File layout is the “as installed” layout.

The following can be copy/pasted into the CL description to aid validation and tracking on a per-cl basis:

Startblock:
  has LGTM from http://linkremoved/
  has tag PRISTINE
  has tag METADATA_URL_MATCHES_COPYBARA_URL
  has tag HAS_RULES_AND_TESTS
  has tag SUPPORTS_PY3
  has tag DIRNAME_MATCHES_IMPORT_NAME
  has tag IMPORT_NAME_OK
  has tag FILE_LAYOUT_OK

WANT_LGTM=all

Add startblock to the reviewers and to enforce the above. Startblock will then wait for the above tags to be added (e.g. PRISTINE=yes) before approving.

Design notes

NOTE: This section describes the pre-Hermetic-Python implementation. One feature of Hermetic Python is that import google3 is no longer required for third_party/py/ imports. If your py_binary and py_test rules use the default hermetic Python runtime, you can skip adding it to your main .py files. py_library sources never need import google3.

Import path manipulation is done in the file //piper/…/init.py. This code runs when the program first executes ‘import google3’ or ‘from google3.x.y import z’.

In particular, this line runs:

_SetupThirdParty(sys.path, _google3_path)

where _google3_path is a list of all the “google3-like” directories found, which might be like

["/path/to/.../src3", "/path/to/.../READONLY",]

For each of these paths, the code uses a set of heuristics to decide if there is a //piper/third_party/py subdirectory and whether it should be used, and where in the import search path it should be inserted (order is important). It also tries to detect cases where third-party modules were imported from site-packages but also exist in //piper/third_party/py. The latter should not happen with GRTE Python (which we use since 2008), but people sometimes use the system Python instead.

The heuristics also try be careful about I/O operations to avoid adding any extra I/O overhead or possible hangs in cases where the previous naive code would not have been slow or had a hang due to NFS flakiness or any other reason.

Some modules in //piper/third_party/py are back-ports of modules that were added to the Python standard library in a later version of Python. In these cases, we want to use the standard library version of the module if it exists. This behavior is accomplished by code in //piper/…/init.py, which adds //piper/third_party/py to sys.path after the standard library directories, but before the site-packages directory.

It’s easy to get the import order “wrong”, and such an error might not cause problems until the program is run in a different context or on a different platform. For this reason, the code in //piper/…/init.py searches for any third-party modules that were erroneously imported from site-packages and issues a warning. It is treated as a warning instead of an error until such time as we’re sure that there is no legitimate circumstance where this might be desired.

Historical note

Until 2006, there were two choices for third-party packages. Either use modules that are locally installed in /usr/lib/pythonX.Y/site-packages, or use modules that were in Piper at //piper/third_party/python (which was different than //piper/third_party/py above).

In reality, the first choice was more like 14 choices, because there were at least 14 different Python installations in active use at Google – this was pre-GRTE, which provides a single installation for all systems. So, there were as many as 14 different versions of the MySQLdb package in use.

Having so many different versions is a giant mess that makes it extremely hard to reason about our codebase, or make changes to it, without massive breakage.

The scheme described in this document was how we got down to one version for most packages.

Except as otherwise noted, the content of this page is licensed under CC-BY-4.0 license. Third-party product names and logos may be the trademarks of their respective owners.