Skip to content

feat: Refactor repositories download contents#4153

Open
stevehipwell wants to merge 7 commits intogoogle:masterfrom
stevehipwell:fix-download-contents
Open

feat: Refactor repositories download contents#4153
stevehipwell wants to merge 7 commits intogoogle:masterfrom
stevehipwell:fix-download-contents

Conversation

@stevehipwell
Copy link
Copy Markdown
Contributor

@stevehipwell stevehipwell commented Apr 14, 2026

This PR refactors the behaviour of DownloadContents & DownloadContentsWithMeta with the former now being a direct passthrough to the latter as the only difference was the signature. The code has been refactored to use the API directly instead of via an unnecessary layer of indirection.

I've added an OpenAPI update to this PR as it proves that the updated code works against GitHub.

This change is required for #4151.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 28.84615% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.68%. Comparing base (1d6a852) to head (52d75be).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
example/contents/main.go 0.00% 34 Missing ⚠️
github/repos_contents.go 83.33% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4153      +/-   ##
==========================================
- Coverage   93.83%   93.68%   -0.15%     
==========================================
  Files         209      210       +1     
  Lines       19685    19695      +10     
==========================================
- Hits        18472    18452      -20     
- Misses       1015     1047      +32     
+ Partials      198      196       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread tools/metadata/main_test.go Outdated
@stevehipwell
Copy link
Copy Markdown
Contributor Author

@gmlewis can we get this merged?

Copy link
Copy Markdown
Collaborator

@gmlewis gmlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite concerned about this PR because it appears to me that the behavior of following redirects has been deleted and there are many unit tests that have also simply been deleted without comment or explanation. One of the great things about unit tests is that when major refactors are performed like this one, if the unit tests are left alone we can easily detect regressions. As it is in this PR, however, where a major refactor happens and unit tests are also heavily refactored and/or deleted, it is hard to tell what is actually happening.

Can this be broken down into 3 PRs?

  1. Update the openapi_operations.yaml file - I'll do that myself momentarily.
  2. Refactor the download methods without modifying unit tests
  3. Refactor and/or delete unit tests

Comment thread openapi_operations.yaml Outdated
@stevehipwell
Copy link
Copy Markdown
Contributor Author

@gmlewis let me take a look, but the main problem here is that the tests appear to be tightly coupled to the implementation with mocks designed to make the test pass rather than to mirror the actual API. I'll add the deleted tests back, but the mocks will need to be refactored to add the schema required download_url to the content payload.

On a slight tangent, shouldn't the mock payloads be validated against the schema?

@gmlewis
Copy link
Copy Markdown
Collaborator

gmlewis commented Apr 15, 2026

On a slight tangent, shouldn't the mock payloads be validated against the schema?

Yes, they probably should. I don't remember when GitHub v3 API docs started sharing schemas for endpoints, but it is possible that these were written prior to that.

I think my biggest concern is following redirects because I remember a bunch of issues devoted solely to this topic, and to my shock and disappointment, I don't see any of the unit tests actually testing out following redirects and I could have sworn that it took a good deal of effort to get those unit tests to pass at one point. :-(

@stevehipwell
Copy link
Copy Markdown
Contributor Author

I think my biggest concern is following redirects because I remember a bunch of issues devoted solely to this topic, and to my shock and disappointment, I don't see any of the unit tests actually testing out following redirects and I could have sworn that it took a good deal of effort to get those unit tests to pass at one point. :-(

@gmlewis there are no redirects in the removed tests. The old code pattern was just ignoring the presence of download_url on the file content response and making an unnecessary call to the same get content endpoint for the parent directory (if the content wasn't returned inline). I've just removed this unnecessary step, if the content isn't returned inline we still use exactly the same pattern for fetching it from the download link.

FYI the following example snippet will error using the current code but pass with the updated code as the last file requested is at an index greater than 1000 and has a size of greater than 1mb so won't have returned content.

package main

import (
	"context"
	"fmt"
	"io"
	"os"

	"github.com/google/go-github/v84/github"
)

// downloadContents downloads the contents of a file in a repository and returns it as a byte slice.
func downloadContents(ctx context.Context, client *github.Client, owner, repo, path, ref string) ([]byte, error) {
	rc, _, err := client.Repositories.DownloadContents(ctx, owner, repo, path, &github.RepositoryContentGetOptions{Ref: ref})
	if err != nil {
		return nil, err
	}
	defer rc.Close()

	by, err := io.ReadAll(rc)
	if err != nil {
		return nil, err
	}

	fmt.Printf("Downloaded %v/%v/%v as %d bytes\n", owner, repo, path, len(by))
	return by, nil
}

func main() {
	client := github.NewClient(nil)

	t := []struct {
		owner string
		repo  string
		path  string
		ref   string
	}{
		{"google", "go-github", "README.md", "master"},
		{"github", "rest-api-description", "descriptions/api.github.com/api.github.com.2026-03-10.yaml", "main"},
		{"ScoopInstaller", "Main", "bucket/yq.json", "master"},
		{"stevehipwell", "scoop-main-bucket", "bucket/zzztest.bin", "test-content"},
	}

	for _, v := range t {
		if _, err := downloadContents(context.Background(), client, v.owner, v.repo, v.path, v.ref); err != nil {
			fmt.Printf("Error: %v\n", err)
			os.Exit(1)
		}
	}
}

@stevehipwell
Copy link
Copy Markdown
Contributor Author

@gmlewis I've added back the removed tests and undone some of the cosmetic changes to make the diff clearer that none of the actual tests have changed (it's only the mocks). I haven't rebased to fix the conflict yet in case you want to look at anything first?

Comment thread example/contents/main.go Outdated
@gmlewis
Copy link
Copy Markdown
Collaborator

gmlewis commented Apr 16, 2026

@gmlewis I've added back the removed tests and undone some of the cosmetic changes to make the diff clearer that none of the actual tests have changed (it's only the mocks). I haven't rebased to fix the conflict yet in case you want to look at anything first?

Thank you, @stevehipwell!
This looks great. Yes, let's proceed with this PR and get this in so you can continue to make progress on the context overhaul.

@stevehipwell stevehipwell force-pushed the fix-download-contents branch from 19bc2aa to d25a1ae Compare April 16, 2026 12:41
@stevehipwell
Copy link
Copy Markdown
Contributor Author

@gmlewis I've rebased this and it should be good to go.

@alexandear I've updated the example to be closer to the other patterns and to have a valid comment.

Comment thread github/repos_contents.go
return nil, fileContent, resp, err
}

for _, contents := range dirContents {
Copy link
Copy Markdown
Collaborator

@gmlewis gmlewis Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After closer inspection, the docs here:
https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-repository-content
say that contents from a repo directory can be downloaded with this endpoint.

Before, I said I was concerned about losing the functionality of following redirects, specifically in these lines 204-220. However, this code is not following redirects, it is downloading the contents of a directory.

Are we losing that capability in this PR?

I'm wondering why there are no unit tests that exercise the ability to download the contents from a repo directory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I follow your concern, the code in lines 204-220 is only triggered when a file is larger than 1 mb or the input is invalid (a dir not a file). For files larger than 1mb the updated code uses the download link already returned instead of making an additional API call and iterating through all of the dir files. For invalid input the updated code errors early while this code runs all the way to the end and errors.

I can add a test to show this behaviour? As there wasn't already a test and you asked for the tests to be aligned I didn't add one when I spotted that it was missing earlier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added tests for calling a directory to show that it errors.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-repository-content correctly, the provided link can point to a repo directory and ALL the contents of that directory will be downloaded. Am I reading that wrong? Or are you saying that even though the docs claim this feature, it doesn't actually work?

I don't have time to investigate this myself at the moment, so any insight you can provide would help tremendously.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've investigated and when calling the endpoint on a directory any child directories we get back have an empty download link. If you can download a whole directory you probably need to use the raw content type.

Also I don't see that in the description, where are you seeing it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context:

Also the unit tests that are modified in this PR show that a list could historically be returned which represented the names of items within a directory.

When I'm off my phone I'll look at the official docs again and quote the part that I'm concerned about.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API will return a list if you ask for the parent dir contents, my point here is that it's unnecessary.

The first PR you link above just copies the download function and also returns the metadata. The second PR adds a check for the content in the initial API call.

AFAIK the content API has always returned the download URL for a file, so the dir call and loop has always been unnecessary. Remember both calls are going to the same API and I can't believe that even GitHub would skip the download URL in the specific response and make you make second call that's also limited on response.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API will return a list if you ask for the parent dir contents, my point here is that it's unnecessary.

The first PR you link above just copies the download function and also returns the metadata. The second PR adds a check for the content in the initial API call.

AFAIK the content API has always returned the download URL for a file, so the dir call and loop has always been unnecessary. Remember both calls are going to the same API and I can't believe that even GitHub would skip the download URL in the specific response and make you make second call that's also limited on response.

OK, I'm trying to write an example that lists the contents in a directory, and I'm not getting it to work.

Here are the paragraphs that concern me:

Gets the contents of a file or directory in a repository. Specify the file path or directory with the path parameter. If you omit the path parameter, you will receive the contents of the repository's root directory.

application/vnd.github.object+json: Returns the contents in a consistent object format regardless of the content type. For example, instead of an array of objects for a directory, the response will be an object with an entries attribute containing the array of objects.

If the content is a directory, the response will be an array of objects, one object for each item in the directory. When listing the contents of a directory, submodules have their "type" specified as "file". Logically, the value should be "submodule". This behavior exists for backwards compatibility purposes. In the next major version of the API, the type will be returned as "submodule".

Before we rip out functionality that someone might miss, though, I would like another set of eyes on this.

@alexandear - what are your thoughts about ripping out the for loops that are being removed in this PR?
Will anyone miss them?

If I'm reading @stevehipwell's arguments correctly, he is saying that they never actually did anything, although we have a hint of proof that at one point they did something because he had to remove parts of the unit tests (that contained objects with arrays) to get tests to pass... so that is another one of my concerns.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlewis I'm not saying they didn't do anything, I'm saying the implementation was inefficient and unnecessary. The mocks needed updating because they were implemented to make the tests pass.

So from first principals; the new mocks actually match the API schema and the new code functions correctly and mirrors the behaviour of the old code. The only difference in functionality is the new code doesn't fail when getting the content from a file that's larger than 1mb and at an index of greater than 1000 in its directory.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlewis I'm not saying they didn't do anything, I'm saying the implementation was inefficient and unnecessary. The mocks needed updating because they were implemented to make the tests pass.

So from first principals; the new mocks actually match the API schema and the new code functions correctly and mirrors the behaviour of the old code. The only difference in functionality is the new code doesn't fail when getting the content from a file that's larger than 1mb and at an index of greater than 1000 in its directory.

OK, thank you, @stevehipwell. Sounds good to me.
I know @alexandear already approved, but let's please just wait for one more confirmation before merging.
Thank you for your patience with me! I appreciate it.

Signed-off-by: Steve Hipwell <steve.hipwell@gmail.com>
@stevehipwell stevehipwell force-pushed the fix-download-contents branch from 280f864 to 52d75be Compare April 16, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants