
I Accidentally Deleted 7TB of Videos Before Going to Production
My class mate says this is adorable!
[Hello HN: You can read the discussion and comments here]
[I also want to preface this whole post by saying that I’m a Junior Developer with less than one year of actual experience. Some of the things that might seem obvious to some might not be so for me, thanks!]
This is a story that some won’t understand. It involves bad practices and errors from multiple parties in a world that might seem foreign to the “Silicon Valley” world but paints an accurate picture of what development is for small IT companies around the world.
I’m currently working at a tiny development company (10 employees) in Italy. We develop and manage websites and tools for local businesses. Other than that we landed a big contract for one of the biggest gym companies in Italy, the UK and South Africa. You might expect that given the size they know what they’re doing, but that’s hardly the truth. It’s easy to point the finger and accuse someone, but that’s not what I will be doing here (especially since we all make mistakes and bad decisions as you will see) so I just want to objectively describe what’s going on.
The project#I’m under an NDA, so I can’t disclose too much, but it suffices to say that we’re currently working on a project that needs to use videos hosted on Vimeo. Currently the company uses VimeoOTT, a platform that provides a stock frontend for the content, and they wanted to migrate to Vimeo Enterprise. There were roughly 500 videos on VimeoOTT that had to be transferred to Enterprise and Vimeo doesn’t provide an easy way of doing it. I wrote to the support team around October asking them if it was possible to do a migration, and they told us that they “will look into it” without letting us know anything ever since.
This meant that the upload had to be done again. I proposed to build a custom API script that downloads videos from OTT and uploads them to Enterprise (and our product as well) but the proposal was rejected by the management, and they decided to pay a person to do it manually instead. During the months following October said person uploaded the 500 videos from OTT + 400 new ones, thus reaching around 9TBs out of the 11 granted to us with the Enterprise plan, all was going well (even if it wasn’t really efficient). Then April arrived.
The problem#At one point, without letting us know anything, Vimeo decided it was a great idea to comply with our request and dumped all the videos present on OTT onto the new platform. No questions were asked, and apparently no one at Vimeo cared that
They were duplicating videos that were already uploaded.The total size of the videos was now around 15TB, 4 over the limit.This meant that unless we deleted stuff no one was able to upload videos anymore. We asked Vimeo if it was possible to revert the change, but we received a negative answer. The worst part? We had to go live in about a week.
It was time to delete the extra videos, and I was the one in charge of doing it. Sadly, I made a giant mistake.
The “solution”#(For context, I’ve been working with React for the last 7 months. This will kinda explain what went wrong in a bit)
Luckily in our DB we had a VimeoId assigned to each video, so the first solution that came to my mind was:
for each video in vimeo:
if video not in our_vimeo_ids:
delete(“api.vimeo.com/videos/{video}”
Both requests were paginated (in a slightly different way) so the actual code I wrote was:
page=0
url=f”https://api.ourservice.com/media?page{page}&step=100
our_ids=[]
for i in range(10):
page=i
res=requests.get(url)
videos=res.json()[‘list’]
ids= for video in videos]
our_ids +=ids
next=’/me?page=1′
vimeo_ids=[]
while next is not None:
res=requests.get(f’https://api.vimeo.com/videos{next}’)
res=res.json()
videos=res[‘data’]
ids= for video in videos]
vimeo_ids +=ids
next=res[‘pagination’][‘next’]
for id in vimeo_ids:
if id not in our_ids:
requests.delete(f’https://api.vimeo.com/videos/{id}’)
I think you can easily spot the error. I know I can, but at the time the code seemed completely correct to me. In case you want a spoiler because you can’t see it here it is:
url=f”https://api.ourservice.com/media?page{page}&step=100
our_ids=[]
for i in range(10):
page=i
res=requests.get(url)
I was so used to React that for some reason my mind thought that url would refresh itself as soon as the page variable changed, which of course is not the case. That meant that with this script I deleted from Vimeo all videos that weren’t in the first page of our db
There was another issue here: I tested the code but to do it I used the same flawed loop in the first example.
page=0
url=f”https://api.ourservice.com/media?page{page}&step=100
our_ids=[]
for i in range(10):
page=i
res=requests.get(url)
videos=res.json()[‘list’]
ids= for video in videos]
for id in ids:
res=requests.get(f’https://api.vimeo.com/videos/{id}’)
if res.status_code !=200:
print(f”There was something wrong. You have deleted a wrong video -> {id}”)
I also did some manual testing, but the testing was done only on the first page of our db. A series of mistakes that could’ve probably been easily prevented.
The aftermath#The good news that the videos were still physically backed up in a Google Drive folder and the info about them was still in our db. The bad news is that this was on Friday, and we needed to have the videos back up at most for Tuesday morning. We had to upload ~8TB of data with a 30MB/s connection. Not ideal, and I had to think about something fast.
The first solution that came to my mind was to use Google Drive APIs. We had the filenames of all the videos uploaded on our db, so I quickly wrote some code that looked like this:
page=0
file_names=get_our_filenames(page) # This time without the mistake in the for loop
for name in file_names:
download_and_save_from_drive(name)
upload_to_vimeo(name)
This meant that I could run the script multiple times with different pages, thus “parallelizing” the procedure on different networks. (I also thought to execute it in a high speed environment but at the moment we didn’t have a convenient place without enormous egress fees). It was still non ideal since saturating our Upload for 4 days wasn’t the best choice. Then an idea came to my mind:
The solution#Was there a possibility to directly upload videos from Google Drive to Vimeo? I checked on their Upload page and sure there was one! There was a small problem though: it was only a manual solution and there were no APIs to automate it. The good thing of it though was that the uploads were near instant. Maybe there is a better solution to this but I don’t know of one, so my reaction to this discovery was to boot up Playwright.
Playwright is an automated E2E tool that can be used to simulate user interaction. That means that it can also programmatically click on websites, see where this is going?
Here’s the code: (I just started using Playwright and it had to be written really quickly so excuse me for it being ugly)
test(‘Videos’, async ({ page })=> {
// We login into vimeo
await page.goto(‘https://vimeo.com/upload/videos’);
await page.fill(
‘input[name=”email”]’,
‘xxx’
);
await page.fill(‘input[name=”password”]’, ‘xxx’);
await page.click(‘input[type=submit]’);
// We click on the Drive button and then login into Google Drive
// We need to manage it as an iframe
const [popup]=await Promise.all([
page.waitForEvent(‘popup’),
page.click(‘text=Drive’),
]);
await popup.fill(‘input[type=”email”]’, ‘xxx’);
await popup.click(‘button:has-text(“Next”)’);
await popup.fill(‘input[type=”password”]’, ‘xxx’);
await popup.click(‘button:has-text(“Next”)’);
await timeout(5000);
// For all the filenames we obtained before we upload them
for (let i=0; i
page.click(‘text=Drive’);
await timeout(5000);
}
let found=false;
while(!found) {
for (let frame of page.frames()) {
const searchbox=await frame.$(‘input[aria-label=”Search terms”]’);
const button=await frame.$(‘div[data-tooltip=”Search”]’);
if (searchbox) {
await temp.fill(videos[i]);
await button.click();
}
}
}
await timeout(5000);
// Whenever we search google regenerates the iframe so we have to search again
for (let frame of page.frames()) {
const temp=await frame.$(‘table[role=”listbox”] div[tabindex=”0″]’);
if (temp) {
const select=await frame.$(‘div[id=”picker:ap:2″]’);
await select.click();
}
}
await page.goto(‘https://vimeo.com/upload/videos’);
}
});
This code is bad (notice the timeouts to combat the flakiness of the .click() in playwright) But it does its job, except for one important thing: I didn’t manage to get the click working on the video that was found, only on the “Select” button. I have no clue to this day how to make it work, but that meant that with this code I had to manually click every 10 seconds to select the video and make the program continue. I did it for 10 minutes, then I asked myself why I was doing it.
I downloaded an autoclicker (xclicker) and set it to click every 5 seconds. Lo and behold, roughly 13 seconds per video, 1000 videos, 4 hours later all of the videos were uploaded. Only one thing left to do:
They now had new vimeoIds so I had to go back to our db and update all the videos with the correct values. This was simple to do with a python script similar to the previous ones.
And finally, all the videos were uploaded and the day was saved.
Conclusion#What does this teach us? Well, it teaches me to do more diverse tests when doing destructive operations. It also should probably teach something to Vimeo and to my contractor but I doubt it will (and yes, the upload for some reason is still manual to this day. Go figure!)
2022 © thevinter.RSS
Read More
Share this on knowasiak.com to discuss with people on this topicSign up on Knowasiak.com now if you’re not registered yet.
Responses