Your requirement is quiet beautiful.
Going to do it.
Will code it in Java
I can approach it in two ways:
1) Using Selenium (Web Driver)
2) Direct Http Calls
Let me know which way you prefer
In the first approach :
WebDriver will consume a lot of your CPU resources, but will be foolproof.
Since it will work on DOM structure, and work for all websites.
It will open your browser, open the URL, and parse it, iteratively.
In the second approach :
Java will make Http calls, and the response may be as HTML, XML, JSON or Object
This will be tricky
To know which is API generated output and which is HTML
It you know for sure the websites you will be targeting, their all APIs give HTML response only, we can go for the second approach for sure then.
Let me know what you think.
Another thing is regarding your CSV output file.
On the left side you have maintained the heirarchy.
I have a query.
The text to be printed on the left side of CSV is the URL path split by forward slash '/', or the text the anchor tag would contain describing it.