Launch HN: Infra.new (YC W23) – DevOps copilot with guardrails built in

52 points by TankeJosh 2 days ago | 35 comments

ashishb
Congrats on the launch.
Here are my two cents, as I am very familiar with this space[1][2]
The problem of trying to position your product as "an easy way to deploy on over GCP" or "an easier way to do K8s" is that your product is always limited by the potential of what the underlying platform directly offers. I know multiple K8s management startups (in the pre-LLM era) that failed because of this.
You are not required to, but will be seduced to build 1:1 mapping to the concepts of the underlying systems. So, anyone using your product has to learn both the underlying platform (E.g., GCP) and your system. And the problem is that all of those concepts have been derived either directly or indirectly from AWS or K8s, both have a focus on SREs much more than software engineers.
The second problem is that there are now two interfaces to change something - one is infra.new, and another is the underlying platform directly. Your system will have to catch the drift in deployment when someone goes and changes the underlying platform.
The only major way to win is to have your deployment system, e.g., an alternative to vercel.com, Render.com, or https://railway.com.
```
  - Vercel - a deployment system for frontend engineers (that's my perception)
  - Render/Railway - a deployment system for backend software engineers (that's my perception)
```
This approach is not guaranteed to succeed, but you are no longer limited to using the underlying platform's concepts.
```
  1. - https://github.com/ashishb/gabo
  2. - https://ashishb.net/programming/how-to-deploy-side-projects-as-web-services-for-free/
```
- TankeJosh
  Appreciate the detailed feedback and definitely agree that wrapping these cloud services is a bad idea. Our last product did this and it went exactly how you described.
  Our goal isn’t really to make deploying “easy” per se, we mainly want to help infra / DevOps teams make better configuration changes faster by blending AI code gen with specialized RAG + static analysis + human review. The cool thing about using LLMs for this use case is that we don’t need to do the 1:1 mapping you described, we can instead just teach the agent to use the underlying systems directly.
  We like to think of ourselves as the anti-PaaS since we help engineering teams manage their own platform. Most of these teams already use Terraform and can continue to manage their infra however they like, they'll just do it faster and probably catch some issues that slipped through the cracks before.
  Our launch post did a bad job mentioning this focus on infra teams, so I apologize if that caused any confusion! Maybe "the Cursor for infra teams" would be a better way to describe infra.new
chrisweekly
Agreed this is a problem space that merits a robust solution (ie, emphasis on the guardrails and "rabbit hole" risks). Glad to see the founders' pedigree, and the rec to use in dev envs is a signal that increases trust, from my devops-adjacent pov. Bookmarked, I hope to get time to check this out before too long. In any case, good luck!
- TankeJosh
  Thank you for the feedback! Would love to hear your thoughts when you have a chance to try it.
candiddevmike
Terraform is far too brittle of a tool/language to use LLMs IMO, even with guardrails. Naive changes like the kind LLMs tend to do can be catastrophic... I'm struggling to see the benefits of this product compared to more clickops-y infra LLM tools that just build out the infra without the GitHub shenanigans. I don't see a market where folks are savvy enough to understand the value of what you're doing here but unable to just have copilot or whatever generate all of this for them.
- calebtv
  Yeah Terraform does have some sharp edges, but we chose it because it's the most widely used IaC language and lets you review any changes the agent wants to make before actually applying them in your cloud. One issue with an agent making click-ops changes is there's little to no visibility into what actually changed other than going through the console to verify it yourself. Terraform also lets you do more static checks like cost estimates and policy enforcement before deploying the changes.
  candiddevmike
  So there was originally BuildFlow[0], then LaunchFlow[1], and now infra.new? How many pivots are you folks going to do? What happened to all the customers you had on these products?
  0 - https://news.ycombinator.com/item?id=35169256
  1- https://news.ycombinator.com/item?id=40063124
  calebtv
  This will hopefully be our last pivot :) BuildFlow never took off and we're still supporting the companies who use the LaunchFlow Python SDK.
solatic
There are different personas in this space with different needs. It sounds like you're trying to reach the User who is currently doing ClickOps in cloud consoles to help set up their initial infrastructure and is subsequently getting lost.
Your risks include: (1) if the User is not proficient with Terraform and similar tools, will they appreciate being given Terraform code that they don't know how to deal with, and the additional overhead compared to just, well, ClickOps'ing their way through cloud consoles, particularly since ClickOps is a fundamentally free product? (2) If the User is proficient with Terraform, won't they already have some ideas in mind about how to modularize their codebase for long-term maintainability? How do you address the "too many resources take too long to plan" problem?
What you're describing is cool but I'm wondering who your target persona is, what value you provide above just ClickOps or running terraform plan locally, and whether that value solves enough of a pain point that people will be willing to pay?
fabiofzero
Do you have any plans to expand to non-US based clouds? This is an urgent concern these days.
- calebtv
  We do want to support clouds beyond AWS, GCP, and Azure. That's one of the reasons we choose Terraform since it can be extendable to any cloud, and also why we plan on focusing on Kubernetes next.
cloudking
This looks like a good concept, what I'm more interested in is a "DevOps agent" that can analyze cloud resources and diagnose problems, suggest optimizations etc.
The deployment part isn't a pain point, the maintenance and optimization is.
- TankeJosh
  We’d like to add more background maintenance / optimization features after we get the 0→1 use case solid. The agent can do some basic analysis using its workflow tool + cloud CLIs, but it's definitely not designed for anything beyond diagnosing deployment errors.
  Are you wanting something that runs during CI/CD? Or something that is constantly scanning your cloud looking for issues and potential improvements?
  cloudking
  Something that periodically scans for issues, compliance, improvements. Also I can deploy to address specific tasks, upgrades, migrations etc.
gitroom
I think cool idea and I get why folks want more guardrails, but I always hit headaches with Terraform breaking stuff - you think better checks can really stop big issues or just make them easier to catch quicker?
- TankeJosh
  Ideally both. The agent is really good at keeping environment and module configurations separate, which IMO gets rid of a lot of the common headaches.
  What sort of breakages have you experienced with Terraform?
bberenberg
Bug report: I went through the guest credits, signed up for an account via Google oAuth, and then got dumped to a screen which says "You don't have access to this chat".
- TankeJosh
  Sorry about that, there's a race condition in the client that occasionally happens when claiming your guest chats. Reloading the page should fix it.
  bberenberg
  No, this is persistent. The URL works fine in an incognito window, but it never bound to the main account.
  TankeJosh
  Do you mind sending me an email with your chat ID so I can dig into it? josh@launchflow.com
stackskipton
As Ops/SRE/DevOps/Platform Engineer/Whatever person, I watched Loom and browsed the website. My initial thought is I recommend this to almost no one.
Ops is a skill just like SQL/Backend/Frontend is. Most Devs here would recoil at thought of me writing Frontend with some LLM that swore up and down it would protect me from common SPA JS bugs/footguns I run into. I recoil at thought of vibing your way through Terraform/OpenTofu against Cloud Resources you don't understand and is loaded with footguns. Also Terraform/OpenTofu is riddled with footguns by itself. The fact 4 examples on their website is Kubernetes/Kubernetes/VM/Cloud Run is scary. If you need LLM to run Kubernetes, you shouldn't be running it. Cert Manager, External DNS and other things are complete grues that will eat you.
What would I recommend, something that takes a container you create and runs it for you. GCP Cloud Run/Azure Web App/AWS Something/Heroku/Fly.io. (I work with GCP/Azure) Database should be Cloud Managed. If that's outside budget, then cheap VPS from company like Vultr/DigitalOcean with Docker Compose is my recommendation. Simple, easy to understand and easy to write simple GitHub action for. Once you need scaling, you can hire an Ops person and they can wrangle Terraform/Kubernetes.
- TankeJosh
  100% agree with everything you’re saying here. This tool is designed to make the infra / DevOps person more effective by spending less time on tedious tasks and instead focus on the high level architecture and cost of their infrastructure. The deployment feature is mainly for testing any configurations in dev / staging environments before integrating with GitOps.
  The workflow we imagine is an infra team either managing everything on their own, or providing private Terraform modules that the rest of their developers can ask the agent to configure for them. For teams that go this developer self-service route, you can also set custom validation rules and default configurations that are shared across your team.
  Looking back on our launch post we definitely did not highlight who this product was built for well enough. I’d love to hear if there’s any aspects of the tool that you think would be helpful for your work!
  stackskipton
  None that I can think of. Maybe LLM that knows Terraform would help me out BUT I'm not sure I'd pay for it because most Ops people don't spend their entire life writing Terraform. They write few modules, publish them, update them occasionally and that's it.
  Most of mine is hard political stuff which is to say getting Devs/Leadership to give a passing care about Ops. Outside FAANG/FAANG types, most companies are fine with Devs going "Light is green, trap is clean" and not caring about containment field (Ops). Paging me out at 2AM is not something Devs can get in trouble for.
  This is common thing I see with most FAANG founders. People coming from Google Operations think, ok, it's probably 10x worse than horrors I saw. No, it's much worse.
  TankeJosh
  Maybe we should add a special agent mode to help with planning internal politics strategy ;)
  I am curious what the handoff looks like between you and the devs you work with. Do they self-serve using the modules you publish? Or is there some sort of dev portal that abstracts away Terraform?
  stackskipton
  For Infrastructure, we are moving to single monorepo where they PR into the repo and GHA runs it. Most of time, they are using Modules we wrote but we don't run alot of cloud native stuff. Most of it is Database, possibly Redis, maybe storage, occasionally Pub/Sub. Modules are supposed to load up rules and forward it to the teams in Pagerduty but that doesn't always happen which fall through to us. I'd say most teams infrastructure changes only 1-2x year except when a new project is getting spun up.
  Application can be put into a container and tossed into Kubernetes. We use Kustomize + Templates for most of applications but occasionally those will need to be modified. I'd say that happens once a week.
  Other option is ungodly Chef setup that will deploy their applications from Jenkins. We actually package their system up in .deb package that is pushed to subset of boxes that is absolutely nightmare I luckily don't have to deal with. We went full "Write your own Kubernetes" never go full "Write your own Kubernetes" (https://www.macchaffee.com/blog/2024/you-have-built-a-kubern... NOT THE AUTHOR)
  Hand off is Container or "My application builds on Jenkins." If everything is running normally, there is nothing to hand off. It's when it's not, I get paged and lack of hand off becomes frustrating. We are also isolated to our group.
  TankeJosh
  Thanks for sharing, this is really helpful info!
  In its current form, infra.new would probably be most helpful when setting up new projects or migrating any old apps to this single monorepo setup, but it also sounds like Terraform isn't a huge pain point for your team.
  I am interested to learn if we can help with these 2am pages though. Are those set up by you? Or the developers? Would an agent that helps improve observability / alerts configuration be interesting to you?
  stackskipton
  >I am interested to learn if we can help with these 2am pages though. Are those set up by you? Or the developers?
  Could be me or developers. Sometimes, it's my infrastructure acting up, thanks Azure for that failed Kubernetes upgrade. Or it could be Dev Team ran into something and paged out Ops team because A) Maybe it's infrastructure. B) Ops teams tend to have best troubleshooters, something in our Ops DNA. C) They can and their managers never want to explain "Well, we found it was DNS but because Ops was not on the call, it took 15 minutes for us to wake them up." D) They likely need our support to run this one-off Kubernetes Job or rush out deployment or other such thing.
  > Would an agent that helps improve observability / alerts configuration be interesting to you?
  That's what Datadog has sold us already (I'm not impressed) so it's a crowded marketplace. ;) I'm personally not in the marketplace for anything so I'm not potential customer. If you were looking for another pivot, please for the love that is holy, have it plug into Prometheus (PromQL) natively. If I have to setup another beeping sidecar to deal with logs and metrics, I'm going to hurt someone. Also, logs hooked to some LLM/AI is terrible idea, don't even think about it.
- ryanisnan
  I can't help but completely agree. I could see an abstraction like this working really well on top of a simpler foundation, and I desperately think that's what the industry needs.
  But as it stands, on top of Helm/Kubernetes/Terraform/AWS/GCP/etc., there are so many dragons.
  I wonder if this tool could be repurposed to leverage much simpler cloud concepts that inherently require less operational skills to maintain.
  TankeJosh
  There are definitely a lot of dragons, but there are also a lot of teams that use these tools and need help slaying those dragons :)
  We chose to focus on IaC because we think general coding agents are going to take a long time to solve all the edge cases well.
  I do think a similar tool built on something simpler would be really interesting. I’ve been tempted to try this with our previous product: https://docs.launchflow.com/
conductr
The pricing makes me think too much. Tokens? Runners? I lose interest immediately on product pages when my brain goes into 'WTF is this going to cost me?' mode
- TankeJosh
  Yeah I don’t love the token-based pricing either, not sure if we'll keep it long term.
  In practice you won't hit these limits, and if you do then you're probably a power user and we'll happily give you free quota in exchange for feedback :)
65
It would be nice if this supported the AWS CDK.
- TankeJosh
  Its on the roadmap! The current plan is to add CDK docs / static checks after we finish adding Kubernetes support.
GhostCOM97
[dead]
curtisszmania
Love seeing tools that prioritize infrastructure reliability. Well done!
- TankeJosh
  Thank you for the kind words! Would love to hear your thoughts if you have a chance to try it.
kajogo
[dead]